Focus On Oracle

Installing, Backup & Recovery, Performance Tuning,
Troubleshooting, Upgrading, Patching

Oracle Engineered System


当前位置: 首页 » 技术文章 » Big Data

wordcloud and jieba

wordcloud即词云,是用python写的工具,这个工具很方便。可以对文本文件做分析,然后把关键字以可视化的方式呈现。我们要做的就是准备数据、一款字体、一张模板图片即可。
wordcloud在Python 2.7, 3.4, 3.5, 3.6和3.7上都验证过。有两种安装方法:
1.通过pip安装
   pip install wordcloud
2.通过conda安装
   conda install -c conda-forge wordcloud
注意:wordcloud依赖numpy和pillow
本文是通过conda安装的wordcloud
[root@xd07dbm01 ~]# conda install -c conda-forge wordcloud
Collecting package metadata: done
Solving environment: done
## Package Plan ##
  environment location: /opt/anaconda2
  added / updated specs:
    - wordcloud
The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.6.9                |           py27_0         888 KB  conda-forge
    wordcloud-1.5.0            |py27h14c3975_1000         182 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         1.0 MB
The following NEW packages will be INSTALLED:

  wordcloud          conda-forge/linux-64::wordcloud-1.5.0-py27h14c3975_1000
The following packages will be UPDATED:
  conda              anaconda/cloud/conda-forge::conda-4.6~ --> conda-forge::conda-4.6.9-py27_0
Proceed ([y]/n)? y
Downloading and Extracting Packages
wordcloud-1.5.0      | 182 KB    | ######################################## | 100%
conda-4.6.9          | 888 KB    | ######################################## | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
[root@xd07dbm01 ~]#
克隆wordcloud
[root@xd07dbm01 ~]# git clone https://github.com/amueller/word_cloud.git
Cloning into 'word_cloud'...
remote: Enumerating objects: 3669, done.
remote: Total 3669 (delta 0), reused 0 (delta 0), pack-reused 3669
Receiving objects: 100% (3669/3669), 67.64 MiB | 411.00 KiB/s, done.
Resolving deltas: 100% (2040/2040), done.
下面是amueller的两个demo
[root@xd07dbm01 ~]# cd word_cloud/
[root@xd07dbm01 word_cloud]# cd examples/

[root@xd07dbm01 examples]# cat simple.py

#!/usr/bin/env python
"""
Minimal Example
===============
Generating a square wordcloud from the US constitution using default arguments.
"""
import os
from os import path
from wordcloud import WordCloud
# get data directory (using getcwd() is needed to support running example in generated IPython notebook)
d = path.dirname(__file__) if "__file__" in locals() else os.getcwd()

# Read the whole text.
text = open(path.join(d, 'constitution.txt')).read()
# Generate a word cloud image
wordcloud = WordCloud().generate(text)
# Display the generated image:
# the matplotlib way:
import matplotlib.pyplot as plt
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
# lower max_font_size
wordcloud = WordCloud(max_font_size=40).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
# The pil way (if you don't have matplotlib)
# image = wordcloud.to_image()
# image.show() 

[root@xd07dbm01 examples]# python simple.py

[root@xd07dbm01 examples]# python masked.py
[root@xd07dbm01 examples]#
注意:如果你是在windows上运行了Linux虚拟机,通过ssh工具连接,需要在windows上安装xmanager或xming等工具,export DISPLAY到windows本机地址,然后才能在本机显示
wordcloud也支持命令行,工具名是wordcloud_cli
$ wordcloud_cli --text mytext.txt --imagefile wordcloud.png
也可以将pdf转为为text,然后直接生成新的图片
$ pdftotext mydocument.pdf - | wordcloud_cli --imagefile wordcloud.png

wordcloud常用的参数
wordcloud.WordCloud(
    font_path=None,  #字体路径。如果是中文,必须设置,否则无法正确显示图形
    width=400, #默认宽度
    height=200, #默认高度
    margin=2, #字体与生成图片边缘的宽度
    ranks_only=None,
    prefer_horizontal=0.9, #水平摆放的字体占的比例,默认为 0.9,则垂直摆放的比例为 0.1
    mask=None, #用来作为形状的图片,如果想根据图片绘制,则需要设置
    scale=1, #缩放时使用,也是常有参数
    color_func=None, #颜色生成方法,默认为随机生成
    max_words=200, #最多显示的词汇量
    min_font_size=4, #最小字号
    stopwords=None, # 停止词,即禁词,即不在图片中生成显示的词
    random_state=None, #随机状态,即有多少种随机配色
    background_color='black', #背景颜色设置,默认为黑色,可以为具体颜色,比如white或者16进制数值
    max_font_size=None, #最大字号
    font_step=1, #字体间隔,默认为1
    mode='RGB', #图片模式,默认为RGB
    relative_scaling='auto',
    regexp=None,
    collocations=True,
    colormap='viridis', #可参考https://matplotlib.org/examples/color/colormaps_reference.html
    normalize_plurals=True,
    contour_width=0,
    contour_color='black',
    repeat=False)

下面是从masked.py改编的例子

#!/usr/bin/env python
"""
Masked wordcloud
================
Using a mask you can generate wordclouds in arbitrary shapes.
"""
from os import path
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import random
import os
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# get data directory (using getcwd() is needed to support running example in generated IPython notebook)
d = path.dirname(__file__) if "__file__" in locals() else os.getcwd()
# Read the whole text.
text = open(path.join(d, 'alice.txt')).read()
# read the mask image
# taken from
alice_color = np.array(Image.open(path.join(d, "alice_color.png")))
# photo from https://www.deviantart.com/jirkavinse/art/Real-Life-Alice-282261010 stopwords = set(STOPWORDS)
stopwords.add("said")
wc = WordCloud(background_color="white", max_words=2000, mask=alice_color,stopwords=stopwords, contour_width=3, contour_color='steelblue')

# generate word cloud,you can use wc.generate_from_text(text) as well
wc.generate(text)

# show original picture
plt.imshow(alice_color, cmap=plt.cm.gray)
plt.axis("off")
plt.show()

# store to file alice1.png
wc.to_file(path.join(d, "alice1.png"))
# show
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

#recolor and save to file alice2.png
img_colors = ImageColorGenerator(alice_color)
wc.recolor(color_func=img_colors)
# store to file
wc.to_file(path.join(d, "alice2.png"))
# show
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.tight_layout()
plt.show()

#recolor and save to file alice3.png
def grey_color_func(word, font_size, position, orientation, random_state=None, **kwargs):
    return "hsl(0, 0%%, %d%%)" % random.randint(60, 100)
wc.recolor(color_func=grey_color_func,random_state=3)
# store to file
wc.to_file(path.join(d, "alice3.png"))
# show
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.tight_layout()
plt.show()

Wordcloud是一个非常好的工具,但如果你想创建中文词云,只有wordcloud是不够的。英文里面空格就可以分词,但中文中分词比较复杂,所以我们还需要一个中文分词库jieba,jieba现在是python中最流行的中文分词工具。 你可以使用'PIP install jieba'。正如你所看到的,同时使用wordcloud与jieba非常方便。

下面用最简单的方式将三国演义中的一些关键词做分析

# coding=utf-8
import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# 读取文本
text = open("./sanguoyanyi.txt").read()
# 中文分词,然后用空格连接起来
text = " ".join(jieba.cut(text))
# 生成词云并显示
wordcloud = WordCloud(font_path='./fonts/SourceHanSerif/SourceHanSerifK-Light.otf',width=1024, height=800,max_words=200).generate(text)
plt.imshow(wordcloud)
plt.axis('off')
plt.show() 

详细内容请参考下面的链接

https://amueller.github.io/word_cloud/auto_examples/wordcloud_cn.html


Reference
https://amueller.github.io/word_cloud/
https://github.com/amueller/word_cloud
https://amueller.github.io/word_cloud/auto_examples/colored.html
https://github.com/amueller/tensorflow-workshop/
https://github.com/amueller/advanced_training
https://github.com/fxsjy/jieba

https://github.com/fxsjy/jieba/blob/master/test/demo.py

https://www.deviantart.com/jirkavinse/art/Real-Life-Alice-282261010


关键词:jieba wordcloud python 

相关文章

wordcloud and jieba
Get financial data by tushare
conda and anaconda
python basic knowledge
Top