>>> import jieba
>>> import jieba.posseg as pseg
>>> words = pseg.cut("我爱北京天安门") #jieba默认模式
>>> jieba.enable_paddle() #启动paddle模式。 0.40版之后开始支持,早期版本不支持
>>> words = pseg.cut("我爱北京天安门",use_paddle=True) #paddle模式
>>> for word, flag in words:
... print('%s %s' % (word, flag))
...
我 r
爱 v
北京 ns
天安门 ns
$> python -m jieba --help
Jieba command line interface.positional arguments:filename input fileoptional arguments:-h, --help show this help message and exit-d [DELIM], --delimiter [DELIM]use DELIM instead of ' / ' for word delimiter; or aspace if it is used without DELIM-p [DELIM], --pos [DELIM]enable POS tagging; if DELIM is specified, use DELIMinstead of '_' for POS delimiter-D DICT, --dict DICT use DICT as dictionary-u USER_DICT, --user-dict USER_DICTuse USER_DICT together with the default dictionary orDICT (if specified)-a, --cut-all full pattern cutting (ignored with POS tagging)-n, --no-hmm don't use the Hidden Markov Model-q, --quiet don't print loading messages to stderr-V, --version show program's version number and exitIf no filename specified, use STDIN instead.
“Jieba” (Chinese for “to stutter”) Chinese text segmentation: built to be the best Python Chinese word segmentation module.
Features
Support three types of segmentation mode:
Accurate Mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis.
Full Mode gets all the possible words from the sentence. Fast but not accurate.
Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.
Supports Traditional Chinese
Supports customized dictionaries
MIT License
Online demo
http://jiebademo.ap01.aws.af.cm/
(Powered by Appfog)
Usage
Fully automatic installation: easy_install jieba or pip install jieba
Semi-automatic installation: Download http://pypi.python.org/pypi/jieba/ , run python setup.py install after extracting.
Manual installation: place the jieba directory in the current directory or python site-packages directory.
import jieba.
Algorithm
Based on a prefix dictionary structure to achieve efficient word graph scanning. Build a directed acyclic graph (DAG) for all possible word combinations.
Use dynamic programming to find the most probable combination based on the word frequency.
For unknown words, a HMM-based model is used with the Viterbi algorithm.
Main Functions
Cut
The jieba.cut function accepts three input parameters: the first parameter is the string to be cut; the second parameter is cut_all, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model.
jieba.cut_for_search accepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines.
The input string can be an unicode/str object, or a str/bytes object which is encoded in UTF-8 or GBK. Note that using GBK encoding is not recommended because it may be unexpectly decoded as UTF-8.
jieba.cut and jieba.cut_for_search returns an generator, from which you can use a for loop to get the segmentation result (in unicode).
jieba.lcut and jieba.lcut_for_search returns a list.
jieba.Tokenizer(dictionary=DEFAULT_DICT) creates a new customized Tokenizer, which enables you to use different dictionaries at the same time. jieba.dt is the default Tokenizer, to which almost all global functions are mapped.
[Full Mode]: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学[Accurate Mode]: 我/ 来到/ 北京/ 清华大学[Unknown Words Recognize] 他, 来到, 了, 网易, 杭研, 大厦 (In this case, "杭研" is not in the dictionary, but is identified by the Viterbi algorithm)[Search Engine Mode]: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
Add a custom dictionary
Load dictionary
Developers can specify their own custom dictionary to be included in the jieba default dictionary. Jieba is able to identify new words, but you can add your own new words can ensure a higher accuracy.
Usage: jieba.load_userdict(file_name) # file_name is a file-like object or the path of the custom dictionary
The dictionary format is the same as that of dict.txt: one word per line; each line is divided into three parts separated by a space: word, word frequency, POS tag. If file_name is a path or a file opened in binary mode, the dictionary must be UTF-8 encoded.
The word frequency and POS tag can be omitted respectively. The word frequency will be filled with a suitable value if omitted.
For example:
创新办 3 i
云计算 5
凱特琳 nz
台中
Change a Tokenizer’s tmp_dir and cache_file to specify the path of the cache file, for using on a restricted file system.
jieba.analyse.TextRank() creates a new TextRank instance.
Part of Speech Tagging
jieba.posseg.POSTokenizer(tokenizer=None) creates a new customized Tokenizer. tokenizer specifies the jieba.Tokenizer to internally use. jieba.posseg.dt is the default POSTokenizer.
Tags the POS of each word after segmentation, using labels compatible with ictclas.
Example:
>>> import jieba.posseg as pseg
>>> words = pseg.cut("我爱北京天安门")
>>> for w in words:
... print('%s %s' % (w.word, w.flag))
...
我 r
爱 v
北京 ns
天安门 ns
Parallel Processing
Principle: Split target text by line, assign the lines into multiple Python processes, and then merge the results, which is considerably faster.
Based on the multiprocessing module of Python.
Usage:
jieba.enable_parallel(4) # Enable parallel processing. The parameter is the number of processes.
Result: On a four-core 3.4GHz Linux machine, do accurate word segmentation on Complete Works of Jin Yong, and the speed reaches 1MB/s, which is 3.3 times faster than the single-process version.
Note that parallel processing supports only default tokenizers, jieba.dt and jieba.posseg.dt.
Tokenize: return words with position
The input must be unicode
Default mode
result = jieba.tokenize(u'永和服装饰品有限公司')for tk in result:print("word %s\t\t start: %d \t\t end:%d"%(tk[0],tk[1],tk[2]))
word 永和 start: 0 end:2
word 服装 start: 2 end:4
word 饰品 start: 4 end:6
word 有限公司 start: 6 end:10
Search mode
result = jieba.tokenize(u'永和服装饰品有限公司',mode='search')for tk in result:print("word %s\t\t start: %d \t\t end:%d"%(tk[0],tk[1],tk[2]))
word 永和 start: 0 end:2
word 服装 start: 2 end:4
word 饰品 start: 4 end:6
word 有限 start: 6 end:8
word 公司 start: 8 end:10
word 有限公司 start: 6 end:10
$> python -m jieba --help
Jieba command line interface.positional arguments:filename input fileoptional arguments:-h, --help show this help message and exit-d [DELIM], --delimiter [DELIM]use DELIM instead of ' / ' for word delimiter; or aspace if it is used without DELIM-p [DELIM], --pos [DELIM]enable POS tagging; if DELIM is specified, use DELIMinstead of '_' for POS delimiter-D DICT, --dict DICT use DICT as dictionary-u USER_DICT, --user-dict USER_DICTuse USER_DICT together with the default dictionary orDICT (if specified)-a, --cut-all full pattern cutting (ignored with POS tagging)-n, --no-hmm don't use the Hidden Markov Model-q, --quiet don't print loading messages to stderr-V, --version show program's version number and exitIf no filename specified, use STDIN instead.
Initialization
By default, Jieba don’t build the prefix dictionary unless it’s necessary. This takes 1-3 seconds, after which it is not initialized again. If you want to initialize Jieba manually, you can call:
import jieba
jieba.initialize() # (optional)
You can also specify the dictionary (not supported before version 0.28) :
jieba.set_dictionary('data/dict.txt.big')
Using Other Dictionaries
It is possible to use your own dictionary with Jieba, and there are also two dictionaries ready for download:
A smaller dictionary for a smaller memory footprint:
https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
There is also a bigger dictionary that has better support for traditional Chinese (繁體):
https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
By default, an in-between dictionary is used, called dict.txt and included in the distribution.
In either case, download the file you want, and then call jieba.set_dictionary('data/dict.txt.big') or just replace the existing dict.txt.
Segmentation speed
1.5 MB / Second in Full Mode
400 KB / Second in Default Mode
Test Env: Intel® Core™ i7-2600 CPU @ 3.4GHz;《围城》.txt
PaddlePaddle:PaddlePaddle是由百度公司研发的开源深度学习框架,全称为“PArallel Distributed Deep LEarning”,适用于大规模数据训练和高性能推理场景。在jieba分词组件中,paddle模式利用PaddlePaddle框架训练序列标注模型(如双向GRU),实现更高级别的中文分词功能,同时支持词性标注,提升了对复杂语境下词汇切分与理解的能力。