python 2.7 - When I am creating a dictionary for German Language , I am facing some problems in making tokens -


i using sublime text editor coding.

code:

# coding: utf-8 import nltk line = "frau präsidentin, zu recht befaßt sich das parlament regelmäßig mit der verkehrssicherheit." print nltk.word_tokenize(line.decode('utf8')) 

result:

[u'frau', u'pr', u'\xe4', u'sidentin', u',', u'zu', u'recht', u'befa', u'\xdf', u't', u'sich', u'das', u'parlament', u'regelm', u'\xe4', u'\xdf', u'ig', u'mit', u'der', u'verkehrssicherheit', u'.'] [finished in 0.4s] 

still tokens not correct. because breaking präsidentin sub token dont want.

according docs:

this particular tokenizer requires punkt sentence tokenization models installed.

i'm guessing need these, include german model. instructions installing these can found @ http://www.nltk.org/data.html, or models can directly downloaded here


Comments

Popular posts from this blog

python - No exponential form of the z-axis in matplotlib-3D-plots -

php - Best Light server (Linux + Web server + Database) for Raspberry Pi -

c# - "Newtonsoft.Json.JsonSerializationException unable to find constructor to use for types" error when deserializing class -