python 2.7 - When I am creating a dictionary for German Language , I am facing some problems in making tokens -
i using sublime text editor coding.
code:
# coding: utf-8 import nltk line = "frau präsidentin, zu recht befaßt sich das parlament regelmäßig mit der verkehrssicherheit." print nltk.word_tokenize(line.decode('utf8'))
result:
[u'frau', u'pr', u'\xe4', u'sidentin', u',', u'zu', u'recht', u'befa', u'\xdf', u't', u'sich', u'das', u'parlament', u'regelm', u'\xe4', u'\xdf', u'ig', u'mit', u'der', u'verkehrssicherheit', u'.'] [finished in 0.4s]
still tokens not correct. because breaking präsidentin sub token dont want.
according docs:
this particular tokenizer requires punkt sentence tokenization models installed.
i'm guessing need these, include german model. instructions installing these can found @ http://www.nltk.org/data.html, or models can directly downloaded here
Comments
Post a Comment