python - Text parsing - date recogniser -
does know if there's python text parser recognises embedded dates? instance, given sentence
"bla bla bla bla 12 jan 14 bla bla bla 01/04/15 bla bla bla"
the parser pick out 2 date occurrences. know of java tools, there python ones? ntlk overkill?
thanks
here attempt nondeterministically (read: exhaustively) solve problem of finding dates in tokenized text. enumerates ways of partitioning sentence (as list of tokens), partition size minps
maxps
.
each partitioning run parser, outputs list of parsed dates, , token range parsed.
each parser output scored sum of token ranges squared (so prefer date parsed 4 tokens rather 2 dates parsed 2 tokens each).
finally, find , outputs parse best score.
the 3 building blocks of algorithm:
from dateutil.parser import parse parsedate def partition(lst, minps, maxps, i=0): if lst == []: yield [] else: try: l in range(minps, maxps+1): if l > len(lst): continue z in partition(lst[l:], minps, maxps, i+l): yield [(i, lst[:l])] + z except: pass def parsedates(p): x in p: i, pi = x try: d = parsedate(' '.join(pi)) # output: (startindex, endindex, parseddate) if d: yield i, i+len(pi), d except: pass def score(p): score = 0 pi in p: score += (pi[1]-pi[0])**2 return score
finding parse best score:
def bestparse(toks, maxps=3): bestscore = 0 bestparse = none ps in partition(toks, 1, maxps): l = list(parsedates(ps)) s = score(l) if s > bestscore: bestscore = s bestparse = l return bestparse
some tests:
l=['bla', 'bla', 'bla', '12', 'jan', '14', 'bla', 'bla', 'bla', '01/04/15', 'bla', 'bla'] bpi in bestparse(l): print('found date %s @ tokens %s' % (bpi[2], ','.join(map(str, range(*bpi[:2])))))
found date 2014-01-12 00:00:00 @ tokens 3,4,5
found date 2015-01-04 00:00:00 @ tokens 9
l=['fred', 'was', 'born', 'on', '23/1/99', 'at', '23:30'] bpi in bestparse(l, 5): print('found date %s @ tokens %s' % (bpi[2], ','.join(map(str, range(*bpi[:2])))))
found date 1999-01-23 23:30:00 @ tokens 3,4,5,6
beware can computationally expensive, may want run on single short phrases, not on whole document. may want split long phrases in chunks.
another point improvement partitioning function. if have prior information how many dates can @ in single sentence, number of ways of partitioning can reduced.
Comments
Post a Comment