python - Text parsing - date recogniser -


does know if there's python text parser recognises embedded dates? instance, given sentence

"bla bla bla bla 12 jan 14 bla bla bla 01/04/15 bla bla bla"

the parser pick out 2 date occurrences. know of java tools, there python ones? ntlk overkill?

thanks

here attempt nondeterministically (read: exhaustively) solve problem of finding dates in tokenized text. enumerates ways of partitioning sentence (as list of tokens), partition size minps maxps.

each partitioning run parser, outputs list of parsed dates, , token range parsed.

each parser output scored sum of token ranges squared (so prefer date parsed 4 tokens rather 2 dates parsed 2 tokens each).

finally, find , outputs parse best score.

the 3 building blocks of algorithm:

from dateutil.parser import parse parsedate  def partition(lst, minps, maxps, i=0):     if lst == []:         yield []     else:         try:             l in range(minps, maxps+1):                 if l > len(lst): continue                 z in partition(lst[l:], minps, maxps, i+l):                     yield [(i, lst[:l])] + z         except:             pass  def parsedates(p):     x in p:         i, pi = x         try:             d = parsedate(' '.join(pi))             # output: (startindex, endindex, parseddate)             if d: yield i, i+len(pi), d         except: pass  def score(p):     score = 0     pi in p:         score += (pi[1]-pi[0])**2     return score 

finding parse best score:

def bestparse(toks, maxps=3):     bestscore = 0     bestparse = none     ps in partition(toks, 1, maxps):         l = list(parsedates(ps))         s = score(l)         if s > bestscore:             bestscore = s             bestparse = l     return bestparse 

some tests:

l=['bla', 'bla', 'bla', '12', 'jan', '14', 'bla', 'bla', 'bla', '01/04/15', 'bla', 'bla'] bpi in bestparse(l):     print('found date %s @ tokens %s' % (bpi[2], ','.join(map(str, range(*bpi[:2]))))) 

found date 2014-01-12 00:00:00 @ tokens 3,4,5

found date 2015-01-04 00:00:00 @ tokens 9

l=['fred', 'was', 'born', 'on', '23/1/99', 'at', '23:30'] bpi in bestparse(l, 5):     print('found date %s @ tokens %s' % (bpi[2], ','.join(map(str, range(*bpi[:2]))))) 

found date 1999-01-23 23:30:00 @ tokens 3,4,5,6

beware can computationally expensive, may want run on single short phrases, not on whole document. may want split long phrases in chunks.

another point improvement partitioning function. if have prior information how many dates can @ in single sentence, number of ways of partitioning can reduced.


Comments

Popular posts from this blog

python - No exponential form of the z-axis in matplotlib-3D-plots -

php - Best Light server (Linux + Web server + Database) for Raspberry Pi -

c# - "Newtonsoft.Json.JsonSerializationException unable to find constructor to use for types" error when deserializing class -