python - Text parsing - date recogniser -

- September 15, 2013

does know if there's python text parser recognises embedded dates? instance, given sentence

"bla bla bla bla 12 jan 14 bla bla bla 01/04/15 bla bla bla"

the parser pick out 2 date occurrences. know of java tools, there python ones? ntlk overkill?

thanks

here attempt nondeterministically (read: exhaustively) solve problem of finding dates in tokenized text. enumerates ways of partitioning sentence (as list of tokens), partition size minps maxps.

each partitioning run parser, outputs list of parsed dates, , token range parsed.

each parser output scored sum of token ranges squared (so prefer date parsed 4 tokens rather 2 dates parsed 2 tokens each).

finally, find , outputs parse best score.

the 3 building blocks of algorithm:

from dateutil.parser import parse parsedate  def partition(lst, minps, maxps, i=0):     if lst == []:         yield []     else:         try:             l in range(minps, maxps+1):                 if l > len(lst): continue                 z in partition(lst[l:], minps, maxps, i+l):                     yield [(i, lst[:l])] + z         except:             pass  def parsedates(p):     x in p:         i, pi = x         try:             d = parsedate(' '.join(pi))             # output: (startindex, endindex, parseddate)             if d: yield i, i+len(pi), d         except: pass  def score(p):     score = 0     pi in p:         score += (pi[1]-pi[0])**2     return score

finding parse best score:

def bestparse(toks, maxps=3):     bestscore = 0     bestparse = none     ps in partition(toks, 1, maxps):         l = list(parsedates(ps))         s = score(l)         if s > bestscore:             bestscore = s             bestparse = l     return bestparse

some tests:

l=['bla', 'bla', 'bla', '12', 'jan', '14', 'bla', 'bla', 'bla', '01/04/15', 'bla', 'bla'] bpi in bestparse(l):     print('found date %s @ tokens %s' % (bpi[2], ','.join(map(str, range(*bpi[:2])))))

found date 2014-01-12 00:00:00 @ tokens 3,4,5

found date 2015-01-04 00:00:00 @ tokens 9

l=['fred', 'was', 'born', 'on', '23/1/99', 'at', '23:30'] bpi in bestparse(l, 5):     print('found date %s @ tokens %s' % (bpi[2], ','.join(map(str, range(*bpi[:2])))))

found date 1999-01-23 23:30:00 @ tokens 3,4,5,6

beware can computationally expensive, may want run on single short phrases, not on whole document. may want split long phrases in chunks.

another point improvement partitioning function. if have prior information how many dates can @ in single sentence, number of ways of partitioning can reduced.

Search This Blog

Running

python - Text parsing - date recogniser -

Comments

Post a Comment

Popular posts from this blog

python - No exponential form of the z-axis in matplotlib-3D-plots -

c# - "Newtonsoft.Json.JsonSerializationException unable to find constructor to use for types" error when deserializing class -

Why does a .NET 4.0 program produce a system.unauthorizedAccess error on a Windows Server 2012 machine with .NET 4.5 installed? -