Extract tuples from file with python and regEx -



have html format file sorts of data, need extract pairs of (id, title). wrote regex seems work fine in regex online tester.
file need extract data:

<g id="node841" class="cond_node"><title>sr_aud_nbest_list_playlistplayplaylist_cond</title> <g id="node842" class="prompt_node"><title>sr_aud_nbest_list_playlistplayplaylist_prompt</title> <g id="edge841" class="edge"><title>sr_aud_nbest_list_playlistplayplaylist_cond&#45;&gt;sr_aud_nbest_list_playlistplayplaylist_prompt</title> <g id="node848" class="node"><title>sr_aud_main_link_51</title> <g id="node841" class="prompt_node"><title>sr_aud_nbest_list_playlistplayplaylist_prompt</title> <g id="node841" class="cmd_node"><title>sr_aud_nbest_list_playlistplayplaylist_cmd</title> <g id="node856" class="exit_node"><title>exit_63</title> <g id="node860" class="node"><title>sr_aud_confirmnaplayplaylistname_notavailable_3</title> <g id="node860" class="node"><title>sr_aud_confirmnaplayplaylistname_notavailable_4</title><title>sr_aud_confirmnaplayplaylistname_notavailable_3</title> 

with regex:

(<g\sid="\w+"\s+class="node">+.{1,})(?!.+(_cmd|_cond|_prompt|exit)) 

i extracting entire lines above conditions.
python script uses file , regex extract specific lines:

result = re.search(r'(id="\w+"\s+class="node">+.{1,})(?!.+(_cmd|_cond|_prompt|exit))', svg) 

but problem result contains 1 pair of data (only node id 848) separated "space char" not entire list of lines extracted regex.

do have idea how extract data matches regex entire file, not 1 line? in particular case extracted data should be, online regex tester says:

<g id="node848" class="node"><title>sr_aud_main_link_51</title> <g id="node860" class="node"><title>sr_aud_confirmnaplayplaylistname_notavailable_3</title> <g id="node860" class="node"><title>sr_aud_confirmnaplayplaylistname_notavailable_4</title> 

as mentioned in comments, regular expressions may not best tool parse xml.

having said that, problem approach seems using search instead of findall or finditer, returning first match, instead of all.

p = r'(<g\sid="\w+"\s+class="node">+.{1,})(?!.+(_cmd|_cond|_prompt|exit))' match in re.finditer(p, svg):     print match.group() 

however, note in last case capture entire line, not first <title>.


Comments

Popular posts from this blog

python - No exponential form of the z-axis in matplotlib-3D-plots -

php - Best Light server (Linux + Web server + Database) for Raspberry Pi -

c# - "Newtonsoft.Json.JsonSerializationException unable to find constructor to use for types" error when deserializing class -