Extract tuples from file with python and regEx -
have html format file sorts of data, need extract pairs of (id, title). wrote regex seems work fine in regex online tester.
file need extract data:
<g id="node841" class="cond_node"><title>sr_aud_nbest_list_playlistplayplaylist_cond</title> <g id="node842" class="prompt_node"><title>sr_aud_nbest_list_playlistplayplaylist_prompt</title> <g id="edge841" class="edge"><title>sr_aud_nbest_list_playlistplayplaylist_cond->sr_aud_nbest_list_playlistplayplaylist_prompt</title> <g id="node848" class="node"><title>sr_aud_main_link_51</title> <g id="node841" class="prompt_node"><title>sr_aud_nbest_list_playlistplayplaylist_prompt</title> <g id="node841" class="cmd_node"><title>sr_aud_nbest_list_playlistplayplaylist_cmd</title> <g id="node856" class="exit_node"><title>exit_63</title> <g id="node860" class="node"><title>sr_aud_confirmnaplayplaylistname_notavailable_3</title> <g id="node860" class="node"><title>sr_aud_confirmnaplayplaylistname_notavailable_4</title><title>sr_aud_confirmnaplayplaylistname_notavailable_3</title>
with regex:
(<g\sid="\w+"\s+class="node">+.{1,})(?!.+(_cmd|_cond|_prompt|exit))
i extracting entire lines above conditions.
python script uses file , regex extract specific lines:
result = re.search(r'(id="\w+"\s+class="node">+.{1,})(?!.+(_cmd|_cond|_prompt|exit))', svg)
but problem result contains 1 pair of data (only node id 848) separated "space char" not entire list of lines extracted regex.
do have idea how extract data matches regex entire file, not 1 line? in particular case extracted data should be, online regex tester says:
<g id="node848" class="node"><title>sr_aud_main_link_51</title> <g id="node860" class="node"><title>sr_aud_confirmnaplayplaylistname_notavailable_3</title> <g id="node860" class="node"><title>sr_aud_confirmnaplayplaylistname_notavailable_4</title>
as mentioned in comments, regular expressions may not best tool parse xml.
having said that, problem approach seems using search
instead of findall
or finditer
, returning first match, instead of all.
p = r'(<g\sid="\w+"\s+class="node">+.{1,})(?!.+(_cmd|_cond|_prompt|exit))' match in re.finditer(p, svg): print match.group()
however, note in last case capture entire line, not first <title>
.
Comments
Post a Comment