regex - regexp_extract hive not working as expected -
i started working hive data preparation , ran peculiar problem when using regexp_extract udf. working on xml structures , trying extract elements xml-string. here example. string operating on is:
<b>ajsdnf</b> <a>asdhf</a> <a>alfnv</a> <b>ajsdnf</b> <a>test</a>
the regular expression (<a>.*?<\/a>)
should extract strings contains elements tags. when check logic on regex101 finds right groups.
however when run against hive this
select regexp_extract('<b>ajsdnf</b><a>asdhf</a><a>alfnv</a><b>ajsdnf</b><a>test</a>','(<a>.*?<\/a>)',0) some_table limit 1;
it returns first <a>asdhf</a>
. according documentation of regex_extract should return occurrences if pass integer 0 3rd parameter. there chance can achieve following result
<a>asdhf</a> <a>alfnv</a> <a>test</a>
and if wondering why not using xpath deal xml problem, having more complex structure , want extract parts of xml tree including child nodes. xpath udfs of hive cannot handle @ moment.
select regexp_replace('<b>ajsdnf</b><a>a<b>aksdhf</b>dhf</a><a>alfnv</a><b>ajsdnf</b><a>test</a>','(.*?)(<a>.*?<\/a>)(.*?)','$2') some_tablelimit 1;
this did trick. nhahtdh suggestions
Comments
Post a Comment