regex - regexp_extract hive not working as expected -


i started working hive data preparation , ran peculiar problem when using regexp_extract udf. working on xml structures , trying extract elements xml-string. here example. string operating on is:

<b>ajsdnf</b> <a>asdhf</a> <a>alfnv</a> <b>ajsdnf</b> <a>test</a> 

the regular expression (<a>.*?<\/a>) should extract strings contains elements tags. when check logic on regex101 finds right groups.

however when run against hive this

select regexp_extract('<b>ajsdnf</b><a>asdhf</a><a>alfnv</a><b>ajsdnf</b><a>test</a>','(<a>.*?<\/a>)',0) some_table limit 1; 

it returns first <a>asdhf</a>. according documentation of regex_extract should return occurrences if pass integer 0 3rd parameter. there chance can achieve following result

<a>asdhf</a> <a>alfnv</a> <a>test</a> 

and if wondering why not using xpath deal xml problem, having more complex structure , want extract parts of xml tree including child nodes. xpath udfs of hive cannot handle @ moment.

select regexp_replace('<b>ajsdnf</b><a>a<b>aksdhf</b>dhf</a><a>alfnv</a><b>ajsdnf</b><a>test</a>','(.*?)(<a>.*?<\/a>)(.*?)','$2') some_tablelimit 1; 

this did trick. nhahtdh suggestions


Comments

Popular posts from this blog

python - No exponential form of the z-axis in matplotlib-3D-plots -

php - Best Light server (Linux + Web server + Database) for Raspberry Pi -

c# - "Newtonsoft.Json.JsonSerializationException unable to find constructor to use for types" error when deserializing class -