Substring using Oracle SQL regex

advertisements

I've created a regex which would capture the string I need. When I am testing regex on websites such as rubular.com then everything works, however when I put the same regex into REGEXP_SUBSTR function then it doesn't work.

Here are 2 SQL examples (one with text in English and another one in Kristaps Porzingis' language):

SELECT regexp_substr('<ul data-error-code="REOPENED" data-unique-error-code="REOPENED"><li class="b">This is the text I would like to substr! <p class="tutorial" href="#">Other random text that I do not need</li></ul>'
                    ,'<li class="b">([\wāēīšžģņļčķū:!,\b\s]+)<')
  FROM dual;

SELECT regexp_substr('<ul data-error-code="REOPENED" data-unique-error-code="REOPENED"><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p class="tutorial" href="#">Other random text that I do not need</li></ul>'
                    ,'<li class="b">([\wāēīšžģņļčķū:!,\b\s]+)<')
  FROM dual;

I am trying to select text between <li class="b"> and next html tag, which in this case is <p class="tutorial">.

Any advice on what am I doing wrong?


Parsing HTML with regex is not advisable, you'd better fetch the strings and parse them with a language that has convenient means to parse HTML.

If you only have Oracle DBMS at hand, for a one-off job, you may consider using the following regexp_substr:

SELECT regexp_substr('<ul><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p>Not needed</li></ul>',
      '<li\s+class="b">([^<]+)', 1, 1, NULL, 1) as RESULT from dual

See the REXTESTER demo:

Here,

  • <li\s+class="b"> - matches <li, 1+ whitespaces, class="b">literal substring
  • ([^<]+) - captures into Group #1 one or more chars other than <

The last 1 argument lets you access the contents of this Group 1.