How to sanitate a string with nested html tags but keep < em & gt; Mark?


I am trying to sanitalize Solr search results, cause it has html tags inside:

ActionController::Base.helpers.sanitize( result_string )

It is easy to sanitalize not highlighted string like: I know <ul><li>ruby</li> <li>rails</li></ul>.

But when results is highlighted I have additional important tags inside - <em> and </em>:

I <em>know</em> <<em>ul</em>><<em>li</em>><em>ruby</em></<em>li</em>> <<em>li</em>><em>rails</em></<em>li</em>></<em>ul</em>>.

So, when I sanitalize string with nested html and highlighting tags, I get string with peaces of htmls tags. And it is bad :)

How can I sanitalize highlighted string with <em> tags inside to get correct result (string with <em> tags only)?

I found the way, but it's slow and not pretty:

string = 'I <em>know</em> <<em>ul</em>><<em>li</em>><em>ruby</em></<em>li</em>> <<em>li</em>><em>rails</em></<em>li</em>></<em>ul</em>>'

['p', 'ul', 'li', 'ol', 'span', 'b', 'br'].each do |tag|
  string.gsub!( "<<em>#{tag}</em>>",  '' )
  string.gsub!( "</<em>#{tag}</em>>", '' )

string = ActionController::Base.helpers.sanitize string, tags: %w(em)

How can I optimize it or do it using some better solution? to write some regex and remove html_tags, but keep <em> and </em> e.g.

Please help, thanks.

You could call gsub! to discard all tags but keep only tags that are independent, or that are not included in html tag.

result_string.gsub!(/(<\/?[^e][^m]>)|(<<em>\w*<\/em>>)|(<\/<em>\w*<\/em>>)/, '')

would do the trick

To explain:

# first group (<\/?[^e][^m]>)
# find all html tags that are not <em> or </em>

# second group (<<em>\w*<\/em>>)
# find all opening tags that have <em> </em> inside of them like:
# <<em>li</em>>   or <<em>ul</em>>

# third group (<\/<em>\w*<\/em>>)
# find all closing tags that have <em> </em> inside of them:
# </<em>li</em>>   or  </<em>ul</em>>

# and gsub replaces all of this with empty string