I am trying to sanitalize Solr search results, cause it has html tags inside:
ActionController::Base.helpers.sanitize( result_string )
It is easy to sanitalize not highlighted string like: I know <ul><li>ruby</li> <li>rails</li></ul>
.
But when results is highlighted I have additional important tags inside - <em>
and </em>
:
I <em>know</em> <<em>ul</em>><<em>li</em>><em>ruby</em></<em>li</em>> <<em>li</em>><em>rails</em></<em>li</em>></<em>ul</em>>
.
So, when I sanitalize string with nested html and highlighting tags, I get string with peaces of htmls tags. And it is bad :)
How can I sanitalize highlighted string with <em>
tags inside to get correct result (string with <em>
tags only)?
I found the way, but it's slow and not pretty:
string = 'I <em>know</em> <<em>ul</em>><<em>li</em>><em>ruby</em></<em>li</em>> <<em>li</em>><em>rails</em></<em>li</em>></<em>ul</em>>'
['p', 'ul', 'li', 'ol', 'span', 'b', 'br'].each do |tag|
string.gsub!( "<<em>#{tag}</em>>", '' )
string.gsub!( "</<em>#{tag}</em>>", '' )
end
string = ActionController::Base.helpers.sanitize string, tags: %w(em)
How can I optimize it or do it using some better solution? to write some regex and remove html_tags, but keep <em>
and </em>
e.g.
Please help, thanks.
You could call gsub! to discard all tags but keep only tags that are independent, or that are not included in html tag.
result_string.gsub!(/(<\/?[^e][^m]>)|(<<em>\w*<\/em>>)|(<\/<em>\w*<\/em>>)/, '')
would do the trick
To explain:
# first group (<\/?[^e][^m]>)
# find all html tags that are not <em> or </em>
# second group (<<em>\w*<\/em>>)
# find all opening tags that have <em> </em> inside of them like:
# <<em>li</em>> or <<em>ul</em>>
# third group (<\/<em>\w*<\/em>>)
# find all closing tags that have <em> </em> inside of them:
# </<em>li</em>> or </<em>ul</em>>
# and gsub replaces all of this with empty string