How do I match the text with a regular expression ignoring punctuation and line breaks

advertisements

I have an app where I need to find the position of a list of words in a passage of text. A regex is blatantly the way to do this but the issue I have is that I may have all kinds of punctuation or new lines between words. How do I do "find these words possibly separated but some non-alphanumeric characters"?

UPDATE:

An example would be that I need to find the range of:

shouted help these regular expressions are horrible so

in

The developer shouted "help", these regular expressions are horrible! So, please help me :(


Description

\b(?:[a-z](?:[a-z\n\r.:;,?!-]*[a-z])?)\b

** Click for bigger image

This regular expression will do the following:

  • Requires all words to start and end with a-z, or be a single letter long
  • Allows words to contain new line characters, or common punctuation like .:;,?!-
  • Words are not allowed to contain spaces

Example

Live Demo

https://regex101.com/r/bK4oO8/1

Sample text

How do I match text with a regular expres
sion ignoring punctuation and line breaks?
How do I do "find these words pos-
sibly separated but some non-alphanumeric characters"?

Sample Matches

MATCH 1
0.  [0-3]   `How`

MATCH 2
0.  [4-6]   `do`

MATCH 3
0.  [7-8]   `I`

MATCH 4
0.  [9-14]  `match`

MATCH 5
0.  [15-19] `text`

MATCH 6
0.  [20-24] `with`

MATCH 7
0.  [25-26] `a`

MATCH 8
0.  [27-34] `regular`

MATCH 9
0.  [35-46] `expres
sion`

MATCH 10
0.  [47-55] `ignoring`

MATCH 11
0.  [56-67] `punctuation`

MATCH 12
0.  [68-71] `and`

MATCH 13
0.  [72-76] `line`

MATCH 14
0.  [77-88] `breaks?
How`

MATCH 15
0.  [89-91] `do`

MATCH 16
0.  [92-93] `I`

MATCH 17
0.  [94-96] `do`

MATCH 18
0.  [98-102]    `find`

MATCH 19
0.  [103-108]   `these`

MATCH 20
0.  [109-114]   `words`

MATCH 21
0.  [115-125]   `pos-
sibly`

MATCH 22
0.  [126-135]   `separated`

MATCH 23
0.  [136-139]   `but`

MATCH 24
0.  [140-144]   `some`

MATCH 25
0.  [145-161]   `non-alphanumeric`

MATCH 26
0.  [162-172]   `characters`

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    [a-z]                    any character of: 'a' to 'z'
----------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      [a-z\n\r.:;,?!-          any character of: 'a' to 'z', '\n'
      ]*                       (newline), '\r' (carriage return),
                               '.', ':', ';', ',', '?', '!', '-' (0
                               or more times (matching the most
                               amount possible))
----------------------------------------------------------------------
      [a-z]                    any character of: 'a' to 'z'
----------------------------------------------------------------------
    )?                       end of grouping
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
----------------------------------------------------------------------


Extra Credit

If you also want to eliminate matches like #14 above, where you have a ? which is followed by a new line character. When in this configuration the ? should not be considered to be part of the word, where as a - followed by a new line is really a hyphen. Then you should consider this

\b(?:[a-z](?:(?:[a-z-]+|[.:;,?!-]+(?![\n\r])|[\n\r]+)*[a-z])?)\b

Live Demo: https://regex101.com/r/bK4oO8/2