Capture the text in the middle of a sentence using R

advertisements

I have a sentence like below:

1字樓High Care Level

I want to extract the term "High Care Level", so I use the following script:

> gsub("([A-Z][a-z]+ Care Level).*","\\1","1字樓High Care Level")
[1] "1字樓High Care Level"

Why 1字樓 is still there?

Remarks: 1字樓 is some Chinese phrase


1字樓 is still there because you don't match it, therefore you do not replace it.

You need to match the first part of the string then put your replacement text in a back-reference (as you have done). You can make this work by anchoring the expression at the start of the string with the ^ and optional capture of any number of any characters, but any upper or lower case letters that are contiguous with Care Level will be matched as back reference 1, and then the rest of the string is matched. You replace the whole string by only what was matched in back reference 1...

gsub("^.*([A-Z][a-z]+ Care Level).*","\\1","1字樓High Care Level")
#[1] "High Care Level"