I've got a text with several words. I want to remove all the derivative extension of the words. For example I want to remove extensions -ed -ing and keep the initial verb. If I i have the verifying or verified to keep verify f.e. I found the method strip in python which removes a specific string from the beginning or end of a string but is not what exactly I want. Is there any library which does such a thing in python for example?
I've tried to perform the code from proposed post and I've noticed a weird trimming in several words. For example I've got the following text
We goin all the way βπƒβ΅οΈβ΅οΈ Think ive caught on to a really good song ! Im writing π Lookin back on the stuff i did when i was lil makes me laughh π‚ I sneezed on the beat and the beat got sicka #nashnewvideo http://t.co/10cbUQswHR Homee βοΈβοΈβοΈπ΄ So much respect for this man , truly amazing guy βοΈ @edsheeran http://t.co/DGxvXpo1OM" What a day .. RT @edsheeran: Having some food with @ShawnMendes #VoiceSave christina π Im gunna make the βοΈ sign my signature pose You all are so beautiful .. π soooo beautiful Thought that was a really awesome quote Beautiful things don't ask for attention"""
And after the use of the following code (also I remove non latin characters and urls)
we goin all the way think ive caught on to a realli good song im write lookin back on the stuff i did when i wa lil make me laughh i sneez on the beat and the beat got sicka nashnewvideo home so much respect for thi man truli amaz guy what a day rt have some food with voicesav christina im gunna make the sign my signatur pose you all are so beauti soooo beauti thought that wa a realli awesom quot beauti thing dont ask for attent
For example it trims beautiful to beauti and quote to quot really to realli. My code is the following:
reader = csv.reader(f) print doc for row in reader: text = re.sub(r"(?:\@|https?\://)\S+", "", row) filter(lambda x: x in string.printable, text) out = text.translate(string.maketrans("",""), string.punctuation) out = re.sub("[\W\d]", " ", out.strip()) word_list = out.split() str1 = "" for verb in word_list: verb = verb.lower() verb = nltk.stem.porter.PorterStemmer().stem_word(verb) str1 = str1+" "+verb+" " list.append(str1) str1 = "\n"
stemmer you can use
lemmatizer. Here's an example with python NLTK:
from nltk.stem import WordNetLemmatizer s = """ You all are so beautiful soooo beautiful Thought that was a really awesome quote Beautiful things don't ask for attention """ wnl = WordNetLemmatizer() print " ".join([wnl.lemmatize(i) for i in s.split()]) #You all are so beautiful soooo beautiful Thought that wa a really awesome quote Beautiful thing don't ask for attention
In some cases, it may not do what you expect:
print wnl.lemmatize('going') #going
Then you can combine both approaches: