Remove word extension in python

advertisements

I've got a text with several words. I want to remove all the derivative extension of the words. For example I want to remove extensions -ed -ing and keep the initial verb. If I i have the verifying or verified to keep verify f.e. I found the method strip in python which removes a specific string from the beginning or end of a string but is not what exactly I want. Is there any library which does such a thing in python for example?

I've tried to perform the code from proposed post and I've noticed a weird trimming in several words. For example I've got the following text

 We goin all the way βπƒβ΅οΈβ΅οΈ
 Think ive caught on to a really good song ! Im writing π
 Lookin back on the stuff i did when i was lil makes me laughh π‚
 I sneezed on the beat and the beat got sicka
 #nashnewvideo http://t.co/10cbUQswHR
 Homee βοΈβοΈβοΈπ΄
 So much respect for this man , truly amazing guy βοΈ @edsheeran
 http://t.co/DGxvXpo1OM"
 What a day ..
 RT @edsheeran: Having some food with @ShawnMendes
 #VoiceSave  christina π
 Im gunna make the βοΈ sign my signature pose
 You all are so beautiful .. π soooo beautiful
 Thought that was a really awesome quote
 Beautiful things don't ask for attention"""

And after the use of the following code (also I remove non latin characters and urls)

 we  goin  all  the  way
 think  ive  caught  on  to  a  realli  good  song  im  write
 lookin  back  on  the  stuff  i  did  when  i  wa  lil  make  me  laughh
 i  sneez  on  the  beat  and  the  beat  got  sicka
 nashnewvideo
 home
 so  much  respect  for  thi  man  truli  amaz  guy
 what  a  day
 rt  have  some  food  with
 voicesav  christina
 im  gunna  make  the  sign  my  signatur  pose
 you  all  are  so  beauti  soooo  beauti
 thought  that  wa  a  realli  awesom  quot
 beauti  thing  dont  ask  for  attent

For example it trims beautiful to beauti and quote to quot really to realli. My code is the following:

 reader = csv.reader(f)
    print doc
    for row in reader:
        text =  re.sub(r"(?:\@|https?\://)\S+", "", row[2])
        filter(lambda x: x in string.printable, text)
        out = text.translate(string.maketrans("",""), string.punctuation)
        out = re.sub("[\W\d]", " ", out.strip())
        word_list = out.split()
        str1 = ""
        for verb in word_list:
                 verb = verb.lower()
                 verb = nltk.stem.porter.PorterStemmer().stem_word(verb)
                 str1 = str1+" "+verb+" "
        list.append(str1)
        str1 = "\n"


Instead stemmer you can use lemmatizer. Here's an example with python NLTK:

from nltk.stem import WordNetLemmatizer

s = """
 You all are so beautiful soooo beautiful
 Thought that was a really awesome quote
 Beautiful things don't ask for attention
 """

wnl = WordNetLemmatizer()
print " ".join([wnl.lemmatize(i) for i in s.split()]) #You all are so beautiful soooo beautiful Thought that wa a really awesome quote Beautiful thing don't ask for attention

In some cases, it may not do what you expect:

print wnl.lemmatize('going') #going

Then you can combine both approaches: stemming and lemmatization.