Strip URL - Python


Ok how do i use regex to remove http AND/OR www just to get into

Assume x as any kind of TLD or cTLD

Input example:





If you really want to use regular expressions instead of urlparse() or splitting the string:

>>> domain = ''
>>> re.match(r'(?:\w*://)?(?:.*\.)?([a-zA-Z-1-9]*\.[a-zA-Z]{1,}).*', domain).groups()[0]

The regular expression might a bit simplistic, but works. It's also not replacing, but I think getting the domain out is easier.

To support domains like '', one can do the following:

>>> p = re.compile(r'(?:\w*://)?(?:.*?\.)?(?:([a-zA-Z-1-9]*)\.)?([a-zA-Z-1-9]*\.[a-zA-Z]{1,}).*')
>>> p.match(domain).groups()

('google', '')

So you got to check the result for domains like '', and join the result again in such a case. Normal domains should work OK. I could not make it work when you have multiple subdomains.

One-liner without regular expressions or fancy modules:

>>> domain = ''
>>> '.'.join(domain.replace('http://','').split('/')[0].split('.')[-2:])