Strip URL - Python

advertisements

Ok how do i use regex to remove http AND/OR www just to get http://www.domain.com/ into domain.com

Assume x as any kind of TLD or cTLD

Input example:

http://www.domain.x/

www.domain.x

Output:

domain.x


If you really want to use regular expressions instead of urlparse() or splitting the string:

>>> domain = 'http://www.example.com/'
>>> re.match(r'(?:\w*://)?(?:.*\.)?([a-zA-Z-1-9]*\.[a-zA-Z]{1,}).*', domain).groups()[0]
example.com

The regular expression might a bit simplistic, but works. It's also not replacing, but I think getting the domain out is easier.

To support domains like 'co.uk', one can do the following:

>>> p = re.compile(r'(?:\w*://)?(?:.*?\.)?(?:([a-zA-Z-1-9]*)\.)?([a-zA-Z-1-9]*\.[a-zA-Z]{1,}).*')
>>> p.match(domain).groups()

('google', 'co.uk')

So you got to check the result for domains like 'co.uk', and join the result again in such a case. Normal domains should work OK. I could not make it work when you have multiple subdomains.

One-liner without regular expressions or fancy modules:

>>> domain = 'http://www.example.com/'
>>> '.'.join(domain.replace('http://','').split('/')[0].split('.')[-2:])