Regex cuts URLs in text that does not have a separator

advertisements

Apologies for yet another regex question!

I have some input text which rather unhelpfully has multiple urls (only urls) all on one line with no separators

https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n

this example contains just two urls, but it could be more.

I'm trying to separate the urls, into a list using python

I've tried searching for solutions and tried a few but can't get this to work exactly, as they greedily consume all following urls. https://stackoverflow.com/a/6883094/659346

I realise that's probably because https://... could probably be legally allowed in the query part of a url, but in my case I'm willing to assume it can't, and assume that when it occurs it's the start of the next url.

I also tried (http[s]://.*?) but that with and without the ? either makes it get the whole bit of text or just the https://


You need to use a positive lookahead assertion.

>>> s = "https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n"
>>> re.findall(r'https?://.*?(?=https?://|$|\s)', s)
['https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZg', 'https://console.developers.google.com/project/reducted/?authuser=1']