Regular Expressions Aren’t the Devil

I love regular expressions. Okay, I love the challenge of crafting regular expressions. I do not enjoy reading regular expressions that I have not created or, really, even the ones I do create. But give me a problem and tell me to make a regular expression to match things and I am all over it.

A co-worker wanted a regular expression to turn unlinked URLs in text into HTML links and to correct linked URLs that lacked a protocol into valid URLs. In other words, if “www.google.com” appeared in some text, it needed to be replaced with <a href="http://www.google.com/">www.google.com</a> and <a href="www.google.com">some link text<a> needed to turn into <a href="http://www.google.com">some link text<a>

My first pass was a monster regular expression that handled both situations but I couldn’t get the replacement string to account for the fact that there was already link text in the invalid URL example. And I couldn’t adequately cover the situation where there were attributes before the href attribute. So scrap that one.

This is what I came up with after separating it into two replacement passes. I share it with you both as a testament to my regular expression abilities (good or bad, you decide) and because this situation seems like one that might come up pretty frequently.

Regular expression Replacement string
(?<=\s|^)(?<domain>www\.[^\s]+)(?=\s)
|(?<=\s)(?<protocol>http[s]?://){1}
(?<domain>(www)?\.?[^\s]+)(?=\s)
<a href="http://${domain}">${domain}</a>
href="(?<domain>www\.[^"]+)" href="http://${domain}"