Regular Expressions (part 2)

The first regexp was thought after someone in the IRC asked for some help. They asked if anyone could remove the comments from a c++ source code file. I tried to help:

(/*([^*]|[n]|(*+([^*/]|[n])))**+/)|(//.*)

Note: this is not the exact expression that I came up at the time but this one is richer and better than the other one I answered at the time.

Let’s analyze it:

  1. /* – to match with the beginning of any comment /*.
  2. [^*]|[n] – to match any characters except the * or match with the new line character.
  3. *+ – matches any number of * in the middle of comments.
  4. [^*/]|[n] – to match with any character except these two * and / or match a new line.
  5. *+/ – matches any number of * and the / character.
  6. //.* – matches // followed by any characters.

After matching with the first /* the expression becomes a bit harder to understand. What happens next is that we match anything (including new lines) except the * or we match one or more * followed by anything except the end of comment */. After we match with one ore more * followed by a /.
The second part matches only 1 line comments in C++.

Have you ever received an email full of HTML garbage? It happened to me more than once and it’s extremely annoying having to filter the text in the middle of the HTML. I remembered to create a regular expression that would help me remove this kind of garbage. If you didn’t understand what I meant by garbage, here is an example of these emails:

<html><div style='background-color:'><DIV>
<DIV>
<P class=MsoNormal><FONT color=navy face=Impact size=5><SPAN style="BACKGROUND: #f7f7f7; COLOR: navy;
FONT-FAMILY: Impact; FONT-SIZE: 18pt">This is extremely&nbsp</SPAN></FONT><FONT color=#9966ff
face=Impact size=5 FAMILY="SANSSERIF"> <SPAN style="BACKGROUND: #f7f7f7; COLOR: #9966ff; FONT-FAMILY: Impact;
FONT-SIZE: 18pt">annoying&nbsp;</SPAN></FONT> <FONT color=navy face=Tahoma FAMILY="SANSSERIF">
<SPAN style="BACKGROUND: #f7f7f7; COLOR: navy; FONT-FAMILY: Tahoma">&nbsp;</SPAN></FONT>

In this case I used this regexp:

(<[^<]*>)|&nbsp;

It’s quite pretty to grasp this one, we just grab everything that is between two < > but we have to put a safe guard to exclude a possible < since regular expressions are pretty greedy and like to match whatever they can.
The &nbsp; match the HTML code for space characters serve. We could filter other similar characters but this one seems to do the trick in most situations.

Tagged , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *