Regular Expressions (part 1)

A while ago I was refactoring the #include’s of c++ project and I needed to know which project files were including STL files (in our case it meant that the included files wouldn’t have the .h). So, I decided to make a regular expression to find these files:

#include:b*<.+[^.h]>

If you insert this simple expression in the search box of the Visual Studio you can get a list of all #include’s that don’t have the .h in the name of the file. Now I’ll break the regular expression and try to explain it step by step:

  1. #include – to match with any #include expression.
  2. :b* – to match with any number of spaces or tabs.
  3. < – to match with the < character.
  4. .+ – to match one or more characters.
  5. [^.h] – exclude the .h characters.
  6. > – match with the > character.

Meaning of the characters in the expression:

  • – escape a character, the character after this symbol is treated as a normal character instead of a special character used in regular expressions.
  • :b – space or tab.
  • * – 0 or more times.
  • + – 1 or more times.
  • . – any character except the end of line.
  • [] – any set of characters inside the [].

Note: this regular expression might not be compatible with other programs because it uses specific expressions of the VS, such as the :b that matches a space or tab.

Another example, remove the initial characters (garbage) from actual lines code:

1.          #include <iostream>
2.             using namespace std;
3.         int main()
4.           {
5.           cout << "Hello World!";
6.        return 0;
7.         }

I’m sure you already found something like this and when you put it in the editor it’s really a pain in the ass to remove all that garbage line by line. Here’s another expression that will help in this task:

^[^a-zA-Z_$/{}#"'+-]+

Again, let’s go step by step:

  1. ^ – this means that we’ll start to match only at the beginning of a line.
  2. [^…]+ – matches any character that is not in the set of characters that follows the ^.
  3. a-zA-Z_$/{}#”‘+- – exclude the characters from a to z (same for uppercase letters) and the following characters: _, $, /, {, }, #, , , + and .

This means that this expression catches anything that starts with any character except the characters that are excluded. In the VS, replace this expression by an empty string to remove the garbage.
Note: It’s quite possible that the regular expressions presented here will fail (specially the second one), because it’s really complicated to test all the possibilities but in the general case, these should work.

I hope these two examples will make you see the power of regular expressions or even be useful to you 😉 If you have any comments about this article or do you have any problems with a regular expression? Just let me know.

Tagged , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *