Regular Expressions in Programming Languages: The Story of C++

0
12456

In this issue of OSFY, we present the third article on regular expressions in programming languages. The earlier articles covered the use of regular expressions in general, in Python and then in Perl. Read on to discover the intricacies of regular expressions in C++.

Interpreted languages often have weakly typed variables which don’t require prior declaration. The additional benefit of weakly typed variables is that they can be used to hold different types of data. For example, the same variable can hold an integer, a character, or a string. Due to these qualities, scripts written in such languages are often very compact. But this is not the case with compiled languages, for which you need a lot of initialisation; and with strongly typed variables, the code is often longer. Even if the regular expression syntax for interpreted and compiled languages is the same, how they are used in real programs is different. So, I believe it is time to discuss regular expressions in compiled languages. In this article, I will discuss the regular expression syntax of C++.

Standards of C++

People often fail to notice the fact that programming languages like C and C++ have different standards. This is quite unlike languages like Perl and Python, for which the use of regular expressions is highly warranted due to the very nature of these programming languages (they are scripting languages widely used for text processing and Web application development).

For a language like C++, heavily used for high-performance computing applications, system programming, embedded system development, etc, many felt that the inclusion of regular expressions was unnecessary. Many of the initial standards of C++ didn’t have a natural way for handling regular expressions. I will briefly discuss the different standards of C++ and which among them has support for regular expressions.

C++ was invented by Bjarne Stroustrup in 1979 and was initially known as ‘C with Classes’ and later renamed C++ in 1983. A book titled ‘The C++ Programming Language’, first published in 1985 by Stroustrup himself, and its subsequent editions became the de facto standard for C++ until 1998, when the language was standardised by the International Standards Organization (ISO) and the International Electrotechnical Commission (IEC) as ISO/IEC 14882:1998, informally called C++98. The next three standards of C++ are informally called C++03, C++11 and C++14. Hopefully, by the time this article gets published, the latest standard of C++, informally called C++17, would have been released and will have some major changes to C++. After this, the next big changes in C++ will take place with a newer standard, informally known as C++20, which is set to be released in 2020.

The first three standards of C++, namely the de facto standard of C++, C++98 and C++03, do not have any inbuilt mechanism for handling regular expressions. Things changed with C++11 when native support for regular expressions was added with the help of a new header file called <regex>. In fact, the support for regular expressions was one of the most important changes brought in by this standard. C++14 also has provision for native support of regular expressions, and it is highly unlikely that C++17 or any future standards of C++ will quash the support for handling regular expressions. One problem we might face in this regard is that the academic community in India mostly revolves around the C++98 standard, which doesn’t support regular expressions. But this is just a personal opinion and I don’t have any documented evidence to prove my statement.

Figure 1: Output of regex_search() and regex_match()

The C++11 standard

Unlike C++03 and C++14, for which the changes were minimal, C++11 was a major revision of C++. GCC 5 fully supports the features of C++11and C++14. The latter has become the default standard for GCC 6. There were many changes made to the core language by the standard C++11. The inclusion of a new 64-bit integer data type, called long long int, is a change made in C++ by the C++11 standard. Earlier, C++ only had 32-bit integers called long int. External templates were also added to C++ by this standard.

Many more changes were made to the core of the C++ language by the C++11 standard, but the changes were not limited to the core alone — the C++ standard library was also enhanced by the C++11 standard. Changes were made to the C++ standard library in such a way that multiple threads could be created very easily. New methods for generating pseudo-random numbers were also provided by the C++11 standard. A uniform method for computing the return type of function objects was also included by the C++11 standard. Though a lot of changes have been made to the standard library in C++11, the one that concerns us the most is the inclusion of a new header file called <regex>.

Regular expressions in C++11

In C++, support for regular expressions is achieved by making changes to the standard library of C++. The header file called <regex> is added to the C++ standard library to support regular expressions. The header file <regex> is also available in C++14 and, hence, what we learn for C++11 also applies to C++14. There are some additions to the header file <regex> in C++14, which will be discussed later in this article. There are three functions provided by the header file <regex>. These are regex_match( ), regex_search( ) and regex_replace( ). The function regex_match( ) returns a match only if the match is found at the beginning of a string, whereas regex_search( ) searches the entire string for a match. The function regex_replace( ) not only finds a match, but it replaces the matched string with a replacement string. All these functions use a regular expression to denote the string to be matched.

Other than these three functions, the header file <regex> also defines a number of classes like regex, wregex, etc, and a few iterator types like regex_iterator and regex_token_iterator. But to simplify and shorten our discussion, I will only cover the class regex and the three functions, regex_search( ), regex_match( ) and regex_replace( ). I believe it is impossible to discuss all the features of the header file <regex> in a short article like this, but the topics I will cover are a good starting point for any serious C++ programmer to catch up with professional users of regular expressions. Now let us see how regular expressions are used in C++ with the help of a small C++ program.

Figure 2: The function regex_replace() in C++

A simple C++ program using regular expressions

The code below shows a C++ program called regex1.cc. I am sure you are all familiar with the .cc extension of C++ programs. This and all the other C++ programs and text files used in this article can be downloaded from opensourceforu.com/article_source_code/September17C++.zip.

#include <iostream>

#include <regex>

using namespace std;

int main( )

{

char str[ ] = “Open Source For You”;

regex pat(“Source”);

if( regex_search(str,pat) )

{

cout << “Match Found\n”;

}

else

{

cout<<”No Match Found\n”;

}

return 0;

}

I’m assuming that the syntax of C is quite well known to readers, who will understand the simple C++ programs we discuss in this article, so no further skills are required. Now let us study and analyse the program. The first two lines #include <iostream> and #include <regex> include the two header files <iostream> and <regex>. The next line of code using namespace std; adds the std namespace to the program so that cout, cin, etc, can be used without the help of the scope resolution operator (::). The line int main( ) declares the only function in this program, the main( ) function.

This is one problem we face when programming languages like C++ or Java are used. You need to write a lot of code to set up the environment and get things moving. This is one reason why you should stick with languages like Perl or Python rather than C++ or Java if your whole aim is to process a text file. But if you are writing system software and want to analyse a system log file, then using regular expressions in C++ is a very good idea.

The next line of code char str[ ] = “Open Source For You”; initialises a character array called str[ ] with a string in which we will search for a pattern. In this particular case, the character array is initialised with the string Open Source For You. If you wish to replace the line of code

char str[ ] = Open Source For You with string str = “Open Source For You”; and thereby use an object str of string class of C++ instead of a character array, the program will still work equally well. Remember that the string class of C++ is just an instance of the template class basic_string.

This modified program called string.cc is also available for downloading. On execution with the commands g++ string.cc and ./a.out, the program string.cc will also behave just like the program regex1.cc. Since I am expecting a mixed group of readers with expertise in different programming languages, I tried to make the C++ programs look as much as possible like C programs, in the belief that C is the language of academia and everybody has had a stint with it as a student. I could have even used the printf( ) and scanf( ) functions instead of cout and cin. But a line should be drawn somewhere, and this is where I have stopped making C++ programs look like C.

The next line of code regex pat(“Source”); is very important. It is responsible for setting up the regular expression pattern that should be searched in the string Open Source For You. Here the pattern searched is the word ‘Source’ which is stored in an object called pat of the class regex.

The next few lines of code contain an if-else statement. The line of code if( regex_search(str,pat) ) uses the function regex_search( ) provided by the header file <regex> to search for the pattern stored in the object pat of the class regex in the string stored inside the character array str[ ]. If a match is found, the line of code cout << “Match Found\n”; is executed and prints the message Match Found. If a match is not found, the else part of the code cout << “No Match Found\n”; is executed and prints the message No Match Found. This program can be compiled with the command g++ regex1.cc, where g++ is the C++ compiler provided by GCC (GNU Compiler Collection). This will produce an executable called a.out. This is then executed with the command ./a.out. On execution, the program prints the message Match Found on the screen because the function regex_search( ) searches the entire string to find a match. Since the word Source is present in the string Open Source For You, a match is found.

Now, it is time for us to revisit the difference between the functions regex_search( ) and regex_match( ). To do this, the line of code if( regex_search(str,pat) ) in the program regex1.cc is replaced with the line if( regex_match(str,pat) ).

This modified code is available as a program named regex2.cc, which can be compiled with the command g++ regex2.cc, which will produce an executable called a.out. This is then executed with the command ./a.out. Now the output printed on the screen is No Match Found. Why? As mentioned earlier, this is due to a difference between the functions regex_search( ) and regex_match( ). The function regex_search( ) searches the entire string for a match and the function regex_match( ) returns true only if the regular expression pattern is present at the beginning of a string. In this case, the word Source appears as the second word in the string Open Source For You, and hence no match is found by the function regex_match( ). Figure 1 shows the output of the programs regex1.cc and regex2.cc.

Figure 3: Case-sensitive and case-insensitive matches

Pattern replacement in C++

Let’s now study the working of the function regex_replace(). Consider the program regex3.cc which uses the function regex_replace( ). This function will search for a match and if it finds one, the function will replace the matched string with a replacement string.

#include <iostream>

#include <regex>

#include <string>

using namespace std;

int main( )

{

char str[ ] = “Open Source Software is Good”;

regex pat(“Open Source”);

char rep[ ] = “Free”;

cout <<regex_replace(str,pat,rep)<<’\n’;

return 0;

}

Except for the line of code cout <<regex_replace(str,pat,rep)<<’\n’; I don’t think any further explanation is required. This is the line in which the function regex_replace( ) is called with three parameters — the character array str[ ] where the string to be searched is stored, the regular expression pattern to be matched stored in the object pat of the class regex, and the pattern to be replaced stored in the character array rep[ ]. Execute the program regex3.cc with the commands g++ regex3.cc and ./a.out. You will see the message Free Software is Good on the screen. Nothing surprising there, because the string in the character array str[ ] contains Open Source Software is Good, the pattern to be searched is Open Source and the replacement pattern is Free. Hence, a match is found and a replacement is done by the function regex_replace( ).

The next question to be answered is whether the function regex_replace( ) behaves like the function regex_search( ) or the function regex_match( ). In order to understand the behaviour of the function regex_replace( ) clearly, let us modify the program regex3.cc slightly to get a program called regex4.cc as shown in the following code:

#include <iostream>

#include <regex>

#include <string>

using namespace std;

int main( )

{

char str[ ] = “Open Source Software is Good”;

regex pat(“Good”);

char rep[ ] = “Excellent”;

cout <<regex_replace(str,pat,rep)<<’\n’;

return 0;

}

On execution with the commands g++ regex4.cc and ./a.out, the program regex4.cc prints the message Open Source Software is Excellent. This clearly tells us that the function regex_replace() behaves like the function regex_search( ) whereby the whole string is searched for a possible match for the given regular expression, unlike the function regex_match( ) which looks for a match at the very beginning of a string. Figure 2 shows the output of the two programs, regex3.cc and regex4.cc.

File processing in C++ with regular expressions

The next question that needs to be answered is: How do we process data inside a text file with a regular expression? To test the working of such a program, a text file called file1.txt is used, which is the same one used in the previous articles in this series on regular expressions.

unix is an operting system

Unix is an Operating System

UNIX IS AN OPERATING SYSTEM

Linux is also an Operating System

Now let us consider the following C++ program called regex5.cc that reads and processes the text file file1.txt line by line, to print all the lines that contain the word ‘UNIX’.

#include <iostream>

#include <string>

#include <fstream>

#include <regex>

using namespace std;

int main( )

{

ifstream file(“file1.txt”);

string str;

regex pat(“UNIX”);

while (getline(file, str))

{

if( regex_search(str,pat) )

{

cout << str <<”\n”;

}

}

return 0;

}

When the program regex5.cc is executed with the commands g++ regex5.cc and ./a.out, the message printed on the screen is UNIX IS AN OPERATING SYSTEM. So, a case-sensitive pattern match is carried out here. The next question is: How do we carry out a case-insensitive pattern match? For this purpose, we use a regex constant called icase. When the line of code regex pat(“UNIX”); is replaced with the line regex pat(“UNIX”, regex_constants::icase); a case insensitive match is carried out, and this results in a match for three lines in the text file file1.txt. Figure 3 shows the results of the case-sensitive and case-insensitive matches. There are many other regex constants defined in the namespace regex_constants. Some of them are nosubs, optimize, collate, etc. Use of these regex constants will add more power to your regular expressions. It is a good idea to learn more about them as you gain more information about C++ regular expressions.

Regular expressions in C++14 and C++17

It is now time for us to discuss regular expressions in C++14. Luckily, except for a few minor additions, the <regex> header file of C++11 has remained largely unchanged even after the introduction of the later standard C++14. For example, the definitions of the functions regex_match( ) and regex_search() are slightly modified in C++14 so that additional processing with these functions is possible. But these changes only add more power to existing functions and do not affect their basic working. And finally, what are the changes that will be brought on by C++17? Hopefully, nothing major. So far, there have been no rumours about whether there will be a major revision to the header file <regex>. Therefore, whatever we have learnt from this article can be used for a long time.

Regular expression style of C++

Unlike the previous two articles in this series, in this article I have started by explaining C++ code snippets using regular expressions directly, without providing details regarding the kind of regular expression syntax being used in C++. Sometimes it is better to attack the problem directly than beat around the bush. But even then, it is absolutely essential to know the regular expression syntax used with C++. Otherwise, this article may just be a set of ‘Do-It-Yourself’ instructions. C++11 regular expressions support multiple regular expression styles like ECMAScript syntax, AWK script syntax, Grep syntax, etc.

ECMAScript is a scripting language and JavaScript is the most well-known implementation of ECMAScript. The syntax used by ECMAScript is not much different from the other regular expression flavours. There are some minor differences, though. For example, the notation \d used in Perl style regular expressions to denote decimal digits is absent in ECMAScript style regular expressions. Instead, a notation like [[:digit:]] is used there. I am not going to point out any other such difference but just keep in mind that C++11 supports multiple regular expression styles and some of the styles differ from the others, slightly.

Figure 4: Regular expressions for numbers

A practical regular expression for C++

Now let us discuss a practical regular expression with which we can find out some real data rather than finding out ‘strings starting with abc and ending with xyz’. Our aim is to identify those lines that contain only numbers. Consider the text file file2.txt with the following data to test our regular expressions:

abcxyz

a1234z

111222

123456

aaaaaaa

zzzzzzz

AA111

111

2222

33333

22.22

BBBB

Now consider the program regex7.cc with the following code:

#include <iostream>

#include <string>

#include <fstream>

#include <regex>

using namespace std;

int main()

{

ifstream file(“file2.txt”);

string str;

regex pat(“^[[:digit:]]+$”);

while (getline(file, str))

{

if( regex_search(str,pat) )

{

cout << str <<”\n”;

}

}

return 0;

}

On execution with the following commands, g++ regex7.cc and ./a.out the program prints those lines containing numbers alone. Figure 4 shows the output of the program regex7.cc.

Except for the line of code regex pat(“^[[:digit:]]+$”); which defines the regular expression pattern to be searched, there is no difference between the working of the programs regex5.cc and regex7.cc. The caret symbol ^ is used to denote that the match should happen at the very beginning and the dollar symbol $ is used to denote that the match should occur at the end. In the middle, there is the regular expression [[:digit:]]+ which implies one or more occurrences of decimal digits, the same as [0-9]+. So, the given regular expression finds a match only if the line of text contains only decimal digits and nothing more. Due to this reason, lines of text like AA111, 22.22, a1234z, etc, are not selected.

Now it is time for us to wind up the discussion. Like the previous two articles in this series, I have covered the use of regular expressions in a particular programming language as well as some other aspects of the programming language that will affect the usage of regular expressions. In this article, the lengthy discussion about the standards of C++ was absolutely essential, without which you might blindly apply regular expressions on all standards of C++ without considering the subtle differences between them. The topics on regular expressions discussed in this article may not be comprehensive but they provide an adequate basis for any good C++ programmer to build up from. In the next part of this series we will discuss the use of regular expressions in yet another programming language, maybe one that is much used on the Internet and the World Wide Web.

LEAVE A REPLY

Please enter your comment!
Please enter your name here