Regular Expressions in Programming Languages: Java for You

0
9434

This is the sixth and final part of a series of articles on regular expressions in programming languages. In this article, we will discuss the use of regular expressions in Java, a very powerful programming language.

Java is an object-oriented general-purpose programming language. Java applications are initially compiled to bytecode, which can then be run on a Java virtual machine (JVM), independent of the underlying computer architecture. According to Wikipedia, “A Java virtual machine is an abstract computing machine that enables a computer to run a Java program.” Don’t get confused with this complicated definition—just imagine that JVM acts as software capable of running Java bytecode. JVM acts as an interpreter for Java bytecode. This is the reason why Java is often called a compiled and interpreted language. The development of Java—initially called Oak—began in 1991 by James Gosling, Mike Sheridan and Patrick Naughton. The first public implementation of Java was released as Java 1.0 in 1996 by Sun Microsystems. Currently, Oracle Corporation owns Sun Microsystems. Unlike many other programming languages, Java has a mascot called Duke (shown in Figure 1).

As with previous articles in this series I really wanted to begin with a brief discussion about the history of Java by describing the different platforms and versions of Java. But here I am at a loss. The availability of a large number of Java platforms and the complicated version numbering scheme followed by Sun Microsystems makes such a discussion difficult. For example, in order to explain terms like Java 2, Java SE, Core Java, JDK, Java EE, etc, in detail, a series of articles might be required. Such a discussion about the history of Java might be a worthy pursuit for another time but definitely not for this article. So, all I am going to do is explain a few key points regarding various Java implementations.

First of all, Java Card, Java ME (Micro Edition), Java SE (Standard Edition) and Java EE (Enterprise Edition) are all different Java platforms that target different classes of devices and application domains. For example, Java SE is customised for general-purpose use on desktop PCs, servers and similar devices. Another important question that requires an answer is, ‘What is the difference between Java SE and Java 2?’ Books like ‘Learn Java 2 in 48 Hours’ or ‘Learn Java SE in Two Days’ can confuse beginners a lot while making a choice. In a nutshell, there is no difference between the two. All this confusion arises due to the complicated naming convention followed by Sun Microsystems.

The December 1998 release of Java was called Java 2, and the version name J2SE 1.2 was given to JDK 1.2 to distinguish it from the other platforms of Java. Again, J2SE 1.5 (JDK 1.5) was renamed J2SE 5.0 and later as Java SE 5, citing the maturity of J2SE over the years as the reason for this name change. The latest version of Java is Java SE 9, which was released in September 2017. But actually, when you say Java 9, you mean JDK 1.9. So, keep in mind that Java SE was formerly known as Java 2 Platform, Standard Edition or J2SE.

Figure 1: Duke – the mascot of Java

The Java Development Kit (JDK) is an implementation of one of the Java Platforms, Standard Edition, Enterprise Edition, or Micro Edition in the form of a binary product. The JDK includes the JVM and a few other tools like the compiler (javac), debugger (jdb), applet viewer, etc, which are required for the development of Java applications and applets. The latest version of JDK is JDK 9.0.1 released in October 2017. OpenJDK is a free and open source implementation of Java SE. The OpenJDK implementation is licensed under the GNU General Public License (GNU GPL). The Java Class Library (JCL) is a set of dynamically loadable libraries that Java applications can call at run time. JCL contains a number of packages, and each of them contains a number of classes to provide various functionalities. Some of the packages in JCL include java.lang, java.io, java.net, java.util, etc.

The ‘Hello World’ program in Java

Other than console based Java application programs, special classes like the applet, servlet, swing, etc, are used to develop Java programs to complete a variety of tasks. For example, Java applets are programs that are embedded in other applications, typically in a Web page displayed in a browser. Regular expressions can be used in Java application programs and programs based on other classes like the applet, swing, servlet, etc, without making any changes. Since there is no difference in the use of regular expressions, all our discussions are based on simple Java application programs. But before exploring Java programs using regular expressions let us build our muscles by executing a simple ‘Hello World’ program in Java. The code given below shows the program HelloWorld.java.

public class HelloWorld

{

public static void main(String[ ] args)

{

System.out.println(“Hello World”);

}

}

To execute the Java source file HelloWorld.java open a terminal in the same directory containing the file and execute the command:

javac HelloWorld.java.
Figure 2: Hello World program in Java

Now a Java class file called HelloWorld.class containing the Java bytecode is created in the directory. The JVM can be invoked to execute this class file containing bytecode with the command:

java HelloWorld.class

The message ‘Hello World’ is displayed on the terminal. Figure 2 shows the execution and output of the Java program HelloWorld.java. The program contains a special method named main( ), the starting point of this program, which will be identified and executed by the JVM. Remember that a method in an object oriented programming paradigm is nothing but a function in a procedural programming paradigm. The main( ) method contains the following line of code, which prints the message ‘Hello World’ on the terminal:

‘System.out.println(“Hello World”);’

The program HelloWorld.java and all the other programs discussed in this article can be downloaded from opensourceforu.com/article_source_code/January18javaforyou.zip.

Regular expressions in Java

Now coming down to business, let us discuss regular expressions in Java. The first question to be answered is ‘What flavour of regular expression is being used in Java?’ Well, Java uses PCRE (Perl Compatible Regular Expressions). So, all the regular expressions we have developed in the previous articles describing regular expressions in Python, Perl and PHP will work in Java without any modifications, because Python, Perl and PHP also use the PCRE flavour of regular expressions.

Since we have already covered much of the syntax of PCRE in the previous articles on Python, Perl and PHP, I am not going to reintroduce them here. But I would like to point out a few minor differences between the classic PCRE and the PCRE standard tailor-made for Java. For example, the regular expressions in Java lack the embedded comment syntax available in programming languages like Perl. Another difference is regarding the quantifiers used in regular expressions in Java and other PCRE based programming languages. Quantifiers allow you to specify the number of occurrences of a character to match against a string. Almost all the PCRE flavours have a greedy quantifier and a reluctant quantifier. In addition to these two, the regular expression syntax of Java has a possessive quantifier also.

To differentiate between these three quantifiers, consider the string aaaaaa. The regular expression pattern ‘a+a’ involves a greedy quantifier by default. This pattern will result in a greedy match of the whole string aaaaaa because the pattern ‘a+’ will match only the string aaaaa. Now consider the reluctant quantifier ‘a+?a’. This pattern will only result in a match for the string aa since the pattern ‘a+?’ will only match the single character string a. Now let us see the effect of the Java specific possessive quantifier denoted by the pattern ‘a++a’. This pattern will not result in any match because the possessive quantifier behaves like a greedy quantifier, except that it is possessive. So, the pattern ‘a++’ itself will possessively match the whole string aaaaaa, and the last character a in the regular expression pattern ‘a++a’ will not have a match. So, a possessive quantifier will match greedily and after a match it will never give away a character.

You can download and test the three example Java files Greedy.java, Reluctant.java and Possessive.java for a better understanding of these concepts. In Java, regular expression processing is enabled with the help of the package java.util.regex. This package was included in the Java Class Library (JCL) by J2SE 1.4 (JDK 1.4). So, if you are going to use regular expressions in Java, make sure that you have JDK 1.4 or later installed on your system. Execute the command:

java -version

… at the terminal to find the particular version of Java installed on your system. The later versions of Java have fixed many bugs and added support for features like named capture and Unicode based regular expression processing. There are also some third party packages that support regular expression processing in Java but our discussion strictly covers the classes offered by the package java.util.regex, which is standard and part of the JCL. The package java.util.regex offers two classes called Pattern and Matcher two classes called Pattern and Matcher that are used are used jointly for regular expression processing. The Pattern class enables us to define a regular expression pattern. The Matcher class helps us match a regular expression pattern with the contents of a string.

Java programs using regular expressions

Let us now execute and analyse a simple Java program using regular expressions. The code given below shows the program Regex1.java.

import java.util.regex.*;

class Regex1

{

public static void main(String args[ ])

{

Pattern pat = Pattern.compile(“Open Source”);

Matcher mat = pat.matcher(“Magazine Open Source For You”);

if(mat.matches( ))

{

System.out.println(“Match from “ + (mat.start( )+1) + “ to “ + (mat.end( )));

}

else

{

System.out.println(“No Match Found”);

}

}

}

Open a terminal in the same directory containing the file Regex1.java and execute the following commands to view the output:

javac Regex1.java

and

Java Regex1

You will be surprised to see the message ‘No Match Found’ displayed in the terminal. Let us analyse the code in detail to understand the reason for this output. The first line of code:

‘import java.util.regex.*;’

…imports the classes Pattern and Matcher from the package java.util.regex. The line of code:

‘Pattern pat = Pattern.compile(“Open Source”);’

…generates the regular expression pattern with the help of the method compile( ) provided by the Pattern class. The Pattern object thus generated is stored in the object pat. A PatternSyntaxException is thrown if the regular expression syntax is invalid. The line of code:

‘Matcher mat = pat.matcher(“Magazine Open Source For You”);’

…uses the matcher( ) method of Pattern class to generate a Matcher object, because the Matcher class does not have a constructor. The Matcher object thus generated is stored in the object mat. The line of code:

‘if(mat.matches( ))’

…uses the method matches( ) provided by the class Pattern to perform a matching between the regular expression pattern ‘Open Source’ and the string ‘Magazine Open Source For You’. The method matches( ) returns True if there is a match and returns False if there is no match. But the important thing to remember is that the method matches( ) returns True only if the pattern matches the whole string. In this case, the string ‘Open Source’ is just a substring of the string ‘Magazine Open Source For You’ and since there is no match, the method matches( ) returns False, and the if statement displays the message ‘No Match Found’ on the terminal.

If you replace the line of code:

‘Pattern pat = Pattern.compile(“Open Source”);’

…with the line of code:

‘Pattern pat = Pattern.compile(“Magazine Open Source For You”);’

…then you will get a match and the matches( ) method will return True. The file with this modification Regex2.java is also available for download. The line of code:

‘System.out.println(“Match from “ + (mat.start( )+1) + “ to “ + (mat.end( )));’

…uses two methods provided by the Matcher class, start( ) and end( ). The method start( ) returns the start index of the previous match and the method end( ) returns the offset after the last character matched. So, the output of the program Regex2.java will be ‘Match from 1 to 28’.

Figure 3: Output of Regex1.java and Regex2.java

Figure 3 shows the output of Regex1.java and Regex2.java. An important point to remember is that the indexing starts at 0 and that is the reason why 1 is added to the value returned by the method start( ) as (mat.start( )+1). Since the method end( ) returns the index immediately after the last matched character, nothing needs to be done there.

The matches( ) method of Pattern class with this sort of a comparison is almost useless. But many other useful methods are provided by the class Matcher to carry out different types of comparisons. The method find( ) provided by the class Matcher is useful if you want to find a substring match.

Replace the line of code:

‘if(mat.matches( ))’

…in Regex1.java with the line of code:

‘if(mat.find( ))’

…to obtain the program Regex3.java. On execution, Regex3.java will display the message ‘Match from 10 to 20’ on the terminal. This is due to the fact that the substring ‘Open Source’ appears from the 10th character to the 20th character in the string ‘Magazine Open Source For You’. The method find( ) also returns True in case of a match and False in case if there is no match. The method find( ) can be used repeatedly to find all the matching substrings present in a string. Consider the program Regex4.java shown below.

import java.util.regex.*;

class Regex4

{

public static void main(String args[])

{

Pattern pat = Pattern.compile(“abc”);

String str = “abcdabcdabcd”;

Matcher mat = pat.matcher(str);

while(mat.find( ))

{

System.out.println(“Match from “ + (mat.start( )+1) + “ to “ + (mat.end( )));

}

}

}

In this case, the method find( ) will search the whole string and find matches at positions starting at the first, fifth and ninth characters. The line of code:

‘String str = “abcdabcdabcd”;’

…is used to store the string to be searched, and in the line of code:

‘Matcher mat = pat.matcher(str);’

…this string is used by the method matcher( ) for further processing. Figure 4 shows the output of the programs Regex3.java and Regex4.java.

Now, what if you want the matched string displayed instead of the index at which a match is obtained. Well, then you have to use the method group( ) provided by the class Matcher. Consider the program Regex5.java shown below:

import java.util.regex.*;

class Regex5

{

public static void main(String args[])

{

Pattern pat = Pattern.compile(“S.*r”);

String str = “Sachin Tendulkar Hits a Sixer”;

Matcher mat = pat.matcher(str);

int i=1;

while(mat.find( ))

{

System.out.println(“Matched String “ + i + “ : “ + mat.group( ));

i++;

}

}

}

On execution, the program regex5.java displays the message ‘Matched String 1 : Sachin Tendulkar Hits a Sixer’ on the terminal. What is the reason for matching the whole string? Because the pattern ‘S.*r’ searches for a string starting with S, followed by zero or more occurrences of any character, and finally ending with an r. Since the pattern ‘.*’ results in a greedy match, the whole string is matched.

Now replace the line of code:

‘Pattern pat = Pattern.compile(“S.*r”);’

…in Regex5.java with the line:

‘Pattern pat = Pattern.compile(“S.*?r”);’
Figure 4: Output of Regex3.java and Regex4.java

…to get Regex6.java. What will be the output of Regex6.java? Since this is the last article of this series on regular expressions, I request you to try your best to find the answer before proceeding any further. Figure 5 shows the output of Regex5.java and Regex6.java. But what is the reason for the output shown by Regex6.java? Again, I request you to ponder over the problem for some time and find out the answer. If you don’t get the answer, download the file Regex6.java from the link shown earlier, and in that file I have given the explanation as a comment.So, with that example, let us wind up our discussion about regular expressions in Java. Java is a very powerful programming language and the effective use of regular expressions will make it even more powerful. The basic stuff discussed here will definitely kick-start your journey towards the efficient use of regular expressions in Java. And now it is time to say farewell.

Figure 5: Output of Regex5.java and Regex6.java

In this series we have discussed regular expression processing in six different programming languages. Four of these—Python, Perl, PHP and Java—use a regular expression style called PCRE (Perl Compatible Regular Expressions). The other two programming languages we discussed in this series, C++ and JavaScript, use a style known as the ECMAScript regular expression style. The articles in this series were never intended to describe the complexities of intricate regular expressions in detail. Instead, I tried to focus on the different flavours of regular expressions and how they can be used in various programming languages. Any decent textbook on regular expressions will give a language-agnostic discussion of regular expressions but we were more worried about the actual execution of regular expressions in programming languages.Before concluding this series, I would like to go over the important takeaways. First, always remember the fact that there are many different regular expression flavours. The differences between many of them are subtle, yet they can cause havoc if used indiscreetly. Second, the style of regular expression used in a programming language depends on the flavour of the regular expression implemented by the language’s regular expression engine. Due to this reason, a single programming language may support multiple regular expression styles with the help of different regular expression engines and library functions. Third, the way different languages support regular expressions is different. In some languages the support for regular expressions is part of the language core. An example for such a language is Perl. In some other languages the regular expressions are supported with the help of library functions. C++ is a programming language in which regular expressions are implemented using library functions. Due to this, all the versions and standards of some programming languages may not support the use of regular expressions. For example, in C++, the support for regular expressions starts with the C++11 standard. For the same reason, the different versions of a particular programming language itself might support different regular expression styles. You must be very careful about these important points while developing programs using regular expressions to avoid dangerous pitfalls.

So, finally, we are at the end of a long journey of learning regular expressions. But an even longer and far more exciting journey of practising and developing regular expressions lies ahead. Good luck!

LEAVE A REPLY

Please enter your comment!
Please enter your name here