Regular Expressions in Programming Languages: The JavaScript Story

0
7768

Each programming language has its own way of parsing regular expressions. We have looked at how regular expressions work with different languages in the earlier four articles in this series. Now we explore regular expressions in JavaScript.

In the previous issue of OSFY, we tackled pattern matching in PHP using regular expressions. PHP is most often used as a server-side scripting language but what if your client doesn’t want to bother the server with all the work? Well, then you have to process regular expressions at the client side with JavaScript, which is almost synonymous with client-side scripting language. So, in this article, we’ll discuss regular expressions in JavaScript.

Though, technically, JavaScript is a general-purpose programming language, it is often used as a client-side scripting language to create interactive Web pages. With the help of JavaScript runtime environments like Node.js, JavaScript can also be used at the server-side. However, in this article, we will discuss only the client-side scripting aspects of JavaScript because we have already discussed regular expressions in the server-side scripting language— PHP. Just like we found out about PHP in the previous article in this series, you will mostly see JavaScript code embedded inside HTML script. As mentioned earlier in the series, limited knowledge of HTML syntax will in no way affect the understanding of the regular expressions used in JavaScript. Though we are mostly interested in the use of regular expressions, as always, let’s begin with a brief discussion on the syntax and history of JavaScript.

JavaScript is an interpreted programming language. ECMAScript is a scripting language specification from the European Computer Manufacturer’s Association (ECMA) and International Organization for Standardization (ISO), standardised in ECMA-262 and ISO/IEC 16262 for JavaScript. JavaScript was introduced by Netscape Navigator (now defunct) in 1995; soon Microsoft followed with its own version of JavaScript which was officially named JScript. The first edition of ECMAScript was released in June 1997 in an effort to settle the disputes between Netscape and Microsoft regarding the standardisation of JavaScript. The latest edition of ECMAScript, version 8, was released in June 2017. All modern Web browsers support JavaScript with the help of a JavaScript engine that is based on the ECMAScript specification. Chrome V8, often known as V8, is an open source JavaScript engine developed by Google for the Chrome browser. Even though JavaScript has borrowed a lot of syntax from Java, do remember that JavaScript is not Java.

Figure 1: Standalone application in JavaScript

Standalone JavaScript applications

Now that we have some idea about the scope and evolution of JavaScript, the next obvious question is, can it be used to develop standalone applications rather than only being used as an embedded scripting language inside HTML scripts? Well, anything is possible with computers and yes, JavaScript can be used to develop standalone applications. But whether it is a good idea to do so or not is debatable. Anyway, there are many different JavaScript shells that allow you to run JavaScript code snippets directly. But, most often, this is done during testing and not for developing useful standalone JavaScript applications.

Like standalone PHP applications, standalone JavaScript applications are also not very popular because there are other programming languages more suitable for developing standalone applications. JSDB, JLS, JSPL, etc, are some JavaScript shells that will allow you to run standalone JavaScript applications. But I will use Node.js, which I mentioned earlier, to run our standalone JavaScript file first.js with the following single line of code:

console.log(‘This is a stand-alone application’);

Open a terminal in the same directory containing the file first.js and execute the following command:

node -v

This will make sure that Node.js is installed in your system. If Node.js is not installed, install it and execute the following command:

node first.js

…at the terminal to run the script:

first.js

The message ‘This is a stand-alone application’ is displayed on the terminal. Figure 1 shows the output of the script first.js. This and all the other scripts discussed in this article can be downloaded from opensourceforu.com/article_source_code/December17Javascript.zip

Hello World in JavaScript

Whenever someone discusses programming languages, it is customary to begin with ‘Hello World’ programs; so let us not change that tradition. The code below shows the ‘Hello World’ script hello.html in JavaScript:

<html>

<body>

<script>

alert(‘Hello World’);

</script>

</body>

</html>

Now let us try to understand the code. The HTML part of the code is straightforward and needs no explanation. All the JavaScript code should be placed within the <script> tags (<script> and </script>). In this example, the following code uses the alert( ) function to display the message ‘Hello World’ in a dialogue box:

alert(‘Hello World’);

To view the effect of the JavaScript code, open the file using any Web browser. I have used Mozilla Firefox for this purpose. Figure 2 shows the output of the file hello.html. Please note that a file containing JavaScript code alone can have the extension .js, whereas an HTML file with embedded JavaScript code will have the extension .html or .htm.

Figure 2: Hello World in JavaScript

Regular expressions in JavaScript

There are many different flavours of regular expressions used by various programming languages. In this series we have discussed two of the very popular regular expression styles. The Perl Compatible Regular Expressions (PCRE) style is very popular, and we have seen regular expressions in this style being used when we discussed the programming languages Python, Perl and PHP in some of the previous articles in this series. But we have also discussed the ECMAScript style of regular expressions when we discussed regular expressions in C++. If you refer to that article on regular expressions in C++ you will come across some subtle differences between PCRE and the ECMAScript style of regular expressions. JavaScript also uses ECMAScript style regular expressions. JavaScript’s support for regular expressions is built-in and is available for direct use. Since we have already dealt with the syntax of the ECMAScript style of regular expressions, we can directly work with a simple JavaScript file containing regular expressions.

JavaScript with regular expressions

Consider the script called regex1.html shown below. To save some space I have only shown the JavaScript portion of the script and not the HTML code. But the complete file is available for download.

<script>

var str = “Working with JavaScript”;

var pat = /Java/;

if(str.search(pat) != -1) {

document.write(‘<b>MATCH FOUND</b>’);

} else {

document.write(‘<b>NO MATCH</b>’);

}

</script>

Open the file regex1.html in any Web browser and you will see the message ‘Match Found’ displayed on the Web page in bold text. Well, this is an anomaly, since we did not expect a match. So, now let us go through the JavaScript code in detail to find out what happened. The following line of code stores a string in the variable str:

var str = “Working with JavaScript”;

The line of code shown below creates a regular expression pattern and stores it in the variable pat:

var pat = /Java/;

The regular expression patterns are specified as characters within a pair of forward slash ( / ) characters. Here, the regular expression pattern specifies the word Java. The RegExp object is used to specify regular expression patterns in JavaScript. This regular expression can also be defined with the RegExp( ) constructor using the following line of code:

var pat = new RegExp(“Java”);

This is instead of the line of code:

var pat = /Java/;

A script called regex2.html with this modification is available for download. The output for the script regex2.html is the same as that of regex1.html. The next few lines of code involve an if-else block. The following line of code uses the search( ) method provided by the String object:

if(str.search(pat) != -1)

The search( ) method takes a regular-expression pattern as an argument, and returns either the position of the start of the first matching substring or −1 if there is no match. If a match is found, the following line of code inside the if block prints the message ‘MATCH FOUND’ in bold:

document.write(‘<b>MATCH FOUND</b>’);

Otherwise, the following line of code inside the else block prints the message ‘NO MATCH’ in bold:

document.write(‘<b>NO MATCH</b>’);

Remember the search( ) method searches for a substring match and not for a complete word. This is the reason why the script reports ‘Match found’. If you are interested in a literal search for the word Java, then replace the line of code:

var pat = /Java/;

…with:

var pat = /\sJava\s/;

The script with this modification regex3.html is also available for downloading. The notation \s is used to denote a whitespace; this pattern makes sure that the word Java is present in the string and not just as a substring in words like JavaScript, Javanese, etc. If you open the script regex3.html in a Web browser, you will see the message ‘NO MATCH’ displayed on the Web page.

Figure 3: Input to regex4.html

Methods for pattern matching

In the last section, we had seen the search( ) method provided by the String object. The String object also provides three other methods for regular expression processing. The methods are replace( ), match( ) and split( ). Consider the script regex4.html shown below which uses the method replace( ):

<html>

<body>

<form id=”f1”>

ENTER TEXT HERE: <input type=”text” name=”data” >

</form>

<button onclick=”check( )”>CLICK</button>

<script>

function check( ) {

var x = document.getElementById(“f1”);

var text =””;

text += x.elements[0].value;

text = text.replace(/I am/i,”We are”);

document.write(text);

}

</script>

</body>

</html>

Open the file regex4.html in a Web browser and you will see a text box to enter data and a Submit button. If you enter a string like ‘I am good’, you will see the output message ‘we are good’ displayed on the Web page. Let us analyse the code in detail to understand how it works. There is an HTML form which contains the text box to enter data, with a button that, when pressed, will call a JavaScript method called check( ). The JavaScript code is placed inside the <script> tags. The following line of code gets the elements in the HTML form:

var x = document.getElementById(“f1”);

In this case, there is only one element in the HTML form, the text box. The following line of code reads the content of the text box to the variable text:

text += x.elements[0].value;

The following line of code uses the replace( ) method to test for a regular expression pattern and if a match is found, the matched substring is replaced with the replacement string:

text = text.replace(/I am/i,”We are”);

In this case, the regular expression pattern is /I am/i and the replacement string is We are. If you observe carefully, you will see that the regular expression pattern is followed by an ‘i’. Well, we came across similar constructs throughout the series. This ‘i’ is an example of a regular expression flag, and this particular one instructs the regular expression engine to perform a case-insensitive match. So, you will get a match whether you enter ‘I AM’, ‘i am’ or even ‘i aM’.

There are other flags also like g, m, etc. The flag g will result in a global match rather than stopping after the first match. The flag m is used to enable the multi-line mode. Also note the fact that the replace( ) method did not replace the contents of the variable text; instead, it returned the modified string, which then was explicitly stored in the variable text. The following line of code writes the contents on to the Web page:

document.write(text);

Figure 3 shows the input for the script regex4.html and Figure 4 shows the output.

A method called match( ) is also provided by the String object for regular expression processing. Search( ) returns the starting index of the matched substring, whereas the match( ) method returns the matched substring itself. What will happen if we replace the line of code:

text = text.replace(/I am/i,”We are”);

…in regex4.html with the following code?

text = text.match(/\d+/);

If you open the file regex5.html having this modification, enter the string article part 5 in the text box and press the Submit button. You will see the number ‘5’ displayed on the Web page. Here the regular expression pattern is /\d+/ which matches for one or more occurrences of a decimal digit.

Figure 4: Output of regex4.html

Another method provided by the String object for regular expression processing is the split( ) method. This breaks the string on which it was called into an array of substrings, using the regular expression pattern as the separator. For example, replace the line of code:

text = text.replace(/I am/i,”We are”);

…in regex4.html with the code:

text = text.split(“.”);

…to obtain regex6.html.

If you open the file regex6.html, enter the IPv4 address 192.100.50.10 in dotted-decimal notation on the text box and press the Submit button. From then on, the IPv4 address will be displayed as ‘192, 100, 50, 10’. The IPv4 address string is split into substrings based on the separator ‘.’ (dot).

String processing of regular expressions

In previous articles in this series we mostly dealt with regular expressions that processed numbers. For a change, in this article, we will look at some regular expressions to process strings. Nowadays, computer science professionals from India face difficulties in deciding whether to use American English spelling or the British English spelling while preparing technical documents. I always get confused with colour/color, programme/program, centre/center, pretence/pretense, etc. Let us look at a few simple techniques to handle situations like this.

For example, the regular expression /colo(?:u)?r/ will match both the spellings ‘color’ and ‘colour’. The question mark symbol ( ? ) is used to denote zero or one occurrence of the preceding group of characters. The notation (?:u) groups u with the grouping operator ( ) and the notation ?: makes sure that the matched substring is not stored into a memory unnecessarily. So, here a match is obtained with and without the letter u.

What about the spellings ‘programme’ and ‘program’? The regular expression /program(?:me)?/ will accept both these spellings. The regular expression /cent(?:re|er)/ will accept both the spellings, ‘center’ and ‘centre’. Here the pipe symbol ( | ) is used as an alternation operator.

What about words like ‘biscuit’ and ‘cookie’? In British English the word ‘biscuit’ is preferred over the word ‘cookie’ and the reverse is the case in American English. The regular expression /(?:cookie|biscuit)/ will accept both the words — ‘cookie’ and ‘biscuit’. The regular expression /preten[cs]e/ will match both the spellings, ‘pretence’ and ‘pretense’. Here the character class operator [ ] is used in the regular expression pattern to match either the letter c or the letter s.

I have only discussed specific solutions to the problems mentioned here so as to make the regular expressions very simple. But with the help of complicated regular expressions it is possible to solve many of these problems in a more general way rather than solving individual cases. As mentioned earlier, C++ also uses ECMAScript style regular expressions; so any regular expression pattern we have developed in the article on regular expressions in C++ can be used in JavaScript without making any modifications.

Just like the pattern followed in the previous articles in this series, after a brief discussion on the specific programming language, in this case, JavaScript, we moved on to the use of the regular expression syntax in that language. This should be enough for practitioners of JavaScript, who are willing to get their hands dirty by practising with more regular expressions. In the next part of this series on regular expressions, we will discuss the very powerful programming language, Java, a distant cousin of JavaScript.

LEAVE A REPLY

Please enter your comment!
Please enter your name here