Blog Entry - 24th June 2007 - Programming - JavaScript

Irregular Expressions


The Regular Expression is an invaluable tool for programming.

However, try to use it in earnest, exploiting most if not all of its various features, and you may occasionally run into browser differences and other traps for the unwary.

lastIndex property

You will be familiar with the exec function (Section 15.10.6.2 RegExp.prototype.exec(string)):-

Performs a regular expression match of string against the regular expression and returns an Array object containing the results of the match, or null if the string did not match.

The Array object returned has the following properties:-

0 The matched sub-string.
1 .. 9N The values of the 1st to 9thall capturing parenthesis.
length The number of array entries.
index The start of the matched sub-string.
input The string against which the exec was run.
[lastIndex] The index of the character next following the match sub-string. This is non-standard and only supported by Internet Explorer: so do not use this.

The RegExp object instance (re in the example below) is also updated and has the following properties :-

source The regular expression pattern.
global True if "g" was used in the pattern's switches /ab/g
ignoreCase True if "i" was used in the pattern's switches /ab/i
multiline True if "m" was used in the pattern's switches /ab/m
lastIndex The string position at which to start the search. If the "g" (global) switch is set then this will be updated with the string position after the last match. Unlike the lastIndex property on the Array returned above this property is supported by all browsers.
var str = "abcabcabc";
var re = /(ab)(ca)/g;

var match = re.exec(str);

for (var propertyName in match)
{
	alert(propertyName + " " + match[propertyName]);
}

alert("source" + " " + re.source);
alert("global" + " " + re.global);
alert("ignoreCase" + " " + re.ignoreCase);
alert("multiline" + " " + re.multiline);
alert("lastIndex" + " " + re.lastIndex);

editplay

In order to use the lastIndex property As noted above, lastIndex requires the "g" flag to be set.

If you are going to use the lastIndex property, then it must be on the RegExp instance and not the Array returned from exec.

Note that the RegExp.$0 ... RegExp.$9 properties are being depreciated, so try to stop using these.

Object Reuse

What do you think the following code will do? Will it alert 1, 4, 7 or 1, 1, 1?

var str = "1ab2ab3ab";

find(str);
find(str);
find(str);

function find(
	input /*: String*/
)
{
	var re = /ab/gi;

	// re.lastIndex = 0;
	var match = re.exec(input);

	if (!match)
	{
		alert("Not Found");
	}

	alert(match.index);
}

editplay

For Mozilla and Firefox: 1, 4, 7

For Internet Explorer: 1, 1, 1

It seems that once a RegExp is created in Mozilla and Firefox, it does not reset.

The solution is to use the line escaped above re.lastIndex = 0, if you want the find always to start from the beginning.

This is not a problem with the String.prototype.replace function, where the lastIndex property is ignored:-

var str = "1ab2ab3ab";

replaceAB("1ab2ab3ab");
replaceAB("1ab2ab3ab");
replaceAB("1ab2ab3ab");


function replaceAB(
	input /*: String*/
)
{
	var re = /ab/gi;

	re.lastIndex = 2;

	alert(input.replace(re, "cd"));
}

editplay

Alerts 1cd2cd3cd for all browsers.

Parenthesis Matches

As you will be aware, you can capture sub-matches in parentheses.

Thus:-

var str = "abcabcabc";
var re = /(ab)(ca)/g;

var match = re.exec(str);
alert(match[1]);
alert(match[2]);

editplay

Will alert ab and then ca.

But what if a parenthesis has a * quantifier, so that it can match 0 or more times, and it matches 0 times?

For Internet Explorer the Array will contain the empty string "", but for Opera and Mozilla, the value undefined.

Thus:-

var str = "b";
var re = /(b+)([^a])*/;

var match = re.exec(str);
alert("'" + match[1] + "' " + typeof match[1]);
alert("'" + match[2] + "' " + typeof match[2]);

// COMPARED TO 
var str = "b";
var re = /(b+)([^a]*)/;

var match = re.exec(str);
alert("'" + match[1] + "' " + typeof match[1]);
alert("'" + match[2] + "' " + typeof match[2]);

editplay

Replacement Function

15.5.4.11 String.prototype.replace (searchValue, replaceValue) allows a function to be provided as the replaceValue:-

If replaceValue is a function, then for each matched substring, call the function with the following m + 3 arguments. Argument 1 is the substring that matched. If searchValue is a regular expression, the next m arguments are all of the captures in the MatchResult (see 15.10.2.1). Argument m + 2 is the offset within string where the match occurred, and argument m + 3 is string. The result is a string value derived from the original input by replacing each matched substring with the corresponding return value of the function call, converted to a string if need be.

Thus:-

var str = "abcde|abcde";
var re = /(a)(bc)(de)/g;

alert(str.replace(re, replacementFunction));

function replacementFunction(
	matchedSubstring /*: String*/,
	paren1 /*: String*/,
	paren2 /*: String*/,
	paren3 /*: String*/,
	matchOffset /*: Number*/,
	inputString /*: String*/ 
) /*: String*/
{
	return paren3 + paren2 + paren1 + matchOffset;
}

editplay

Versions of Opera older than 9 do not support this.

Safari Bug

I have recently noted the following entry concerning a Safari bug:-

Yet Another Safari Bug

It seems to involve a quantifier on a parenthesis:-

var a = [];

for (var i = 0; i < 1000; i++)
{
	a[i] = "abcdefghi";
}

var string = a.join("");

alert(/(.)+/.test(string));

editplay

According to the entry, this crashes Safari when string's length is bigger than about 7000 characters.

My Exec Function

To give myself some extra control, I use the following function:-

var input /*: String*/ = "123abc123def123";
var re /*: RegExp*/ = /(abc)*(123)/;
var result /*: Array*/ = [];
var overflow /*: int*/ = 10;
var str /*: String*/ = "";

while (overflow--)
{
	result = RegExp_Exec(re, input, result.lastIndex);

	if (result === null)
	{
		break;
	}

	str = "";
	str += "0 :\t\t" + result[0] + "\r\n";
	str += "1 :\t\t" + result[1] + " " + typeof result[1] + "\r\n";
	str += "2 :\t\t" + result[2] + " " + typeof result[2] + "\r\n";
	str += "index :\t\t" + result.index + "\r\n";
	str += "lastIndex :\t" + result.lastIndex + "\r\n";

	alert(str);
}


/*
 *
 *  (JavaScript Function)
 *
 *  RegExp_Exec
 *
 *  June 2007
 *
 *  A function to enhance the RegExp.prototype.exec method
 *  
 *
 *  Written by Julian Turner 2007
 *  Copyright Free
 * 
 *  @param (RegExp) re
 *      RegExp instance
 *
 *  @param (String) input
 *      The string to be searched
 *
 *  @param (Number) startIndex OPTIONAL
 *      input character to start from (and including).  
 *
 *  @return (Array) 
 *  	The array returned by RegExp.prototyp.exec
 *      but with 1..9 "undefined" converted to empty string
 *      and a "lastIndex" property, even if "g" switch not set
 */

function RegExp_Exec(
	re /*: RegExp*/,
	input /*: String*/,
	startIndex /*: int*/
) /*: Array*/
{
	if (typeof startIndex == "undefined")
	{
		startIndex = 0;
	}

	if (re.global)
	{
		re.lastIndex = startIndex;
	}
	else
	{
		input = input.substring(startIndex, input.length);
	}

	var result /*: Array*/ = re.exec(input);

	if (result === null)
	{
		return result;
	}

	for (var i /*: int*/ = 0; i < result.length; i++)
	{
		if (typeof result[i] === "undefined")
		{
			result[i] = "";
		}
	}

	if (re.global)
	{
		result.lastIndex = re.lastIndex;
	}
	else
	{
		result.index = startIndex + result.index;
		result.lastIndex = result.index + result[0].length;
	}

	return result;
}

editplay


Comment(s)


Sorry, comments have been suspended. Too much offensive comment spam is causing the site to be blocked by firewalls (which ironically therefore defeats the point of posting spam in the first place!). I don't get that many comments anyway, so I am going to look at a better way of managing the comment spam before reinstating the comments.


Leave a comment ...


{{PREVIEW}} Comments stopped temporarily due to attack from comment spammers.