$PAGE_HEADER Help - Regular Expression Keyword Matching

If you've programmed in Perl or any other language with built-in regular-expression capabilities, then you probably know how much easier regular expressions make text processing and pattern matching. If you're unfamiliar with the term, a regular expression is simply a string of characters that defines a pattern used to search for a matching string. The AlertSite keyword match facility allows you to use the power of regular expressions to create complex pattern matches to monitor your sites.

Note: The regular expression feature is being offered to customers as a courtesy to provide expanded matching functionality. Please note that we do not offer technical support for the use of regular expressions. A verbose set of help and examples are provided below.

The following is a brief introduction to regular expression syntax to get you started.

Regular expression:  cat
Matches:  cat, catalog, Catherine, sophisticated

Regular expression:  t.n
Matches:  tan, Ten, tin, ton, t n, t#n, tpn, etc.

Regular expression:  t[aeio]n
Matches:  tan, Ten, tin, ton

Regular expression:  t(a|e|i|o|oo)n
Matches:  tan, Ten, tin, ton, toon

As you can see, parentheses may be used for grouping contiguous sets of character patterns together with an optional "|" operator to provide alternative selections during matching. That is, any of the alternative patterns within the group may produce a match (with left to right precedence):

Regular expression:  Good (morning|afternoon|evening)!
Matches:  Good morning!, Good afternoon!, Good evening!

Regular expression:  Surprise!*
Matches:  Surprise, Surprise!, Surprise!!, Surprise!!!, and so on.

If the "*" notation is combined with the wildcard (period) character, it will match all (zero or more) characters, including spaces, tabs and line breaks between two separate notations:

Regular expression:   Hello.*There!
Matches:  HelloThere!, Hello There!, Hello everyone over There!, and so on.

The following quantifier notations may be used to determine how many times a given notation to the immediate left of the quantifier notation should repeat itself:

Quantifier notations:

* 0 or more times
+ 1 or more times
? 0 or 1 time
{n} Exactly n number of times
{n,} At least n times
{n,m} At least n but not more than m times

Regular expression:  [0-9]{3}\-[0-9]{2}\-[0-9]{4}
Matches:  All social security numbers of the form 123-12-1234

In regular expressions, the hyphen ("-") notation has special meaning; it indicates a (sequential) range of possible characters such as A-Z, a-z, or 0-9. Thus, the notation [0-9]{3} in the first element of the pattern matches any string of exactly 3 digits, each of which may range from 0-9. This is followed by an "escaped" hyphen character. You must escape the "-" character with a forward slash ("\") when matching literal hyphens in a pattern because of its special meaning within a regular expression.

If, in your template pattern, you wish to make the hyphen optional -- if, say, you consider both 999-99-9999 and 999999999 acceptable formats -- you can use the "?" quantifier notation as shown:

Regular expression:  [0-9]{3}\-?[0-9]{2}\-?[0-9]{4}
Matches:  All social security numbers of the forms 123-12-1234 and 123121234

Let's take a look at another example. One format for US car plate numbers consists of four numeric characters followed by two letters. Thus, a regular expression might first include a "[0-9]{4}" numeric part, followed by a "[A-Z]{2}" textual part,:

Regular expression:  [0-9]{4}[A-Z]{2}
Matches:  US car plate numbers of the format: 8836KV

Regular expression:  \b[^xy][a-z]+\b
Matches:  All (lowercase) words except those that start with the letter x or y.

In the above example, the "+" quantifier is used to specify one or more characters in range of a-z, and the "\b" notation is used to match at word boundaries.

Commonly used notations:

\d [0-9]
\D [^0-9]
\w [A-Z0-9]
\W [^A-Z0-9]
\s [ \t\n\r\f]
\S [^ \t\n\r\f]

To illustrate, we can use "\d" for all instances of "[0-9]" we used before, as was the case with our social security number expressions. The revised regular expression is:

Regular expression:  \d{3}\-\d{2}\-\d{4}
Matches:  All social security numbers of the form 123-12-1234

Or, suppose you want to match an IP address. It consists of four 1-byte segments (octets), each segment has a value between 0 and 255 and is separated from the others by a period. Thus, in each individual segment of the IP address, you have at least one and at most three digits. The following regular expression might be used to match just such a construct:

Regular expression:  \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
Matches:  IP addresses that consist of four 3-digit segments, each with values between 0 and 255.

You need to escape the period character because you literally want it to be there; you do not want it read in terms of its special meaning in regular expression syntax, as explained earlier. Other special characters that need to be escaped when used in a literal match are discussed in the "Additional Considerations" section below.

Perhaps you're trying to match a particular type of date string. A typical date format might be: June 26, 1951. One example of a regular expression to match strings of this type would be:

Regular expression:  [A-Za-z]+\s+[0-9]{1,2},\s*[0-9]{4}
Matches:  All dates with the format of Month DD, YYYY

Broken down, the first element of the expression ("[A-Za-z]+") matches the Month (rather, a word consisting of at least 1 alphabetic character), followed by a mandatory space ("\s+"), followed by the day of the month up to 2 digits ("[0-9]{1,2}"), followed by a mandatory comma, followed by an optional space ("\s*") followed by a four-digit year field ("[0-9]{4}"). This pattern may be adequate, but you might also choose to enclose the full set of month names within a parenthetical grouping, separate by the "|" notation, such as (January|February|March ... ) instead of the weaker "[A-Za-z]+" notation.

Note that the "\s" is shorthand notation for whitespace, and matches either a blank space, tab, newline, return, or form-feed character.

Additional Special Character Definitions:

\ Quote the next metacharacter
^ Match the beginning of the line
. Match any character
$ Match the end of the line
| Alternation (OR)
() Grouping
[] Character class
\w Match a "word" character (alphanumerics and "_" chars)
\W Match a non-word character
\s Match a whitespace character
\S Match a non-whitespace character
\d Match a digit character
\D Match a non-digit character
\b Match a word boundary
\B Match a non-(word boundary)
\A Match only at beginning of string (same as ^)
\Z Match only at end of string (same as $)

Site Keyword Matching

The AlertSite keyword matching facility treats an entire web page as one continuous line of text. Therefore, both the Plain Text and Regular Expression keyword match types permit matches across multiple lines of HTML source text. Typical HTML source text usually includes plain text mixed together with HTML tags and attributes, and may optionally include snippets of programmatic scripting code.

It may be possible for your regular expression to satisfy multiple pattern matches on the same web page. Which pattern ultimately gets matched may or may not be what you desire. For example, you may only want to consider a match successful if the keyword or pattern is found at a particular location on the web page, or only if it appears on the page along with another keyword located somewhere else on the same page.

Let's say that the following HTML source code sample was retrieved from viewing the source of a page on your web site:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head><title>CompanyName : My Company Website</title>

... additional HTML source code ...

<a href="http://www.CompanyName.com/login.shtml">Click Here</a>

... additional HTML source code ...

<strong>Copyright&copy;1999-2004 CompanyName.</strong><br>

... additional HTML source code ...

If you wanted to create a regular expression to match "CompanyName", but only when it appears in the title of your web page, you might use the following regular expression:

Regular expression:  <title>.*CompanyName.*</title>
Matches:  Any occurrence of CompanyName between the <title> and </title> HTML tags.

Similarly, if you wanted your match to require multiple keywords from different areas of the page, say for example, CompanyName followed somewhere by "Login Successful", it might look something like this:

Regular expression:  CompanyName.*Login Successful
Matches:  The string "CompanyName", followed by any number of characters (all the "middle stuff"), followed by the string "Login Sucessful".

In the above examples, the ".*" quantifier will generally match as much of the source text as possible while still allowing the whole regular expression to match. Quantifiers that grab as much text as possible are called maximal match or greedy quantifiers (see Quantifier notations above).

But there are times when we would like these quantifiers to match a minimal piece of a text, rather than a maximal piece. The minimal match or non-greedy quantifiers are: ??, *?, +?, and {}?. These are the same standard quantifiers but with a ? appended to them. They have the following meanings:

?? Match 0 or 1 times. Try 0 first, then 1
*? Match 0 or more times, but as few as possible
+? Match 1 or more times, but as few as possible
{n}? Match exactly n times. Equivalent to {n}
{n,}? Match at least n times, but as few as possible
{n,m}? Match at least n but not more than m times, as few as possible

Since a regular expression can match a string in several different ways, we can use some of the following principles to predict which way the regular expression will match:

Express Yourself

Now that you've been introduced to the pattern matching power of regular expressions, it's up to you to decide whether to use either a Plain Text match or the more powerful Regular Expression type. When used appropriately, regular expressions can help a great deal in constructing complex pattern matches for your site monitoring needs. This tutorial touches only briefly on the full capabilities of regular expression pattern matching. For additional information, you may wish to consult one of the many widely available regular expression tutorials on the internet.

Advanced Pattern Matching:

In order to handle more complex pattern matching requirements, you may choose to use some of the more advanced features of regular expression syntax such as subpattern location independence and lookahead assertions. Suggested solutions to some of these situations are presented below. For more detailed information, please consider reviewing an online tutorial on regular expression syntax.

Regular expression:  ALPHA|BETA
Matches:  Any occurrence of either ALPHA or BETA, anywhere on the page (overlapping permitted).

Regular expression:  (?=.*ALPHA).*BETA
Matches:  When both ALPHA and BETA occur, anywhere on the page (overlapping permitted).

Regular expression:  (?:^.*ALPHA.*BETA)|(?:^.*BETA.*ALPHA)
Matches:  When both ALPHA and BETA occur, anywhere on the page (non-overlapping).

Regular expression:  (?<=ALPHA)BETA
Matches:  Any occurrence of BETA that is preceeded by ALPHA (positive look-behind assertion).

Case Sensitivity: To make your match criteria wholly or partially case insensitive, you may embed the (?i) and(?i:pattern) notations within your regular expressions, respectively. Here are some potential solutions (where ALPHA and BETA are your keywords or sub-patterns):

Regular expression:  (?i)alpha-beta
Matches:  Any occurrence of ALPHA and BETA, regardless of case, separated by a dash (e.g., alpha-beta, ALPHA-BETA, aLpHa-BetA, etc).

Regular expression:  (?i:alpha)-BETA
Matches:  Any occurrence of ALPHA regardless of case, followed by a dash and an uppercase BETA (e.g., aLpHa-BETA, Alpha-BETA, alphA-BETA, etc).

Additional Considerations

Some other things you may want to consider when constructing your regular expressions: