Introduction
Regular expressions can be used to specify text by its characteristics rather than by the exact characters. Regular expressions allow the specifications of such items through the use of a syntax borrowed from tools such as GREP, LEX and YACC.
The following syntax definition uses EBNF notation.
Informal description
A regular expression is composed of a sequence of sub-expressions, each of the form in the operators table below. The entire expression may be preceded by ^ to indicate that the expression is only matched at the start of a line, or ended by $ to indicate that the expression can only exist at the end of a line.
EBNF syntax specification
EBNF (Extended Backaus-Naur Form) is a style of specification used for formal syntax descriptions. Within the syntax description the following metacharacters are used:
a ::= b Construct a is defined by construct b [a] Indicates that construct a is optional (abc) Indicates that constructs a, b and c are taken as a single construct a|b Indicates either construct a or construct b 'abc' The literal characters 'a' followed by 'b' followed by 'c' <a> Single syntax construct a defined in the specification
Regular expression operators
a+ One or more occurrences of a a* Zero or more occurrences of a a? Zero or one (i.e. optional) occurrence of a a{n} Exactly n occurrences of a a{n,} n or more occurrences of a a{,m} Zero or at most m occurrences of a a{n,m} At least n but not more than m occurrences of a a|b Either a or b a||b a or b or both a and b in any order abc a followed by b followed by c [abc] A single character, one of a or b or c [a-b] A single character, ranging in value from a to b inclusive [^abc] A single character, any except a, b or c (abc) a followed by b followed by c "abc" The letters a followed by b followed by c with no special significance attached to a, b or c . Any character except a newline \a The letter a, with no special significance attached to a, special forms: \t The tab character \n The newline character \r The return character \f The formfeed character \b The backspace character \xNN The hex character NN \0ooo The octal character ooo \w A single character, one of [a-zA-Z0-9_] \W Any single character not matching \w \d A single character [0-9] \D A single character not matching \d \s A whitespace character [\t\r\n\f\b\ ] \S A single character not matching \s
Examples
Expression Matches Does not match ----------------------------------------------------------- "this"|"that" this This that That \d{2}\.\d{2} 23.45 2.4 03.22 0.1 [a-zA-Z_]\w* Identifier 2Identifiers \(\*[\x01-\x7F]+\*\) (* a comment *) ( No comment *)
Regular expression examples
1. Locate Internet references
("http://"|"https://"|"mailto:"|"ftp://")[^ \n\r\"\<\\]+
Would allow the detection of internet references that start with 'http://', 'https://', 'mailto:' or 'ftp://'.
In english, the expression reads:
"Find all occurrences of text that start with 'http://', 'mailto:' or 'ftp://' and are followed by at least one character that is not one of a space (\s), a newline(\n), a carriage return(\r), a quote(\"), a bracket (\<), or a slash (\\)"
2. Locate all H1 HTML Tags
(<[h|H]1>)(.+)(</[h|H]1>)
In english, the expression reads:
"Find all occurrences of text that starts with a open tag bracket and is followed by an 'h1' or 'H1', optionally followed by any number of any characters, then followed by a opened tag bracket, backslash, then followed by an 'h1' or 'H1' and a close tag bracket.
3. Locate all HTML Hex Colors
#{1}([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})
In english, the expression reads:
Find a # character followed by six hex characters (A-F 0-9) or three hex characters (A-F 0-9).
4. Locate all HTML Entities
&([^;\s])+;
In english, the expression reads:
Find a & character followed by any characters except ";" and whitespace, then followed by a ";" character.