The Tower Bridge in Sacramento, California Parsing Concepts
Regular Expressions
Main
Latest News
Getting Started
Screen Shots
Download
Documentation
Contributors
Contact
About GOLD
How It Works
FAQ
Why Use GOLD?
Comparison
Revision History
Freeware License
More ...
Articles
What is a Parser?
Backus-Naur Form
DFA Lexer
LALR Parsing
Glossary
Links
More ...


Regular Expressions

A Regular Expression is a simple, yet powerful, notation that is used to represent simple patterns. They are used extensively in programming language theory. In particular, Regular Expressions are used to describe the "terminals" of a programming language. The term "terminal" refers to the reserved words, symbols, literals, identifiers, etc... which are the basic components of a programming language.

When a program is analyzed, the text is chopped into different logical units by the tokenizer. The tokenizer produces a number of "tokens" which contain the same information as the original program. Of course, the tokenizer has the ability to ignore information such as comments. While terminals are used to represent the classification of information, tokens contain the actual information. Essentially, the category of token is its associated terminal.

Regular expressions are used to describe these kind of patterns. The notation consists of expressions constructed from a series of characters. Sub-expressions are delimited by using parenthesis '(' and ')'. The vertical-bar character '|' is used to denote alternate expressions. Any of these items, can be followed by a special character that specifies the number that can appear in sequence.

Symbol Meaning
Kleene Closure. This symbol denotes 0 or more or the specified character(s)
One or more. This symbol denotes 1 or more of the specified character(s)
Optional. This symbol denotes 0 or 1 of the specified character(s)

Many scanner (lexer) generators and parsing systems have expanded the notation to include set literals and sometimes named sets. In the case of Lex, literal sets of characters are delimited using the square brackets '[' and ']' and named sets are delimited by the braces '{' and '}'. For instance, the text "[abcde]" denotes a set of characters consisting of the first five letters of the alphabet while the text "{abc}" refers to a set named "abc". This type of notation permits a short-cut notation for regular expressions. The expression (a|b|c)+ can be defined as [abc]+ .

Examples

Defintion Possible Terminals
Example1 = abc* ab, abc, abcc, abccc, abcccc, ...
Example2 = ab?c abc, ac
Example3 = a|b|c a, b, c
Example4 = a[12]*b ab, a1b, a2b, a12b, a21b, a22b, a111b, ...
Example5 = '*'+ *, **, ***, ****, ...
Example6 = {Letter}+ cat, dog, Sacramento, ...
ListFunction = c[ad]+r car, cdr, caar, cadr, cdar, cddr, caaar, ...
ListFunction = c(a|d)+r The same as the above using a different, yet equivalent, regular expression.