Parsing Concepts
	Regular Expressions

Main
	Latest News
	Getting Started
	Screen Shots
	Download
	Documentation
	Contributors
	Contact

About GOLD
	How It Works
	FAQ
	Why Use GOLD?
	Comparison
	Revision History
	Freeware License
	More ...

Articles
	What is a Parser?
	Backus-Naur Form
	DFA Lexer
	LALR Parsing
	Glossary
	Links
	More ...

Regular Expressions

A Regular Expression is a simple, yet powerful, notation that is used to represent simple patterns. They are used extensively in programming language theory. In particular, Regular Expressions are used to describe the "terminals" of a programming language. The term "terminal" refers to the reserved words, symbols, literals, identifiers, etc... which are the basic components of a programming language.

When a program is analyzed, the text is chopped into different logical units by the tokenizer. The tokenizer produces a number of "tokens" which contain the same information as the original program. Of course, the tokenizer has the ability to ignore information such as comments. While terminals are used to represent the classification of information, tokens contain the actual information. Essentially, the category of token is its associated terminal.

Regular expressions are used to describe these kind of patterns. The notation consists of expressions constructed from a series of characters. Sub-expressions are delimited by using parenthesis '(' and ')'. The vertical-bar character '|' is used to denote alternate expressions. Any of these items, can be followed by a special character that specifies the number that can appear in sequence.

Symbol	Meaning
	Kleene Closure. This symbol denotes 0 or more or the specified character(s)
	One or more. This symbol denotes 1 or more of the specified character(s)
	Optional. This symbol denotes 0 or 1 of the specified character(s)

Many scanner (lexer) generators and parsing systems have expanded the notation to include set literals and sometimes named sets. In the case of Lex, literal sets of characters are delimited using the square brackets '[' and ']' and named sets are delimited by the braces '{' and '}'. For instance, the text "[abcde]" denotes a set of characters consisting of the first five letters of the alphabet while the text "{abc}" refers to a set named "abc". This type of notation permits a short-cut notation for regular expressions. The expression (a|b|c)+ can be defined as [abc]+ .

Examples

Defintion	Possible Terminals
Example1 = abc*	ab, abc, abcc, abccc, abcccc, ...
Example2 = ab?c	abc, ac
Example3 = a\|b\|c	a, b, c
Example4 = a[12]*b	ab, a1b, a2b, a12b, a21b, a22b, a111b, ...
Example5 = '*'+	, , , **, ...
Example6 = {Letter}+	cat, dog, Sacramento, ...
ListFunction = c[ad]+r	car, cdr, caar, cadr, cdar, cddr, caaar, ...
ListFunction = c(a\|d)+r	The same as the above using a different, yet equivalent, regular expression.