Getting Started
	Terminals & Regular Expressions

Main
	Latest News
	Getting Started
	Screen Shots
	Download
	Documentation
	Contributors
	Contact

About GOLD
	How It Works
	FAQ
	Why Use GOLD?
	Comparison
	Revision History
	Freeware License
	More ...

Articles
	What is a Parser?
	Backus-Naur Form
	DFA Lexer
	LALR Parsing
	Glossary
	Links
	More ...

What is a Terminal?

When talking about grammars, a terminal is any symbol, identifier, reserved word, etc... that describes a piece of data found in the language. Essentially, anything that can be recognized as a meaningful piece of lexical information is a terminal.

What is considered a terminal varies greatly between programming languages. In most cases, there is significant overlap. For instance, the following piece of text would specify a simple mathematical expression in most languages:

1 + 2 * 3

In this example, the terminals are '1', '+', '2','*', and '3'. Often, the format of terminal is specified using a regular expression. This is true of GOLD and many other parsing systems.

Regular Expressions

Introduction

A regular expression is a simple, yet powerful, notation that is used to represent simple patterns. They are used extensively in programming language theory. In particular, regular expressions are used to describe the "terminals" of a programming language.

The notation is fairly easy to master. Each expression consists of a series of characters and sub-expressions delimited using parenthesis '(' and ')'. Any of these items, can be followed by a special character that specifies the number that they can appear in a sequence.

The vertical-bar character '|' is used to denote alternate expressions.

Special Symbols
	Zero or more
	One or more
	Optional

ab*

The asterisk * denotes zero or more of the selected characters. For instance, the example to the left defines a regular expression that accepts a letter 'a' followed by zero or more letter 'b's. The text "a", "ab", "abb", "abbb", etc... would be accepted.

ab+

The plus symbol + denotes one or more of the selected characters. Unlike the last example, because the plus requires one or more 'b's, a single "a" will not be accepted. In other words, the text "ab", "abb", "abbb", etc... would be accepted.

ab?

Finally, the question mark ? is used to indicate the selected characters are optional. In other words, they can appear zero or 1 times. In this example, only the text "a" and "ab" would be accepted.

Notation Variations

Many parsing systems have expanded the notation to include set literals and sometimes named sets. In the case of Lex, literal sets of characters are delimited using the square brackets '[' and ']' and named sets are delimited by the braces '{' and '}'. For instance, the text "[abcde]" denotes a set of characters consisting of the first five letters of the alphabet while the text "{abc}" refers to a set named "abc". This type of notation permits a short-cut notation for regular expressions. The expression (a|b|c)+ can be defined as [abc]+.

It should be noted that standard set notation uses curly brackets to denote a set, not the name of a set. As a result, the notation used by most parsing systems differs quite a bit from the official notation.

Examples

Defintion	Possible Terminals
Example1 = abc*	ab, abc, abcc, abccc, abcccc, ...
Example2 = ab?c	abc, ac
Example3 = a\|b\|c	a, b, c
Example4 = a[12]*b	ab, a1b, a2b, a12b, a21b, a22b, a111b, ...
Example5 = '*'+	, , , **, ...
Example6 = {Letter}+	cat, dog, Sacramento, ...
ListFunction = c[ad]+r	car, cdr, caar, cadr, cdar, cddr, caaar, ...
ListFunction = c(a\|d)+r	The same as the above using a different, yet equivalent, regular expression.

Next: Syntax & Backus-Naur Form