The Tower Bridge in Sacramento, California Parsing Concepts
Glossary
Main
Latest News
Getting Started
Screen Shots
Download
Documentation
Contributors
Contact
About GOLD
How It Works
FAQ
Why Use GOLD?
Comparison
Revision History
Freeware License
More ...
Articles
What is a Parser?
Backus-Naur Form
DFA Lexer
LALR Parsing
Glossary
Links
More ...


This page contains a short description of a number of different parsing terms and concepts.

 

Backus-Naur Form Backus-Naur Form is a notation used to describe the syntax of programming languages. For more information, please see theinline-arrow-r.gif (99 bytes)Grammar and Backus-Naur Form page
Big Endian Byte
Ordering
Big Endian Byte Ordering is a binary storage format where the most significant byte is stored in lower memory. This is the format used by the Motorola family of processors. For more information, please see theinline-arrow-r.gif (99 bytes)Byte Ordering page.
Configuration In parsing terms, a configuration is a rule production in the process of being completed. Configurations play a major role in the construction of LR parse tables.
Context Free Grammars Please see theinline-arrow-r.gif (99 bytes)Grammar and Backus-Naur Form page
Deterministic Finite
Automata
A Deterministic Finite Automaton is often used to analyze a series of characters.  Often it is implemented using state driven transition graph. Please see theinline-arrow-r.gif (99 bytes)Deterministic Finite Automata page for more information.
Grammar Please see theinline-arrow-r.gif (99 bytes)Grammar and Backus-Naur Form page
LALR Parsing LALR Parsing, or "Lookahead LR parsing", is a variant of LR Parsing that combines different "configurations" to limit the size of the parse tables. As a result, the algorithm is slightly less powerful than the LR Parsing. Grammars that can be parsed by a LR parser, might be found to be "ambiguous" by LALR. However, this is very rarely the case and real-world examples are few.

The number of states eliminated by LALR are sometimes huge. The C programming language, for instance, has over 10,000 LR states. LALR drops this number to slightly more than 200.

For more information, please see theinline-arrow-r.gif (99 bytes)LALR Algorithm page.

Little Endian Byte
Ordering
Little Endian Byte Ordering is a binary storage format where the least significant byte is stored in lower memory. This is the format used by the Intel family of processors. For more information, please see theinline-arrow-r.gif (99 bytes)Little Endian Byte Ordering page.
LL Parsing LL Parsing, or Left-to-right Left-derivative parsing, refers to a class of parsers that analyze text using a top-down algorithm.

LL Parsers have the advantage of being very simple in the design - at least conceptually. Unlike LR Parsers, the system does not need to generate tables ahead of time. Instead, only a set of rules and lookahead data is needed.

However, LL Parsers are not as powerful as LR parsers. Rules cannot have left-recursion since it causes problems with the recursive decent algorithm. In addition, LL Parsers are known as memory intensive and slow. Both of these are due to the "search" nature of the algorithm. Optimization is possible, but the complexity of the LL Parser runtime engine increases dramatically.

LR Parsing LR Parsing, or Left-to-right Right-derivative parsing, uses tables to determine when a rule is complete and when additional tokens are needed to be read from the source. Unlike LL Parsing, the LR Parser does very little "thinking" at runtime. All decisions are based on the content of the parse tables. As a result, LR Parsing is faster and more memory efficient than LL Parsing.

The construction of these tables where all the "thinking" takes place. LR parser generators, such as YACC and GOLD, construct these tables by analyzing the grammar and determining all the possible "states" the system can be in when parsing.

Each state represents a point in the parse process where a number of tokens have been read from the source and rules are in different states of completion. Each production in a state of completion is called a "configuration" and each state is really a configuration set.

LR parse tables can be huge, and, as a result, often a variant of LR Parsing is used. For more information, please see theinline-arrow-r.gif (99 bytes)LALR Algorithm page.

Nonterminal Please see theinline-arrow-r.gif (99 bytes)Grammar and Backus-Naur Form page
Nullable Rule A nullable rule is a type of rule which is optional, or in other words, can contain no symbols. In the GOLD Meta-Language, a nullable rule can be specified by adding a blank entry:
<Optional ID> ::= Identifier
               |
Parser A parser is software, such as a procedure or library, that organizes text into a set of logical units used by a programming language. Please see theinline-arrow-r.gif (99 bytes)What is a Parser page for more information.
Parser Generator A parser generator is a generic term that refers to any software program that helps develop a working parser. Examples include the GOLD, YACC and ANTLR.
Production Please see theinline-arrow-r.gif (99 bytes)Grammar and Backus-Naur Form page
Reduce-Reduce
Conflict
A Reduce-Reduce Conflict is a caused when a grammar allows two or more different rules to be reduced at the same time, for the same token. When this happens, the grammar becomes ambiguous since a program can be interpreted more than one way.

For instance, assume you have the following grammar:

<S> ::= <A> | <B>

<A> ::= Identifier
<B> ::= Identifier

When the system reads an identifier, it cannot determine if it has completed <A> or<B>.

When a LALR parser generator is analyzing a grammar and constructing the parse tables, these conflicts are located immediately.

Regular Expression A Regular Expression is a notation used describe patterns of characters. In programming language theory, they are often used to describe a languages' terminals. For more information, please see theinline-arrow-r.gif (99 bytes)Regular Expression page
Rule Please see theinline-arrow-r.gif (99 bytes)Grammar and Backus-Naur Form page
Semantics The semantics of a programming language refers to how the actual statements, constructs, etc... are interpreted. This is quite different from the syntax of a programming language which refers to how different symbols and reserved words are arranged.

For instance, in both Visual Basic and C++, the following is a valid expression:

a & b

Even though the syntax is the same between the two languages, the semantics are quite different. In C++, this expression is interpreted as a binary-and of "a" and "b". In Visual Basic, however, this expression returns the concatenation of two strings.

Symbol In parsing terms, a symbol is the building block of a grammar and can be either a terminal or nonterminal. Essentially, the term "symbol" is used to refer to either of its two forms.
Syntax The term "syntax" refers to the structure of a programming language, in particular, the different series of symbols and words that make up the basic parts of the language. This is quite different from the programming language's semantics - the actual meaning of those different parts.
Shift A "shift" is an action performed by a parsing engine when it reads a token that is valid in the current state.  Lookahead parsers maintain a list of terminals which are expected to be read from each state, given that the syntax of the program is correct. If the token is not the expected, a syntax error occurs.

In the bottom-up parser engines, such as GOLD and YACC, a shift pushes the token onto an internal stack that is used to hold tokens that are being used to construct completed rules.

Shift-Reduce
Conflict
The Shift-Reduce Conflict is the most common type of conflict found in grammars. It is caused when the grammar allows a rule to be reduced  for particular token, but, at the same time, allowing another rule to be shifted for that same token.

As a result, the grammar is ambiguous since a program can be interpreted more than one way.

This error is often caused by recursive grammar definitions where the system cannot determine when one rule is complete and another is just started. The Builder documentation contains an inline-arrow-r.gif (99 bytes)example of the common if-then-else grammar problem and how to fit it.

Terminal In Backus-Naur Form, a terminal is used to denote a programming language's reserved words and symbols. Please see theinline-arrow-r.gif (99 bytes)Grammar and Backus-Naur Form page
Unicode Please see theinline-arrow-r.gif (99 bytes)Unicode page