Parsing Concepts
	Grammars and Backus-Naur Form

Main
	Latest News
	Getting Started
	Screen Shots
	Download
	Documentation
	Contributors
	Contact

About GOLD
	How It Works
	FAQ
	Why Use GOLD?
	Comparison
	Revision History
	Freeware License
	More ...

Articles
	What is a Parser?
	Backus-Naur Form
	DFA Lexer
	LALR Parsing
	Glossary
	Links
	More ...

Context Free Grammars

Grammars provide rules that specify the structure of languages, independently from the actual meaning of the content. Grammars are classified according to the complexity of the structure they describe. The class of context free grammars (CFG) is the most common one use to describe the syntax of programming languages. In this class, the category a token belongs to (e.g. reserved words, identifiers, etc.) is what matters rather than the specific token (e.g. the identifier xyz).

In addition, the formatting of the program (the content of whitespace) and the actual text of identifiers does not affect the syntax of the grammar. This is very important in parsing technology. Grammars that are not context free cannot be parsed by the LR, LALR or LL parsing algorithms.

Backus-Naur Form

Backus-Naur Form, or BNF for short, is a notation used to describe context free grammars. The notation breaks down the grammar into a series of rules - which are used to describe how the programming languages tokens form different logical units

The actual reserved words and recognized symbol categories in the grammar represent "terminals". Usually, terminals are left without special formatting or are delimited by single or double quotes. Examples include: if, while, '=' and identifier.

In Backus-Naur Form, rules are represented with a "nonterminal" - which are structure names. Typically, nonterminals are delimited by angle-brackets, but this is not always the case. Examples include <statement> and <exp>. Both terminals and nonterminals are referred to generically as "symbols". Each nonterminal is defined using a series of one or more rules (also called productions). They have the following format:

N ::= s

where N is a nonterminal and s is a series of zero or more terminals and nonterminals. Different alternatives for rules can be specified in Backus-Naur Form. For readability, often productions are grouped together and separated by a “pipe” symbol which is read as the word “or”.

In summary , there are slight variations in use, but the notation has the following properties.

A rule consists of one or more productions.
The production starts with a single nonterminal, which is the name of the rule being defined
This nonterminal is followed by a ::= symbol which means “as defined as”. The ::= symbol is often used interchangeably with the symbol. They both have the same meaning.
The symbol is followed by a series of terminals and nonterminals.

The following chart identifies the various parts of a rule definition.

For example, the following defines a rule called <Value> which can contain either an Identifier terminal or the contents of another rule called <Literal>

<Value> ::= Identifier | <Literal>
<Literal> ::= Number | String

The <Literal> rule can contain either a Number or String terminal. As a result of this definition, a <Value> can contain an Identifier, Number or String.

Rules can also be recursively defined. The following rule defines a series of one or more Identifiers.

<Identifiers> ::= Identifier <Identifiers>
| Identifier

Extended BNF

There is another version of BNF called Extended BNF, or EBNF (ISO/IEC 14977), for short. This variant was originally developed by Niklaus Wirth to define the syntax for the Pascal Programming Language. The notation was designed to simplify the notation of BNF by allowing the developer to use special notation for defining lists and optional sets of symbols.

Variations between different versions of EBNF exist, but most use similar notation. Square brackets "[ ... ]" are used to denote optional elements of a rule. Elements of a rule can also be grouped together using braces "{ ... }" which denotes a repetition of zero to infinity. Symbols can also be grouped using parenthesis "( ... )" and followed by a Kleene closure. In this case, the semantics are identical to those used in Regular Expressions.

This format, while powerful, creates a number of implied rules. For instance, if the programmer was to define a rule with an optional clause, the system would have two distinct forms of the rule - the one with the clause and one without. This is also the case with lists and other enhanced features. For instance, the If-then-else statement could be defined as:

<If Stm> ::= IF <Expression> THEN <Stms> [ ELSE <Stms> ]

This would create the following rules:

<If Stm> ::= IF <Expression> THEN <Stms>
| IF <Expression> THEN <Stms> ELSE <Stms>