Design Pattern Evangelist Blog

Smart pointers about software design

Interpreter Design Pattern – Grammars

Grammars define the structure rules for natural and programming languages.


School Room Grammar

Introduction

This blog continues the Interpreter Design Pattern series by expanding upon Grammars, which I introduced in Interpreter Design Pattern – Programming Language Grammar Primer Primer.

Programming languages allow developers to communicate functional intent with computers and other developers. One could argue that communication with other developers is more important than with the computer. A computer would work just as well with machine language as it does with a higher-level programming language. Two quotes by Donald Knuth come to mind:

Programming languages are bound by their grammar rules that document their syntax and structure. If you already feel confident about Programming Language Grammars, then skim this blog or skip it completely.

Here’s a summary of the Grammar concepts presented here:

Natural Language Grammars

People have been communicating with each other via natural language long before we did so with programming languages. Natural languages and programming languages have a lot in common, but they have a lot of differences as well. Except for constructed languages, such as Esperanto, spoken languages emerge from cultural and social interaction adding warts and all. Natural languages require context, contain ambiguity, and are littered with irregular forms, such as the conjugation of the verb: to be.

Here are some aspects of natural languages, for comparison, which for the most part don’t exist in programming languages.

Implicit/Internalized Grammars

All natural languages have grammar, but most of us don’t think about them too much. Most children are near experts of the first language they learn by around age three or four, but none of these toddlers would be able to describe its grammar rules. They choose their words based upon whether they sound right or wrong, which is based upon countless hours of hearing the language from their parents, other family members, friends, etc. They have been in training mode for a very long time.

Children learn grammar in school. When I was in elementary school, we learned common English sentence grammars such as Noun-Verb-Noun as in John likes pizza. As we progressed, we learned more grammar rules and parts of speech, such as prepositions, prepositional phrases, subordinate clauses, infinitives, gerunds, active/passive voice, etc.

We implicitly understood these grammar rules and parts of speech, mostly by ear, but in the classroom we were learning formal grammar.

Sometimes we never learned formal grammar rules. For example, all native English speakers would choose big blue marble rather than blue big marble in a phrase. If asked why, they would probably say that the first one sounds better.

Adjective Order Rules

But there is a grammar rule for the order of adjectives in English based upon the adjective’s denoting category. In that rule, size adjectives (i.e., big) come before color adjectives (i.e., blue).

I first saw this grammar rule a few years ago, well into my 50s. And yet, if I encountered a sentence that shuffled any of these adjective categories in a phrase, I suspect my ear would have flagged as sounding a bit odd. I don’t know if non-native English speakers are taught this grammar rule or whether they develop an ear for it as well.

We know the grammar rules of our native natural languages even if we really don’t know them.

In contrast, when I learned Latin in secondary school, we were bombarded with grammar rules almost from the start. Latin’s grammar rules usually take form of tables of suffixes, which is a critical indicator of a word’s use in a Latin sentence. Word order isn’t as critical in Latin as it is with English. While I was able to memorize these myriad tables of suffixes, I never developed an ear, or auris, for the language.

Latin Grammar Tables

Natural Language Ambiguity

Natural languages have ambiguity which requires context for resolution. Consider the sentence: Tom saw Jane with a telescope. This has at least two possible interpretations:

We need context to resolve the ambiguity. However, a slightly different version of the same sentence has no ambiguity: Galileo saw the moons of Jupiter with a telescope.

Sometimes a clever phrase depends upon ambiguity, such as Martin Fowler’s advice when in a difficult work environment:

You can Change Your Organization, or you can Change Your Organization.

This repeats the same phrase twice, but with two different meanings: Change how your Organization operates, or move to a new Organization.

Grammar Parsing Backtracking

Sometimes our interpretation is directed down the wrong path, and we must backtrack and start it again.

Time flies like an arrow. Fruit flies like a banana. ― Groucho Marx

These two sentences have identical structure, but they require different parsing. The two-word phrase: flies like has two structures:

The first sentence frames our mind for the verb/simile parsing, and when we get to the second sentence, we don’t realize that’s it’s really the noun/verb parsing until we hit the last word. The verb/simile parsing no longer makes sense, and we need to reparse it mentally.

Programming Language Grammars

Spoken languages came first, and then their grammar rules came later. It’s like physicists observing natural phenomena and then creating models that formalize their observations. The constant evolution of natural language is what leads to ambiguity, irregular forms, backtracking, etc.

Programming languages must be deterministic. They can’t contain ambiguity, irregular forms, backtracking, etc. The behavior of every program must be clear and obvious. It would be disastrous if the same program exhibited different behaviors each time it’s executed. I know there are exceptions, such as random algorithms, concurrency issues and such. I’m more interested in language features that are intended to be deterministic.

Programming language grammars are defined using context-free grammars. Context-free grammars originate from automata theory with the mathematical rigor that comes with the theory. Pushdown automata are the types of theoretical machines that accept (operate upon) context-free grammars.

Automata theory is indifferent to programming languages. But it turns out that program language designers can leverage that theory in practice. The theory provides a rigorous foundation upon which program languages can be constructed. For example, programming languages are defined using context-free grammars and parsers are implemented as pushdown automata. They are two sides of the same coin. This is why parsers can be autogenerated from grammars.

I mentioned the basic parts of programming language grammars in Interpreter Design Pattern Introduction - Programming Language Grammar Primer. For more details about grammars see:

Delete Railroad tracks

Railroad Tracks in Japan

Programs Emerge from Grammars

Grammar rules are more than just syntax. They define the semantic framework structure of a program by building a unique parse tree that matches the program. The non-terminal nodes in the tree will be grammar rules. The leaf nodes will be the tokens in the program. For example, if given this program:

A = B + C * 2;
D = 1

The constructed parse tree could be:

Delete Railroad tracks

We don’t generate the tree from the program. We start with an initial rule, such as statements and expand it one rule at a time growing the tree from its root until each non-terminal node has been expanded to a leaf node. Parsers use the tokens in the program as road signs to know which rules to expand to build the object tree.

statements would expand to stmt ; stmt. Then each stmt expands to assign, and so on. All programs that can be expressed with the grammar rules can be created by expanding from the root.

The grammars for General-Purpose Languages (GPL) are surprisingly small considering that every possible program that can be written can be generated uniquely from the root. Here are a few GPL grammar rules:

A Developer’s Notion of Programming Language Grammars

I suspect that most developers know just enough grammar of a programming language to get by. They don’t know the entire set of grammar rules, and I suspect they internalize what they do know. It’s like people who don’t know all the grammar rules of their first natural language, but they can know them by ear, such as the grammar rule for adjective order, which I mentioned above.

Most if not all programming languages have the concept of an if statement with an optional else clause. Most developers learned this concept in their earliest programming experience. If like me, they saw a few examples, internalized it, and then could apply it almost universally. And if we made a syntax or structural mistake, then the compiler or IDE would correct us almost immediately.

I doubt that anyone thinks of the if and else Java grammar rules listed in Aarhus University Department of Computer Science, as:

<if then statement>::= if ( <expression> ) <statement>
<if then else statement>::= if ( <expression> ) <statement no short if> else <statement>
<if then else statement no short if> ::= if ( <expression> ) <statement no short if> else <statement no short if>

If anything, we tend to think of the if and else Java grammar rules more like this version listed in Oracle, as:

Statement ::= if ParExpression Statement [else Statement] // Snippet of one Statement of the set.
ParExpression: ( Expression )

Rational Number Evaluator Grammar Use Case

DSL grammars tend to be smaller than GPL grammars. Let’s see some examples using the Rational Number Evaluator Use Case.

A Few Examples

I’d like to get a feel for what the Rational Number Evaluator might look like in use first, and then I’ll come up with the grammar rules. Each line will represent a single expression that will be evaluated:

2                 // Should evaluate to 2
2 5/6             // Should evaluate to 2 5/6
2 6/8             // Should evaluate to 2 3/4
4 10/4            // Should evaluate to 6 1/2
-1                // Should evaluate to -1
5/7               // Should evaluate to 5/7
7/5               // Should evaluate to 1 2/5
+(2, 3)           // Should evaluate to 5. Note: Using prefix notation. More on this shortly.
-(2, 3)           // Should evaluate to -1
*(3 1/3, 5 5/7)   // Should evaluate to 19 1/21
/(3 1/3, 5 5/7)   // Should evaluate to 7/12
*(2, 3)           // Should evaluate to 6
*(-2, 3)          // Should evaluate to -6
*(2, -3)          // Should evaluate to -6
*(-2, -3)         // Should evaluate to 6
a = 3             // Should evaluate to 3
+(a, 5)           // Should evaluate to 8
+(3, *(2, 6), 5)  // Should evaluate to 20

I’m using prefix notation for arithmetic operations for at least two reasons:

It will be similar to how Microsoft Excel uses prefix notation in many of its functions. For example, the SUM function is prefix, and its syntax is almost identical to what I’ve shown above. For example, it will support the following in a cell definition:

=SUM(3, 12, 5)

which would also evaluate to 20.

Rational Expression Evaluation DSL Grammar Rules

Here are the Rational Expression Evaluation DSL Grammar Rules:

Statement ::= Assignment | Expression
Assignment ::= Identifier = Expression
Expression ::= Identifier | Rational | AddOperation | SubtractOperation | MultipleOperation | DivideOperation
Identifier ::= AlphaNumbericValue // Must start with a letter.
Rational ::= Integer | Fraction | MixedFraction
Fraction ::=  Integer / Integer // Whitespace on either side of slash is optional.
MixedFraction ::= Integer Integer / Integer // Whitespace between two first Integers is mandatory. Whitespace on either side of slash is optional.
AddOperation ::= + ( ExpressionList ) // Whitespace is optional.
SubtractOperation ::= - ( Expression, Expression) // Whitespace is optional.
MultiplyOperation ::= * ( ExpressionList ) // Whitespace is optional.
DivideOperation ::= / ( Expression, Expression ) // Whitespace is optional.
ExpressionList ::= Expression [, Expression]* // Whitespace is optional. Star is any number of repeating Comma/Expression extensions.

Notice that the arithmetic operations define Expressions as operands. That will allow us to nest expressions.

These rules will be the basis of my upcoming design and implementation. As that continues, I may have to return and tweak it a bit.

Summary

Grammars define the semantics and structure of a programming language. This applies to both General-Purpose Languages and Domain-Specific Languages. The grammars must define deterministic structure so that programs exhibit deterministic behavior.

References

Here are some free resources:

See: Interpreter Design Pattern Introduction/References.

Comments

Previous: Interpreter Design Pattern – Domain-Specific Languages

Next: Interpreter Design Pattern – Grammar to Design

Home: Design Pattern Evangelist Blog