SE250:HTTB:WIP:Parsing

From Marks Wiki
Jump to navigation Jump to search

Back to Work in Progess

Parsing in General

Parsing, or "syntactic analysis", is the process of converting a sequence of tokens, in a grammatical structure, based on a formal grammar. In software, Parsing usually refers to taking a sequence of tokens, created from a (relatively) high level language by a Tokeniser, and inferring a (partially) syntacticly correct parse tree from it, for use in converting to a lower level language. It is part of the compilation process.

Lexical Analysis

While Lexical Analysis is not strictly part of Parsing, the two are quire closely linked, as the output of a Tokeniser(Which performs Lexical Analysis) is what a Parser works on.

Tokens

Tokens are simply blocks of text that are recognized to be important to a program. Tokens can be specific constant combinations of characters, ie, "for", "+", ")", etc, or they can be a combination of characters that conform to a particular defined pattern. Ie, "42" would be recognized as a number, with the "value" 42.

Example Token implimentation in C

Token structure:

typedef struct {
	token_t type;
	union {
		int symval;
		char* strval;
		int intval;
		double fltval;
	} val;
} Token;

Token types:

typedef enum {
T_SYMBOL,
T_IDENT,
T_INTEGER,
T_FLOAT,
T_STRING,
T_END,
T_NOTHING
} token_t;

Tokeniser

A Tokeniser performs the process of converting a sequence of characters (e.g. in a C source file) into a sequence of tokens. This process usually has two components - the scanner and the tokeniser. The scanner function searches for sequences of characters that conform to predefined patterns. The tokeniser then takes the resulting sections of characters, classifies them, and creates their tokens.

Basic overview of the Tokeniser (Lexical analyser) — Screen cast

Regular expressions

Regular expressions are used to describe regular grammars. Put simply, a grammar is regular if it's not recursive. Regexs (often also called regular expressions) have an extended syntax, making them a bit more expressive and slower to parse. They are often used in Tokenisers to define what should be tokenised.

Most text in a regular expression is treated as a literal. For example, cat will match any occurrence of the word cat. Here are some fairly standard operators:

<html>

operatorDescription
.Matches any one character
*Matches the preceding element 0 or more times.
[ ]Matches one of the characters between the brackets.

</html>

Syntactic Analysis

Parser

A parser takes a sequence of tokens, and attempts to construct a valid structure out of them. As it reads through the tokens, it compairs what it finds to defined formal grammar. When it find what seems to be a valid sequence, it adds the tokens to the parse tree, in the structure defined for that sequence.

As the formal grammar the Parser uses is defined specifically to find allowable combinations of token types, the parser is able to detect localised syntactic errors. For example, it will detect if a "for" statement has an incorrect number of arguments. However, as most parser's don't keep an overall picture of the program "in mind", most cannot detect errors on a larger scale. For example, they will not pick up if a variable is used befor initialization. Things like this are usually checked later in the process.

Parse Trees

A parse tree is produced by recursively following the rules of a grammar. Each level represents a matched production. Just as multiple grammars can be written to parse the same language, there can be multiple equivalent parse trees for the same expression.

Example of a parsetree:

<html> <img src="http://www.resisty.com/upload/parsetree.jpg" width="543" height="676" alt="Parse Tree" /> </html>

Abstract Syntax Trees

Abstract syntax trees are simplified trees which unambiguously show the structure of a text.

Example (same expression as above for comparison):

<html><img src="http://img151.imageshack.us/img151/6968/astdv8.png"/></html>

Grammars

A grammar formally defines the acceptable syntax of a language. We specify a grammar using productions. Productions look like this:

<S> -> <C>
<S> -> <C><S>
<SC> -> "<S>"

To understand these, try reading them like this:
A string can be a character.
A string can be a character followed by a string.
A string constant can be a quote, followed by a string, followed by a quote.

Grammars for mathematical expressions tend to be a bit more complex. Here is a cut down example (where N is a number):

<E> -> <N>
<E> -> ( <E> )
<E> -> <E> + <E>
<E> -> <E> - <E>
<E> -> <E> * <E>
<E> -> <E> / <E>

There are two (related) problems with this grammar. The first is that it's ambiguous. It could match 1 - 2 + 3 as (1 - 2) + 3 = -1 + 3 = 2. Or, it could match it as 1 - (2 + 3) = 1 - 5 = -4. The second is that order of operations is ignored. For example, multiplication should take precedence over addition.

Example Grammar

Balanced Parentheses

< B > -> €
<B> -> (<B>) <B>

Statements

<St> -> if(<condition>) <St>
<St> -> while(<condition>) <St>
<St> -> if(<condition>) <St> else <St>
<Stlist> -> E
<Stlist> -> <St><Stlist>
<St> -> {<Stlist>}

Lexical Analyzer - Flex

Lexical analysis is the complete structure that has all the above operations in it, it consists of a scanner and a tokenizer. They are the functions that perform the analysis for it and have many operations within them. Although these functions make the analyser, lexical analysis can be done is one step instead of going through each function. This is possible if only one character is read into it at a time. This can be done using a flex a parser generator, this is a faster and more efficient track than the normal route of operation. A flex uses a table and goto statements to jump into statements and instructions. What it does is pick up on lexical patterns in text and collects regular occurrences those terms. Flex uses the commands given by the user and interprets which terms to be collected and which ones not to be. More modern flex analysers have libraries in which the terms to be collected don’t need to be stated. After doing its operations the flex generator outputs the file in a c format.

Screencasts

User Description
sgha014 Using the mkNode function.
vpup001 Constructing a parse tree of the given output.
rwan064 Lexical analysis - scanning and tokenising
jsmi233 The parsing process