Saturday, June 27, 2009

introducing jasi and my first parsing strategy

JASI - Just Another Scheme Interpreter.
I'm doing an experiment of writing a scheme interpreter(hosted here, code discussed in this post is available only upto revision#10).
I've read sometimes about basics of compilers, grammars, ast etc and followed that path to start with. So I started with creating a EBNF grammar(not strictly correct) for a subset of scheme. Then a parser. Basically there are two kind of parsers, Bottom-up(e.g. shift-reduce) and Top-down(e.g. recursive descent). Since this is an experiment, So I wanted to write the parser by hand. Its easy to write a recursive descent parser(algo is simple, each non-terminal symbol has a method to create its part of ast, we start with calling that method of the start non-terminal of the grammar and that in turn recursively calls same method for other non-terminals) by hand, So I tried to write the grammar in a way so that its easy to write a top-down recursive descent parser for it.I removed left recursion, tried to left factorize it so that starter symbol of any non-terminal for each production rule is distinct(from all other production rules for that non-terminal) and single look-ahead is enough to parse. Here is that grammar I wrote to start with...
expr :=
|CHAR ;charExpr
|STRING ;stringExpr
|NUMBER ;numberExpr
|VARIABLE ; variableExpr
|LPAREN DEFINE VARIABLE expr RPAREN ;definitionExpr
|LPAREN IF expr expr expr? RPAREN ;ifExpr
|LPAREN LAMBDA LPAREN variable* RPAREN expr+ RPAREN ;lambdaExpr
|LPAREN variable expr* RPAREN ;applicationExpr

CHAR := #\any-non-white-space-char
NUMBER := DIGIT(DIGIT)*
DIGIT := 0|1|2|3|4|5|6|7|8|9
STRING := "any-characters"
LPAREN := '('
RPAREN := )
VARIABLE := [^#0-9()] any-non-white-space-char

Notice that, I gave labels to each production rule and each label is mapped to a AST in the code. So here is the (classes of)tree structure that I created for ast...
AST
CharAST
StringAST
NumberAST
VariableAST
DefinitionAST
IfAST
LambdaAST
ApplicationAST

I wrote a class called Tokenizer to make tokens from stream of characters. Its implemented in terms of a finite state machine. And wrote another Parser class(a recursive descent top down parser) that gets tokens from Tokenizer and builds the AST. The most important method of Parser is here and the beauty is its simplicity(it derives straightforward from the grammar and indeed this is the reason parsers can be generated. :)
public AST parseExpr() {
Token currentToken = Tokenizer.getNextToken();
switch(currentToken.getType()) {
case Constants.TOKEN_TYPE_CHAR:
return parseCharExpr(currentToken);
case Constants.TOKEN_TYPE_NUMBER:
return parseNumberExpr(currentToken);
case Constants.TOKEN_TYPE_STRING:
return parseStringExpr(currentToken);
case Constants.TOKEN_TYPE_VARIABLE:
return parseVariableExpr(currentToken);
case Constants.TOKEN_TYPE_LPAREN:
currentToken = Tokenizer.getNextToken();
switch(currentToken.getType()) {
case Constants.TOKEN_TYPE_DEFINE:
return parseDefinitionExpr(currentToken);
case Constants.TOKEN_TYPE_IF:
return parseIfExpr(currentToken);
case Constants.TOKEN_TYPE_LAMBDA:
return parseLambdaExpr(currentToken);
case Constants.TOKEN_TYPE_VARIABLE:
return parseApplicationExpr(currentToken);
default:
throw new RuntimeException("Could not" +
" parse:" + currentToken.getType() +
":" + currentToken.getSpelling());
}
default:
throw new RuntimeException("Could not " +
"parse at:" + currentToken.getType() +
":" + currentToken.getSpelling());
}
}

No comments:

Post a Comment