Saturday, June 27, 2009

scheme parser = "read"

This is another article for my ongoing learnings with writing a scheme interpreter(hosted here). As I described in my previous post that I started with writing a formal EBNF grammar and a recursive descent parser for it.
When I came to implementing "read" function of scheme, then I realised that it essentially is the parser of scheme and list data structure is AST. Moreover when I write the formal grammar then the syntactical rules of the language are fixed at the time of parsing itself and its very difficult now to provide macros. Instead the parser or the "read" should just read *any* valid s-expression and "eval" only should derive any meaning from it.
So I've now changed my implementation and written a "read" that reads any valid s-expression and "eval" provides meaning to it.

introducing jasi and my first parsing strategy

JASI - Just Another Scheme Interpreter.
I'm doing an experiment of writing a scheme interpreter(hosted here, code discussed in this post is available only upto revision#10).
I've read sometimes about basics of compilers, grammars, ast etc and followed that path to start with. So I started with creating a EBNF grammar(not strictly correct) for a subset of scheme. Then a parser. Basically there are two kind of parsers, Bottom-up(e.g. shift-reduce) and Top-down(e.g. recursive descent). Since this is an experiment, So I wanted to write the parser by hand. Its easy to write a recursive descent parser(algo is simple, each non-terminal symbol has a method to create its part of ast, we start with calling that method of the start non-terminal of the grammar and that in turn recursively calls same method for other non-terminals) by hand, So I tried to write the grammar in a way so that its easy to write a top-down recursive descent parser for it.I removed left recursion, tried to left factorize it so that starter symbol of any non-terminal for each production rule is distinct(from all other production rules for that non-terminal) and single look-ahead is enough to parse. Here is that grammar I wrote to start with...
expr :=
|CHAR ;charExpr
|STRING ;stringExpr
|NUMBER ;numberExpr
|VARIABLE ; variableExpr
|LPAREN DEFINE VARIABLE expr RPAREN ;definitionExpr
|LPAREN IF expr expr expr? RPAREN ;ifExpr
|LPAREN LAMBDA LPAREN variable* RPAREN expr+ RPAREN ;lambdaExpr
|LPAREN variable expr* RPAREN ;applicationExpr

CHAR := #\any-non-white-space-char
NUMBER := DIGIT(DIGIT)*
DIGIT := 0|1|2|3|4|5|6|7|8|9
STRING := "any-characters"
LPAREN := '('
RPAREN := )
VARIABLE := [^#0-9()] any-non-white-space-char

Notice that, I gave labels to each production rule and each label is mapped to a AST in the code. So here is the (classes of)tree structure that I created for ast...
AST
CharAST
StringAST
NumberAST
VariableAST
DefinitionAST
IfAST
LambdaAST
ApplicationAST

I wrote a class called Tokenizer to make tokens from stream of characters. Its implemented in terms of a finite state machine. And wrote another Parser class(a recursive descent top down parser) that gets tokens from Tokenizer and builds the AST. The most important method of Parser is here and the beauty is its simplicity(it derives straightforward from the grammar and indeed this is the reason parsers can be generated. :)
public AST parseExpr() {
Token currentToken = Tokenizer.getNextToken();
switch(currentToken.getType()) {
case Constants.TOKEN_TYPE_CHAR:
return parseCharExpr(currentToken);
case Constants.TOKEN_TYPE_NUMBER:
return parseNumberExpr(currentToken);
case Constants.TOKEN_TYPE_STRING:
return parseStringExpr(currentToken);
case Constants.TOKEN_TYPE_VARIABLE:
return parseVariableExpr(currentToken);
case Constants.TOKEN_TYPE_LPAREN:
currentToken = Tokenizer.getNextToken();
switch(currentToken.getType()) {
case Constants.TOKEN_TYPE_DEFINE:
return parseDefinitionExpr(currentToken);
case Constants.TOKEN_TYPE_IF:
return parseIfExpr(currentToken);
case Constants.TOKEN_TYPE_LAMBDA:
return parseLambdaExpr(currentToken);
case Constants.TOKEN_TYPE_VARIABLE:
return parseApplicationExpr(currentToken);
default:
throw new RuntimeException("Could not" +
" parse:" + currentToken.getType() +
":" + currentToken.getSpelling());
}
default:
throw new RuntimeException("Could not " +
"parse at:" + currentToken.getType() +
":" + currentToken.getSpelling());
}
}