Lexing(Tokenizing): Converting a string to a list of tokens
Token: A meaningful string
Typically: keywords, identifiers, numbers,
"The short Wizard" \(\Rightarrow\) [Det;Adj;noun]
type token = Int of int | Add | Sub | LParen | RParen;;
tokenize "2 + ( 4 - 5)";;
= > [Int(2); add; LParen; Int(4); sub; Int(5); RParen]
How to Tokenize?
One way: RE and boring repitition
(* take a regexp *)
let tokenize s =
let re_num = Re.Perl.compile (Re.Perl.re "^([0-9]+)") in
let re_add = Re.Perl.compile (Re.Perl.re "^\+") in
let re_sub = Re.Perl.compile (Re.Perl.re "^-") in
let rec mklst text =
if text = "" then [] else
if (Re.execp re_num text) then
let matched = Re.Group.get (Re.exec re_num text) 1 in
Int(int_of_string matched)::(mklst (String.sub text (String.length matched) ((String.length text)-(String.length matched))))
else if (Re.execp re_add text) then
Add::(mklst (String.sub text 1 ((String.length text)-1)))
else if (Re.execp re_sub text) then
Sub::(mklst (String.sub text 1 ((String.length text)-1)))
else (mklst (String.sub text 1 ((String.length text)-1))) in
mklst s;;
Parsing: taken list to AST
can checks if text is grammatically correct
Many types of parsers: we will use recursive decent
RDP is top down; Grammar slides showed bottom up
Consider the basic grammar for polish notation
\(E \rightarrow A\vert + A\ E \vert - A\ E\)
\(A \rightarrow 0\vert 1\vert \dots\vert 9\)
\(E \rightarrow A\vert + A\ E \vert - A\ E\)
\(A \rightarrow 0\vert 1\vert \dots\vert 9\)
type ast = Num of int | Plus of ast * ast |Minus of ast * ast
let rec parse tokens = match tokens with
[] -> raise (Failure ("not enough tokens"))
|Int(x)::xs -> Num(x),xs
|Add::Int(y)::xs ->
let s,l = parse xs in
(Plus(Num(y),s),l)
|Sub::Int(y)::xs ->
let s,l = parse xs in
(Minus(Num(y),s),l)
|_ -> raise (Failure "not gram correct")
let parse_wrapper tokens =
let tree,leftover = parse tokens in
if leftover = [] then
tree
else
raise (Failure ("have extra tokens"));;
Important: knowing which branch you are looking for
Important: knowing which branch you are looking for
Backtracking vs Predictive
Predictive: whats the next symbol?
First(nt): set of terminals nt represents
Only so good: conflicting first sets
Only so good: conflicting first sets
Converting to AST
Recall a Tree in OCaml
type tree = Leaf|Node of int * Node * Node;;
Node(2,Node(0,Leaf,Leaf),Leaf);;
Modify for Tokens
type expr = Num of int|Plus of expr * expr|Minus of expr * expr;;
(Add(Num 1, Num 2));;