CMSC330

Parsers

Parsers

Lexing
Parsing
Interpreting

Lexing

Lexing

Lexing(Tokenizing): Converting a string to a list of tokens

Token: A meaningful string

Typically: keywords, identifiers, numbers,

"The short Wizard" \(\Rightarrow\) [Det;Adj;noun]


type token = Int of int | Add | Sub | LParen | RParen;;

tokenize "2 + ( 4 - 5)";;
= > [Int(2); add; LParen; Int(4); sub; Int(5); RParen]
          

How to Tokenize?

One way: RE and boring repitition


(* take a regexp *)
let tokenize s = 
  let re_num = Re.Perl.compile (Re.Perl.re "^([0-9]+)") in
  let re_add = Re.Perl.compile (Re.Perl.re "^\+") in 
  let re_sub = Re.Perl.compile (Re.Perl.re "^-") in 
  let rec mklst text = 
    if text = "" then [] else
    if (Re.execp re_num text) then
      let matched = Re.Group.get (Re.exec re_num text) 1 in 
      Int(int_of_string matched)::(mklst (String.sub text (String.length matched) ((String.length text)-(String.length matched))))
    else if (Re.execp re_add text) then
      Add::(mklst (String.sub text 1 ((String.length text)-1)))
    else if (Re.execp re_sub text) then
      Sub::(mklst (String.sub text 1 ((String.length text)-1)))
    else (mklst (String.sub text 1 ((String.length text)-1))) in
  mklst s;;
          

Parsing

Parsing: taken list to AST

can checks if text is grammatically correct

Many types of parsers: we will use recursive decent

RDP is top down; Grammar slides showed bottom up

Consider the basic grammar for polish notation

\(E \rightarrow A\vert + A\ E \vert - A\ E\)
\(A \rightarrow 0\vert 1\vert \dots\vert 9\)

  • Which Branch am I in/looking for?

  • Which Token are we looking for?

\(E \rightarrow A\vert + A\ E \vert - A\ E\)
\(A \rightarrow 0\vert 1\vert \dots\vert 9\)


type ast = Num of int | Plus of ast * ast |Minus of ast * ast
let rec parse tokens = match tokens with
  [] ->  raise (Failure ("not enough tokens"))
 |Int(x)::xs -> Num(x),xs
 |Add::Int(y)::xs ->  
                    let s,l = parse xs in
                    (Plus(Num(y),s),l)
 |Sub::Int(y)::xs ->  
                    let s,l = parse xs in
                    (Minus(Num(y),s),l)
 |_  -> raise (Failure "not gram correct")

let parse_wrapper tokens = 
  let tree,leftover = parse tokens in 
  if leftover = [] then
    tree
  else
    raise (Failure ("have extra tokens"));;
          

Important: knowing which branch you are looking for

Important: knowing which branch you are looking for

Backtracking vs Predictive

Predictive: whats the next symbol?

First(nt): set of terminals nt represents

Only so good: conflicting first sets

Only so good: conflicting first sets

  • Can rewrite grammar
  • Can rewrite parser

Converting to AST

Recall a Tree in OCaml


type tree = Leaf|Node of int * Node * Node;;
Node(2,Node(0,Leaf,Leaf),Leaf);;
          

Modify for Tokens


type expr = Num of int|Plus of expr * expr|Minus of expr * expr;;
(Add(Num 1, Num 2));;
          

Interpreting