CMSC330

Regular Expressions

Regular Expressions

Regular Languages
Regex Intro
Regex in OCaml

Regular Languages

Regular Languages

I want to determine the cardinal direction I am facing

I can model this problem with English

I can model this problem with a subset of English

Subset:{Turn Left 90 degrees, Turn Right 90 degrees}

There is a minimum language needed to compute a problem

There is a minimum language needed to compute a problem

Different classes of languages exist

Regular languages can compute very simple problems

A regular language is any language that can be defined by a regular expression

Regex Intro

Regular Expression: A pattern that describes a set of strings

Regular Expressions are used to describe regular languages (future lecture)

For now: a tool used to search for text

A pattern that describes a set of strings

  • An Alphabet: symbols in the string
  • Concatenation
  • Boolean (or)
  • Precedence
  • Quantification

How to define the set?

How to define the set?

We write a pattern or a regular expression

Our first pattern

"a"

Describes the set {"a"}

Our second pattern

"hello"

Describes the set {"hello"}

Boring

"hello|hi"

Describes the set {"hello", "hi"}

Boolean Or

"this|that"

Describes the set {"this", "that"}

The or operator's scope extends to start or end

"this|that|the other thing"

or until another |

Describes the set {"this", "that", "the other thing"}

Precedence

"cliff|clyff"

{"cliff","clyff"}

A lot of shared characters

"cl(i|y)ff"

Describes the same set

Quantification

"0|1|2|3|4|5|6|7|8|9"

{"0", "1", "2", "3", "4", "5", "6", "7", "8", "9"}

What about two digit strings?

"(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)"

Cringe

Quantification

"(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)"
"(0|1|2|3|4|5|6|7|8|9){2}"

What about infinite repetition?

Kleene Operator

"(ha)*"

{"", "ha", "haha", "hahaha",...}

Bracket Expressions

"0|1|2|3|4|5|6|7|8|9"
"[0-9]"
"[a-z]"

Any ascii range, can also or

"[a-zA-Z]"

Any lowercase or uppercase letter

Bracket Expressions

Any ascii range, can also or

"[a-zA-Z]"
"[aeiou]"

{"a", "e", "i", "o", "u"}

Can also negate single characters

"[^aeiou]"

Anything except a,e,i,o,u

Other helpful symbols

  • "+" one or more repeats
  • "?" one or zero repeats
  • "." any character
  • "^" outside of bracket: start of string
  • "$" end of string

Note: will need to be escaped to be matched

"1 \+ 2"

Regex in OCaml

Needs the re library

#require Re (* utop only *)
          

create a re:

let my_re = Re.compile(Re.Posix.re ("[0-9]+\.[0-9]+"))
          

Matching

Check if string in re

let my_re = Re.compile(Re.Posix.re ("[0-9]+\.[0-9]+")) in
let did_match = Re.execp my_re "26.19" in
if did_match then
  print_string "successfully matched"
else
  ptring_string "unsuccessfully matched"
          

Grouping

Searching is great, parsing is better

let my_re = Re.compile(Re.Posix.re ("([0-9]+)\.[0-9]+")) in
let matched = Re.exec my_re "26.19" in
print_string Re.Group.get matched 1
          

Parenthesis show precedence, AND capture

Grouping

Parenthesis show precedence, AND capture

let my_re = Re.compile(Re.Posix.re "(([0-9]){3})-([0-9]{3}-[0-9]{4})") in
let m = Re.exec my_re "123-456-7890" in
let ac = Re.Group.get m 1 in 
let rest = Re.Group.get m 3 in 
  let _ = print_string ("the area code is " ^ ac ) in 
  print_string ("\nthe rest is " ^ rest )
          

Group is determined by open paren

Other ways to make Regex Type

Re.Posix.re "I am ([0-9]+) years old"
          
Re.seq [Re.str "I am "; Re.group (Re.rep1 Re.digit); Re.str " years old"]
          

Useful to procedurely build regex