CMSC330

Regular Expressions

Regular Expressions

Regular Languages
Regex Intro
Regex in Python

Regular Languages

Regular Languages

I want to determine the cardinal direction I am facing

I can model this problem with English

I can model this problem with a subset of English

Subset:{Turn Left 90 degrees, Turn Right 90 degrees}

There is a minimum language needed to compute a problem

There is a minimum language needed to compute a problem

Different classes of languages exist

Regular languages can compute very simple problems

A regular language is any language that can be defined by a regular expression

Regex Intro

Regular Expression: A pattern that describes a set of strings

Regular Expressions are used to describe regular languages (future lecture)

For now: a tool used to search for text

A pattern that describes a set of strings

  • An Alphabet: symbols in the string
  • Concatenation
  • Boolean (or)
  • Precedence
  • Quantification

How to define the set?

How to define the set?

We write a pattern or a regular expression

Our first pattern

"a"

Describes the set {"a"}

Our second pattern

"hello"

Describes the set {"hello"}

Boring

"hello|hi"

Describes the set {"hello", "hi"}

Boolean Or

"this"|"that"

Describes the set {"this", "that"}

The or operator's scope extends to start or end

"this|that|the other thing"

or until another |

Describes the set {"this", "that", "the other thing"}

Precedence

"cliff|clyff"

{"cliff","clyff"}

A lot of shared characters

"cl(i|y)ff"

Describes the same set

Quantification

"0|1|2|3|4|5|6|7|8|9"

{"0", "1", "2", "3", "4", "5", "6", "7", "8", "9"}

What about two digit strings?

"(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)"

Cringe

Quantification

"(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)"
"(0|1|2|3|4|5|6|7|8|9){2}"

What about infinite repetition?

Kleene Operator

"(ha)*"

{"", "ha", "haha", "hahaha",...}

Bracket Expressions

"0|1|2|3|4|5|6|7|8|9"
"[0-9]"
"[a-z]"

Any ascii range, can also or

"[a-zA-Z]"

Any lowercase or uppercase letter

Bracket Expressions

Any ascii range, can also or

"[a-zA-Z]"
"[aeiou]"

{"a", "e", "i", "o", "u"}

Can also negate single characters

"[^aeiou]"

Anything except a,e,i,o,u

Other helpful symbols

  • "+" one or more repeats
  • "?" one or zero repeats
  • "." any character
  • "^" outside of bracket: start of string
  • "$" end of string

Note: will need to be escaped to be matched

"1 \+ 2"

Regex in Python

Needs the re module

#regex.py
import re
          

create a re:

#regex2.py
my_re = re.compile("[0-9]+\.[0-9]+")
          

Matching

Check if string in re

#regex3.py
my_re = re.compile("[0-9]+\.[0-9]+")
if (my_re.match("12.3")):
  print("successfully matched")
else:
  print("unsuccessfully matched")
          

Grouping

Searching is great, parsing is better

#regex4.py
my_re = re.compile("[0-9]+\.([0-9]+)")
m = my_re.match("12.3")
if m:
  print("the decimal is " + m.group(1))
else:
  print("unsuccessfully matched")
          

Parenthesis show precedence, AND capture

Grouping

Parenthesis show precedence, AND capture

#regex5.py
my_re = re.compile("(([0-9]){3})-([0-9]{3}-[0-9]{4})")
m = my_re.match("123-456-7890")
if m:
  print("the area code is " + m.group(1))
  print("the rest is " + m.group(3))
else:
  print("unsuccessfully matched")
          

Group is determined by open paren