CMSC330

Regular Expressions

Regular Expressions

Strings and Files
Regex Intro
Regex in Ruby

Strings and Files

Strings and Files

Everything is a file (at least in UNIX (for the most part))

Programs live in RAM

Hard drives are for long term storage

Programs read files to pick up where they left off

Used in save or configuration files

Some files are binary files

Some have text

P1b loads data from file

How to do this?

#load_file.rb
f = File.open("file.txt")
line = f.gets #readline
lines = f.readlines
f.close
          

Commonly we need to go line by line

#line_by_line.rb
f = File.open("file.txt")
line = f.gets
while line
  #do something
  line = f.gets
end
          
#fun_strings.rb
# remove ending values

a = "hello\n"
puts a.chomp
a.chomp!
puts a

# substrings
a = "hello"
puts a[1]
puts a[-1]
puts a[1..]
puts a[0..4]
puts a[2..4]
puts a[2...4]

# substitution
a = "hello world"
a["world"] = "cliff"
puts a
a.sub!("hello", "bye")
puts a
a.gsub("f","ph")
puts a

# searching
a = "hello world"
puts a.include?("hello")
puts a.index("o")
          

Not helpful if we want to find alternative spellings

Color vs Colour

Grey vs Gray

Cliff vs Kliff vs Clyff vs Klyff vs Qulyph

Regex Intro

Regular Expression: A pattern that describes a set of strings

Regular Expressions are used to describe regular languages (future lecture)

For now: a tool used to search for text

A pattern that describes a set of strings

  • An Alphabet: symbols in the string
  • Concatenation
  • Branching (or)
  • Grouping
  • Repetition

More about this in a future lecture

Regex in Ruby

Regexp Syntax: /pattern/

#regex1.rb
puts /pattern/.class
          

Check to see if a string contains a pattern

#regex2.rb
puts "pattern" =~ /pattern/
          

Regexp Syntax: /pattern/

# regex3.rb
line = gets 

# string literal
if line =~ /cliff/
  puts "line contains "cliff"
end

# one string or another 
if line =~ /cliff|kliff/
  puts "line contains 'cliff or kliff'"
end

# we can add parenthesis
if line =~ /(c|k)liff/
  puts "line contains 'cliff or kliff'"
end

# and common letter grouping
if line =~ /[a-z]liff/
  puts "includes 'liff'"
end

# repetition makes things easy
if line =~ /[a-z]{5}/
  puts "any 5 letter string"
end

# digits can be matched too
if line =~ /[0-9]{1,3}/
  puts "any 1,2, or 3 digit number"
end

# combining adds complexity
if line =~ /[A-Za-z0-9]+/
  puts "alphanumberic with length 1+"
end

# can match some fun things
if line =~ /-?[0-9]+/
  puts "any integer"
end

# can match some useful things
if line =~ /-?[0-9]+(,-?[0-9]+)*/
  puts "a list of numbers"
end

# can be tricky
if line =~ /^start \d*  end$/
  puts "start number end"
end

# can be really tricky
if line =~ /[^start]\$\d+.\d{2}/
  puts "start number end"
end
          

Restriction: can only look forward, not predictive

Grouping

Searching is great, parsing is better

#regex4.rb
line = gets
if line =~ /\(\d{3}\) \d{3}-\d{4}/
  puts "is phone number"
end
          

How can we find out the area code?

Grouping

How can we find out the area code?

  • Capture Groups
  • Scanning

Grouping

Parenthesized parts are stored as global variables

Refer to them using $\text{\$}$1,$\text{\$}$2 $\text{\$}$3,etc

#back_reference.rb
line = gets
if line =~ /\((\d{3})\) \d{3}-\d{4}/
  puts $1
end
          

Issue: each match will reset the groupings

Grouping

String's scan method matches and groups

Will return array

# scanning.rb
line = gets
arr = line.scan(/\d+/)
puts arr

arr = line.scan(/[a-zA-Z0-9]{2}/)
puts arr

# parenthesis gives array of array
arr = line.scan(/(\d)(\d)/)
puts arr
          

Grouping

String's scan method matches and groups

Can also take a code block

# scanning2.rb
line = gets
arr = line.scan(/\d+/){|number|
  puts number.to_i*number.to_i
}