5 May 2017

Learning objectives

A first glance at regular expressions

  • Matching substituting of character strings
  • s/^lecture([0-9]{1,2}).*[^_].Rmd$/\1.Rmd/g
  • Ugly, unreadable, terrifying expression
  • See R for data science

Motivation

What if we want to match

  • any letter followed by 'n'?
  • any vowel followed by 'n'?
  • two letters followed by 'n'?
  • any number of letters followed by 'n'?

Regular expressions!

  • allow us to match much more complicated patterns
  • build patterns from a simple vocabulary and grammar

Finite state automaton

Most relevant consequence

  1. It will always return the earliest (leftmost) match it finds.

  2. Given a choice it always favors match over a nonmatch

Examples

The topic of the day is isotopes.

Typical R functions using regex

grep, grepl

Search for matches of a regular expression/pattern in a character vector; either return the indices into the character vector that match, the strings that happen to match, or a TRUE/FALSE vector indicating which elements match

grep("c", "cat")
[1] 1
grep("a", "cat", value = TRUE)
[1] "cat"
grepl("at", "cat")
[1] TRUE

sub, gsub

Search a character vector for regular expression matches and replace that match with another string

sub("c", "h", "cat")
[1] "hat"
sub("c", "kl", "cataract")
[1] "klataract"
gsub("c", "n", "cataract")
[1] "natarant"

Meta-characters

Not representing matched characters

. (period) represents any character except empty string ''

vec <- c('ct', 'cat', 'cart', 'dog', 'rat', 'carert', 'bet')
grep(".at", vec)
[1] 2 5
grep("..t", vec)
[1] 2 3 5 6 7

+ represents one or more occurrences

grep('c.+t', vec)
[1] 2 3 6

Meta characters

* represents zero or more occurrences

vec 
[1] "ct"     "cat"    "cart"   "dog"    "rat"    "carert" "bet"   
grep('c.*t', vec)
[1] 1 2 3 6

Group terms with parentheses '(' and ')'

grep('c(.r)+t', vec)
[1] 3 6
grep('c(.r)*t', vec)
[1] 1 3 6

Quantifying number of matches

Applies to preceeding of group

  • ? The preceding item is optional and will be matched at most once.
  • * The preceding item will be matched zero or more times. *+ The preceding item will be matched one or more times.
  • {n} The preceding item is matched exactly 'n' times.
  • {n,} The preceding item is matched 'n' or more times.
  • {n,m} The preceding item is matched at least 'n' times, but not more than 'm' times.

Match positions

Other useful ones include

  • ^ Start of string
  • $ End of string

Other useful ones include

vec
[1] "ct"     "cat"    "cart"   "dog"    "rat"    "carert" "bet"   
grep('r.$', vec)
[1] 3 6
grep('^c', vec)
[1] 1 2 3 6

Meta characters

  • | ( logical OR )
vec
[1] "ct"     "cat"    "cart"   "dog"    "rat"    "carert" "bet"   
grep('(c.t)|(c.rt)', vec)
[1] 2 3

Character classes

  • [a-z] lowercase letters
  • [a-zA-Z] any letter
  • [0-9] any number'
  • [aeiou] any vowel
  • [0-7ivx] any of 0 to 7, i, v, and x

Inside a character class ?? means anything except the following characters. E.g.

[??0-9] anything except a digit

Matching metacharacters

We saw a bunch of special characters . + * ] [ $ What if we want to match them?

vec2 = c("ct", "cat", "caat", "caart", "caaat", "caaraat", "c.t")
grep('c.t', vec2)
[1] 2 7

Escape them with  or in R's case, two \

#grep('c\.t', vec) will not work, because R thinks \. is a special character like \n

#Use two \'s
grep('c\\.t', vec2)
[1] 7

Matching metacharacters

To match a , our pattern must represent \

Our string must contain 4 backslashes!

vec = c("a\\backslash", "nobackslash")
#grep('\\', vec) gives error
grep('\\\\', vec)
[1] 1

Search and replace

The sub function allows search and replacement

vec2
[1] "ct"      "cat"     "caat"    "caart"   "caaat"   "caaraat" "c.t"    
sub('a+', 'a', vec2)
[1] "ct"     "cat"    "cat"    "cart"   "cat"    "caraat" "c.t"   

sub replaces only first match, gsub replaces all

Use the backreferences \1, \2 etc to refer to first, second group, etc.

gsub('(a+)r(a+)', 'b\\1brc\\2c', vec2)
[1] "ct"          "cat"         "caat"        "caart"       "caaat"      
[6] "cbaabrcaact" "c.t"        

Search and replace

Use \U, \L, \E to make following backreferences upper or lower case or leave unchanged

vec2
[1] "ct"      "cat"     "caat"    "caart"   "caaat"   "caaraat" "c.t"    
gsub('(a+)r(a+)', '\\U\\1r\\2', vec2)
[1] "ct"       "cat"      "caat"     "caart"    "caaat"    "cUaaraat"
[7] "c.t"     
gsub('(a+)r(a+)', '\\U\\1r\\E\\2', vec2)
[1] "ct"        "cat"       "caat"      "caart"     "caaat"     "cUaarEaat"
[7] "c.t"      

Summary

. stands for any character.

[ABC] means A,B or C.

[A-Z] means any upper letter between A and Z.

[0-9] means any digit between 0 and 9.

List of metacharacters '$ * + . ? [ ] ^ { } | ( ) '. If you need to use one of those characters, precede them with a doubled backslash.

Extended list of regular expressions

Requires perl=TRUE flag

Readable short cuts

[:digit:] Digits: '0 1 2 3 4 5 6 7 8 9'.

[:alpha:] Alphabetic characters: '[:lower:]' and '[:upper:]'.

[:upper:] Upper-case letters.

[:lower:] Lower-case letters.

Note that the set of alphabetic characters includes accents such as ß, ç or ö which are very common is some languages. Therefore, it is more general than [A-Za-z] which ascii characters only.

Extended list of regular expressions

For other characters

[:punct:] Punctuation characters: '! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [  ] ^ _ ` { | } ~'.

[:space:] Space characters: tab, newline, vertical tab, form feed, carriage return, and space.

[:blank:] Blank characters: space and tab.

Extended list of regular expressions

For combination of other classes

[:alnum:] Alphanumeric characters: '[:alpha:]' and '[:digit:]'.

[:graph:] Graphical characters: '[:alnum:]' and '[:punct:]'.

[:print:] Printable characters: '[:alnum:]', '[:punct:]' and space.

[:xdigit:] Hexadecimal digits: '0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f'.

Concatenating strings

Base R

  • paste() concatenates strings.
  • paste0()
  • cat() prints and concatenates strings
paste("toto","tata",sep=' ')
[1] "toto tata"
paste("toto","tata",sep=",")
[1] "toto,tata"
x <- c("a","b","c")
paste(x,collapse="-")
[1] "a-b-c"
cat(c("a","b","c"), sep = "+")
a+b+c

Splitting a string

strsplit( ) : Split the elements of a character vector 'x' into substrings according to the matches to substring 'split' within them. cat() prints and concatenates strings

strsplit("a.b.c", "\\.")
[[1]]
[1] "a" "b" "c"
unlist(strsplit("a.b.c", "\\."))
[1] "a" "b" "c"

Mastering regular expressions

Working with regular

  1. Start simply and expand
  2. Think about the most comprehensive – shortest – expression
  3. Think about negation rather than inclusion of all possibilities