Data processing with R tidyverse

5 May 2017

Learning objectives

A first glance at regular expressions

Matching substituting of character strings
s/^lecture([0-9]{1,2}).*[^_].Rmd$/\1.Rmd/g
Ugly, unreadable, terrifying expression
See R for data science

Motivation

What if we want to match

any letter followed by 'n'?
any vowel followed by 'n'?
two letters followed by 'n'?
any number of letters followed by 'n'?

Regular expressions!

allow us to match much more complicated patterns
build patterns from a simple vocabulary and grammar

Finite state automaton

Most relevant consequence

It will always return the earliest (leftmost) match it finds.
Given a choice it always favors match over a nonmatch

Examples

The topic of the day is isotopes.

Typical R functions using regex

`grep, grepl`

Search for matches of a regular expression/pattern in a character vector; either return the indices into the character vector that match, the strings that happen to match, or a TRUE/FALSE vector indicating which elements match

grep("c", "cat")

[1] 1

grep("a", "cat", value = TRUE)

[1] "cat"

grepl("at", "cat")

[1] TRUE

`sub, gsub`

Search a character vector for regular expression matches and replace that match with another string

sub("c", "h", "cat")

[1] "hat"

sub("c", "kl", "cataract")

[1] "klataract"

gsub("c", "n", "cataract")

[1] "natarant"

Meta-characters

Not representing matched characters

`.` (period) represents any character except empty string ''

vec <- c('ct', 'cat', 'cart', 'dog', 'rat', 'carert', 'bet')
grep(".at", vec)

[1] 2 5

grep("..t", vec)

[1] 2 3 5 6 7

+ represents one or more occurrences

grep('c.+t', vec)

[1] 2 3 6

Meta characters

* represents zero or more occurrences

vec

[1] "ct"     "cat"    "cart"   "dog"    "rat"    "carert" "bet"

grep('c.*t', vec)

[1] 1 2 3 6

Group terms with parentheses '(' and ')'

grep('c(.r)+t', vec)

[1] 3 6

grep('c(.r)*t', vec)

[1] 1 3 6

Quantifying number of matches

Applies to preceeding of group

? The preceding item is optional and will be matched at most once.
* The preceding item will be matched zero or more times. *+ The preceding item will be matched one or more times.
{n} The preceding item is matched exactly 'n' times.
{n,} The preceding item is matched 'n' or more times.
{n,m} The preceding item is matched at least 'n' times, but not more than 'm' times.

Match positions

Other useful ones include

^ Start of string
$ End of string

Other useful ones include

vec

[1] "ct"     "cat"    "cart"   "dog"    "rat"    "carert" "bet"

grep('r.$', vec)

[1] 3 6

grep('^c', vec)

[1] 1 2 3 6

Meta characters

| ( logical OR )

vec

[1] "ct"     "cat"    "cart"   "dog"    "rat"    "carert" "bet"

grep('(c.t)|(c.rt)', vec)

[1] 2 3

Character classes

[a-z] lowercase letters
[a-zA-Z] any letter
[0-9] any number'
[aeiou] any vowel
[0-7ivx] any of 0 to 7, i, v, and x

Inside a character class ?? means anything except the following characters. E.g.

[??0-9] anything except a digit

Matching metacharacters

We saw a bunch of special characters . + * ] [ $ What if we want to match them?

vec2 = c("ct", "cat", "caat", "caart", "caaat", "caaraat", "c.t")
grep('c.t', vec2)

[1] 2 7

Escape them with or in R's case, two \

#grep('c\.t', vec) will not work, because R thinks \. is a special character like \n

#Use two \'s
grep('c\\.t', vec2)

[1] 7

Matching metacharacters

To match a , our pattern must represent \

Our string must contain 4 backslashes!

vec = c("a\\backslash", "nobackslash")
#grep('\\', vec) gives error
grep('\\\\', vec)

[1] 1

Search and replace

The sub function allows search and replacement

vec2

[1] "ct"      "cat"     "caat"    "caart"   "caaat"   "caaraat" "c.t"

sub('a+', 'a', vec2)

[1] "ct"     "cat"    "cat"    "cart"   "cat"    "caraat" "c.t"

sub replaces only first match, gsub replaces all

Use the backreferences \1, \2 etc to refer to first, second group, etc.

gsub('(a+)r(a+)', 'b\\1brc\\2c', vec2)

[1] "ct"          "cat"         "caat"        "caart"       "caaat"      
[6] "cbaabrcaact" "c.t"

Search and replace

Use `\U, \L, \E` to make following backreferences upper or lower case or leave unchanged

vec2

[1] "ct"      "cat"     "caat"    "caart"   "caaat"   "caaraat" "c.t"

gsub('(a+)r(a+)', '\\U\\1r\\2', vec2)

[1] "ct"       "cat"      "caat"     "caart"    "caaat"    "cUaaraat"
[7] "c.t"

gsub('(a+)r(a+)', '\\U\\1r\\E\\2', vec2)

[1] "ct"        "cat"       "caat"      "caart"     "caaat"     "cUaarEaat"
[7] "c.t"

Summary

. stands for any character.

[ABC] means A,B or C.

[A-Z] means any upper letter between A and Z.

[0-9] means any digit between 0 and 9.

List of metacharacters '$ * + . ? [ ] ^ { } | ( ) '. If you need to use one of those characters, precede them with a doubled backslash.

Extended list of regular expressions

Requires perl=TRUE flag

Readable short cuts

[:digit:] Digits: '0 1 2 3 4 5 6 7 8 9'.

[:alpha:] Alphabetic characters: '[:lower:]' and '[:upper:]'.

[:upper:] Upper-case letters.

[:lower:] Lower-case letters.

Note that the set of alphabetic characters includes accents such as ß, ç or ö which are very common is some languages. Therefore, it is more general than [A-Za-z] which ascii characters only.

Extended list of regular expressions

For other characters

[:punct:] Punctuation characters: '! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~'.

[:space:] Space characters: tab, newline, vertical tab, form feed, carriage return, and space.

[:blank:] Blank characters: space and tab.

Extended list of regular expressions

For combination of other classes

[:alnum:] Alphanumeric characters: '[:alpha:]' and '[:digit:]'.

[:graph:] Graphical characters: '[:alnum:]' and '[:punct:]'.

[:print:] Printable characters: '[:alnum:]', '[:punct:]' and space.

[:xdigit:] Hexadecimal digits: '0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f'.

Concatenating strings

Base R

paste() concatenates strings.
paste0()
cat() prints and concatenates strings

paste("toto","tata",sep=' ')

[1] "toto tata"

paste("toto","tata",sep=",")

[1] "toto,tata"

x <- c("a","b","c")
paste(x,collapse="-")

[1] "a-b-c"

cat(c("a","b","c"), sep = "+")

a+b+c

Splitting a string

strsplit( ) : Split the elements of a character vector 'x' into substrings according to the matches to substring 'split' within them. cat() prints and concatenates strings

strsplit("a.b.c", "\\.")

[[1]]
[1] "a" "b" "c"

unlist(strsplit("a.b.c", "\\."))

[1] "a" "b" "c"

Mastering regular expressions

Working with regular

Start simply and expand
Think about the most comprehensive – shortest – expression
Think about negation rather than inclusion of all possibilities

Learning objectives

A first glance at regular expressions

Motivation

What if we want to match

Regular expressions!

Finite state automaton

Most relevant consequence

Examples

Typical R functions using regex

grep, grepl

sub, gsub

Meta-characters

Not representing matched characters

. (period) represents any character except empty string ''

Meta characters

Quantifying number of matches

Applies to preceeding of group

Match positions

Other useful ones include

Other useful ones include

Meta characters

Character classes

Matching metacharacters

Matching metacharacters

Search and replace

The sub function allows search and replacement

Search and replace

Use \U, \L, \E to make following backreferences upper or lower case or leave unchanged

Summary

Extended list of regular expressions

Readable short cuts

Extended list of regular expressions

For other characters

Extended list of regular expressions

For combination of other classes

Concatenating strings

Base R

Splitting a string

Mastering regular expressions

Working with regular

`grep, grepl`

`sub, gsub`

`.` (period) represents any character except empty string ''

Use `\U, \L, \E` to make following backreferences upper or lower case or leave unchanged