A first glance at regular expressions
- Matching substituting of character strings
s/^lecture([0-9]{1,2}).*[^_].Rmd$/\1.Rmd/g
- Ugly, unreadable, terrifying expression
- See R for data science
5 May 2017
s/^lecture([0-9]{1,2}).*[^_].Rmd$/\1.Rmd/g
It will always return the earliest (leftmost) match it finds.
Given a choice it always favors match over a nonmatch
The topic of the day is isotopes.
grep, grepl
Search for matches of a regular expression/pattern in a character vector; either return the indices into the character vector that match, the strings that happen to match, or a TRUE/FALSE vector indicating which elements match
grep("c", "cat")
[1] 1
grep("a", "cat", value = TRUE)
[1] "cat"
grepl("at", "cat")
[1] TRUE
sub, gsub
Search a character vector for regular expression matches and replace that match with another string
sub("c", "h", "cat")
[1] "hat"
sub("c", "kl", "cataract")
[1] "klataract"
gsub("c", "n", "cataract")
[1] "natarant"
.
(period) represents any character except empty string ''vec <- c('ct', 'cat', 'cart', 'dog', 'rat', 'carert', 'bet') grep(".at", vec)
[1] 2 5
grep("..t", vec)
[1] 2 3 5 6 7
+
represents one or more occurrences
grep('c.+t', vec)
[1] 2 3 6
* represents zero or more occurrences
vec
[1] "ct" "cat" "cart" "dog" "rat" "carert" "bet"
grep('c.*t', vec)
[1] 1 2 3 6
Group terms with parentheses '(' and ')'
grep('c(.r)+t', vec)
[1] 3 6
grep('c(.r)*t', vec)
[1] 1 3 6
?
The preceding item is optional and will be matched at most once.*
The preceding item will be matched zero or more times. *+
The preceding item will be matched one or more times.{n}
The preceding item is matched exactly 'n' times.{n,}
The preceding item is matched 'n' or more times.{n,m}
The preceding item is matched at least 'n' times, but not more than 'm' times.^
Start of string$
End of stringvec
[1] "ct" "cat" "cart" "dog" "rat" "carert" "bet"
grep('r.$', vec)
[1] 3 6
grep('^c', vec)
[1] 1 2 3 6
|
( logical OR )vec
[1] "ct" "cat" "cart" "dog" "rat" "carert" "bet"
grep('(c.t)|(c.rt)', vec)
[1] 2 3
Inside a character class ?? means anything except the following characters. E.g.
[??0-9] anything except a digit
We saw a bunch of special characters . + * ] [ $ What if we want to match them?
vec2 = c("ct", "cat", "caat", "caart", "caaat", "caaraat", "c.t") grep('c.t', vec2)
[1] 2 7
Escape them with or in R's case, two \
#grep('c\.t', vec) will not work, because R thinks \. is a special character like \n #Use two \'s grep('c\\.t', vec2)
[1] 7
To match a , our pattern must represent \
Our string must contain 4 backslashes!
vec = c("a\\backslash", "nobackslash") #grep('\\', vec) gives error grep('\\\\', vec)
[1] 1
vec2
[1] "ct" "cat" "caat" "caart" "caaat" "caaraat" "c.t"
sub('a+', 'a', vec2)
[1] "ct" "cat" "cat" "cart" "cat" "caraat" "c.t"
sub replaces only first match, gsub replaces all
Use the backreferences \1, \2
etc to refer to first, second group, etc.
gsub('(a+)r(a+)', 'b\\1brc\\2c', vec2)
[1] "ct" "cat" "caat" "caart" "caaat" [6] "cbaabrcaact" "c.t"
\U, \L, \E
to make following backreferences upper or lower case or leave unchangedvec2
[1] "ct" "cat" "caat" "caart" "caaat" "caaraat" "c.t"
gsub('(a+)r(a+)', '\\U\\1r\\2', vec2)
[1] "ct" "cat" "caat" "caart" "caaat" "cUaaraat" [7] "c.t"
gsub('(a+)r(a+)', '\\U\\1r\\E\\2', vec2)
[1] "ct" "cat" "caat" "caart" "caaat" "cUaarEaat" [7] "c.t"
. stands for any character.
[ABC] means A,B or C.
[A-Z] means any upper letter between A and Z.
[0-9] means any digit between 0 and 9.
List of metacharacters '$ * + . ? [ ] ^ { } | ( ) '. If you need to use one of those characters, precede them with a doubled backslash.
Requires perl=TRUE
flag
[:digit:] Digits: '0 1 2 3 4 5 6 7 8 9'.
[:alpha:] Alphabetic characters: '[:lower:]' and '[:upper:]'.
[:upper:] Upper-case letters.
[:lower:] Lower-case letters.
Note that the set of alphabetic characters includes accents such as ß, ç or ö which are very common is some languages. Therefore, it is more general than [A-Za-z] which ascii characters only.
[:punct:] Punctuation characters: '! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~'.
[:space:] Space characters: tab, newline, vertical tab, form feed, carriage return, and space.
[:blank:] Blank characters: space and tab.
[:alnum:] Alphanumeric characters: '[:alpha:]' and '[:digit:]'.
[:graph:] Graphical characters: '[:alnum:]' and '[:punct:]'.
[:print:] Printable characters: '[:alnum:]', '[:punct:]' and space.
[:xdigit:] Hexadecimal digits: '0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f'.
paste()
concatenates strings.paste0()
cat()
prints and concatenates stringspaste("toto","tata",sep=' ')
[1] "toto tata"
paste("toto","tata",sep=",")
[1] "toto,tata"
x <- c("a","b","c") paste(x,collapse="-")
[1] "a-b-c"
cat(c("a","b","c"), sep = "+")
a+b+c
strsplit( ) : Split the elements of a character vector 'x' into substrings according to the matches to substring 'split' within them. cat() prints and concatenates strings
strsplit("a.b.c", "\\.")
[[1]] [1] "a" "b" "c"
unlist(strsplit("a.b.c", "\\."))
[1] "a" "b" "c"