2 May 2017

Learning objectives

  • (re)view some R base
  • get the different data types: numeric, logical, factor …
  • understand what is a list, a vector, a data.frame …

Getting started

Let's get ready to use R and RStudio. Do the following:

  • Open up RStudio
  • Maximize the RStudio window
  • Click the Console pane, at the prompt (>) type in 3 + 2 and hit enter
> 3 + 2

Arithmetic operations

You will not be surprised that R is very good at computing

arithmetic operators

  • +: addition
  • -: subtraction
  • *: multiplication
  • /: division
  • ^ or **: exponentiation
  • %%: modulo (remainder after division)
  • %/%: integer division

Remember

R will:

  • first perform exponentiation
  • then multiplications and/or divisions
  • and finally additions and/or subtractions.

If you need to change the priority during the evaluation, use parentheses – i.e. ( and ) – to group calculations.

Data types and structures

R base

Necessary R base

R base

We could let base down, but the tidyverse is wrapping around it. Some functions need to be known

4 main types

Type Example
numeric integer (2), double (2.34)
string "tidyverse !"
boolean TRUE / FALSE
complex 2+0i

Special case

NA   # not available, missing data
NA_real_
NA_integer_
NA_character_
NA_complex_
NULL # empty
-Inf/Inf # infinite values

missing and infinite

c(NA_real_, 2.45, 45.67)
[1]    NA  2.45 45.67
c(Inf, 2.45, 45.67)
[1]   Inf  2.45 45.67

Structures

Vectors

c() is the function for concatenate

Example

4
c(43, 5.6, 2.90)
[1] 4
[1] 43.0  5.6  2.9

Factors

convert strings to factors, levels is the dictionary

Example

factor(c("AA", "BB", "AA", "CC"))
[1] AA BB AA CC
Levels: AA BB CC

Matrix (2D), Arrays (\(\geq\) 3D)

won't dig into those

Example

matrix(1:4, nrow = 2)
     [,1] [,2]
[1,]    1    3
[2,]    2    4

Lists

very important as can contain anything

Example

list(f = factor(c("AA", "AA")),
     v = c(43, 5.6, 2.90),
     s = 4)
$f
[1] AA AA
Levels: AA

$v
[1] 43.0  5.6  2.9

$s
[1] 4

Data frames are special lists

data.frame

same as list but where all objects must have the same length

Example

data.frame(
  f = c("AA", "AA", "BB") %>% factor(),
  v = c(43, 5.6, 2.90),
  s = rep(4, 3))
   f    v s
1 AA 43.0 4
2 AA  5.6 4
3 BB  2.9 4

Data types 2

# evaluate
typeof(2)
[1] "double"
# check
is.integer(2.34)
[1] FALSE
# check with an actual integer
is.integer(2L)
[1] TRUE
# convert
as.integer(2.34)
[1] 2

Vectors

Vectors

Vectors are the simplest type of object in R.

print(5)
[1] 5

[1] means we made a numeric vector of length 1. Now look at what the : operator does:

1:30
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30

How many elements are in the thing we made here? What does the [24] signify?

Vectors

concatenate

Think of vectors as collections of simple things (like numbers) that are ordered. We can create vectors from other vectors using the c function:

c(2, TRUE, "a string")
[1] "2"        "TRUE"     "a string"

We can use the assignment operator <- to associate a name to our vectors in order to reuse them:

my_vec <- c(3, 4, 1:3)
my_vec
[1] 3 4 1 2 3

Tip

Rstudio has the built-in shortcut Alt+- for <-

Advice

Even if = works also for <-, don't use it, see why

Vectors

(cont.)

The following will build a character vector. We know this because the elements are all in "quotes".

char_vec <- c("dog", "cat", "ape")

Now use the c function to combine a length-one vector number of the number 4 with the char_vec. What happens?

c(4, char_vec)
[1] "4"   "dog" "cat" "ape"

Notice that the 4 is quoted. R turned it into a character vector and then combined it with char_vec.

Remember

All elements in a atomic vector must be of the same type. Otherwise, they are silently coerced.

Vectors

Hierachy

Vectors

built-in

R has a few built in vectors. One of these is LETTERS. What does it contain?

LETTERS
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
[18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

How do extract the first element from this (the letter A). Here is how to do it:

LETTERS[1]
[1] "A"

Use the square brackets [] to subset vectors

Vectors

subset

Important

Unlike python or Perl, vectors use 1-based index!!

How to extract > 1 element

select elements from position 3 to 10:

LETTERS[3:10]
[1] "C" "D" "E" "F" "G" "H" "I" "J"

Remember what the : operator does?

Take a look:

3:10
[1]  3  4  5  6  7  8  9 10

Can you see how LETTERS[3:10] works now?

Exercise

find a way to output

[1] "B" "C" "D" "E"

find a way to output

[1] "B" "C" "D" "E" "G"

find a way to output first 5 letters + one to the last

[1] "A" "B" "C" "D" "E" "Y"

Tip

the length of a vector is provided by length()

find a way to output all letters except the first one

 [1] "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R"
[18] "S" "T" "U" "V" "W" "X" "Y" "Z"

Tip

subsetting could use negative indexes

Solution

  • indexes from 2 to 5
LETTERS[2:5]
[1] "B" "C" "D" "E"


- indexes from 2 to 5 + 7

LETTERS[c(2:5, 7)]
[1] "B" "C" "D" "E" "G"


- indexes from 1 to 5 + last one

LETTERS[c(1:5, length(LETTERS) - 1)]
[1] "A" "B" "C" "D" "E" "Y"


- indexes except 1

LETTERS[-1]
 [1] "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R"
[18] "S" "T" "U" "V" "W" "X" "Y" "Z"

Named vectors

Like the dict in python or associative array in Perl, characters can be used as indexes

char_vec[1]
[1] "dog"
names(char_vec) <- c("first", "second", "third")
char_vec["first"]
first 
"dog" 

Of note, see that the [1] is no longer displayed

char_vec
 first second  third 
 "dog"  "cat"  "ape" 

Vectorized operation

my_vec <- 10:18
my_vec + 2
[1] 12 13 14 15 16 17 18 19 20


R recycles vectors that are too short, without any warnings:

my_vec * c(1:3)
[1] 10 22 36 13 28 45 16 34 54

Vectorized operation

(cont.)

Have a look at the following operation:

c(1:3) + c(1:2) * c(1:4)
Warning in c(1:3) + c(1:2) * c(1:4): longer object length is not a multiple
of shorter object length
[1] 2 6 6 9

Details

Steps R performs behind the scene are:

c(1, 2, 3, 1) + (c(1, 2, 1, 2) * c(1, 2, 3, 4))
[1] 2 6 6 9
c(1, 2, 3, 1) + c(1, 4, 3, 8)
[1] 2 6 6 9

Vectors

tricky filling

x <- numeric(10)
x[20] <- 1
head(x, 20)
 [1]  0  0  0  0  0  0  0  0  0  0 NA NA NA NA NA NA NA NA NA  1

source: Kevin Ushey

Warning!

Unlike python that will output index out of range, R expand and fill with missing values silently

Factors

Vectors with qualitative data

my_f <- factor(c("cytoplasm", "nucleus", "extracellular", "nucleus", "nucleus"))
my_f
[1] cytoplasm     nucleus       extracellular nucleus       nucleus      
Levels: cytoplasm extracellular nucleus

Representation

Actually, data are represented with numbers

str(my_f)
 Factor w/ 3 levels "cytoplasm","extracellular",..: 1 3 2 3 3

Dictionary

ids are called levels. Default is alphabetical sorting

levels(my_f)
[1] "cytoplasm"     "extracellular" "nucleus"      

For moving around those levels, safest way is to use the forcats package

Matrix

A matrix is a 2D array

M <- matrix(1:6, ncol = 2, nrow = 3)
M
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
M <- matrix(1:6, ncol = 2, nrow = 3, byrow = TRUE)
M
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6

Array

Similar to a matrix but with dimensions \(\geq\) 3D

A <- array(1:24, dim = c(2, 4, 3))
A
, , 1

     [,1] [,2] [,3] [,4]
[1,]    1    3    5    7
[2,]    2    4    6    8

, , 2

     [,1] [,2] [,3] [,4]
[1,]    9   11   13   15
[2,]   10   12   14   16

, , 3

     [,1] [,2] [,3] [,4]
[1,]   17   19   21   23
[2,]   18   20   22   24

Lists

Also named recursive vectors. Most permissive type, could contain anything and be nested!

  • squares are atomic
  • rounded are lists

source: H. Wickham - R for data science, licence CC

Lists

Pepper analogy

Example

l <- list(name = "Farina",
          firstname = "Geoff",
          year = 1995)
l["firstname"]
$firstname
[1] "Geoff"
l[["firstname"]]
[1] "Geoff"

Question

How to subset a single pepper seed?

Data frames

It's the most important type to recall. All the tidyverse is focusing on those.

Actually on tweaked data.frame: tibbles

definition

data.frame are lists where all columns (i.e vectors) are of the same length

built-in example

women
   height weight
1      58    115
2      59    117
3      60    120
4      61    123
5      62    126
6      63    129
7      64    132
8      65    135
9      66    139
10     67    142
11     68    146
12     69    150
13     70    154
14     71    159
15     72    164

Data frames

subset

We can extract a vector (colum) from a data frame in a few different ways:

Using the double [[]]

women[["height"]]
 [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72

Or its alias: the $ operator

women$height
 [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72


Remember the pepper analogy introduced by Hadley?

What would be the output of women["height"]?

Data frame as a table

A data frame can be considered as a table and extract a specify a cell by its row and column:

first 5 rows

head(women, 5)
  height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126

only one cell with []

  • first coordinate = row, col]
  • second coordinate = col
women[4, 2] 
[1] 123

Logical operators

In addition to the arithmetic operators

Perform comparisons

  • == equal
  • != different
  • < smaller
  • <= smaller or equal
  • > greater
  • >= greater or equal
  • ! is not
  • &, && and
  • |, || or

Using library()

with only base loaded

x <- 1:10
filter(x, rep(1, 3))
Time Series:
Start = 1 
End = 10 
Frequency = 1 
 [1] NA  6  9 12 15 18 21 24 27 NA

Conflicts! when 2 packages export a function

with the same name, the latest loaded wins

library(dplyr)
filter(x, rep(1, 3))

Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "c('integer', 'numeric')

Solution

using the :: operator to call a function from a specific package

stats::filter(x, rep(1, 3))
Time Series:
Start = 1 
End = 10 
Frequency = 1 
 [1] NA  6  9 12 15 18 21 24 27 NA

Pipes with magrittr

developed by Stefan Milton Bache

compare the approaches between classic parenthesis and the magrittr pipeline

R base

set.seed(12)
round(mean(rnorm(5)), 2)
[1] -0.76

magrittr

set.seed(12)
rnorm(5) %>%
  mean() %>%
  round(2)
[1] -0.76

Of note, magrittr needs to loaded with either:

library(magrittr)
library(dplyr)
library(tidyverse)

Coding's style

R is rather flexible and permissive with its syntax. However, being more strict tends to ease the debugging process.

See tidyverse style's recommendations

In summary:

Good

  • use spaces
  • use more lines
    • } alone on their line except for

    r } else {

    • using the pipe %>% to display a single instruction per line
    • break list definitions, function arguments …
  • avoid using names of existing functions and variables
  • use snake_case more than CamelCases

Bad

# example from http://adv-r.had.co.nz/Style.html
T <- FALSE
c <- 10
mean <- function(x) sum(x)
# lack spaces for readibility
average<-mean(feet/12+inches,na.rm=TRUE)