Data processing with R tidyverse

2 May 2017

Overview

The four day course provides a complete introduction to data science in with the tidyverse. Focusing on getting data ready, some exploratory analysis, visualization and handling models.

Preparing data takes up to 90% of the time spent in analysis — speeding this up is the mission of this course

This workshop is composed of 30 hours:

lectures

~ 8 hours
available online
very short exercises included
convert them to pdf using chrome

practical sessions

~ 15 hours
using your own laptop
teachers available
supplementary exercises if needed

bring your own data

~ 7 hours (Day 4)
teachers available
project provided for those that don't have data yet

Time table and contents

Day 1

Lecture 1
- Introduction to R and RStudio (AG)
Lecture 2
- Basic data types, control structures (AG)
Importing data into R
- Lecture 3 Markdown (EK)
- Lecture 4 readr (EK)
Setup Bring Your Own data Session (for Day 4)
Lecture 5
- Introduction to tidy data (RK)

Time table

(cont.)

Day 2

Lecture 6
- Data wrangling with dplyr (RK)
Lecture 7
- Visualization with ggplot2 (AG)

Time table

(cont.)

Day 3

Lecture 8
- Functional programming with purrr (EK)
Lecture 9
- Handling statistical models with broom (AG)
[Lecture 10]
- Programming with tidyeval (AG)

Day 4

Project work
- Project (BYOD) or
- Microarray analysis

What is R?

R is shorthand for "GNU R":

An interactive programming language derived from S
Focus on data analysis ("stats") and plotting
R is also shorthand for the ecosystem around this language
- Book authors
- Package developers
- Ordinary useRs

Learning to use R will make you more efficient and facilitate the use of advanced data analysis tools

Why use R?

It's free!
easy to install / maintain
easy to process big files and analyse huge amounts of data
integrated data visualization tools, even dynamic
fast, and even faster with C++ integration via Rcpp.
easy to get help
- huge R community in the web
- stackoverflow with a lot of tags like tidyverse, dplyr, ggplot2 etc.
- rbloggers

Twitter R community

#rstats on twitter

Constant trend

From Touchon & McCoy. Ecosphere. 2016

Packages

+10,000 in Jan 2017

CRAN

reliable: package is checked during submission process

MRAN for Windows users

bioconductor

dedicated to biology. status

typical install:

source("https://bioconductor.org/biocLite.R")
biocLite("limma")

GitHub

easy install thanks to devtools. status

# install.packages("devtools")
devtools::install_github("tidyverse/readr")

could be a security issue (see next slide)

CRAN install from Rstudio

github install from Rstudio' console

Security

source Bob Rudis' blog

Help pages

2 possibilities for manual pages.

?log
help(log)

In Rstudio, the help page can be viewed in the bottom right pane

Sadly

manpages are often unhelpful, now vignettes (and articles on tidyverse) are better and described workflows.

Drawback: Steep learning curve

Period of much suckiness

Tidyverse

creator

R base is complex, has a long history and many contributors

Why R is hard to learn

Unhelpful help ?print
generic methods print.data.frame
too many commands open source
inconsistent names read.csv, load, readRDS
inconsistent syntax open source
too many ways to select variables df$x, df$"x", df[,"x"], df[[1]]
[…] see r4stats' post for the full list
the tidyverse curse

source: Robert A. Muenchen' blog

Navigating the balance between base R and the tidyverse is a challenge to learn Robert A. Muenchen

Tidyverse

creator

We think the tidyverse is better, especially for beginners. It is

recent (both an issue and an advantage)
unified
consistent

Hadley Wickham

Hadley, chief scientist at Rstudio, coined the tidyverse at userR meeting in 2016. He developed and maintains most of the core tidyverse packages

Tidyverse

packages

Tidyverse

packages in processes

Tidyverse

workflow

Pipeline

David Robinson

@drob on twitter

Tidyverse criticism

core / extended

Core

ggplot2, for data visualization
dplyr, for data manipulation
tidyr, for data tidying
readr, for data import
purrr, for functional programming
tibble, for tibbles, a modern re-imagining of data frames

Extended

Working with specific types of vectors:
- hms, for times
- stringr, for strings
- lubridate, for date/times
- forcats, for factors
Importing other types of data:
- feather, for sharing with other languages
- haven, for SPSS, SAS and Stata files
- httr, for web apis
- jsonlite for JSON
- readxl, for .xls and .xlsx files
- rvest, for web scraping
- xml2, for XML
Modelling
- modelr, for modelling within a pipeline
- broom, for turning models into tidy data

source: http://tidyverse.tidyverse.org/ H.Wickham

Tidyverse criticism

dialect

@ucfagls yeah. I think the tidyverse is a dialect. But its accent isn’t so thick
— Hadley Wickham (@hadleywickham) 12 janvier 2017

Tidyverse criticism

controversy

SO's comment

See the popularity of the data.table versus dplyr question.

Easily summarized: data.table is faster, for less than 10 m rows, negligible.

Tidyverse criticism

jobs

Realized today: #tidyverse R and base #rstats have little in common. Beware when looking for job which requires knowledge of R.
— Yeedle N. (@Yeedle) 2 mars 2017

Personal complains:

still young so change quickly and drastically. Backward compatibility is not always maintained.
tibbles are nice but a lot of non-tidyverse packages require matrices. rownames still an issue.

Anyway, learning the tidyverse does not prevent to learn R base, it helps to get things done early in the process

Tidyverse

trends

source: rdocumentation (2017/04/18)

Tidyverse

trends

source: rdocumentation (2017/04/18)

RStudio

Rstudio

What is it?

RStudio is an Integrated Development Environment.
It makes working with R much easier

Features

Console to run R, with syntax highlighter
Editor to work with scripts
Viewer for data / plots / website
Package management (including building)
Autocompletion using TAB
Cheatsheets
Git integration for versioning
Inline outputs (>= v1.03)
Keyboard shortcuts
Notebooks

Warning

Don't mix up R and RStudio.
R needs to be installed first.

Rstudio

The 4 panels layout

Four panels

scripting

could be your main window
should be a Rmarkdown doc
tabs are great

Environment

Environment, display loaded objects and their str()
History is useless IMO
nice git integration

Console

could be hidden with inline outputs
in the preview, embed a nice terminal tab

Files / Plots / Help

necessary package management tab
plots becomes useless when using inline outputs
very useful help tab

For reproducibility, options to activate / deactivate

Code diagnostics, highly recommended

using Global Options -> Code -> Diagnostics editing pane:

source: Kevin Ushey' article

check argument calls

missing arguments

variable definitions

unused variables & style recommendations

The dream team

Rstudio

Working directory and projects

It is where R is looking for files (read or write).
Using the console, try:

> getwd()

setwd() or relative paths

It is possible to change the location using setwd() (in the console or interactively in RStudio)
A better way is to use projects in Rstudio

Projects

They solve most issues with working directories: get rid of setwd()

Using `library()`

with only `base` loaded

x <- 1:10
filter(x, rep(1, 3))

Time Series:
Start = 1 
End = 10 
Frequency = 1 
 [1] NA  6  9 12 15 18 21 24 27 NA

Conflicts! when 2 packages export a function

with the same name, the latest loaded wins

library(dplyr)
filter(x, rep(1, 3))

Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "c('integer', 'numeric')

Solution

using the :: operator to call a function from a specific package

stats::filter(x, rep(1, 3))

Time Series:
Start = 1 
End = 10 
Frequency = 1 
 [1] NA  6  9 12 15 18 21 24 27 NA

Pipes with magrittr

developed by Stefan Milton Bache

compare the approaches between classic parenthesis and the magrittr pipeline

R base

set.seed(12)
round(mean(rnorm(5)), 2)

[1] -0.76

magrittr

set.seed(12)
rnorm(5) %>%
  mean() %>%
  round(2)

[1] -0.76

Of note, magrittr needs to loaded with either:

library(magrittr)
library(dplyr)
library(tidyverse)

Coding's style

R is rather flexible and permissive with its syntax. However, being more strict tends to ease the debugging process.

See tidyverse style's recommendations

In summary:

Good

use spaces
use more lines
- } alone on their line except for
r } else {
- using the pipe %>% to display a single instruction per line
- break list definitions, function arguments …
avoid using names of existing functions and variables
use snake_case more than CamelCases

Bad

# example from http://adv-r.had.co.nz/Style.html
T <- FALSE
c <- 10
mean <- function(x) sum(x)

# lack spaces for readibility
average<-mean(feet/12+inches,na.rm=TRUE)

Hexbins

After David Robinson' laptop, see and get inspired!

Hadley Wickham

Bob Rudis

Overview

lectures

practical sessions

bring your own data

Time table and contents

Day 1

Time table

(cont.)

Day 2

Time table

(cont.)

Day 3

Day 4

What is R?

Why use R?

Twitter R community

Constant trend

Packages

+10,000 in Jan 2017

CRAN

bioconductor

GitHub

CRAN install from Rstudio

github install from Rstudio' console

Security

Help pages

Sadly

Drawback: Steep learning curve

Period of much suckiness

Tidyverse

creator

Why R is hard to learn

Tidyverse

creator

Hadley Wickham

Tidyverse

packages

Tidyverse

packages in processes

Tidyverse

workflow

Pipeline

David Robinson

Tidyverse criticism

core / extended

Core

Extended

Tidyverse criticism

dialect

Tidyverse criticism

controversy

Tidyverse criticism

jobs

Tidyverse

trends

Tidyverse

trends

RStudio

Rstudio

What is it?

Features

Warning

Rstudio

The 4 panels layout

Four panels

scripting

Environment

Console

Files / Plots / Help

For reproducibility, options to activate / deactivate

Code diagnostics, highly recommended

The dream team

Rstudio

Working directory and projects

setwd() or relative paths

Projects

Using library()

with only base loaded

Conflicts! when 2 packages export a function

Solution

Using `library()`

with only `base` loaded