2 May 2017

Data transformation

Introduction

Preparing data is the most time consuming part of of data analysis

dplyr in a nutshell

  • dplyr is a tool box for working with data in tibbles/data frames
  • The most import data manipulation operations are covered
    • Selection and manipulation of observation, variables and values
    • Summarizing
    • Grouping
    • Joining and intersecting tibbles
  • In a workflow typically follows reshaping operations from tidyr
  • Fast, in particular for in-memory data by writing key pieces in C++ (using Rcpp)
  • Standard interface to work with data in a data.frame, a data.table or database.

Learning objectives

dplyr in a nutshell

  • Learn the basic vocabulary of dplyr
  • Exercise commands
  • Translating questions into data manipulation statements

dplyr

Introduction: Cheat sheets

dplyr

Introduction: Cheat sheets (cont.)

Installation

  • dplyr is included in the tidyverse package
  • biomaRt is part of bioconductor (check)
install.packages("dplyr")
# OR
install.packages("tidyverse")

source("https://bioconductor.org/biocLite.R")
biocLite("biomaRt")

Retrieving sample data

Gene from chromosome 21 from biomaRt

# Load the library
library(biomaRt)
gene_mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="www.ensembl.org")
gene_set <- useDataset(gene_mart , dataset="hsapiens_gene_ensembl")

gene_by_exon <- getBM(
  mart = gene_set,
  attributes = c(
    "ensembl_gene_id",
    "ensembl_transcript_id",
    "chromosome_name",
    "start_position",
    "end_position",
    "hgnc_symbol",
    "hgnc_id",
    "strand",
    "gene_biotype"
    ), 
  filter = "chromosome_name",
  value = "21"
  )

as_tibble()

Important with large tables

gene_by_exon <- as_tibble(gene_by_exon)
gene_by_exon
# A tibble: 2,417 x 9
   ensembl_gene_id ensembl_transcript_id chromosome_name start_position
             <chr>                 <chr>           <int>          <int>
 1 ENSG00000264452       ENST00000583496              21       44439035
 2 ENSG00000274046       ENST00000617336              21        7092616
 3 ENSG00000278775       ENST00000619112              21        8433085
 4 ENSG00000276873       ENST00000616808              21       39171462
 5 ENSG00000236545       ENST00000458654              21       41870633
 6 ENSG00000160255       ENST00000302347              21       44885953
 7 ENSG00000160255       ENST00000518033              21       44885953
 8 ENSG00000160255       ENST00000523126              21       44885953
 9 ENSG00000160255       ENST00000521995              21       44885953
10 ENSG00000160255       ENST00000397846              21       44885953
# ... with 2,407 more rows, and 5 more variables: end_position <int>,
#   hgnc_symbol <chr>, hgnc_id <chr>, strand <int>, gene_biotype <chr>

glimpse()

Inspect data frames / tibbles

  • Use glimpse() to show some values and types per column.
  • The Environment tab in RStudio tab does it too
glimpse(gene_by_exon)
Observations: 2,417
Variables: 9
$ ensembl_gene_id       <chr> "ENSG00000264452", "ENSG00000274046", "E...
$ ensembl_transcript_id <chr> "ENST00000583496", "ENST00000617336", "E...
$ chromosome_name       <int> 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, ...
$ start_position        <int> 44439035, 7092616, 8433085, 39171462, 41...
$ end_position          <int> 44439110, 7092716, 8433174, 39171560, 41...
$ hgnc_symbol           <chr> "", "", "", "", "", "ITGB2", "ITGB2", "I...
$ hgnc_id               <chr> "", "", "", "", "", "HGNC:6155", "HGNC:6...
$ strand                <int> 1, 1, 1, -1, 1, -1, -1, -1, -1, -1, -1, ...
$ gene_biotype          <chr> "snoRNA", "misc_RNA", "misc_RNA", "misc_...

Pipes in R

magrittr

select()

Selecting specific columns

select() is part of the biomaRt package, need to address dplyr-package!

gene_by_exon %>%
  dplyr::select(hgnc_symbol, chromosome_name, start_position, end_position)
# A tibble: 2,417 x 4
   hgnc_symbol chromosome_name start_position end_position
         <chr>           <int>          <int>        <int>
 1                          21       44439035     44439110
 2                          21        7092616      7092716
 3                          21        8433085      8433174
 4                          21       39171462     39171560
 5                          21       41870633     41872054
 6       ITGB2              21       44885953     44931989
 7       ITGB2              21       44885953     44931989
 8       ITGB2              21       44885953     44931989
 9       ITGB2              21       44885953     44931989
10       ITGB2              21       44885953     44931989
# ... with 2,407 more rows

select() helper functions

select() has many helper functions

  • contains(), starts_with(), ends_with() for literal strings
  • matches() for regex
  • one_of() variables in a character vector
  • everything() all remaining variable
  • Can be combined

select()

Negative selection

Dropping columns

We can drop columns by "negating" their names. Since helpers give us column names, we can negate them too. Note that the

gene_by_exon %>%
  dplyr::select(-strand, -starts_with("ensembl"),
         -ends_with("id"))
# A tibble: 2,417 x 5
   chromosome_name start_position end_position hgnc_symbol   gene_biotype
             <int>          <int>        <int>       <chr>          <chr>
 1              21       44439035     44439110                     snoRNA
 2              21        7092616      7092716                   misc_RNA
 3              21        8433085      8433174                   misc_RNA
 4              21       39171462     39171560                   misc_RNA
 5              21       41870633     41872054                  antisense
 6              21       44885953     44931989       ITGB2 protein_coding
 7              21       44885953     44931989       ITGB2 protein_coding
 8              21       44885953     44931989       ITGB2 protein_coding
 9              21       44885953     44931989       ITGB2 protein_coding
10              21       44885953     44931989       ITGB2 protein_coding
# ... with 2,407 more rows

filter()

inspect subsets of data

  • Take a look at the amyloid precursor protein gene APP
  • Note the rendering of the data.frame, not a tibble.
gene_by_exon %>%
  filter(hgnc_symbol == "APP") %>% 
  dplyr::select(hgnc_symbol, everything())
# A tibble: 17 x 9
   hgnc_symbol ensembl_gene_id ensembl_transcript_id chromosome_name
         <chr>           <chr>                 <chr>           <int>
 1         APP ENSG00000142192       ENST00000346798              21
 2         APP ENSG00000142192       ENST00000354192              21
 3         APP ENSG00000142192       ENST00000348990              21
 4         APP ENSG00000142192       ENST00000357903              21
 5         APP ENSG00000142192       ENST00000440126              21
 6         APP ENSG00000142192       ENST00000439274              21
 7         APP ENSG00000142192       ENST00000464867              21
 8         APP ENSG00000142192       ENST00000358918              21
 9         APP ENSG00000142192       ENST00000448850              21
10         APP ENSG00000142192       ENST00000415997              21
11         APP ENSG00000142192       ENST00000491395              21
12         APP ENSG00000142192       ENST00000474136              21
13         APP ENSG00000142192       ENST00000463070              21
14         APP ENSG00000142192       ENST00000548570              21
15         APP ENSG00000142192       ENST00000462267              21
16         APP ENSG00000142192       ENST00000466453              21
17         APP ENSG00000142192       ENST00000359726              21
# ... with 5 more variables: start_position <int>, end_position <int>,
#   hgnc_id <chr>, strand <int>, gene_biotype <chr>

filter()

Multiple conditions, AND (&)

Genes in a particular range

# Comma separated conditions are combined with '&'
gene_by_exon %>%
  filter(start_position > 10000000, end_position < 12000000) %>% 
  dplyr::select(hgnc_symbol, start_position, end_position)
# A tibble: 21 x 3
   hgnc_symbol start_position end_position
         <chr>          <int>        <int>
 1 IGHV1OR21-1       10649400     10649835
 2       BAGE2       10413477     10516431
 3       BAGE2       10413477     10516431
 4       BAGE2       10413477     10516431
 5                   10482738     10605716
 6                   10397644     10397778
 7                   10328411     10342737
 8                   10328411     10342737
 9                   10136419     10137004
10  SLC25A15P4       10612455     10613379
# ... with 11 more rows

filter()

Multiple conditions, OR (|)

Genes close to the telomers

gene_by_exon %>%
  filter(start_position < 7000000 | end_position >46665124 )
# A tibble: 131 x 9
   ensembl_gene_id ensembl_transcript_id chromosome_name start_position
             <chr>                 <chr>           <int>          <int>
 1 ENSG00000284550       ENST00000616627              21        6859171
 2 ENSG00000275496       ENST00000621924              21        6228966
 3 ENSG00000275496       ENST00000617746              21        6228966
 4 ENSG00000275496       ENST00000624446              21        6228966
 5 ENSG00000275496       ENST00000623405              21        6228966
 6 ENSG00000275496       ENST00000623575              21        6228966
 7 ENSG00000275496       ENST00000623506              21        6228966
 8 ENSG00000275496       ENST00000619488              21        6228966
 9 ENSG00000279064       ENST00000623723              21        5707004
10 ENSG00000276076       ENST00000619537              21        6560714
# ... with 121 more rows, and 5 more variables: end_position <int>,
#   hgnc_symbol <chr>, hgnc_id <chr>, strand <int>, gene_biotype <chr>

filter()

Set operations

  • The two below are equivalent
  • For larger operations use inner_join()

Set operations

gene_by_exon %>%
  filter(is.element(hgnc_symbol, c("APP", "ATP50", "PRMT2")))
# A tibble: 33 x 9
   ensembl_gene_id ensembl_transcript_id chromosome_name start_position
             <chr>                 <chr>           <int>          <int>
 1 ENSG00000160310       ENST00000355680              21       46635167
 2 ENSG00000160310       ENST00000397638              21       46635167
 3 ENSG00000160310       ENST00000291705              21       46635167
 4 ENSG00000160310       ENST00000397637              21       46635167
 5 ENSG00000160310       ENST00000334494              21       46635167
 6 ENSG00000160310       ENST00000397628              21       46635167
 7 ENSG00000160310       ENST00000440086              21       46635167
 8 ENSG00000160310       ENST00000482508              21       46635167
 9 ENSG00000160310       ENST00000481861              21       46635167
10 ENSG00000160310       ENST00000455177              21       46635167
# ... with 23 more rows, and 5 more variables: end_position <int>,
#   hgnc_symbol <chr>, hgnc_id <chr>, strand <int>, gene_biotype <chr>

Set operations

gene_by_exon %>%
  filter(hgnc_symbol %in% c("APP", "ATP50", "PRMT2"))
# A tibble: 33 x 9
   ensembl_gene_id ensembl_transcript_id chromosome_name start_position
             <chr>                 <chr>           <int>          <int>
 1 ENSG00000160310       ENST00000355680              21       46635167
 2 ENSG00000160310       ENST00000397638              21       46635167
 3 ENSG00000160310       ENST00000291705              21       46635167
 4 ENSG00000160310       ENST00000397637              21       46635167
 5 ENSG00000160310       ENST00000334494              21       46635167
 6 ENSG00000160310       ENST00000397628              21       46635167
 7 ENSG00000160310       ENST00000440086              21       46635167
 8 ENSG00000160310       ENST00000482508              21       46635167
 9 ENSG00000160310       ENST00000481861              21       46635167
10 ENSG00000160310       ENST00000455177              21       46635167
# ... with 23 more rows, and 5 more variables: end_position <int>,
#   hgnc_symbol <chr>, hgnc_id <chr>, strand <int>, gene_biotype <chr>

arrange()

Sort columns

Perform a nested sorting of all Genes:

1. hgnc_symbol
2. gene_biotype
3. end_position
gene_by_exon %>%
  arrange(hgnc_symbol, gene_biotype, end_position) %>% 
  dplyr::select(hgnc_symbol, gene_biotype, end_position, everything())
# A tibble: 2,417 x 9
   hgnc_symbol gene_biotype end_position ensembl_gene_id
         <chr>        <chr>        <int>           <chr>
 1                antisense      5087867 ENSG00000279687
 2                antisense     14144468 ENSG00000224905
 3                antisense     14144468 ENSG00000224905
 4                antisense     14144468 ENSG00000224905
 5                antisense     14947096 ENSG00000235277
 6                antisense     14964233 ENSG00000229047
 7                antisense     14992854 ENSG00000231201
 8                antisense     17810845 ENSG00000244676
 9                antisense     17810845 ENSG00000244676
10                antisense     21226829 ENSG00000226771
# ... with 2,407 more rows, and 5 more variables:
#   ensembl_transcript_id <chr>, chromosome_name <int>,
#   start_position <int>, hgnc_id <chr>, strand <int>

arrange()

desc to reverse sort columns

Find the last gene on chromosome 21.

gene_by_exon %>%
  arrange(desc(end_position)) %>%
  dplyr::select(end_position, everything()) # way to reorder arr_delay 1st column
# A tibble: 2,417 x 9
   end_position ensembl_gene_id ensembl_transcript_id chromosome_name
          <int>           <chr>                 <chr>           <int>
 1     46691226 ENSG00000212932       ENST00000427757              21
 2     46665124 ENSG00000160310       ENST00000355680              21
 3     46665124 ENSG00000160310       ENST00000397638              21
 4     46665124 ENSG00000160310       ENST00000291705              21
 5     46665124 ENSG00000160310       ENST00000397637              21
 6     46665124 ENSG00000160310       ENST00000334494              21
 7     46665124 ENSG00000160310       ENST00000397628              21
 8     46665124 ENSG00000160310       ENST00000440086              21
 9     46665124 ENSG00000160310       ENST00000482508              21
10     46665124 ENSG00000160310       ENST00000481861              21
# ... with 2,407 more rows, and 5 more variables: start_position <int>,
#   hgnc_symbol <chr>, hgnc_id <chr>, strand <int>, gene_biotype <chr>
gene_by_exon %>%
  dplyr::select(hgnc_symbol, 
                starts_with("ensembl"),
                contains("position"),
                -contains("transcript"))
# A tibble: 2,417 x 4
   hgnc_symbol ensembl_gene_id start_position end_position
         <chr>           <chr>          <int>        <int>
 1             ENSG00000264452       44439035     44439110
 2             ENSG00000274046        7092616      7092716
 3             ENSG00000278775        8433085      8433174
 4             ENSG00000276873       39171462     39171560
 5             ENSG00000236545       41870633     41872054
 6       ITGB2 ENSG00000160255       44885953     44931989
 7       ITGB2 ENSG00000160255       44885953     44931989
 8       ITGB2 ENSG00000160255       44885953     44931989
 9       ITGB2 ENSG00000160255       44885953     44931989
10       ITGB2 ENSG00000160255       44885953     44931989
# ... with 2,407 more rows

Verbs for inspecting data

Summary

  • as_tibble() (dplyr::as_data_frame) to convert to a tibble (different from as.data.frame in R base)
  • glimpse() - some of each column
  • filter() - subsetting
  • arrange() - sorting (desc to reverse the sort)
  • select() - picking (and omiting) columns

Value transformation

mutate()

rename()

Renaming columns

  • Rename columns with rename(new_name = old_name).
    • To keep the order correct, read/remember the renaming = as "was".
    • Rename will replace column names
gene_by_exon %>%
  rename(stop = end_position, start = start_position)
# A tibble: 2,417 x 9
   ensembl_gene_id ensembl_transcript_id chromosome_name    start     stop
             <chr>                 <chr>           <int>    <int>    <int>
 1 ENSG00000264452       ENST00000583496              21 44439035 44439110
 2 ENSG00000274046       ENST00000617336              21  7092616  7092716
 3 ENSG00000278775       ENST00000619112              21  8433085  8433174
 4 ENSG00000276873       ENST00000616808              21 39171462 39171560
 5 ENSG00000236545       ENST00000458654              21 41870633 41872054
 6 ENSG00000160255       ENST00000302347              21 44885953 44931989
 7 ENSG00000160255       ENST00000518033              21 44885953 44931989
 8 ENSG00000160255       ENST00000523126              21 44885953 44931989
 9 ENSG00000160255       ENST00000521995              21 44885953 44931989
10 ENSG00000160255       ENST00000397846              21 44885953 44931989
# ... with 2,407 more rows, and 4 more variables: hgnc_symbol <chr>,
#   hgnc_id <chr>, strand <int>, gene_biotype <chr>

mutate()

Add or replace columns

Gene length

  • Note that new variables can be used right away
gene_by_exon %>%
  mutate(  length = end_position - start_position) %>% 
  dplyr::select(hgnc_symbol, length)  
# A tibble: 2,417 x 2
   hgnc_symbol length
         <chr>  <int>
 1                 75
 2                100
 3                 89
 4                 98
 5               1421
 6       ITGB2  46036
 7       ITGB2  46036
 8       ITGB2  46036
 9       ITGB2  46036
10       ITGB2  46036
# ... with 2,407 more rows

distinct()

Unique rows

This is not a gene set for statistics

gene_by_exon %>%
  filter(gene_biotype == "protein_coding") %>% 
  mutate(  length = end_position - start_position) %>% 
  dplyr::select(hgnc_symbol, length) %>% 
  distinct()
# A tibble: 232 x 2
   hgnc_symbol length
         <chr>  <int>
 1       ITGB2  46036
 2       S100B   6246
 3       DIP2A 110953
 4   KRTAP12-2    738
 5   KRTAP12-4    446
 6        PCNT 121647
 7   KRTAP12-3    409
 8   KRTAP10-6   1237
 9   KRTAP12-1    587
10   KRTAP10-2   1148
# ... with 222 more rows

mutate_at(), mutate_all(), mutate_if()

Changing many columns

Usage

  • Use the select helpers with mutate_at()
  • Use column conditions for mutate_if()
  • vars() to wrap variables and funs() to wrap functions

Coordinate conversion

gene_by_exon %>% 
  mutate_at(vars(contains("position")), funs(. + 1))
# A tibble: 2,417 x 9
   ensembl_gene_id ensembl_transcript_id chromosome_name start_position
             <chr>                 <chr>           <int>          <dbl>
 1 ENSG00000264452       ENST00000583496              21       44439036
 2 ENSG00000274046       ENST00000617336              21        7092617
 3 ENSG00000278775       ENST00000619112              21        8433086
 4 ENSG00000276873       ENST00000616808              21       39171463
 5 ENSG00000236545       ENST00000458654              21       41870634
 6 ENSG00000160255       ENST00000302347              21       44885954
 7 ENSG00000160255       ENST00000518033              21       44885954
 8 ENSG00000160255       ENST00000523126              21       44885954
 9 ENSG00000160255       ENST00000521995              21       44885954
10 ENSG00000160255       ENST00000397846              21       44885954
# ... with 2,407 more rows, and 5 more variables: end_position <dbl>,
#   hgnc_symbol <chr>, hgnc_id <chr>, strand <int>, gene_biotype <chr>

Coordinate conversion

gene_by_exon %>% 
  mutate_if(is.numeric, funs(. + 1))
# A tibble: 2,417 x 9
   ensembl_gene_id ensembl_transcript_id chromosome_name start_position
             <chr>                 <chr>           <dbl>          <dbl>
 1 ENSG00000264452       ENST00000583496              22       44439036
 2 ENSG00000274046       ENST00000617336              22        7092617
 3 ENSG00000278775       ENST00000619112              22        8433086
 4 ENSG00000276873       ENST00000616808              22       39171463
 5 ENSG00000236545       ENST00000458654              22       41870634
 6 ENSG00000160255       ENST00000302347              22       44885954
 7 ENSG00000160255       ENST00000518033              22       44885954
 8 ENSG00000160255       ENST00000523126              22       44885954
 9 ENSG00000160255       ENST00000521995              22       44885954
10 ENSG00000160255       ENST00000397846              22       44885954
# ... with 2,407 more rows, and 5 more variables: end_position <dbl>,
#   hgnc_symbol <chr>, hgnc_id <chr>, strand <dbl>, gene_biotype <chr>

Grouping and summarizing

group_by()

Grouping

  • Not a summarising transformation but frequently used in conjunction
  • Grouping by more than one or more variables
  • Use ungroup()

Let's compute the average number of transcripts

gene_by_exon %>%
  group_by(ensembl_gene_id, hgnc_symbol) %>% 
  summarise(nexon =n_distinct(ensembl_transcript_id))
# A tibble: 841 x 3
# Groups:   ensembl_gene_id [?]
   ensembl_gene_id hgnc_symbol nexon
             <chr>       <chr> <int>
 1 ENSG00000141956      PRDM15    16
 2 ENSG00000141959        PFKL    12
 3 ENSG00000142149        HUNK     4
 4 ENSG00000142156      COL6A1     7
 5 ENSG00000142166      IFNAR1     3
 6 ENSG00000142168        SOD1     4
 7 ENSG00000142173      COL6A2     8
 8 ENSG00000142178        SIK1     2
 9 ENSG00000142182      DNMT3L     4
10 ENSG00000142185       TRPM2     8
# ... with 831 more rows

count()

Replacing calls to table()

head(as.data.frame(table(gene_by_exon$hgnc_symbol) ))
     Var1 Freq
1          454
2   AATBC    3
3  ABCC13    6
4   ABCG1   11
5 ADAMTS1    6
6 ADAMTS5    1
  • count() does the grouping for you
gene_by_exon %>%
  count(hgnc_symbol)
# A tibble: 522 x 2
   hgnc_symbol     n
         <chr> <int>
 1               454
 2       AATBC     3
 3      ABCC13     6
 4       ABCG1    11
 5     ADAMTS1     6
 6     ADAMTS5     1
 7      ADARB1    15
 8      AGPAT3    18
 9        AIRE     5
10 ANKRD20A11P     6
# ... with 512 more rows

Non-standard evaluation

Basics

  • filter(), mutate(), summarise(), arrange(), select() and group_by() take unquoted variable names.
  • Non-standard evaluation is standard in many tidyverse functions
  • Reduces typing
  • Enhances speed on back-end
  • Use of lazyeval package
  • Verbs have alternative version with trailing underscore (select_) for standard evaluation
  • The system is to be "soft deprectated" in dplyr 0.6.0, set to be released soon.
  • Vignette "Non-standard evaluation"

Examples

#NSE 
gene_by_exon %>% 
  summarize(mean(start_position))
# A tibble: 1 x 1
  `mean(start_position)`
                   <dbl>
1               32182221
# SE
gene_by_exon %>% 
  summarize_("mean(start_position)") #!
# A tibble: 1 x 1
  `mean(start_position)`
                   <dbl>
1               32182221
gene_by_exon %>% 
  summarize_(quote(mean(start_position))) #!
# A tibble: 1 x 1
  `mean(start_position)`
                   <dbl>
1               32182221
gene_by_exon %>% 
  summarize_(~mean(start_position)) #!
# A tibble: 1 x 1
  `mean(start_position)`
                   <dbl>
1               32182221

Joining data frames

Sample data

Common manual input

gene_interest <- tribble(~symbol, ~gene_interest, 
                   "ATP50", "yes",
                   "APP",   "yes",
                   "DIP2A", "yes",
                   "TP53",  "yes")

Inner join

gene_by_exon %>% 
  inner_join(gene_interest, 
             by = c("hgnc_symbol"="symbol")) 
# A tibble: 29 x 10
   ensembl_gene_id ensembl_transcript_id chromosome_name start_position
             <chr>                 <chr>           <int>          <int>
 1 ENSG00000160305       ENST00000400274              21       46458899
 2 ENSG00000160305       ENST00000457905              21       46458899
 3 ENSG00000160305       ENST00000466639              21       46458899
 4 ENSG00000160305       ENST00000435722              21       46458899
 5 ENSG00000160305       ENST00000417564              21       46458899
 6 ENSG00000160305       ENST00000473752              21       46458899
 7 ENSG00000160305       ENST00000494435              21       46458899
 8 ENSG00000160305       ENST00000480553              21       46458899
 9 ENSG00000160305       ENST00000472364              21       46458899
10 ENSG00000160305       ENST00000481883              21       46458899
# ... with 19 more rows, and 6 more variables: end_position <int>,
#   hgnc_symbol <chr>, hgnc_id <chr>, strand <int>, gene_biotype <chr>,
#   gene_interest <chr>
  # provide the equivalence since columns have a different name

Outer join

gene_by_exon %>% 
  full_join(gene_interest, 
             by = c("hgnc_symbol"="symbol")) 
# A tibble: 2,419 x 10
   ensembl_gene_id ensembl_transcript_id chromosome_name start_position
             <chr>                 <chr>           <int>          <int>
 1 ENSG00000264452       ENST00000583496              21       44439035
 2 ENSG00000274046       ENST00000617336              21        7092616
 3 ENSG00000278775       ENST00000619112              21        8433085
 4 ENSG00000276873       ENST00000616808              21       39171462
 5 ENSG00000236545       ENST00000458654              21       41870633
 6 ENSG00000160255       ENST00000302347              21       44885953
 7 ENSG00000160255       ENST00000518033              21       44885953
 8 ENSG00000160255       ENST00000523126              21       44885953
 9 ENSG00000160255       ENST00000521995              21       44885953
10 ENSG00000160255       ENST00000397846              21       44885953
# ... with 2,409 more rows, and 6 more variables: end_position <int>,
#   hgnc_symbol <chr>, hgnc_id <chr>, strand <int>, gene_biotype <chr>,
#   gene_interest <chr>

Relational operations

mutating joins

lead() and lag()

Comparing rows above and below

Calculate intergenic distance

gene_by_exon %>%
  dplyr::select(ensembl_gene_id, start_position, end_position) %>% 
  distinct() %>% 
  arrange(start_position) %>% 
  mutate(intergenic_length = start_position - lag(end_position)) 
# A tibble: 837 x 4
   ensembl_gene_id start_position end_position intergenic_length
             <chr>          <int>        <int>             <int>
 1 ENSG00000279493        5011799      5017145                NA
 2 ENSG00000277117        5022493      5040666              5348
 3 ENSG00000279687        5073458      5087867             32792
 4 ENSG00000280071        5079294      5128425             -8573
 5 ENSG00000276612        5116343      5133805            -12082
 6 ENSG00000275464        5130871      5154734             -2934
 7 ENSG00000280433        5155499      5165472               765
 8 ENSG00000279669        5232668      5243833             67196
 9 ENSG00000279094        5499151      5502542            255318
10 ENSG00000274333        5553637      5614880             51095
# ... with 827 more rows

Summary

Most commonly used - 80%

  • select() - columns
  • filter() - rows meeting condition
  • arrange() - sort
  • glimpse() - inspect
  • rename() - change column name
  • mutate() - copy manipulated values
  • group_by(), ungroup()
  • summarise() - group-wise, table-wise summaries
  • lead() and lag() - Values in other rows
  • inner_join and friends - Merging tables

Other 20%

Occasionally handy

  • Assembly: bind_rows, bind_cols
  • Windows function, min_rank, dense_rank, cumsum. See vignette
  • do: arbitrary code on each chunk. To be replaced by tidyr::nest() %>% mutate(purrr::map()) data.tables)

Data base access

SQL mapping

  • dplyr code can be translated into SQL and query databases online (using dbplyr)
  • different types of tabular data (dplyr SQL backend, databases,

Bigger data, go for data.table

  • See this interesting thread about comparing data.table versus dplyr
  • data.table, see introduction is very efficient but the syntax is not so easy.
  • Main advantage: inline replacement (tidyverse is frequently copying)
  • As a summary: tl;dr data.table for speed, dplyr for readability and convenience Prashanth Sriram
  • Hadley recommends that for data > 1-2 Gb, if speed is your main matter, go for data.table
  • dtplyr is a dplyr interface to data.table but slower than native data.table

Thank you for your attention