Practical data transformation

This short tutorial will allow you to explore dplyr functionality based on the previous lecture. Every question can be answered with a combination of %>% pipes. You should refrain from using temporary variables and statements outside of the range of the tidyverse.

Tip

If you are missing `biomaRt, install with

source("https://bioconductor.org/biocLite.R")
biocLite("biomaRt")

Get data for chromosome 21 from biomaRt. Use Rmarkdown and set the chunk option ‘cache = TRUE’

library(biomaRt)

## Loading required package: methods

gene_mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="www.ensembl.org")
gene_set <- useDataset(gene_mart , dataset="hsapiens_gene_ensembl")

gene_by_exon <-as_tibble(getBM(
  mart = gene_set,
  attributes = c(
    "ensembl_gene_id",
    "ensembl_transcript_id",
    "ensembl_exon_id",
    "chromosome_name",
    "start_position",
    "end_position",
    "hgnc_symbol",
    "hgnc_id",
    "strand",
    "gene_biotype",
    "phenotype_description"
    ), 
  filter = "chromosome_name",
  value = "21"
  ))

Extract the processed pseudogenes from the genes_by_exon data set. Convert genes_by_exon data set to a tibble Use glimpse() to find the correct columns and distinct() to identify how pseudogenes are coded. Store the results in a tibble pseudogenes
Count the number of pseudogenes in the set (without referring to table() obviously )
Extract a unique set of gene ids without redunancy of transcripts and exon information. Store the results in a tibble called genes
Sort the genes by their length.
Calculate the average length per gene by gene_biotype.
Calculate the total number of genes and their average length by gene_biotype.
What is the most frequent single word in the phenotype description on chromosome 21? Split the column using separate, gather the columns and count in a single dplyr statement.

Practical data transformation

dplyr

Roland Krause

3 May 2017

Tip