This short tutorial will allow you to explore dplyr functionality based on the previous lecture. Every question can be answered with a combination of %>% pipes. You should refrain from using temporary variables and statements outside of the range of the tidyverse.

Tip

If you are missing `biomaRt, install with

source("https://bioconductor.org/biocLite.R")
biocLite("biomaRt")
  1. Get data for chromosome 21 from biomaRt. Use Rmarkdown and set the chunk option ‘cache = TRUE’
library(biomaRt)
## Loading required package: methods
gene_mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="www.ensembl.org")
gene_set <- useDataset(gene_mart , dataset="hsapiens_gene_ensembl")

gene_by_exon <-as_tibble(getBM(
  mart = gene_set,
  attributes = c(
    "ensembl_gene_id",
    "ensembl_transcript_id",
    "ensembl_exon_id",
    "chromosome_name",
    "start_position",
    "end_position",
    "hgnc_symbol",
    "hgnc_id",
    "strand",
    "gene_biotype",
    "phenotype_description"
    ), 
  filter = "chromosome_name",
  value = "21"
  ))
  1. Extract the processed pseudogenes from the genes_by_exon data set. Convert genes_by_exon data set to a tibble Use glimpse() to find the correct columns and distinct() to identify how pseudogenes are coded. Store the results in a tibble pseudogenes

  2. Count the number of pseudogenes in the set (without referring to table() obviously )

  3. Extract a unique set of gene ids without redunancy of transcripts and exon information. Store the results in a tibble called genes

  4. Sort the genes by their length.

  5. Calculate the average length per gene by gene_biotype.

  6. Calculate the total number of genes and their average length by gene_biotype.

  7. What is the most frequent single word in the phenotype description on chromosome 21? Split the column using separate, gather the columns and count in a single dplyr statement.