This short tutorial will allow you to explore dplyr
functionality based on the previous lecture. Every question can be answered with a combination of %>%
pipes. You should refrain from using temporary variables and statements outside of the range of the tidyverse.
If you are missing `biomaRt, install with
source("https://bioconductor.org/biocLite.R")
biocLite("biomaRt")
library(biomaRt)
## Loading required package: methods
gene_mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="www.ensembl.org")
gene_set <- useDataset(gene_mart , dataset="hsapiens_gene_ensembl")
gene_by_exon <-as_tibble(getBM(
mart = gene_set,
attributes = c(
"ensembl_gene_id",
"ensembl_transcript_id",
"ensembl_exon_id",
"chromosome_name",
"start_position",
"end_position",
"hgnc_symbol",
"hgnc_id",
"strand",
"gene_biotype",
"phenotype_description"
),
filter = "chromosome_name",
value = "21"
))
Extract the processed pseudogenes from the genes_by_exon
data set. Convert genes_by_exon
data set to a tibble
Use glimpse()
to find the correct columns and distinct()
to identify how pseudogenes are coded. Store the results in a tibble pseudogenes
Count the number of pseudogenes in the set (without referring to table()
obviously )
Extract a unique set of gene ids without redunancy of transcripts and exon information. Store the results in a tibble called genes
Sort the genes by their length.
Calculate the average length per gene by gene_biotype
.
Calculate the total number of genes and their average length by gene_biotype
.
What is the most frequent single word in the phenotype description on chromosome 21? Split the column using separate
, gather
the columns and count
in a single dplyr
statement.