R_workshop/day1-beginner
using the project manager utility, top-right of the rstudio window.
data
using bottom-right panel > Files tab > New Folder buttonR
commands. top-left panel > Create icon > New Script entry.Now, you have the 4 panels of the rstudio layout.
practical-beginner.R
Download this simple tab-separated file http://lsru.github.io/r_workshop/data/women.tsv
and save it inside the folder R_workshop/day1-beginner/data
.
Remember, your current active rstudio project should be day1-beginner
load it: All paths are relative to the root which is the projects folder
library("readr")
df <- read_tsv("data/women.tsv", col_names = TRUE)
df
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
## 7 64 132
## 8 65 135
## 9 66 139
## 10 67 142
## 11 68 146
## 12 69 150
## 13 70 154
## 14 71 159
## 15 72 164
Thanks to readr
the object df
is already a tibble diff rstudio blog: tibble
We keep this section short, as we will focus on dplyr
to perform tasks on data frames
Access to one column, display only the first elements
head(df$height)
## [1] 58 59 60 61 62 63
Using a similar syntax, apply:
mean()
to find the mean of women’ height.mean(women$height)
## [1] 65
var()
to find the variance of women’ weight.var(women$weight)
## [1] 240.2095
To compute her BMI (remember height
are inches and weight
US pounds) the formula is:
\[BMI = \frac{weight}{height^2} * 703\]
For the first individual (^2
for square):
(115 / 58^2) * 703
## [1] 24.0324
bmi
bmi <- (women$weight / women$height^2) * 703
mean(bmi)
## [1] 22.72443
median(bmi)
## [1] 22.46272
First load dplyr
. This enables the use of the %>%
pipe operator
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Using df
dataset:
heigh
in function of the weight (geom_point()
)library("ggplot2")
df %>%
ggplot(aes(x = weight, y = height))+
geom_point()
bmi
df %>%
ggplot(aes(x = weight, y = height,
size = bmi))+
geom_point()
df
has 2 columns, both contain values.
gather()
from tidyr
to get two columns
measure
for either height or weightvalue
for actual measurementsRemember that gather
takes by default all columns.
df_melt
library("tidyr")
df_melt <- gather(df, measure, value)
df_melt %>%
ggplot(aes(x = measure, y = value))+
geom_boxplot()
Let’s add bmi
as a third column to df
.
df$bmi <- bmi
head(df)
## Source: local data frame [6 x 3]
##
## height weight bmi
## (int) (int) (dbl)
## 1 58 115 24.03240
## 2 59 117 23.62856
## 3 60 120 23.43333
## 4 61 123 23.23811
## 5 62 126 23.04318
## 6 63 129 22.84883
tidyr
) the 3 columns and plot all densities using different colours and set them translucent You will need to make a new df_melt
data frame first.df_melt <- gather(df, measure, value)
df_melt %>%
ggplot(aes(x = value, fill = measure, colour = measure))+
geom_density(alpha = 0.7)
The 3 distributions have very different ranges.
measure
(Use the appropriate free scale
).df_melt %>%
ggplot(aes(x = value, fill = measure, colour = measure))+
geom_density(alpha = 0.7)+
facet_wrap(~ measure, scale = "free")
When faceting, the 3 distributions are drawn in distinct plots: mapping the colours to measure
is useless.
lightblue
colour for all. Be careful to NOT set the colour inside aes()
.df_melt %>%
ggplot(aes(x = value))+
geom_density(fill = "blue", alpha = 0.7)+
facet_wrap(~ measure, scale = "free")
Microarray data from the GEO dataset GSE35982.
data
folder.read_tsv()
and store it into a data frame named gse
. The file will be uncompressed seamlessly.gse <- read_tsv("data/GSE35982.tsv.gz")
No, since all samples (starts by “GSM*“) are in different columns
gather
help page to select columns based on characters.library("tidyr")
gse_melt <- gse %>%
gather(sample, value, starts_with("GSM"))
library("ggplot2")
gse_melt %>%
ggplot(aes(x = sample, y = value))+
geom_boxplot()+
coord_flip()+
theme_bw()
the locale
setting in readr
allows to specify the decimal mark used for float numbers
gse
by the correct one found in the data frame you just created.gsefr <- read_tsv("data/GSE35982.tsv.gz", locale = locale(decimal_mark = ","))
gse$GSM878683 <- gsefr$GSM878683
gse_melt <- gse %>%
gather(sample, value, starts_with("GSM"))
gse_melt %>%
ggplot(aes(x = sample, y = value))+
geom_boxplot()+
coord_flip()+
theme_bw()
Yes, perfectly normalised