R_workshop/day1-beginner
using the project manager utility, top-right of the rstudio window.
data
using bottom-right panel > Files tab > New Folder buttonR
commands. top-left panel > Create icon > New Script entry.Now, you have the 4 panels of the rstudio layout.
practical-beginner.R
Download this simple tab-separated file http://lsru.github.io/r_workshop/data/women.tsv
and save it inside the folder R_workshop/day1-beginner/data
.
Remember, your current active rstudio project should be day1-beginner
load it: All paths are relative to the root which is the projects folder
library("readr")
df <- read_tsv("data/women.tsv", col_names = TRUE)
## Parsed with column specification:
## cols(
## height = col_integer(),
## weight = col_integer()
## )
df
## # A tibble: 15 x 2
## height weight
## <int> <int>
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
## 7 64 132
## 8 65 135
## 9 66 139
## 10 67 142
## 11 68 146
## 12 69 150
## 13 70 154
## 14 71 159
## 15 72 164
Thanks to readr
the object df
is already a tibble diff rstudio blog: tibble
We keep this section short, as we will focus on dplyr
to perform tasks on data frames
Access to one column, display only the first elements
head(df$height)
## [1] 58 59 60 61 62 63
Using a similar syntax, apply:
the function mean()
to find the mean of women’ height.
the function var()
to find the variance of women’ weight.
To compute her BMI (remember height
are inches and weight
US pounds) the formula is:
\[BMI = \frac{weight}{height^2} * 703\]
For the first individual (^2
for square):
(115 / 58^2) * 703
## [1] 24.0324
Compute the BMI for all individuals, save it as bmi
Compute the mean and median of all BMI
First load dplyr
. This enables the use of the %>%
pipe operator
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Using df
dataset:
plot the heigh
in function of the weight (geom_point()
)
use the previous scatterplot, but map the point’ size to the bmi
df
has 2 columns, both contain values.
gather()
from tidyr
to get two columns
measure
for either height or weightvalue
for actual measurementsRemember that gather
takes by default all columns.
store the result into df_melt
plot the distribution as boxplots of both measures
Let’s add bmi
as a third column to df
.
df$bmi <- bmi
head(df)
## # A tibble: 6 x 3
## height weight bmi
## <int> <int> <dbl>
## 1 58 115 24.03240
## 2 59 117 23.62856
## 3 60 120 23.43333
## 4 61 123 23.23811
## 5 62 126 23.04318
## 6 63 129 22.84883
tidyr
) the 3 columns and plot all densities using different colours and set them translucent You will need to make a new df_melt
data frame first.The 3 distributions have very different ranges.
measure
(Use the appropriate free scale
).When faceting, the 3 distributions are drawn in distinct plots: mapping the colours to measure
is useless.
lightblue
colour for all. Be careful to NOT set the colour inside aes()
.Microarray data from the GEO dataset GSE35982.
data
folder.read it using read_tsv()
and store it into a data frame named gse
. The file will be uncompressed seamlessly.
Is the file tidy?
Gather the samples. Look at the gather
help page to select columns based on characters.
plot the distributions as boxplots
Any obvious issues? Check the file and find out what happened.
the locale
setting in readr
allows to specify the decimal mark used for float numbers
Replace the wrong column in gse
by the correct one found in the data frame you just created.
tidy the samples again.
plot the distributions as boxplots
do the data appear normalised?