R_workshop/day1-beginnerusing the project manager utility, top-right of the rstudio window.
project menu
data using bottom-right panel > Files tab > New Folder buttonFiles tab
R commands. top-left panel > Create icon > New Script entry.create menu
Now, you have the 4 panels of the rstudio layout.
practical-beginner.RDownload this simple tab-separated file http://lsru.github.io/r_workshop/data/women.tsv
and save it inside the folder R_workshop/day1-beginner/data.
Remember, your current active rstudio project should be day1-beginner
load it: All paths are relative to the root which is the projects folder
library("readr")
df <- read_tsv("data/women.tsv", col_names = TRUE)
## Parsed with column specification:
## cols(
## height = col_integer(),
## weight = col_integer()
## )
df
## # A tibble: 15 x 2
## height weight
## <int> <int>
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
## 7 64 132
## 8 65 135
## 9 66 139
## 10 67 142
## 11 68 146
## 12 69 150
## 13 70 154
## 14 71 159
## 15 72 164
Thanks to readr the object df is already a tibble diff rstudio blog: tibble
We keep this section short, as we will focus on dplyr to perform tasks on data frames
Access to one column, display only the first elements
head(df$height)
## [1] 58 59 60 61 62 63
Using a similar syntax, apply:
the function mean() to find the mean of women’ height.
the function var() to find the variance of women’ weight.
To compute her BMI (remember height are inches and weight US pounds) the formula is:
\[BMI = \frac{weight}{height^2} * 703\]
For the first individual (^2 for square):
(115 / 58^2) * 703
## [1] 24.0324
Compute the BMI for all individuals, save it as bmi
Compute the mean and median of all BMI
First load dplyr. This enables the use of the %>% pipe operator
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Using df dataset:
plot the heigh in function of the weight (geom_point())
use the previous scatterplot, but map the point’ size to the bmi
df has 2 columns, both contain values.
gather() from tidyr to get two columns
measure for either height or weightvalue for actual measurementsRemember that gather takes by default all columns.
store the result into df_melt
plot the distribution as boxplots of both measures
Let’s add bmi as a third column to df.
df$bmi <- bmi
head(df)
## # A tibble: 6 x 3
## height weight bmi
## <int> <int> <dbl>
## 1 58 115 24.03240
## 2 59 117 23.62856
## 3 60 120 23.43333
## 4 61 123 23.23811
## 5 62 126 23.04318
## 6 63 129 22.84883
tidyr) the 3 columns and plot all densities using different colours and set them translucent You will need to make a new df_melt data frame first.The 3 distributions have very different ranges.
measure (Use the appropriate free scale).When faceting, the 3 distributions are drawn in distinct plots: mapping the colours to measure is useless.
lightblue colour for all. Be careful to NOT set the colour inside aes().Microarray data from the GEO dataset GSE35982.
data folder.read it using read_tsv() and store it into a data frame named gse. The file will be uncompressed seamlessly.
Is the file tidy?
Gather the samples. Look at the gather help page to select columns based on characters.
plot the distributions as boxplots
Any obvious issues? Check the file and find out what happened.
the locale setting in readr allows to specify the decimal mark used for float numbers
Replace the wrong column in gse by the correct one found in the data frame you just created.
tidy the samples again.
plot the distributions as boxplots
do the data appear normalised?