You will learn to:
- use
readr
and/orreadxl
to import your data into R - use the interactive RStudio interface to visualise your data
- appreciate tibbles
- adjust the type of the data you would like to import
2 May 2017
readr
and/or readxl
to import your data into R.csv
, .tsv
, …).xls
, .xlsx
).sas
from SAS, .sav
from SPSS, .dta
from Stata)read.csv()
, read.delim()
)Use as_tibble()
to convert a data.frame
to a tibble
tibble
vs data.frame
data.frame
iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 7 4.6 3.4 1.4 0.3 setosa 8 5.0 3.4 1.5 0.2 setosa 9 4.4 2.9 1.4 0.2 setosa 10 4.9 3.1 1.5 0.1 setosa 11 5.4 3.7 1.5 0.2 setosa 12 4.8 3.4 1.6 0.2 setosa 13 4.8 3.0 1.4 0.1 setosa 14 4.3 3.0 1.1 0.1 setosa 15 5.8 4.0 1.2 0.2 setosa 16 5.7 4.4 1.5 0.4 setosa 17 5.4 3.9 1.3 0.4 setosa 18 5.1 3.5 1.4 0.3 setosa 19 5.7 3.8 1.7 0.3 setosa 20 5.1 3.8 1.5 0.3 setosa 21 5.4 3.4 1.7 0.2 setosa 22 5.1 3.7 1.5 0.4 setosa 23 4.6 3.6 1.0 0.2 setosa 24 5.1 3.3 1.7 0.5 setosa 25 4.8 3.4 1.9 0.2 setosa 26 5.0 3.0 1.6 0.2 setosa 27 5.0 3.4 1.6 0.4 setosa 28 5.2 3.5 1.5 0.2 setosa 29 5.2 3.4 1.4 0.2 setosa 30 4.7 3.2 1.6 0.2 setosa 31 4.8 3.1 1.6 0.2 setosa 32 5.4 3.4 1.5 0.4 setosa 33 5.2 4.1 1.5 0.1 setosa 34 5.5 4.2 1.4 0.2 setosa 35 4.9 3.1 1.5 0.2 setosa 36 5.0 3.2 1.2 0.2 setosa 37 5.5 3.5 1.3 0.2 setosa 38 4.9 3.6 1.4 0.1 setosa 39 4.4 3.0 1.3 0.2 setosa 40 5.1 3.4 1.5 0.2 setosa 41 5.0 3.5 1.3 0.3 setosa 42 4.5 2.3 1.3 0.3 setosa 43 4.4 3.2 1.3 0.2 setosa 44 5.0 3.5 1.6 0.6 setosa 45 5.1 3.8 1.9 0.4 setosa 46 4.8 3.0 1.4 0.3 setosa 47 5.1 3.8 1.6 0.2 setosa 48 4.6 3.2 1.4 0.2 setosa 49 5.3 3.7 1.5 0.2 setosa 50 5.0 3.3 1.4 0.2 setosa 51 7.0 3.2 4.7 1.4 versicolor 52 6.4 3.2 4.5 1.5 versicolor 53 6.9 3.1 4.9 1.5 versicolor 54 5.5 2.3 4.0 1.3 versicolor 55 6.5 2.8 4.6 1.5 versicolor 56 5.7 2.8 4.5 1.3 versicolor 57 6.3 3.3 4.7 1.6 versicolor 58 4.9 2.4 3.3 1.0 versicolor 59 6.6 2.9 4.6 1.3 versicolor 60 5.2 2.7 3.9 1.4 versicolor 61 5.0 2.0 3.5 1.0 versicolor 62 5.9 3.0 4.2 1.5 versicolor 63 6.0 2.2 4.0 1.0 versicolor 64 6.1 2.9 4.7 1.4 versicolor 65 5.6 2.9 3.6 1.3 versicolor 66 6.7 3.1 4.4 1.4 versicolor 67 5.6 3.0 4.5 1.5 versicolor 68 5.8 2.7 4.1 1.0 versicolor 69 6.2 2.2 4.5 1.5 versicolor 70 5.6 2.5 3.9 1.1 versicolor 71 5.9 3.2 4.8 1.8 versicolor 72 6.1 2.8 4.0 1.3 versicolor 73 6.3 2.5 4.9 1.5 versicolor 74 6.1 2.8 4.7 1.2 versicolor 75 6.4 2.9 4.3 1.3 versicolor 76 6.6 3.0 4.4 1.4 versicolor 77 6.8 2.8 4.8 1.4 versicolor 78 6.7 3.0 5.0 1.7 versicolor 79 6.0 2.9 4.5 1.5 versicolor 80 5.7 2.6 3.5 1.0 versicolor 81 5.5 2.4 3.8 1.1 versicolor 82 5.5 2.4 3.7 1.0 versicolor 83 5.8 2.7 3.9 1.2 versicolor 84 6.0 2.7 5.1 1.6 versicolor 85 5.4 3.0 4.5 1.5 versicolor 86 6.0 3.4 4.5 1.6 versicolor 87 6.7 3.1 4.7 1.5 versicolor 88 6.3 2.3 4.4 1.3 versicolor 89 5.6 3.0 4.1 1.3 versicolor 90 5.5 2.5 4.0 1.3 versicolor 91 5.5 2.6 4.4 1.2 versicolor 92 6.1 3.0 4.6 1.4 versicolor 93 5.8 2.6 4.0 1.2 versicolor 94 5.0 2.3 3.3 1.0 versicolor 95 5.6 2.7 4.2 1.3 versicolor 96 5.7 3.0 4.2 1.2 versicolor 97 5.7 2.9 4.2 1.3 versicolor 98 6.2 2.9 4.3 1.3 versicolor 99 5.1 2.5 3.0 1.1 versicolor 100 5.7 2.8 4.1 1.3 versicolor 101 6.3 3.3 6.0 2.5 virginica 102 5.8 2.7 5.1 1.9 virginica 103 7.1 3.0 5.9 2.1 virginica 104 6.3 2.9 5.6 1.8 virginica 105 6.5 3.0 5.8 2.2 virginica 106 7.6 3.0 6.6 2.1 virginica 107 4.9 2.5 4.5 1.7 virginica 108 7.3 2.9 6.3 1.8 virginica 109 6.7 2.5 5.8 1.8 virginica 110 7.2 3.6 6.1 2.5 virginica 111 6.5 3.2 5.1 2.0 virginica 112 6.4 2.7 5.3 1.9 virginica 113 6.8 3.0 5.5 2.1 virginica 114 5.7 2.5 5.0 2.0 virginica 115 5.8 2.8 5.1 2.4 virginica 116 6.4 3.2 5.3 2.3 virginica 117 6.5 3.0 5.5 1.8 virginica 118 7.7 3.8 6.7 2.2 virginica 119 7.7 2.6 6.9 2.3 virginica 120 6.0 2.2 5.0 1.5 virginica 121 6.9 3.2 5.7 2.3 virginica 122 5.6 2.8 4.9 2.0 virginica 123 7.7 2.8 6.7 2.0 virginica 124 6.3 2.7 4.9 1.8 virginica 125 6.7 3.3 5.7 2.1 virginica 126 7.2 3.2 6.0 1.8 virginica 127 6.2 2.8 4.8 1.8 virginica 128 6.1 3.0 4.9 1.8 virginica 129 6.4 2.8 5.6 2.1 virginica 130 7.2 3.0 5.8 1.6 virginica 131 7.4 2.8 6.1 1.9 virginica 132 7.9 3.8 6.4 2.0 virginica 133 6.4 2.8 5.6 2.2 virginica 134 6.3 2.8 5.1 1.5 virginica 135 6.1 2.6 5.6 1.4 virginica 136 7.7 3.0 6.1 2.3 virginica 137 6.3 3.4 5.6 2.4 virginica 138 6.4 3.1 5.5 1.8 virginica 139 6.0 3.0 4.8 1.8 virginica 140 6.9 3.1 5.4 2.1 virginica 141 6.7 3.1 5.6 2.4 virginica 142 6.9 3.1 5.1 2.3 virginica 143 5.8 2.7 5.1 1.9 virginica 144 6.8 3.2 5.9 2.3 virginica 145 6.7 3.3 5.7 2.5 virginica 146 6.7 3.0 5.2 2.3 virginica 147 6.3 2.5 5.0 1.9 virginica 148 6.5 3.0 5.2 2.0 virginica 149 6.2 3.4 5.4 2.3 virginica 150 5.9 3.0 5.1 1.8 virginica
tibble
vs data.frame
tibble
iris %>% as_tibble()
# A tibble: 150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <fctr> 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 7 4.6 3.4 1.4 0.3 setosa 8 5.0 3.4 1.5 0.2 setosa 9 4.4 2.9 1.4 0.2 setosa 10 4.9 3.1 1.5 0.1 setosa # ... with 140 more rows
tibble
adjusts to widthiris %>% as_tibble()
# A tibble: 150 x 5 Sepal.Length Sepal.Width <dbl> <dbl> 1 5.1 3.5 2 4.9 3.0 3 4.7 3.2 4 4.6 3.1 5 5.0 3.6 6 5.4 3.9 7 4.6 3.4 8 5.0 3.4 9 4.4 2.9 10 4.9 3.1 # ... with 140 more rows, and # 3 more variables: # Petal.Length <dbl>, # Petal.Width <dbl>, # Species <fctr>
tibble()
base::data.frame()
but
data.frame(`bad name` = 1:4, x = rep(letters[1:2], 2)) %>% str()
'data.frame': 4 obs. of 2 variables: $ bad.name: int 1 2 3 4 $ x : Factor w/ 2 levels "a","b": 1 2 1 2
tibble(`bad name` = 1:4, x = rep(letters[1:2], 2)) %>% str()
Classes 'tbl_df', 'tbl' and 'data.frame': 4 obs. of 2 variables: $ bad name: int 1 2 3 4 $ x : chr "a" "b" "a" "b"
tribble()
~
)tribble( ~x, ~y, ~z, "a", 2, 3.6, "b", 1, 8.5 )
# A tibble: 2 x 3 x y z <chr> <dbl> <dbl> 1 a 2 3.6 2 b 1 8.5
datapasta
packageSeven file formats are supported by the readr package:
read_csv()
: comma separated (CSV) filesread_tsv()
: tab separated filesread_delim()
: general delimited filesread_fwf()
: fixed width filesread_table()
: tabular files where colums are separated by white-space.read_log()
: web log filesTo import excel files (.xls
and .xlsx
):
read_excel()
read_xls()
read_xlsx()
read_sas()
for SASread_sav()
for SPSSread_dta()
for Statareadr
or readxl
Import Dataset
button in the upper right panel or click on the file in the lower right panel
mtcars.csv
file shipped with readr
to your project folder:file.copy( from = readr::readr_example("mtcars.csv"), to = "mtcars.csv" )
Import Dataset
button to import the mtcars.csv
file.readr
functionsread_csv()
read_csv2()
read_tsv()
read_delim()
read_delim(file, delim = "|", ...)
read_fwf()
csv
file: mtcars.csv"mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb" 21,6,160,110,3.9,2.62,16.46,0,1,4,4 21,6,160,110,3.9,2.875,17.02,0,1,4,4 22.8,4,108,93,3.85,2.32,18.61,1,1,4,1 21.4,6,258,110,3.08,3.215,19.44,1,0,3,1 18.7,8,360,175,3.15,3.44,17.02,0,0,3,2 ...
read_csv()
readr
provides some example filesreadr::readr_example()
to access them.zip
, .gz
, …)read_csv()
readr_example("mtcars.csv") %>% # Generates the path to the example file read_csv()
Parsed with column specification: cols( mpg = col_double(), cyl = col_integer(), disp = col_double(), hp = col_integer(), drat = col_double(), wt = col_double(), qsec = col_double(), vs = col_integer(), am = col_integer(), gear = col_integer(), carb = col_integer() )
# A tibble: 32 x 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # ... with 22 more rows
guess_max
optionmessage = FALSE
in your rmarkdown chunk option.col_types = cols()
col_types
to avoid any problemParsed with column specification: cols( mpg = col_double(), cyl = col_integer(), disp = col_double(), hp = col_integer(), drat = col_double(), wt = col_double(), qsec = col_double(), vs = col_integer(), am = col_integer(), gear = col_integer(), carb = col_integer() )
Use the following functions in cols()
:
col_double()
col_integer()
col_character()
col_logical()
col_factor()
col_date()
col_datetime()
col_time()
col_guess()
col_skip()
readr_example("mtcars.csv") %>% read_csv(col_types = cols(mpg = col_double(), cyl = col_integer(), disp = col_double(), hp = col_integer(), drat = col_double(), wt = col_double(), qsec = col_double(), vs = col_integer(), am = col_integer(), gear = col_integer(), carb = col_integer()) )
readr_example("mtcars.csv") %>% read_csv(col_types = cols( mpg = "d", cyl = "i", disp = "d", hp = "i", drat = "d", wt = "d", qsec = "d", vs = "i", am = "i", gear = "i", carb = "i")) readr_example("mtcars.csv") %>% read_csv(col_types = "dididddiiii")
mtcars.csv
file again but
readr_example("mtcars.csv") %>% read_csv(col_types = cols_only(cyl = col_character(), mpg = col_double(), gear = col_integer()))
# A tibble: 32 x 3 mpg cyl gear <dbl> <chr> <int> 1 21.0 6 4 2 21.0 6 4 3 22.8 4 4 4 21.4 6 3 # ... with 28 more rows
readr_example("mtcars.csv") %>% read_csv(col_types = "dc_______i_")
# A tibble: 32 x 3 mpg cyl gear <dbl> <chr> <int> 1 21.0 6 4 2 21.0 6 4 3 22.8 4 4 4 21.4 6 3 # ... with 28 more rows
challenge.csv
(provided by readr_example()
)challenge
problems(challenge)
to list the failures againchallenge <- readr_example("challenge.csv") %>% read_csv()
Parsed with column specification: cols( x = col_integer(), y = col_character() )
Warning in rbind(names(probs), probs_f): number of columns of result is not a multiple of vector length (arg 1)
Warning: 1000 parsing failures. row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1001 x no trailing characters .23837975086644292 'challenge.csv' file 2 1002 x no trailing characters .41167997173033655 'challenge.csv' row 3 1003 x no trailing characters .7460716762579978 'challenge.csv' col 4 1004 x no trailing characters .723450553836301 'challenge.csv' expected 5 1005 x no trailing characters .614524137461558 'challenge.csv' ... ................. ... ....................................................................... ........ ....................................................................... ...... ....................................................................... .... ....................................................................... ... ....................................................................... ... ....................................................................... ........ ....................................................................... See problems(...) for more details.
challenge.csv
(provided by readr_example()
)challenge
challenge <- readr_example("challenge.csv") %>% read_csv(col_types = "dD")
challenge <- readr_example("challenge.csv") %>% read_csv(guess_max = 1500) # n > 1000: look at the first row causing a failure
Parsed with column specification: cols( x = col_double(), y = col_date(format = "") )
skip
argumentTo skip the first n rows
n_max
argumentTo stop reading after n rows
col_names
argumentoverride column names
readr_example("mtcars.csv") %>% read_csv()
# A tibble: 32 x 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int> 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 # ... with 28 more rows
readr_example("mtcars.csv") %>% read_csv(skip = 3, n_max = 3, col_names = FALSE)
# A tibble: 3 x 11 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 <dbl> <int> <int> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int> 1 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 2 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
readr_example("mtcars.csv") %>% read_csv(skip = 3, n_max = 3, col_names = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb"))
# A tibble: 3 x 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <int> <int> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int> 1 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 2 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
"name";"first_name";"born";"value" "Dupont";"Michel";03/10/71;1,2 "Doe";"John";27/02/74;1,7 "Mustermann";"Max";14/08/69;1,6
read_delim()
read_delim("data/locale_1.csv", delim = ";")
# A tibble: 3 x 4 name first_name born value <chr> <chr> <chr> <dbl> 1 Dupont Michel 03/10/71 12 2 Doe John 27/02/74 17 3 Mustermann Max 14/08/69 16
locale
argumentread_delim("data/locale_1.csv", delim = ";", locale = locale(decimal_mark = ",", date_format = "%d/%m/%y"))
# A tibble: 3 x 4 name first_name born value <chr> <chr> <date> <dbl> 1 Dupont Michel 1971-10-03 1.2 2 Doe John 1974-02-27 1.7 3 Mustermann Max 1969-08-14 1.6
?parse_datetime
to list the datetime format specificationsread_csv2()
is a shortcut to read_delim()
using ;
as a delimiter and ,
as the decimal markread_csv2("data/locale_1.csv", locale = locale(date_format = "%d/%m/%y")) read_csv2("data/locale_1.csv") %>% mutate_at(c("born"), parse_date, format = "%d/%m/%y")
readxl
.xls
and .xlsx
filesread_excel()
read_xls()
read_xlsx()
excel_sheets()
to list all sheets in the Excel filereadxl::readxl_example()
[1] "clippy.xls" "clippy.xlsx" "datasets.xls" [4] "datasets.xlsx" "deaths.xls" "deaths.xlsx" [7] "geometry.xls" "geometry.xlsx" "type-me.xls" [10] "type-me.xlsx"
readxl
and the import of excel files
datasets.xlsx
example filemtcars
sheetfile.copy( from = readxl::readxl_example("datasets.xlsx"), to = "datasets.xlsx" ) # To copy it to the root of your project
deaths
datasetdeaths.xls
file (using readxl_example("deaths.xlsx")
to get the path to it)readxl_example("deaths.xlsx") %>% read_excel(sheet = 1)
# A tibble: 18 x 6 `Lots of people` X__1 X__2 X__3 X__4 <chr> <chr> <chr> <chr> <chr> 1 simply cannot resist writing <NA> <NA> <NA> <NA> 2 at the top <NA> of 3 or merging <NA> <NA> <NA> 4 Name Profession Age Has kids Date of birth 5 David Bowie musician 69 TRUE 17175 6 Carrie Fisher actor 60 TRUE 20749 7 Chuck Berry musician 90 TRUE 9788 8 Bill Paxton actor 61 TRUE 20226 # ... with 10 more rows, and 1 more variables: X__5 <chr>
readxl 1.0.0
to read in your data frame of interest (have a look at ?readxl::read_excel
)readxl_example("deaths.xlsx") %>% read_excel(sheet = 1, range = "A5:F15")
# A tibble: 10 x 6 Name Profession Age `Has kids` `Date of birth` <chr> <chr> <dbl> <lgl> <dttm> 1 David Bowie musician 69 TRUE 1947-01-08 2 Carrie Fisher actor 60 TRUE 1956-10-21 3 Chuck Berry musician 90 TRUE 1926-10-18 4 Bill Paxton actor 61 TRUE 1955-05-17 5 Prince musician 57 TRUE 1958-06-07 6 Alan Rickman actor 69 FALSE 1946-02-21 7 Florence Henderson actor 82 TRUE 1934-02-14 8 Harper Lee author 89 FALSE 1926-04-28 9 Zsa Zsa Gábor actor 99 TRUE 1917-02-06 10 George Michael musician 53 FALSE 1963-06-25 # ... with 1 more variables: `Date of death` <dttm>
readxl_example("deaths.xlsx") %>% read_excel(range = "arts!A5:F15") readxl_example("deaths.xlsx") %>% read_excel(range = cell_limits(c(5, 1), c(15, 6), "arts")) # have a look at ?cellranger::cell_limits
readr
using the col_types
argumentcols()
function won't work here!
readr
guesses column type based on the data.readxl
guesses column type based on Excel cell typescol_types
should be adjusted to a character vectorlook at the correspondance in the readxl vignette
vignette("cell-and-column-types")
How it is in Excel | How it will be in R | How to request in col_types |
---|---|---|
anything | non-existent | "skip" |
empty | logical, but all NA | you cannot request this |
boolean | logical | "logical" |
numeric | numeric | "numeric" |
datetime | POSIXct | "date" |
text | character | "text" |
anything | list | "list" |
clippy.xls
example file provided by readxl
readxl_example("clippy.xls") %>% read_excel()
# A tibble: 4 x 2 name value <chr> <chr> 1 Name Clippy 2 Species paperclip 3 Approx date of death 39083 4 Weight in grams 0.9
col_types
readxl_example("clippy.xls") %>% read_excel(col_types = c("text", "list"))
# A tibble: 4 x 2 name value <chr> <list> 1 Name <chr [1]> 2 Species <chr [1]> 3 Approx date of death <dttm [1]> 4 Weight in grams <dbl [1]>
readxl_example("clippy.xls") %>% read_excel(col_types = c("text", "list")) %>% tibble::deframe() %>% str()
List of 4 $ Name : chr "Clippy" $ Species : chr "paperclip" $ Approx date of death: POSIXct[1:1], format: "2007-01-01" $ Weight in grams : num 0.9
readr
and/or readxl
to import your data into R
vignette("readr")
vignette("locales")
vignette("sheet-geometry")
vignette("cell-and-column-types")