Section 5.3
1. Use the read_csv function to read each of the files
that the following code saves in the files object:
library(readr)
path<-system.file("extdata", package="dslabs")
files<-list.files(path)
files
## [1] "2010_bigfive_regents.xls"
## [2] "carbon_emissions.csv"
## [3] "fertility-two-countries-example.csv"
## [4] "HRlist2.txt"
## [5] "life-expectancy-and-fertility-two-countries-example.csv"
## [6] "murders.csv"
## [7] "olive.csv"
## [8] "RD-Mortality-Report_2015-18-180531.pdf"
## [9] "ssa-death-probability.csv"
filename <-c("carbon_emissions.csv", "fertility-two-countries-example.csv", "life-expectancy-and-fertility-two-countries-example.csv", "olive.csv", "murders.csv", "ssa-death-probability.csv")
dir <- system.file("extdata", package="dslabs")
fullpath <- file.path(dir, filename)
file.copy(fullpath, filename)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
read_csv("carbon_emissions.csv", n_max=5)
## Rows: 5 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): Year, Total carbon emissions from fossil fuel consumption and cemen...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 5 × 2
## Year Total carbon emissions from fossil fuel consumption and cement product…¹
## <dbl> <dbl>
## 1 1751 3
## 2 1752 3
## 3 1753 3
## 4 1754 3
## 5 1755 3
## # ℹ abbreviated name:
## # ¹`Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C)`
read_csv("fertility-two-countries-example.csv", n_max=5)
## Rows: 2 Columns: 57
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (56): 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 57
## country `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` `1969`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37 2.28 2.17
## 2 South K… 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85 4.73 4.62
## # ℹ 46 more variables: `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>,
## # `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>,
## # `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>,
## # `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>,
## # `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>,
## # `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>,
## # `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, …
read_csv("life-expectancy-and-fertility-two-countries-example.csv", n_max=5)
## Rows: 2 Columns: 113
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (112): 1960_fertility, 1960_life_expectancy, 1961_fertility, 1961_life_e...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 113
## country `1960_fertility` `1960_life_expectancy` `1961_fertility`
## <chr> <dbl> <dbl> <dbl>
## 1 Germany 2.41 69.3 2.44
## 2 South Korea 6.16 53.0 5.99
## # ℹ 109 more variables: `1961_life_expectancy` <dbl>, `1962_fertility` <dbl>,
## # `1962_life_expectancy` <dbl>, `1963_fertility` <dbl>,
## # `1963_life_expectancy` <dbl>, `1964_fertility` <dbl>,
## # `1964_life_expectancy` <dbl>, `1965_fertility` <dbl>,
## # `1965_life_expectancy` <dbl>, `1966_fertility` <dbl>,
## # `1966_life_expectancy` <dbl>, `1967_fertility` <dbl>,
## # `1967_life_expectancy` <dbl>, `1968_fertility` <dbl>, …
read_csv("olive.csv", n_max=5)
## New names:
## • `` -> `...1`
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 5 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Region
## dbl (9): ...1, Area, palmitic, palmitoleic, stearic, oleic, linoleic, linole...
## num (1): eicosenoic
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 5 × 11
## ...1 Region Area palmitic palmitoleic stearic oleic linoleic linolenic
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 North-Apulia 1 1 1075 75 226 7823 672
## 2 2 North-Apulia 1 1 1088 73 224 7709 781
## 3 3 North-Apulia 1 1 911 54 246 8113 549
## 4 4 North-Apulia 1 1 966 57 240 7952 619
## 5 5 North-Apulia 1 1 1051 67 259 7771 672
## # ℹ 2 more variables: arachidic <dbl>, eicosenoic <dbl>
read_csv("murders.csv", n_max=5)
## Rows: 5 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): state, abb, region
## dbl (2): population, total
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 5 × 5
## state abb region population total
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
read_csv("ssa-death-probability.csv", n_max=5)
## Rows: 5 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Sex
## dbl (3): Age, DeathProb, LifeExp
## num (1): NumberOfLives
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 5 × 5
## Age Sex DeathProb NumberOfLives LifeExp
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 0 Male 0.00638 100000 76.2
## 2 1 Male 0.000453 99362 75.6
## 3 2 Male 0.000282 99317 74.7
## 4 3 Male 0.00023 99289 73.7
## 5 4 Male 0.000169 99266 72.7
2. Note that the last one, the olive file, gives us a
warning. This is because the first line of the file is missing the
header for the first column.
Read the help file for read_csv to figure out how to
read in the file without reading this header. If you skip the header,
you should not get this warning. Save the result to an object called
dat.
?read_csv
dat<-read_csv("olive.csv", skip=1)
## New names:
## Rows: 571 Columns: 12
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): North-Apulia dbl (11): 1...1, 1...3, 1...4, 1075, 75, 226, 7823, 672, 36,
## 60, 29
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `1` -> `1...1`
## • `1` -> `1...3`
## • `1` -> `1...4`
3. A problem with the previous approach is that we don’t know what the columns represent. Type:
names(dat)
to see that the names are not informative.
Use the readLines function to read in just the first
line (we later learn how to extract values from the output).
names(dat)
## [1] "1...1" "North-Apulia" "1...3" "1...4" "1075"
## [6] "75" "226" "7823" "672" "36"
## [11] "60" "29"
read_lines("olive.csv", skip=0, n_max=1)
## [1] ",Region,Area,palmitic,palmitoleic,stearic,oleic,linoleic,linolenic,arachidic,eicosenoic"