HarvardX PH125.1x: Chapter 5 Exercises (Part 1)

Section 5.3

1. Use the read_csv function to read each of the files that the following code saves in the files object:

library(readr)
path<-system.file("extdata", package="dslabs")
files<-list.files(path)
files

## [1] "2010_bigfive_regents.xls"                               
## [2] "carbon_emissions.csv"                                   
## [3] "fertility-two-countries-example.csv"                    
## [4] "HRlist2.txt"                                            
## [5] "life-expectancy-and-fertility-two-countries-example.csv"
## [6] "murders.csv"                                            
## [7] "olive.csv"                                              
## [8] "RD-Mortality-Report_2015-18-180531.pdf"                 
## [9] "ssa-death-probability.csv"

filename <-c("carbon_emissions.csv", "fertility-two-countries-example.csv", "life-expectancy-and-fertility-two-countries-example.csv", "olive.csv", "murders.csv", "ssa-death-probability.csv")
dir <- system.file("extdata", package="dslabs") 
fullpath <- file.path(dir, filename)
file.copy(fullpath, filename)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

read_csv("carbon_emissions.csv", n_max=5)

## Rows: 5 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): Year, Total carbon emissions from fossil fuel consumption and cemen...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 5 × 2
##    Year Total carbon emissions from fossil fuel consumption and cement product…¹
##   <dbl>                                                                    <dbl>
## 1  1751                                                                        3
## 2  1752                                                                        3
## 3  1753                                                                        3
## 4  1754                                                                        3
## 5  1755                                                                        3
## # ℹ abbreviated name:
## #   ¹`Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C)`

read_csv("fertility-two-countries-example.csv", n_max=5)

## Rows: 2 Columns: 57
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): country
## dbl (56): 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 2 × 57
##   country  `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` `1969`
##   <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Germany    2.41   2.44   2.47   2.49   2.49   2.48   2.44   2.37   2.28   2.17
## 2 South K…   6.16   5.99   5.79   5.57   5.36   5.16   4.99   4.85   4.73   4.62
## # ℹ 46 more variables: `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>,
## #   `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>,
## #   `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>,
## #   `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>,
## #   `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>,
## #   `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>,
## #   `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, …

read_csv("life-expectancy-and-fertility-two-countries-example.csv", n_max=5)

## Rows: 2 Columns: 113
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (1): country
## dbl (112): 1960_fertility, 1960_life_expectancy, 1961_fertility, 1961_life_e...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 2 × 113
##   country     `1960_fertility` `1960_life_expectancy` `1961_fertility`
##   <chr>                  <dbl>                  <dbl>            <dbl>
## 1 Germany                 2.41                   69.3             2.44
## 2 South Korea             6.16                   53.0             5.99
## # ℹ 109 more variables: `1961_life_expectancy` <dbl>, `1962_fertility` <dbl>,
## #   `1962_life_expectancy` <dbl>, `1963_fertility` <dbl>,
## #   `1963_life_expectancy` <dbl>, `1964_fertility` <dbl>,
## #   `1964_life_expectancy` <dbl>, `1965_fertility` <dbl>,
## #   `1965_life_expectancy` <dbl>, `1966_fertility` <dbl>,
## #   `1966_life_expectancy` <dbl>, `1967_fertility` <dbl>,
## #   `1967_life_expectancy` <dbl>, `1968_fertility` <dbl>, …

read_csv("olive.csv", n_max=5)

## New names:
## • `` -> `...1`

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 5 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Region
## dbl (9): ...1, Area, palmitic, palmitoleic, stearic, oleic, linoleic, linole...
## num (1): eicosenoic
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 5 × 11
##    ...1 Region        Area palmitic palmitoleic stearic oleic linoleic linolenic
##   <dbl> <chr>        <dbl>    <dbl>       <dbl>   <dbl> <dbl>    <dbl>     <dbl>
## 1     1 North-Apulia     1        1        1075      75   226     7823       672
## 2     2 North-Apulia     1        1        1088      73   224     7709       781
## 3     3 North-Apulia     1        1         911      54   246     8113       549
## 4     4 North-Apulia     1        1         966      57   240     7952       619
## 5     5 North-Apulia     1        1        1051      67   259     7771       672
## # ℹ 2 more variables: arachidic <dbl>, eicosenoic <dbl>

read_csv("murders.csv", n_max=5)

## Rows: 5 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): state, abb, region
## dbl (2): population, total
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 5 × 5
##   state      abb   region population total
##   <chr>      <chr> <chr>       <dbl> <dbl>
## 1 Alabama    AL    South     4779736   135
## 2 Alaska     AK    West       710231    19
## 3 Arizona    AZ    West      6392017   232
## 4 Arkansas   AR    South     2915918    93
## 5 California CA    West     37253956  1257

read_csv("ssa-death-probability.csv", n_max=5)

## Rows: 5 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Sex
## dbl (3): Age, DeathProb, LifeExp
## num (1): NumberOfLives
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 5 × 5
##     Age Sex   DeathProb NumberOfLives LifeExp
##   <dbl> <chr>     <dbl>         <dbl>   <dbl>
## 1     0 Male   0.00638         100000    76.2
## 2     1 Male   0.000453         99362    75.6
## 3     2 Male   0.000282         99317    74.7
## 4     3 Male   0.00023          99289    73.7
## 5     4 Male   0.000169         99266    72.7

2. Note that the last one, the olive file, gives us a warning. This is because the first line of the file is missing the header for the first column.

Read the help file for read_csv to figure out how to read in the file without reading this header. If you skip the header, you should not get this warning. Save the result to an object called dat.

?read_csv

dat<-read_csv("olive.csv", skip=1)

## New names:
## Rows: 571 Columns: 12
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): North-Apulia dbl (11): 1...1, 1...3, 1...4, 1075, 75, 226, 7823, 672, 36,
## 60, 29
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `1` -> `1...1`
## • `1` -> `1...3`
## • `1` -> `1...4`

3. A problem with the previous approach is that we don’t know what the columns represent. Type:

names(dat)

to see that the names are not informative.

Use the readLines function to read in just the first line (we later learn how to extract values from the output).

names(dat)

##  [1] "1...1"        "North-Apulia" "1...3"        "1...4"        "1075"        
##  [6] "75"           "226"          "7823"         "672"          "36"          
## [11] "60"           "29"

read_lines("olive.csv", skip=0, n_max=1)

## [1] ",Region,Area,palmitic,palmitoleic,stearic,oleic,linoleic,linolenic,arachidic,eicosenoic"

HarvardX PH125.1x: Chapter 5 Exercises (Part 1)

Dimple K. Patel

2024-01-16