L06 Data Import

Data Science 1 with R (STAT 301-1)

Author

YOUR NAME

Load Packages

Code
# Loading package(s)
library(tidyverse)
library(readr)
library(janitor)

Datasets

All datasets are either coded inline or contained in the data sub-directory within the downloaded zip file.

Exercises

Exercise 1

Demonstrate how to read in TopBabyNamesByState.txt contained in the data sub-directory using the appropriate function from readr package. After reading in the data, determine the top male and female names in 1984 for South Dakota.

Code
top_baby_names <- read_delim("data/TopBabyNamesbyState.txt")
Code
top_baby_names %>%
  filter(year == 1984, state == "SD")
# A tibble: 2 × 5
  state gender  year top_name occurences
  <chr> <chr>  <dbl> <chr>         <dbl>
1 SD    F       1984 Jessica         166
2 SD    M       1984 Matthew         220

Exercise 2

What is the main difference between read_csv(), read_csv2(), and read_tsv()?

They all accept the parameters na, col_names, and trim_ws. Please describe in your own words what each of these parameters control.

Solution

read_csv() reads comma delimited files, read_csv2() reads semicolon separated files, and read_tsv() reads tab delimited files. na controls which strings are dictated as missing values, col_names sets the column names of a character vector, and trim_ws gets rid of leading/trailing whitespace from character strings.

Exercise 3

Read in the fixed width file fwf_example.txt contained in the sub-directory data. We have provided the column names so the final dataset has appropriate names. You may want to look at the dataset in a text editor to get an idea of what you are dealing with. (Hint: Use fwf_widths() for col_positions and the skip parameter where appropriate.)

Code
# Column names.
column_names <- c(
  "Entry", "Per.", "Post Date", "GL Account", "Description", "Srce.", 
  "Cflow", "Ref.", "Post", "Debit", "Credit", "Alloc."
  )
Code
col_widths <-c(8, 4, 12, 13, 27, 5, 4, 10, 13, 19, 13, 4)

fwf_data <- read_fwf(
  file = "data/fwf_example.txt",
  skip = 2,
  col_positions = fwf_widths(
    widths = col_widths, 
    col_names = column_names
    )
)
Code
fwf_data %>%
  glimpse()
Rows: 17
Columns: 12
$ Entry        <dbl> 16524, 191675, 191667, 191673, 80495, 80507, 80509, 80497…
$ Per.         <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01…
$ `Post Date`  <chr> "10/17/2012", "01/14/2013", "01/14/2013", "01/14/2013", "…
$ `GL Account` <dbl> 3930621977, 2368183100, 3714468136, 2632703881, 276638979…
$ Description  <chr> "TXNPUES", "OUNHQEX XUFQONY", "GHAKASC QHJXDFM", "PAHFSAP…
$ Srce.        <chr> "S1", "S1", "S1", "S1", "S1", "S1", "S1", "S1", "S1", "S1…
$ Cflow        <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "No", "Yes", "Yes…
$ Ref.         <chr> "RHMXWPCP", NA, NA, NA, "TGZGMOXG", "USUKVMZO", "DNUNTASS…
$ Post         <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "…
$ Debit        <dbl> NA, NA, 3172.53, 983.21, NA, NA, NA, NA, 9167.93, 746.70,…
$ Credit       <dbl> 5007.10, 43537.00, NA, NA, 903.78, 670.31, 848.50, 7.31, …
$ Alloc.       <chr> "No", "Yes", "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes…

Exercise 4

Generate the correct format string to parse each of the following dates and times:

Code
# various date and time string formats
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
t1 <- "1705"
t2 <- "11:15:10.12 PM"
Code
d1 <- "January 1, 2010"
parse_date(d1, format = "%B %d, %Y")
[1] "2010-01-01"
Code
d2 <- "2015-Mar-07"
parse_date(d2, format = "%Y-%b-%d")
[1] "2015-03-07"
Code
d3 <- "06-Jun-2017"
parse_date(d3, format = "%d-%b-%Y")
[1] "2017-06-06"
Code
d4 <- c("August 19 (2015)", "July 1 (2015)")
parse_date(d4, format = "%B %d (%Y)")
[1] "2015-08-19" "2015-07-01"
Code
d5 <- "12/30/14" # Dec 30, 2014
parse_date(d5, format = "%m/%d/%y")
[1] "2014-12-30"
Code
t1 <- "1705"
parse_time(t1, format = "%H%M" )
17:05:00
Code
t2 <- "11:15:10.12 PM"
parse_time(t2, "%I:%M:%OS %p")
23:15:10.12

Exercise 5

Read in and store cdc.txt from the data sub-directory.

  • Create a graphic that compares the distribution of weights for each gender.
  • Write out a copy of the dataset to the data sub-directory as a tab deliminated text file named cdc_tab.txt.
  • Write out a copy of the dataset to the data sub-directory as an RDS file named cdc.rds.
  • List some benefits of writing a dataset to an RDS file type.
Code
cdc_data <- read_delim("data/cdc.txt")

ggplot(cdc_data, aes(x = weight, y = gender)) +
  geom_boxplot(varwidth = TRUE)

Code
write_tsv(x = cdc_data, file = "data/cdc_tab.txt")

write_rds(x= cdc_data, file = "data/cdc.rds")

Solution

RDS files tend to take up less storage space, preserves data types and classes, thus meaning that they don’t need to be redefined after the file is reloaded. Meanwhile, RDS also supports list columns, which feather does not.

Exercise 6

What types of files do the packages haven and readxl deal with?

Solution

Haven reads SPSS, Stata, and SAS files, whereas readxl reads excel files.

Exercise 7

What function in janitor helps you deal with non-syntactic column names in R and and also ensures column names are systematically handled? Demonstrate its use.

Code
dirty_tibble <- tibble(
  `id.number` = c(1,2,3,4,5),
  `First Name` = c('Liam','Ally','Jessica', 'David', 'Rohan'),
  `LastName` = c('Liam','Ally','Jessica', 'David', 'Rohan')
)

clean_tibble <- dirty_tibble %>% 
  clean_names()
Code
print(clean_tibble)
# A tibble: 5 × 3
  id_number first_name last_name
      <dbl> <chr>      <chr>    
1         1 Liam       Liam     
2         2 Ally       Ally     
3         3 Jessica    Jessica  
4         4 David      David    
5         5 Rohan      Rohan    

Solution

The clean_names function handles troublesome variable names with spacing, special characters, or other irregularities. As you can see,id.number, First Name, and LastName now are all separated by an underscore; since one variable had a space, they were all defaultd to a specific standardzied format. This is very beneficial when exporting data from excel or other files, where the variable names may not be clean.

Solution

Exercise 8

Suppose a data file takes a long time to read in, which can be a problem since we will need to read it in every time we come back to work on it. How could you increase the read in speed of this dataset for future R sessions?

Solution

The fread() function is extremely faster (10x) than their base equivalents, which can be found from the data.table package. Even if it doesn’t fit as well into the tidyverse, the speed is very useful for larger datasets.

Challenge

Required for graduate students, but not for undergraduate students.

Extract 2016 total population estimates for each state and DC from cc-est2016-alldata.csv contained in the data subdirectory. Force RStudio to display all rows and have them arranged from largest to smallest in population.

Hint: May want to read the cc-est2016-alldata.pdf for details.