Code
# Loading package(s)
library(tidyverse)
library(readr)
library(janitor)Data Science 1 with R (STAT 301-1)
# Loading package(s)
library(tidyverse)
library(readr)
library(janitor)All datasets are either coded inline or contained in the data sub-directory within the downloaded zip file.
Demonstrate how to read in TopBabyNamesByState.txt contained in the data sub-directory using the appropriate function from readr package. After reading in the data, determine the top male and female names in 1984 for South Dakota.
top_baby_names <- read_delim("data/TopBabyNamesbyState.txt")top_baby_names %>%
filter(year == 1984, state == "SD")# A tibble: 2 × 5
state gender year top_name occurences
<chr> <chr> <dbl> <chr> <dbl>
1 SD F 1984 Jessica 166
2 SD M 1984 Matthew 220
What is the main difference between read_csv(), read_csv2(), and read_tsv()?
They all accept the parameters na, col_names, and trim_ws. Please describe in your own words what each of these parameters control.
read_csv() reads comma delimited files, read_csv2() reads semicolon separated files, and read_tsv() reads tab delimited files. na controls which strings are dictated as missing values, col_names sets the column names of a character vector, and trim_ws gets rid of leading/trailing whitespace from character strings.
Read in the fixed width file fwf_example.txt contained in the sub-directory data. We have provided the column names so the final dataset has appropriate names. You may want to look at the dataset in a text editor to get an idea of what you are dealing with. (Hint: Use fwf_widths() for col_positions and the skip parameter where appropriate.)
# Column names.
column_names <- c(
"Entry", "Per.", "Post Date", "GL Account", "Description", "Srce.",
"Cflow", "Ref.", "Post", "Debit", "Credit", "Alloc."
)col_widths <-c(8, 4, 12, 13, 27, 5, 4, 10, 13, 19, 13, 4)
fwf_data <- read_fwf(
file = "data/fwf_example.txt",
skip = 2,
col_positions = fwf_widths(
widths = col_widths,
col_names = column_names
)
)fwf_data %>%
glimpse()Rows: 17
Columns: 12
$ Entry <dbl> 16524, 191675, 191667, 191673, 80495, 80507, 80509, 80497…
$ Per. <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01…
$ `Post Date` <chr> "10/17/2012", "01/14/2013", "01/14/2013", "01/14/2013", "…
$ `GL Account` <dbl> 3930621977, 2368183100, 3714468136, 2632703881, 276638979…
$ Description <chr> "TXNPUES", "OUNHQEX XUFQONY", "GHAKASC QHJXDFM", "PAHFSAP…
$ Srce. <chr> "S1", "S1", "S1", "S1", "S1", "S1", "S1", "S1", "S1", "S1…
$ Cflow <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "No", "Yes", "Yes…
$ Ref. <chr> "RHMXWPCP", NA, NA, NA, "TGZGMOXG", "USUKVMZO", "DNUNTASS…
$ Post <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "…
$ Debit <dbl> NA, NA, 3172.53, 983.21, NA, NA, NA, NA, 9167.93, 746.70,…
$ Credit <dbl> 5007.10, 43537.00, NA, NA, 903.78, 670.31, 848.50, 7.31, …
$ Alloc. <chr> "No", "Yes", "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes…
Generate the correct format string to parse each of the following dates and times:
# various date and time string formats
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
t1 <- "1705"
t2 <- "11:15:10.12 PM"d1 <- "January 1, 2010"
parse_date(d1, format = "%B %d, %Y")[1] "2010-01-01"
d2 <- "2015-Mar-07"
parse_date(d2, format = "%Y-%b-%d")[1] "2015-03-07"
d3 <- "06-Jun-2017"
parse_date(d3, format = "%d-%b-%Y")[1] "2017-06-06"
d4 <- c("August 19 (2015)", "July 1 (2015)")
parse_date(d4, format = "%B %d (%Y)")[1] "2015-08-19" "2015-07-01"
d5 <- "12/30/14" # Dec 30, 2014
parse_date(d5, format = "%m/%d/%y")[1] "2014-12-30"
t1 <- "1705"
parse_time(t1, format = "%H%M" )17:05:00
t2 <- "11:15:10.12 PM"
parse_time(t2, "%I:%M:%OS %p")23:15:10.12
Read in and store cdc.txt from the data sub-directory.
data sub-directory as a tab deliminated text file named cdc_tab.txt.data sub-directory as an RDS file named cdc.rds.cdc_data <- read_delim("data/cdc.txt")
ggplot(cdc_data, aes(x = weight, y = gender)) +
geom_boxplot(varwidth = TRUE)write_tsv(x = cdc_data, file = "data/cdc_tab.txt")
write_rds(x= cdc_data, file = "data/cdc.rds")RDS files tend to take up less storage space, preserves data types and classes, thus meaning that they don’t need to be redefined after the file is reloaded. Meanwhile, RDS also supports list columns, which feather does not.
What types of files do the packages haven and readxl deal with?
Haven reads SPSS, Stata, and SAS files, whereas readxl reads excel files.
What function in janitor helps you deal with non-syntactic column names in R and and also ensures column names are systematically handled? Demonstrate its use.
dirty_tibble <- tibble(
`id.number` = c(1,2,3,4,5),
`First Name` = c('Liam','Ally','Jessica', 'David', 'Rohan'),
`LastName` = c('Liam','Ally','Jessica', 'David', 'Rohan')
)
clean_tibble <- dirty_tibble %>%
clean_names()print(clean_tibble)# A tibble: 5 × 3
id_number first_name last_name
<dbl> <chr> <chr>
1 1 Liam Liam
2 2 Ally Ally
3 3 Jessica Jessica
4 4 David David
5 5 Rohan Rohan
The clean_names function handles troublesome variable names with spacing, special characters, or other irregularities. As you can see,id.number, First Name, and LastName now are all separated by an underscore; since one variable had a space, they were all defaultd to a specific standardzied format. This is very beneficial when exporting data from excel or other files, where the variable names may not be clean.
Suppose a data file takes a long time to read in, which can be a problem since we will need to read it in every time we come back to work on it. How could you increase the read in speed of this dataset for future R sessions?
The fread() function is extremely faster (10x) than their base equivalents, which can be found from the data.table package. Even if it doesn’t fit as well into the tidyverse, the speed is very useful for larger datasets.
Required for graduate students, but not for undergraduate students.
Extract 2016 total population estimates for each state and DC from cc-est2016-alldata.csv contained in the data subdirectory. Force RStudio to display all rows and have them arranged from largest to smallest in population.
Hint: May want to read the cc-est2016-alldata.pdf for details.