packages = c(
"dplyr","ggplot2","stringr", "dslabs", "readr", "tidyr", "purrr",
"lubridate"
)
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)
rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
options(digits=4, scipen=12)
library(readr)
library(dplyr)
library(ggplot2)
library(stringr)
library(lubridate)
library(tidyr)
library(dslabs)
Q1: A collaborator sends you a file containing data for three years of average race finish times.
read.csv("data/AgeGroup.txt") %>% setNames(c('age_group',2015:2017))
age_group 2015 2016 2017
1 20 3:46 3:22 3:50
2 30 3:50 3:43 4:43
3 40 4:39 3:49 4:51
4 50 4:48 4:59 5:01
Are these data considered “tidy” in R? Why or why not?
Q2: Below are four versions of the same dataset.
read.table("data/state1.txt",header=T,sep="")
state abb region population total
1 Alabama AL South 4779736 135
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
4 Arkansas AR South 2915918 93
5 California CA West 37253956 1257
6 Colorado CO West 5029196 65
read.table("data/state2.txt",header=T,sep="")
state abb region var people
1 Alabama AL South population 4779736
2 Alabama AL South total 135
3 Alaska AK West population 710231
4 Alaska AK West total 19
5 Arizona AZ West population 6392017
6 Arizona AZ West total 232
read.table("data/state3.txt",header=T,sep="")
read.table("data/state4.txt",header=T,sep="")
state abb region rate
1 Alabama AL South 0.0000282
2 Alaska AK West 0.0000268
3 Arizona AZ West 0.0000363
4 Arkansas AR South 0.0000319
5 California CA West 0.0000337
6 Colorado CO West 0.0000129
Which one is in a tidy format?
Q1: Your file called “times.csv” has age groups and average race finish times for three years of marathons. You read in the data file using the following command.
d = read_csv("data/times.csv")
Parsed with column specification:
cols(
age_group = col_integer(),
`2015` = col_time(format = ""),
`2016` = col_time(format = ""),
`2017` = col_time(format = "")
)
Which commands will help you “tidy” the data?
d %>% gather(year, time, `2015`:`2017`) %>% data.frame
age_group year time
1 20 2015 03:46:00
2 30 2015 03:50:00
3 40 2015 04:39:00
4 50 2015 04:48:00
5 20 2016 03:22:00
6 30 2016 03:43:00
7 40 2016 03:49:00
8 50 2016 04:59:00
9 20 2017 03:50:00
10 30 2017 04:43:00
11 40 2017 04:51:00
12 50 2017 05:01:00
Q2: You have a dataset on U.S. contagious diseases, but it is in the following wide format:
D = read.table("data/diseases.txt", header=T, sep="")
D
state year population Hepatitis_A Mumps Polio Rubella
1 Alabama 1990 4040587 86 19 76 1
2 Alabama 1991 4066003 39 14 65 0
3 Alabama 1992 4097169 35 12 24 0
4 Alabama 1993 4133242 40 22 67 0
5 Alabama 1994 4173361 72 12 39 0
6 Alabama 1995 4216645 75 2 38 0
Which of the following would transform this into a tidy dataset, with each row representing an observation of the incidence of each specific disease (as shown below)?
D %>% gather(disease, count, "Hepatitis_A": "Rubella") %>% head(10)
state year population disease count
1 Alabama 1990 4040587 Hepatitis_A 86
2 Alabama 1991 4066003 Hepatitis_A 39
3 Alabama 1992 4097169 Hepatitis_A 35
4 Alabama 1993 4133242 Hepatitis_A 40
5 Alabama 1994 4173361 Hepatitis_A 72
6 Alabama 1995 4216645 Hepatitis_A 75
7 Alabama 1990 4040587 Mumps 19
8 Alabama 1991 4066003 Mumps 14
9 Alabama 1992 4097169 Mumps 12
10 Alabama 1993 4133242 Mumps 22
Q3: You have successfully formatted marathon finish times into a tidy object called D. The first few lines are shown below.
D = read.table("data/times_long.txt", header=T, sep=",")
D
age_group year time
1 20 2015 03:46
2 30 2015 03:50
3 40 2015 04:39
4 50 2015 04:48
5 20 2016 03:22
Select the code that converts these data back to the wide format, where each year has a separate column.
D %>% spread(year, time)
age_group 2015 2016
1 20 03:46 03:22
2 30 03:50 <NA>
3 40 04:39 <NA>
4 50 04:48 <NA>
Q4: You have a file
D = read.table("data/state2.txt", header=T, sep="")
D
state abb region var people
1 Alabama AL South population 4779736
2 Alabama AL South total 135
3 Alaska AK West population 710231
4 Alaska AK West total 19
5 Arizona AZ West population 6392017
6 Arizona AZ West total 232
You would like to transform it into a dataset where population and total are each their own column (shown below). Which code would best accomplish this? Select the code that converts these data back to the wide format, where each year has a separate column.
D %>% spread(key=var, value=people)
state abb region population total
1 Alabama AL South 4779736 135
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
Q1: A collaborator sends you a file containing data for two years of average race finish times.
D = read.csv("data/times2.txt")
D
age_group X2015_time X2015_participants X2016_time
1 20 3:46 54 3:22
2 30 3:50 60 3:43
3 40 4:39 29 3:49
4 50 4:48 10 4:59
X2016_participants
1 62
2 58
3 33
4 14
Which of the answers below best tidys the data?
D %>% gather(key=key, value=value, -age_group) %>%
separate(col=key, into=c("year", "variable_name"), sep = "_") %>%
spread(key=variable_name, value=value) %>% data.frame
attributes are not identical across measure variables;
they will be dropped
age_group year participants time
1 20 X2015 54 3:46
2 20 X2016 62 3:22
3 30 X2015 60 3:50
4 30 X2016 58 3:43
5 40 X2015 29 4:39
6 40 X2016 33 3:49
7 50 X2015 10 4:48
8 50 X2016 14 4:59
Q2: You are in the process of tidying some data on heights, hand length, and wingspan for basketball players in the draft. Currently, you have the following:
stats = read.table("data/player.txt",header=T,sep="")
stats
key value
1 allen_height 75.00
2 allen_hand_length 8.25
3 allen_wingspan 79.25
4 bamba_height 83.25
5 bamba_hand_length 9.75
6 bamba_wingspan 94.00
Select all of the correct commands below that would turn this data into a “tidy” format.
stats %>%
separate(col=key, into=c("player", "variable_name"), sep="_", extra="merge") %>%
spread(key=variable_name, value=value)
player hand_length height wingspan
1 allen 8.25 75.00 79.25
2 bamba 9.75 83.25 94.00
stats %>%
separate(col=key, into=c("player", "variable_name1", "variable_name2"),
sep="_", fill="right") %>%
unite(col = variable_name, variable_name1, variable_name2, sep = "_") %>%
spread(key = variable_name, value = value)
player hand_length height_NA wingspan_NA
1 allen 8.25 75.00 79.25
2 bamba 9.75 83.25 94.00
Q1: You have created a tab1 and tab2 of state population and election data, similar to our module videos:
tab1 = read.table("data/tab1.txt",header=T,sep="",stringsAsFactors=F)
tab1
state population
1 Alabama 4779736
2 Alaska 710231
3 Arizona 6392017
4 Delaware 897934
5 District_of_Columbia 601723
tab2 = read.table("data/tab2.txt",header=T,sep="",stringsAsFactors=F)
tab2
state electoral_votes
1 Alabama 9
2 Alaska 3
3 Arizona 11
4 California 55
5 Colorado 9
6 Connecticut 7
What are the dimensions of the table dat, created by the following command?
left_join(tab1, tab2, by = "state") %>% dim
[1] 5 3
Q2: We are still using the tab1 and tab2 tables shown in question 1. What join command would create a new table dat with three rows and two columns?
semi_join(tab1, tab2, by = "state")
state population
1 Alabama 4779736
2 Alaska 710231
3 Arizona 6392017
Q1: Which of the following are real differences between the join and bind functions? Please select all correct answers.
Q1: We have two simple tables, shown below:
df1 = data.frame(x=c("a","b"), y=c("a","a"), stringsAsFactors=F); df1
x y
1 a a
2 b a
df2 = data.frame(x=c("a","a"), y=c("a","b"), stringsAsFactors=F); df2
x y
1 a a
2 a b
Which command would result in the following table?
dplyr::setdiff(df1, df2)
x y
1 b a
Q1: Which feature of html documents allows us to extract the table that we are interested in?
Q2: In the video, we use the following code to extract the murders table (tab) from our downloaded html file h:
tab <- h %>% html_nodes(“table”)
tab <- tab[[2]] %>% html_table
Why did we use the html_nodes() command instead of the html_node command?