packages = c(
"dplyr","ggplot2","stringr", "dslabs", "readr", "tidyr", "purrr",
"lubridate"
)
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)
rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
[1] "C"
options(digits=4, scipen=12)
library(readr)
library(dplyr)
library(ggplot2)
library(stringr)
library(lubridate)
library(tidyr)
library(dslabs)
Q1: A collaborator sends you a file containing data for three years of average race finish times.
A <- read.csv("data/AgeGroup.txt")
incomplete final line found by readTableHeader on 'data/AgeGroup.txt'
A
A %>% setNames(c('age_group',2015:2017))
Are these data considered “tidy” in R? Why or why not?
Q2: Below are four versions of the same dataset.
read.table("data/state1.txt",header=T,sep="")
read.table("data/state2.txt",header=T,sep="")
read.table("data/state3.txt",header=T,sep="")
read.table("data/state4.txt",header=T,sep="")
Which one is in a tidy format?
Q1: Your file called “times.csv” has age groups and average race finish times for three years of marathons. You read in the data file using the following command.
d = read_csv("data/times.csv")
Parsed with column specification:
cols(
age_group = col_integer(),
`2015` = col_time(format = ""),
`2016` = col_time(format = ""),
`2017` = col_time(format = "")
)
d
#這樣的表格可稱: 寬表格
Which commands will help you “tidy” the data?
d %>% gather(year, time, `2015`:`2017`) %>% data.frame #沒有指定value,就代表key會是value的欄位名稱,value是類別變數的值
# gather將寬表格整理成長表格
# gather(dataset,key = "類別變數,紀錄數值變數的來源" , value = "多個數值變數的累積" , key , value)
#長表格,tidy form,一個row是一次紀錄
Q2: You have a dataset on U.S. contagious diseases, but it is in the following wide format:
D = read.table("data/diseases.txt", header=T, sep="")
D
Which of the following would transform this into a tidy dataset, with each row representing an observation of the incidence of each specific disease (as shown below)?
D %>% gather(disease, count, "Hepatitis_A": "Rubella") %>% head(10)
Q3: You have successfully formatted marathon finish times into a tidy object called D. The first few lines are shown below.
D = read.table("data/times_long.txt", header=T, sep=",")
D
Select the code that converts these data back to the wide format, where each year has a separate column.
D %>% spread(year, time)
Q4: You have a file
D = read.table("data/state2.txt", header=T, sep="")
D
You would like to transform it into a dataset where population and total are each their own column (shown below). Which code would best accomplish this? Select the code that converts these data back to the wide format, where each year has a separate column.
D %>% spread(key=var, value=people)
Q1: A collaborator sends you a file containing data for two years of average race finish times.
D = read.csv("data/times2.txt")
D
Which of the answers below best tidys the data?
D %>% gather(key=key, value=value, -age_group) %>%
separate(col=key, into=c("year", "variable_name"), sep = "_") %>%
spread(key=variable_name, value=value) %>% data.frame
attributes are not identical across measure variables;
they will be dropped
Q2: You are in the process of tidying some data on heights, hand length, and wingspan for basketball players in the draft. Currently, you have the following:
stats = read.table("data/player.txt",header=T,sep="")
stats
Select all of the correct commands below that would turn this data into a “tidy” format.
stats %>%
separate(col=key, into=c("player", "variable_name"), sep="_", extra="merge") %>%
spread(key=variable_name, value=value)
stats %>%
separate(col=key, into=c("player", "variable_name1", "variable_name2"),
sep="_", fill="right") %>%
unite(col = variable_name, variable_name1, variable_name2, sep = "_") %>%
spread(key = variable_name, value = value)
Q1: You have created a tab1 and tab2 of state population and election data, similar to our module videos:
tab1 = read.table("data/tab1.txt",header=T,sep="",stringsAsFactors=F)
tab1
tab2 = read.table("data/tab2.txt",header=T,sep="",stringsAsFactors=F)
tab2
What are the dimensions of the table dat, created by the following command?
left_join(tab1, tab2, by = "state") %>% dim
Q2: We are still using the tab1 and tab2 tables shown in question 1. What join command would create a new table dat with three rows and two columns?
semi_join(tab1, tab2, by = "state")
Q1: Which of the following are real differences between the join and bind functions? Please select all correct answers.
Q1: We have two simple tables, shown below:
df1 = data.frame(x=c("a","b"), y=c("a","a"), stringsAsFactors=F); df1
df2 = data.frame(x=c("a","a"), y=c("a","b"), stringsAsFactors=F); df2
Which command would result in the following table?
dplyr::setdiff(df1, df2)
Q1: Which feature of html documents allows us to extract the table that we are interested in?
Q2: In the video, we use the following code to extract the murders table (tab) from our downloaded html file h:
tab <- h %>% html_nodes(“table”)
tab <- tab[[2]] %>% html_table
Why did we use the html_nodes() command instead of the html_node command?