packages = c(
  "dplyr","ggplot2","stringr", "dslabs", "readr", "tidyr", "purrr",
  "lubridate"
  )
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)
rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
[1] "C"
options(digits=4, scipen=12)
library(readr)
library(dplyr)
library(ggplot2)
library(stringr)
library(lubridate)
library(tidyr)
library(dslabs)

A. Tidy Data Overview

A1 Tidy Data

Q1: A collaborator sends you a file containing data for three years of average race finish times.

A <- read.csv("data/AgeGroup.txt")
incomplete final line found by readTableHeader on 'data/AgeGroup.txt'
A
A %>% setNames(c('age_group',2015:2017))

Are these data considered “tidy” in R? Why or why not?

  • No. These data are not considered “tidy” because the variable “year” is stored in the header.

Q2: Below are four versions of the same dataset.

read.table("data/state1.txt",header=T,sep="")
read.table("data/state2.txt",header=T,sep="")
read.table("data/state3.txt",header=T,sep="")
read.table("data/state4.txt",header=T,sep="")

Which one is in a tidy format?

  • ans: A
  • in tidy format, every observation is a row and every variable is a column.



B. Reshaping Data

B1. Reshaping Data

Q1: Your file called “times.csv” has age groups and average race finish times for three years of marathons. You read in the data file using the following command.

d = read_csv("data/times.csv")
Parsed with column specification:
cols(
  age_group = col_integer(),
  `2015` = col_time(format = ""),
  `2016` = col_time(format = ""),
  `2017` = col_time(format = "")
)
d
#這樣的表格可稱: 寬表格

Which commands will help you “tidy” the data?

d %>% gather(year, time, `2015`:`2017`) %>% data.frame #沒有指定value,就代表key會是value的欄位名稱,value是類別變數的值
# gather將寬表格整理成長表格
# gather(dataset,key = "類別變數,紀錄數值變數的來源" , value = "多個數值變數的累積" , key , value)
#長表格,tidy form,一個row是一次紀錄

Q2: You have a dataset on U.S. contagious diseases, but it is in the following wide format:

D = read.table("data/diseases.txt", header=T, sep="")
D

Which of the following would transform this into a tidy dataset, with each row representing an observation of the incidence of each specific disease (as shown below)?

D %>% gather(disease, count, "Hepatitis_A": "Rubella") %>% head(10)

Q3: You have successfully formatted marathon finish times into a tidy object called D. The first few lines are shown below.

D = read.table("data/times_long.txt", header=T, sep=",")
D

Select the code that converts these data back to the wide format, where each year has a separate column.

D %>% spread(year, time)

Q4: You have a file

D = read.table("data/state2.txt", header=T, sep="")
D

You would like to transform it into a dataset where population and total are each their own column (shown below). Which code would best accomplish this? Select the code that converts these data back to the wide format, where each year has a separate column.

D %>% spread(key=var, value=people)
B3 Separate and Unite

Q1: A collaborator sends you a file containing data for two years of average race finish times.

D = read.csv("data/times2.txt")
D

Which of the answers below best tidys the data?

D %>% gather(key=key, value=value, -age_group) %>%
    separate(col=key, into=c("year", "variable_name"), sep = "_") %>% 
    spread(key=variable_name, value=value) %>% data.frame
attributes are not identical across measure variables;
they will be dropped

Q2: You are in the process of tidying some data on heights, hand length, and wingspan for basketball players in the draft. Currently, you have the following:

stats = read.table("data/player.txt",header=T,sep="")
stats

Select all of the correct commands below that would turn this data into a “tidy” format.

stats %>%
    separate(col=key, into=c("player", "variable_name"), sep="_", extra="merge") %>% 
    spread(key=variable_name, value=value)
stats %>%
    separate(col=key, into=c("player", "variable_name1", "variable_name2"), 
             sep="_", fill="right") %>% 
    unite(col = variable_name, variable_name1, variable_name2, sep = "_") %>% 
    spread(key = variable_name, value = value)



C. Combining Tables

C1 Combining Tables

Q1: You have created a tab1 and tab2 of state population and election data, similar to our module videos:

tab1 = read.table("data/tab1.txt",header=T,sep="",stringsAsFactors=F)
tab1
tab2 = read.table("data/tab2.txt",header=T,sep="",stringsAsFactors=F)
tab2

What are the dimensions of the table dat, created by the following command?

left_join(tab1, tab2, by = "state") %>% dim

Q2: We are still using the tab1 and tab2 tables shown in question 1. What join command would create a new table dat with three rows and two columns?

semi_join(tab1, tab2, by = "state")
C2 Binding

Q1: Which of the following are real differences between the join and bind functions? Please select all correct answers.

  • Binding functions combine by position, while join functions match by variables.
  • Joining functions can join datasets of different dimensions, but the bind functions must match on the appropriate dimension (either same row or column numbers).
  • Bind functions can combine both vectors and dataframes, while join functions work for only for dataframes.
C3 Set Operators

Q1: We have two simple tables, shown below:

df1 = data.frame(x=c("a","b"), y=c("a","a"), stringsAsFactors=F); df1
df2 = data.frame(x=c("a","a"), y=c("a","b"), stringsAsFactors=F); df2

Which command would result in the following table?

dplyr::setdiff(df1, df2)



D. Web Scraping

D1 Web Scraping

Q1: Which feature of html documents allows us to extract the table that we are interested in?

  • All elements in an html page are specified as “nodes”; we can use the node “tables” to identify and extract the specific table we are interested in before we do additional data cleaning.

Q2: In the video, we use the following code to extract the murders table (tab) from our downloaded html file h:

tab <- h %>% html_nodes(“table”) 
tab <- tab[[2]] %>% html_table

Why did we use the html_nodes() command instead of the html_node command?

  • The html_node command only selects the first node of a specified type. In this example the first “table” node is a legend table and not the actual data we are interested in.
  • We used html_nodes so that we could specify the second “table” element using the tab[[2]] command








---
title: "Wrangling, Tidy Data"
output: html_notebook
editor_options: 
  chunk_output_type: inline
---

<br>

```{r}
packages = c(
  "dplyr","ggplot2","stringr", "dslabs", "readr", "tidyr", "purrr",
  "lubridate"
  )
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)
```

```{r echo=T, message=F, cache=F, warning=F}
rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
options(digits=4, scipen=12)
library(readr)
library(dplyr)
library(ggplot2)
library(stringr)
library(lubridate)
library(tidyr)
library(dslabs)
```

- - -

### A. Tidy Data Overview

##### A1 Tidy Data

**Q1:** A collaborator sends you a file containing data for three years of average race finish times. 
```{r}
A <- read.csv("data/AgeGroup.txt")
A
A %>% setNames(c('age_group',2015:2017))
```
_Are these data considered “tidy” in R? Why or why not?_

+ No. These data are not considered “tidy” because the variable “year” is stored in the header.
+

**Q2:** Below are four versions of the same dataset. 
```{r}
read.table("data/state1.txt",header=T,sep="")
```

```{r}
read.table("data/state2.txt",header=T,sep="")
```

```{r eval=F}
read.table("data/state3.txt",header=T,sep="")
```

```{r}
read.table("data/state4.txt",header=T,sep="")
```
_Which one is in a tidy format?_

+ ans: A
+ in tidy format, every observation is a row and every variable is a column. 

<br>

- - -

### B. Reshaping Data

##### B1. Reshaping Data

**Q1:** Your file called “times.csv” has age groups and average race finish times for three years of marathons. You read in the data file using the following command.

```{r}
d = read_csv("data/times.csv")
d
#這樣的表格可稱: 寬表格
```

_Which commands will help you “tidy” the data?_
```{r}
d %>% gather(year, time, `2015`:`2017`) %>% data.frame #沒有指定value，就代表key會是value的欄位名稱，value是類別變數的值
# gather將寬表格整理成長表格
# gather(dataset,key = "類別變數，紀錄數值變數的來源" , value = "多個數值變數的累積" , key , value)

#長表格，tidy form，一個row是一次紀錄
```

**Q2:** You have a dataset on U.S. contagious diseases, but it is in the following wide format:
```{r}
D = read.table("data/diseases.txt", header=T, sep="")
D
```

_Which of the following would transform this into a tidy dataset, with each row representing an observation of the incidence of each specific disease (as shown below)?_
```{r}
D %>% gather(disease, count, "Hepatitis_A": "Rubella") %>% head(10)
```

**Q3:** You have successfully formatted marathon finish times into a tidy object called `D`. The first few lines are shown below.
```{r}
D = read.table("data/times_long.txt", header=T, sep=",")
D
```

_Select the code that converts these data back to the wide format, where each year has a separate column._
```{r}
D %>% spread(year, time)
```

**Q4:** You have a file
```{r}
D = read.table("data/state2.txt", header=T, sep="")
D
```
You would like to transform it into a dataset where population and total are each their own column (shown below). _Which code would best accomplish this? Select the code that converts these data back to the wide format, where each year has a separate column._
```{r}
D %>% spread(key=var, value=people)
```

##### B3 Separate and Unite

**Q1:** A collaborator sends you a file containing data for two years of average race finish times.
```{r}
D = read.csv("data/times2.txt")
D
```

_Which of the answers below best tidys the data?_
```{r}
D %>% gather(key=key, value=value, -age_group) %>%
	separate(col=key, into=c("year", "variable_name"), sep = "_") %>% 
	spread(key=variable_name, value=value) %>% data.frame
```

**Q2:** You are in the process of tidying some data on heights, hand length, and wingspan for basketball players in the draft. Currently, you have the following:
```{r}
stats = read.table("data/player.txt",header=T,sep="")
stats
```

_Select all of the correct commands below that would turn this data into a “tidy” format._

```{r}
stats %>%
	separate(col=key, into=c("player", "variable_name"), sep="_", extra="merge") %>% 
	spread(key=variable_name, value=value)
```

```{r}
stats %>%
	separate(col=key, into=c("player", "variable_name1", "variable_name2"), 
	         sep="_", fill="right") %>% 
	unite(col = variable_name, variable_name1, variable_name2, sep = "_") %>% 
	spread(key = variable_name, value = value)
```

<br>

- - -

### C. Combining Tables

##### C1 Combining Tables

**Q1:** You have created a tab1 and tab2 of state population and election data, similar to our module videos:
```{r}
tab1 = read.table("data/tab1.txt",header=T,sep="",stringsAsFactors=F)
tab1
tab2 = read.table("data/tab2.txt",header=T,sep="",stringsAsFactors=F)
tab2
```

_What are the dimensions of the table dat, created by the following command?_
```{r}
left_join(tab1, tab2, by = "state") %>% dim
```


**Q2:** _We are still using the tab1 and tab2 tables shown in question 1. What join command would create a new table `dat` with three rows and two columns?_
```{r}
semi_join(tab1, tab2, by = "state")
```

##### C2 Binding

**Q1:** _Which of the following are real differences between the join and bind functions? Please select all correct answers._

+ Binding functions combine by position, while join functions match by variables.
+ Joining functions can join datasets of different dimensions, but the bind functions must match on the appropriate dimension (either same row or column numbers).
+ Bind functions can combine both vectors and dataframes, while join functions work for only for dataframes.
+


##### C3 Set Operators

**Q1:** We have two simple tables, shown below:
```{r}
df1 = data.frame(x=c("a","b"), y=c("a","a"), stringsAsFactors=F); df1
df2 = data.frame(x=c("a","a"), y=c("a","b"), stringsAsFactors=F); df2
```

_Which command would result in the following table?_
```{r}
dplyr::setdiff(df1, df2)
```

<br>

- - -

### D. Web Scraping

##### D1 Web Scraping

**Q1:** _Which feature of html documents allows us to extract the table that we are interested in?_

+ All elements in an html page are specified as “nodes”; we can use the node “tables” to identify and extract the specific table we are interested in before we do additional data cleaning.
+

**Q2:** In the video, we use the following code to extract the murders table (tab) from our downloaded html file `h`:
```{r eval=F}
tab <- h %>% html_nodes(“table”) 
tab <- tab[[2]] %>% html_table
```

_Why did we use the html_nodes() command instead of the html_node command?_

+ The html_node command only selects the first node of a specified type. In this example the first “table” node is a legend table and not the actual data we are interested in.
+ We used html_nodes so that we could specify the second “table” element using the tab[[2]] command
+

<br>

- - -

<br><br><br><br><br>

<style>
.caption {
  color: #777;
  margin-top: 10px;
}
p code {
  white-space: inherit;
}
pre {
  word-break: normal;
  word-wrap: normal;
  line-height: 1;
}
pre code {
  white-space: inherit;
}
p,li {
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

.r{
  line-height: 1.2;
}

title{
  color: #cc0000;
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

body{
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

h1,h2,h3,h4,h5{
  color: #008800;
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

h3{
  color: #b36b00;
  background: #ffe0b3;
  line-height: 2;
  font-weight: bold;
}

h5{
  color: #006000;
  background: #ffffe0;
  line-height: 2;
  font-weight: bold;
}

em{
  color: #0000c0;
  background: #f0f0f0;
  }
</style>
