【目錄】

1. Data Import
2. Tidy Data
3. Web Scraping
4. String Processing


【Data Import】

system.file() #回傳根目錄"C:/PROGRA~1/R/R-35~1.0/library/base"
file.path(路徑,檔名) #會幫你自動選對的slash黏上 
file.copy(被複製的路徑,欲複製到的路徑)


【R的基本function VS readr ?】

  • Base R functions import data as a data frame, while “readr” functions import data as a tibble.
  • 如果read.csv時沒有用StringAsFactor=F,他會將character轉為factor
  • The base R import functions can read .csv files,but cannot files with other delimiters, such as .tsv files, or fixed-width files.
  • The base R import functions can read files with other delimiters(分隔號) like .tsv using read.delim and can read fixed-width files using read.fwf.


【What is Tibble ?】

  • A tibble is also a data.frame.
    • Tibble prints better in console (it fits to screen size) and it also prints data types in header.
    • To put it simple, tibbles are “evolutions” of data.frames, printing and subsetting are less troublesome with tibbles than data.frames.
    • For example, if you want to look 10 first rows of a data.frame, you will use head(data.frame), for tibbles you can simple use print(tibble) or call tibble in R Console.
  • When subsetting, if you use “[” on tibble R will return always another tibble, but when you use “[[” R will return a vector.
  • For data.frames, when using “[” you might get a vector or another data.frame, which is quite unpleasant.
  • Another difference, tibble does not have partial matching.
url = https://~/murders.csv
read_csv(url)                       # 可以直接讀取網址檔
download.file(url, "murders.csv")   # 把網上的檔案載下來並命名為murders.csv
tempdir() 
tempfile()
# tempfile創建一個字符串,而不是文件,這可能是一個唯一的文件名


【Tidy Data】

dat_tidy <- dat_wide %>%
gather(key = disease, value = count, “Hepatitis A”: “Rubella”)
gather(key = key, value = value, -age_group)  # 把除了其中一個以外的全部的變數都拿去gather
D %>% gather(key=key, value=value, -age_group) %>%
  separate(col=key, into=c("year", "variable_name"), "_") %>% 
  spread(key=variable_name, value=value) %>% data.frame
# 默認的separate就是"_"了,所以即使不寫也沒關係
stats %>%
  separate(col=key, into=c("player", "variable_name"), sep="_", extra="merge") %>% 
  spread(key=variable_name, value=value)
# 等同於
stats %>%
  separate(col=key, into=c("player", "variable_name1", "variable_name2"), 
           sep="_", fill="right") %>% 
  unite(col = variable_name, variable_name1, variable_name2, sep = "_") %>% 
  spread(key = variable_name, value = value)
# fill="right"是叫他把name1,name2的欄位黏在player右邊


【Merge data】

  • left_join()
    • all rows in the left-hand table are retained in the final table, columns from both tables will be included in the final table
    • 左邊保持原樣,右邊對照左邊的obs.把右邊的column和值加入(NA來自右邊的)
  • right_join()
    • 與left_join相反
  • inner_join()
    • 選兩邊都有obs.的column,把他們merge起來(取交集)
  • full_join()
    • 不管是否欄列對稱,全部資料保留,缺項塞NA(取聯集)
  • semi_join()
    • 從inner_join的结果中只取左邊的值
  • anti_join()
    • 與semi_join相反,為左邊獨有的obs.


Bind_column(tb1,tb2) == cbind()
Bind_row(tb1,tb2,tb3) == rbind()
  • Binding functions combine by position, while join functions match by variables.

  • Joining functions can join datasets of different dimensions, but the bind functions must match on the appropriate dimension (either same row or column numbers).

  • Bind functions can combine both vectors and dataframes, while join functions work for only for dataframes.

intersect(c(1:10),c(6:15))
intersect(c("a","b"),c("b","c")) 
setdiff(c(1:10),c(6:15))
[1] 1 2 3 4 5
setequal(c(1:10),c(6:15))
[1] FALSE
setequal(c(1:10),c(1:10))
[1] TRUE
union(c(1:10),c(6:15))
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15


【Web Scraping】


library(rvest)
url = https://~
h = read_html(url)
tab = h %>% html_nodes("table")
tab = tab[[2]] # 還是一行一行亂亂的
tab = tab %>% html_table
class(tab)
[1]  "data.frame"

For the guacamole recipe page we already have done this and determined that we need the following selectors:

library(rvest)
h <- read_html("http://www.foodnetwork.com/recipes/alton-brown/guacamole-recipe-1940609")
recipe <- h %>% html_node(".o-AssetTitle__a-HeadlineText") %>% html_text()
prep_time <- h %>% html_node(".o-RecipeInfo__a-Description--Total") %>% html_text()
ingredients <- h %>% html_nodes(".o-Ingredients__a-ListItemText") %>% html_text()

You can see how complex the selectors are. In any case we are now ready to extract what we want and create a list:

guacamole <- list(recipe, prep_time, ingredients)

Since recipe pages from this website follow this general layout, we can use this code to create a function that extracts this information:

get_recipe <- function(url){
    h <- read_html(url)
    recipe <- h %>% html_node(".o-AssetTitle__a-HeadlineText") %>% html_text()
    prep_time <- h %>% html_node(".o-RecipeInfo__a-Description--Total") %>% html_text()
    ingredients <- h %>% html_nodes(".o-Ingredients__a-ListItemText") %>% html_text()
    return(list(recipe = recipe, prep_time = prep_time, ingredients = ingredients))
}

get_recipe("http://www.foodnetwork.com/recipes/food-network-kitchen/pancakes-recipe-1913844")

Below is an example for Scraping rain data from a website:

year <- c(2013:2017)
date <- c(201301:201312,201401:201412,201501:201512,201601:201612,201701:201712)
td <- data.frame('rd'=NA)
url <- paste0('http://e-service.cwb.gov.tw/HistoryDataQuery/YearDataController.do?command=viewMain&station=466910&stname=%25E9%259E%258D%25E9%2583%25A8&datepicker=',year)
for( i in 1:5){
  rd <- url[i] %>% 
    GET() %>% 
    read_html() %>% 
    html_nodes(.,css='td:nth-child(21)') %>% 
    html_text()
  rd <- data.frame('rd'=rd)
  td <- rbind(td,rd)
}
td <- td[-1,]
d1 <- data.frame('date'=date)
d1 <- cbind(d1,td)

d1$td <- gsub("\U00A0", "", d1$td)
View(d1)

write.csv(d1, file="C:/Users/shiki/Downloads/1.xlsx")


【String Processing】

str_detect(tab$population, ",")   # 回傳TRUE/FALSE
str_subset(tab$population, ",")   # 回傳字串 ex. 123,456/1,000
str_extract(tab$population, ",")  # 回傳逗號(提取""裡面的東西)











---
title: "Data Wargling R Notebook"
output: html_notebook
---

### 【目錄】

[1. Data Import](#n1) <br>
[2. Tidy Data](#n2) <br>
[3. Web Scraping](#n3) <br>
[4. String Processing](#n4) <br>

<hr>

### <a id="n1"></a>【Data Import】

```{r}
system.file() #回傳根目錄"C:/PROGRA~1/R/R-35~1.0/library/base"
file.path(路徑,檔名) #會幫你自動選對的slash黏上 
file.copy(被複製的路徑,欲複製到的路徑)
```

+ read_lines() #查看有沒有header
+ The import functions in the readr package all start as read_, while the import functions for base R all start with read.

<br>

#### 【R的基本function VS readr ?】

+ Base R functions import data as a data frame, while "readr" functions import data as a tibble.
+ 如果read.csv時沒有用StringAsFactor=F，他會將character轉為factor
+ The base R import functions can read .csv files,but cannot files with other delimiters, such as .tsv files, or fixed-width files.
+ The base R import functions can read files with other delimiters(分隔號) 
like .tsv using read.delim and can read fixed-width files using read.fwf.

<br>

#### 【What is Tibble ?】

+ A tibble is also a data.frame. 
    + Tibble prints better in console (it fits to screen size) and it also prints data types in header.
    + To put it simple, tibbles are "evolutions" of data.frames, printing and subsetting are less troublesome with tibbles than data.frames.
    + For example, if you want to look 10 first rows of a data.frame, you will use head(data.frame), for tibbles you can simple use print(tibble) or call tibble in R Console.
+ When subsetting, if you use "[" on tibble R will return always another tibble, but when you use "[[" R will return a vector.
+ For data.frames, when using "[" you might get a vector or another data.frame, which is quite unpleasant. 
+ Another difference, tibble does not have partial matching.

```{r}
url = https://~/murders.csv
read_csv(url)                       # 可以直接讀取網址檔
download.file(url, "murders.csv")   # 把網上的檔案載下來並命名為murders.csv
tempdir() 
tempfile()
# tempfile創建一個字符串，而不是文件，這可能是一個唯一的文件名
```

<br>
<hr>

### <a id="n2"></a>【Tidy Data】

```{r}
dat_tidy <- dat_wide %>%
gather(key = disease, value = count, “Hepatitis A”: “Rubella”)
```
  
+ specified that the “key” column will be called “disease”, the value of each entry will be called “count”, and that the columns Hepatitis A through Rubella will all be included in the gather command.
+ speard跟gather相反,speard是把一個欄位裡面n個factor變成n個欄位,並把每row右邊的值輸入n個欄位下面
+ gather是把n個欄位聚集成1個新的欄位,並創第2個新的欄位把n欄下的obs.輸入對應的row

```{r}
gather(key = key, value = value, -age_group)  # 把除了其中一個以外的全部的變數都拿去gather
```


```{r}
D %>% gather(key=key, value=value, -age_group) %>%
  separate(col=key, into=c("year", "variable_name"), "_") %>% 
  spread(key=variable_name, value=value) %>% data.frame
# 默認的separate就是"_"了,所以即使不寫也沒關係
```


```{r}
stats %>%
  separate(col=key, into=c("player", "variable_name"), sep="_", extra="merge") %>% 
  spread(key=variable_name, value=value)
# 等同於
stats %>%
  separate(col=key, into=c("player", "variable_name1", "variable_name2"), 
           sep="_", fill="right") %>% 
  unite(col = variable_name, variable_name1, variable_name2, sep = "_") %>% 
  spread(key = variable_name, value = value)
# fill="right"是叫他把name1,name2的欄位黏在player右邊
```

<br>

#### 【Merge data】

+ left_join()
    + all rows in the left-hand table are retained in the final table, columns from both tables will be included in the final table
    + 左邊保持原樣,右邊對照左邊的obs.把右邊的column和值加入(NA來自右邊的)
+ right_join()
    + 與left_join相反
+ inner_join()
    + 選兩邊都有obs.的column,把他們merge起來(取交集)
+ full_join()
    + 不管是否欄列對稱,全部資料保留,缺項塞NA(取聯集)
+ semi_join()
    + 從inner_join的结果中只取左邊的值
+ anti_join()
    + 與semi_join相反,為左邊獨有的obs.

<br>

```{r}
Bind_column(tb1,tb2) == cbind()
Bind_row(tb1,tb2,tb3) == rbind()
```

+ Binding functions combine by position, while join functions match by variables.

+ Joining functions can join datasets of different dimensions, but the bind functions must match on the appropriate dimension (either same row or column numbers).

+ Bind functions can combine both vectors and dataframes, while join functions work for only for dataframes.


```{r}
intersect(c(1:10),c(6:15))
intersect(c("a","b"),c("b","c")) 
```
 
```{r}
setdiff(c(1:10),c(6:15))
```

```{r}
setequal(c(1:10),c(6:15))
setequal(c(1:10),c(1:10))
```

```{r}
union(c(1:10),c(6:15))
```

<br>
<hr>

### <a id="n3"></a>【Web Scraping】

<br>

+ The html_node command only selects the first node of a specified type.
+ html_nodes so that we could specify the elements we want. 
    + For example, the second “table” element using the tab[[2]] command.
+ selector garget 教學:https://selectorgadget.com/

```{r}
library(rvest)
url = https://~
h = read_html(url)
tab = h %>% html_nodes("table")
tab = tab[[2]] # 還是一行一行亂亂的
tab = tab %>% html_table
class(tab)
[1]  "data.frame"
```
  
  
For the guacamole recipe page we already have done this and determined that we need the following selectors:
  
```{r}
library(rvest)
h <- read_html("http://www.foodnetwork.com/recipes/alton-brown/guacamole-recipe-1940609")
recipe <- h %>% html_node(".o-AssetTitle__a-HeadlineText") %>% html_text()
prep_time <- h %>% html_node(".o-RecipeInfo__a-Description--Total") %>% html_text()
ingredients <- h %>% html_nodes(".o-Ingredients__a-ListItemText") %>% html_text()
```

You can see how complex the selectors are. In any case we are now ready to extract what we want and create a list:
  
```{r}
guacamole <- list(recipe, prep_time, ingredients)
```

Since recipe pages from this website follow this general layout, we can use this code to create a function that extracts this information:
  
```{r}
get_recipe <- function(url){
    h <- read_html(url)
    recipe <- h %>% html_node(".o-AssetTitle__a-HeadlineText") %>% html_text()
    prep_time <- h %>% html_node(".o-RecipeInfo__a-Description--Total") %>% html_text()
    ingredients <- h %>% html_nodes(".o-Ingredients__a-ListItemText") %>% html_text()
    return(list(recipe = recipe, prep_time = prep_time, ingredients = ingredients))
}

get_recipe("http://www.foodnetwork.com/recipes/food-network-kitchen/pancakes-recipe-1913844")
```


Below is an example for Scraping rain data from a website:
  
```{r}
year <- c(2013:2017)
date <- c(201301:201312,201401:201412,201501:201512,201601:201612,201701:201712)
td <- data.frame('rd'=NA)
url <- paste0('http://e-service.cwb.gov.tw/HistoryDataQuery/YearDataController.do?command=viewMain&station=466910&stname=%25E9%259E%258D%25E9%2583%25A8&datepicker=',year)
for( i in 1:5){
  rd <- url[i] %>% 
    GET() %>% 
    read_html() %>% 
    html_nodes(.,css='td:nth-child(21)') %>% 
    html_text()
  rd <- data.frame('rd'=rd)
  td <- rbind(td,rd)
}
td <- td[-1,]
d1 <- data.frame('date'=date)
d1 <- cbind(d1,td)

d1$td <- gsub("\U00A0", "", d1$td)
View(d1)

write.csv(d1, file="C:/Users/shiki/Downloads/1.xlsx")
```


<br><hr>

### <a id="n4"></a>【String Processing】
  
```{r}
str_detect(tab$population, ",")   # 回傳TRUE/FALSE
str_subset(tab$population, ",")   # 回傳字串 ex. 123,456/1,000
str_extract(tab$population, ",")  # 回傳逗號(提取""裡面的東西)
```

<br><br><hr><br><br><br>
  

<style>
p,li {
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

title{
    font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

body{
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

h1,h2,h3,h4,h5{
    font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}
</style><br><br><br><br><br>

<style>
.caption {
  color: #777;
  margin-top: 10px;
}
p code {
  white-space: inherit;
}
pre {
  word-break: normal;
  word-wrap: normal;
  line-height: 1;
}
pre code {
  white-space: inherit;
}
p,li {
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

.r{
  line-height: 1.2;
}

title{
  color: #cc0000;
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

body{
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

h1,h2,h3,h4,h5{
  color: #008800;
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

h3{
  color: #b36b00;
  background: #ffe0b3;
  line-height: 2;
  font-weight: bold;
}

h5{
  color: #006000;
  background: #ffffe0;
  line-height: 2;
  font-weight: bold;
}

em{
  color: #0000c0;
  background: #f0f0f0;
  }
</style>

