【Data Import】
system.file() #回傳根目錄"C:/PROGRA~1/R/R-35~1.0/library/base"
file.path(路徑,檔名) #會幫你自動選對的slash黏上
file.copy(被複製的路徑,欲複製到的路徑)
- read_lines() #查看有沒有header
- The import functions in the readr package all start as read_, while the import functions for base R all start with read.
【R的基本function VS readr ?】
- Base R functions import data as a data frame, while “readr” functions import data as a tibble.
- 如果read.csv時沒有用StringAsFactor=F,他會將character轉為factor
- The base R import functions can read .csv files,but cannot files with other delimiters, such as .tsv files, or fixed-width files.
- The base R import functions can read files with other delimiters(分隔號) like .tsv using read.delim and can read fixed-width files using read.fwf.
【What is Tibble ?】
- A tibble is also a data.frame.
- Tibble prints better in console (it fits to screen size) and it also prints data types in header.
- To put it simple, tibbles are “evolutions” of data.frames, printing and subsetting are less troublesome with tibbles than data.frames.
- For example, if you want to look 10 first rows of a data.frame, you will use head(data.frame), for tibbles you can simple use print(tibble) or call tibble in R Console.
- When subsetting, if you use “[” on tibble R will return always another tibble, but when you use “[[” R will return a vector.
- For data.frames, when using “[” you might get a vector or another data.frame, which is quite unpleasant.
- Another difference, tibble does not have partial matching.
url = https://~/murders.csv
read_csv(url) # 可以直接讀取網址檔
download.file(url, "murders.csv") # 把網上的檔案載下來並命名為murders.csv
tempdir()
tempfile()
# tempfile創建一個字符串,而不是文件,這可能是一個唯一的文件名
【Tidy Data】
dat_tidy <- dat_wide %>%
gather(key = disease, value = count, “Hepatitis A”: “Rubella”)
- specified that the “key” column will be called “disease”, the value of each entry will be called “count”, and that the columns Hepatitis A through Rubella will all be included in the gather command.
- speard跟gather相反,speard是把一個欄位裡面n個factor變成n個欄位,並把每row右邊的值輸入n個欄位下面
- gather是把n個欄位聚集成1個新的欄位,並創第2個新的欄位把n欄下的obs.輸入對應的row
gather(key = key, value = value, -age_group) # 把除了其中一個以外的全部的變數都拿去gather
D %>% gather(key=key, value=value, -age_group) %>%
separate(col=key, into=c("year", "variable_name"), "_") %>%
spread(key=variable_name, value=value) %>% data.frame
# 默認的separate就是"_"了,所以即使不寫也沒關係
stats %>%
separate(col=key, into=c("player", "variable_name"), sep="_", extra="merge") %>%
spread(key=variable_name, value=value)
# 等同於
stats %>%
separate(col=key, into=c("player", "variable_name1", "variable_name2"),
sep="_", fill="right") %>%
unite(col = variable_name, variable_name1, variable_name2, sep = "_") %>%
spread(key = variable_name, value = value)
# fill="right"是叫他把name1,name2的欄位黏在player右邊
【Merge data】
- left_join()
- all rows in the left-hand table are retained in the final table, columns from both tables will be included in the final table
- 左邊保持原樣,右邊對照左邊的obs.把右邊的column和值加入(NA來自右邊的)
- right_join()
- inner_join()
- 選兩邊都有obs.的column,把他們merge起來(取交集)
- full_join()
- 不管是否欄列對稱,全部資料保留,缺項塞NA(取聯集)
- semi_join()
- anti_join()
Bind_column(tb1,tb2) == cbind()
Bind_row(tb1,tb2,tb3) == rbind()
Binding functions combine by position, while join functions match by variables.
Joining functions can join datasets of different dimensions, but the bind functions must match on the appropriate dimension (either same row or column numbers).
Bind functions can combine both vectors and dataframes, while join functions work for only for dataframes.
intersect(c(1:10),c(6:15))
intersect(c("a","b"),c("b","c"))
setdiff(c(1:10),c(6:15))
[1] 1 2 3 4 5
setequal(c(1:10),c(6:15))
[1] FALSE
setequal(c(1:10),c(1:10))
[1] TRUE
union(c(1:10),c(6:15))
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
【Web Scraping】
- The html_node command only selects the first node of a specified type.
- html_nodes so that we could specify the elements we want.
- For example, the second “table” element using the tab[[2]] command.
- selector garget 教學:https://selectorgadget.com/
library(rvest)
url = https://~
h = read_html(url)
tab = h %>% html_nodes("table")
tab = tab[[2]] # 還是一行一行亂亂的
tab = tab %>% html_table
class(tab)
[1] "data.frame"
For the guacamole recipe page we already have done this and determined that we need the following selectors:
library(rvest)
h <- read_html("http://www.foodnetwork.com/recipes/alton-brown/guacamole-recipe-1940609")
recipe <- h %>% html_node(".o-AssetTitle__a-HeadlineText") %>% html_text()
prep_time <- h %>% html_node(".o-RecipeInfo__a-Description--Total") %>% html_text()
ingredients <- h %>% html_nodes(".o-Ingredients__a-ListItemText") %>% html_text()
You can see how complex the selectors are. In any case we are now ready to extract what we want and create a list:
guacamole <- list(recipe, prep_time, ingredients)
Since recipe pages from this website follow this general layout, we can use this code to create a function that extracts this information:
get_recipe <- function(url){
h <- read_html(url)
recipe <- h %>% html_node(".o-AssetTitle__a-HeadlineText") %>% html_text()
prep_time <- h %>% html_node(".o-RecipeInfo__a-Description--Total") %>% html_text()
ingredients <- h %>% html_nodes(".o-Ingredients__a-ListItemText") %>% html_text()
return(list(recipe = recipe, prep_time = prep_time, ingredients = ingredients))
}
get_recipe("http://www.foodnetwork.com/recipes/food-network-kitchen/pancakes-recipe-1913844")
Below is an example for Scraping rain data from a website:
year <- c(2013:2017)
date <- c(201301:201312,201401:201412,201501:201512,201601:201612,201701:201712)
td <- data.frame('rd'=NA)
url <- paste0('http://e-service.cwb.gov.tw/HistoryDataQuery/YearDataController.do?command=viewMain&station=466910&stname=%25E9%259E%258D%25E9%2583%25A8&datepicker=',year)
for( i in 1:5){
rd <- url[i] %>%
GET() %>%
read_html() %>%
html_nodes(.,css='td:nth-child(21)') %>%
html_text()
rd <- data.frame('rd'=rd)
td <- rbind(td,rd)
}
td <- td[-1,]
d1 <- data.frame('date'=date)
d1 <- cbind(d1,td)
d1$td <- gsub("\U00A0", "", d1$td)
View(d1)
write.csv(d1, file="C:/Users/shiki/Downloads/1.xlsx")
【String Processing】
str_detect(tab$population, ",") # 回傳TRUE/FALSE
str_subset(tab$population, ",") # 回傳字串 ex. 123,456/1,000
str_extract(tab$population, ",") # 回傳逗號(提取""裡面的東西)
---
title: "Data Wargling R Notebook"
output: html_notebook
---

### 【目錄】

[1. Data Import](#n1) <br>
[2. Tidy Data](#n2) <br>
[3. Web Scraping](#n3) <br>
[4. String Processing](#n4) <br>

<hr>

### <a id="n1"></a>【Data Import】

```{r}
system.file() #回傳根目錄"C:/PROGRA~1/R/R-35~1.0/library/base"
file.path(路徑,檔名) #會幫你自動選對的slash黏上 
file.copy(被複製的路徑,欲複製到的路徑)
```

+ read_lines() #查看有沒有header
+ The import functions in the readr package all start as read_, while the import functions for base R all start with read.

<br>

#### 【R的基本function VS readr ?】

+ Base R functions import data as a data frame, while "readr" functions import data as a tibble.
+ 如果read.csv時沒有用StringAsFactor=F，他會將character轉為factor
+ The base R import functions can read .csv files,but cannot files with other delimiters, such as .tsv files, or fixed-width files.
+ The base R import functions can read files with other delimiters(分隔號) 
like .tsv using read.delim and can read fixed-width files using read.fwf.

<br>

#### 【What is Tibble ?】

+ A tibble is also a data.frame. 
    + Tibble prints better in console (it fits to screen size) and it also prints data types in header.
    + To put it simple, tibbles are "evolutions" of data.frames, printing and subsetting are less troublesome with tibbles than data.frames.
    + For example, if you want to look 10 first rows of a data.frame, you will use head(data.frame), for tibbles you can simple use print(tibble) or call tibble in R Console.
+ When subsetting, if you use "[" on tibble R will return always another tibble, but when you use "[[" R will return a vector.
+ For data.frames, when using "[" you might get a vector or another data.frame, which is quite unpleasant. 
+ Another difference, tibble does not have partial matching.

```{r}
url = https://~/murders.csv
read_csv(url)                       # 可以直接讀取網址檔
download.file(url, "murders.csv")   # 把網上的檔案載下來並命名為murders.csv
tempdir() 
tempfile()
# tempfile創建一個字符串，而不是文件，這可能是一個唯一的文件名
```

<br>
<hr>

### <a id="n2"></a>【Tidy Data】

```{r}
dat_tidy <- dat_wide %>%
gather(key = disease, value = count, “Hepatitis A”: “Rubella”)
```
  
+ specified that the “key” column will be called “disease”, the value of each entry will be called “count”, and that the columns Hepatitis A through Rubella will all be included in the gather command.
+ speard跟gather相反,speard是把一個欄位裡面n個factor變成n個欄位,並把每row右邊的值輸入n個欄位下面
+ gather是把n個欄位聚集成1個新的欄位,並創第2個新的欄位把n欄下的obs.輸入對應的row

```{r}
gather(key = key, value = value, -age_group)  # 把除了其中一個以外的全部的變數都拿去gather
```


```{r}
D %>% gather(key=key, value=value, -age_group) %>%
  separate(col=key, into=c("year", "variable_name"), "_") %>% 
  spread(key=variable_name, value=value) %>% data.frame
# 默認的separate就是"_"了,所以即使不寫也沒關係
```


```{r}
stats %>%
  separate(col=key, into=c("player", "variable_name"), sep="_", extra="merge") %>% 
  spread(key=variable_name, value=value)
# 等同於
stats %>%
  separate(col=key, into=c("player", "variable_name1", "variable_name2"), 
           sep="_", fill="right") %>% 
  unite(col = variable_name, variable_name1, variable_name2, sep = "_") %>% 
  spread(key = variable_name, value = value)
# fill="right"是叫他把name1,name2的欄位黏在player右邊
```

<br>

#### 【Merge data】

+ left_join()
    + all rows in the left-hand table are retained in the final table, columns from both tables will be included in the final table
    + 左邊保持原樣,右邊對照左邊的obs.把右邊的column和值加入(NA來自右邊的)
+ right_join()
    + 與left_join相反
+ inner_join()
    + 選兩邊都有obs.的column,把他們merge起來(取交集)
+ full_join()
    + 不管是否欄列對稱,全部資料保留,缺項塞NA(取聯集)
+ semi_join()
    + 從inner_join的结果中只取左邊的值
+ anti_join()
    + 與semi_join相反,為左邊獨有的obs.

<br>

```{r}
Bind_column(tb1,tb2) == cbind()
Bind_row(tb1,tb2,tb3) == rbind()
```

+ Binding functions combine by position, while join functions match by variables.

+ Joining functions can join datasets of different dimensions, but the bind functions must match on the appropriate dimension (either same row or column numbers).

+ Bind functions can combine both vectors and dataframes, while join functions work for only for dataframes.


```{r}
intersect(c(1:10),c(6:15))
intersect(c("a","b"),c("b","c")) 
```
 
```{r}
setdiff(c(1:10),c(6:15))
```

```{r}
setequal(c(1:10),c(6:15))
setequal(c(1:10),c(1:10))
```

```{r}
union(c(1:10),c(6:15))
```

<br>
<hr>

### <a id="n3"></a>【Web Scraping】

<br>

+ The html_node command only selects the first node of a specified type.
+ html_nodes so that we could specify the elements we want. 
    + For example, the second “table” element using the tab[[2]] command.
+ selector garget 教學:https://selectorgadget.com/

```{r}
library(rvest)
url = https://~
h = read_html(url)
tab = h %>% html_nodes("table")
tab = tab[[2]] # 還是一行一行亂亂的
tab = tab %>% html_table
class(tab)
[1]  "data.frame"
```
  
  
For the guacamole recipe page we already have done this and determined that we need the following selectors:
  
```{r}
library(rvest)
h <- read_html("http://www.foodnetwork.com/recipes/alton-brown/guacamole-recipe-1940609")
recipe <- h %>% html_node(".o-AssetTitle__a-HeadlineText") %>% html_text()
prep_time <- h %>% html_node(".o-RecipeInfo__a-Description--Total") %>% html_text()
ingredients <- h %>% html_nodes(".o-Ingredients__a-ListItemText") %>% html_text()
```

You can see how complex the selectors are. In any case we are now ready to extract what we want and create a list:
  
```{r}
guacamole <- list(recipe, prep_time, ingredients)
```

Since recipe pages from this website follow this general layout, we can use this code to create a function that extracts this information:
  
```{r}
get_recipe <- function(url){
    h <- read_html(url)
    recipe <- h %>% html_node(".o-AssetTitle__a-HeadlineText") %>% html_text()
    prep_time <- h %>% html_node(".o-RecipeInfo__a-Description--Total") %>% html_text()
    ingredients <- h %>% html_nodes(".o-Ingredients__a-ListItemText") %>% html_text()
    return(list(recipe = recipe, prep_time = prep_time, ingredients = ingredients))
}

get_recipe("http://www.foodnetwork.com/recipes/food-network-kitchen/pancakes-recipe-1913844")
```


Below is an example for Scraping rain data from a website:
  
```{r}
year <- c(2013:2017)
date <- c(201301:201312,201401:201412,201501:201512,201601:201612,201701:201712)
td <- data.frame('rd'=NA)
url <- paste0('http://e-service.cwb.gov.tw/HistoryDataQuery/YearDataController.do?command=viewMain&station=466910&stname=%25E9%259E%258D%25E9%2583%25A8&datepicker=',year)
for( i in 1:5){
  rd <- url[i] %>% 
    GET() %>% 
    read_html() %>% 
    html_nodes(.,css='td:nth-child(21)') %>% 
    html_text()
  rd <- data.frame('rd'=rd)
  td <- rbind(td,rd)
}
td <- td[-1,]
d1 <- data.frame('date'=date)
d1 <- cbind(d1,td)

d1$td <- gsub("\U00A0", "", d1$td)
View(d1)

write.csv(d1, file="C:/Users/shiki/Downloads/1.xlsx")
```


<br><hr>

### <a id="n4"></a>【String Processing】
  
```{r}
str_detect(tab$population, ",")   # 回傳TRUE/FALSE
str_subset(tab$population, ",")   # 回傳字串 ex. 123,456/1,000
str_extract(tab$population, ",")  # 回傳逗號(提取""裡面的東西)
```

<br><br><hr><br><br><br>
  

<style>
p,li {
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

title{
    font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

body{
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

h1,h2,h3,h4,h5{
    font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}
</style><br><br><br><br><br>

<style>
.caption {
  color: #777;
  margin-top: 10px;
}
p code {
  white-space: inherit;
}
pre {
  word-break: normal;
  word-wrap: normal;
  line-height: 1;
}
pre code {
  white-space: inherit;
}
p,li {
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

.r{
  line-height: 1.2;
}

title{
  color: #cc0000;
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

body{
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

h1,h2,h3,h4,h5{
  color: #008800;
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

h3{
  color: #b36b00;
  background: #ffe0b3;
  line-height: 2;
  font-weight: bold;
}

h5{
  color: #006000;
  background: #ffffe0;
  line-height: 2;
  font-weight: bold;
}

em{
  color: #0000c0;
  background: #f0f0f0;
  }
</style>

