何为数据科学及tidyverse包的安装、简介

数据科学定义：数据科学(data science)是从数据中发现模式、规律，得到洞见(insight)，从而为业界创造商业价值的一套科学方法。

本章着重介绍在数据科学方面常用的两个超级R包，即tidyverse与caret，更为贴近业界的数据科学实践。

为便于执行数据科学项目，RStudio首席科学家Hadley Wickham推出了超级R包tidyverse，包含一系列内在自洽的R包；并因此获得国际统计学会的COPSS奖。

下载并安装tidyverse包

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

结果表明，载入R包tidyverse后，自动加载了8个R包，即ggplot2，tibble，tidyr，readr，purrr，dplyr，stringr，以及forcats(将在下文介绍其中最重要的几个包)。结果还显示，R包dplyr中的filter()函数与lag()函数分别“掩盖”(mask)了同名的R基础函数。如果你想调用R基础函数filter()或lag()，则须使用它们的全称stats::filter()或stats::lag()，从R的基础包stats调用这两个函数。

管道算子 %>%

当复合函数层数比较多时，使用函数套会使得R代码变得不易阅读。

x<- 1:10
y<-round(sum(sin(sqrt(log(x)))),2)
y

## [1] 8.43

一种简化方案：引入一些中间变量

x<-1:10
z<-sqrt(log(x))
w<-sum(sin(z))
y<-round(w,2)
y

## [1] 8.43

以上代码中的“中间对象”(intermediate objects)z与w可能并没有什么意义，而且多用了两行代码。

另一种解决方案：引入管道算子”%>%” 来精简代码

library(tidyverse)
x<-1:10
y<-x %>%log %>% sqrt %>% sin %>% sum %>% round(2)
y

## [1] 8.43

此处管道算子”%>%“可读为“然后”，即“x %>% log”表示将“x”作为参数传递给函数\(log()\). 管道算子优势：可以将一系列管道算子连在一起，构成一个长的管道。比如上例。

管道算子的快捷输入法

不同系统中管道算子的快捷输入法：

1、windows系统%>% 的快捷键：“Ctrl+Shift+M”

2、 Mac系统%>%的快捷键：“command+Shift+M”

管道算子的基本用法

管道算子的基本用法主要有以下5种：

1、x%>% y等价于\(f(x)\)；

2、x %>% \(f(y)\)等价于\(f(x,y)\)；

3、x %>% f %>% \(g\)等价于\(g(f(x))\)；

4、x %>% \(f(y,.)\)等价于\(f(y,x)\)；

5、x %>% \(f(y,z=.)\)等价于\(f(y,z=x)\)；

管道算子使用一般不宜太长，比如超过10步后可考虑通过分拆多个命令来实现，便于纠错。

数据读取

getwd()

## [1] "/Users/huanghuilin/Desktop/360安全云盘同步版/2022/2022年数据挖掘课程/上机/第二次实验课：tidyverse包的使用"

bank <- read.csv("bank-additional-full.csv",header = T,sep=";")
dim(bank)

## [1] 41188    21

#计算运行此命令的时间
t1<-system.time(read.csv("bank-additional-full.csv",header = T,sep=";"))

为了加快读取数据的速度，我们可以采用R包readr中的函数read_delim()来读取数据

getwd()

## [1] "/Users/huanghuilin/Desktop/360安全云盘同步版/2022/2022年数据挖掘课程/上机/第二次实验课：tidyverse包的使用"

library(readr)

bank <- read_delim("bank-additional-full.csv", col_names = T,delim =";")

#计算运行此命令的时间
t2<-system.time(read_delim("bank-additional-full.csv", col_names = T,delim =";")) 

bank

## # A tibble: 41,188 × 21
##      age job   marital education default housing loan  contact month day_of_week
##    <dbl> <chr> <chr>   <chr>     <chr>   <chr>   <chr> <chr>   <chr> <chr>      
##  1    56 hous… married basic.4y  no      no      no    teleph… may   mon        
##  2    57 serv… married high.sch… unknown no      no    teleph… may   mon        
##  3    37 serv… married high.sch… no      yes     no    teleph… may   mon        
##  4    40 admi… married basic.6y  no      no      no    teleph… may   mon        
##  5    56 serv… married high.sch… no      no      yes   teleph… may   mon        
##  6    45 serv… married basic.9y  unknown no      no    teleph… may   mon        
##  7    59 admi… married professi… no      no      no    teleph… may   mon        
##  8    41 blue… married unknown   unknown no      no    teleph… may   mon        
##  9    24 tech… single  professi… no      yes     no    teleph… may   mon        
## 10    25 serv… single  high.sch… no      yes     no    teleph… may   mon        
## # … with 41,178 more rows, and 11 more variables: duration <dbl>,
## #   campaign <dbl>, pdays <dbl>, previous <dbl>, poutcome <chr>,
## #   emp.var.rate <dbl>, cons.price.idx <dbl>, cons.conf.idx <dbl>,
## #   euribor3m <dbl>, nr.employed <dbl>, y <chr>

write_csv(bank,"bank.csv")  #利用R包readr中的函数将数据存储

当数据的样本容量特别大时，为了进一步提升数据读取速度，建议使用R包data.table中的fread()函数。

library(data.table)
bank<-fread("bank-additional-full.csv")
t3<-system.time(fread("bank-additional-full.csv"))
rbind(t1,t2,t3)[,1:3]

##    user.self sys.self elapsed
## t1     0.351    0.005   0.357
## t2     0.176    0.011   0.097
## t3     0.081    0.002   0.083

总结：可以看出，各个函数读取同一组数据所花费的时间fread（）最少，read_delim()次之，read.csv()花费时间最多。

数据清洗

R包tidyverse中的tidyr包可以使得清洗数据变得容易。

数据合并

library(tidyverse)
table1   #table1为面板数据，且已经为清洁数据

## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

#table4a和table4b是table1清洗之前的形式
table4a

## # A tibble: 3 × 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

table4b

## # A tibble: 3 × 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

从table4a和table4b可以看出，table4a包含table1中的cases的信息，而table4b包含table1中变量population的信息，变量year的两个取值被作为变量 ’1999’与’2000’使用。

首先利用tidyr包中的gather函数将table4a的两列数据聚集成一个变量cases.

library(tidyverse)
tidy4a<- table4a %>% 
         gather('1999','2000',key = 'year',value ='cases')

tidy4a

## # A tibble: 6 × 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

结果显示：生成了两列新变量year与cases. 类似我们可以对table4b进行同样的操作。

library(tidyverse)
tidy4b<- table4b %>% 
         gather('1999','2000',key = 'year',value ='population')

tidy4b

## # A tibble: 6 × 3
##   country     year  population
##   <chr>       <chr>      <int>
## 1 Afghanistan 1999    19987071
## 2 Brazil      1999   172006362
## 3 China       1999  1272915272
## 4 Afghanistan 2000    20595360
## 5 Brazil      2000   174504898
## 6 China       2000  1280428583

结果显示：生成了两列新变量year与population. 现利用dplyr包中的left_join()函数，将tidy4a与tidy4b进行横向合并，可得完整的整洁数据。

library(dplyr)
left_join(tidy4a,tidy4b)

## # A tibble: 6 × 4
##   country     year   cases population
##   <chr>       <chr>  <int>      <int>
## 1 Afghanistan 1999     745   19987071
## 2 Brazil      1999   37737  172006362
## 3 China       1999  212258 1272915272
## 4 Afghanistan 2000    2666   20595360
## 5 Brazil      2000   80488  174504898
## 6 China       2000  213766 1280428583

同一样本占据多行或多个变量占据同一列的数据清洗

case1:同一样本占据多行的情形,需要使用函数spread()，将数据“展开”(spread)，使得每个观测值仅占一行.

library(tidyverse)
table2

## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

table2 %>% 
      spread(key=type,value=count)

## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

注：1、函数gather()使得宽的数据表格变得更窄、更长。

2、函数spread()使得长的数据表格变得更短、更宽。

case2:多个变量占据同一列,使用函数separate()，将变量rate的信息进行“分离”(separate)

library(tidyverse)
table3

## # A tibble: 6 × 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE)

## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

其中，变量rate其实包含了两个变量(cases与population)的信息。参数“convert = TRUE”表示将所得变量cases与population变为更合适的类型，即(表示整数)；否则，将继承原来变量rate的\(<chr>\)类型。

函数separate()默认在既非字母、也非数字的字符处切割。也可使用参数“sep=“/””指定在“/”处进行分离。

还可指定分割的具体位置；比如将变量year在其第2个字符之后，切割为变量century与year：

table5 <- table3 %>% 
    separate(year, into = c("century", "year"), sep = 2)
table5

## # A tibble: 6 × 4
##   country     century year  rate             
##   <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

与函数separate()相反的运算为unite()，比如将table5的变量century与year合二为一：

table5 %>% unite(new, century, year,sep="")

## # A tibble: 6 × 3
##   country     new   rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583

其中，参数“sep=““”表示在合并变量取值时，不加任何字符；默认加下划线“_”。

数据变换

R包tidyverse的dplyr包专门用于做数据变换，包括选取样本、排序、选择变量、生成新变量、合并数据等。我们以R包nycflights13的flights数据来演示。该数据包含2013年从NewYork三个机场出发的所有336776次航班的信息，共有19个变量。

library(tidyverse)
# install.packages("nycflights13")
library(nycflights13)
flights

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

names(flights)

##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"

getwd()

## [1] "/Users/huanghuilin/Desktop/360安全云盘同步版/2022/2022年数据挖掘课程/上机/第二次实验课：tidyverse包的使用"

names<-read.csv("names of flights.csv")
knitr::kable(names)

names	含义
dep_time	起飞时间
sched_dep_time	原定起飞时间
dep_delay	起飞延误
arr_time	降落时间
sched_arr_time	原定降落时间
arr_delay	降落延误
carrier	航空公司
flight	航班号
tailnum	机尾编号
origin	出发地
dest	目的地
air_time	飞行时间
distance	距离
hour	原定几点起飞
minute	原定几分起飞
time_hour	原定起飞日期与时间

利用filter()函数选取观测样本

假设我们仅对1月1日的航班感兴趣，则可以利用dplyr包中的filter()函数进行过滤。

jan1 <-flights %>% filter( month == 1, day == 1)
jan1

## # A tibble: 842 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 832 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

如果我们需要挑选所有11月或12月出发的航班：

flights %>% filter( month == 11 | month == 12)

## # A tibble: 55,403 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    11     1        5           2359         6      352            345
##  2  2013    11     1       35           2250       105      123           2356
##  3  2013    11     1      455            500        -5      641            651
##  4  2013    11     1      539            545        -6      856            827
##  5  2013    11     1      542            545        -3      831            855
##  6  2013    11     1      549            600       -11      912            923
##  7  2013    11     1      550            600       -10      705            659
##  8  2013    11     1      554            600        -6      659            701
##  9  2013    11     1      554            600        -6      826            827
## 10  2013    11     1      554            600        -6      749            751
## # … with 55,393 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

#另一种等价表达：
flights %>% filter( month %in% c(11, 12))

## # A tibble: 55,403 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    11     1        5           2359         6      352            345
##  2  2013    11     1       35           2250       105      123           2356
##  3  2013    11     1      455            500        -5      641            651
##  4  2013    11     1      539            545        -6      856            827
##  5  2013    11     1      542            545        -3      831            855
##  6  2013    11     1      549            600       -11      912            923
##  7  2013    11     1      550            600       -10      705            659
##  8  2013    11     1      554            600        -6      659            701
##  9  2013    11     1      554            600        -6      826            827
## 10  2013    11     1      554            600        -6      749            751
## # … with 55,393 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

#如果需要挑选上半年的数据：
flights %>% filter(month %in% 1:6)

## # A tibble: 166,158 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 166,148 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

利用arrange()进行观测值排序

如果需要对观测值进行排序，可使用dplyr包的arrange()函数。首先，将flights数据按照年、月、日进行升序排列：

flights %>% arrange(year,month,day)

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

其次，按照dep_delay(起飞延误)进行升序排列，可输入命令：

flights %>% arrange(dep_delay)

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12     7     2040           2123       -43       40           2352
##  2  2013     2     3     2022           2055       -33     2240           2338
##  3  2013    11    10     1408           1440       -32     1549           1559
##  4  2013     1    11     1900           1930       -30     2233           2243
##  5  2013     1    29     1703           1730       -27     1947           1957
##  6  2013     8     9      729            755       -26     1002            955
##  7  2013    10    23     1907           1932       -25     2143           2143
##  8  2013     3    30     2030           2055       -25     2213           2250
##  9  2013     3     2     1431           1455       -24     1601           1631
## 10  2013     5     5      934            958       -24     1225           1309
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

若按照dep_delay(起飞延误)进行降序排列，可输入命令：

flights %>% arrange(desc(dep_delay))

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

结果显示，dep_delay的最大值为1301，这意味着延误1301分钟（即21小时41分钟）才起飞。

利用select()函数选择变量

如果我们只想选择部分变量，可使用dplyr包的select()函数. 首先，只选择flights数据集中的year、month与day这三个变量：

flights %>% select(year,month,day)

## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # … with 336,766 more rows

#上述命令和下面的命令等价：
flights %>% select(year:day)

## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # … with 336,766 more rows

若选取year、month与day变量以外的所有变量，可输入命令：

flights %>% select(-(year:day))

## # A tibble: 336,776 × 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1      517            515         2      830            819        11 UA     
##  2      533            529         4      850            830        20 UA     
##  3      542            540         2      923            850        33 AA     
##  4      544            545        -1     1004           1022       -18 B6     
##  5      554            600        -6      812            837       -25 DL     
##  6      554            558        -4      740            728        12 UA     
##  7      555            600        -5      913            854        19 B6     
##  8      557            600        -3      709            723       -14 EV     
##  9      557            600        -3      838            846        -8 B6     
## 10      558            600        -2      753            745         8 AA     
## # … with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

其次，还可以使用select()函数调整变量在数据框中的排序。假设希望将变量time_hour与air_time排到所有变量的前面，可输入命令：

select (flights,time_hour,air_time,everything())

## # A tibble: 336,776 × 19
##    time_hour           air_time  year month   day dep_time sched_dep_time
##    <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
##  1 2013-01-01 05:00:00      227  2013     1     1      517            515
##  2 2013-01-01 05:00:00      227  2013     1     1      533            529
##  3 2013-01-01 05:00:00      160  2013     1     1      542            540
##  4 2013-01-01 05:00:00      183  2013     1     1      544            545
##  5 2013-01-01 06:00:00      116  2013     1     1      554            600
##  6 2013-01-01 05:00:00      150  2013     1     1      554            558
##  7 2013-01-01 06:00:00      158  2013     1     1      555            600
##  8 2013-01-01 06:00:00       53  2013     1     1      557            600
##  9 2013-01-01 06:00:00      140  2013     1     1      557            600
## 10 2013-01-01 06:00:00      138  2013     1     1      558            600
## # … with 336,766 more rows, and 12 more variables: dep_delay <dbl>,
## #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## #   hour <dbl>, minute <dbl>

参数everything()表示所有其他变量。

利用rename()函数更改变量名

如果要改变变量名，可以利用dplyr包中的rename()函数。假设我们要将变量sched_dep_time重新命令为s_dep_time:

flights %>% rename(s_dep_time=sched_dep_time)

## # A tibble: 336,776 × 19
##     year month   day dep_time s_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>      <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517        515         2      830            819
##  2  2013     1     1      533        529         4      850            830
##  3  2013     1     1      542        540         2      923            850
##  4  2013     1     1      544        545        -1     1004           1022
##  5  2013     1     1      554        600        -6      812            837
##  6  2013     1     1      554        558        -4      740            728
##  7  2013     1     1      555        600        -5      913            854
##  8  2013     1     1      557        600        -3      709            723
##  9  2013     1     1      557        600        -3      838            846
## 10  2013     1     1      558        600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

利用mutate()函数生成新的变量

如果要根据原有变量生成新的变量，可以用dplyr包中的mutate()函数，而新生成的变量将排在最后。为了演示方便，我们先用select()函数选择少数变量，并将所得数据集记为f1:

f1<- flights %>% select(ends_with("delay"),distance,air_time)
f1

## # A tibble: 336,776 × 4
##    dep_delay arr_delay distance air_time
##        <dbl>     <dbl>    <dbl>    <dbl>
##  1         2        11     1400      227
##  2         4        20     1416      227
##  3         2        33     1089      160
##  4        -1       -18     1576      183
##  5        -6       -25      762      116
##  6        -4        12      719      150
##  7        -5        19     1065      158
##  8        -3       -14      229       53
##  9        -3        -8      944      140
## 10        -2         8      733      138
## # … with 336,766 more rows

其中，参数“ends_with(“delay”)”表示以“delay”结尾的变量。现在，我们来定义两个新变量gain≈和speed分别表示追回多少时间和飞行速度,可输入命令：

f1 %>% mutate(gain=dep_delay-arr_delay, speed=distance/air_time*60)

## # A tibble: 336,776 × 6
##    dep_delay arr_delay distance air_time  gain speed
##        <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
##  1         2        11     1400      227    -9  370.
##  2         4        20     1416      227   -16  374.
##  3         2        33     1089      160   -31  408.
##  4        -1       -18     1576      183    17  517.
##  5        -6       -25      762      116    19  394.
##  6        -4        12      719      150   -16  288.
##  7        -5        19     1065      158   -24  404.
##  8        -3       -14      229       53    11  259.
##  9        -3        -8      944      140     5  405.
## 10        -2         8      733      138   -10  319.
## # … with 336,766 more rows

变量distance以英里为单位，air_time以分钟为单位，因此speed=distance/air_time*60表示每小时飞行多少英里。

利用summarize()函数计算统计指标

我们可以利用dplyr包中的summarize()函数计算一些样本数据的概括性的统计指标。比如：计算样本中变量dep_delay(起飞延误)的平均值，并将数据框命名为delay.

flights %>% summarize(delay=mean(dep_delay,na.rm=TRUE))

## # A tibble: 1 × 1
##   delay
##   <dbl>
## 1  12.6

其中函数mean()的参数na.rm=TRUE表示在计算平均值时，去掉缺失值。

函数summariz()更重要的一个功能是能实现分组计算统计指标。

m_delay<-flights %>% group_by(month) %>% summarize(delay=mean(dep_delay,na.rm=TRUE))
m_delay %>% arrange(delay)   #按照delay的升序排列

## # A tibble: 12 × 2
##    month delay
##    <int> <dbl>
##  1    11  5.44
##  2    10  6.24
##  3     9  6.72
##  4     1 10.0 
##  5     2 10.8 
##  6     8 12.6 
##  7     5 13.0 
##  8     3 13.2 
##  9     4 13.9 
## 10    12 16.6 
## 11     6 20.8 
## 12     7 21.7

m_delay %>% arrange(desc(delay))  #按照delay的降序排列

## # A tibble: 12 × 2
##    month delay
##    <int> <dbl>
##  1     7 21.7 
##  2     6 20.8 
##  3    12 16.6 
##  4     4 13.9 
##  5     3 13.2 
##  6     5 13.0 
##  7     8 12.6 
##  8     2 10.8 
##  9     1 10.0 
## 10     9  6.72
## 11    10  6.24
## 12    11  5.44

结果变量，结果表明，在一年之中，延误最严重的月份是6月和7月，其次是12月，都是旅行旺季。

利用left_join()函数合并数据框

在实际应用中我们往往需要将来自不同渠道的数据框进行有意义的合并。 R包dplyr包中的left_join()函数可以实现数据合并功能，且其速度远远快于R基础函数merge(). 继续以R包nycflights13为例，此R包除了flights数据外，还有其他相关数据，比如airlines提供航空公司的全称：

library(nycflights13)
airlines

## # A tibble: 16 × 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

结果显示，数据框airlines中包含16家航空公司的简称（carrier）与全称（name）。 R包dplyr中最常用的合并函数为left_join(),其基本格式为：

left_join(x,y,by=“var”)

此命令表示，将数据框x与数据框y合并，合并方式为保留数据框x中的所有行，同时根据变量var进行匹配，把数据框y的变量合并到数据框x.进行匹配所用的变量var称为关键词。

示例：将数据框airlines的变量name作为一个变量加到数据框flights.

library(nycflights13)
names(flights)

##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"

names(airlines)

## [1] "carrier" "name"

new<-left_join(flights,airlines,by="carrier")
dim<-rbind.data.frame(dim(flights),dim(airlines),dim(new))  #合并维度向量
names(dim)=c("nrow","ncolumn")
knitr::kable(dim)

nrow	ncolumn
336776	19
16	2
336776	20

names(new)

##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"      "name"

new[,c(10,18:20)]

## # A tibble: 336,776 × 4
##    carrier minute time_hour           name                    
##    <chr>    <dbl> <dttm>              <chr>                   
##  1 UA          15 2013-01-01 05:00:00 United Air Lines Inc.   
##  2 UA          29 2013-01-01 05:00:00 United Air Lines Inc.   
##  3 AA          40 2013-01-01 05:00:00 American Airlines Inc.  
##  4 B6          45 2013-01-01 05:00:00 JetBlue Airways         
##  5 DL           0 2013-01-01 06:00:00 Delta Air Lines Inc.    
##  6 UA          58 2013-01-01 05:00:00 United Air Lines Inc.   
##  7 B6           0 2013-01-01 06:00:00 JetBlue Airways         
##  8 EV           0 2013-01-01 06:00:00 ExpressJet Airlines Inc.
##  9 B6           0 2013-01-01 06:00:00 JetBlue Airways         
## 10 AA           0 2013-01-01 06:00:00 American Airlines Inc.  
## # … with 336,766 more rows

write.csv(airlines,"airlines.csv")
write.csv(flights,"flights.csv")
write.csv(new,"flights_airlines.csv")

在使用函数left_join()时，如果省略参数by=“carrier”,则默认使用两个数据框的所有变量进行匹配，称为”自然合并”。

示例：R包nycflights13还提供了数据框weather,提供纽约三个机场每个小时的天气情况数据,我们将其与flights数据进行自然合并：

library(nycflights13)
weather

## # A tibble: 26,115 × 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # … with 26,105 more rows, and 5 more variables: wind_gust <dbl>, precip <dbl>,
## #   pressure <dbl>, visib <dbl>, time_hour <dttm>

new<-left_join(flights,weather)
new

## # A tibble: 336,776 × 28
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 20 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   temp <dbl>, dewp <dbl>, humid <dbl>, wind_dir <dbl>, wind_speed <dbl>,
## #   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>

dim<-rbind.data.frame(dim(flights),dim(weather),dim(new)) 
names(dim)=c("nrow","ncolumn")
knitr::kable(dim)

nrow	ncolumn
336776	19
26115	15
336776	28

write.csv(weather,"weather.csv")
write.csv(new,"flights_weather.csv")

结果显示，已将天体变量合并入数据框。

进一步地，R包nycflights13还包含另一数据框airports,提供有关机场的信息。我们将其与flights数据集合并，合并标准为数据集flights中的dest取值与数据框airports的变量faa取值相同：

library(nycflights13)
airports

## # A tibble: 1,458 × 8
##    faa   name                             lat    lon   alt    tz dst   tzone    
##    <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
## # … with 1,448 more rows

new<-left_join(flights,airports,by=c("dest"="faa"))
new

## # A tibble: 336,776 × 26
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 18 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   name <chr>, lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>,
## #   tzone <chr>

write.csv(airports,"airports.csv")
write.csv(new,"flights_airports.csv")

上述参数by=c(“dest”=“faa”)表明在合并时第一个数据框的变量dest取值须与第2个数据框的变量faa的取值相同，以此作为匹配标准。

R包dplyr中除了常用的left_join（）之外，还有如下表所示：

getwd()

## [1] "/Users/huanghuilin/Desktop/360安全云盘同步版/2022/2022年数据挖掘课程/上机/第二次实验课：tidyverse包的使用"

join<-read.csv("dplyr包中数据合并的函数及功能.csv",header=T)
knitr::kable(join)

函数名	功能
left_join(x,y)	保留数据框 x 中的所有观测值信息
right_join(x,y)	保留数据框 y 中的所有观测值信息
full_join(x,y)	保留数据框 x 和 y 中的所有观测值信息

上述函数具体用法可以查询帮助文档。

使用tidyverse包进行数据预处理

Huang Huilin

3/31/2022