接下來陸續介紹了tidyverse家族,家庭主要成員包括
功能 包
dplyr 定義了資料處理的語法,包括6個主要的函數:
其他: + count() + left_join() + full_join()
主要的函數
library(dplyr)
library(tidyverse) # scoped 函数
library(ggplot2)
用途:select()函式可選要分析的欄位。 使用方式:select(資料名稱,欄位條件1,欄位條件2,…)
# 以ggplot2 自帶 diamonds dataset 為例
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
# 查看 diamonds 的簡單統計量
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
我們只選擇 carat, cut, color, clarity, depth, table 這些欄位
diamonds_select1 <- diamonds %>%
select(carat:table)
head(diamonds_select1,5)
## # A tibble: 5 x 6
## carat cut color clarity depth table
## <dbl> <ord> <ord> <ord> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55
## 2 0.21 Premium E SI1 59.8 61
## 3 0.23 Good E VS1 56.9 65
## 4 0.290 Premium I VS2 62.4 58
## 5 0.31 Good J SI2 63.3 58
這裏我們不想選擇 cut 和 price 這兩個欄位
diamonds_select2 <- diamonds %>%
select(carat, color:z,-price)
head(diamonds_select2,5)
## # A tibble: 5 x 8
## carat color clarity depth table x y z
## <dbl> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 E SI2 61.5 55 3.95 3.98 2.43
## 2 0.21 E SI1 59.8 61 3.89 3.84 2.31
## 3 0.23 E VS1 56.9 65 4.05 4.07 2.31
## 4 0.290 I VS2 62.4 58 4.2 4.23 2.63
## 5 0.31 J SI2 63.3 58 4.34 4.35 2.75
我們希望把x,y,z放到前面,把price放到最後一欄
diamonds_select4 <- diamonds %>%
select(x,y,z, everything(), price)
head(diamonds_select4)
## # A tibble: 6 x 10
## x y z carat cut color clarity depth table price
## <dbl> <dbl> <dbl> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
## 1 3.95 3.98 2.43 0.23 Ideal E SI2 61.5 55 326
## 2 3.89 3.84 2.31 0.21 Premium E SI1 59.8 61 326
## 3 4.05 4.07 2.31 0.23 Good E VS1 56.9 65 327
## 4 4.2 4.23 2.63 0.290 Premium I VS2 62.4 58 334
## 5 4.34 4.35 2.75 0.31 Good J SI2 63.3 58 335
## 6 3.94 3.96 2.48 0.24 Very Good J VVS2 62.8 57 336
mutate 主要用來新增欄位
# 新建一個欄位,可以當作編號欄位
diamonds_mutate <- diamonds %>%
mutate(num = 1:length(diamonds$cut))
head(diamonds_mutate)
## # A tibble: 6 x 11
## carat cut color clarity depth table price x y z num
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 1
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 2
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 3
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 4
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 5
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 6
透過已知欄位新增欄位
diamonds_mutate2 <- diamonds %>%
mutate(newprice = price * 0.95)
head(diamonds_mutate2)
## # A tibble: 6 x 11
## carat cut color clarity depth table price x y z newprice
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 310.
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 310.
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 311.
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 317.
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 318.
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 319.
新建多个欄位,包含剛剛新建出的欄位
diamonds_mutate3 <- diamonds %>%
mutate(newprice = price * 0.95, newtotal = newprice + 100)
head(diamonds_mutate3)
## # A tibble: 6 x 12
## carat cut color clarity depth table price x y z newprice
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 310.
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 310.
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 311.
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 317.
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 318.
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 319.
## # … with 1 more variable: newtotal <dbl>
使用if_else
新增條件欄位
#使用if_else新增条件列
diamonds_mutate4 <- diamonds %>%
mutate(type = if_else(price > 3933, "> mean", "<= mean"))
head(diamonds_mutate4)
## # A tibble: 6 x 11
## carat cut color clarity depth table price x y z type
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 <= mean
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 <= mean
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 <= mean
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 <= mean
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 <= mean
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 <= mean
filter 主要就是以 row的方向進行選擇
diamonds_filter <- diamonds %>%
filter(price > 1000)
head(diamonds_filter)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.7 Ideal E SI1 62.5 57 2757 5.7 5.72 3.57
## 2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
## 3 0.7 Ideal G VS2 61.6 56 2757 5.7 5.67 3.5
## 4 0.71 Very Good E VS2 62.4 57 2759 5.68 5.73 3.56
## 5 0.78 Very Good G SI2 63.8 56 2759 5.81 5.85 3.72
## 6 0.7 Good E VS2 57.5 58 2759 5.85 5.9 3.38
單一屬性中的多個不同類型的選擇,可以用%in%
# %in%
diamonds_filter2 <- diamonds %>%
filter(cut %in% c('Ideal','Very Good','Good'))
head(diamonds_filter2)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 3 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 4 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 5 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 6 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
多個屬性的,多個條件的選擇。
# 每個屬性的條件用逗號隔開即可
diamonds_filter3 <- diamonds %>%
filter(cut %in% c('Ideal','Very Good','Good'), price>3000)
head(diamonds_filter3)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.9 Very Good I SI2 62.5 58 3001 6.1 6.15 3.83
## 2 0.9 Ideal I SI2 62 56 3001 6.16 6.2 3.83
## 3 0.72 Very Good D SI1 61.5 58 3001 5.74 5.78 3.54
## 4 0.7 Good D VS1 63.6 60 3001 5.61 5.52 3.54
## 5 0.7 Very Good D VS1 63.4 59 3001 5.58 5.55 3.53
## 6 0.6 Ideal G VVS1 62.1 56 3001 5.42 5.43 3.37
# 也可使用 & 和 |等符號串連邏輯
diamonds_filter4 <- diamonds %>%
filter(cut %in% c('Ideal','Very Good','Good') | price>2000 | color =='D')
head(diamonds_filter4)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 3 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 4 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 5 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 6 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
篩選rows,可以看出來filter函數中的參數為邏輯值即可,這就非常強大,能夠篩選出複雜的多條件。同樣的,還支持很多運算符: ==,>,>=等等 &,|,!,xor() is.na() between(),near()
summarise()函式用來計算統計值 單獨使用時一般就是對資料進行一個整體的了解,與group_by()合併使用時就是對每個分組的統計量進行了解
# 我們想查看的統計量是共有多少row,最高價,平均depth,cut的種類
diamonds_sum <- diamonds %>%
summarise(nrow = n(), max_price = max(price), mean_depth = mean(depth), nd_cut = n_distinct(cut))
head(diamonds_sum)
## # A tibble: 1 x 4
## nrow max_price mean_depth nd_cut
## <int> <int> <dbl> <int>
## 1 53940 18823 61.7 5
summarise函式能對輸入的資料集指定不同屬性進行聚合。除了像上面常用的聚合有求和,求平均值,還支持:
group_by()函數的功能為設定分組依據,通常會與summarise()函式合併使用
diamonds_groupby <- diamonds %>%
group_by(color) %>%
summarise(nrow = n(), mean_price = mean(price))
head(diamonds_groupby)
## # A tibble: 6 x 3
## color nrow mean_price
## <ord> <int> <dbl>
## 1 D 6775 3170.
## 2 E 9797 3077.
## 3 F 9542 3725.
## 4 G 11292 3999.
## 5 H 8304 4487.
## 6 I 5422 5092.
做一些簡單的線型回歸,然後提取統計量
# 用常用到的mtcars
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars %>%
group_by(cyl) %>%
group_modify(
~ broom::tidy(lm(mpg ~ wt, data = .))
)
## # A tibble: 6 x 6
## # Groups: cyl [3]
## cyl term estimate std.error statistic p.value
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 4 (Intercept) 39.6 4.35 9.10 0.00000777
## 2 4 wt -5.65 1.85 -3.05 0.0137
## 3 6 (Intercept) 28.4 4.18 6.79 0.00105
## 4 6 wt -2.78 1.33 -2.08 0.0918
## 5 8 (Intercept) 23.9 3.01 7.94 0.00000405
## 6 8 wt -2.19 0.739 -2.97 0.0118
# summarise()的進階用法
mtcars %>%
group_by(cyl) %>%
summarise(
broom::tidy(lm(mpg ~ wt))
)
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
## # A tibble: 6 x 6
## # Groups: cyl [3]
## cyl term estimate std.error statistic p.value
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 4 (Intercept) 39.6 4.35 9.10 0.00000777
## 2 4 wt -5.65 1.85 -3.05 0.0137
## 3 6 (Intercept) 28.4 4.18 6.79 0.00105
## 4 6 wt -2.78 1.33 -2.08 0.0918
## 5 8 (Intercept) 23.9 3.01 7.94 0.00000405
## 6 8 wt -2.19 0.739 -2.97 0.0118
排序功能,預設為遞增排序
diamonds_arrange <- diamonds %>%
group_by(color) %>%
summarise(nrow = n(), mean_price = mean(price))%>%
arrange(desc(mean_price))
head(diamonds_arrange)
## # A tibble: 6 x 3
## color nrow mean_price
## <ord> <int> <dbl>
## 1 J 2808 5324.
## 2 I 5422 5092.
## 3 H 8304 4487.
## 4 G 11292 3999.
## 5 F 9542 3725.
## 6 D 6775 3170.
按多個屬性排序
diamonds_arrange2 <- diamonds %>%
arrange(price & desc(depth))
head(diamonds_arrange2)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
其功能就是統計某個變量中各組出現的次數
統計所有的資料筆數
diamonds_count <- diamonds %>%
count()
head(diamonds_count)
## # A tibble: 1 x 1
## n
## <int>
## 1 53940
統計類別變數,各個類別的數量
diamonds_count2 <- diamonds %>%
count(cut)
diamonds_count2
## # A tibble: 5 x 2
## cut n
## <ord> <int>
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
和 table()
所達到的目的一樣
table(diamonds$cut)
##
## Fair Good Very Good Premium Ideal
## 1610 4906 12082 13791 21551
diamonds_count3 <- diamonds %>%
count(new = 7 * (price %% 10))
head(diamonds_count3)
## # A tibble: 6 x 2
## new n
## <dbl> <int>
## 1 0 5406
## 2 7 5012
## 3 14 5586
## 4 21 5266
## 5 28 5475
## 6 35 5419
當你想要加入一個統計欄位的時候可以用比較傳統的:group_by() + mutate() 也可以用更簡單的add_count()
diamonds_count4 <- diamonds %>%
group_by(color) %>%
mutate(total_col = n()) %>%
ungroup()
head(diamonds_count4)
## # A tibble: 6 x 11
## carat cut color clarity depth table price x y z total_col
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 9797
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 9797
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 9797
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 5422
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 2808
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 2808
add_count()的用法,最後會有一個“n”的欄位就是統計結果
diamonds_count5 <- diamonds %>%
add_count(color)
head(diamonds_count5)
## # A tibble: 6 x 11
## carat cut color clarity depth table price x y z n
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 9797
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 9797
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 9797
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 5422
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 2808
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 2808
dplyr的一些函數(mutate(select()等)),事實上,這些函數加上後綴_all,_at,_if,形成三組變體函數,可以方便對特定的子集進行操作。
Operate | _all | _at | _if |
---|---|---|---|
select() | select_all() | select_at() | select_if() |
mutate() | mutate_all() | mutate_at() | mutate_if() |
rename() | rename_all() | rename_at() | rename_if() |
arrange() | arrange_all() | arrange_at() | arrange_if() |
filter() | filter_all() | filter_at() | filter_if() |
distinct() | distinct_all() | distinct_at() | distinct_if() |
group_by() | group_by_all() | group_by_at() | group_by_if() |
summarise() | summarise_all() | summarise_at() | summarise_if() |
map() | map_all() | map_at() | map_if() |
modify() | modify_all() | modify_at() | modify_if() |
可以用 select_if 選擇某一數據類型
# 下看一下資料結構
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
# 挑選出所有factor型態的資料
diamonds_select3 <- diamonds %>%
select_if(is.factor)
head(diamonds_select3,5)
## # A tibble: 5 x 3
## cut color clarity
## <ord> <ord> <ord>
## 1 Ideal E SI2
## 2 Premium E SI1
## 3 Good E VS1
## 4 Premium I VS2
## 5 Good J SI2