| title: “tidyverse初认识” |
| toc: true #在渲染文档中增加目录 |
| number-sections: true #目录增加数字编号 |
| format: html |
| editor: visual |
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
利用nycflights13包的flights数据集是2013年从纽约三大机场(JFK、LGA、EWR)起飞的所有航班的准点数据,共336776条记录。
计算纽约三大机场2013起飞航班数和平均延误时间(可使用group_by, summarise函数)
library(nycflights13)
flights %>%
group_by(origin) %>%
summarise(n=n(),depm=mean(dep_delay,na.rm=T))
## # A tibble: 3 × 3
## origin n depm
## <chr> <int> <dbl>
## 1 EWR 120835 15.1
## 2 JFK 111279 12.1
## 3 LGA 104662 10.3计算不同航空公司2013从纽约起飞航班数和平均延误时间
flights %>%
group_by(carrier) %>%
summarise(n=n(),depm=mean(dep_delay,na.rm=T)) %>%
arrange(desc(n))
## # A tibble: 16 × 3
## carrier n depm
## <chr> <int> <dbl>
## 1 UA 58665 12.1
## 2 B6 54635 13.0
## 3 EV 54173 20.0
## 4 DL 48110 9.26
## 5 AA 32729 8.59
## 6 MQ 26397 10.6
## 7 US 20536 3.78
## 8 9E 18460 16.7
## 9 WN 12275 17.7
## 10 VX 5162 12.9
## 11 FL 3260 18.7
## 12 AS 714 5.80
## 13 F9 685 20.2
## 14 YV 601 19.0
## 15 HA 342 4.90
## 16 OO 32 12.6计算纽约三大机场排名前三个目的地和平均飞行距离(可使用group_by, summarise, arrange, slice_max函数)
flights %>%
group_by(origin,dest) %>%
summarise(n=n(),distm=mean(distance)) %>%
slice_max(n,n=3)
## `summarise()` has grouped output by 'origin'. You can override using the
## `.groups` argument.
## # A tibble: 9 × 4
## # Groups: origin [3]
## origin dest n distm
## <chr> <chr> <int> <dbl>
## 1 EWR ORD 6100 719
## 2 EWR BOS 5327 200
## 3 EWR SFO 5127 2565
## 4 JFK LAX 11262 2475
## 5 JFK SFO 8204 2586
## 6 JFK BOS 5898 187
## 7 LGA ATL 10263 762
## 8 LGA ORD 8857 733
## 9 LGA CLT 6168 544代码含义:将iris数据集转换为tibble数据框格式,先按species排序,然后对sepal开头的列进行降序排列,如果length相等则比对width。
tibble(iris) %>%
arrange(Species,across(starts_with("Sepal"), desc))
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.8 4 1.2 0.2 setosa
## 2 5.7 4.4 1.5 0.4 setosa
## 3 5.7 3.8 1.7 0.3 setosa
## 4 5.5 4.2 1.4 0.2 setosa
## 5 5.5 3.5 1.3 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 5.4 3.9 1.3 0.4 setosa
## 8 5.4 3.7 1.5 0.2 setosa
## 9 5.4 3.4 1.7 0.2 setosa
## 10 5.4 3.4 1.5 0.4 setosa
## # ℹ 140 more rows代码含义:对starwars数据集按照gender对数据分组,计算每个性别组的平均体重,筛选满足mass大于该性别组的平均体重条件的行,并忽略缺失值。
starwars %>%
group_by(gender) %>%
filter(mass > mean(mass, na.rm = TRUE))
## # A tibble: 15 × 14
## # Groups: gender [3]
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Darth … 202 136 none white yellow 41.9 male mascu…
## 2 Owen L… 178 120 brown, gr… light blue 52 male mascu…
## 3 Beru W… 165 75 brown light blue 47 fema… femin…
## 4 Chewba… 228 112 brown unknown blue 200 male mascu…
## 5 Jabba … 175 1358 <NA> green-tan… orange 600 herm… mascu…
## 6 Jek To… 180 110 brown fair blue NA <NA> <NA>
## 7 IG-88 200 140 none metal red 15 none mascu…
## 8 Bossk 190 113 none green red 53 male mascu…
## 9 Ayla S… 178 55 none blue hazel 48 fema… femin…
## 10 Gregar… 185 85 black dark brown NA <NA> <NA>
## 11 Lumina… 170 56.2 black yellow blue 58 fema… femin…
## 12 Zam We… 168 55 blonde fair, gre… yellow NA fema… femin…
## 13 Shaak … 178 57 none red, blue… black NA fema… femin…
## 14 Grievo… 216 159 none brown, wh… green, y… NA male mascu…
## 15 Tarfful 234 136 brown brown blue NA male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>代码含义:对starwars数据集,从中选择name/homeworld/species这三列,修改数据框的列并对多列进行操作,对除了name列之外的所有列转换为因子。
starwars %>%
select(name, homeworld, species) %>%
mutate(across(!name, as.factor))
## # A tibble: 87 × 3
## name homeworld species
## <chr> <fct> <fct>
## 1 Luke Skywalker Tatooine Human
## 2 C-3PO Tatooine Droid
## 3 R2-D2 Naboo Droid
## 4 Darth Vader Tatooine Human
## 5 Leia Organa Alderaan Human
## 6 Owen Lars Tatooine Human
## 7 Beru Whitesun Lars Tatooine Human
## 8 R5-D4 Tatooine Droid
## 9 Biggs Darklighter Tatooine Human
## 10 Obi-Wan Kenobi Stewjon Human
## # ℹ 77 more rows代码含义:将mtcars数据集转换为tibble数据框格式,按照vs列给数据分组,添加一个新列hp_cut,将hp列分成三个等宽区间为新列,对新列数据分组。
tibble(mtcars) %>%
group_by(vs) %>%
mutate(hp_cut = cut(hp, 3)) %>%
group_by(hp_cut)
## # A tibble: 32 × 12
## # Groups: hp_cut [6]
## mpg cyl disp hp drat wt qsec vs am gear carb hp_cut
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 (90.8,172]
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 (90.8,172]
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 (75.7,99.3]
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 (99.3,123]
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 (172,254]
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 (99.3,123]
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 (172,254]
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 (51.9,75.7]
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 (75.7,99.3]
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 (99.3,123]
## # ℹ 22 more rows阅读 https://dplyr.tidyverse.org/reference/mutate-joins.html 内容,说明4个数据集链接函数函数的作用。分别举一个实际例子演示并解释其输出结果。
inner_join()
:返回两个数据集中键值匹配的行,即只保留两个数据集中都存在的键值
df1 <- tibble(id = c(1, 2, 3), name = c("Alice", "Bob", "Charlie"))
df2 <- tibble(id = c(2, 3, 4), age = c(25, 30, 35))
result <- inner_join(df1, df2, by = "id")
print(result)
## # A tibble: 2 × 3
## id name age
## <dbl> <chr> <dbl>
## 1 2 Bob 25
## 2 3 Charlie 30left_join()
:返回左侧数据集的所有行,并匹配右侧数据集中键值对应的行。
result <- left_join(df1, df2, by = "id")
print(result)
## # A tibble: 3 × 3
## id name age
## <dbl> <chr> <dbl>
## 1 1 Alice NA
## 2 2 Bob 25
## 3 3 Charlie 30right_join()
:返回右侧数据集的行,并匹配左侧数据集中键值对应的行。
result <- right_join(df1, df2, by = "id")
print(result)
## # A tibble: 3 × 3
## id name age
## <dbl> <chr> <dbl>
## 1 2 Bob 25
## 2 3 Charlie 30
## 3 4 <NA> 35full_join()
:返回两个数据集的所有行,并匹配键值对应的行。
result <- full_join(df1, df2, by = "id")
print(result)
## # A tibble: 4 × 3
## id name age
## <dbl> <chr> <dbl>
## 1 1 Alice NA
## 2 2 Bob 25
## 3 3 Charlie 30
## 4 4 <NA> 35