利用nycflights13包的flights数据集是2013年从纽约三大机场(JFK、LGA、EWR)起飞的所有航班的准点数据,共336776条记录。
计算纽约三大机场2013起飞航班数和平均延误时间(可使用group_by, summarise函数)
flights
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
group_by(origin) %>%
summarise(n=n(),depm=mean(dep_delay,na.rm=T))
## # A tibble: 3 × 3
## origin n depm
## <chr> <int> <dbl>
## 1 EWR 120835 15.1
## 2 JFK 111279 12.1
## 3 LGA 104662 10.3计算不同航空公司2013从纽约起飞航班数和平均延误时间
flights %>%
group_by(carrier) %>%
summarise(n=n(),depm=mean(dep_delay,na.rm=T)) %>%
arrange(desc(n))
## # A tibble: 16 × 3
## carrier n depm
## <chr> <int> <dbl>
## 1 UA 58665 12.1
## 2 B6 54635 13.0
## 3 EV 54173 20.0
## 4 DL 48110 9.26
## 5 AA 32729 8.59
## 6 MQ 26397 10.6
## 7 US 20536 3.78
## 8 9E 18460 16.7
## 9 WN 12275 17.7
## 10 VX 5162 12.9
## 11 FL 3260 18.7
## 12 AS 714 5.80
## 13 F9 685 20.2
## 14 YV 601 19.0
## 15 HA 342 4.90
## 16 OO 32 12.6计算纽约三大机场排名前三个目的地和平均飞行距离(可使用group_by, summarise, arrange, slice_max函数)
flights %>%
group_by(origin,dest) %>%
summarise(n=n(),distm=mean(distance)) %>%
slice_max(n,n=3)
## `summarise()` has grouped output by 'origin'. You can override using the
## `.groups` argument.
## # A tibble: 9 × 4
## # Groups: origin [3]
## origin dest n distm
## <chr> <chr> <int> <dbl>
## 1 EWR ORD 6100 719
## 2 EWR BOS 5327 200
## 3 EWR SFO 5127 2565
## 4 JFK LAX 11262 2475
## 5 JFK SFO 8204 2586
## 6 JFK BOS 5898 187
## 7 LGA ATL 10263 762
## 8 LGA ORD 8857 733
## 9 LGA CLT 6168 544代码含义:用管道操作符 %>% 将数据框 iris 转换为 tibble
格式,arrange 的第一个参数是数据框(在这里是 tibble(iris)),后面的参数指定了排序的列和顺序,已品种为首,其次是sepal,降序排列
tibble(iris) %>%
arrange(Species,across(starts_with("Sepal"), desc))
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.8 4 1.2 0.2 setosa
## 2 5.7 4.4 1.5 0.4 setosa
## 3 5.7 3.8 1.7 0.3 setosa
## 4 5.5 4.2 1.4 0.2 setosa
## 5 5.5 3.5 1.3 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 5.4 3.9 1.3 0.4 setosa
## 8 5.4 3.7 1.5 0.2 setosa
## 9 5.4 3.4 1.7 0.2 setosa
## 10 5.4 3.4 1.5 0.4 setosa
## # ℹ 140 more rows代码含义:将 starwars 数据集按照 gender
列进行分组。对每个性别组,计算该组中 mass
列的平均值(忽略缺失值)。筛选出每个组中 mass
大于组内平均值的角色
starwars %>%
group_by(gender) %>%
filter(mass > mean(mass, na.rm = TRUE))
## # A tibble: 15 × 14
## # Groups: gender [3]
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Darth … 202 136 none white yellow 41.9 male mascu…
## 2 Owen L… 178 120 brown, gr… light blue 52 male mascu…
## 3 Beru W… 165 75 brown light blue 47 fema… femin…
## 4 Chewba… 228 112 brown unknown blue 200 male mascu…
## 5 Jabba … 175 1358 <NA> green-tan… orange 600 herm… mascu…
## 6 Jek To… 180 110 brown fair blue NA <NA> <NA>
## 7 IG-88 200 140 none metal red 15 none mascu…
## 8 Bossk 190 113 none green red 53 male mascu…
## 9 Ayla S… 178 55 none blue hazel 48 fema… femin…
## 10 Gregar… 185 85 black dark brown NA <NA> <NA>
## 11 Lumina… 170 56.2 black yellow blue 58 fema… femin…
## 12 Zam We… 168 55 blonde fair, gre… yellow NA fema… femin…
## 13 Shaak … 178 57 none red, blue… black NA fema… femin…
## 14 Grievo… 216 159 none brown, wh… green, y… NA male mascu…
## 15 Tarfful 234 136 brown brown blue NA male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
代码含义:从 starwars 数据集中选择
name、homeworld 和 species
三列。将 homeworld 和 species
列转换为因子类型。
starwars %>%
select(name, homeworld, species) %>%
mutate(across(!name, as.factor))
## # A tibble: 87 × 3
## name homeworld species
## <chr> <fct> <fct>
## 1 Luke Skywalker Tatooine Human
## 2 C-3PO Tatooine Droid
## 3 R2-D2 Naboo Droid
## 4 Darth Vader Tatooine Human
## 5 Leia Organa Alderaan Human
## 6 Owen Lars Tatooine Human
## 7 Beru Whitesun Lars Tatooine Human
## 8 R5-D4 Tatooine Droid
## 9 Biggs Darklighter Tatooine Human
## 10 Obi-Wan Kenobi Stewjon Human
## # ℹ 77 more rows代码含义:将 mtcars 数据集转换为 tibble
格式。按照 vs 列(发动机类型)进行分组。在每个
vs 组内,将 hp
列(马力)分成3个区间,并创建一个新列 hp_cut
来存储分箱结果。最后,按照 hp_cut
列(马力区间)重新分组
tibble(mtcars) %>%
group_by(vs) %>%
mutate(hp_cut = cut(hp, 3)) %>%
group_by(hp_cut)
## # A tibble: 32 × 12
## # Groups: hp_cut [6]
## mpg cyl disp hp drat wt qsec vs am gear carb hp_cut
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 (90.8,172]
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 (90.8,172]
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 (75.7,99.3]
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 (99.3,123]
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 (172,254]
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 (99.3,123]
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 (172,254]
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 (51.9,75.7]
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 (75.7,99.3]
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 (99.3,123]
## # ℹ 22 more rows阅读 https://dplyr.tidyverse.org/reference/mutate-joins.html 内容,说明4个数据集链接函数函数的作用。分别举一个实际例子演示并解释其输出结果。
inner_join() :
library(dplyr)
students <- tibble(
id = 1:3,
name = c("Alice", "Bob", "Charlie")
)
scores <- tibble(
id = c(2, 3, 4),
score = c(90, 85, 70)
)
inner_join(students, scores, by = "id")
## # A tibble: 2 × 3
## id name score
## <dbl> <chr> <dbl>
## 1 2 Bob 90
## 2 3 Charlie 85left_join() :
library(dplyr)
students <- tibble(
id = c(1, 2, 3),
name = c("Alice", "Bob", "Charlie")
)
scores <- tibble(
id = c(1, 2, 4),
score = c(90, 85, 88)
)
result <- left_join(students, scores, by = "id")
print(result)
## # A tibble: 3 × 3
## id name score
## <dbl> <chr> <dbl>
## 1 1 Alice 90
## 2 2 Bob 85
## 3 3 Charlie NAright_join() :
result <- right_join(students, scores, by = "id")
print(result)
## # A tibble: 3 × 3
## id name score
## <dbl> <chr> <dbl>
## 1 1 Alice 90
## 2 2 Bob 85
## 3 4 <NA> 88full_join() :
result <- full_join(students, scores, by = "id")
print(result)
## # A tibble: 4 × 3
## id name score
## <dbl> <chr> <dbl>
## 1 1 Alice 90
## 2 2 Bob 85
## 3 3 Charlie NA
## 4 4 <NA> 88
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
```r
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.