title: “tidyverse初认识”
toc: true #在渲染文档中增加目录
number-sections: true #目录增加数字编号
format: html
editor: visual
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

第一题 编写代码

利用nycflights13包的flights数据集是2013年从纽约三大机场(JFK、LGA、EWR)起飞的所有航班的准点数据,共336776条记录。

第二题 解释代码

  1. 代码含义:将iris数据集转换为tibble数据框格式,先按species排序,然后对sepal开头的列进行降序排列,如果length相等则比对width。

    tibble(iris) %>% 
      arrange(Species,across(starts_with("Sepal"), desc))
    ## # A tibble: 150 × 5
    ##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    ##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
    ##  1          5.8         4            1.2         0.2 setosa 
    ##  2          5.7         4.4          1.5         0.4 setosa 
    ##  3          5.7         3.8          1.7         0.3 setosa 
    ##  4          5.5         4.2          1.4         0.2 setosa 
    ##  5          5.5         3.5          1.3         0.2 setosa 
    ##  6          5.4         3.9          1.7         0.4 setosa 
    ##  7          5.4         3.9          1.3         0.4 setosa 
    ##  8          5.4         3.7          1.5         0.2 setosa 
    ##  9          5.4         3.4          1.7         0.2 setosa 
    ## 10          5.4         3.4          1.5         0.4 setosa 
    ## # ℹ 140 more rows
  2. 代码含义:对starwars数据集按照gender对数据分组,计算每个性别组的平均体重,筛选满足mass大于该性别组的平均体重条件的行,并忽略缺失值。

    starwars %>% 
      group_by(gender) %>% 
      filter(mass > mean(mass, na.rm = TRUE))
    ## # A tibble: 15 × 14
    ## # Groups:   gender [3]
    ##    name    height   mass hair_color skin_color eye_color birth_year sex   gender
    ##    <chr>    <int>  <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
    ##  1 Darth …    202  136   none       white      yellow          41.9 male  mascu…
    ##  2 Owen L…    178  120   brown, gr… light      blue            52   male  mascu…
    ##  3 Beru W…    165   75   brown      light      blue            47   fema… femin…
    ##  4 Chewba…    228  112   brown      unknown    blue           200   male  mascu…
    ##  5 Jabba …    175 1358   <NA>       green-tan… orange         600   herm… mascu…
    ##  6 Jek To…    180  110   brown      fair       blue            NA   <NA>  <NA>  
    ##  7 IG-88      200  140   none       metal      red             15   none  mascu…
    ##  8 Bossk      190  113   none       green      red             53   male  mascu…
    ##  9 Ayla S…    178   55   none       blue       hazel           48   fema… femin…
    ## 10 Gregar…    185   85   black      dark       brown           NA   <NA>  <NA>  
    ## 11 Lumina…    170   56.2 black      yellow     blue            58   fema… femin…
    ## 12 Zam We…    168   55   blonde     fair, gre… yellow          NA   fema… femin…
    ## 13 Shaak …    178   57   none       red, blue… black           NA   fema… femin…
    ## 14 Grievo…    216  159   none       brown, wh… green, y…       NA   male  mascu…
    ## 15 Tarfful    234  136   brown      brown      blue            NA   male  mascu…
    ## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
    ## #   vehicles <list>, starships <list>
  3. 代码含义:对starwars数据集,从中选择name/homeworld/species这三列,修改数据框的列并对多列进行操作,对除了name列之外的所有列转换为因子。

    starwars %>%
      select(name, homeworld, species) %>%
      mutate(across(!name, as.factor))
    ## # A tibble: 87 × 3
    ##    name               homeworld species
    ##    <chr>              <fct>     <fct>  
    ##  1 Luke Skywalker     Tatooine  Human  
    ##  2 C-3PO              Tatooine  Droid  
    ##  3 R2-D2              Naboo     Droid  
    ##  4 Darth Vader        Tatooine  Human  
    ##  5 Leia Organa        Alderaan  Human  
    ##  6 Owen Lars          Tatooine  Human  
    ##  7 Beru Whitesun Lars Tatooine  Human  
    ##  8 R5-D4              Tatooine  Droid  
    ##  9 Biggs Darklighter  Tatooine  Human  
    ## 10 Obi-Wan Kenobi     Stewjon   Human  
    ## # ℹ 77 more rows
  4. 代码含义:将mtcars数据集转换为tibble数据框格式,按照vs列给数据分组,添加一个新列hp_cut,将hp列分成三个等宽区间为新列,对新列数据分组。

    tibble(mtcars) %>%
      group_by(vs) %>%
      mutate(hp_cut = cut(hp, 3)) %>%
      group_by(hp_cut)
    ## # A tibble: 32 × 12
    ## # Groups:   hp_cut [6]
    ##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb hp_cut     
    ##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>      
    ##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4 (90.8,172] 
    ##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4 (90.8,172] 
    ##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1 (75.7,99.3]
    ##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1 (99.3,123] 
    ##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2 (172,254]  
    ##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1 (99.3,123] 
    ##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 (172,254]  
    ##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2 (51.9,75.7]
    ##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2 (75.7,99.3]
    ## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4 (99.3,123] 
    ## # ℹ 22 more rows

第三题 查找帮助理解函数

阅读 https://dplyr.tidyverse.org/reference/mutate-joins.html 内容,说明4个数据集链接函数函数的作用。分别举一个实际例子演示并解释其输出结果。

  1. inner_join() :返回两个数据集中键值匹配的行,即只保留两个数据集中都存在的键值

    df1 <- tibble(id = c(1, 2, 3), name = c("Alice", "Bob", "Charlie"))
    df2 <- tibble(id = c(2, 3, 4), age = c(25, 30, 35))
    result <- inner_join(df1, df2, by = "id")
    print(result)
    ## # A tibble: 2 × 3
    ##      id name      age
    ##   <dbl> <chr>   <dbl>
    ## 1     2 Bob        25
    ## 2     3 Charlie    30
  2. left_join() :返回左侧数据集的所有行,并匹配右侧数据集中键值对应的行。

    result <- left_join(df1, df2, by = "id")
    print(result)
    ## # A tibble: 3 × 3
    ##      id name      age
    ##   <dbl> <chr>   <dbl>
    ## 1     1 Alice      NA
    ## 2     2 Bob        25
    ## 3     3 Charlie    30
  3. right_join() :返回右侧数据集的行,并匹配左侧数据集中键值对应的行。

    result <- right_join(df1, df2, by = "id")
    print(result)
    ## # A tibble: 3 × 3
    ##      id name      age
    ##   <dbl> <chr>   <dbl>
    ## 1     2 Bob        25
    ## 2     3 Charlie    30
    ## 3     4 <NA>       35
  4. full_join() :返回两个数据集的所有行,并匹配键值对应的行。

    result <- full_join(df1, df2, by = "id")
    print(result)
    ## # A tibble: 4 × 3
    ##      id name      age
    ##   <dbl> <chr>   <dbl>
    ## 1     1 Alice      NA
    ## 2     2 Bob        25
    ## 3     3 Charlie    30
    ## 4     4 <NA>       35