Workshop 7: Creating strings and using join commands

Libraries needed:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(babynames)
library(nycflights13)

14.3.4

Question 2:What’s the difference between paste() and paste0()? How can you recreate the equivalent of paste() with str_c()?

The paste() command and paste0() differ because the first allows you to choose and specify what the separator is between the elements in the series, while the second does not set any separators in between.

paste("My","name","is","Dalia")
## [1] "My name is Dalia"
paste0("My","name","is","Dalia")
## [1] "MynameisDalia"

To recreate the equivalent of paste() with str_c(), you need to indicate the separator for the str_c() command. If it was paste0() instead, the str_c() command by itself would work, or str_c(, sep = ““).

paste("My","name","is","Dalia")
## [1] "My name is Dalia"
str_c("My","name","is","Dalia", sep = " ")
## [1] "My name is Dalia"

14.5.3

Question 2: Use str_length() and str_sub() to extract the middle letter from each baby name. What will you do if the string has an even number of characters?

Str_length() helps identify the length of each baby name. With str_sub(), we can then take out the middle letter. You can replace all the even names and make them as N/As, or you can choose to let it be and take the following letter as the midpoint of the string.

babynames |> 
  mutate(name_length = str_length(name)) |>
  mutate(middle_char = str_sub(name, name_length/2+1, name_length/2+1)) |>
  mutate(middle_char = replace(middle_char, name_length%%2 == 0, NA))
## # A tibble: 1,924,665 × 7
##     year sex   name          n   prop name_length middle_char
##    <dbl> <chr> <chr>     <int>  <dbl>       <int> <chr>      
##  1  1880 F     Mary       7065 0.0724           4 <NA>       
##  2  1880 F     Anna       2604 0.0267           4 <NA>       
##  3  1880 F     Emma       2003 0.0205           4 <NA>       
##  4  1880 F     Elizabeth  1939 0.0199           9 a          
##  5  1880 F     Minnie     1746 0.0179           6 <NA>       
##  6  1880 F     Margaret   1578 0.0162           8 <NA>       
##  7  1880 F     Ida        1472 0.0151           3 d          
##  8  1880 F     Alice      1414 0.0145           5 i          
##  9  1880 F     Bertha     1320 0.0135           6 <NA>       
## 10  1880 F     Sarah      1288 0.0132           5 r          
## # ℹ 1,924,655 more rows

19.2.4

Question 4: We know that some days of the year are special and fewer people than usual fly on them (e.g., Christmas eve and Christmas day). How might you represent that data as a data frame? What would be the primary key? How would it connect to the existing data frames?

Creating a table of special dates is one way to visualize the information, making the date information the primary key (year, month, and day). This adds a column with describing whether or not the flight occurs on one of the designated special days, and if so which one.

special_days <- tibble(
  year = c(2013, 2013, 2013, 2013),
  month = c(01, 07, 11, 12),
  day = c(01, 04, 29, 25),
  holiday = c("New Years Day", "Independence Day", "Thanksgiving Day", "Christmas Day")
)

flights_special_days <- flights |>
  left_join(special_days, by = c ("year", "month", "day"))

flights_special_days
## # A tibble: 336,776 × 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <dbl> <dbl> <dbl>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, holiday <chr>

19.3.4