Import your data

results <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-07/results.csv')
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 25220 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): position, positionText, time, milliseconds, fastestLap, rank, fast...
## dbl (10): resultId, raceId, driverId, constructorId, number, grid, positionO...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(results)
Data summary
Name results
Number of rows 25220
Number of columns 18
_______________________
Column type frequency:
character 8
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
position 0 1 1 2 0 34 0
positionText 0 1 1 2 0 39 0
time 0 1 2 11 0 6488 0
milliseconds 0 1 2 8 0 6687 0
fastestLap 0 1 1 2 0 80 0
rank 0 1 1 2 0 26 0
fastestLapTime 0 1 2 8 0 6266 0
fastestLapSpeed 0 1 2 7 0 6395 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
resultId 0 1 12611.23 7281.58 1 6305.75 12610.5 18915.25 25225 ▇▇▇▇▇
raceId 0 1 517.95 290.34 1 287.00 503.0 762.00 1064 ▆▇▇▆▆
driverId 0 1 250.84 258.25 1 56.00 158.0 347.00 854 ▇▃▂▁▂
constructorId 0 1 47.48 58.39 1 6.00 25.0 57.00 214 ▇▂▁▁▁
number 6 1 17.59 14.80 0 7.00 15.0 23.00 208 ▇▁▁▁▁
grid 0 1 11.21 7.27 0 5.00 11.0 17.00 34 ▇▇▇▃▁
positionOrder 0 1 12.93 7.74 1 6.00 12.0 19.00 39 ▇▇▆▂▁
points 0 1 1.80 4.03 0 0.00 0.0 2.00 50 ▇▁▁▁▁
laps 0 1 45.79 30.04 0 21.00 52.0 66.00 200 ▅▇▁▁▁
statusId 0 1 17.72 26.10 1 1.00 11.0 14.00 139 ▇▁▁▁▁

Chapter 13

What are primary keys in your data?

The Primary keys in my data are raceId, rank, and driverId.

Can you divide your data into two?

Divide it using dplyr::select in a way the two have a common variable, which you could use to join the two.

data_1 <- results %>% select(raceId:statusId)
data_1
## # A tibble: 25,220 × 17
##    raceId driverId construct…¹ number  grid posit…² posit…³ posit…⁴ points  laps
##     <dbl>    <dbl>       <dbl>  <dbl> <dbl> <chr>   <chr>     <dbl>  <dbl> <dbl>
##  1     18        1           1     22     1 "1"     1             1     10    58
##  2     18        2           2      3     5 "2"     2             2      8    58
##  3     18        3           3      7     7 "3"     3             3      6    58
##  4     18        4           4      5    11 "4"     4             4      5    58
##  5     18        5           1     23     3 "5"     5             5      4    58
##  6     18        6           3      8    13 "6"     6             6      3    57
##  7     18        7           5     14    17 "7"     7             7      2    55
##  8     18        8           6      1    15 "8"     8             8      1    53
##  9     18        9           2      4     2 "\\N"   R             9      0    47
## 10     18       10           7     12    18 "\\N"   R            10      0    43
## # … with 25,210 more rows, 7 more variables: time <chr>, milliseconds <chr>,
## #   fastestLap <chr>, rank <chr>, fastestLapTime <chr>, fastestLapSpeed <chr>,
## #   statusId <dbl>, and abbreviated variable names ¹​constructorId, ²​position,
## #   ³​positionText, ⁴​positionOrder
data_2 <- results %>% select(raceId, rank, fastestLapTime)
data_2
## # A tibble: 25,220 × 3
##    raceId rank  fastestLapTime
##     <dbl> <chr> <chr>         
##  1     18 2     1:27.452      
##  2     18 3     1:27.739      
##  3     18 5     1:28.090      
##  4     18 7     1:28.603      
##  5     18 1     1:27.418      
##  6     18 14    1:29.639      
##  7     18 12    1:29.534      
##  8     18 4     1:27.903      
##  9     18 9     1:28.753      
## 10     18 13    1:29.558      
## # … with 25,210 more rows

Can you join the two together?

Use tidyr::left_join or other joining functions.

data_1 %>%
    left_join(data_2, by = c("raceId", "fastestLapTime"))
## # A tibble: 488,388 × 18
##    raceId driverId construct…¹ number  grid posit…² posit…³ posit…⁴ points  laps
##     <dbl>    <dbl>       <dbl>  <dbl> <dbl> <chr>   <chr>     <dbl>  <dbl> <dbl>
##  1     18        1           1     22     1 "1"     1             1     10    58
##  2     18        2           2      3     5 "2"     2             2      8    58
##  3     18        3           3      7     7 "3"     3             3      6    58
##  4     18        4           4      5    11 "4"     4             4      5    58
##  5     18        5           1     23     3 "5"     5             5      4    58
##  6     18        6           3      8    13 "6"     6             6      3    57
##  7     18        7           5     14    17 "7"     7             7      2    55
##  8     18        8           6      1    15 "8"     8             8      1    53
##  9     18        9           2      4     2 "\\N"   R             9      0    47
## 10     18       10           7     12    18 "\\N"   R            10      0    43
## # … with 488,378 more rows, 8 more variables: time <chr>, milliseconds <chr>,
## #   fastestLap <chr>, rank.x <chr>, fastestLapTime <chr>,
## #   fastestLapSpeed <chr>, statusId <dbl>, rank.y <chr>, and abbreviated
## #   variable names ¹​constructorId, ²​position, ³​positionText, ⁴​positionOrder