Week 9: Apply it to your data 8

Import your data

results <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-07/results.csv')

## Warning: One or more parsing issues, see `problems()` for details

## Rows: 25220 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): position, positionText, time, milliseconds, fastestLap, rank, fast...
## dbl (10): resultId, raceId, driverId, constructorId, number, grid, positionO...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(results)

Data summary
Name	results
Number of rows	25220
Number of columns	18
_______________________
Column type frequency:
character	8
numeric	10
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
position	1	1	2	34
positionText	1	1	2	39
time	1	2	11	6488
milliseconds	1	2	8	6687
fastestLap	1	1	2	80
rank	1	1	2	26
fastestLapTime	1	2	8	6266
fastestLapSpeed	1	2	7	6395

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
resultId	0	1	12611.23	7281.58	1	6305.75	12610.5	18915.25	25225	▇▇▇▇▇
raceId	0	1	517.95	290.34	1	287.00	503.0	762.00	1064	▆▇▇▆▆
driverId	0	1	250.84	258.25	1	56.00	158.0	347.00	854	▇▃▂▁▂
constructorId	0	1	47.48	58.39	1	6.00	25.0	57.00	214	▇▂▁▁▁
number	6	1	17.59	14.80	0	7.00	15.0	23.00	208	▇▁▁▁▁
grid	0	1	11.21	7.27	0	5.00	11.0	17.00	34	▇▇▇▃▁
positionOrder	0	1	12.93	7.74	1	6.00	12.0	19.00	39	▇▇▆▂▁
points	0	1	1.80	4.03	0	0.00	0.0	2.00	50	▇▁▁▁▁
laps	0	1	45.79	30.04	0	21.00	52.0	66.00	200	▅▇▁▁▁
statusId	0	1	17.72	26.10	1	1.00	11.0	14.00	139	▇▁▁▁▁

Chapter 13

What are primary keys in your data?

The Primary keys in my data are raceId, rank, and driverId.

Can you divide your data into two?

Divide it using dplyr::select in a way the two have a common variable, which you could use to join the two.

data_1 <- results %>% select(raceId:statusId)
data_1

## # A tibble: 25,220 × 17
##    raceId driverId construct…¹ number  grid posit…² posit…³ posit…⁴ points  laps
##     <dbl>    <dbl>       <dbl>  <dbl> <dbl> <chr>   <chr>     <dbl>  <dbl> <dbl>
##  1     18        1           1     22     1 "1"     1             1     10    58
##  2     18        2           2      3     5 "2"     2             2      8    58
##  3     18        3           3      7     7 "3"     3             3      6    58
##  4     18        4           4      5    11 "4"     4             4      5    58
##  5     18        5           1     23     3 "5"     5             5      4    58
##  6     18        6           3      8    13 "6"     6             6      3    57
##  7     18        7           5     14    17 "7"     7             7      2    55
##  8     18        8           6      1    15 "8"     8             8      1    53
##  9     18        9           2      4     2 "\\N"   R             9      0    47
## 10     18       10           7     12    18 "\\N"   R            10      0    43
## # … with 25,210 more rows, 7 more variables: time <chr>, milliseconds <chr>,
## #   fastestLap <chr>, rank <chr>, fastestLapTime <chr>, fastestLapSpeed <chr>,
## #   statusId <dbl>, and abbreviated variable names ¹constructorId, ²position,
## #   ³positionText, ⁴positionOrder

data_2 <- results %>% select(raceId, rank, fastestLapTime)
data_2

## # A tibble: 25,220 × 3
##    raceId rank  fastestLapTime
##     <dbl> <chr> <chr>         
##  1     18 2     1:27.452      
##  2     18 3     1:27.739      
##  3     18 5     1:28.090      
##  4     18 7     1:28.603      
##  5     18 1     1:27.418      
##  6     18 14    1:29.639      
##  7     18 12    1:29.534      
##  8     18 4     1:27.903      
##  9     18 9     1:28.753      
## 10     18 13    1:29.558      
## # … with 25,210 more rows

Can you join the two together?

Use tidyr::left_join or other joining functions.

data_1 %>%
    left_join(data_2, by = c("raceId", "fastestLapTime"))

## # A tibble: 488,388 × 18
##    raceId driverId construct…¹ number  grid posit…² posit…³ posit…⁴ points  laps
##     <dbl>    <dbl>       <dbl>  <dbl> <dbl> <chr>   <chr>     <dbl>  <dbl> <dbl>
##  1     18        1           1     22     1 "1"     1             1     10    58
##  2     18        2           2      3     5 "2"     2             2      8    58
##  3     18        3           3      7     7 "3"     3             3      6    58
##  4     18        4           4      5    11 "4"     4             4      5    58
##  5     18        5           1     23     3 "5"     5             5      4    58
##  6     18        6           3      8    13 "6"     6             6      3    57
##  7     18        7           5     14    17 "7"     7             7      2    55
##  8     18        8           6      1    15 "8"     8             8      1    53
##  9     18        9           2      4     2 "\\N"   R             9      0    47
## 10     18       10           7     12    18 "\\N"   R            10      0    43
## # … with 488,378 more rows, 8 more variables: time <chr>, milliseconds <chr>,
## #   fastestLap <chr>, rank.x <chr>, fastestLapTime <chr>,
## #   fastestLapSpeed <chr>, statusId <dbl>, rank.y <chr>, and abbreviated
## #   variable names ¹constructorId, ²position, ³positionText, ⁴positionOrder