Week 7: Apply it to your data 6

Import Data

results <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-07/results.csv')

## Warning: One or more parsing issues, see `problems()` for details

## Rows: 25220 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): position, positionText, time, milliseconds, fastestLap, rank, fast...
## dbl (10): resultId, raceId, driverId, constructorId, number, grid, positionO...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(results)

Data summary
Name	results
Number of rows	25220
Number of columns	18
_______________________
Column type frequency:
character	8
numeric	10
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
position	1	1	2	34
positionText	1	1	2	39
time	1	2	11	6488
milliseconds	1	2	8	6687
fastestLap	1	1	2	80
rank	1	1	2	26
fastestLapTime	1	2	8	6266
fastestLapSpeed	1	2	7	6395

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
resultId	0	1	12611.23	7281.58	1	6305.75	12610.5	18915.25	25225	▇▇▇▇▇
raceId	0	1	517.95	290.34	1	287.00	503.0	762.00	1064	▆▇▇▆▆
driverId	0	1	250.84	258.25	1	56.00	158.0	347.00	854	▇▃▂▁▂
constructorId	0	1	47.48	58.39	1	6.00	25.0	57.00	214	▇▂▁▁▁
number	6	1	17.59	14.80	0	7.00	15.0	23.00	208	▇▁▁▁▁
grid	0	1	11.21	7.27	0	5.00	11.0	17.00	34	▇▇▇▃▁
positionOrder	0	1	12.93	7.74	1	6.00	12.0	19.00	39	▇▇▆▂▁
points	0	1	1.80	4.03	0	0.00	0.0	2.00	50	▇▁▁▁▁
laps	0	1	45.79	30.04	0	21.00	52.0	66.00	200	▅▇▁▁▁
statusId	0	1	17.72	26.10	1	1.00	11.0	14.00	139	▇▁▁▁▁

results <- results %>%
    filter(position != "\\N") %>%
    mutate(position = as_factor(position))

Introduction

Questions

Variation

Visualizing distributions

results %>%
    ggplot(aes(x = position)) +
    geom_bar()

Typical values

results %>%
    
    # Filter out positions lower than 10
    # filter(number < 3) %>%
    
    # Plot
    ggplot(aes(x = number)) +
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Unusual values

Missing Values

Covariation

A categorical and continuous variable

results %>%
    
    ggplot(aes(x = position, y = fastestLap)) +
    geom_boxplot()

Two categorical variables

results %>%
    
    count(position, rank) %>%
    
    ggplot(aes(x = position, y = rank)) +
    geom_boxplot()

Two continous variables

results %>%
    ggplot(aes(x = position, y = fastestLap)) +
    geom_boxplot()

Week 7: Apply it to your data 6

Amanda Simpson

2022-03-07

Import Data

Introduction

Questions

Variation

Visualizing distributions

Typical values

Unusual values

Missing Values

Covariation

A categorical and continuous variable

Two categorical variables

Two continous variables

Patterns and models