Intro Essay
I chose this dataset because I’m a fan of formula 1. I found myself
asking is there a mathematical way to prove starting positions and where
they finshed. It is easy to look at race and say who ever starts first
will finish first but what we fail to take into account is keep a lead
sometimes can be harder than chasing one. In this sport where split
seconds are the difference bewteen winning or losing, I will be
attempting to explain whether or starting position(the data calls it
qualifying position) and the final position of a race have any
correlation.
The libraries I loaded in and the dataset
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
setwd("C:/Users/Danny/OneDrive/Documents/Data_110/Project_2")
Formula_1 <- read_csv("Database_Formula_1.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 25840 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (21): Grand Prix, Starting Time, Circuit, Location, Country, Position, ...
## dbl (11): Year, Round, lat, lng, Altitude, Position Order, Race Number, Lap...
## date (2): Date, Date of Birth
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Removing some of the Variables I’m not going to use
Formula_1 <- Formula_1 |>
select(-Code, -Date, -`Date of Birth`, - `Total Time`)
How the Data was filtered
Formula_1 <- replace(Formula_1, Formula_1 == "\\N", NA)
Fm1year<- Formula_1 %>% filter(Year >= 2010 & Year <= 2020)
F1<- Fm1year [!is.na(Fm1year$`Fastest Lap`) & !is.na(Fm1year$`Av. Fastest Lap (km/h)`) & !is.na(Fm1year$fastestLap),]
The line of linear regression for F1
Linerg <- lm(`Position Order` ~ `Qualifying Position`, data = F1)
summary(Linerg)
##
## Call:
## lm(formula = `Position Order` ~ `Qualifying Position`, data = F1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5691 -3.1332 -0.9255 2.2924 19.6591
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.69736 0.14081 26.26 <2e-16 ***
## `Qualifying Position` 0.64359 0.01091 58.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.592 on 4427 degrees of freedom
## (20 observations deleted due to missingness)
## Multiple R-squared: 0.4401, Adjusted R-squared: 0.44
## F-statistic: 3480 on 1 and 4427 DF, p-value: < 2.2e-16
I tried to graph 4500 points but it turned out to be too much
highchart() %>%
hc_add_series(data = F1,
type = "line",
hcaes(x = `Qualifying Position`,
y = `Position Order`,)) %>%
hc_title(text="Qualifying Position and Position Order") %>%
hc_xAxis(title = list(text="Qualifying Position"))%>%
hc_yAxis(title = list(text="Position Order")) %>%
hc_plotOptions(series = list(marker = list(symbol = "square"))) %>%
hc_tooltip(shared = TRUE,
borderColor = "blue",
pointFormat = "{point.gas}: {point.y:.2f}<br>")
So 4500 points was too much so I decided to filter for only one
season which was the 2019 season. And got a similar correlation to the
10 year average
F1_2019 <-filter(F1, Year == 2019)
Linerg2 <- lm(`Position Order`~ `Qualifying Position`, data = F1_2019)
summary(Linerg2)
##
## Call:
## lm(formula = `Position Order` ~ `Qualifying Position`, data = F1_2019)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4802 -3.0147 -0.9303 2.1666 14.9946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.73039 0.44987 8.292 1.59e-15 ***
## `Qualifying Position` 0.63749 0.03767 16.924 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.412 on 412 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.4101, Adjusted R-squared: 0.4087
## F-statistic: 286.4 on 1 and 412 DF, p-value: < 2.2e-16
ggplot(F1_2019, aes(x = `Qualifying Position`, y = `Position Order`)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(title = "Qualifying Position vs Final Position",
x = "Qualifying Position", y = "Position Order")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

I graphed all the qualifying positions and final positions for each
driver and included in which race it was and what time they were apart
of.
highchart() %>%
hc_chart(type = "scatter") %>%
hc_title(text = "Qualifying Position VS Final Position For F1 2019 Season") %>%
hc_xAxis(title = list(text = "Qualifying Position")) %>%
hc_yAxis(title = list(text = "Position Order")) %>%
hc_add_series(data = F1, type = "scatter", hcaes(x = `Qualifying Position`, y = `Position Order`)) %>%
hc_tooltip(headerFormat = "<b>{point.name}</b><br>",
pointFormat = "Team: {point.Team} <br>Grand Prix: {point.Grand Prix}<br>Qualifying Position: {point.x}<br>Position Order: {point.y}" ) %>%
hc_tooltip (crosshairs = TRUE, Shared = TRUE, valueDecimals = 2) %>%
hc_plotOptions(series = list(marker = list(enabled = TRUE))) %>%
hc_credits(enabled = FALSE)
highchart() %>%
hc_chart(type = "scatter") %>%
hc_title(text = "Qualifying Position VS Final Position For F1 2019 Season") %>%
hc_xAxis(title = list(text = "Qualifying Position")) %>%
hc_yAxis(title = list(text = "Position Order")) %>%
hc_add_series(data = F1_2019, type = "scatter", hcaes(x = `Qualifying Position`, y = `Position Order`, color = `Team`)) %>%
hc_tooltip(headerFormat = "<b>{point.name}</b><br>",
pointFormat = "Team: {point.Team} <br>Grand Prix: {point.Grand Prix}<br>Qualifying Position: {point.x}<br>Position Order: {point.y}" ) %>%
hc_tooltip (crosshairs = TRUE, Shared = TRUE, valueDecimals = 4) %>%
hc_plotOptions(series = list(marker = list(enabled = TRUE))) %>%
hc_credits(enabled = FALSE)
Flitering the data even futher only seeing who finished in the top 5
for every race in the 2019 season and what correlation it has
F1_top5<- filter(F1_2019, `Position Order` >= 1 & `Position Order` <= 5)
Linerg3 <- lm(`Position Order`~ `Qualifying Position`, data = F1_top5)
summary(Linerg3)
##
## Call:
## lm(formula = `Position Order` ~ `Qualifying Position`, data = F1_top5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4296 -1.0963 0.0593 1.0593 2.3704
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.31854 0.18914 12.259 < 2e-16 ***
## `Qualifying Position` 0.15555 0.03223 4.826 4.84e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.289 on 103 degrees of freedom
## Multiple R-squared: 0.1844, Adjusted R-squared: 0.1765
## F-statistic: 23.29 on 1 and 103 DF, p-value: 4.84e-06
Graphing the correlation bewteen Qualifying Position and Position
Order
ggplot(F1_top5, aes(x = `Qualifying Position`, y = `Position Order`)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "blue") +
labs(title = "QF Position vs Final Position",
x = "Qualifying Position", y = "Position Order")
## `geom_smooth()` using formula = 'y ~ x'

With all this information I decided to change the graph to something
else I decided a scatterplot would suit it best
highchart() %>%
hc_chart(type = "scatter") %>%
hc_title(text = "Relationship Between Top 5 Finishes and Qualifying Position For The F1 2019 Season") %>%
hc_xAxis(title = list(text = "Qualifying Position")) %>%
hc_yAxis(title = list(text = "Position Order")) %>%
hc_add_series(data = F1_top5 %>% filter(Team == "Red Bull"), type = "scatter", hcaes(x = `Qualifying Position`, y = `Position Order`), name = "Red Bull") %>%
hc_add_series(data = F1_top5 %>% filter(Team == "Mercedes"), type = "scatter", hcaes(x = `Qualifying Position`, y = `Position Order`), name = "Mercedes") %>%
hc_add_series(data = F1_top5 %>% filter(Team == "Ferrari"), type = "scatter", hcaes(x = `Qualifying Position`, y = `Position Order`), name = "Ferrari") %>%
hc_add_series(data = F1_top5 %>% filter(Team == "Alfa Romeo"), type = "scatter", hcaes(x = `Qualifying Position`, y = `Position Order`), name = "Alfa Romeo") %>%
hc_add_series(data = F1_top5 %>% filter(Team == "McLaren"), type = "scatter", hcaes(x = `Qualifying Position`, y = `Position Order`), name = "McLaren") %>%
hc_add_series(data = F1_top5 %>% filter(Team == "Toro Rosso"), type = "scatter", hcaes(x = `Qualifying Position`, y = `Position Order`), name = "Toro Rosso") %>%
hc_add_series(data = F1_top5 %>% filter(Team == "Racing Point"), type = "scatter", hcaes(x = `Qualifying Position`, y = `Position Order`), name = "Racing Point") %>%
hc_tooltip(headerFormat = "<b>{point.name}</b><br>",
pointFormat = "Team: {point.Team} <br>Grand Prix: {point.Grand Prix}<br> Lastname: {point.Lastname}<br> Qualifying Position: {point.x}<br>Position Order: {point.y}" ) %>%
hc_tooltip (crosshairs = TRUE, Shared = TRUE, valueDecimals = 2) %>%
hc_plotOptions(series = list(marker = list(enabled = TRUE))) %>%
hc_legend(enabled = TRUE,
layout = "vertical",
align = "right",
verticalAlign = "middle",
title = list(text = "Teams"),
borderWidth = 1) %>%
hc_colors(colors =c("#7cb5ec", "#2b908f", "#f45b5b", "#8085e9" ,"#f7a35c","#434348", "#e4d354")) %>%
hc_credits(enabled = FALSE)
Essay
There has been some research done on this. In fact I found an
article from sportmakers (https://www.sportmakers.co.uk/f1/what-is-pole-position-in-formula-1-does-it-matter/#:~:text=A%20number%20of%20stats%20are,the%20number%20of%20drivers%20involved.)
found that 40% of winners come from the first starting position
historically. Meaning that where you start really does matter and no
matter how much skill you think you have there’s only so much you do
when starting in a lower position. So my thoughts of defending first is
harder than fighting for it was very wrong. Even for different circuits
like for example Monaco which is a street circuit (meaning they race on
actual public roads normally in a big city) the turns are very tight and
sportmakers anaylsis found that who ever starts first wins the race 75%
of the time. Also we tend to forget how much these formula one cars have
inproved over time. Add in rules changes that have made some types of
overtakes illegal and you will find that some of the most legendary
racers of the past can barely win over 50% of there races starting in
first versus Lewis Hamliton who has a 60% win rate when starting
first.
Now back to what I did. I took at first a decade worth of qualifying
position (starting Position) and Position Order (Final Position) and
tried to plot it out and it turned out terrible. About 4500 reasons
terrible so I decided to only plot for the 2019 F1 Season and it was
still alot so I turned my attention to what matters the racers who
finished top 5. More specfially what there correlation was. For data
F1_2019 (it includes all position 1-24) the p value was less than 0.01
and the r squared was 0.41. For F1_top5 (only including final positions
1-5) the p value was also less than 0.01 and the r squared was but
surpising the r squared fell to 0.18. Meaning that only 18% of the top 5
finshes are explained by the qualifying position. I did not expect this
I expected the r squared to only go up as the research earlier would
suggest that the higher you start the more likely you are to stay there.
But as for the visualization it showed exactly what I thought. Only 8
drivers out of the 105 managed to finished in the top 5 while qualifying
at a position 11 or lower. In fact out of those 8 only 3 finished in
podium positions (top 3) but none won a race. That means 92% of top 5
finishers started in spots 10 and above. If you want to win good luck
trying to win a race starting below 4th as only starting spots 1-4 won a
race. I wish I could create like a prediction model based on all races
to try and predict where teams finished in terms of points for this
season. If I were to bet the top 3 in no order it would Red Bull,
Ferrari, and Mercedes as they have the most top 5 finishes by far
compared to the rest. I was also going to plot where each race took
place historcally using leaflet but I ran out of time.