Intro Essay

I chose this dataset because I’m a fan of formula 1. I found myself asking is there a mathematical way to prove starting positions and where they finshed. It is easy to look at race and say who ever starts first will finish first but what we fail to take into account is keep a lead sometimes can be harder than chasing one. In this sport where split seconds are the difference bewteen winning or losing, I will be attempting to explain whether or starting position(the data calls it qualifying position) and the final position of a race have any correlation.

The libraries I loaded in and the dataset

library(highcharter)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
setwd("C:/Users/Danny/OneDrive/Documents/Data_110/Project_2")
Formula_1 <- read_csv("Database_Formula_1.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 25840 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (21): Grand Prix, Starting Time, Circuit, Location, Country, Position, ...
## dbl  (11): Year, Round, lat, lng, Altitude, Position Order, Race Number, Lap...
## date  (2): Date, Date of Birth
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Removing some of the Variables I’m not going to use

Formula_1 <- Formula_1 |>
  select(-Code, -Date, -`Date of Birth`, - `Total Time`)

How the Data was filtered

Formula_1 <- replace(Formula_1, Formula_1 == "\\N", NA)

Fm1year<- Formula_1 %>% filter(Year >= 2010 & Year <= 2020)

F1<- Fm1year [!is.na(Fm1year$`Fastest Lap`) & !is.na(Fm1year$`Av. Fastest Lap (km/h)`) & !is.na(Fm1year$fastestLap),]

The line of linear regression for F1

Linerg <- lm(`Position Order` ~ `Qualifying Position`, data = F1)
summary(Linerg)
## 
## Call:
## lm(formula = `Position Order` ~ `Qualifying Position`, data = F1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5691  -3.1332  -0.9255   2.2924  19.6591 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.69736    0.14081   26.26   <2e-16 ***
## `Qualifying Position`  0.64359    0.01091   58.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.592 on 4427 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.4401, Adjusted R-squared:   0.44 
## F-statistic:  3480 on 1 and 4427 DF,  p-value: < 2.2e-16

I tried to graph 4500 points but it turned out to be too much

highchart() %>% 
  hc_add_series(data = F1,
                type = "line",
                hcaes(x = `Qualifying Position`, 
                      y = `Position Order`,)) %>%
      hc_title(text="Qualifying Position and Position Order") %>%
  hc_xAxis(title = list(text="Qualifying Position"))%>%
  hc_yAxis(title = list(text="Position Order")) %>%
  hc_plotOptions(series = list(marker = list(symbol = "square"))) %>%
  hc_tooltip(shared = TRUE,
             borderColor = "blue",
             pointFormat = "{point.gas}: {point.y:.2f}<br>")

So 4500 points was too much so I decided to filter for only one season which was the 2019 season. And got a similar correlation to the 10 year average

F1_2019 <-filter(F1, Year == 2019)
Linerg2 <- lm(`Position Order`~ `Qualifying Position`, data = F1_2019)
summary(Linerg2)
## 
## Call:
## lm(formula = `Position Order` ~ `Qualifying Position`, data = F1_2019)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4802  -3.0147  -0.9303   2.1666  14.9946 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.73039    0.44987   8.292 1.59e-15 ***
## `Qualifying Position`  0.63749    0.03767  16.924  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.412 on 412 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.4101, Adjusted R-squared:  0.4087 
## F-statistic: 286.4 on 1 and 412 DF,  p-value: < 2.2e-16
ggplot(F1_2019, aes(x = `Qualifying Position`, y = `Position Order`)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, color = "red") + 
  labs(title = "Qualifying Position vs Final Position",
       x = "Qualifying Position", y = "Position Order")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

I graphed all the qualifying positions and final positions for each driver and included in which race it was and what time they were apart of.

highchart() %>% 
  hc_chart(type = "scatter") %>%
  hc_title(text = "Qualifying Position VS Final Position For F1 2019 Season") %>%
  hc_xAxis(title = list(text = "Qualifying Position")) %>%
  hc_yAxis(title = list(text = "Position Order")) %>%
  hc_add_series(data = F1, type = "scatter", hcaes(x = `Qualifying Position`, y = `Position Order`)) %>%
  hc_tooltip(headerFormat = "<b>{point.name}</b><br>",
             pointFormat = "Team: {point.Team} <br>Grand Prix: {point.Grand Prix}<br>Qualifying Position: {point.x}<br>Position Order: {point.y}" ) %>%
  hc_tooltip (crosshairs = TRUE, Shared = TRUE, valueDecimals = 2) %>%
  hc_plotOptions(series = list(marker = list(enabled = TRUE))) %>%
  hc_credits(enabled = FALSE)
highchart() %>% 
  hc_chart(type = "scatter") %>%
  hc_title(text = "Qualifying Position VS Final Position For F1 2019 Season") %>%
  hc_xAxis(title = list(text = "Qualifying Position")) %>%
  hc_yAxis(title = list(text = "Position Order")) %>%
  hc_add_series(data = F1_2019, type = "scatter", hcaes(x = `Qualifying Position`, y = `Position Order`, color = `Team`)) %>%
  hc_tooltip(headerFormat = "<b>{point.name}</b><br>",
             pointFormat = "Team: {point.Team} <br>Grand Prix: {point.Grand Prix}<br>Qualifying Position: {point.x}<br>Position Order: {point.y}" ) %>%
  hc_tooltip (crosshairs = TRUE, Shared = TRUE, valueDecimals = 4) %>%
  hc_plotOptions(series = list(marker = list(enabled = TRUE))) %>%
  hc_credits(enabled = FALSE)

Flitering the data even futher only seeing who finished in the top 5 for every race in the 2019 season and what correlation it has

F1_top5<- filter(F1_2019, `Position Order` >= 1 & `Position Order` <= 5)
Linerg3 <- lm(`Position Order`~ `Qualifying Position`, data = F1_top5)
summary(Linerg3)
## 
## Call:
## lm(formula = `Position Order` ~ `Qualifying Position`, data = F1_top5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4296 -1.0963  0.0593  1.0593  2.3704 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.31854    0.18914  12.259  < 2e-16 ***
## `Qualifying Position`  0.15555    0.03223   4.826 4.84e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.289 on 103 degrees of freedom
## Multiple R-squared:  0.1844, Adjusted R-squared:  0.1765 
## F-statistic: 23.29 on 1 and 103 DF,  p-value: 4.84e-06

Graphing the correlation bewteen Qualifying Position and Position Order

ggplot(F1_top5, aes(x = `Qualifying Position`, y = `Position Order`)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, color = "blue") + 
  labs(title = "QF Position vs Final Position",
       x = "Qualifying Position", y = "Position Order")
## `geom_smooth()` using formula = 'y ~ x'

With all this information I decided to change the graph to something else I decided a scatterplot would suit it best

highchart() %>% 
  hc_chart(type = "scatter") %>%
  hc_title(text = "Relationship Between Top 5 Finishes and Qualifying Position For The F1 2019 Season") %>%
  hc_xAxis(title = list(text = "Qualifying Position")) %>%
  hc_yAxis(title = list(text = "Position Order")) %>%
 hc_add_series(data = F1_top5 %>% filter(Team == "Red Bull"), type = "scatter", hcaes(x = `Qualifying Position`, y = `Position Order`), name = "Red Bull")  %>%
  hc_add_series(data = F1_top5 %>% filter(Team == "Mercedes"), type = "scatter", hcaes(x = `Qualifying Position`, y = `Position Order`), name = "Mercedes") %>%
   hc_add_series(data = F1_top5 %>% filter(Team == "Ferrari"), type = "scatter", hcaes(x = `Qualifying Position`, y = `Position Order`), name = "Ferrari") %>%
  hc_add_series(data = F1_top5 %>% filter(Team == "Alfa Romeo"), type = "scatter", hcaes(x = `Qualifying Position`, y = `Position Order`), name = "Alfa Romeo") %>%
   hc_add_series(data = F1_top5 %>% filter(Team == "McLaren"), type = "scatter", hcaes(x = `Qualifying Position`, y = `Position Order`), name = "McLaren") %>%
  hc_add_series(data = F1_top5 %>% filter(Team == "Toro Rosso"), type = "scatter", hcaes(x = `Qualifying Position`, y = `Position Order`), name = "Toro Rosso") %>%
   hc_add_series(data = F1_top5 %>% filter(Team == "Racing Point"), type = "scatter", hcaes(x = `Qualifying Position`, y = `Position Order`), name = "Racing Point") %>%
  hc_tooltip(headerFormat = "<b>{point.name}</b><br>",
             pointFormat = "Team: {point.Team} <br>Grand Prix: {point.Grand Prix}<br> Lastname: {point.Lastname}<br> Qualifying Position: {point.x}<br>Position Order: {point.y}" ) %>%
  hc_tooltip (crosshairs = TRUE, Shared = TRUE, valueDecimals = 2) %>%
  hc_plotOptions(series = list(marker = list(enabled = TRUE))) %>%
  hc_legend(enabled = TRUE, 
    layout = "vertical",
    align = "right",
    verticalAlign = "middle",
    title = list(text = "Teams"),
    borderWidth = 1) %>%
    hc_colors(colors =c("#7cb5ec", "#2b908f", "#f45b5b", "#8085e9" ,"#f7a35c","#434348", "#e4d354")) %>%
  hc_credits(enabled = FALSE)

Essay

There has been some research done on this. In fact I found an article from sportmakers (https://www.sportmakers.co.uk/f1/what-is-pole-position-in-formula-1-does-it-matter/#:~:text=A%20number%20of%20stats%20are,the%20number%20of%20drivers%20involved.) found that 40% of winners come from the first starting position historically. Meaning that where you start really does matter and no matter how much skill you think you have there’s only so much you do when starting in a lower position. So my thoughts of defending first is harder than fighting for it was very wrong. Even for different circuits like for example Monaco which is a street circuit (meaning they race on actual public roads normally in a big city) the turns are very tight and sportmakers anaylsis found that who ever starts first wins the race 75% of the time. Also we tend to forget how much these formula one cars have inproved over time. Add in rules changes that have made some types of overtakes illegal and you will find that some of the most legendary racers of the past can barely win over 50% of there races starting in first versus Lewis Hamliton who has a 60% win rate when starting first.

Now back to what I did. I took at first a decade worth of qualifying position (starting Position) and Position Order (Final Position) and tried to plot it out and it turned out terrible. About 4500 reasons terrible so I decided to only plot for the 2019 F1 Season and it was still alot so I turned my attention to what matters the racers who finished top 5. More specfially what there correlation was. For data F1_2019 (it includes all position 1-24) the p value was less than 0.01 and the r squared was 0.41. For F1_top5 (only including final positions 1-5) the p value was also less than 0.01 and the r squared was but surpising the r squared fell to 0.18. Meaning that only 18% of the top 5 finshes are explained by the qualifying position. I did not expect this I expected the r squared to only go up as the research earlier would suggest that the higher you start the more likely you are to stay there. But as for the visualization it showed exactly what I thought. Only 8 drivers out of the 105 managed to finished in the top 5 while qualifying at a position 11 or lower. In fact out of those 8 only 3 finished in podium positions (top 3) but none won a race. That means 92% of top 5 finishers started in spots 10 and above. If you want to win good luck trying to win a race starting below 4th as only starting spots 1-4 won a race. I wish I could create like a prediction model based on all races to try and predict where teams finished in terms of points for this season. If I were to bet the top 3 in no order it would Red Bull, Ferrari, and Mercedes as they have the most top 5 finishes by far compared to the rest. I was also going to plot where each race took place historcally using leaflet but I ran out of time.