F1 2022 Results

Introduction

F1 is an internationally enjoyed sport for those who enjoy adrenaline rush. Since 1950, this sport for a single seated race car has become a race of the highest prestige, finding itself in the realm of luxury being sponsored by organizations such as Ferrari and Rolex. With the world moving to electric vehicles, and climate change the hottest topic of discussion, gas guzzling sports like F1 has been finding a big challenge of keeping viewership from both the external as well as internal issues.

This data was collected by toUpperCase78 through webscraping of the official F1 website. The dataset and associated ReadMe file can be found on GitHub with the data being scrapped from Offical F1.The data has been separated into a variety of different topics, where this analysis will be focusing on the records on each race. The other datasets are available on the same GitHub link.

What am I looking for?

There are an endless amount of research opportunities that may affect the F1 sport. However, we would like to keep a focus internally to what has been happening with the sport.

Over the past few seasons, there has been a growing concern that the top teams are too significantly ahead of the rest. In the 2022 season, there were major modifications to rules and mechanical requirements to hopefully allow other teams to not only catch up, but also give a sense of competition. Hamilton, Leclerc, Perez, and Verstappen are the power houses that have made a name for themselves over the last 5+ years. They represent teams for Ferrari, Mercedes, and Red Bull. They have continued to take the top seats making the F1 series feel like a battle between these teams, and more specifically, these drivers. Major news outlets as well as major community forums have placed their concern that the races have lost its touch because of this 3 way battle with no others teams able to come close, Race Fans reports.

Looking through the 2022 races, can we see any trends with the top three teams? Did these changes for 2022 make the races feel more competitive?

The Variables

Define variables

Column Title Type Description
Track chr Name or Location of the Track
Position num Final Rank of the Driver at Event
No num Number of the Driver
Driver chr Driver Name
Team chr Team Name
Starting Grid num Position the Driver Started at this event
Laps num Number of Laps Completed
Time/Retired num Finishing Time
Points num Number of Points Received from Race
+1 Pt chr Additional 1 Point for the Driver Who Got the Fastest Lap Time
Fastest Lap POSIXct Time of Fastest Lap

The purpose of having Time/Retired be a POSIXct is because only the first place gets a time, and all places after get a value that is added to that first place record.

Loading

# Prepare for load of libraries
setwd("~/Data Study/Data 110 MC/Project Final")
library("pacman")
pacman::p_load(tidyverse,dplyr,readxl,plotly,ggthemes,treemapify,RColorBrewer,hrbrthemes,viridis)
# Read in the Data f1dataset.xlsx
df <- read_excel("f1dataset.xlsx",sheet = "Sheet1")

EDA

# Explore the data
str(df)
## tibble [440 × 11] (S3: tbl_df/tbl/data.frame)
##  $ Track        : chr [1:440] "Bahrain" "Bahrain" "Bahrain" "Bahrain" ...
##  $ Position     : chr [1:440] "1" "2" "3" "4" ...
##  $ No           : num [1:440] 16 55 44 63 20 77 31 22 14 24 ...
##  $ Driver       : chr [1:440] "Charles Leclerc" "Carlos Sainz" "Lewis Hamilton" "George Russell" ...
##  $ Team         : chr [1:440] "Ferrari" "Ferrari" "Mercedes" "Mercedes" ...
##  $ Starting Grid: num [1:440] 1 3 5 9 7 6 11 16 8 15 ...
##  $ Laps         : num [1:440] 57 57 57 57 57 57 57 57 57 57 ...
##  $ Time/Retired : chr [1:440] "6.7749814814814804E-2" "5.5979999999999999" "9.6750000000000007" "11.211" ...
##  $ Points       : num [1:440] 26 18 15 12 10 8 6 4 2 1 ...
##  $ +1 Pt        : chr [1:440] "Yes" "No" "No" "No" ...
##  $ Fastest Lap  : POSIXct[1:440], format: "1899-12-31 00:01:35" "1899-12-31 00:01:36" ...

Here we can see that some of the columns have strange variable types, such as Position being characterized as a character rather than a number. Below is a chart of each of the variable types and what we expect.

Column Title Type Expected Type
Track chr chr
Position chr num
No num num
Driver chr chr
Team chr chr
Starting Grid num num
Laps num num
Time/Retired chr num
Points num num
+1 Pt chr chr
Fastest Lap POSIXct POSIXct

For the value types that don’t line up (i.e. Position), we will review and correct

# Take a look at the head and tail of the dataset to see how it is set.
head(df,10)
## # A tibble: 10 × 11
##    Track   Position    No Driver      Team  Start…¹  Laps Time/…² Points `+1 Pt`
##    <chr>   <chr>    <dbl> <chr>       <chr>   <dbl> <dbl> <chr>    <dbl> <chr>  
##  1 Bahrain 1           16 Charles Le… Ferr…       1    57 6.7749…     26 Yes    
##  2 Bahrain 2           55 Carlos Sai… Ferr…       3    57 5.5979…     18 No     
##  3 Bahrain 3           44 Lewis Hami… Merc…       5    57 9.6750…     15 No     
##  4 Bahrain 4           63 George Rus… Merc…       9    57 11.211      12 No     
##  5 Bahrain 5           20 Kevin Magn… Haas…       7    57 14.754      10 No     
##  6 Bahrain 6           77 Valtteri B… Alfa…       6    57 16.119       8 No     
##  7 Bahrain 7           31 Esteban Oc… Alpi…      11    57 19.422…      6 No     
##  8 Bahrain 8           22 Yuki Tsuno… Alph…      16    57 20.385…      4 No     
##  9 Bahrain 9           14 Fernando A… Alpi…       8    57 22.39        2 No     
## 10 Bahrain 10          24 Guanyu Zhou Alfa…      15    57 23.064       1 No     
## # … with 1 more variable: `Fastest Lap` <dttm>, and abbreviated variable names
## #   ¹​`Starting Grid`, ²​`Time/Retired`
tail(df,10)
## # A tibble: 10 × 11
##    Track     Position    No Driver    Team  Start…¹  Laps Time/…² Points `+1 Pt`
##    <chr>     <chr>    <dbl> <chr>     <chr>   <dbl> <dbl> <chr>    <dbl> <chr>  
##  1 Abu Dhabi 11          22 Yuki Tsu… Alph…      11    58 89.370…      0 No     
##  2 Abu Dhabi 12          24 Guanyu Z… Alfa…      15    57 +1 lap       0 No     
##  3 Abu Dhabi 13          23 Alexande… Will…      19    57 +1 lap       0 No     
##  4 Abu Dhabi 14          10 Pierre G… Alph…      17    57 +1 lap       0 No     
##  5 Abu Dhabi 15          77 Valtteri… Alfa…      18    57 +1 lap       0 No     
##  6 Abu Dhabi 16          47 Mick Sch… Haas…      12    57 +1 lap       0 No     
##  7 Abu Dhabi 17          20 Kevin Ma… Haas…      16    57 +1 lap       0 No     
##  8 Abu Dhabi 18          44 Lewis Ha… Merc…       5    55 DNF          0 No     
##  9 Abu Dhabi 19           6 Nicholas… Will…      20    55 DNF          0 No     
## 10 Abu Dhabi NC          14 Fernando… Alpi…      10    27 DNF          0 No     
## # … with 1 more variable: `Fastest Lap` <dttm>, and abbreviated variable names
## #   ¹​`Starting Grid`, ²​`Time/Retired`
# Check standard NA values (Not inclusive of NC or DNF)
colSums(is.na(df))
##         Track      Position            No        Driver          Team 
##             0             0             0             0             0 
## Starting Grid          Laps  Time/Retired        Points         +1 Pt 
##             2             0             0             0             0 
##   Fastest Lap 
##            14

Clean the Data

# Cleaning the data to be used for analysis

# Add Main Event Race Date and Race Number
df2 <- df %>%
  mutate(`Race Number` = ifelse(Track == "Bahrain", 1, 
                                ifelse(Track == "Saudi Arabia", 2,
                                ifelse(Track == "Australia", 3,
                                ifelse(Track == "Emilia Romagna", 4,
                                ifelse(Track == "Miami", 5,
                                ifelse(Track == "Spain", 6,
                                ifelse(Track == "Monaco", 7,
                                ifelse(Track == "Azerbaijan", 8,
                                ifelse(Track == "Canada", 9,
                                ifelse(Track == "Great Britain", 10,
                                ifelse(Track == "Austria", 11,
                                ifelse(Track == "France", 12,
                                ifelse(Track == "Hungary", 13,
                                ifelse(Track == "Belgium", 14,
                                ifelse(Track == "Netherlands", 15,
                                ifelse(Track == "Italy", 16,
                                ifelse(Track == "Singapore", 17,
                                ifelse(Track == "Japan", 18,
                                ifelse(Track == "United States", 19,
                                ifelse(Track == "Mexico", 20,
                                ifelse(Track == "Brazil", 21,
                                ifelse(Track == "Abu Dhabi", 22, "Not Listed")))))))))))))))))))))))

df2 <- df2 %>%
  mutate(`Race Date` = ifelse(Track == "Bahrain", format(as.Date("2022-03-20"), format = "%Y/%m/%d"), 
                                ifelse(Track == "Saudi Arabia", format(as.Date("2022-03-27"), format = "%Y/%m/%d"),
                                ifelse(Track == "Australia", format(as.Date("2022-04-10"), format = "%Y/%m/%d"),
                                ifelse(Track == "Emilia Romagna", format(as.Date("2022-04-24"), format = "%Y/%m/%d"),
                                ifelse(Track == "Miami", format(as.Date("2022-05-08"), format = "%Y/%m/%d"),
                                ifelse(Track == "Spain", format(as.Date("2022-05-22"), format = "%Y/%m/%d"),
                                ifelse(Track == "Monaco", format(as.Date("2022-05-29"), format = "%Y/%m/%d"),
                                ifelse(Track == "Azerbaijan", format(as.Date("2022-06-12"), format = "%Y/%m/%d"),
                                ifelse(Track == "Canada", format(as.Date("2022-06-19"), format = "%Y/%m/%d"),
                                ifelse(Track == "Great Britain", format(as.Date("2022-07-03"), format = "%Y/%m/%d"),
                                ifelse(Track == "Austria", format(as.Date("2022-07-10"), format = "%Y/%m/%d"),
                                ifelse(Track == "France", format(as.Date("2022-07-24"), format = "%Y/%m/%d"),
                                ifelse(Track == "Hungary", format(as.Date("2022-07-31"), format = "%Y/%m/%d"),
                                ifelse(Track == "Belgium", format(as.Date("2022-08-28"), format = "%Y/%m/%d"),
                                ifelse(Track == "Netherlands", format(as.Date("2022-09-04"), format = "%Y/%m/%d"),
                                ifelse(Track == "Italy", format(as.Date("2022-09-11"), format = "%Y/%m/%d"),
                                ifelse(Track == "Singapore", format(as.Date("2022-10-02"), format = "%Y/%m/%d"),
                                ifelse(Track == "Japan", format(as.Date("2022-10-09"), format = "%Y/%m/%d"),
                                ifelse(Track == "United States", format(as.Date("2022-10-23"), format = "%Y/%m/%d"),
                                ifelse(Track == "Mexico", format(as.Date("2022-10-30"), format = "%Y/%m/%d"),
                                ifelse(Track == "Brazil", format(as.Date("2022-11-13"), format = "%Y/%m/%d"),
                                ifelse(Track == "Abu Dhabi", format(as.Date("2022-11-20"), format = "%Y/%m/%d"), 
                                       "Not Listed")))))))))))))))))))))))
df2$`Race Date` <- as.Date(df2$`Race Date`)
df2$Position <- as.numeric(df2$Position)
df2$`Race Number` <- as.numeric(df2$`Race Number`)

There are three main columns I would like to clean up before moving on to the actual analysis. These are the Position, Time/Retired “NA” types, and Time/Retired value types. Currently Time/Retired is set to initial completion time (first place), and then all additional placements are a value of first place time + additional time. I will change this to be total times.

# Cleaning Position Column
df2$Position[which(df2$Position == "NC")] <- NA
# Cleaning Time/Retired for the DNF/DNS and +laps
df2$`Time/Retired`[which(df2$`Time/Retired` %in% c("DNF","DNS","+1 lap","+2 laps","+6 laps"))] <- 0
str(df2)
## tibble [440 × 13] (S3: tbl_df/tbl/data.frame)
##  $ Track        : chr [1:440] "Bahrain" "Bahrain" "Bahrain" "Bahrain" ...
##  $ Position     : num [1:440] 1 2 3 4 5 6 7 8 9 10 ...
##  $ No           : num [1:440] 16 55 44 63 20 77 31 22 14 24 ...
##  $ Driver       : chr [1:440] "Charles Leclerc" "Carlos Sainz" "Lewis Hamilton" "George Russell" ...
##  $ Team         : chr [1:440] "Ferrari" "Ferrari" "Mercedes" "Mercedes" ...
##  $ Starting Grid: num [1:440] 1 3 5 9 7 6 11 16 8 15 ...
##  $ Laps         : num [1:440] 57 57 57 57 57 57 57 57 57 57 ...
##  $ Time/Retired : chr [1:440] "6.7749814814814804E-2" "5.5979999999999999" "9.6750000000000007" "11.211" ...
##  $ Points       : num [1:440] 26 18 15 12 10 8 6 4 2 1 ...
##  $ +1 Pt        : chr [1:440] "Yes" "No" "No" "No" ...
##  $ Fastest Lap  : POSIXct[1:440], format: "1899-12-31 00:01:35" "1899-12-31 00:01:36" ...
##  $ Race Number  : num [1:440] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Race Date    : Date[1:440], format: "2022-03-20" "2022-03-20" ...
# Cleaning format of Time/Retired
# Want to change everything to seconds, depending on results, may want to add an additional column for hours conversion

# Race 1
df2$`Time/Retired`[1] <- 5853.6
# Race 2
df2$`Time/Retired`[21] <- 5059.3
# Race 3
df2$`Time/Retired`[41] <- 5266.5
# Race 4
df2$`Time/Retired`[61] <- 5528
# Race 5
df2$`Time/Retired`[81] <- 5664.3
# Race 6
df2$`Time/Retired`[101] <- 5840.5
# Race 7
df2$`Time/Retired`[121] <- 6990.3
# Race 8
df2$`Time/Retired`[141] <- 5645.9
# Race 9
df2$`Time/Retired`[161] <- 5781.8
# Race 10
df2$`Time/Retired`[181] <- 4670.3
# Race 11
df2$`Time/Retired`[201] <- 5064.3
# Race 12
df2$`Time/Retired`[221] <- 5402.1
# Race 13
df2$`Time/Retired`[241] <- 5965.9
# Race 14
df2$`Time/Retired`[261] <- 5152.9
# Race 15
df2$`Time/Retired`[281] <- 5802.8
# Race 16
df2$`Time/Retired`[301] <- 4827.5
# Race 17
df2$`Time/Retired`[321] <- 3740.2
# Race 18
df2$`Time/Retired`[341] <- 3704
# Race 19
df2$`Time/Retired`[361] <- 6131.7
# Race 20
df2$`Time/Retired`[381] <- 5916.7
# Race 21
df2$`Time/Retired`[401] <- 5914
# Race 22
df2$`Time/Retired`[421] <- 5265.9
df2$`Time/Retired` <- as.numeric(df2$`Time/Retired`)
df2$`Time/Retired` <- round(df2$`Time/Retired`,1)

df2 <- df2 %>%
   mutate(`Total Time` = ifelse(`Race Number` == 1, `Time/Retired`[1]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[1]) * `Time/Retired`,
                                ifelse(`Race Number` == 2,`Time/Retired`[21]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[21]) * `Time/Retired`,
                                ifelse(`Race Number` == 3,`Time/Retired`[41]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[41]) * `Time/Retired`,
                                ifelse(`Race Number` == 4,`Time/Retired`[61]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[61]) * `Time/Retired`,
                                ifelse(`Race Number` == 5,`Time/Retired`[81]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[81]) * `Time/Retired`,
                                ifelse(`Race Number` == 6,`Time/Retired`[101]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[101]) * `Time/Retired`,
                                ifelse(`Race Number` == 7,`Time/Retired`[121]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[121]) * `Time/Retired`,
                                ifelse(`Race Number` == 8,`Time/Retired`[141]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[141]) * `Time/Retired`,
                                ifelse(`Race Number` == 9,`Time/Retired`[161]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[161]) * `Time/Retired`,
                                ifelse(`Race Number` == 10,`Time/Retired`[181]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[181]) * `Time/Retired`,
                                ifelse(`Race Number` == 11,`Time/Retired`[201]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[201]) * `Time/Retired`,
                                ifelse(`Race Number` == 12,`Time/Retired`[221]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[221]) * `Time/Retired`,
                                ifelse(`Race Number` == 13,`Time/Retired`[241]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[241]) * `Time/Retired`,
                                ifelse(`Race Number` == 14,`Time/Retired`[261]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[261]) * `Time/Retired`,
                                ifelse(`Race Number` == 15,`Time/Retired`[281]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[281]) * `Time/Retired`,
                                ifelse(`Race Number` == 16,`Time/Retired`[301]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[301]) * `Time/Retired`,
                                ifelse(`Race Number` == 17,`Time/Retired`[321]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[321]) * `Time/Retired`,
                                ifelse(`Race Number` == 18,`Time/Retired`[341]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[341]) * `Time/Retired`,
                                ifelse(`Race Number` == 19,`Time/Retired`[361]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[361]) * `Time/Retired`,
                                ifelse(`Race Number` == 20,`Time/Retired`[381]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[381]) * `Time/Retired`,
                                ifelse(`Race Number` == 21,`Time/Retired`[401]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[401]) * `Time/Retired`,
                                ifelse(`Race Number` == 22,`Time/Retired`[421]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[421]) * `Time/Retired`,0)))))))))))))))))))))))

Re-review of the Data

We want to review the data to make sure the expected dataset is in the format we would want for analysis

df2 <- df2 %>%
  mutate(across(c("Time/Retired","Fastest Lap","Total Time"), ~na_if(.,0)))
str(df2)
## tibble [440 × 14] (S3: tbl_df/tbl/data.frame)
##  $ Track        : chr [1:440] "Bahrain" "Bahrain" "Bahrain" "Bahrain" ...
##  $ Position     : num [1:440] 1 2 3 4 5 6 7 8 9 10 ...
##  $ No           : num [1:440] 16 55 44 63 20 77 31 22 14 24 ...
##  $ Driver       : chr [1:440] "Charles Leclerc" "Carlos Sainz" "Lewis Hamilton" "George Russell" ...
##  $ Team         : chr [1:440] "Ferrari" "Ferrari" "Mercedes" "Mercedes" ...
##  $ Starting Grid: num [1:440] 1 3 5 9 7 6 11 16 8 15 ...
##  $ Laps         : num [1:440] 57 57 57 57 57 57 57 57 57 57 ...
##  $ Time/Retired : num [1:440] 5853.6 5.6 9.7 11.2 14.8 ...
##  $ Points       : num [1:440] 26 18 15 12 10 8 6 4 2 1 ...
##  $ +1 Pt        : chr [1:440] "Yes" "No" "No" "No" ...
##  $ Fastest Lap  : POSIXct[1:440], format: "1899-12-31 00:01:35" "1899-12-31 00:01:36" ...
##  $ Race Number  : num [1:440] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Race Date    : Date[1:440], format: "2022-03-20" "2022-03-20" ...
##  $ Total Time   : num [1:440] 5854 5859 5863 5865 5868 ...
colSums(is.na(df2))
##         Track      Position            No        Driver          Team 
##             0            64             0             0             0 
## Starting Grid          Laps  Time/Retired        Points         +1 Pt 
##             2             0           161             0             0 
##   Fastest Lap   Race Number     Race Date    Total Time 
##            14             0             0           161

This is our final output of the data frame:

Column Name Type
Track chr
Position num
No num
Driver chr
Team chr
Starting Grid num
Laps num
Time/Retired num
Points num
+1 Pt chr
Fastest Lap POSIXct
Race Number num
Race Date Date
Total Time num

Initial review of data

Through our initial EDA we added a few columns to help clarify the race details and results details. We also made assumption on the data types such that it will be easier to do analysis on. For example, we made modifications to the Time/Retired column to be in a single type of data type, numeric. Originally this was set as a 1st place gets a final time, those who finished within a lap, would get a additional time value, those who were lapped, get a number of lap rating, and finally, there is value on those who Did Not Finish or Did Not Start. We created the Total Time column to get a cumulative race time for each of the drivers. Those who got lapped, Did Not Finish, or Did Not Start received a 0. We had to make those who lapped 0 because typically after a lap, time is not registered and thus data on actual final times are difficult to make assumptions on as the range of possibilities are too large.

Analysis

# Include statistical analysis
# talk about the statistical analysis
p1d <- df2 %>%
  group_by(Team) %>%
  summarise("Total Points" = sum(Points, na.rm=T),
            "Average Position" = round(mean(Position, na.rm=T),2),
            "Average Starting Position" = round(mean(`Starting Grid`, na.rm=T),2),
            "Average Total Time" = round(mean(`Total Time`, na.rm=T),2)) %>%
  mutate("Color Code" = ifelse(Team == "Alfa Romeo Ferrari", "#900000",
                               ifelse(Team == "AlphaTauri RBPT", "#2B4562",
                               ifelse(Team == "Alpine Renault", "#0090FF",
                               ifelse(Team == "Aston Martin Aramco Mercedes", "#006F62",
                               ifelse(Team == "Ferrari", "#DC0000",
                               ifelse(Team == "Haas Ferrari", "#FFFFFF",
                               ifelse(Team == "McLaren Mercedes", "#FF8700",
                               ifelse(Team == "Mercedes", "#00D2BE",
                               ifelse(Team == "Red Bull Racing RBPT", "#0600EF",
                               ifelse(Team == "Williams Mercedes", "#005AFF","No Color")))))))))))
str(p1d)
## tibble [10 × 6] (S3: tbl_df/tbl/data.frame)
##  $ Team                     : chr [1:10] "Alfa Romeo Ferrari" "AlphaTauri RBPT" "Alpine Renault" "Aston Martin Aramco Mercedes" ...
##  $ Total Points             : num [1:10] 53 36 165 57 519 35 150 495 724 8
##  $ Average Position         : num [1:10] 11.65 12.26 8.41 11.2 3.4 ...
##  $ Average Starting Position: num [1:10] 12.82 13.09 10.52 14.11 4.32 ...
##  $ Average Total Time       : num [1:10] 5443 5366 5566 5409 5449 ...
##  $ Color Code               : chr [1:10] "#900000" "#2B4562" "#0090FF" "#006F62" ...

Accessory - Team Color Palette

# This list is in alpha order for each of the teams
alphaCP <- c("#900000","#2B4562","#0090FF","#006F62","#DC0000","#FFFFFF","#FF8700","#00D2BE","#0600EF","#005AFF")

How did each of the Teams do?

p1 <- ggplot(p1d, aes(x = Team, y = `Total Points`, fill = Team)) +
  geom_bar(stat='identity', alpha = 0.75, color = "black") +
  theme_minimal() +
  theme(axis.text.x=element_text(angle = 45), panel.background = element_rect(fill = "grey")) +
  scale_fill_manual(name = "Team with Offical Colors", 
                    values = c("Alfa Romeo Ferrari" = "#900000",
                               "AlphaTauri RBPT" = "#2B4562",
                               "Alpine Renault" = "#0090FF",
                               "Aston Martin Aramco Mercedes" = "#006F62",
                               "Ferrari" = "#DC0000",
                               "Haas Ferrari" = "#FFFFFF",
                               "McLaren Mercedes" = "#FF8700",
                               "Mercedes" = "#00D2BE",
                               "Red Bull Racing RBPT" = "#0600EF",
                               "Williams Mercedes" = "#005AFF")) +
  xlab("Team") +
  ylab("Team's Total Points") +
  ggtitle("Team Ranking for the 2022 Season")
p1 <- ggplotly(p1)
p1
df3 <- df2 %>%
  filter(Team %in% c("Ferrari","Mercedes","Red Bull Racing RBPT"))
p2 <- ggplot(df3, aes(x = Team, y = `Points`, fill = Driver)) + 
  geom_bar(stat = "identity")
p2 <- ggplotly(p2)
p2
p3 <- df3 %>%
  ggplot(aes(x = Driver, y = Points, fill = Driver, text = paste("Date of Race:", `Race Date`))) + 
    geom_boxplot() +
    scale_fill_viridis(discrete = T, alpha = 0.75) +
    geom_jitter(color = "black", size = 0.3, alpha = 0.9) +
    theme_ipsum() +
    theme(axis.text.x=element_text(angle = 45, vjust = 0.5, hjust = 1)) +
    xlab("Each Driver per Team") +
    ggtitle("Analysis on Top 3 Teams")
p3 <- ggplotly(p3)
p3

Through the general point system, we can see that there are three primary teams who have done well this year. That would be Ferrari, Mercedes, and Red Bull Racing. Though, most drives have a high point as well as a 0 mark, only George Russell is noted of having these high points and 0 points being an outlier. Looking at his record, it seems he is a very consistent driver with most of his points being from 10 to 18 on almost all races except for 3. Yet, he is still above all teams combined points (excluding the three we are reviewing). This shows there is still a big difference between teams, going against the F1 desire to make 2022 more competitive amongst all teams.

p2d <- df2 %>%
  group_by(Driver, Team) %>%
  summarise("Total Points" = sum(Points, na.rm=T),
            "Average Position" = round(mean(Position, na.rm=T),2),
            "Average Starting Position" = round(mean(`Starting Grid`, na.rm=T),2),
            "Average Total Time" = round(mean(`Total Time`, na.rm=T),2)) %>%
  arrange(desc(`Total Points`))
p2d
## # A tibble: 22 × 6
## # Groups:   Driver [22]
##    Driver          Team                 `Total Points` Average…¹ Avera…² Avera…³
##    <chr>           <chr>                         <dbl>     <dbl>   <dbl>   <dbl>
##  1 Max Verstappen  Red Bull Racing RBPT            433      2.81    3.41   5408.
##  2 Charles Leclerc Ferrari                         291      3.32    3.91   5401.
##  3 Sergio Perez    Red Bull Racing RBPT            291      3.95    4.82   5407.
##  4 George Russell  Mercedes                        262      4.38    6.36   5566.
##  5 Lewis Hamilton  Mercedes                        233      5.38    6.5    5461.
##  6 Carlos Sainz    Ferrari                         228      3.5     4.73   5507.
##  7 Lando Norris    McLaren Mercedes                116      7.8     8.05   5419.
##  8 Esteban Ocon    Alpine Renault                   90      8.2    11.6    5542.
##  9 Fernando Alonso Alpine Renault                   75      8.65    9.41   5599.
## 10 Valtteri Bottas Alfa Romeo Ferrari               47     10.4    11.4    5477.
## # … with 12 more rows, and abbreviated variable names ¹​`Average Position`,
## #   ²​`Average Starting Position`, ³​`Average Total Time`

Additional Analysis

summary(lm(df3$`Points`~df3$`Total Time`))
## 
## Call:
## lm(formula = df3$Points ~ df3$`Total Time`)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3469  -3.5398  -0.4149   2.7986  10.6511 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      1.481e+01  4.416e+00   3.355  0.00109 **
## df3$`Total Time` 1.037e-04  8.019e-04   0.129  0.89729   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.263 on 111 degrees of freedom
##   (결측으로 인하여 19개의 관측치가 삭제되었습니다.)
## Multiple R-squared:  0.0001508,  Adjusted R-squared:  -0.008857 
## F-statistic: 0.01674 on 1 and 111 DF,  p-value: 0.8973
cor.test(df3$`Points`,df3$`Total Time`, use = "complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  df3$Points and df3$`Total Time`
## t = 0.12938, df = 111, p-value = 0.8973
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1728424  0.1965635
## sample estimates:
##        cor 
## 0.01227955

We take a look through some of the statistics between Points and Total Time. We see some interesting aspect that there is little to no correlation between the amount of points a driver makes versus their race time.

df4 <- df2 %>%
  filter(Team %in% c("Alpine Renault","McLaren Mercedes"))

p4 <- df4 %>%
  ggplot(aes(x = Driver, y = Points, fill = Driver, text = paste("Date of Race:", `Race Date`))) + 
    geom_boxplot() +
    scale_fill_viridis(discrete = T, alpha = 0.75) +
    geom_jitter(color = "black", size = 0.3, alpha = 0.9) +
    theme_ipsum() +
    theme(axis.text.x=element_text(angle = 45, vjust = 0.5, hjust = 1)) +
    xlab("Each Driver per Team") +
    ggtitle("Analysis on Top 3 Teams")
p4 <- ggplotly(p4)
p4

To compare, looking at the next two teams (or the next 4 drivers), we see it is significantly less than what we can see from the top 3 teams. The boxplots here show a much larger concentration at 15 or below. Only two of these drivers are able to get into the top 4, and it occurs a total of 3 times in the entire season.

Summary of Findings

The goal of 2022 for the corporate F1 organization was to find a way to make the events more competitive. We wanted to look through this analysis and the results of each of the races to see if they reached their goal making a more competitive race. Did they succeed? Unfortunately, looking at not only the placements, but also looking at each of the team ranking and even down to the driver rankings, we find that the top 3 teams are stuck as the top three, the middle two who may be a contender are no where near close. The rest of the teams? No competition.

Even after all of the rule changes, monetary restrictions, safety and mechanical changes that F1 put in place, the same three teams are yet again in the lead. We take a look at who could possibly try to tame the beasts, but there is only one other driver who even came close, Lando Norris who in a single race early in the season got 3rd place. Since then, no other racer has ever come in third place outside of the top 3 teams. Statistically, we don’t see much that shows number of points related to speed. Even in the events where the racers from the top teams get disqualified, do not finish, or even get lapped, no other teams are able to get into the top 3. F1 has a long road ahead of them on policy and rule changes if they want to make this sport more active for fans, more competitive for racers, and sustainable for sponsors of the non-top 3 teams.

In terms of the project analysis, if more time is granted, I would like to look for additional information on the race, such as race conditions, weather conditions, and crash events. Some teams may be well equipped to handle extremely dry conditions as compared to wet conditions, or at higher elevations. Crash related information is relative as that may be the cause for some of the lap or incomplete races as well. I would also like to take time to find a more automated method to the clean up and reformatting efforts in the early stages of the project.