F1 is an internationally enjoyed sport for those who enjoy adrenaline rush. Since 1950, this sport for a single seated race car has become a race of the highest prestige, finding itself in the realm of luxury being sponsored by organizations such as Ferrari and Rolex. With the world moving to electric vehicles, and climate change the hottest topic of discussion, gas guzzling sports like F1 has been finding a big challenge of keeping viewership from both the external as well as internal issues.
This data was collected by toUpperCase78 through webscraping of the official F1 website. The dataset and associated ReadMe file can be found on GitHub with the data being scrapped from Offical F1.The data has been separated into a variety of different topics, where this analysis will be focusing on the records on each race. The other datasets are available on the same GitHub link.
There are an endless amount of research opportunities that may affect the F1 sport. However, we would like to keep a focus internally to what has been happening with the sport.
Over the past few seasons, there has been a growing concern that the top teams are too significantly ahead of the rest. In the 2022 season, there were major modifications to rules and mechanical requirements to hopefully allow other teams to not only catch up, but also give a sense of competition. Hamilton, Leclerc, Perez, and Verstappen are the power houses that have made a name for themselves over the last 5+ years. They represent teams for Ferrari, Mercedes, and Red Bull. They have continued to take the top seats making the F1 series feel like a battle between these teams, and more specifically, these drivers. Major news outlets as well as major community forums have placed their concern that the races have lost its touch because of this 3 way battle with no others teams able to come close, Race Fans reports.
Looking through the 2022 races, can we see any trends with the top three teams? Did these changes for 2022 make the races feel more competitive?
Define variables
| Column Title | Type | Description |
|---|---|---|
| Track | chr | Name or Location of the Track |
| Position | num | Final Rank of the Driver at Event |
| No | num | Number of the Driver |
| Driver | chr | Driver Name |
| Team | chr | Team Name |
| Starting Grid | num | Position the Driver Started at this event |
| Laps | num | Number of Laps Completed |
| Time/Retired | num | Finishing Time |
| Points | num | Number of Points Received from Race |
| +1 Pt | chr | Additional 1 Point for the Driver Who Got the Fastest Lap Time |
| Fastest Lap | POSIXct | Time of Fastest Lap |
The purpose of having Time/Retired be a POSIXct is because only the first place gets a time, and all places after get a value that is added to that first place record.
# Prepare for load of libraries
setwd("~/Data Study/Data 110 MC/Project Final")
library("pacman")
pacman::p_load(tidyverse,dplyr,readxl,plotly,ggthemes,treemapify,RColorBrewer,hrbrthemes,viridis)
# Read in the Data f1dataset.xlsx
df <- read_excel("f1dataset.xlsx",sheet = "Sheet1")
# Explore the data
str(df)
## tibble [440 × 11] (S3: tbl_df/tbl/data.frame)
## $ Track : chr [1:440] "Bahrain" "Bahrain" "Bahrain" "Bahrain" ...
## $ Position : chr [1:440] "1" "2" "3" "4" ...
## $ No : num [1:440] 16 55 44 63 20 77 31 22 14 24 ...
## $ Driver : chr [1:440] "Charles Leclerc" "Carlos Sainz" "Lewis Hamilton" "George Russell" ...
## $ Team : chr [1:440] "Ferrari" "Ferrari" "Mercedes" "Mercedes" ...
## $ Starting Grid: num [1:440] 1 3 5 9 7 6 11 16 8 15 ...
## $ Laps : num [1:440] 57 57 57 57 57 57 57 57 57 57 ...
## $ Time/Retired : chr [1:440] "6.7749814814814804E-2" "5.5979999999999999" "9.6750000000000007" "11.211" ...
## $ Points : num [1:440] 26 18 15 12 10 8 6 4 2 1 ...
## $ +1 Pt : chr [1:440] "Yes" "No" "No" "No" ...
## $ Fastest Lap : POSIXct[1:440], format: "1899-12-31 00:01:35" "1899-12-31 00:01:36" ...
Here we can see that some of the columns have strange variable types, such as Position being characterized as a character rather than a number. Below is a chart of each of the variable types and what we expect.
| Column Title | Type | Expected Type |
|---|---|---|
| Track | chr | chr |
| Position | chr | num |
| No | num | num |
| Driver | chr | chr |
| Team | chr | chr |
| Starting Grid | num | num |
| Laps | num | num |
| Time/Retired | chr | num |
| Points | num | num |
| +1 Pt | chr | chr |
| Fastest Lap | POSIXct | POSIXct |
For the value types that don’t line up (i.e. Position), we will review and correct
# Take a look at the head and tail of the dataset to see how it is set.
head(df,10)
## # A tibble: 10 × 11
## Track Position No Driver Team Start…¹ Laps Time/…² Points `+1 Pt`
## <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr>
## 1 Bahrain 1 16 Charles Le… Ferr… 1 57 6.7749… 26 Yes
## 2 Bahrain 2 55 Carlos Sai… Ferr… 3 57 5.5979… 18 No
## 3 Bahrain 3 44 Lewis Hami… Merc… 5 57 9.6750… 15 No
## 4 Bahrain 4 63 George Rus… Merc… 9 57 11.211 12 No
## 5 Bahrain 5 20 Kevin Magn… Haas… 7 57 14.754 10 No
## 6 Bahrain 6 77 Valtteri B… Alfa… 6 57 16.119 8 No
## 7 Bahrain 7 31 Esteban Oc… Alpi… 11 57 19.422… 6 No
## 8 Bahrain 8 22 Yuki Tsuno… Alph… 16 57 20.385… 4 No
## 9 Bahrain 9 14 Fernando A… Alpi… 8 57 22.39 2 No
## 10 Bahrain 10 24 Guanyu Zhou Alfa… 15 57 23.064 1 No
## # … with 1 more variable: `Fastest Lap` <dttm>, and abbreviated variable names
## # ¹`Starting Grid`, ²`Time/Retired`
tail(df,10)
## # A tibble: 10 × 11
## Track Position No Driver Team Start…¹ Laps Time/…² Points `+1 Pt`
## <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr>
## 1 Abu Dhabi 11 22 Yuki Tsu… Alph… 11 58 89.370… 0 No
## 2 Abu Dhabi 12 24 Guanyu Z… Alfa… 15 57 +1 lap 0 No
## 3 Abu Dhabi 13 23 Alexande… Will… 19 57 +1 lap 0 No
## 4 Abu Dhabi 14 10 Pierre G… Alph… 17 57 +1 lap 0 No
## 5 Abu Dhabi 15 77 Valtteri… Alfa… 18 57 +1 lap 0 No
## 6 Abu Dhabi 16 47 Mick Sch… Haas… 12 57 +1 lap 0 No
## 7 Abu Dhabi 17 20 Kevin Ma… Haas… 16 57 +1 lap 0 No
## 8 Abu Dhabi 18 44 Lewis Ha… Merc… 5 55 DNF 0 No
## 9 Abu Dhabi 19 6 Nicholas… Will… 20 55 DNF 0 No
## 10 Abu Dhabi NC 14 Fernando… Alpi… 10 27 DNF 0 No
## # … with 1 more variable: `Fastest Lap` <dttm>, and abbreviated variable names
## # ¹`Starting Grid`, ²`Time/Retired`
# Check standard NA values (Not inclusive of NC or DNF)
colSums(is.na(df))
## Track Position No Driver Team
## 0 0 0 0 0
## Starting Grid Laps Time/Retired Points +1 Pt
## 2 0 0 0 0
## Fastest Lap
## 14
# Cleaning the data to be used for analysis
# Add Main Event Race Date and Race Number
df2 <- df %>%
mutate(`Race Number` = ifelse(Track == "Bahrain", 1,
ifelse(Track == "Saudi Arabia", 2,
ifelse(Track == "Australia", 3,
ifelse(Track == "Emilia Romagna", 4,
ifelse(Track == "Miami", 5,
ifelse(Track == "Spain", 6,
ifelse(Track == "Monaco", 7,
ifelse(Track == "Azerbaijan", 8,
ifelse(Track == "Canada", 9,
ifelse(Track == "Great Britain", 10,
ifelse(Track == "Austria", 11,
ifelse(Track == "France", 12,
ifelse(Track == "Hungary", 13,
ifelse(Track == "Belgium", 14,
ifelse(Track == "Netherlands", 15,
ifelse(Track == "Italy", 16,
ifelse(Track == "Singapore", 17,
ifelse(Track == "Japan", 18,
ifelse(Track == "United States", 19,
ifelse(Track == "Mexico", 20,
ifelse(Track == "Brazil", 21,
ifelse(Track == "Abu Dhabi", 22, "Not Listed")))))))))))))))))))))))
df2 <- df2 %>%
mutate(`Race Date` = ifelse(Track == "Bahrain", format(as.Date("2022-03-20"), format = "%Y/%m/%d"),
ifelse(Track == "Saudi Arabia", format(as.Date("2022-03-27"), format = "%Y/%m/%d"),
ifelse(Track == "Australia", format(as.Date("2022-04-10"), format = "%Y/%m/%d"),
ifelse(Track == "Emilia Romagna", format(as.Date("2022-04-24"), format = "%Y/%m/%d"),
ifelse(Track == "Miami", format(as.Date("2022-05-08"), format = "%Y/%m/%d"),
ifelse(Track == "Spain", format(as.Date("2022-05-22"), format = "%Y/%m/%d"),
ifelse(Track == "Monaco", format(as.Date("2022-05-29"), format = "%Y/%m/%d"),
ifelse(Track == "Azerbaijan", format(as.Date("2022-06-12"), format = "%Y/%m/%d"),
ifelse(Track == "Canada", format(as.Date("2022-06-19"), format = "%Y/%m/%d"),
ifelse(Track == "Great Britain", format(as.Date("2022-07-03"), format = "%Y/%m/%d"),
ifelse(Track == "Austria", format(as.Date("2022-07-10"), format = "%Y/%m/%d"),
ifelse(Track == "France", format(as.Date("2022-07-24"), format = "%Y/%m/%d"),
ifelse(Track == "Hungary", format(as.Date("2022-07-31"), format = "%Y/%m/%d"),
ifelse(Track == "Belgium", format(as.Date("2022-08-28"), format = "%Y/%m/%d"),
ifelse(Track == "Netherlands", format(as.Date("2022-09-04"), format = "%Y/%m/%d"),
ifelse(Track == "Italy", format(as.Date("2022-09-11"), format = "%Y/%m/%d"),
ifelse(Track == "Singapore", format(as.Date("2022-10-02"), format = "%Y/%m/%d"),
ifelse(Track == "Japan", format(as.Date("2022-10-09"), format = "%Y/%m/%d"),
ifelse(Track == "United States", format(as.Date("2022-10-23"), format = "%Y/%m/%d"),
ifelse(Track == "Mexico", format(as.Date("2022-10-30"), format = "%Y/%m/%d"),
ifelse(Track == "Brazil", format(as.Date("2022-11-13"), format = "%Y/%m/%d"),
ifelse(Track == "Abu Dhabi", format(as.Date("2022-11-20"), format = "%Y/%m/%d"),
"Not Listed")))))))))))))))))))))))
df2$`Race Date` <- as.Date(df2$`Race Date`)
df2$Position <- as.numeric(df2$Position)
df2$`Race Number` <- as.numeric(df2$`Race Number`)
There are three main columns I would like to clean up before moving on to the actual analysis. These are the Position, Time/Retired “NA” types, and Time/Retired value types. Currently Time/Retired is set to initial completion time (first place), and then all additional placements are a value of first place time + additional time. I will change this to be total times.
# Cleaning Position Column
df2$Position[which(df2$Position == "NC")] <- NA
# Cleaning Time/Retired for the DNF/DNS and +laps
df2$`Time/Retired`[which(df2$`Time/Retired` %in% c("DNF","DNS","+1 lap","+2 laps","+6 laps"))] <- 0
str(df2)
## tibble [440 × 13] (S3: tbl_df/tbl/data.frame)
## $ Track : chr [1:440] "Bahrain" "Bahrain" "Bahrain" "Bahrain" ...
## $ Position : num [1:440] 1 2 3 4 5 6 7 8 9 10 ...
## $ No : num [1:440] 16 55 44 63 20 77 31 22 14 24 ...
## $ Driver : chr [1:440] "Charles Leclerc" "Carlos Sainz" "Lewis Hamilton" "George Russell" ...
## $ Team : chr [1:440] "Ferrari" "Ferrari" "Mercedes" "Mercedes" ...
## $ Starting Grid: num [1:440] 1 3 5 9 7 6 11 16 8 15 ...
## $ Laps : num [1:440] 57 57 57 57 57 57 57 57 57 57 ...
## $ Time/Retired : chr [1:440] "6.7749814814814804E-2" "5.5979999999999999" "9.6750000000000007" "11.211" ...
## $ Points : num [1:440] 26 18 15 12 10 8 6 4 2 1 ...
## $ +1 Pt : chr [1:440] "Yes" "No" "No" "No" ...
## $ Fastest Lap : POSIXct[1:440], format: "1899-12-31 00:01:35" "1899-12-31 00:01:36" ...
## $ Race Number : num [1:440] 1 1 1 1 1 1 1 1 1 1 ...
## $ Race Date : Date[1:440], format: "2022-03-20" "2022-03-20" ...
# Cleaning format of Time/Retired
# Want to change everything to seconds, depending on results, may want to add an additional column for hours conversion
# Race 1
df2$`Time/Retired`[1] <- 5853.6
# Race 2
df2$`Time/Retired`[21] <- 5059.3
# Race 3
df2$`Time/Retired`[41] <- 5266.5
# Race 4
df2$`Time/Retired`[61] <- 5528
# Race 5
df2$`Time/Retired`[81] <- 5664.3
# Race 6
df2$`Time/Retired`[101] <- 5840.5
# Race 7
df2$`Time/Retired`[121] <- 6990.3
# Race 8
df2$`Time/Retired`[141] <- 5645.9
# Race 9
df2$`Time/Retired`[161] <- 5781.8
# Race 10
df2$`Time/Retired`[181] <- 4670.3
# Race 11
df2$`Time/Retired`[201] <- 5064.3
# Race 12
df2$`Time/Retired`[221] <- 5402.1
# Race 13
df2$`Time/Retired`[241] <- 5965.9
# Race 14
df2$`Time/Retired`[261] <- 5152.9
# Race 15
df2$`Time/Retired`[281] <- 5802.8
# Race 16
df2$`Time/Retired`[301] <- 4827.5
# Race 17
df2$`Time/Retired`[321] <- 3740.2
# Race 18
df2$`Time/Retired`[341] <- 3704
# Race 19
df2$`Time/Retired`[361] <- 6131.7
# Race 20
df2$`Time/Retired`[381] <- 5916.7
# Race 21
df2$`Time/Retired`[401] <- 5914
# Race 22
df2$`Time/Retired`[421] <- 5265.9
df2$`Time/Retired` <- as.numeric(df2$`Time/Retired`)
df2$`Time/Retired` <- round(df2$`Time/Retired`,1)
df2 <- df2 %>%
mutate(`Total Time` = ifelse(`Race Number` == 1, `Time/Retired`[1]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[1]) * `Time/Retired`,
ifelse(`Race Number` == 2,`Time/Retired`[21]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[21]) * `Time/Retired`,
ifelse(`Race Number` == 3,`Time/Retired`[41]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[41]) * `Time/Retired`,
ifelse(`Race Number` == 4,`Time/Retired`[61]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[61]) * `Time/Retired`,
ifelse(`Race Number` == 5,`Time/Retired`[81]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[81]) * `Time/Retired`,
ifelse(`Race Number` == 6,`Time/Retired`[101]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[101]) * `Time/Retired`,
ifelse(`Race Number` == 7,`Time/Retired`[121]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[121]) * `Time/Retired`,
ifelse(`Race Number` == 8,`Time/Retired`[141]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[141]) * `Time/Retired`,
ifelse(`Race Number` == 9,`Time/Retired`[161]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[161]) * `Time/Retired`,
ifelse(`Race Number` == 10,`Time/Retired`[181]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[181]) * `Time/Retired`,
ifelse(`Race Number` == 11,`Time/Retired`[201]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[201]) * `Time/Retired`,
ifelse(`Race Number` == 12,`Time/Retired`[221]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[221]) * `Time/Retired`,
ifelse(`Race Number` == 13,`Time/Retired`[241]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[241]) * `Time/Retired`,
ifelse(`Race Number` == 14,`Time/Retired`[261]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[261]) * `Time/Retired`,
ifelse(`Race Number` == 15,`Time/Retired`[281]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[281]) * `Time/Retired`,
ifelse(`Race Number` == 16,`Time/Retired`[301]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[301]) * `Time/Retired`,
ifelse(`Race Number` == 17,`Time/Retired`[321]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[321]) * `Time/Retired`,
ifelse(`Race Number` == 18,`Time/Retired`[341]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[341]) * `Time/Retired`,
ifelse(`Race Number` == 19,`Time/Retired`[361]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[361]) * `Time/Retired`,
ifelse(`Race Number` == 20,`Time/Retired`[381]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[381]) * `Time/Retired`,
ifelse(`Race Number` == 21,`Time/Retired`[401]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[401]) * `Time/Retired`,
ifelse(`Race Number` == 22,`Time/Retired`[421]*(`Time/Retired` != 0) + (`Time/Retired` != `Time/Retired`[421]) * `Time/Retired`,0)))))))))))))))))))))))
We want to review the data to make sure the expected dataset is in the format we would want for analysis
df2 <- df2 %>%
mutate(across(c("Time/Retired","Fastest Lap","Total Time"), ~na_if(.,0)))
str(df2)
## tibble [440 × 14] (S3: tbl_df/tbl/data.frame)
## $ Track : chr [1:440] "Bahrain" "Bahrain" "Bahrain" "Bahrain" ...
## $ Position : num [1:440] 1 2 3 4 5 6 7 8 9 10 ...
## $ No : num [1:440] 16 55 44 63 20 77 31 22 14 24 ...
## $ Driver : chr [1:440] "Charles Leclerc" "Carlos Sainz" "Lewis Hamilton" "George Russell" ...
## $ Team : chr [1:440] "Ferrari" "Ferrari" "Mercedes" "Mercedes" ...
## $ Starting Grid: num [1:440] 1 3 5 9 7 6 11 16 8 15 ...
## $ Laps : num [1:440] 57 57 57 57 57 57 57 57 57 57 ...
## $ Time/Retired : num [1:440] 5853.6 5.6 9.7 11.2 14.8 ...
## $ Points : num [1:440] 26 18 15 12 10 8 6 4 2 1 ...
## $ +1 Pt : chr [1:440] "Yes" "No" "No" "No" ...
## $ Fastest Lap : POSIXct[1:440], format: "1899-12-31 00:01:35" "1899-12-31 00:01:36" ...
## $ Race Number : num [1:440] 1 1 1 1 1 1 1 1 1 1 ...
## $ Race Date : Date[1:440], format: "2022-03-20" "2022-03-20" ...
## $ Total Time : num [1:440] 5854 5859 5863 5865 5868 ...
colSums(is.na(df2))
## Track Position No Driver Team
## 0 64 0 0 0
## Starting Grid Laps Time/Retired Points +1 Pt
## 2 0 161 0 0
## Fastest Lap Race Number Race Date Total Time
## 14 0 0 161
This is our final output of the data frame:
| Column Name | Type |
|---|---|
| Track | chr |
| Position | num |
| No | num |
| Driver | chr |
| Team | chr |
| Starting Grid | num |
| Laps | num |
| Time/Retired | num |
| Points | num |
| +1 Pt | chr |
| Fastest Lap | POSIXct |
| Race Number | num |
| Race Date | Date |
| Total Time | num |
Through our initial EDA we added a few columns to help clarify the race details and results details. We also made assumption on the data types such that it will be easier to do analysis on. For example, we made modifications to the Time/Retired column to be in a single type of data type, numeric. Originally this was set as a 1st place gets a final time, those who finished within a lap, would get a additional time value, those who were lapped, get a number of lap rating, and finally, there is value on those who Did Not Finish or Did Not Start. We created the Total Time column to get a cumulative race time for each of the drivers. Those who got lapped, Did Not Finish, or Did Not Start received a 0. We had to make those who lapped 0 because typically after a lap, time is not registered and thus data on actual final times are difficult to make assumptions on as the range of possibilities are too large.
# Include statistical analysis
# talk about the statistical analysis
p1d <- df2 %>%
group_by(Team) %>%
summarise("Total Points" = sum(Points, na.rm=T),
"Average Position" = round(mean(Position, na.rm=T),2),
"Average Starting Position" = round(mean(`Starting Grid`, na.rm=T),2),
"Average Total Time" = round(mean(`Total Time`, na.rm=T),2)) %>%
mutate("Color Code" = ifelse(Team == "Alfa Romeo Ferrari", "#900000",
ifelse(Team == "AlphaTauri RBPT", "#2B4562",
ifelse(Team == "Alpine Renault", "#0090FF",
ifelse(Team == "Aston Martin Aramco Mercedes", "#006F62",
ifelse(Team == "Ferrari", "#DC0000",
ifelse(Team == "Haas Ferrari", "#FFFFFF",
ifelse(Team == "McLaren Mercedes", "#FF8700",
ifelse(Team == "Mercedes", "#00D2BE",
ifelse(Team == "Red Bull Racing RBPT", "#0600EF",
ifelse(Team == "Williams Mercedes", "#005AFF","No Color")))))))))))
str(p1d)
## tibble [10 × 6] (S3: tbl_df/tbl/data.frame)
## $ Team : chr [1:10] "Alfa Romeo Ferrari" "AlphaTauri RBPT" "Alpine Renault" "Aston Martin Aramco Mercedes" ...
## $ Total Points : num [1:10] 53 36 165 57 519 35 150 495 724 8
## $ Average Position : num [1:10] 11.65 12.26 8.41 11.2 3.4 ...
## $ Average Starting Position: num [1:10] 12.82 13.09 10.52 14.11 4.32 ...
## $ Average Total Time : num [1:10] 5443 5366 5566 5409 5449 ...
## $ Color Code : chr [1:10] "#900000" "#2B4562" "#0090FF" "#006F62" ...
Accessory - Team Color Palette
# This list is in alpha order for each of the teams
alphaCP <- c("#900000","#2B4562","#0090FF","#006F62","#DC0000","#FFFFFF","#FF8700","#00D2BE","#0600EF","#005AFF")
How did each of the Teams do?
p1 <- ggplot(p1d, aes(x = Team, y = `Total Points`, fill = Team)) +
geom_bar(stat='identity', alpha = 0.75, color = "black") +
theme_minimal() +
theme(axis.text.x=element_text(angle = 45), panel.background = element_rect(fill = "grey")) +
scale_fill_manual(name = "Team with Offical Colors",
values = c("Alfa Romeo Ferrari" = "#900000",
"AlphaTauri RBPT" = "#2B4562",
"Alpine Renault" = "#0090FF",
"Aston Martin Aramco Mercedes" = "#006F62",
"Ferrari" = "#DC0000",
"Haas Ferrari" = "#FFFFFF",
"McLaren Mercedes" = "#FF8700",
"Mercedes" = "#00D2BE",
"Red Bull Racing RBPT" = "#0600EF",
"Williams Mercedes" = "#005AFF")) +
xlab("Team") +
ylab("Team's Total Points") +
ggtitle("Team Ranking for the 2022 Season")
p1 <- ggplotly(p1)
p1
df3 <- df2 %>%
filter(Team %in% c("Ferrari","Mercedes","Red Bull Racing RBPT"))
p2 <- ggplot(df3, aes(x = Team, y = `Points`, fill = Driver)) +
geom_bar(stat = "identity")
p2 <- ggplotly(p2)
p2
p3 <- df3 %>%
ggplot(aes(x = Driver, y = Points, fill = Driver, text = paste("Date of Race:", `Race Date`))) +
geom_boxplot() +
scale_fill_viridis(discrete = T, alpha = 0.75) +
geom_jitter(color = "black", size = 0.3, alpha = 0.9) +
theme_ipsum() +
theme(axis.text.x=element_text(angle = 45, vjust = 0.5, hjust = 1)) +
xlab("Each Driver per Team") +
ggtitle("Analysis on Top 3 Teams")
p3 <- ggplotly(p3)
p3
Through the general point system, we can see that there are three primary teams who have done well this year. That would be Ferrari, Mercedes, and Red Bull Racing. Though, most drives have a high point as well as a 0 mark, only George Russell is noted of having these high points and 0 points being an outlier. Looking at his record, it seems he is a very consistent driver with most of his points being from 10 to 18 on almost all races except for 3. Yet, he is still above all teams combined points (excluding the three we are reviewing). This shows there is still a big difference between teams, going against the F1 desire to make 2022 more competitive amongst all teams.
p2d <- df2 %>%
group_by(Driver, Team) %>%
summarise("Total Points" = sum(Points, na.rm=T),
"Average Position" = round(mean(Position, na.rm=T),2),
"Average Starting Position" = round(mean(`Starting Grid`, na.rm=T),2),
"Average Total Time" = round(mean(`Total Time`, na.rm=T),2)) %>%
arrange(desc(`Total Points`))
p2d
## # A tibble: 22 × 6
## # Groups: Driver [22]
## Driver Team `Total Points` Average…¹ Avera…² Avera…³
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Max Verstappen Red Bull Racing RBPT 433 2.81 3.41 5408.
## 2 Charles Leclerc Ferrari 291 3.32 3.91 5401.
## 3 Sergio Perez Red Bull Racing RBPT 291 3.95 4.82 5407.
## 4 George Russell Mercedes 262 4.38 6.36 5566.
## 5 Lewis Hamilton Mercedes 233 5.38 6.5 5461.
## 6 Carlos Sainz Ferrari 228 3.5 4.73 5507.
## 7 Lando Norris McLaren Mercedes 116 7.8 8.05 5419.
## 8 Esteban Ocon Alpine Renault 90 8.2 11.6 5542.
## 9 Fernando Alonso Alpine Renault 75 8.65 9.41 5599.
## 10 Valtteri Bottas Alfa Romeo Ferrari 47 10.4 11.4 5477.
## # … with 12 more rows, and abbreviated variable names ¹`Average Position`,
## # ²`Average Starting Position`, ³`Average Total Time`
summary(lm(df3$`Points`~df3$`Total Time`))
##
## Call:
## lm(formula = df3$Points ~ df3$`Total Time`)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3469 -3.5398 -0.4149 2.7986 10.6511
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.481e+01 4.416e+00 3.355 0.00109 **
## df3$`Total Time` 1.037e-04 8.019e-04 0.129 0.89729
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.263 on 111 degrees of freedom
## (결측으로 인하여 19개의 관측치가 삭제되었습니다.)
## Multiple R-squared: 0.0001508, Adjusted R-squared: -0.008857
## F-statistic: 0.01674 on 1 and 111 DF, p-value: 0.8973
cor.test(df3$`Points`,df3$`Total Time`, use = "complete.obs")
##
## Pearson's product-moment correlation
##
## data: df3$Points and df3$`Total Time`
## t = 0.12938, df = 111, p-value = 0.8973
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1728424 0.1965635
## sample estimates:
## cor
## 0.01227955
We take a look through some of the statistics between Points and Total Time. We see some interesting aspect that there is little to no correlation between the amount of points a driver makes versus their race time.
df4 <- df2 %>%
filter(Team %in% c("Alpine Renault","McLaren Mercedes"))
p4 <- df4 %>%
ggplot(aes(x = Driver, y = Points, fill = Driver, text = paste("Date of Race:", `Race Date`))) +
geom_boxplot() +
scale_fill_viridis(discrete = T, alpha = 0.75) +
geom_jitter(color = "black", size = 0.3, alpha = 0.9) +
theme_ipsum() +
theme(axis.text.x=element_text(angle = 45, vjust = 0.5, hjust = 1)) +
xlab("Each Driver per Team") +
ggtitle("Analysis on Top 3 Teams")
p4 <- ggplotly(p4)
p4
To compare, looking at the next two teams (or the next 4 drivers), we see it is significantly less than what we can see from the top 3 teams. The boxplots here show a much larger concentration at 15 or below. Only two of these drivers are able to get into the top 4, and it occurs a total of 3 times in the entire season.
The goal of 2022 for the corporate F1 organization was to find a way to make the events more competitive. We wanted to look through this analysis and the results of each of the races to see if they reached their goal making a more competitive race. Did they succeed? Unfortunately, looking at not only the placements, but also looking at each of the team ranking and even down to the driver rankings, we find that the top 3 teams are stuck as the top three, the middle two who may be a contender are no where near close. The rest of the teams? No competition.
Even after all of the rule changes, monetary restrictions, safety and mechanical changes that F1 put in place, the same three teams are yet again in the lead. We take a look at who could possibly try to tame the beasts, but there is only one other driver who even came close, Lando Norris who in a single race early in the season got 3rd place. Since then, no other racer has ever come in third place outside of the top 3 teams. Statistically, we don’t see much that shows number of points related to speed. Even in the events where the racers from the top teams get disqualified, do not finish, or even get lapped, no other teams are able to get into the top 3. F1 has a long road ahead of them on policy and rule changes if they want to make this sport more active for fans, more competitive for racers, and sustainable for sponsors of the non-top 3 teams.
In terms of the project analysis, if more time is granted, I would like to look for additional information on the race, such as race conditions, weather conditions, and crash events. Some teams may be well equipped to handle extremely dry conditions as compared to wet conditions, or at higher elevations. Crash related information is relative as that may be the cause for some of the lap or incomplete races as well. I would also like to take time to find a more automated method to the clean up and reformatting efforts in the early stages of the project.