Introduction

Formula 1 has been a beloved sport to its loyal fan base for over 70 years now, but the sport is considered one of the most data-driven professional sports in the world. Each race team has a data team that comes with them to each race weekend, and one that stays back at their respective factories. They are constantly combing through the thousands of data points that a car returns every lap and then meeting with their team’s principal, engineers, and drivers to determine the greatest strategy for every race weekend. Along with this, Formula 1 and their broadcast partners in recent years teamed up with AWS to deliver insights and predictions during the race broadcast. However, one of the greatest periods in the history of the sport is from 2000-2021 did not have data for the public to see until a few years ago.

There is plenty of data but little insight into the powerhouses of the sport so fans can understand how one one-hundredth of a second can make a difference in your outcome. The success of the Netflix docuseries Formula 1: Drive to Survive and the documentary Schumacher based on the career of the legendary Michael Schumacher have attracted a large new fan base to the sport. As a lifelong F1 fan myself, I wanted to take a look back at how the sport has evolved over the past twenty years and look at the teams, drivers, and car performance to see how they have evolved. This is because, for the upcoming 2022 season, there is a brand-new car design and new rules that will make the sport a competitive field for all teams that are racing. Since the end of what has been one of if not the greatest era of Formula 1 yet, I wanted to take a look back and see what made it truly remarkable but also see how the sport can improve. I hope you all enjoy it!

Dataset

For my data set, I have chosen to use the Ergast Developer API’s(http://ergast.com/mrd/db/) data files that have data that look at Formula 1 from 1950-the present. This data set came as multiple .csv files located in a zip file that would have to be joined together when needed for each visualization. When looking at the structure of the data files, each has one or multiple possible connections that can be made to connect different sheets. For this reason, I plan on using sqldf in building out the data frames for each visualization.

The first table that I used in the data set is the driver’s file. In this file, you will have the drivers which will be used to link to other data sets, the driver’s reference which is the driver’s last name. These two columns will be used for my wins per driver visualization.

summary(df_drivers)
##     driverId      driverRef            number              code          
##  Min.   :  1.0   Length:853         Length:853         Length:853        
##  1st Qu.:214.0   Class :character   Class :character   Class :character  
##  Median :427.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :427.1                                                           
##  3rd Qu.:640.0                                                           
##  Max.   :854.0                                                           
##    forename           surname          nationality       
##  Length:853         Length:853         Length:853        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
## 

The next data set that I used was the races data file from Ergast. In this data set, I used the raceId to link this table to others as this is the key for this table. Next, I used the year column to help determine which year the races were held to help filter out the thousands of rows of data and look at a specific period. Finally, I used the circuitId as a foreign key to connect to the circuits table which I will talk about next. This table will be used in all of my visualizations to help understand the years that I am looking at and look at results for those races. One Modification that was made in this data frame was to set the time column to not appear as it was creating an issue with joining other tables with the same column name in it.

summary (df_races)
##      raceId            year          round          circuitId    
##  Min.   :   1.0   Min.   :1950   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 265.0   1st Qu.:1976   1st Qu.: 4.000   1st Qu.: 9.00  
##  Median : 529.0   Median :1992   Median : 8.000   Median :18.00  
##  Mean   : 530.7   Mean   :1991   Mean   : 8.369   Mean   :22.19  
##  3rd Qu.: 793.0   3rd Qu.:2008   3rd Qu.:12.000   3rd Qu.:32.00  
##  Max.   :1073.0   Max.   :2021   Max.   :22.000   Max.   :77.00  
##  Grand_Prix_name        time          
##  Length:1057        Length:1057       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

The circuits data set is showing the information regarding each race track that has been used in Formula 1. I will be using the circuitId number to connect my race data set, and the circuit name to help show what racetracks team are the most successful at.

summary(df_circuits)
##    circuitId   circuitRef         track_name          location        
##  Min.   : 1   Length:77          Length:77          Length:77         
##  1st Qu.:20   Class :character   Class :character   Class :character  
##  Median :39   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :39                                                           
##  3rd Qu.:58                                                           
##  Max.   :77                                                           
##    country               lat              lng                alt        
##  Length:77          Min.   :-37.85   Min.   :-118.189   Min.   :  -7.0  
##  Class :character   1st Qu.: 33.45   1st Qu.:  -9.203   1st Qu.:  18.0  
##  Mode  :character   Median : 41.17   Median :   4.327   Median : 129.0  
##                     Mean   : 33.72   Mean   :   3.551   Mean   : 247.5  
##                     3rd Qu.: 47.20   3rd Qu.:  19.249   3rd Qu.: 332.0  
##                     Max.   : 57.27   Max.   : 144.968   Max.   :2227.0  
##      url           
##  Length:77         
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

The constructor’s data gives us the constructorId and the constructor_name which will be used to connect with other tables to determine how successful the teams have been since 2000. One note about this data frame has been renamed to df_teams as a way to simplify the naming convention.

summary(df_teams)
##  constructorId   constructorRef     constructors_name  nationality       
##  Min.   :  1.0   Length:211         Length:211         Length:211        
##  1st Qu.: 54.5   Class :character   Class :character   Class :character  
##  Median :107.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :107.0                                                           
##  3rd Qu.:159.5                                                           
##  Max.   :214.0

The pitstop data in the dataset uses the raceId to tell the data what race the pitstop came from. I will be using that and the duration of the pitstop to determine the average pitstop time per year.

summary(df_pit_stops)
##      raceId          driverId          stop            lap       
##  Min.   : 841.0   Min.   :  1.0   Min.   :1.000   Min.   : 1.00  
##  1st Qu.: 881.0   1st Qu.: 17.0   1st Qu.:1.000   1st Qu.:13.00  
##  Median : 937.0   Median :813.0   Median :2.000   Median :25.00  
##  Mean   : 939.9   Mean   :487.2   Mean   :1.768   Mean   :25.13  
##  3rd Qu.: 993.0   3rd Qu.:826.0   3rd Qu.:2.000   3rd Qu.:35.00  
##  Max.   :1065.0   Max.   :854.0   Max.   :6.000   Max.   :78.00  
##      time             duration          milliseconds    
##  Length:8518        Length:8518        Min.   :  12897  
##  Class :character   Class :character   1st Qu.:  21896  
##  Mode  :character   Mode  :character   Median :  23492  
##                                        Mean   :  65200  
##                                        3rd Qu.:  26055  
##                                        Max.   :2077164

Next, the qualifying data set looks at the qualifying time set by a driver at each race. For this dataset, I used the raceId column to connect to other data sets to connect my qualifying data to a race, the constructor id to connect to a certain team name, and the q1 column which gives the best lap time given by each driver during the first of three qualifying stages.

summary(df_qualifying)
##    qualifyId        raceId          driverId     constructorId   
##  Min.   :   1   Min.   :   1.0   Min.   :  1.0   Min.   :  1.00  
##  1st Qu.:2245   1st Qu.: 109.0   1st Qu.: 15.0   1st Qu.:  4.00  
##  Median :4488   Median : 353.0   Median : 41.0   Median :  9.00  
##  Mean   :4495   Mean   : 543.3   Mean   :276.6   Mean   : 40.04  
##  3rd Qu.:6752   3rd Qu.: 951.0   3rd Qu.:815.0   3rd Qu.: 22.00  
##  Max.   :9013   Max.   :1065.0   Max.   :854.0   Max.   :214.00  
##      number      grid_position        q1                 q2           
##  Min.   : 0.00   Min.   : 1.00   Length:8973        Length:8973       
##  1st Qu.: 7.00   1st Qu.: 6.00   Class :character   Class :character  
##  Median :14.00   Median :11.00   Mode  :character   Mode  :character  
##  Mean   :17.33   Mean   :11.31                                        
##  3rd Qu.:21.00   3rd Qu.:17.00                                        
##  Max.   :99.00   Max.   :28.00                                        
##       q3           
##  Length:8973       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

The results table shows us where a driver has started in a race and subsequently where they have finished. This table acts as almost a master table since it houses constructorsId,raceId, and driverId which are all crucial in understanding the performance of a driver and team in a race. I have used these to help connect other tables using sqldf for my visualizations, as well as position and grid position to help show the starting grid placement plays a major role in a driver’s chance to win. One note about this table is that I made the time column set to NULL as it was not needed and was interfering with other tables with the same name.

summary(df_results)
##     resultId         raceId          driverId     constructorId   
##  Min.   :    1   Min.   :   1.0   Min.   :  1.0   Min.   :  1.00  
##  1st Qu.: 6311   1st Qu.: 287.8   1st Qu.: 56.0   1st Qu.:  6.00  
##  Median :12620   Median : 503.0   Median :158.0   Median : 25.00  
##  Mean   :12621   Mean   : 518.4   Mean   :251.2   Mean   : 47.51  
##  3rd Qu.:18930   3rd Qu.: 763.0   3rd Qu.:347.0   3rd Qu.: 58.00  
##  Max.   :25245   Max.   :1065.0   Max.   :854.0   Max.   :214.00  
##     number               grid        position         positionText      
##  Length:25240       Min.   : 0.0   Length:25240       Length:25240      
##  Class :character   1st Qu.: 5.0   Class :character   Class :character  
##  Mode  :character   Median :11.0   Mode  :character   Mode  :character  
##                     Mean   :11.2                                        
##                     3rd Qu.:17.0                                        
##                     Max.   :34.0                                        
##  positionOrder       points            laps        milliseconds      
##  Min.   : 1.00   Min.   : 0.000   Min.   :  0.00   Length:25240      
##  1st Qu.: 6.00   1st Qu.: 0.000   1st Qu.: 21.00   Class :character  
##  Median :12.00   Median : 0.000   Median : 52.00   Mode  :character  
##  Mean   :12.93   Mean   : 1.801   Mean   : 45.79                     
##  3rd Qu.:19.00   3rd Qu.: 2.000   3rd Qu.: 66.00                     
##  Max.   :39.00   Max.   :50.000   Max.   :200.00                     
##   fastestLap            rank           fastestLapTime     fastestLapSpeed   
##  Length:25240       Length:25240       Length:25240       Length:25240      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     statusId     
##  Min.   :  1.00  
##  1st Qu.:  1.00  
##  Median : 11.00  
##  Mean   : 17.72  
##  3rd Qu.: 14.00  
##  Max.   :139.00

Finally, I used the lap_times data to help look at a team’s lap times in a race. I used this table to join the column raceId in hopes to look at average lap times in a race versus average qualifying lap time for several teams but I was not able to due to a lack of ability to join the constructor’s table.

summary(df_lap_times)
##      raceId          driverId          lap           position     
##  Min.   :   1.0   Min.   :  1.0   Min.   : 1.00   Min.   : 1.000  
##  1st Qu.: 120.0   1st Qu.: 15.0   1st Qu.:14.00   1st Qu.: 5.000  
##  Median : 343.0   Median : 33.0   Median :29.00   Median : 9.000  
##  Mean   : 517.4   Mean   :258.8   Mean   :29.97   Mean   : 9.646  
##  3rd Qu.: 943.0   3rd Qu.:813.0   3rd Qu.:44.00   3rd Qu.:14.000  
##  Max.   :1065.0   Max.   :854.0   Max.   :87.00   Max.   :24.000  
##      time            milliseconds    
##  Length:505806      Min.   :  55404  
##  Class :character   1st Qu.:  82018  
##  Mode  :character   Median :  90606  
##                     Mean   :  95557  
##                     3rd Qu.: 102389  
##                     Max.   :7507547

Findings

Wins Per Team on Each Track since 2000

df_team_race <- sqldf(
  "select *
  from df_teams inner join df_results using (constructorId)"
)
df_team_race <- sqldf("select * 
                       from df_team_race inner join df_races using (raceId)")
df_team_race <- sqldf("select *
                      from df_team_race inner join df_circuits using (circuitId)")


df_team_race_win <- sqldf("select constructors_name, count(position) as 'Wins',track_name
                          from df_team_race
                          where year >= 2000
                          and position == 1
                          group by track_name, constructors_name")


getPalette = colorRampPalette(brewer.pal(8, "Set2"))
team_race_win <- ggplot(df_team_race_win, aes( x = Wins, y= track_name, fill = constructors_name))+
  geom_bar(stat= "identity", position = position_stack(reverse = TRUE)) +
  geom_text(aes(x = Wins, 
                y = track_name,label = Wins), 
            position = position_stack(vjust=0.5,reverse=TRUE),
            size = 4) + 
  labs(title = "Wins per Team  by Track Since 2000", x= "Wins", y = "Track Name", fill = "Team Name")+
  theme_light() +
  theme(plot.title = element_text(hjust=0.5)) +
  scale_fill_manual(values = getPalette(length(unique(df_team_race_win$constructors_name))))
  team_race_win

This stacked bar chart is looking at Wins per Team by Track Since 2000. This graph shows us not only what teams are successful at which track but also how many times a race has been held there. The dominance of the Ferrari F1 team is easily shown in the blue color. This is due to their exceptional cars and driver pairings that are brought out each year. One reason in particular for their success is thanks to Michael Schumacher and his amazing time at Ferrari in the early to late 2000s. Another team that has shown great success is McLaren due to their driver pairing of now seven-time world champion Lewis Hamilton and Jenson Button. Looking at the other end of the spectrum the teams with fewer wins in total and by track are most likely because they are the “minor league’ teams that have a smaller budget which means less experienced drivers, and a less successful R&D department.

Wins per Driver since 2000

df_race_year <- sqldf("select* 
                       from df_races inner join df_results using (raceId)")

df_wins_per_driver <- sqldf("Select driverRef as 'Driver Name', count(position) as 'Wins'
                            from df_drivers inner join df_race_year using (driverId)
                            where position == 1
                            and year >= 2000
                            group by driverRef
                            ORDER BY count(position) DESC, driverRef")
ggplot(df_wins_per_driver, aes(x= Wins, y= reorder(`Driver Name`,Wins)))+
  geom_bar(stat= "identity", position = position_stack(reverse = FALSE),color= "darkgreen", fill = "lightgreen")+
  labs(title = "Wins per Driver Since 2000", x= "Wins", y = "Drivers Last Name")+
  theme_light()+
  theme(plot.title = element_text(hjust = 0.5))+
  geom_text(data = df_wins_per_driver, aes(x= round(Wins,digits = 0), y = `Driver Name`, label = Wins, fill =NULL), hjust = -0.1,size = 4)

This bar chart is looking at Career wins per driver since 2000, and you can see that is skewed in one direction and highlights the best drivers of the last 20 years. The skewed this way because both Lewis Hamilton and Michael Schumacher are considered the greatest drivers in F1 history. The drivers that have more wins than Max Verstappen have another thing in common; they were all at one time driving for Ferrari, Mercedes, or McLaren.

Average Pit Stop Time

avg_pit_stop<- sqldf("select year, avg(stop) as 'Average Pit stops per Race', avg(duration) as 'Average Pit Stop Time'
                     from df_pit_stops inner join df_races using (raceId)
                     where year >= 2011
                     group by Year")      
x_axis_years <- min(avg_pit_stop$year): max(avg_pit_stop$year)
pit_stop <- ggplot(data = avg_pit_stop,aes(x = year, y = `Average Pit Stop Time`))+
geom_line(color = 'black', size = 1.5)+
geom_point(shape = 21, size = 3, color= 'white', fill = 'red')+
  labs(title = "Average Pit Stop time (2011-2021)", x= "Years",
       y = "Average Pit Stop Time
(Seconds)")+
  theme_light()+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_x_continuous(labels = x_axis_years, breaks = x_axis_years)+
  geom_label_repel(aes(label= round(`Average Pit Stop Time`,digits =2)),
                   box.padding = 1,
                   point.padding = 1,
                   size =3,
                   color ='black',
                   segment.color = 'red')
pit_stop

The timing of a pit stop is crucial in F1. This is because teams will use it to execute what is known as an “undercut” to get ahead of them before they are expected to pit. If the undercut is performed perfectly it can change the tide in a race, but one small mistake from your pit crew can derail that. The average pitstop time that is measured here is the time from when the car enters the pit lane until it exits the pit lane again. The graph is showing us that since 2011 which is the first year of the pitstop data that there is an overall increase in the average pit stop time. This could be due to a few different reasons. First, the aspect of human error will play, for example, one wheel gunner cannot get his wheel off fast enough, etc. Second is the car has sustained damage to a part that can be replaced on the car such as the front wing. The final possibility is the issue of Drivers having to serve timed penalties by serving a drive-through penalty or a penalty to be served during a pit stop. One thing to note this year’s data shows roughly six-tenths of a second difference because the 2021 season is not finished yet.

Average Qualifying time since 2010

df_all_teams <- sqldf("select year, constructors_name, avg(q1) as 'Average Q1 Lap Time'
                               from df_lap_times inner join df_races using (raceId)
                               inner join df_qualifying using (raceId)
                               inner join df_teams using (constructorId)
                               where year >= 2010
                               group by constructors_name, year")
trellis_df <- df_all_teams %>%
  filter(constructors_name %in% c("Mercedes", "McLaren", "Red Bull", "Ferrari")) %>%
  data.frame()
team_trellis<-ggplot(trellis_df, aes(x= year, group=constructors_name)) +
  geom_line(stat ="identity", aes(y=(Average.Q1.Lap.Time)), color="steelblue")+
  geom_point(aes(y= Average.Q1.Lap.Time),shape = 21, size = 2, color= 'white', fill = 'red')+
  labs(title= "Average Qualifying Times for Four Teams Since 2010", x="Year",y= "Average Q1 Lap Time
       (Minutes)")+
  theme_light()+
  facet_wrap(~constructors_name, nrow=2, ncol=2)
team_trellis

When looking at this trellis chart, we are observing the average qualifying pace for the first Qualifying Session. In this chart, I wanted to look at the top four performing teams over the last 11 years; Mercedes, Ferrari, McLaren, and Red Bull. Between the four teams, each follows a fairly similar plot to them, except for McLaren which had similar trends as Red Bull but drastically increased followed by a radical decrease the following year. Some key takeaways here are that after the 2014 season, each team had a gradual decrease. This is most likely due to changes in the engine regulations that improved horsepower. The increase of the average qualifying time could be attributed to poor weather and track conditions, worn-down engine parts that will hurt performance. However, this helps us to explain why teams want to go as fast as possible in qualifying, to gain the best position on the grid for the race.

Race Results Based on Qualifying Position

library(plotly)
df_wins_based_grid <- sqldf("select grid, count(position) as 'Wins'
                   from df_results inner join df_races using (raceId)
                   where position == 1
                   and year >= 2000
                   group by grid")

Race_Result_Qualifying<-plot_ly(df_wins_based_grid, labels = ~grid, values = ~Wins)%>%
  add_pie(hole=0.6)%>%
  layout(title =" Total Wins Based on Starting Grid Position",legend=list(title=list(text='Starting Grid Position',font = 4)) )%>%
  layout(annotations=list(text=paste0("Total number of races since 2000:
  ", (sum(df_wins_based_grid$Wins))),
          "showarrow"=FALSE))
Race_Result_Qualifying

Continuing off of the trellis chart based on qualifying pace, this donut chart looks at the probability of winning the race depending on their starting grid position. Out of the 403 total races over the last 21 years. We see that the data is extremely skewed to the front row of the grid. If you are in pole position when starting the race, there is a little over 50% probability that you would win the race, and if you are in second place to start; you have a 25% chance of winning. This shows that there is a 75% chance that the drivers starting in first or second will win the race.

Conclusions

In conclusion, when looking at Formula 1 over the last 21 years we are seeing a similar trend. The best drivers who drive for the best teams are usually more successful, which is no surprise. Along with this, it shows that the more successful team has a better-performing car and team that will help them win. All of these visualizations show why the FIA (the F1 governing body) is working to drastically change the sport next year to level the fight for the worse teams to compete. These improvements will make the sport more interesting for fans and make the sport even more competitive.