The following assignment has the aim to perform a comprehensive data analysis using a wide range of R functionalities. This assignment presents an exciting opportunity for the authors to demonstrate their understanding of R programming and data analysis techniques. Through this analysis, we will be able to explore, manipulate and visualize data, draw insights and communicate our findings in a clear and concise manner. Due to the interest of both authors in the world of Formula 1 (as off here “F1”), the report first dives into providing an analysis of the developments within the sport over the last 20 years (2000 - 2022). Later, the two drivers Fernando Alonso and Lewis Hamilton, who have both been in the sport for a majority of the analysed time frame, will be compared in order to also showcase, how individual performance can make a great difference for the individual results.
In a first step, we want to look at how the data looks like.
## 'data.frame': 9160 obs. of 36 variables:
## $ constructorId : int 1 1 1 1 1 1 1 1 1 1 ...
## $ driverId : int 1 1 14 18 1 1 825 4 8 8 ...
## $ circuitId : int 18 17 20 12 5 11 15 3 20 7 ...
## $ raceId : int 16 843 132 867 844 28 913 990 96 131 ...
## $ resultId : int 7856 20825 2391 21407 20852 213 22426 23808 1652 2352 ...
## $ grid : int 17 3 5 9 4 1 9 13 4 5 ...
## $ positionText : chr "Finished Race" "Finished Race" "Retired" "Finished Race" ...
## $ positionOrder : int 3 1 21 8 4 5 10 7 18 4 ...
## $ points : num 6 25 0 4 12 4 1 6 0 3 ...
## $ laps : int 71 56 27 57 58 70 60 56 9 70 ...
## $ time : chr "+18.944" "1:36:58.226" "\\N" "+24.653" ...
## $ milliseconds : chr "5562025" "5818226" "\\N" "6281302" ...
## $ fastestLap : chr "59" "48" "\\N" "48" ...
## $ rank : chr "6" "2" "\\N" "17" ...
## $ fastestLapTime : chr "1:14.345" "1:40.415" "\\N" "1:44.806" ...
## $ fastestLapSpeed: chr "208.654" "195.424" "\\N" "186.138" ...
## $ statusId : int 1 1 4 1 1 1 1 11 5 1 ...
## $ year : int 2009 2011 2002 2012 2011 2008 2014 2018 2004 2002 ...
## $ round : int 16 3 9 8 4 11 14 2 7 8 ...
## $ gp : chr "Brazilian Grand Prix" "Chinese Grand Prix" "European Grand Prix" "European Grand Prix" ...
## $ date : chr "2009-10-18" "2011-04-17" "2002-06-23" "2012-06-24" ...
## $ gp_place : chr "Autódromo José Carlos Pace" "Shanghai International Circuit" "Nürburgring" "Valencia Street Circuit" ...
## $ location : chr "São Paulo" "Shanghai" "Nürburg" "Valencia" ...
## $ country : chr "Brazil" "China" "Germany" "Spain" ...
## $ lat.x : num -23.7 31.3 50.3 39.5 41 ...
## $ lng : num -46.7 121.22 6.947 -0.332 29.405 ...
## $ alt : chr "785" "5" "578" "4" ...
## $ forename : chr "Lewis" "Lewis" "David" "Jenson" ...
## $ surname : chr "Hamilton" "Hamilton" "Coulthard" "Button" ...
## $ dob : chr "1985-01-07" "1985-01-07" "1971-03-27" "1980-01-19" ...
## $ nationality.x : chr "British" "British" "British" "British" ...
## $ name : chr "McLaren" "McLaren" "McLaren" "McLaren" ...
## $ nationality.y : chr "British" "British" "British" "British" ...
## $ Country : chr "UK" "UK" "UK" "UK" ...
## $ long : num -3.11 -3.11 -3.11 -3.11 -3.11 ...
## $ lat.y : num 58.5 58.5 58.5 58.5 58.5 ...
From what the data shows, there is quite a lot of columns with the class “chr (character), which makes sense due to having a lot of textual content in the dataset. The other categories, which are to be found is integers and numerical, which are also to be expected, due to dealing with a lot of numbers, when talking about motorsports. In a next step we want to look at some simple plots in order to some indications of what the data tells us.
Now, let’s look at what we know about the outlook of the dataset.
## [1] "Number of columns 36"
## [1] "Number of rows 9160"
Ok, now we know about the number of rows and columns, but let’s look at whether a summary of the numeric columns provide us with further insights.
## grid positionOrder points laps statusId
## Min. : 0 Min. : 1.00 Min. : 0.000 Min. : 0.0 Min. : 1.000
## 1st Qu.: 6 1st Qu.: 6.00 1st Qu.: 0.000 1st Qu.:49.0 1st Qu.: 1.000
## Median :11 Median :11.00 Median : 0.000 Median :56.0 Median : 4.000
## Mean :11 Mean :11.13 Mean : 3.535 Mean :51.8 Mean : 9.255
## 3rd Qu.:16 3rd Qu.:16.00 3rd Qu.: 5.000 3rd Qu.:66.0 3rd Qu.: 11.000
## Max. :24 Max. :24.00 Max. :50.000 Max. :87.0 Max. :141.000
## year round
## Min. :2000 Min. : 1.000
## 1st Qu.:2006 1st Qu.: 5.000
## Median :2012 Median :10.000
## Mean :2011 Mean : 9.966
## 3rd Qu.:2017 3rd Qu.:15.000
## Max. :2022 Max. :22.000
What we can see is, the ranking distribution in the columns “grid” and “positionOrder”. The first racer takes first position (min = 1, the 0 in “grid” is someone who wasn’t at the race). Additionally, we can see that there is a difference in laps per race, but the maximum number of laps is 87 laps in one race. For the year column we can see, that we are looking at the desired time frame of the last two decades - even more specifically, since 2000.
Next, we want to look at is, which Grand Prix locations are frequently used.
It’s interesting to see, that a few Grand Prix have been only very seldomly used (e.g. Miami GP, first time in 2022), while others seem to be a regular race location since the start of F1 in 1950 (e.g. British GP in Silverstone, one of the most traditional race tracks).
In a next thing we want to look at whether racers finish their races, or how often they crash. We’ll look into this again further down, but let’s have a first glimpse.
As we can see, most of the cars actually make it to the finish line, which is not in line with the image of F1 for some spectators who basically watch the races to see crashes.
After having had a look at this, let’s leave the first sneak peaks into the data and let’s jump into how the sport has changed since 2020.
In this chapter we want to look at how the sport of F1 has changed over the last 20 years. For that we are going to explore, where races have taken place and how this has changed, as well as where drivers and teams have come from and when. We will end with a conclusion on what this development means for the sport and where these decisions might stem from.
First we look at the locations, where GPs have taken place over the years:
Before going into detail on what changes we can see, it is important to mention, that the GP locations of the F1 calendar can change every year and also the number of races can change.
As we can see, in the beginning of the millenium, most races where held in Europe, while a few were held spread all over Asia, one traditional race in Australia, 2 races in the US and 1 race in South America. With the progress of time, F1 has gained more and more interest. Especially in the last three years, due to a Netflix series on the sport of F1, viewership has skyrocketed, race track tickets are mostly sold out and the races are mostly broadcasted on private TV stations. With this development, F1 had to find ways to further grow and grasp new target groups. This was tackled by increasing the number of races (17 in 2000, 22 in 2022) and by distributing the additional races more evenly between different continents (most of the European races are still kept).
With the internationalization of the race locations, as well as the increasing interest in the sport all over the world, one could notice that the teams, as well as the drivers started to become more diverse in terms of their nationalities.
As it can be seen in the animation, the constructors in the early 2000’s were mostly in Europe, which has changed towards the end of the decade. Nevertheless, by the mids of the 2010’s the general conclusion was, that most of the sport’s knowledge still lies within Europe, which is why by the end of the decade, most of all the teams had their hub back in Europe (the exception being the Haas F1 team, which is based in the US).
When looking at the development of the drivers’ places of origin, we want to even look a bit further back than only the last 20 years and compare how the situation has changed when comparing the time since the founding of F1 in the 1950’s until 1999 and how it has changed after the change of the millenium:
Although being aware, that the world became more globalized in the last two decades, the diversification in regards of the drivers’ countries of origin is something worthwhile mentioning. Whereas before the millennial change, the North and South America with the addition of Europe made up for most of the drivers, the Asian continent has provided quite a few drivers over the last two decades. As previously mentioned this also goes hand in hand with the globalization strategy of the entire F1 sport. An additional point that also comes along with the more diverse drivers, is the sponsors country of origins. It is no secret, that F1 is an extremely expensive sport and in order to be successful, money is urgently needed. It therefore makes sense to bring in a driver with a nationality that is not already present in one of the other teams, as this will attract attention from the driver’s “home market” and therefore more funding can be expected.
To see what origins have been most prominent represented within the sport over the last 20 years, we thought we want to look at the first and last names of the drivers and see whether there is a pattern that we can recognize.
While the first names that have been seen most appear to be either from a French or Spanish speaking country (e.g. Sébastien, Charles, Esteban, Felipe or Pedro), an interesting thing can be detected when looking at the last names. We can see two last names who have been most present in the last two decades in the sport - Verstappen and Schumacher. Reason for that being, that the sibling Michael and Ralph Schumacher drove at the same time in the early 2000’s whereas Michael’s son, who’s name is Mick, drove for Haas F1 until 2022 (2023 he’s the reserve driver of Mercedes). A similar story is valid for the family Verstappen. Jos Verstappen drove until 2003 in F1, whereas Max is currently driving the team Red Bull and has won the World Championships in 2021 and 2022.
In order to provide a quantitative summary on what we have discussed so far, we want to show below how the number of drivers, constructors and races have changed in the observed time frame.
As mentioned before, F1 has the aim to become even more global than they are already now, which is why we can see a clear upward trend in the number of races executed per year over the last two decades.
On the other hand, the number of teams on the grid has not changed much in the last 20 years. There were always between 10 and 12 teams in a season, and as each team usually puts two drivers into every race for the entire season (unless there are special circumstances with financial trouble or drivers being substituted etc.) the mean of drivers in the races moves mostly equally as the graph for the constructors.
A last topic regarding changes in the sport we want to point out before moving into a direct comparison of two drivers who have been active in the sport for the biggest chunk of the highlighted time frame, is the safety and reliability of the cars and how it has developed over the years. While a lot of people would assume that this has highly improved since the start of the sport in the middle of the last century, probably most people wouldn’t think that there was a big change over the last two decades.
As we can see from the graph and the different developments of the curves, we can see two graphs that show the development and how car reliability in the sport has developed over the last 20 years. The number of cars which finished races, have increased from roughly 230 in the year 2000, to roughly 380 in 2022. Logically, the number of cars, which had to be retired during races saw the other trend. They went from just under 150 in 2000 to around 60 retired cars in 2022. The reason for this development is that more and more standardized components and new technologies were introduced and have made F1 cars more reliable and have reduced the number of mechanical failures. One key area of improvement has been the engine. In the early 2000s, engine failures were common in F1, but with the introduction of new regulations, engines are now required to last for multiple races. This has helped to reduce the number of engine failures and increased the reliability of the cars.
In the next section of the report, we want to move the focus from general F1 statistics and overall numbers to a more narrow view. Drivers in F1 are all highly talented individuals as they compete in the pinnacle of motor sports. Two drivers who have been around for the majority of the observed time frame of this report are Fernando Alonso and Lewis Hamilton. Most people agree that they are two of the greatest Formula One drivers of all time. Both have achieved great success in the sport and have a wealth of experience and talent. Alonso has made his debut in the early 2000’s, while Hamilton joined in the middle of the decade.
To see how often they have raced in different seasons, we can look at the following bar plot:
As it can be seen, Fernando Alonso started in F1 in the year 2001 and competed in all the years until today besides the years 2002, 2019 and 2020. Lewis Hamilton made his debut in 2007 and competed since then in every year until today. For most of the years, both of the drivers drove the same amount of races, besides the years 2016 and 2017, where Alonso drove one race less per year. One of the measures that is often considered to rank a driver on how talented and impactful a driver was during his career, is the total points collected in their career. Let’s compare the two driver on their all-time collected points.
## Group.1 x
## 1 Alonso 2061.0
## 2 Hamilton 4396.5
As we see, in the graph, Hamilton, although starting 5 seasons later than Alonso in F1, after 9 seasons in F1 overtook Alonso on his all-time career points. Knowing that Hamilton has won seven World Championships so far (as compared to two for Alonso) it is not surprising, that Hamilton has collected more points than Alonso during his entire career. Nevertheless, it comes to many as a surprise, how big the margin between the two is by today. Hamilton has collected nearly 4400 points until today, whereas Alonson is just above 2000 collected points in his 19 F1 seasons.
To look at it from a all-time point collection might not give the full picture, because sometimes a driver just collects much more points than the other based on the car in this very season. Therefore, some people compare the drivers based on how they performed against each other season by season. Therefore we’d like to continue in a next step to compare the drivers’ point collection for each season in which they both competed.
Also in this comparison, Hamilton is quite clearly ahead. While he scored more points in 2008 and 2009, it was Alonso’s time to shine between 2010 and 2013, where he scored more points than Hamilton. The differences between the number of points scored wasn’t very big in these seasons. Especially when compared on how things developed in the year 2014 and after. Hamilton scored World Championships like no one ever before and Alonso wasn’t as successful as he used to be. In the time range, which we highlight in this report, Alonso scored more points in 4 seasons, in 9 seasons Hamilton was the higher scorer.
In a next step we want to go uncover even one layer further, which means we can look at who scored better in each race of the seasons, during the seasons in which they both competed. For that we look at the points scored in each race and tried to show trends with a smoothing line.
As we can see in the different graphs, we can see that Hamilton even in his early years started off stronger than Alonso, but then sometime lost his momentum. For other years it was the other way around. The most clearly visible trends are to be seen in Hamilton’s seasons since 2014 where his trend, his collected points and his rankings for each race were extremely strong and therefore the difference in margin between the two is incredible. To clarify, why the y axis on the points changes after 2010: The FIA (F1 association), changed the number of points awarded to different rankings per race from a maximum of 10 points for the winner before 2010 to 25 points for the winner in 2010 and after. For some years there is even a scale of up to 50 points. This was, when in one location there was two races and therefore two times 25 points could be collected.
There are further measures, on which drivers are compared in their career statistics. One of these measures is the percentage of races won in comparison to the ones one has driven. This is presented for each of the drivers below.
Although one would think that it is an amazing achievement that Alonso has won nearly 10% of the races which he participated in. But when looking at how Hamilton scores on this metrics, he wins a whooping third of all the races he ever drove - an incredible number.
When looking at how often each of the different rankings was reach by both drivers, the difference becomes even more striking.
It becomes evident, that Hamilton has reached much more often the first place, quite a bit more often the second place and a bit more often the third place. For all the other positions, Alonso reached them more often, hence scoring worse than Hamilton.
After having had all these insights, the big question that is still untouched, is to what extent is it really the drivers making this big difference and to what extent is it the car of the team, the drivers drove for.
When looking at the boxplot we see some very interesting things. Let’s look first at how Alonso did for the different teams he drove for. What we can see is that he was quite successful for the teams Renault and Ferrari (a bit better for Ferrari), reaching on 50% of all occasions the 4th place or better. For Alpine, McLaren and Minardi he wasn’t as successful, with mean rankings 9, 11 and 15. Hamilton only drove for two teams. While his median score at McLaren was the 4th place (which is what Alonso reached in his best teams), for Mercedes in 50% of his races he got second or better and for 75% 4th or better - another unbelievable achievement. The best comparison can be made on the achievements, each of the driver have reached when driving for the same team. The only “team-overlap” Hamilton and Alonso have is McLaren. When looking at their median scores for the team (Alonso 11th, Hamilton 4th) the discussion on who has been more successful is over.
As a last step we want to execute some modelling, although being aware that based on the previous results, certain trends can already be assumed.
We will have a look at two linear models. First we want to look whether there is an impact of the driver on the reached position. Based on the previous research we can assume that there is an impact.
model_surname_position <- lm(positionOrder ~ surname, data = Hamilton_Alonso)
summary(model_surname_position)
##
## Call:
## lm(formula = positionOrder ~ surname, data = Hamilton_Alonso)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.494 -3.787 -1.787 2.506 19.213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.4944 0.3147 26.988 < 2e-16 ***
## surnameHamilton -3.7073 0.4620 -8.024 4.62e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.955 on 666 degrees of freedom
## Multiple R-squared: 0.08815, Adjusted R-squared: 0.08678
## F-statistic: 64.39 on 1 and 666 DF, p-value: 4.622e-15
Not surprisingly, the base case is Alonso (based on alphabet), but when the surname is Hamilton, there is a significant impact on the position order (p-value < 0.05).
As a next linear model we will look at whether the constructor has an impact on the number of points collected. Based on the previously seen boxplot, for some teams a significant effect can be assumed.
model_constructors_points <- lm(points ~ name, data = Hamilton_Alonso)
summary(model_constructors_points)
##
## Call:
## lm(formula = points ~ name, data = Hamilton_Alonso)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.417 -4.629 0.000 4.371 32.582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6818 1.0188 3.614 0.000324 ***
## nameFerrari 8.7140 1.2303 7.083 3.62e-12 ***
## nameMcLaren 1.9475 1.1228 1.734 0.083303 .
## nameMercedes 13.7357 1.1253 12.206 < 2e-16 ***
## nameMinardi -3.6818 1.9298 -1.908 0.056845 .
## nameRenault 0.7333 1.2119 0.605 0.545351
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.758 on 662 degrees of freedom
## Multiple R-squared: 0.423, Adjusted R-squared: 0.4186
## F-statistic: 97.05 on 5 and 662 DF, p-value: < 2.2e-16
As expected, we can see significant effects of Ferrari and Mercedes on the achieved points. We have to keep in mind, that the base case is Alpine and when looking at the colum Estimate, the impact of Ferrari (move by 1 unit) or Mercedes (move by 1 unit) as compared to Alpine an increase of 8, respectively 13 points is achieved. This shows the success of these teams achieved, when Alonso and Hamilton drove or still drive for them.
After this report, we hope to have provided an insight on how F1 has changed over time and how two drivers can be compared. While the developments of F1 do not need further explanation, the driver comparison in our opinion should be concluded by valueing both drivers and not only Lewis Hamilton.
In terms of on-track abilities, both Hamilton and Alonso have proven themselves to be fast, consistent, and capable of winning races and championships. Ultimately, both Lewis Hamilton and Fernando Alonso are legends of the sport and have made a huge impact on F1. Their careers and accomplishments are a testament to their talent and dedication to the sport.