My topic for discussion and statistical analysis will be about the NASCAR season statistics from 2007-2022. The list goes from simple to complex statistics and each row displays the metrics of a racer for that specific year. This all coems from the official NASCAR website, plus, the data frame contains 1,111 rows of observations and 20 variables.
Wins - The sum of the driver’s victories
AvgStart - The sum of the driver’s starting positions divided by the number of races
AvgFinish - The sum of the driver’s finishing positions divided by the number of lap
AvgPos - The sum of the driver’s position each lap divided by the number of laps
PassDiff - The sum of green flap passes minus green times passed
GreenFlagPasses - Number of green flag passes minus green times passed
GreenFlagPassed - Number of times driver is passed during green flag
QualityPasses - Number of passes in the top 15 while under green flag conditions by driver
PercentQualityPasses - The sum of quality passes divided by green flag passes
NumFastestLaps - Number of where the driver had the fastest speed on the lap
LapsInTop15 - Number of where the driver had the fastest speed on the lap
PercentLapsInTop15 - The sum of the laps run in the top 15 divided by total laps completed
LapsLed - The sum of the laps led in a race
PercentLapsLed - The sum of the laps led in the race
TotalLaps - The sum of the laps completed by a driver that year
DriverRating - Formula combining wins, finish, top15-finish, average running position while on lead lap, average speed under green, fastest lap, led most laps, and lead lap finish with a maximum rating of 150 points
Driver rating from research and from glance, is seen to be the most important variable. Could we find predictors to best predict driver rating? Let us first start with a simple linear regression. In other words, I will explore the relationship between my choice of variable and driver rating, and make assumptions from that first model. Then, evolve the model into a multiple regression to predict driver rating as mentioned.
There are two steps we must do before starting with a statistical analysis. That is to load the libraries and data necessary to move forward.
This chunk below is the first part.
# This function allows us to use tidyverse and tidymodels (package)
library(tidyverse)
library(tidymodels)
library(plotly)
# Set your working directory as you'd like to load the data
setwd("C:/Users/Angel/OneDrive/Documents/Datasets")
This chunk below is the second part.
# The function reads csv files from your set working directory into your global environment
nascar <- read_csv("nascar_driver_statistics.csv")
# This function will let us see the first observations relative to the variable used
head(nascar)
## # A tibble: 6 × 21
## ...1 Driver Wins AvgStart AvgMidRace AvgFinish AvgPos PassDiff
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Joey Logano 2 10.9 13.5 13.9 11.9 175
## 2 2 Ross Chastain 2 14.6 13.2 14.2 12.4 94
## 3 3 Kyle Larson 2 8 14.4 13.6 12.6 268
## 4 4 Ryan Blaney 0 10.5 11.4 13.8 12.1 215
## 5 5 Denny Hamlin 2 12.2 16.5 16.1 13.8 383
## 6 6 Chase Elliott 4 11 10.9 11.9 11.2 299
## # ℹ 13 more variables: GreenFlagPasses <dbl>, GreenFlagPassed <dbl>,
## # QualityPasses <dbl>, PercentQualityPasses <dbl>, NumFastestLaps <dbl>,
## # LapsInTop15 <dbl>, PercentLapsInTop15 <dbl>, LapsLed <dbl>,
## # PercentLapsLed <dbl>, TotalLaps <dbl>, DriverRating <dbl>, Points <dbl>,
## # Year <dbl>
Comments: What predictor variables are best in a regression model to predict DriverRating? Here, there are two things to notice. First the outcome variable (response, Y) is DriverRating, and secondly, the first predictor variable (explanatory, X) is AvgFinish.
# "|>" another way of saying "pipe", basically a mechanism to inform ggplot of what data to use
nascar |>
# In ggplot, the function "aes" or aesthetics, set it to avgfinish and use the geom_bar function to visualize the distribution
ggplot(aes(x = AvgFinish)) +
geom_bar()
Comments: The shape seems to be normal relative to count, but there is an outlier that is less than five.
# "|>" another way of saying "pipe", basically a mechanism to inform ggplot of what data to use
nascar |>
# In ggplot, the function "aes" or aesthetics, set it to driverrating and use the geom_bar function to visualize the distribution
ggplot(aes(x = DriverRating)) +
geom_bar()
Comments: The shape seems to be normal relative to count, but there is an outlier that is over 150.
# "|>" another way of saying "pipe", basically a mechanism to inform ggplot of what data to use
nascar |>
# In ggplot, the function "aes" or aesthetics, set x to avgfinish and y to driverrating
ggplot(aes(x = AvgFinish, y = DriverRating)) +
# The function geom_point allows use to visualize a scatter plot and geom_smooth adds a "lm" regression line to the plot
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se =FALSE, color = "red", linewidth = .8) +
# The first function sets the background to black and white, meanwhile the second function sets labels to your axis and title
theme_bw() +
labs(x = "Average Finish", y = "Driver Rating", title = "A Scatterplot of Average Finish to Driver Rating", caption = "Source: 2017-2023 NASCAR Digital Media, LLC.")
Comments: There is a negative association between average finish to driver rating. In other words, as average finish increases then it is logical to believe that driver rating will decrease as well.
# Name your model fit1 and use the "lm" function to set DriveRating to AvgFinish (Y ~ X)
fit1 <- lm(data = nascar, DriverRating ~ AvgFinish)
# This function allows us to see the statistics and make conclusions/observations
summary(fit1)
##
## Call:
## lm(formula = DriverRating ~ AvgFinish, data = nascar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.228 -5.605 0.291 5.940 35.607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119.68236 0.79944 149.71 <2e-16 ***
## AvgFinish -2.45106 0.02996 -81.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.059 on 1109 degrees of freedom
## Multiple R-squared: 0.8578, Adjusted R-squared: 0.8577
## F-statistic: 6692 on 1 and 1109 DF, p-value: < 2.2e-16
Comments: There are three things to factor; the linear model, adjusted R-squared value, and the p-value.
To write out the linear model, it woulld look like
DriverRating = 119.682 + -2.451(AvgFinish). In
other words, we can interpret the slope and intercept. When the average
finish is at zero, the driver rating is at 119.682; not to mention for
every additional average finish , the driver rating decreases by
2.451.
Roughly 86% of the variation in the observations can be explained through the use of this model
The p-value is below 0.05, which is an indication or evidence that there is statistical significance made by this model
# "|>" another way of saying "pipe", basically a mechanism to inform ggplot of what data to use
fit1 |>
# In ggplot, the function "aes" or aesthetics, set x to .fitted and y to .resid (Fitted to Residual values)
ggplot(aes(x = .fitted, y = .resid)) +
# The first function allows us to visualize points on a scatter plot and the second function adds a horizontal line at y = 0
geom_point() +
geom_hline(yintercept = 0, color = "red", linewidth = 1.5) +
# The first function sets the background to black and white, and the second function sets labels to your axis and title
theme_bw() +
labs(x = "Fitted Values", y = "Residual Values", title = "Residuals vs. Fitted Values")
Comments: The residual plot do not show an obvious pattern about the horizontal line at y = 0, which in hindsight, is what we are looking for.
New criteria: We know that the first model is well suited to predict driver rating, however there is a chance to improve upon the regression model. Through a series of trial and error, which predictors best predict driver rating?
fit2 <- lm(data = nascar, formula = DriverRating ~ AvgFinish + AvgPos + QualityPasses + LapsInTop15 + LapsLed + TotalLaps)
summary(fit2)
##
## Call:
## lm(formula = DriverRating ~ AvgFinish + AvgPos + QualityPasses +
## LapsInTop15 + LapsLed + TotalLaps, data = nascar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.5151 -3.0250 0.3896 2.8739 17.3802
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.021e+02 1.042e+00 98.005 < 2e-16 ***
## AvgFinish -2.360e-01 4.975e-02 -4.743 2.38e-06 ***
## AvgPos -1.672e+00 6.038e-02 -27.701 < 2e-16 ***
## QualityPasses 1.297e-03 6.900e-04 1.880 0.0604 .
## LapsInTop15 1.426e-03 1.779e-04 8.017 2.75e-15 ***
## LapsLed 6.297e-03 6.142e-04 10.253 < 2e-16 ***
## TotalLaps -6.320e-05 5.176e-05 -1.221 0.2223
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.431 on 1104 degrees of freedom
## Multiple R-squared: 0.9661, Adjusted R-squared: 0.966
## F-statistic: 5250 on 6 and 1104 DF, p-value: < 2.2e-16
Comments: This is only the second model and from first glance, it looks promising. The information given by the summary statistics can be interpreted because of the adjusted R-squared value.
cor(nascar$DriverRating, nascar$AvgFinish)
## [1] -0.9261987
cor(nascar$DriverRating, nascar$AvgPos)
## [1] -0.9720197
Avgpos has a stronger correlation, which will be the premise of the final plot.
p1 <- nascar |>
ggplot(aes(x = AvgPos, y = DriverRating, color = AvgFinish, text = paste("Name:", Driver))) +
geom_point() +
theme_bw() +
scale_color_distiller(palette = "Blues") +
theme_minimal(base_size = 12)+
labs(x = "Average Position", y = " Driver Rating", title = "Scatterplot between Driver Rating to Average Position", caption = "Source: 2017-2023 NASCAR Digital Media, LLC.")
p1 <- p1 + geom_abline(slope = coef(lm(DriverRating ~ AvgPos, data = nascar))[2], intercept = coef(lm(DriverRating ~ AvgPos, data = nascar))[1], linetype = "dashed", color = "black")
ggplotly(p1)
From the beginning the goal was to first build a model between average finish and driver rating. In that case, a linear model makes sense as from a statistical outlook for this sort of scenario. Besides a linear model, you can also showcase the relationship through scatter plots, correlation plots, or diagnostic plots. These are well worth exploring as the best attempt to model the relationship between these two variables. Moving forward, the second question was which predictors are best to predict driver rating? This meant tinkering with the built model to increase the r-adjusted square value because we want the model to explain the variation for most of the observations. The last scatterplot is an attempt to visualize the linear model for the best predictor variable in respect to the driver’ average position and average finish. Drivers like Kevin Harvick, Martin Truex Jr., and Jimmie Johnson are among the top within these performance metrics.