Introduction (Section 1)

My topic for discussion and statistical analysis will be about the NASCAR season statistics from 2007-2022. The list goes from simple to complex statistics and each row displays the metrics of a racer for that specific year. This all coems from the official NASCAR website, plus, the data frame contains 1,111 rows of observations and 20 variables.

Variable Description

Wins - The sum of the driver’s victories

AvgStart - The sum of the driver’s starting positions divided by the number of races

AvgFinish - The sum of the driver’s finishing positions divided by the number of lap

AvgPos - The sum of the driver’s position each lap divided by the number of laps

PassDiff - The sum of green flap passes minus green times passed

GreenFlagPasses - Number of green flag passes minus green times passed

GreenFlagPassed - Number of times driver is passed during green flag

QualityPasses - Number of passes in the top 15 while under green flag conditions by driver

PercentQualityPasses - The sum of quality passes divided by green flag passes

NumFastestLaps - Number of where the driver had the fastest speed on the lap

LapsInTop15 - Number of where the driver had the fastest speed on the lap

PercentLapsInTop15 - The sum of the laps run in the top 15 divided by total laps completed

LapsLed - The sum of the laps led in a race

PercentLapsLed - The sum of the laps led in the race

TotalLaps - The sum of the laps completed by a driver that year

DriverRating - Formula combining wins, finish, top15-finish, average running position while on lead lap, average speed under green, fastest lap, led most laps, and lead lap finish with a maximum rating of 150 points

What questions can be made?

Driver rating from research and from glance, is seen to be the most important variable. Could we find predictors to best predict driver rating? Let us first start with a simple linear regression. In other words, I will explore the relationship between my choice of variable and driver rating, and make assumptions from that first model. Then, evolve the model into a multiple regression to predict driver rating as mentioned.

Working with the Data (Section 2)

There are two steps we must do before starting with a statistical analysis. That is to load the libraries and data necessary to move forward.

This chunk below is the first part.

# This function allows us to use tidyverse and tidymodels (package)
library(tidyverse)
library(tidymodels)
library(plotly)

# Set your working directory as you'd like to load the data
setwd("C:/Users/Angel/OneDrive/Documents/Datasets")

This chunk below is the second part.

# The function reads csv files from your set working directory into your global environment
nascar <- read_csv("nascar_driver_statistics.csv")

# This function will let us see the first observations relative to the variable used
head(nascar)

## # A tibble: 6 × 21
##    ...1 Driver         Wins AvgStart AvgMidRace AvgFinish AvgPos PassDiff
##   <dbl> <chr>         <dbl>    <dbl>      <dbl>     <dbl>  <dbl>    <dbl>
## 1     1 Joey Logano       2     10.9       13.5      13.9   11.9      175
## 2     2 Ross Chastain     2     14.6       13.2      14.2   12.4       94
## 3     3 Kyle Larson       2      8         14.4      13.6   12.6      268
## 4     4 Ryan Blaney       0     10.5       11.4      13.8   12.1      215
## 5     5 Denny Hamlin      2     12.2       16.5      16.1   13.8      383
## 6     6 Chase Elliott     4     11         10.9      11.9   11.2      299
## # ℹ 13 more variables: GreenFlagPasses <dbl>, GreenFlagPassed <dbl>,
## #   QualityPasses <dbl>, PercentQualityPasses <dbl>, NumFastestLaps <dbl>,
## #   LapsInTop15 <dbl>, PercentLapsInTop15 <dbl>, LapsLed <dbl>,
## #   PercentLapsLed <dbl>, TotalLaps <dbl>, DriverRating <dbl>, Points <dbl>,
## #   Year <dbl>

Comments: What predictor variables are best in a regression model to predict DriverRating? Here, there are two things to notice. First the outcome variable (response, Y) is DriverRating, and secondly, the first predictor variable (explanatory, X) is AvgFinish.

What is the distribution shape look like for average finish?

# "|>" another way of saying "pipe", basically a mechanism to inform ggplot of what data to use
nascar |>
  
# In ggplot, the function "aes" or aesthetics, set it to avgfinish and use the geom_bar function to visualize the distribution  
  ggplot(aes(x = AvgFinish)) +
  geom_bar()

Comments: The shape seems to be normal relative to count, but there is an outlier that is less than five.

What is the distribution shape look like for driver rating?

# "|>" another way of saying "pipe", basically a mechanism to inform ggplot of what data to use
nascar |>
  
# In ggplot, the function "aes" or aesthetics, set it to driverrating and use the geom_bar function to visualize the distribution
  ggplot(aes(x = DriverRating)) +
  geom_bar()

Comments: The shape seems to be normal relative to count, but there is an outlier that is over 150.

Scatter plot to see relationship between average finish and driver rating

# "|>" another way of saying "pipe", basically a mechanism to inform ggplot of what data to use
nascar |>

# In ggplot, the function "aes" or aesthetics, set x to avgfinish and y to driverrating
  ggplot(aes(x = AvgFinish, y = DriverRating)) +

# The function geom_point allows use to visualize a scatter plot and geom_smooth adds a "lm" regression line to the plot   
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se =FALSE, color = "red", linewidth = .8) +

# The first function sets the background to black and white, meanwhile the second function sets labels to your axis and title  
  theme_bw() +
  labs(x = "Average Finish", y = "Driver Rating", title = "A Scatterplot of Average Finish to Driver Rating", caption = "Source: 2017-2023 NASCAR Digital Media, LLC.")

Comments: There is a negative association between average finish to driver rating. In other words, as average finish increases then it is logical to believe that driver rating will decrease as well.

Building a linear model

# Name your model fit1 and use the "lm" function to set DriveRating to AvgFinish (Y ~ X) 
fit1 <- lm(data = nascar, DriverRating ~ AvgFinish)

# This function allows us to see the statistics and make conclusions/observations
summary(fit1)

## 
## Call:
## lm(formula = DriverRating ~ AvgFinish, data = nascar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.228  -5.605   0.291   5.940  35.607 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 119.68236    0.79944  149.71   <2e-16 ***
## AvgFinish    -2.45106    0.02996  -81.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.059 on 1109 degrees of freedom
## Multiple R-squared:  0.8578, Adjusted R-squared:  0.8577 
## F-statistic:  6692 on 1 and 1109 DF,  p-value: < 2.2e-16

Comments: There are three things to factor; the linear model, adjusted R-squared value, and the p-value.

To write out the linear model, it woulld look like DriverRating = 119.682 + -2.451(AvgFinish). In other words, we can interpret the slope and intercept. When the average finish is at zero, the driver rating is at 119.682; not to mention for every additional average finish , the driver rating decreases by 2.451.
Roughly 86% of the variation in the observations can be explained through the use of this model
The p-value is below 0.05, which is an indication or evidence that there is statistical significance made by this model

Compare and contrast with a residual plot

# "|>" another way of saying "pipe", basically a mechanism to inform ggplot of what data to use
fit1 |>

# In ggplot, the function "aes" or aesthetics, set x to .fitted and y to .resid (Fitted to Residual values)  
  ggplot(aes(x = .fitted, y = .resid)) +

# The first function allows us to visualize points on a scatter plot and the second function adds a horizontal line at y = 0  
  geom_point() +
  geom_hline(yintercept = 0, color = "red", linewidth = 1.5) +
 
# The first function sets the background to black and white, and the second function sets labels to your axis and title  
  theme_bw() +
  labs(x = "Fitted Values", y = "Residual Values", title = "Residuals vs. Fitted Values")

Comments: The residual plot do not show an obvious pattern about the horizontal line at y = 0, which in hindsight, is what we are looking for.

What if we add more predictor variables to our model?

New criteria: We know that the first model is well suited to predict driver rating, however there is a chance to improve upon the regression model. Through a series of trial and error, which predictors best predict driver rating?

fit2 <- lm(data = nascar, formula = DriverRating ~ AvgFinish + AvgPos + QualityPasses + LapsInTop15 + LapsLed + TotalLaps)

summary(fit2)

## 
## Call:
## lm(formula = DriverRating ~ AvgFinish + AvgPos + QualityPasses + 
##     LapsInTop15 + LapsLed + TotalLaps, data = nascar)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.5151  -3.0250   0.3896   2.8739  17.3802 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.021e+02  1.042e+00  98.005  < 2e-16 ***
## AvgFinish     -2.360e-01  4.975e-02  -4.743 2.38e-06 ***
## AvgPos        -1.672e+00  6.038e-02 -27.701  < 2e-16 ***
## QualityPasses  1.297e-03  6.900e-04   1.880   0.0604 .  
## LapsInTop15    1.426e-03  1.779e-04   8.017 2.75e-15 ***
## LapsLed        6.297e-03  6.142e-04  10.253  < 2e-16 ***
## TotalLaps     -6.320e-05  5.176e-05  -1.221   0.2223    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.431 on 1104 degrees of freedom
## Multiple R-squared:  0.9661, Adjusted R-squared:  0.966 
## F-statistic:  5250 on 6 and 1104 DF,  p-value: < 2.2e-16

Comments: This is only the second model and from first glance, it looks promising. The information given by the summary statistics can be interpreted because of the adjusted R-squared value.

The adjusted R-squared value increased from 0.8577 to 0.9661, which suggests this model is better than the former.

cor(nascar$DriverRating, nascar$AvgFinish)

## [1] -0.9261987

cor(nascar$DriverRating, nascar$AvgPos)

## [1] -0.9720197

Avgpos has a stronger correlation, which will be the premise of the final plot.

p1 <- nascar |>
  
  ggplot(aes(x = AvgPos, y = DriverRating, color = AvgFinish, text = paste("Name:", Driver))) +
  geom_point() +
  theme_bw() +
  scale_color_distiller(palette = "Blues") +
  theme_minimal(base_size = 12)+
  labs(x = "Average Position", y = " Driver Rating", title = "Scatterplot between Driver Rating to Average Position", caption = "Source: 2017-2023 NASCAR Digital Media, LLC.")

p1 <- p1 + geom_abline(slope = coef(lm(DriverRating ~ AvgPos, data = nascar))[2], intercept = coef(lm(DriverRating ~ AvgPos, data = nascar))[1], linetype = "dashed", color = "black")

ggplotly(p1)

Executive Summary (Part Three):

From the beginning the goal was to first build a model between average finish and driver rating. In that case, a linear model makes sense as from a statistical outlook for this sort of scenario. Besides a linear model, you can also showcase the relationship through scatter plots, correlation plots, or diagnostic plots. These are well worth exploring as the best attempt to model the relationship between these two variables. Moving forward, the second question was which predictors are best to predict driver rating? This meant tinkering with the built model to increase the r-adjusted square value because we want the model to explain the variation for most of the observations. The last scatterplot is an attempt to visualize the linear model for the best predictor variable in respect to the driver’ average position and average finish. Drivers like Kevin Harvick, Martin Truex Jr., and Jimmie Johnson are among the top within these performance metrics.

Final Project

A. Diaz-Nova

2023-12-11