Intro to Linear Regression: First Base hitting stats

My comments:
Here I am reading the firstbasestats.csv into RStudio and then we are displaying the internal structure of the dataframe called firstbase.

# Read in data
firstbase = read.csv("firstbasestats.csv")
str(firstbase)
## 'data.frame':    23 obs. of  15 variables:
##  $ Player            : chr  "Freddie Freeman" "Jose Abreu" "Nate Lowe" "Paul Goldschmidt" ...
##  $ Pos               : chr  "1B" "1B" "1B" "1B" ...
##  $ Team              : chr  "LAD" "CHW" "TEX" "STL" ...
##  $ GP                : int  159 157 157 151 160 140 160 145 146 143 ...
##  $ AB                : int  612 601 593 561 638 551 583 555 545 519 ...
##  $ H                 : int  199 183 179 178 175 152 141 139 132 124 ...
##  $ X2B               : int  47 40 26 41 35 27 25 28 40 23 ...
##  $ HR                : int  21 15 27 35 32 20 36 22 8 18 ...
##  $ RBI               : int  100 75 76 115 97 84 94 85 53 63 ...
##  $ AVG               : num  0.325 0.305 0.302 0.317 0.274 0.276 0.242 0.251 0.242 0.239 ...
##  $ OBP               : num  0.407 0.379 0.358 0.404 0.339 0.34 0.327 0.305 0.288 0.319 ...
##  $ SLG               : num  0.511 0.446 0.492 0.578 0.48 0.437 0.477 0.423 0.36 0.391 ...
##  $ OPS               : num  0.918 0.824 0.851 0.981 0.818 0.777 0.804 0.729 0.647 0.71 ...
##  $ WAR               : num  5.77 4.19 3.21 7.86 3.85 3.07 5.05 1.32 -0.33 1.87 ...
##  $ Payroll.Salary2023: num  27000000 19500000 4050000 26000000 14500000 ...

My comments:
Here we are keeping all the columns except the column Pos and calling the new dataframe fistbase.

# Drop the 'Pos' column
firstbase <- firstbase[, !(names(firstbase) %in% c("Pos"))]
str(firstbase)
## 'data.frame':    23 obs. of  14 variables:
##  $ Player            : chr  "Freddie Freeman" "Jose Abreu" "Nate Lowe" "Paul Goldschmidt" ...
##  $ Team              : chr  "LAD" "CHW" "TEX" "STL" ...
##  $ GP                : int  159 157 157 151 160 140 160 145 146 143 ...
##  $ AB                : int  612 601 593 561 638 551 583 555 545 519 ...
##  $ H                 : int  199 183 179 178 175 152 141 139 132 124 ...
##  $ X2B               : int  47 40 26 41 35 27 25 28 40 23 ...
##  $ HR                : int  21 15 27 35 32 20 36 22 8 18 ...
##  $ RBI               : int  100 75 76 115 97 84 94 85 53 63 ...
##  $ AVG               : num  0.325 0.305 0.302 0.317 0.274 0.276 0.242 0.251 0.242 0.239 ...
##  $ OBP               : num  0.407 0.379 0.358 0.404 0.339 0.34 0.327 0.305 0.288 0.319 ...
##  $ SLG               : num  0.511 0.446 0.492 0.578 0.48 0.437 0.477 0.423 0.36 0.391 ...
##  $ OPS               : num  0.918 0.824 0.851 0.981 0.818 0.777 0.804 0.729 0.647 0.71 ...
##  $ WAR               : num  5.77 4.19 3.21 7.86 3.85 3.07 5.05 1.32 -0.33 1.87 ...
##  $ Payroll.Salary2023: num  27000000 19500000 4050000 26000000 14500000 ...

My comments:
The summary() function here is used to provide a concise summary of the main statistical properties of firstbase. When applied to our firstbase dataset, it offers a summary for each column, which includes basic descriptive statistics for numeric columns and frequency counts for factor/character columns.

summary(firstbase)
##     Player              Team                 GP              AB       
##  Length:23          Length:23          Min.   :  5.0   Min.   : 14.0  
##  Class :character   Class :character   1st Qu.:105.5   1st Qu.:309.0  
##  Mode  :character   Mode  :character   Median :131.0   Median :465.0  
##                                        Mean   :120.2   Mean   :426.9  
##                                        3rd Qu.:152.0   3rd Qu.:558.0  
##                                        Max.   :160.0   Max.   :638.0  
##        H              X2B              HR             RBI        
##  Min.   :  3.0   Min.   : 1.00   Min.   : 0.00   Min.   :  1.00  
##  1st Qu.: 74.5   1st Qu.:13.50   1st Qu.: 8.00   1st Qu.: 27.00  
##  Median :115.0   Median :23.00   Median :18.00   Median : 63.00  
##  Mean   :110.0   Mean   :22.39   Mean   :17.09   Mean   : 59.43  
##  3rd Qu.:146.5   3rd Qu.:28.00   3rd Qu.:24.50   3rd Qu.: 84.50  
##  Max.   :199.0   Max.   :47.00   Max.   :36.00   Max.   :115.00  
##       AVG              OBP              SLG              OPS        
##  Min.   :0.2020   Min.   :0.2140   Min.   :0.2860   Min.   :0.5000  
##  1st Qu.:0.2180   1st Qu.:0.3030   1st Qu.:0.3505   1st Qu.:0.6445  
##  Median :0.2420   Median :0.3210   Median :0.4230   Median :0.7290  
##  Mean   :0.2499   Mean   :0.3242   Mean   :0.4106   Mean   :0.7346  
##  3rd Qu.:0.2750   3rd Qu.:0.3395   3rd Qu.:0.4690   3rd Qu.:0.8175  
##  Max.   :0.3250   Max.   :0.4070   Max.   :0.5780   Max.   :0.9810  
##       WAR         Payroll.Salary2023
##  Min.   :-1.470   Min.   :  720000  
##  1st Qu.: 0.190   1st Qu.:  739200  
##  Median : 1.310   Median : 4050000  
##  Mean   : 1.788   Mean   : 6972743  
##  3rd Qu.: 3.140   3rd Qu.: 8150000  
##  Max.   : 7.860   Max.   :27000000

My comments:
Here we are creating a variable model1 to save the value from our linear regression. We are also using the lm() function to store the relationship between salary2003 and RBI from the dataset firstbase.
From the summary we can say the following: RBI: For each additional RBI, the payroll/salary increases by 157088 units, and this relationship is statistically significant. The overall model is statistically significant, as indicated by the p-value 0.001331, also indicating the model is statistically significant

# Linear Regression (one variable)
model1 = lm(Payroll.Salary2023 ~ RBI, data=firstbase)
summary(model1)
## 
## Call:
## lm(formula = Payroll.Salary2023 ~ RBI, data = firstbase)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -10250331  -5220790   -843455   2386848  13654950 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -2363744    2866320  -0.825  0.41883   
## RBI           157088      42465   3.699  0.00133 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6516000 on 21 degrees of freedom
## Multiple R-squared:  0.3945, Adjusted R-squared:  0.3657 
## F-statistic: 13.68 on 1 and 21 DF,  p-value: 0.001331
#looking at the model1 variable
model1
## 
## Call:
## lm(formula = Payroll.Salary2023 ~ RBI, data = firstbase)
## 
## Coefficients:
## (Intercept)          RBI  
##    -2363744       157088

My comments:
Here we are extracting the residuals from the linear model model1. In this case the residuals show the difference between the actual and predicted Payroll.Salary2023. But I am not clear as to when we gave the model new actual numbers to be use for prediction. Positive residuals indicate under-prediction, and negative residuals indicate over-prediction.

# Sum of Squared Errors
model1$residuals
##           1           2           3           4           5           6 
##  13654950.2  10082148.6  -5524939.3  10298631.2   1626214.0  -6731642.8 
##           7           8           9          10          11          12 
##  -5902522.2 -10250330.7  -4711916.8   -532796.1  -6667082.5  -6696203.1 
##          13          14          15          16          17          18 
##   7582148.6  -4916640.9  -1898125.3   -336532.3   -995042.5  -1311618.3 
##          19          20          21          22          23 
##   -843454.5   8050721.3   1250336.9   1847040.4   2926656.0

My comments:
This is a very small number 8.914926 × 10^14. A smaller SSE would indicate a better fit of the model to the data, as it would mean the predictions are closer to the actual values.

SSE = sum(model1$residuals^2)
SSE
## [1] 8.914926e+14

My comments:
Here we are creating a second model with a variable called model2. We are also using the lm() function to store the relationship between salary2003, AVG and RBI from the dataset firstbase. From the summary we can say the following: RBI: Each unit increase in RBI is associated with an increase of 108850 in Payroll.Salary2023, with statistical significance. AVG: Each unit increase in AVG is associated with an increase of 7474301 in Payroll.Salary2023, with marginal significance. The overall model is statistically significant, as indicated by the p-value 0.001636, also indicating the model is statistically significant

# Linear Regression (two variables)
model2 = lm(Payroll.Salary2023 ~ AVG + RBI, data=firstbase)
summary(model2)
## 
## Call:
## lm(formula = Payroll.Salary2023 ~ AVG + RBI, data = firstbase)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -9097952 -4621582   -33233  3016541 10260245 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -18083756    9479037  -1.908   0.0709 .
## AVG          74374031   42934155   1.732   0.0986 .
## RBI            108850      49212   2.212   0.0388 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6226000 on 20 degrees of freedom
## Multiple R-squared:  0.4735, Adjusted R-squared:  0.4209 
## F-statistic: 8.994 on 2 and 20 DF,  p-value: 0.001636
model2
## 
## Call:
## lm(formula = Payroll.Salary2023 ~ AVG + RBI, data = firstbase)
## 
## Coefficients:
## (Intercept)          AVG          RBI  
##   -18083756     74374031       108850

My comments:
This is a very small number 7.751841 × 10^14. A smaller SSE would indicate a better fit of the model to the data, as it would mean the predictions are closer to the actual values.

# Sum of Squared Errors
SSE = sum(model2$residuals^2)
SSE
## [1] 7.751841e+14

My comments:
Here we are creating a third model with a variable called model3. We are also using the lm() function to store the relationship between salary2003 and HR + RBI + AVG + OBP + OPS from the dataset firstbase. From the summary() function we can say the following: The overall model is statistically significant, as indicated by the p-value 0.006951, also indicating the model is statistically significant. However, individual predictors like HR, RBI, AVG, OBP, and OPS do not show strong evidence of a significant impact on Payroll.Salary2023 (as suggested by their high p-values), except for the intercept, which is statistically significant. This analysis suggests that while the model as a whole is significant, the individual contribution of most predictors might be limited or influenced by multicollinearity or other factors.

# Linear Regression (all variables)
model3 = lm(Payroll.Salary2023 ~ HR + RBI + AVG + OBP+ OPS, data=firstbase)
summary(model3)
## 
## Call:
## lm(formula = Payroll.Salary2023 ~ HR + RBI + AVG + OBP + OPS, 
##     data = firstbase)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -9611440 -3338119    64016  4472451  9490309 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -31107859   11738494  -2.650   0.0168 *
## HR            -341069     552069  -0.618   0.5449  
## RBI            115786     113932   1.016   0.3237  
## AVG         -63824769  104544645  -0.611   0.5496  
## OBP          27054948  131210166   0.206   0.8391  
## OPS          60181012   95415131   0.631   0.5366  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6023000 on 17 degrees of freedom
## Multiple R-squared:  0.5811, Adjusted R-squared:  0.4579 
## F-statistic: 4.717 on 5 and 17 DF,  p-value: 0.006951
model3
## 
## Call:
## lm(formula = Payroll.Salary2023 ~ HR + RBI + AVG + OBP + OPS, 
##     data = firstbase)
## 
## Coefficients:
## (Intercept)           HR          RBI          AVG          OBP          OPS  
##   -31107859      -341069       115786    -63824769     27054948     60181012

My comments:
This is a very small number 6.167793 × 10^14. A smaller SSE would indicate a better fit of the model to the data, as it would mean the predictions are closer to the actual values.

# Sum of Squared Errors
SSE = sum(model3$residuals^2)
SSE
## [1] 6.167793e+14

My comments:
In model 4 we are just using 4 predictors RBI + AVG + OBP+OPS in relation to Salary

# Remove HR
model4 = lm(Payroll.Salary2023 ~ RBI + AVG + OBP+OPS, data=firstbase)
summary(model4)
## 
## Call:
## lm(formula = Payroll.Salary2023 ~ RBI + AVG + OBP + OPS, data = firstbase)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -9399551 -3573842    98921  3979339  9263512 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -29466887   11235931  -2.623   0.0173 *
## RBI             71495      87015   0.822   0.4220  
## AVG         -11035457   59192453  -0.186   0.8542  
## OBP          86360720   87899074   0.982   0.3389  
## OPS           9464546   47788458   0.198   0.8452  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5919000 on 18 degrees of freedom
## Multiple R-squared:  0.5717, Adjusted R-squared:  0.4765 
## F-statistic: 6.007 on 4 and 18 DF,  p-value: 0.00298

My comments:
I am changing the code to remove 2 columns instead of 3 because I already removed the Pos column

firstbase<-firstbase[,-(1:2)]
firstbase
str(firstbase)
## 'data.frame':    23 obs. of  12 variables:
##  $ GP                : int  159 157 157 151 160 140 160 145 146 143 ...
##  $ AB                : int  612 601 593 561 638 551 583 555 545 519 ...
##  $ H                 : int  199 183 179 178 175 152 141 139 132 124 ...
##  $ X2B               : int  47 40 26 41 35 27 25 28 40 23 ...
##  $ HR                : int  21 15 27 35 32 20 36 22 8 18 ...
##  $ RBI               : int  100 75 76 115 97 84 94 85 53 63 ...
##  $ AVG               : num  0.325 0.305 0.302 0.317 0.274 0.276 0.242 0.251 0.242 0.239 ...
##  $ OBP               : num  0.407 0.379 0.358 0.404 0.339 0.34 0.327 0.305 0.288 0.319 ...
##  $ SLG               : num  0.511 0.446 0.492 0.578 0.48 0.437 0.477 0.423 0.36 0.391 ...
##  $ OPS               : num  0.918 0.824 0.851 0.981 0.818 0.777 0.804 0.729 0.647 0.71 ...
##  $ WAR               : num  5.77 4.19 3.21 7.86 3.85 3.07 5.05 1.32 -0.33 1.87 ...
##  $ Payroll.Salary2023: num  27000000 19500000 4050000 26000000 14500000 ...

My comments:
The correlation coefficient between the number of runs batted in (RBI) and the payroll/salary for 2023 in the firstbase dataset, resulting in a value of approximately 0.6281. This indicates a moderate to strong positive linear relationship between these two variables.

# Correlations
cor(firstbase$RBI, firstbase$Payroll.Salary2023)
## [1] 0.6281239

My comments:
these 2 predictors are highly correlated and should not be used together for a linear regression model.

cor(firstbase$AVG, firstbase$OBP)
## [1] 0.8028894

My comments:
When looking at the correlation matrix we can get a good idea as to what predictors we need to remove. We have to remove the predictors that are closely correlated or do Principal Component Analysis PCA to get a set of uncorrelated components.

cor(firstbase)
##                           GP        AB         H       X2B        HR       RBI
## GP                 1.0000000 0.9779421 0.9056508 0.8446267 0.7432552 0.8813917
## AB                 0.9779421 1.0000000 0.9516701 0.8924632 0.7721339 0.9125839
## H                  0.9056508 0.9516701 1.0000000 0.9308318 0.7155225 0.9068893
## X2B                0.8446267 0.8924632 0.9308318 1.0000000 0.5889699 0.8485911
## HR                 0.7432552 0.7721339 0.7155225 0.5889699 1.0000000 0.8929048
## RBI                0.8813917 0.9125839 0.9068893 0.8485911 0.8929048 1.0000000
## AVG                0.4430808 0.5126292 0.7393167 0.6613085 0.3444242 0.5658479
## OBP                0.4841583 0.5026125 0.6560021 0.5466537 0.4603408 0.5704463
## SLG                0.6875270 0.7471949 0.8211406 0.7211259 0.8681501 0.8824090
## OPS                0.6504483 0.6980141 0.8069779 0.6966830 0.7638721 0.8156612
## WAR                0.5645243 0.6211558 0.7688712 0.6757470 0.6897677 0.7885666
## Payroll.Salary2023 0.4614889 0.5018820 0.6249911 0.6450730 0.5317619 0.6281239
##                          AVG       OBP       SLG       OPS       WAR
## GP                 0.4430808 0.4841583 0.6875270 0.6504483 0.5645243
## AB                 0.5126292 0.5026125 0.7471949 0.6980141 0.6211558
## H                  0.7393167 0.6560021 0.8211406 0.8069779 0.7688712
## X2B                0.6613085 0.5466537 0.7211259 0.6966830 0.6757470
## HR                 0.3444242 0.4603408 0.8681501 0.7638721 0.6897677
## RBI                0.5658479 0.5704463 0.8824090 0.8156612 0.7885666
## AVG                1.0000000 0.8028894 0.7254274 0.7989005 0.7855945
## OBP                0.8028894 1.0000000 0.7617499 0.8987390 0.7766375
## SLG                0.7254274 0.7617499 1.0000000 0.9686752 0.8611140
## OPS                0.7989005 0.8987390 0.9686752 1.0000000 0.8799893
## WAR                0.7855945 0.7766375 0.8611140 0.8799893 1.0000000
## Payroll.Salary2023 0.5871543 0.7025979 0.6974086 0.7394981 0.8086359
##                    Payroll.Salary2023
## GP                          0.4614889
## AB                          0.5018820
## H                           0.6249911
## X2B                         0.6450730
## HR                          0.5317619
## RBI                         0.6281239
## AVG                         0.5871543
## OBP                         0.7025979
## SLG                         0.6974086
## OPS                         0.7394981
## WAR                         0.8086359
## Payroll.Salary2023          1.0000000

My comments:
Detecting Multicollinearity with VIF. Interpreting VIF Values. VIF < 5: Low multicollinearity. VIF between 5 and 10: Moderate multicollinearity. VIF > 10: High multicollinearity, requiring attention. We have problems here.

# Load necessary library
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
# Fit a multiple regression model
model <- lm(Payroll.Salary2023 ~ ., data=firstbase)
#model <- lm(Payroll.Salary2023 ~ HR + RBI + AVG + OBP + OPS, data=firstbase)

# Calculate VIF
vif_values <- vif(model)
print(vif_values)
##          GP          AB           H         X2B          HR         RBI 
##    45.02486   228.90781   280.41486    34.06695   106.40164    36.88522 
##         AVG         OBP         SLG         OPS         WAR 
##    61.18817  8272.37730 26419.42759 56407.76362    13.32842

My comments: Interpreting the Heatmap: Color Intensity: Indicates the strength of the correlation. Darker red indicates a strong positive correlation. Darker blue indicates a strong negative correlation. White indicates little to no correlation. Numbers: Show the exact correlation coefficients. Well like my sisters says: This is a hot mess! LOL

# Set the CRAN mirror
options(repos = c(CRAN = "https://cloud.r-project.org"))

# Install and load the corrplot package
install.packages("corrplot")
## Installing package into 'C:/Users/krist/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)
## package 'corrplot' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\krist\AppData\Local\Temp\RtmpE38yvM\downloaded_packages
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.92 loaded
# Calculate the correlation matrix for the firstbase dataset
correlation_matrix <- cor(firstbase, use="complete.obs")

# Visualize the correlation matrix using a heatmap
corrplot(correlation_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45, addCoef.col = "black", 
         number.cex = 0.7, col = colorRampPalette(c("blue", "white", "red"))(200))

My comments:
Here we are removing the AVG predictor so we can see how our p-values end up and how the multicolliearity behaves. We want to use the predictors with the less pvalue and the predictors that have the less multicollinearity

#Removing AVG
model5 = lm(Payroll.Salary2023 ~ RBI + OBP+OPS, data=firstbase)
summary(model5)
## 
## Call:
## lm(formula = Payroll.Salary2023 ~ RBI + OBP + OPS, data = firstbase)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -9465449 -3411234   259746  4102864  8876798 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -29737007   10855411  -2.739    0.013 *
## RBI             72393      84646   0.855    0.403  
## OBP          82751360   83534224   0.991    0.334  
## OPS           7598051   45525575   0.167    0.869  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5767000 on 19 degrees of freedom
## Multiple R-squared:  0.5709, Adjusted R-squared:  0.5031 
## F-statistic: 8.426 on 3 and 19 DF,  p-value: 0.000913

My comments:
Here in model6 we are only using 2 predictors: RBI and OBP. In this combination, OBP is the most statistically significant with a Pvalue of 0.00969 and RBI is marginally statistically significant. But the model6 has the lowest pvalue overall of 0.0002149 so it is the most statistically significant of all models so far. Therefore this is the best model to use for salary prediction of our 2 new players Matt and Josh.

model6 = lm(Payroll.Salary2023 ~ RBI + OBP, data=firstbase)
summary(model6)
## 
## Call:
## lm(formula = Payroll.Salary2023 ~ RBI + OBP, data = firstbase)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -9045497 -3487008   139497  4084739  9190185 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -28984802    9632560  -3.009  0.00693 **
## RBI             84278      44634   1.888  0.07360 . 
## OBP          95468873   33385182   2.860  0.00969 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5625000 on 20 degrees of freedom
## Multiple R-squared:  0.5703, Adjusted R-squared:  0.5273 
## F-statistic: 13.27 on 2 and 20 DF,  p-value: 0.0002149

My comments:
Here we are creating a variable called firstbasestats_test that contains 2 observations and 15 variables. Each observation is a new player with the same variables as our firstbasestats dataset

# Read in test set
firstbaseTest = read.csv("firstbasestats_test.csv")
str(firstbaseTest)
## 'data.frame':    2 obs. of  15 variables:
##  $ Player            : chr  "Matt Olson" "Josh Bell"
##  $ Pos               : chr  "1B" "1B"
##  $ Team              : chr  "ATL" "SD"
##  $ GP                : int  162 156
##  $ AB                : int  616 552
##  $ H                 : int  148 147
##  $ X2B               : int  44 29
##  $ HR                : int  34 17
##  $ RBI               : int  103 71
##  $ AVG               : num  0.24 0.266
##  $ OBP               : num  0.325 0.362
##  $ SLG               : num  0.477 0.422
##  $ OPS               : num  0.802 0.784
##  $ WAR               : num  3.29 3.5
##  $ Payroll.Salary2023: num  21000000 16500000

My comments:
We are dropping the Pos column since that column has the same value for both players and it will not help with our model or prediction.

# Drop the 'Pos' column from out test dataset
firstbaseTest <- firstbaseTest[, !(names(firstbaseTest) %in% c("Pos"))]
str(firstbaseTest)
## 'data.frame':    2 obs. of  14 variables:
##  $ Player            : chr  "Matt Olson" "Josh Bell"
##  $ Team              : chr  "ATL" "SD"
##  $ GP                : int  162 156
##  $ AB                : int  616 552
##  $ H                 : int  148 147
##  $ X2B               : int  44 29
##  $ HR                : int  34 17
##  $ RBI               : int  103 71
##  $ AVG               : num  0.24 0.266
##  $ OBP               : num  0.325 0.362
##  $ SLG               : num  0.477 0.422
##  $ OPS               : num  0.802 0.784
##  $ WAR               : num  3.29 3.5
##  $ Payroll.Salary2023: num  21000000 16500000

My comments:
Here we are using our best model, model6 to predict the salary for 2 new players Matt and Josh. This prediction can be used for their salaries for 2024 or for their next contract. So according to their RBI and OBP Matt should get paid 10 and 11 mil. I improved the formatting of the code for better readability.

# Make test set predictions
predictTest = predict(model6, newdata=firstbaseTest)

# Combine the predictions with the Player names
predictedResults = data.frame(Player = firstbaseTest$Player, PredictedSalary = predictTest)

# Format the predicted salaries as currency
predictedResults$PredictedSalary = paste0("$", formatC(predictedResults$PredictedSalary, format = "f", big.mark = ",", digits = 2))

# Print the results
print(predictedResults)
##       Player PredictedSalary
## 1 Matt Olson  $10,723,185.80
## 2  Josh Bell  $11,558,647.39

My comments:

SSE = sum((firstbaseTest$Payroll.Salary2023 - predictTest)^2)
SST = sum((firstbaseTest$Payroll.Salary2023 - mean(firstbase$Payroll.Salary2023))^2)
1 - SSE/SST
## [1] 0.5477734

Here I wanted to make a summary table that included key metrics such as R-squared, Adjusted R-squared, AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), and p-value to show that model6 is the best model. I also learned about the BIC and the AIC criterion to compare multiple models and chose the best one. For example: BIC is similar to AIC but introduces a stronger penalty for the number of parameters in the model.

# Function to extract model metrics
get_model_summary <- function(model) {
  summary_model <- summary(model)
  p_value <- pf(summary_model$fstatistic[1], summary_model$fstatistic[2], summary_model$fstatistic[3], lower.tail = FALSE)
  
  data.frame(
    Model = deparse(substitute(model)),
    R_squared = summary_model$r.squared,
    Adjusted_R_squared = summary_model$adj.r.squared,
    AIC = AIC(model),
    BIC = BIC(model),
    P_value = p_value
  )
}

# Extract summaries for each model
model_summaries <- rbind(
  get_model_summary(model1),
  get_model_summary(model2),
  get_model_summary(model3),
  get_model_summary(model4),
  get_model_summary(model5),
  get_model_summary(model6)
)

# Print the summary table
print(model_summaries)
##         Model R_squared Adjusted_R_squared      AIC      BIC      P_value
## value  model1 0.3945396          0.3657082 790.9049 794.3114 0.0013305155
## value1 model2 0.4735310          0.4208841 789.6896 794.2316 0.0016357927
## value2 model3 0.5811122          0.4579099 790.4320 798.3805 0.0069506447
## value3 model4 0.5717074          0.4765313 788.9427 795.7557 0.0029799760
## value4 model5 0.5708804          0.5031247 786.9871 792.6646 0.0009129988
## value5 model6 0.5702513          0.5272764 785.0208 789.5628 0.0002148551
# Identify the best model based on Adjusted R-squared
best_model <- model_summaries[which.max(model_summaries$Adjusted_R_squared), ]
print(best_model)
##         Model R_squared Adjusted_R_squared      AIC      BIC      P_value
## value5 model6 0.5702513          0.5272764 785.0208 789.5628 0.0002148551

My comments
Loading the packages I need for some visualizations, graphs

if (!require("ggplot2")) install.packages("ggplot2")
## Loading required package: ggplot2
if (!require("reshape2")) install.packages("reshape2")
## Loading required package: reshape2
## Warning: package 'reshape2' was built under R version 4.3.3
library(ggplot2)
library(reshape2)

My comments:
I wanted to graph the p-values of the 6 models to show that model6 is the best model with the lowets p-value

# Create a list of models
models <- list(model1, model2, model3, model4, model5, model6)

# Initialize a data frame to store the metrics
model_comparison <- data.frame(
  Model = character(),
  Adj_R_Squared = numeric(),
  AIC = numeric(),
  BIC = numeric(),
  Residual_SE = numeric(),
  P_Value = numeric(),
  stringsAsFactors = FALSE
)

# Function to extract p-value from the model summary
extract_p_value <- function(model) {
  fstat <- summary(model)$fstatistic
  p_value <- pf(fstat[1], fstat[2], fstat[3], lower.tail = FALSE)
  return(p_value)
}

# Loop through each model and extract the metrics
for (i in 1:length(models)) {
  model <- models[[i]]
  model_summary <- summary(model)
  
  # Extract metrics
  adj_r_squared <- model_summary$adj.r.squared
  aic <- AIC(model)
  bic <- BIC(model)
  residual_se <- model_summary$sigma
  p_value <- extract_p_value(model)
  
  # Append the metrics to the data frame
  model_comparison <- rbind(model_comparison, data.frame(
    Model = paste("model", i, sep = ""),
    Adj_R_Squared = adj_r_squared,
    AIC = aic,
    BIC = bic,
    Residual_SE = residual_se,
    P_Value = p_value
  ))
}

# Reshape the data frame to a long format for metrics
model_comparison_long <- melt(model_comparison, id.vars = "Model", measure.vars = c("Adj_R_Squared", "AIC", "BIC", "Residual_SE"))

# Plot each metric separately
ggplot(model_comparison_long, aes(x = Model, y = value, color = variable, group = variable)) +
  geom_line() +
  geom_point() +
  facet_wrap(~ variable, scales = "free_y") +
  labs(title = "Model Comparison Metrics", x = "Model", y = "Value", color = "Metric") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Plot the p-values
ggplot(model_comparison, aes(x = Model, y = P_Value, group = 1)) +
  geom_line() +
  geom_point() +
  labs(title = "P-Values of the Models", x = "Model", y = "P-Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

My Comments:
I combined the original dataset and the 2 new players so I could do some graphs. Just curious.

# Read in test set
firstbaseAll = read.csv("firstbasestats_all.csv")
str(firstbaseAll)
## 'data.frame':    25 obs. of  15 variables:
##  $ Player            : chr  "Freddie Freeman" "Jose Abreu" "Nate Lowe" "Paul Goldschmidt" ...
##  $ Pos               : chr  "1B" "1B" "1B" "1B" ...
##  $ Team              : chr  "LAD" "CHW" "TEX" "STL" ...
##  $ GP                : int  159 157 157 151 160 140 160 145 146 143 ...
##  $ AB                : int  612 601 593 561 638 551 583 555 545 519 ...
##  $ H                 : int  199 183 179 178 175 152 141 139 132 124 ...
##  $ X2B               : int  47 40 26 41 35 27 25 28 40 23 ...
##  $ HR                : int  21 15 27 35 32 20 36 22 8 18 ...
##  $ RBI               : int  100 75 76 115 97 84 94 85 53 63 ...
##  $ AVG               : num  0.325 0.305 0.302 0.317 0.274 0.276 0.242 0.251 0.242 0.239 ...
##  $ OBP               : num  0.407 0.379 0.358 0.404 0.339 0.34 0.327 0.305 0.288 0.319 ...
##  $ SLG               : num  0.511 0.446 0.492 0.578 0.48 0.437 0.477 0.423 0.36 0.391 ...
##  $ OPS               : num  0.918 0.824 0.851 0.981 0.818 0.777 0.804 0.729 0.647 0.71 ...
##  $ WAR               : num  5.77 4.19 3.21 7.86 3.85 3.07 5.05 1.32 -0.33 1.87 ...
##  $ Payroll.Salary2023: num  27000000 19500000 4050000 26000000 14500000 ...

My comments:
I am droping the Pos column from the combined dataset.

# Drop the 'Pos' column from combined dataset
firstbaseAll <- firstbaseAll[, !(names(firstbaseAll) %in% c("Pos"))]
str(firstbaseAll)
## 'data.frame':    25 obs. of  14 variables:
##  $ Player            : chr  "Freddie Freeman" "Jose Abreu" "Nate Lowe" "Paul Goldschmidt" ...
##  $ Team              : chr  "LAD" "CHW" "TEX" "STL" ...
##  $ GP                : int  159 157 157 151 160 140 160 145 146 143 ...
##  $ AB                : int  612 601 593 561 638 551 583 555 545 519 ...
##  $ H                 : int  199 183 179 178 175 152 141 139 132 124 ...
##  $ X2B               : int  47 40 26 41 35 27 25 28 40 23 ...
##  $ HR                : int  21 15 27 35 32 20 36 22 8 18 ...
##  $ RBI               : int  100 75 76 115 97 84 94 85 53 63 ...
##  $ AVG               : num  0.325 0.305 0.302 0.317 0.274 0.276 0.242 0.251 0.242 0.239 ...
##  $ OBP               : num  0.407 0.379 0.358 0.404 0.339 0.34 0.327 0.305 0.288 0.319 ...
##  $ SLG               : num  0.511 0.446 0.492 0.578 0.48 0.437 0.477 0.423 0.36 0.391 ...
##  $ OPS               : num  0.918 0.824 0.851 0.981 0.818 0.777 0.804 0.729 0.647 0.71 ...
##  $ WAR               : num  5.77 4.19 3.21 7.86 3.85 3.07 5.05 1.32 -0.33 1.87 ...
##  $ Payroll.Salary2023: num  27000000 19500000 4050000 26000000 14500000 ...

My Comments:
Here I wanted to graph the RBI OBP and Salary for the combined dataset including the 2 new players

# Convert Player column to factor to maintain the order in the plots
firstbaseAll$Player <- factor(firstbaseAll$Player, levels = firstbaseAll$Player)

# Plot RBI
p1 <- ggplot(firstbaseAll, aes(x = Player, y = RBI, group = 1)) +
  geom_line() +
  geom_point() +
  labs(title = "RBI by Player", x = "Player", y = "RBI") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Plot OBP
p2 <- ggplot(firstbaseAll, aes(x = Player, y = OBP, group = 1)) +
  geom_line() +
  geom_point() +
  labs(title = "OBP by Player", x = "Player", y = "OBP") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Plot Salary2023
p3 <- ggplot(firstbaseAll, aes(x = Player, y = Payroll.Salary2023, group = 1)) +
  geom_line() +
  geom_point() +
  labs(title = "Payroll.Salary2023 by Player", x = "Player", y = "Payroll.Salary2023") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Print the plots
print(p1)

print(p2)

print(p3)

My comments
Here I wanted to graph the 2 more statistically significant predictor’s p-values for all of the 6 models. As I understand it, I think that in this graph we can see some multicollinearity on the models and the coefficients.

# Create a list of models
models <- list(model1, model2, model3, model4, model5, model6)

# Initialize a data frame to store p-values
model_pvalues <- data.frame(
  Model = character(),
  Coefficient = character(),
  PValue = numeric(),
  stringsAsFactors = FALSE
)

# Loop through each model and extract p-values
for (i in 1:length(models)) {
  model <- models[[i]]
  model_summary <- summary(model)
  
  # Extract p-values and coefficient names
  p_values <- coef(model_summary)[, "Pr(>|t|)"]
  coefficients <- rownames(coef(model_summary))
  
  # Filter for the desired coefficients
  desired_coefficients <- c("(Intercept)", "RBI", "OBP")
  for (j in 1:length(p_values)) {
    if (coefficients[j] %in% desired_coefficients) {
      model_pvalues <- rbind(model_pvalues, data.frame(
        Model = paste("model", i, sep = ""),
        Coefficient = coefficients[j],
        PValue = p_values[j]
      ))
    }
  }
}

# Reshape the data frame to long format
model_pvalues_long <- melt(model_pvalues, id.vars = c("Model", "Coefficient"))

# Create the line graph
ggplot(model_pvalues_long, aes(x = Coefficient, y = value, color = Model, group = Model)) +
  geom_line() +
  geom_point() +
  labs(title = "P-Values of Selected Coefficients by Model", x = "Coefficient", y = "P-Value", color = "Model") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

I will say that doing visualizations in Tableau is a lot easier than in R. I will take this into consideration for my final project.

The End, Thank you!