My comments:
Here I am reading the firstbasestats.csv into
RStudio and then we are displaying the internal structure of the
dataframe called firstbase.
# Read in data
firstbase = read.csv("firstbasestats.csv")
str(firstbase)
## 'data.frame': 23 obs. of 15 variables:
## $ Player : chr "Freddie Freeman" "Jose Abreu" "Nate Lowe" "Paul Goldschmidt" ...
## $ Pos : chr "1B" "1B" "1B" "1B" ...
## $ Team : chr "LAD" "CHW" "TEX" "STL" ...
## $ GP : int 159 157 157 151 160 140 160 145 146 143 ...
## $ AB : int 612 601 593 561 638 551 583 555 545 519 ...
## $ H : int 199 183 179 178 175 152 141 139 132 124 ...
## $ X2B : int 47 40 26 41 35 27 25 28 40 23 ...
## $ HR : int 21 15 27 35 32 20 36 22 8 18 ...
## $ RBI : int 100 75 76 115 97 84 94 85 53 63 ...
## $ AVG : num 0.325 0.305 0.302 0.317 0.274 0.276 0.242 0.251 0.242 0.239 ...
## $ OBP : num 0.407 0.379 0.358 0.404 0.339 0.34 0.327 0.305 0.288 0.319 ...
## $ SLG : num 0.511 0.446 0.492 0.578 0.48 0.437 0.477 0.423 0.36 0.391 ...
## $ OPS : num 0.918 0.824 0.851 0.981 0.818 0.777 0.804 0.729 0.647 0.71 ...
## $ WAR : num 5.77 4.19 3.21 7.86 3.85 3.07 5.05 1.32 -0.33 1.87 ...
## $ Payroll.Salary2023: num 27000000 19500000 4050000 26000000 14500000 ...
My comments:
Here we are keeping all the columns except the
column Pos and calling the new dataframe fistbase.
# Drop the 'Pos' column
firstbase <- firstbase[, !(names(firstbase) %in% c("Pos"))]
str(firstbase)
## 'data.frame': 23 obs. of 14 variables:
## $ Player : chr "Freddie Freeman" "Jose Abreu" "Nate Lowe" "Paul Goldschmidt" ...
## $ Team : chr "LAD" "CHW" "TEX" "STL" ...
## $ GP : int 159 157 157 151 160 140 160 145 146 143 ...
## $ AB : int 612 601 593 561 638 551 583 555 545 519 ...
## $ H : int 199 183 179 178 175 152 141 139 132 124 ...
## $ X2B : int 47 40 26 41 35 27 25 28 40 23 ...
## $ HR : int 21 15 27 35 32 20 36 22 8 18 ...
## $ RBI : int 100 75 76 115 97 84 94 85 53 63 ...
## $ AVG : num 0.325 0.305 0.302 0.317 0.274 0.276 0.242 0.251 0.242 0.239 ...
## $ OBP : num 0.407 0.379 0.358 0.404 0.339 0.34 0.327 0.305 0.288 0.319 ...
## $ SLG : num 0.511 0.446 0.492 0.578 0.48 0.437 0.477 0.423 0.36 0.391 ...
## $ OPS : num 0.918 0.824 0.851 0.981 0.818 0.777 0.804 0.729 0.647 0.71 ...
## $ WAR : num 5.77 4.19 3.21 7.86 3.85 3.07 5.05 1.32 -0.33 1.87 ...
## $ Payroll.Salary2023: num 27000000 19500000 4050000 26000000 14500000 ...
My comments:
The summary() function here is used to provide a
concise summary of the main statistical properties of firstbase. When
applied to our firstbase dataset, it offers a summary for each column,
which includes basic descriptive statistics for numeric columns and
frequency counts for factor/character columns.
summary(firstbase)
## Player Team GP AB
## Length:23 Length:23 Min. : 5.0 Min. : 14.0
## Class :character Class :character 1st Qu.:105.5 1st Qu.:309.0
## Mode :character Mode :character Median :131.0 Median :465.0
## Mean :120.2 Mean :426.9
## 3rd Qu.:152.0 3rd Qu.:558.0
## Max. :160.0 Max. :638.0
## H X2B HR RBI
## Min. : 3.0 Min. : 1.00 Min. : 0.00 Min. : 1.00
## 1st Qu.: 74.5 1st Qu.:13.50 1st Qu.: 8.00 1st Qu.: 27.00
## Median :115.0 Median :23.00 Median :18.00 Median : 63.00
## Mean :110.0 Mean :22.39 Mean :17.09 Mean : 59.43
## 3rd Qu.:146.5 3rd Qu.:28.00 3rd Qu.:24.50 3rd Qu.: 84.50
## Max. :199.0 Max. :47.00 Max. :36.00 Max. :115.00
## AVG OBP SLG OPS
## Min. :0.2020 Min. :0.2140 Min. :0.2860 Min. :0.5000
## 1st Qu.:0.2180 1st Qu.:0.3030 1st Qu.:0.3505 1st Qu.:0.6445
## Median :0.2420 Median :0.3210 Median :0.4230 Median :0.7290
## Mean :0.2499 Mean :0.3242 Mean :0.4106 Mean :0.7346
## 3rd Qu.:0.2750 3rd Qu.:0.3395 3rd Qu.:0.4690 3rd Qu.:0.8175
## Max. :0.3250 Max. :0.4070 Max. :0.5780 Max. :0.9810
## WAR Payroll.Salary2023
## Min. :-1.470 Min. : 720000
## 1st Qu.: 0.190 1st Qu.: 739200
## Median : 1.310 Median : 4050000
## Mean : 1.788 Mean : 6972743
## 3rd Qu.: 3.140 3rd Qu.: 8150000
## Max. : 7.860 Max. :27000000
My comments:
Here we are creating a variable model1 to save the
value from our linear regression. We are also using the lm() function to
store the relationship between salary2003 and RBI from the dataset
firstbase.
From the summary we can say the following: RBI: For each additional RBI,
the payroll/salary increases by 157088 units, and this relationship is
statistically significant. The overall model is statistically
significant, as indicated by the p-value 0.001331, also indicating the
model is statistically significant
# Linear Regression (one variable)
model1 = lm(Payroll.Salary2023 ~ RBI, data=firstbase)
summary(model1)
##
## Call:
## lm(formula = Payroll.Salary2023 ~ RBI, data = firstbase)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10250331 -5220790 -843455 2386848 13654950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2363744 2866320 -0.825 0.41883
## RBI 157088 42465 3.699 0.00133 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6516000 on 21 degrees of freedom
## Multiple R-squared: 0.3945, Adjusted R-squared: 0.3657
## F-statistic: 13.68 on 1 and 21 DF, p-value: 0.001331
#looking at the model1 variable
model1
##
## Call:
## lm(formula = Payroll.Salary2023 ~ RBI, data = firstbase)
##
## Coefficients:
## (Intercept) RBI
## -2363744 157088
My comments:
Here we are extracting the residuals from the linear
model model1. In this case the residuals show the difference between the
actual and predicted Payroll.Salary2023. But I am not clear as to when
we gave the model new actual numbers to be use for prediction. Positive
residuals indicate under-prediction, and negative residuals indicate
over-prediction.
# Sum of Squared Errors
model1$residuals
## 1 2 3 4 5 6
## 13654950.2 10082148.6 -5524939.3 10298631.2 1626214.0 -6731642.8
## 7 8 9 10 11 12
## -5902522.2 -10250330.7 -4711916.8 -532796.1 -6667082.5 -6696203.1
## 13 14 15 16 17 18
## 7582148.6 -4916640.9 -1898125.3 -336532.3 -995042.5 -1311618.3
## 19 20 21 22 23
## -843454.5 8050721.3 1250336.9 1847040.4 2926656.0
My comments:
This is a very small number 8.914926 × 10^14. A
smaller SSE would indicate a better fit of the model to the data, as it
would mean the predictions are closer to the actual values.
SSE = sum(model1$residuals^2)
SSE
## [1] 8.914926e+14
My comments:
Here we are creating a second model with a variable
called model2. We are also using the lm() function to store the
relationship between salary2003, AVG and RBI from the dataset firstbase.
From the summary we can say the following: RBI: Each unit increase in
RBI is associated with an increase of 108850 in Payroll.Salary2023, with
statistical significance. AVG: Each unit increase in AVG is associated
with an increase of 7474301 in Payroll.Salary2023, with marginal
significance. The overall model is statistically significant, as
indicated by the p-value 0.001636, also indicating the model is
statistically significant
# Linear Regression (two variables)
model2 = lm(Payroll.Salary2023 ~ AVG + RBI, data=firstbase)
summary(model2)
##
## Call:
## lm(formula = Payroll.Salary2023 ~ AVG + RBI, data = firstbase)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9097952 -4621582 -33233 3016541 10260245
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18083756 9479037 -1.908 0.0709 .
## AVG 74374031 42934155 1.732 0.0986 .
## RBI 108850 49212 2.212 0.0388 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6226000 on 20 degrees of freedom
## Multiple R-squared: 0.4735, Adjusted R-squared: 0.4209
## F-statistic: 8.994 on 2 and 20 DF, p-value: 0.001636
model2
##
## Call:
## lm(formula = Payroll.Salary2023 ~ AVG + RBI, data = firstbase)
##
## Coefficients:
## (Intercept) AVG RBI
## -18083756 74374031 108850
My comments:
This is a very small number 7.751841 × 10^14. A
smaller SSE would indicate a better fit of the model to the data, as it
would mean the predictions are closer to the actual values.
# Sum of Squared Errors
SSE = sum(model2$residuals^2)
SSE
## [1] 7.751841e+14
My comments:
Here we are creating a third model with a variable
called model3. We are also using the lm() function to store the
relationship between salary2003 and HR + RBI + AVG + OBP + OPS from the
dataset firstbase. From the summary() function we can say the following:
The overall model is statistically significant, as indicated by the
p-value 0.006951, also indicating the model is statistically
significant. However, individual predictors like HR, RBI, AVG, OBP, and
OPS do not show strong evidence of a significant impact on
Payroll.Salary2023 (as suggested by their high p-values), except for the
intercept, which is statistically significant. This analysis suggests
that while the model as a whole is significant, the individual
contribution of most predictors might be limited or influenced by
multicollinearity or other factors.
# Linear Regression (all variables)
model3 = lm(Payroll.Salary2023 ~ HR + RBI + AVG + OBP+ OPS, data=firstbase)
summary(model3)
##
## Call:
## lm(formula = Payroll.Salary2023 ~ HR + RBI + AVG + OBP + OPS,
## data = firstbase)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9611440 -3338119 64016 4472451 9490309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -31107859 11738494 -2.650 0.0168 *
## HR -341069 552069 -0.618 0.5449
## RBI 115786 113932 1.016 0.3237
## AVG -63824769 104544645 -0.611 0.5496
## OBP 27054948 131210166 0.206 0.8391
## OPS 60181012 95415131 0.631 0.5366
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6023000 on 17 degrees of freedom
## Multiple R-squared: 0.5811, Adjusted R-squared: 0.4579
## F-statistic: 4.717 on 5 and 17 DF, p-value: 0.006951
model3
##
## Call:
## lm(formula = Payroll.Salary2023 ~ HR + RBI + AVG + OBP + OPS,
## data = firstbase)
##
## Coefficients:
## (Intercept) HR RBI AVG OBP OPS
## -31107859 -341069 115786 -63824769 27054948 60181012
My comments:
This is a very small number 6.167793 × 10^14. A
smaller SSE would indicate a better fit of the model to the data, as it
would mean the predictions are closer to the actual values.
# Sum of Squared Errors
SSE = sum(model3$residuals^2)
SSE
## [1] 6.167793e+14
My comments:
In model 4 we are just using 4 predictors RBI + AVG
+ OBP+OPS in relation to Salary
# Remove HR
model4 = lm(Payroll.Salary2023 ~ RBI + AVG + OBP+OPS, data=firstbase)
summary(model4)
##
## Call:
## lm(formula = Payroll.Salary2023 ~ RBI + AVG + OBP + OPS, data = firstbase)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9399551 -3573842 98921 3979339 9263512
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29466887 11235931 -2.623 0.0173 *
## RBI 71495 87015 0.822 0.4220
## AVG -11035457 59192453 -0.186 0.8542
## OBP 86360720 87899074 0.982 0.3389
## OPS 9464546 47788458 0.198 0.8452
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5919000 on 18 degrees of freedom
## Multiple R-squared: 0.5717, Adjusted R-squared: 0.4765
## F-statistic: 6.007 on 4 and 18 DF, p-value: 0.00298
My comments:
I am changing the code to remove 2 columns instead
of 3 because I already removed the Pos column
firstbase<-firstbase[,-(1:2)]
firstbase
str(firstbase)
## 'data.frame': 23 obs. of 12 variables:
## $ GP : int 159 157 157 151 160 140 160 145 146 143 ...
## $ AB : int 612 601 593 561 638 551 583 555 545 519 ...
## $ H : int 199 183 179 178 175 152 141 139 132 124 ...
## $ X2B : int 47 40 26 41 35 27 25 28 40 23 ...
## $ HR : int 21 15 27 35 32 20 36 22 8 18 ...
## $ RBI : int 100 75 76 115 97 84 94 85 53 63 ...
## $ AVG : num 0.325 0.305 0.302 0.317 0.274 0.276 0.242 0.251 0.242 0.239 ...
## $ OBP : num 0.407 0.379 0.358 0.404 0.339 0.34 0.327 0.305 0.288 0.319 ...
## $ SLG : num 0.511 0.446 0.492 0.578 0.48 0.437 0.477 0.423 0.36 0.391 ...
## $ OPS : num 0.918 0.824 0.851 0.981 0.818 0.777 0.804 0.729 0.647 0.71 ...
## $ WAR : num 5.77 4.19 3.21 7.86 3.85 3.07 5.05 1.32 -0.33 1.87 ...
## $ Payroll.Salary2023: num 27000000 19500000 4050000 26000000 14500000 ...
My comments:
The correlation coefficient between the number of
runs batted in (RBI) and the payroll/salary for 2023 in the firstbase
dataset, resulting in a value of approximately 0.6281. This indicates a
moderate to strong positive linear relationship between these two
variables.
# Correlations
cor(firstbase$RBI, firstbase$Payroll.Salary2023)
## [1] 0.6281239
My comments:
these 2 predictors are highly correlated and should
not be used together for a linear regression model.
cor(firstbase$AVG, firstbase$OBP)
## [1] 0.8028894
My comments:
When looking at the correlation matrix we can get a
good idea as to what predictors we need to remove. We have to remove the
predictors that are closely correlated or do Principal Component
Analysis PCA to get a set of uncorrelated components.
cor(firstbase)
## GP AB H X2B HR RBI
## GP 1.0000000 0.9779421 0.9056508 0.8446267 0.7432552 0.8813917
## AB 0.9779421 1.0000000 0.9516701 0.8924632 0.7721339 0.9125839
## H 0.9056508 0.9516701 1.0000000 0.9308318 0.7155225 0.9068893
## X2B 0.8446267 0.8924632 0.9308318 1.0000000 0.5889699 0.8485911
## HR 0.7432552 0.7721339 0.7155225 0.5889699 1.0000000 0.8929048
## RBI 0.8813917 0.9125839 0.9068893 0.8485911 0.8929048 1.0000000
## AVG 0.4430808 0.5126292 0.7393167 0.6613085 0.3444242 0.5658479
## OBP 0.4841583 0.5026125 0.6560021 0.5466537 0.4603408 0.5704463
## SLG 0.6875270 0.7471949 0.8211406 0.7211259 0.8681501 0.8824090
## OPS 0.6504483 0.6980141 0.8069779 0.6966830 0.7638721 0.8156612
## WAR 0.5645243 0.6211558 0.7688712 0.6757470 0.6897677 0.7885666
## Payroll.Salary2023 0.4614889 0.5018820 0.6249911 0.6450730 0.5317619 0.6281239
## AVG OBP SLG OPS WAR
## GP 0.4430808 0.4841583 0.6875270 0.6504483 0.5645243
## AB 0.5126292 0.5026125 0.7471949 0.6980141 0.6211558
## H 0.7393167 0.6560021 0.8211406 0.8069779 0.7688712
## X2B 0.6613085 0.5466537 0.7211259 0.6966830 0.6757470
## HR 0.3444242 0.4603408 0.8681501 0.7638721 0.6897677
## RBI 0.5658479 0.5704463 0.8824090 0.8156612 0.7885666
## AVG 1.0000000 0.8028894 0.7254274 0.7989005 0.7855945
## OBP 0.8028894 1.0000000 0.7617499 0.8987390 0.7766375
## SLG 0.7254274 0.7617499 1.0000000 0.9686752 0.8611140
## OPS 0.7989005 0.8987390 0.9686752 1.0000000 0.8799893
## WAR 0.7855945 0.7766375 0.8611140 0.8799893 1.0000000
## Payroll.Salary2023 0.5871543 0.7025979 0.6974086 0.7394981 0.8086359
## Payroll.Salary2023
## GP 0.4614889
## AB 0.5018820
## H 0.6249911
## X2B 0.6450730
## HR 0.5317619
## RBI 0.6281239
## AVG 0.5871543
## OBP 0.7025979
## SLG 0.6974086
## OPS 0.7394981
## WAR 0.8086359
## Payroll.Salary2023 1.0000000
My comments:
Detecting Multicollinearity with VIF. Interpreting
VIF Values. VIF < 5: Low multicollinearity. VIF between 5 and 10:
Moderate multicollinearity. VIF > 10: High multicollinearity,
requiring attention. We have problems here.
# Load necessary library
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
# Fit a multiple regression model
model <- lm(Payroll.Salary2023 ~ ., data=firstbase)
#model <- lm(Payroll.Salary2023 ~ HR + RBI + AVG + OBP + OPS, data=firstbase)
# Calculate VIF
vif_values <- vif(model)
print(vif_values)
## GP AB H X2B HR RBI
## 45.02486 228.90781 280.41486 34.06695 106.40164 36.88522
## AVG OBP SLG OPS WAR
## 61.18817 8272.37730 26419.42759 56407.76362 13.32842
My comments: Interpreting the Heatmap: Color Intensity: Indicates the strength of the correlation. Darker red indicates a strong positive correlation. Darker blue indicates a strong negative correlation. White indicates little to no correlation. Numbers: Show the exact correlation coefficients. Well like my sisters says: This is a hot mess! LOL
# Set the CRAN mirror
options(repos = c(CRAN = "https://cloud.r-project.org"))
# Install and load the corrplot package
install.packages("corrplot")
## Installing package into 'C:/Users/krist/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)
## package 'corrplot' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\krist\AppData\Local\Temp\RtmpE38yvM\downloaded_packages
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.92 loaded
# Calculate the correlation matrix for the firstbase dataset
correlation_matrix <- cor(firstbase, use="complete.obs")
# Visualize the correlation matrix using a heatmap
corrplot(correlation_matrix, method = "color", type = "upper",
tl.col = "black", tl.srt = 45, addCoef.col = "black",
number.cex = 0.7, col = colorRampPalette(c("blue", "white", "red"))(200))
My comments:
Here we are removing the AVG predictor so we can see
how our p-values end up and how the multicolliearity behaves. We want to
use the predictors with the less pvalue and the predictors that have the
less multicollinearity
#Removing AVG
model5 = lm(Payroll.Salary2023 ~ RBI + OBP+OPS, data=firstbase)
summary(model5)
##
## Call:
## lm(formula = Payroll.Salary2023 ~ RBI + OBP + OPS, data = firstbase)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9465449 -3411234 259746 4102864 8876798
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29737007 10855411 -2.739 0.013 *
## RBI 72393 84646 0.855 0.403
## OBP 82751360 83534224 0.991 0.334
## OPS 7598051 45525575 0.167 0.869
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5767000 on 19 degrees of freedom
## Multiple R-squared: 0.5709, Adjusted R-squared: 0.5031
## F-statistic: 8.426 on 3 and 19 DF, p-value: 0.000913
My comments:
Here in model6 we are only using 2 predictors: RBI
and OBP. In this combination, OBP is the most statistically significant
with a Pvalue of 0.00969 and RBI is marginally statistically
significant. But the model6 has the lowest pvalue overall of 0.0002149
so it is the most statistically significant of all models so far.
Therefore this is the best model to use for salary prediction of our 2
new players Matt and Josh.
model6 = lm(Payroll.Salary2023 ~ RBI + OBP, data=firstbase)
summary(model6)
##
## Call:
## lm(formula = Payroll.Salary2023 ~ RBI + OBP, data = firstbase)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9045497 -3487008 139497 4084739 9190185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28984802 9632560 -3.009 0.00693 **
## RBI 84278 44634 1.888 0.07360 .
## OBP 95468873 33385182 2.860 0.00969 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5625000 on 20 degrees of freedom
## Multiple R-squared: 0.5703, Adjusted R-squared: 0.5273
## F-statistic: 13.27 on 2 and 20 DF, p-value: 0.0002149
My comments:
Here we are creating a variable called
firstbasestats_test that contains 2 observations and 15 variables. Each
observation is a new player with the same variables as our
firstbasestats dataset
# Read in test set
firstbaseTest = read.csv("firstbasestats_test.csv")
str(firstbaseTest)
## 'data.frame': 2 obs. of 15 variables:
## $ Player : chr "Matt Olson" "Josh Bell"
## $ Pos : chr "1B" "1B"
## $ Team : chr "ATL" "SD"
## $ GP : int 162 156
## $ AB : int 616 552
## $ H : int 148 147
## $ X2B : int 44 29
## $ HR : int 34 17
## $ RBI : int 103 71
## $ AVG : num 0.24 0.266
## $ OBP : num 0.325 0.362
## $ SLG : num 0.477 0.422
## $ OPS : num 0.802 0.784
## $ WAR : num 3.29 3.5
## $ Payroll.Salary2023: num 21000000 16500000
My comments:
We are dropping the Pos column since that column has
the same value for both players and it will not help with our model or
prediction.
# Drop the 'Pos' column from out test dataset
firstbaseTest <- firstbaseTest[, !(names(firstbaseTest) %in% c("Pos"))]
str(firstbaseTest)
## 'data.frame': 2 obs. of 14 variables:
## $ Player : chr "Matt Olson" "Josh Bell"
## $ Team : chr "ATL" "SD"
## $ GP : int 162 156
## $ AB : int 616 552
## $ H : int 148 147
## $ X2B : int 44 29
## $ HR : int 34 17
## $ RBI : int 103 71
## $ AVG : num 0.24 0.266
## $ OBP : num 0.325 0.362
## $ SLG : num 0.477 0.422
## $ OPS : num 0.802 0.784
## $ WAR : num 3.29 3.5
## $ Payroll.Salary2023: num 21000000 16500000
My comments:
Here we are using our best model, model6 to predict
the salary for 2 new players Matt and Josh. This prediction can be used
for their salaries for 2024 or for their next contract. So according to
their RBI and OBP Matt should get paid 10 and 11 mil. I improved the
formatting of the code for better readability.
# Make test set predictions
predictTest = predict(model6, newdata=firstbaseTest)
# Combine the predictions with the Player names
predictedResults = data.frame(Player = firstbaseTest$Player, PredictedSalary = predictTest)
# Format the predicted salaries as currency
predictedResults$PredictedSalary = paste0("$", formatC(predictedResults$PredictedSalary, format = "f", big.mark = ",", digits = 2))
# Print the results
print(predictedResults)
## Player PredictedSalary
## 1 Matt Olson $10,723,185.80
## 2 Josh Bell $11,558,647.39
My comments:
SSE = sum((firstbaseTest$Payroll.Salary2023 - predictTest)^2)
SST = sum((firstbaseTest$Payroll.Salary2023 - mean(firstbase$Payroll.Salary2023))^2)
1 - SSE/SST
## [1] 0.5477734
Here I wanted to make a summary table that included key metrics such as R-squared, Adjusted R-squared, AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), and p-value to show that model6 is the best model. I also learned about the BIC and the AIC criterion to compare multiple models and chose the best one. For example: BIC is similar to AIC but introduces a stronger penalty for the number of parameters in the model.
# Function to extract model metrics
get_model_summary <- function(model) {
summary_model <- summary(model)
p_value <- pf(summary_model$fstatistic[1], summary_model$fstatistic[2], summary_model$fstatistic[3], lower.tail = FALSE)
data.frame(
Model = deparse(substitute(model)),
R_squared = summary_model$r.squared,
Adjusted_R_squared = summary_model$adj.r.squared,
AIC = AIC(model),
BIC = BIC(model),
P_value = p_value
)
}
# Extract summaries for each model
model_summaries <- rbind(
get_model_summary(model1),
get_model_summary(model2),
get_model_summary(model3),
get_model_summary(model4),
get_model_summary(model5),
get_model_summary(model6)
)
# Print the summary table
print(model_summaries)
## Model R_squared Adjusted_R_squared AIC BIC P_value
## value model1 0.3945396 0.3657082 790.9049 794.3114 0.0013305155
## value1 model2 0.4735310 0.4208841 789.6896 794.2316 0.0016357927
## value2 model3 0.5811122 0.4579099 790.4320 798.3805 0.0069506447
## value3 model4 0.5717074 0.4765313 788.9427 795.7557 0.0029799760
## value4 model5 0.5708804 0.5031247 786.9871 792.6646 0.0009129988
## value5 model6 0.5702513 0.5272764 785.0208 789.5628 0.0002148551
# Identify the best model based on Adjusted R-squared
best_model <- model_summaries[which.max(model_summaries$Adjusted_R_squared), ]
print(best_model)
## Model R_squared Adjusted_R_squared AIC BIC P_value
## value5 model6 0.5702513 0.5272764 785.0208 789.5628 0.0002148551
My comments
Loading the packages I need for some visualizations,
graphs
if (!require("ggplot2")) install.packages("ggplot2")
## Loading required package: ggplot2
if (!require("reshape2")) install.packages("reshape2")
## Loading required package: reshape2
## Warning: package 'reshape2' was built under R version 4.3.3
library(ggplot2)
library(reshape2)
My comments:
I wanted to graph the p-values of the 6 models to
show that model6 is the best model with the lowets p-value
# Create a list of models
models <- list(model1, model2, model3, model4, model5, model6)
# Initialize a data frame to store the metrics
model_comparison <- data.frame(
Model = character(),
Adj_R_Squared = numeric(),
AIC = numeric(),
BIC = numeric(),
Residual_SE = numeric(),
P_Value = numeric(),
stringsAsFactors = FALSE
)
# Function to extract p-value from the model summary
extract_p_value <- function(model) {
fstat <- summary(model)$fstatistic
p_value <- pf(fstat[1], fstat[2], fstat[3], lower.tail = FALSE)
return(p_value)
}
# Loop through each model and extract the metrics
for (i in 1:length(models)) {
model <- models[[i]]
model_summary <- summary(model)
# Extract metrics
adj_r_squared <- model_summary$adj.r.squared
aic <- AIC(model)
bic <- BIC(model)
residual_se <- model_summary$sigma
p_value <- extract_p_value(model)
# Append the metrics to the data frame
model_comparison <- rbind(model_comparison, data.frame(
Model = paste("model", i, sep = ""),
Adj_R_Squared = adj_r_squared,
AIC = aic,
BIC = bic,
Residual_SE = residual_se,
P_Value = p_value
))
}
# Reshape the data frame to a long format for metrics
model_comparison_long <- melt(model_comparison, id.vars = "Model", measure.vars = c("Adj_R_Squared", "AIC", "BIC", "Residual_SE"))
# Plot each metric separately
ggplot(model_comparison_long, aes(x = Model, y = value, color = variable, group = variable)) +
geom_line() +
geom_point() +
facet_wrap(~ variable, scales = "free_y") +
labs(title = "Model Comparison Metrics", x = "Model", y = "Value", color = "Metric") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Plot the p-values
ggplot(model_comparison, aes(x = Model, y = P_Value, group = 1)) +
geom_line() +
geom_point() +
labs(title = "P-Values of the Models", x = "Model", y = "P-Value") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
My Comments:
I combined the original dataset and the 2 new
players so I could do some graphs. Just curious.
# Read in test set
firstbaseAll = read.csv("firstbasestats_all.csv")
str(firstbaseAll)
## 'data.frame': 25 obs. of 15 variables:
## $ Player : chr "Freddie Freeman" "Jose Abreu" "Nate Lowe" "Paul Goldschmidt" ...
## $ Pos : chr "1B" "1B" "1B" "1B" ...
## $ Team : chr "LAD" "CHW" "TEX" "STL" ...
## $ GP : int 159 157 157 151 160 140 160 145 146 143 ...
## $ AB : int 612 601 593 561 638 551 583 555 545 519 ...
## $ H : int 199 183 179 178 175 152 141 139 132 124 ...
## $ X2B : int 47 40 26 41 35 27 25 28 40 23 ...
## $ HR : int 21 15 27 35 32 20 36 22 8 18 ...
## $ RBI : int 100 75 76 115 97 84 94 85 53 63 ...
## $ AVG : num 0.325 0.305 0.302 0.317 0.274 0.276 0.242 0.251 0.242 0.239 ...
## $ OBP : num 0.407 0.379 0.358 0.404 0.339 0.34 0.327 0.305 0.288 0.319 ...
## $ SLG : num 0.511 0.446 0.492 0.578 0.48 0.437 0.477 0.423 0.36 0.391 ...
## $ OPS : num 0.918 0.824 0.851 0.981 0.818 0.777 0.804 0.729 0.647 0.71 ...
## $ WAR : num 5.77 4.19 3.21 7.86 3.85 3.07 5.05 1.32 -0.33 1.87 ...
## $ Payroll.Salary2023: num 27000000 19500000 4050000 26000000 14500000 ...
My comments:
I am droping the Pos column from the combined
dataset.
# Drop the 'Pos' column from combined dataset
firstbaseAll <- firstbaseAll[, !(names(firstbaseAll) %in% c("Pos"))]
str(firstbaseAll)
## 'data.frame': 25 obs. of 14 variables:
## $ Player : chr "Freddie Freeman" "Jose Abreu" "Nate Lowe" "Paul Goldschmidt" ...
## $ Team : chr "LAD" "CHW" "TEX" "STL" ...
## $ GP : int 159 157 157 151 160 140 160 145 146 143 ...
## $ AB : int 612 601 593 561 638 551 583 555 545 519 ...
## $ H : int 199 183 179 178 175 152 141 139 132 124 ...
## $ X2B : int 47 40 26 41 35 27 25 28 40 23 ...
## $ HR : int 21 15 27 35 32 20 36 22 8 18 ...
## $ RBI : int 100 75 76 115 97 84 94 85 53 63 ...
## $ AVG : num 0.325 0.305 0.302 0.317 0.274 0.276 0.242 0.251 0.242 0.239 ...
## $ OBP : num 0.407 0.379 0.358 0.404 0.339 0.34 0.327 0.305 0.288 0.319 ...
## $ SLG : num 0.511 0.446 0.492 0.578 0.48 0.437 0.477 0.423 0.36 0.391 ...
## $ OPS : num 0.918 0.824 0.851 0.981 0.818 0.777 0.804 0.729 0.647 0.71 ...
## $ WAR : num 5.77 4.19 3.21 7.86 3.85 3.07 5.05 1.32 -0.33 1.87 ...
## $ Payroll.Salary2023: num 27000000 19500000 4050000 26000000 14500000 ...
My Comments:
Here I wanted to graph the RBI OBP and Salary for
the combined dataset including the 2 new players
# Convert Player column to factor to maintain the order in the plots
firstbaseAll$Player <- factor(firstbaseAll$Player, levels = firstbaseAll$Player)
# Plot RBI
p1 <- ggplot(firstbaseAll, aes(x = Player, y = RBI, group = 1)) +
geom_line() +
geom_point() +
labs(title = "RBI by Player", x = "Player", y = "RBI") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Plot OBP
p2 <- ggplot(firstbaseAll, aes(x = Player, y = OBP, group = 1)) +
geom_line() +
geom_point() +
labs(title = "OBP by Player", x = "Player", y = "OBP") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Plot Salary2023
p3 <- ggplot(firstbaseAll, aes(x = Player, y = Payroll.Salary2023, group = 1)) +
geom_line() +
geom_point() +
labs(title = "Payroll.Salary2023 by Player", x = "Player", y = "Payroll.Salary2023") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Print the plots
print(p1)
print(p2)
print(p3)
My comments
Here I wanted to graph the 2 more statistically
significant predictor’s p-values for all of the 6 models. As I
understand it, I think that in this graph we can see some
multicollinearity on the models and the coefficients.
# Create a list of models
models <- list(model1, model2, model3, model4, model5, model6)
# Initialize a data frame to store p-values
model_pvalues <- data.frame(
Model = character(),
Coefficient = character(),
PValue = numeric(),
stringsAsFactors = FALSE
)
# Loop through each model and extract p-values
for (i in 1:length(models)) {
model <- models[[i]]
model_summary <- summary(model)
# Extract p-values and coefficient names
p_values <- coef(model_summary)[, "Pr(>|t|)"]
coefficients <- rownames(coef(model_summary))
# Filter for the desired coefficients
desired_coefficients <- c("(Intercept)", "RBI", "OBP")
for (j in 1:length(p_values)) {
if (coefficients[j] %in% desired_coefficients) {
model_pvalues <- rbind(model_pvalues, data.frame(
Model = paste("model", i, sep = ""),
Coefficient = coefficients[j],
PValue = p_values[j]
))
}
}
}
# Reshape the data frame to long format
model_pvalues_long <- melt(model_pvalues, id.vars = c("Model", "Coefficient"))
# Create the line graph
ggplot(model_pvalues_long, aes(x = Coefficient, y = value, color = Model, group = Model)) +
geom_line() +
geom_point() +
labs(title = "P-Values of Selected Coefficients by Model", x = "Coefficient", y = "P-Value", color = "Model") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
I will say that doing visualizations in Tableau is a lot easier than in R. I will take this into consideration for my final project.
The End, Thank you!