# Read in data
firstbase = read.csv("Ass 6/firstbasestats.csv")
str(firstbase)
'data.frame': 23 obs. of 15 variables:
$ Player : chr "Freddie Freeman" "Jose Abreu" "Nate Lowe" "Paul Goldschmidt" ...
$ Pos : chr "1B" "1B" "1B" "1B" ...
$ Team : chr "LAD" "CHW" "TEX" "STL" ...
$ GP : int 159 157 157 151 160 140 160 145 146 143 ...
$ AB : int 612 601 593 561 638 551 583 555 545 519 ...
$ H : int 199 183 179 178 175 152 141 139 132 124 ...
$ X2B : int 47 40 26 41 35 27 25 28 40 23 ...
$ HR : int 21 15 27 35 32 20 36 22 8 18 ...
$ RBI : int 100 75 76 115 97 84 94 85 53 63 ...
$ AVG : num 0.325 0.305 0.302 0.317 0.274 0.276 0.242 0.251 0.242 0.239 ...
$ OBP : num 0.407 0.379 0.358 0.404 0.339 0.34 0.327 0.305 0.288 0.319 ...
$ SLG : num 0.511 0.446 0.492 0.578 0.48 0.437 0.477 0.423 0.36 0.391 ...
$ OPS : num 0.918 0.824 0.851 0.981 0.818 0.777 0.804 0.729 0.647 0.71 ...
$ WAR : num 5.77 4.19 3.21 7.86 3.85 3.07 5.05 1.32 -0.33 1.87 ...
$ Payroll.Salary2023: num 27000000 19500000 4050000 26000000 14500000 ...
The result shows baseball players’ performance. We can use these stats like batting average (AVG) and home runs (SLG) to see how well they did during the season and of the most important to see their salary. It also shows if the players are more efficient in terms of their salaries compared to their stats and see their performance to evaluate what the player do in the field.
summary(firstbase)
Player Pos Team GP AB H X2B
Length:23 Length:23 Length:23 Min. : 5.0 Min. : 14.0 Min. : 3.0 Min. : 1.00
Class :character Class :character Class :character 1st Qu.:105.5 1st Qu.:309.0 1st Qu.: 74.5 1st Qu.:13.50
Mode :character Mode :character Mode :character Median :131.0 Median :465.0 Median :115.0 Median :23.00
Mean :120.2 Mean :426.9 Mean :110.0 Mean :22.39
3rd Qu.:152.0 3rd Qu.:558.0 3rd Qu.:146.5 3rd Qu.:28.00
Max. :160.0 Max. :638.0 Max. :199.0 Max. :47.00
HR RBI AVG OBP SLG OPS WAR
Min. : 0.00 Min. : 1.00 Min. :0.2020 Min. :0.2140 Min. :0.2860 Min. :0.5000 Min. :-1.470
1st Qu.: 8.00 1st Qu.: 27.00 1st Qu.:0.2180 1st Qu.:0.3030 1st Qu.:0.3505 1st Qu.:0.6445 1st Qu.: 0.190
Median :18.00 Median : 63.00 Median :0.2420 Median :0.3210 Median :0.4230 Median :0.7290 Median : 1.310
Mean :17.09 Mean : 59.43 Mean :0.2499 Mean :0.3242 Mean :0.4106 Mean :0.7346 Mean : 1.788
3rd Qu.:24.50 3rd Qu.: 84.50 3rd Qu.:0.2750 3rd Qu.:0.3395 3rd Qu.:0.4690 3rd Qu.:0.8175 3rd Qu.: 3.140
Max. :36.00 Max. :115.00 Max. :0.3250 Max. :0.4070 Max. :0.5780 Max. :0.9810 Max. : 7.860
Payroll.Salary2023
Min. : 720000
1st Qu.: 739200
Median : 4050000
Mean : 6972743
3rd Qu.: 8150000
Max. :27000000
The results above are the summary of the data from the file. We can see the results of each column and get some sights from it. For example, we can see that the highest number number of HR is 36 but the number of doubles is higher with 47 (Max). The results shows some good perspective of what we could analyze.
# Linear Regression (one variable)
model1 = lm(Payroll.Salary2023 ~ RBI, data=firstbase)
summary(model1)
Call:
lm(formula = Payroll.Salary2023 ~ RBI, data = firstbase)
Residuals:
Min 1Q Median 3Q Max
-10250331 -5220790 -843455 2386848 13654950
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2363744 2866320 -0.825 0.41883
RBI 157088 42465 3.699 0.00133 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6516000 on 21 degrees of freedom
Multiple R-squared: 0.3945, Adjusted R-squared: 0.3657
F-statistic: 13.68 on 1 and 21 DF, p-value: 0.001331
The PV indicates strong evidence (PV<0.05) that RBIs significantly impact payroll salary. However, that also means that only a portion (39.45%) explains it, so other factors besides RBIs will likely influence the salary.
# Sum of Squared Errors
model1$residuals
1 2 3 4 5 6 7 8 9 10
13654950.2 10082148.6 -5524939.3 10298631.2 1626214.0 -6731642.8 -5902522.2 -10250330.7 -4711916.8 -532796.1
11 12 13 14 15 16 17 18 19 20
-6667082.5 -6696203.1 7582148.6 -4916640.9 -1898125.3 -336532.3 -995042.5 -1311618.3 -843454.5 8050721.3
21 22 23
1250336.9 1847040.4 2926656.0
The results from the SSE represent the model’s prediction. If there negative means than the model is higher than the actual salary. And if it s positive means the model prediction is lower than the actual salary.
SSE = sum(model1$residuals^2)
SSE
[1] 8.914926e+14
This significant value indicates the total squared difference between the actual salaries and the salaries predicted.
# Linear Regression (two variables)
model2 = lm(Payroll.Salary2023 ~ AVG + RBI, data=firstbase)
summary(model2)
Call:
lm(formula = Payroll.Salary2023 ~ AVG + RBI, data = firstbase)
Residuals:
Min 1Q Median 3Q Max
-9097952 -4621582 -33233 3016541 10260245
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -18083756 9479037 -1.908 0.0709 .
AVG 74374031 42934155 1.732 0.0986 .
RBI 108850 49212 2.212 0.0388 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6226000 on 20 degrees of freedom
Multiple R-squared: 0.4735, Adjusted R-squared: 0.4209
F-statistic: 8.994 on 2 and 20 DF, p-value: 0.001636
There is evidence that AVG (Pv= 0.0986) and RBI (Pv= 0.0388) impact salary because both P values are less than 0.05. This indicates a significant impact of these variables together on salary.
# Sum of Squared Errors
SSE = sum(model2$residuals^2)
SSE
[1] 7.751841e+14
This indicates the total squared difference between the observed and predicted payroll salaries, reflecting the model’s overall error.
# Linear Regression (all variables)
model3 = lm(Payroll.Salary2023 ~ HR + RBI + AVG + OBP+ OPS, data=firstbase)
summary(model3)
Call:
lm(formula = Payroll.Salary2023 ~ HR + RBI + AVG + OBP + OPS,
data = firstbase)
Residuals:
Min 1Q Median 3Q Max
-9611440 -3338119 64016 4472451 9490309
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -31107859 11738494 -2.650 0.0168 *
HR -341069 552069 -0.618 0.5449
RBI 115786 113932 1.016 0.3237
AVG -63824769 104544645 -0.611 0.5496
OBP 27054948 131210166 0.206 0.8391
OPS 60181012 95415131 0.631 0.5366
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6023000 on 17 degrees of freedom
Multiple R-squared: 0.5811, Adjusted R-squared: 0.4579
F-statistic: 4.717 on 5 and 17 DF, p-value: 0.006951
The individual predictors (HR, RBI, AVG, OBP, and OPS) are not significant when considered together because their PVs are greater than 0.05. All the predictors explain the 58.11% impact on the salary, but the individual contributions are unclear.
# Sum of Squared Errors
SSE = sum(model3$residuals^2)
SSE
[1] 6.167793e+14
# Remove HR
model4 = lm(Payroll.Salary2023 ~ RBI + AVG + OBP+OPS, data=firstbase)
summary(model4)
Call:
lm(formula = Payroll.Salary2023 ~ RBI + AVG + OBP + OPS, data = firstbase)
Residuals:
Min 1Q Median 3Q Max
-9399551 -3573842 98921 3979339 9263512
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -29466887 11235931 -2.623 0.0173 *
RBI 71495 87015 0.822 0.4220
AVG -11035457 59192453 -0.186 0.8542
OBP 86360720 87899074 0.982 0.3389
OPS 9464546 47788458 0.198 0.8452
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5919000 on 18 degrees of freedom
Multiple R-squared: 0.5717, Adjusted R-squared: 0.4765
F-statistic: 6.007 on 4 and 18 DF, p-value: 0.00298
Without HR, the payroll salary prediction is still significant, but the individual predictors (RBI, AVG, OBP, OPS) do not show significant contributions because they are greater than 0.05. This means that the predictors are related to and can explain the reason for the impact on the salary payroll.
firstbase<-firstbase[,-(1:3)]
The code above removes the first three columns from the dataset ‘firstbase’.
# Correlations
cor(firstbase$RBI, firstbase$Payroll.Salary2023)
[1] 0.6281239
The correlation coefficient between RBI and payroll salary is 0.628; this indicates a strong positive correlation. As RBI increases, payroll salary tends to increase as well.
cor(firstbase$AVG, firstbase$OBP)
[1] 0.8028894
The correlation coefficient between AVG and OBP is 0.803; this correlation indicates that players with a higher batting average typically have a higher on-base %.
cor(firstbase)
GP AB H X2B HR RBI AVG OBP SLG OPS
GP 1.0000000 0.9779421 0.9056508 0.8446267 0.7432552 0.8813917 0.4430808 0.4841583 0.6875270 0.6504483
AB 0.9779421 1.0000000 0.9516701 0.8924632 0.7721339 0.9125839 0.5126292 0.5026125 0.7471949 0.6980141
H 0.9056508 0.9516701 1.0000000 0.9308318 0.7155225 0.9068893 0.7393167 0.6560021 0.8211406 0.8069779
X2B 0.8446267 0.8924632 0.9308318 1.0000000 0.5889699 0.8485911 0.6613085 0.5466537 0.7211259 0.6966830
HR 0.7432552 0.7721339 0.7155225 0.5889699 1.0000000 0.8929048 0.3444242 0.4603408 0.8681501 0.7638721
RBI 0.8813917 0.9125839 0.9068893 0.8485911 0.8929048 1.0000000 0.5658479 0.5704463 0.8824090 0.8156612
AVG 0.4430808 0.5126292 0.7393167 0.6613085 0.3444242 0.5658479 1.0000000 0.8028894 0.7254274 0.7989005
OBP 0.4841583 0.5026125 0.6560021 0.5466537 0.4603408 0.5704463 0.8028894 1.0000000 0.7617499 0.8987390
SLG 0.6875270 0.7471949 0.8211406 0.7211259 0.8681501 0.8824090 0.7254274 0.7617499 1.0000000 0.9686752
OPS 0.6504483 0.6980141 0.8069779 0.6966830 0.7638721 0.8156612 0.7989005 0.8987390 0.9686752 1.0000000
WAR 0.5645243 0.6211558 0.7688712 0.6757470 0.6897677 0.7885666 0.7855945 0.7766375 0.8611140 0.8799893
Payroll.Salary2023 0.4614889 0.5018820 0.6249911 0.6450730 0.5317619 0.6281239 0.5871543 0.7025979 0.6974086 0.7394981
WAR Payroll.Salary2023
GP 0.5645243 0.4614889
AB 0.6211558 0.5018820
H 0.7688712 0.6249911
X2B 0.6757470 0.6450730
HR 0.6897677 0.5317619
RBI 0.7885666 0.6281239
AVG 0.7855945 0.5871543
OBP 0.7766375 0.7025979
SLG 0.8611140 0.6974086
OPS 0.8799893 0.7394981
WAR 1.0000000 0.8086359
Payroll.Salary2023 0.8086359 1.0000000
#Removing AVG
model5 = lm(Payroll.Salary2023 ~ RBI + OBP+OPS, data=firstbase)
summary(model5)
Call:
lm(formula = Payroll.Salary2023 ~ RBI + OBP + OPS, data = firstbase)
Residuals:
Min 1Q Median 3Q Max
-9465449 -3411234 259746 4102864 8876798
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -29737007 10855411 -2.739 0.013 *
RBI 72393 84646 0.855 0.403
OBP 82751360 83534224 0.991 0.334
OPS 7598051 45525575 0.167 0.869
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5767000 on 19 degrees of freedom
Multiple R-squared: 0.5709, Adjusted R-squared: 0.5031
F-statistic: 8.426 on 3 and 19 DF, p-value: 0.000913
The model explains 57.09% of the variance in salary. The overall PV is 0.000913, and none of the predictors (RBI, OBP, OPS) are significant because they are greater than 0.05. This indicates that AVG can still explain the impact on salaries.
model6 = lm(Payroll.Salary2023 ~ RBI + OBP, data=firstbase)
summary(model6)
Call:
lm(formula = Payroll.Salary2023 ~ RBI + OBP, data = firstbase)
Residuals:
Min 1Q Median 3Q Max
-9045497 -3487008 139497 4084739 9190185
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -28984802 9632560 -3.009 0.00693 **
RBI 84278 44634 1.888 0.07360 .
OBP 95468873 33385182 2.860 0.00969 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5625000 on 20 degrees of freedom
Multiple R-squared: 0.5703, Adjusted R-squared: 0.5273
F-statistic: 13.27 on 2 and 20 DF, p-value: 0.0002149
The PV (0.0002149) is significant and indicates that RBI and OBP are effective predictors of payroll salary. OBP has a stronger and more statistically significant influence on salary compared to RBI.
# Read in test set
firstbaseTest = read.csv("Ass 6/firstbasestats_test.csv")
str(firstbaseTest)
'data.frame': 2 obs. of 15 variables:
$ Player : chr "Matt Olson" "Josh Bell"
$ Pos : chr "1B" "1B"
$ Team : chr "ATL" "SD"
$ GP : int 162 156
$ AB : int 616 552
$ H : int 148 147
$ X2B : int 44 29
$ HR : int 34 17
$ RBI : int 103 71
$ AVG : num 0.24 0.266
$ OBP : num 0.325 0.362
$ SLG : num 0.477 0.422
$ OPS : num 0.802 0.784
$ WAR : num 3.29 3.5
$ Payroll.Salary2023: num 21000000 16500000
The table above contains stats of two players [Matt Olson (ATL), Josh Bell (SD)] and show their performances and their salary. They have similar stats but one that definitely is interesting is this one: Both have 147-148 hits and bat on average 0.240 and 0.266. But the salaries are different: Matt Olson: 21,000,000 & Josh Bell: 16,500,000.
# Make test set predictions
predictTest = predict(model6, newdata=firstbaseTest)
predictTest
1 2
10723186 11558647
This indicates that the model expects these players to earn salaries of 10,723,186 (Olson) and 11,558,647 (Bell).
# Compute R-squared
SSE = sum((firstbaseTest$Payroll.Salary2023 - predictTest)^2)
SST = sum((firstbaseTest$Payroll.Salary2023 - mean(firstbase$Payroll.Salary2023))^2)
1 - SSE/SST
[1] 0.5477734
The model explains about 54.8% of the variance in payroll salaries for the test set. This shows a significant portion that can be predicted, but there is still room to make it accurate.