Wilson Chau Data 621 Assignment 1
#Observation:
baseball_data <- read.csv("https://raw.githubusercontent.com/Wilchau/Data_621_Assignment_1/main/moneyball-training-data.csv")
baseball_eval <- read.csv("https://raw.githubusercontent.com/Wilchau/Data_621_Assignment_1/main/moneyball-evaluation-data.csv")
str(baseball_data)
## 'data.frame': 2276 obs. of 17 variables:
## $ INDEX : int 1 2 3 4 5 6 7 8 11 12 ...
## $ TARGET_WINS : int 39 70 86 70 82 75 80 85 86 76 ...
## $ TEAM_BATTING_H : int 1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
## $ TEAM_BATTING_2B : int 194 219 232 209 186 200 179 171 197 213 ...
## $ TEAM_BATTING_3B : int 39 22 35 38 27 36 54 37 40 18 ...
## $ TEAM_BATTING_HR : int 13 190 137 96 102 92 122 115 114 96 ...
## $ TEAM_BATTING_BB : int 143 685 602 451 472 443 525 456 447 441 ...
## $ TEAM_BATTING_SO : int 842 1075 917 922 920 973 1062 1027 922 827 ...
## $ TEAM_BASERUN_SB : int NA 37 46 43 49 107 80 40 69 72 ...
## $ TEAM_BASERUN_CS : int NA 28 27 30 39 59 54 36 27 34 ...
## $ TEAM_BATTING_HBP: int NA NA NA NA NA NA NA NA NA NA ...
## $ TEAM_PITCHING_H : int 9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
## $ TEAM_PITCHING_HR: int 84 191 137 97 102 92 122 116 114 96 ...
## $ TEAM_PITCHING_BB: int 927 689 602 454 472 443 525 459 447 441 ...
## $ TEAM_PITCHING_SO: int 5456 1082 917 928 920 973 1062 1033 922 827 ...
## $ TEAM_FIELDING_E : int 1011 193 175 164 138 123 136 112 127 131 ...
## $ TEAM_FIELDING_DP: int NA 155 153 156 168 149 186 136 169 159 ...
summary(baseball_data)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
##
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## NA's :102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
## 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
## Median :101.0 Median : 49.0 Median :58.00 Median : 1518
## Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
## 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
## Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
## NA's :131 NA's :772 NA's :2085
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## NA's :102
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:131.0
## Median :149.0
## Mean :146.4
## 3rd Qu.:164.0
## Max. :228.0
## NA's :286
Going through basic observation we can see that the training data set has 17 variables and 2,276 total Observation. Going through each variable I notice some are missing, such as TEAM_BATTING_HBP(Batters hit by pitch) having missing values. INDEX is the Identification Variable, and TARGET_WINS is the response variable to influence the regression model. Another variable to look at would be TEAM_PITCHING_H, TEAM_PITCHING_BB due to the high number. The other variables are all potential predictor variables for our linear model.
boxplot(baseball_data$TARGET_WINS,
main = "Box Plot of TARGET_WINS",
ylab = "Number of Wins")
boxplot(baseball_data$TEAM_BATTING_H, baseball_data$TEAM_BATTING_2B,
baseball_data$TEAM_BATTING_3B, baseball_data$TEAM_BATTING_HR,
baseball_data$TEAM_BATTING_BB, baseball_data$TEAM_BATTING_SO,
baseball_data$TEAM_BASERUN_SB, baseball_data$TEAM_BASERUN_CS,
baseball_data$TEAM_BATTING_HBP,
main = "Box Plots of Selected Variables",
ylab = "Values",
names = c("TEAM_BATTING_H", "TEAM_BATTING_2B", "TEAM_BATTING_3B",
"TEAM_BATTING_HR", "TEAM_BATTING_BB", "TEAM_BATTING_SO",
"TEAM_BASERUN_SB", "TEAM_BASERUN_CS", "TEAM_BATTING_HBP"))
library(corrplot)
## corrplot 0.92 loaded
selected_vars <- c("TARGET_WINS", "TEAM_BATTING_H", "TEAM_BATTING_2B", "TEAM_BATTING_3B",
"TEAM_BATTING_HR", "TEAM_BATTING_BB", "TEAM_BATTING_SO",
"TEAM_BASERUN_SB", "TEAM_BASERUN_CS", "TEAM_BATTING_HBP",
"TEAM_PITCHING_H", "TEAM_PITCHING_HR", "TEAM_PITCHING_BB",
"TEAM_PITCHING_SO", "TEAM_FIELDING_E", "TEAM_FIELDING_DP")
correlation_matrix <- cor(baseball_data[, selected_vars], use = "pairwise.complete.obs")
print(correlation_matrix)
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## TARGET_WINS 1.00000000 0.388767521 0.28910365 0.142608411
## TEAM_BATTING_H 0.38876752 1.000000000 0.56284968 0.427696575
## TEAM_BATTING_2B 0.28910365 0.562849678 1.00000000 -0.107305824
## TEAM_BATTING_3B 0.14260841 0.427696575 -0.10730582 1.000000000
## TEAM_BATTING_HR 0.17615320 -0.006544685 0.43539729 -0.635566946
## TEAM_BATTING_BB 0.23255986 -0.072464013 0.25572610 -0.287235841
## TEAM_BATTING_SO -0.03175071 -0.463853571 0.16268519 -0.669781188
## TEAM_BASERUN_SB 0.13513892 0.123567797 -0.19975724 0.533506448
## TEAM_BASERUN_CS 0.02240407 0.016705668 -0.09981406 0.348764919
## TEAM_BATTING_HBP 0.07350424 -0.029112176 0.04608475 -0.174247154
## TEAM_PITCHING_H -0.10993705 0.302693709 0.02369219 0.194879411
## TEAM_PITCHING_HR 0.18901373 0.072853119 0.45455082 -0.567836679
## TEAM_PITCHING_BB 0.12417454 0.094193027 0.17805420 -0.002224148
## TEAM_PITCHING_SO -0.07843609 -0.252656790 0.06479231 -0.258818931
## TEAM_FIELDING_E -0.17648476 0.264902478 -0.23515099 0.509778447
## TEAM_FIELDING_DP -0.03485058 0.155383321 0.29087998 -0.323074847
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## TARGET_WINS 0.176153200 0.23255986 -0.03175071
## TEAM_BATTING_H -0.006544685 -0.07246401 -0.46385357
## TEAM_BATTING_2B 0.435397293 0.25572610 0.16268519
## TEAM_BATTING_3B -0.635566946 -0.28723584 -0.66978119
## TEAM_BATTING_HR 1.000000000 0.51373481 0.72706935
## TEAM_BATTING_BB 0.513734810 1.00000000 0.37975087
## TEAM_BATTING_SO 0.727069348 0.37975087 1.00000000
## TEAM_BASERUN_SB -0.453578426 -0.10511564 -0.25448923
## TEAM_BASERUN_CS -0.433793868 -0.13698837 -0.21788137
## TEAM_BATTING_HBP 0.106181160 0.04746007 0.22094219
## TEAM_PITCHING_H -0.250145481 -0.44977762 -0.37568637
## TEAM_PITCHING_HR 0.969371396 0.45955207 0.66717889
## TEAM_PITCHING_BB 0.136927564 0.48936126 0.03700514
## TEAM_PITCHING_SO 0.184707564 -0.02075682 0.41623330
## TEAM_FIELDING_E -0.587339098 -0.65597081 -0.58466444
## TEAM_FIELDING_DP 0.448985348 0.43087675 0.15488939
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP
## TARGET_WINS 0.13513892 0.02240407 0.07350424
## TEAM_BATTING_H 0.12356780 0.01670567 -0.02911218
## TEAM_BATTING_2B -0.19975724 -0.09981406 0.04608475
## TEAM_BATTING_3B 0.53350645 0.34876492 -0.17424715
## TEAM_BATTING_HR -0.45357843 -0.43379387 0.10618116
## TEAM_BATTING_BB -0.10511564 -0.13698837 0.04746007
## TEAM_BATTING_SO -0.25448923 -0.21788137 0.22094219
## TEAM_BASERUN_SB 1.00000000 0.65524480 -0.06400498
## TEAM_BASERUN_CS 0.65524480 1.00000000 -0.07051390
## TEAM_BATTING_HBP -0.06400498 -0.07051390 1.00000000
## TEAM_PITCHING_H 0.07328505 -0.05200781 -0.02769699
## TEAM_PITCHING_HR -0.41651072 -0.42256605 0.10675878
## TEAM_PITCHING_BB 0.14641513 -0.10696124 0.04785137
## TEAM_PITCHING_SO -0.13712861 -0.21022274 0.22157375
## TEAM_FIELDING_E 0.50963090 0.04832189 0.04178971
## TEAM_FIELDING_DP -0.49707763 -0.21424801 -0.07120824
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## TARGET_WINS -0.10993705 0.18901373 0.124174536
## TEAM_BATTING_H 0.30269371 0.07285312 0.094193027
## TEAM_BATTING_2B 0.02369219 0.45455082 0.178054204
## TEAM_BATTING_3B 0.19487941 -0.56783668 -0.002224148
## TEAM_BATTING_HR -0.25014548 0.96937140 0.136927564
## TEAM_BATTING_BB -0.44977762 0.45955207 0.489361263
## TEAM_BATTING_SO -0.37568637 0.66717889 0.037005141
## TEAM_BASERUN_SB 0.07328505 -0.41651072 0.146415134
## TEAM_BASERUN_CS -0.05200781 -0.42256605 -0.106961236
## TEAM_BATTING_HBP -0.02769699 0.10675878 0.047851371
## TEAM_PITCHING_H 1.00000000 -0.14161276 0.320676162
## TEAM_PITCHING_HR -0.14161276 1.00000000 0.221937505
## TEAM_PITCHING_BB 0.32067616 0.22193750 1.000000000
## TEAM_PITCHING_SO 0.26724807 0.20588053 0.488498653
## TEAM_FIELDING_E 0.66775901 -0.49314447 -0.022837561
## TEAM_FIELDING_DP -0.22865059 0.43917040 0.324457226
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## TARGET_WINS -0.07843609 -0.17648476 -0.03485058
## TEAM_BATTING_H -0.25265679 0.26490248 0.15538332
## TEAM_BATTING_2B 0.06479231 -0.23515099 0.29087998
## TEAM_BATTING_3B -0.25881893 0.50977845 -0.32307485
## TEAM_BATTING_HR 0.18470756 -0.58733910 0.44898535
## TEAM_BATTING_BB -0.02075682 -0.65597081 0.43087675
## TEAM_BATTING_SO 0.41623330 -0.58466444 0.15488939
## TEAM_BASERUN_SB -0.13712861 0.50963090 -0.49707763
## TEAM_BASERUN_CS -0.21022274 0.04832189 -0.21424801
## TEAM_BATTING_HBP 0.22157375 0.04178971 -0.07120824
## TEAM_PITCHING_H 0.26724807 0.66775901 -0.22865059
## TEAM_PITCHING_HR 0.20588053 -0.49314447 0.43917040
## TEAM_PITCHING_BB 0.48849865 -0.02283756 0.32445723
## TEAM_PITCHING_SO 1.00000000 -0.02329178 0.02615804
## TEAM_FIELDING_E -0.02329178 1.00000000 -0.49768495
## TEAM_FIELDING_DP 0.02615804 -0.49768495 1.00000000
corrplot(correlation_matrix,
method = "color", # Color-coded cells
type = "upper", # Display only upper triangle
tl.cex = 0.3, # Text label size
tl.col = "black", # Text label color
diag = FALSE, # Exclude diagonal elements
addCoef.col = "black", # Color of correlation coefficients
tl.srt = 45) # Text label rotation angle
Going through the correlation matrix, I notice that Team_BAtting_H along
with TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR are all values of
the hits. TEAM_BATTING_BB and TEAM_PITCHING_BB have strong correlation.
Our focus will be on winning.
So far we see that there are missing data, a lot of outliers as shown in the summary table and some need to prepare the data and transform it.
#PART 2: DATA PREPARATION: I will focus on working on removing the NA values or missing values by deleting the observations, variables, imputation with the mean/median/mode. Imputation is the easiest way, but however it reduces the variance in the dataset and shrinks standard errors which can invalidate the hypothesis tests.
##PART 3: BUILD MODELS In this section I will work on building at least three different linear regression models:
Model 1: I chose only all variables, this would allow me to see which variables is significant.
model_1 <- lm(formula = TARGET_WINS ~ ., data = baseball_data)
summary(model_1)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = baseball_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.0626 -5.4196 -0.0423 5.2111 22.9355
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.4562317 19.7385030 3.063 0.00254 **
## INDEX -0.0002478 0.0008508 -0.291 0.77122
## TEAM_BATTING_H 1.8111103 2.7908648 0.649 0.51723
## TEAM_BATTING_2B 0.0267462 0.0303941 0.880 0.38008
## TEAM_BATTING_3B -0.1018043 0.0777401 -1.310 0.19208
## TEAM_BATTING_HR -4.6100155 10.5666083 -0.436 0.66317
## TEAM_BATTING_BB -4.4606275 3.6457882 -1.224 0.22279
## TEAM_BATTING_SO 0.4303282 2.6231874 0.164 0.86988
## TEAM_BASERUN_SB 0.0335937 0.0288100 1.166 0.24519
## TEAM_BASERUN_CS -0.0130338 0.0719436 -0.181 0.85645
## TEAM_BATTING_HBP 0.0837038 0.0499097 1.677 0.09532 .
## TEAM_PITCHING_H -1.7887761 2.7903398 -0.641 0.52233
## TEAM_PITCHING_HR 4.6958245 10.5649821 0.444 0.65725
## TEAM_PITCHING_BB 4.5120283 3.6432611 1.238 0.21721
## TEAM_PITCHING_SO -0.4618971 2.6214432 -0.176 0.86034
## TEAM_FIELDING_E -0.1724513 0.0415365 -4.152 5.16e-05 ***
## TEAM_FIELDING_DP -0.1063200 0.0371964 -2.858 0.00478 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.489 on 174 degrees of freedom
## (2085 observations deleted due to missingness)
## Multiple R-squared: 0.5503, Adjusted R-squared: 0.509
## F-statistic: 13.31 on 16 and 174 DF, p-value: < 2.2e-16
Model 2: Highly statistical Variables Only
For Model 2 I include only the one to be best fit model based model 1 where I saw TEAM_BATTING_H , TEAM_BATTING_SO, TEAM_FIELDING_E, and TEAM_PITCHING_HR as viable candidates for having positive estimates.
model_2 <- lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B +
TEAM_BATTING_HR + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_SO +
TEAM_FIELDING_E + TEAM_FIELDING_DP, data = baseball_data)
summary(model_2)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP, data = baseball_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.968 -7.675 0.280 7.374 31.949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88.365267 5.488846 16.099 < 2e-16 ***
## TEAM_BATTING_H 0.011370 0.003215 3.537 0.000415 ***
## TEAM_BATTING_3B 0.187827 0.019636 9.565 < 2e-16 ***
## TEAM_BATTING_HR 0.130575 0.009150 14.270 < 2e-16 ***
## TEAM_BATTING_SO -0.025391 0.004158 -6.107 1.24e-09 ***
## TEAM_BASERUN_SB 0.078100 0.005703 13.693 < 2e-16 ***
## TEAM_PITCHING_SO -0.004121 0.003532 -1.167 0.243564
## TEAM_FIELDING_E -0.112888 0.007027 -16.064 < 2e-16 ***
## TEAM_FIELDING_DP -0.097127 0.012639 -7.685 2.49e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.59 on 1826 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.3552, Adjusted R-squared: 0.3524
## F-statistic: 125.8 on 8 and 1826 DF, p-value: < 2.2e-16
Model 3: Eliminate variables that are negative or skewed.
model_3 <-lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB +
TEAM_FIELDING_E + TEAM_BATTING_HR, data = baseball_data)
summary(model_3)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB +
## TEAM_FIELDING_E + TEAM_BATTING_HR, data = baseball_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.724 -9.003 0.146 8.425 40.502
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.354874 2.958501 3.500 0.000475 ***
## TEAM_BATTING_H 0.046809 0.002111 22.177 < 2e-16 ***
## TEAM_BASERUN_SB 0.058158 0.003660 15.891 < 2e-16 ***
## TEAM_FIELDING_E -0.037661 0.002164 -17.406 < 2e-16 ***
## TEAM_BATTING_HR 0.024259 0.005792 4.188 2.93e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.42 on 2140 degrees of freedom
## (131 observations deleted due to missingness)
## Multiple R-squared: 0.2933, Adjusted R-squared: 0.292
## F-statistic: 222 on 4 and 2140 DF, p-value: < 2.2e-16
#PART 4: SELECT MODELS
From these three models I went through each of the following: Mean Squared Error Root MSE Adjusted R-Squared F-statistics I observe that the first model has the highest R-squared, it wasn’t statisfically significant, but had some multicollinearity issues. Looking at F-statistics in Model 3 this has the highest out of the three so I chose Model 3.
predictions <- predict(model_3, newdata = baseball_data)
rmse <- sqrt(mean((baseball_data$TARGET_WINS - predictions)^2))
summary(model_3)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB +
## TEAM_FIELDING_E + TEAM_BATTING_HR, data = baseball_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.724 -9.003 0.146 8.425 40.502
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.354874 2.958501 3.500 0.000475 ***
## TEAM_BATTING_H 0.046809 0.002111 22.177 < 2e-16 ***
## TEAM_BASERUN_SB 0.058158 0.003660 15.891 < 2e-16 ***
## TEAM_FIELDING_E -0.037661 0.002164 -17.406 < 2e-16 ***
## TEAM_BATTING_HR 0.024259 0.005792 4.188 2.93e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.42 on 2140 degrees of freedom
## (131 observations deleted due to missingness)
## Multiple R-squared: 0.2933, Adjusted R-squared: 0.292
## F-statistic: 222 on 4 and 2140 DF, p-value: < 2.2e-16