Data 621 Assignment 1

Wilson Chau Data 621 Assignment 1

#Observation:

baseball_data <- read.csv("https://raw.githubusercontent.com/Wilchau/Data_621_Assignment_1/main/moneyball-training-data.csv") 
baseball_eval <- read.csv("https://raw.githubusercontent.com/Wilchau/Data_621_Assignment_1/main/moneyball-evaluation-data.csv")
str(baseball_data)

## 'data.frame':    2276 obs. of  17 variables:
##  $ INDEX           : int  1 2 3 4 5 6 7 8 11 12 ...
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : int  842 1075 917 922 920 973 1062 1027 922 827 ...
##  $ TEAM_BASERUN_SB : int  NA 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : int  NA 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: int  5456 1082 917 928 920 973 1062 1033 922 827 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: int  NA 155 153 156 168 149 186 136 169 159 ...

summary(baseball_data)

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286

Going through basic observation we can see that the training data set has 17 variables and 2,276 total Observation. Going through each variable I notice some are missing, such as TEAM_BATTING_HBP(Batters hit by pitch) having missing values. INDEX is the Identification Variable, and TARGET_WINS is the response variable to influence the regression model. Another variable to look at would be TEAM_PITCHING_H, TEAM_PITCHING_BB due to the high number. The other variables are all potential predictor variables for our linear model.

boxplot(baseball_data$TARGET_WINS, 
        main = "Box Plot of TARGET_WINS",
        ylab = "Number of Wins")

boxplot(baseball_data$TEAM_BATTING_H, baseball_data$TEAM_BATTING_2B, 
        baseball_data$TEAM_BATTING_3B, baseball_data$TEAM_BATTING_HR,
        baseball_data$TEAM_BATTING_BB, baseball_data$TEAM_BATTING_SO,
        baseball_data$TEAM_BASERUN_SB, baseball_data$TEAM_BASERUN_CS,
        baseball_data$TEAM_BATTING_HBP,
        main = "Box Plots of Selected Variables",
        ylab = "Values",
        names = c("TEAM_BATTING_H", "TEAM_BATTING_2B", "TEAM_BATTING_3B", 
                  "TEAM_BATTING_HR", "TEAM_BATTING_BB", "TEAM_BATTING_SO", 
                  "TEAM_BASERUN_SB", "TEAM_BASERUN_CS", "TEAM_BATTING_HBP"))

library(corrplot)

## corrplot 0.92 loaded

selected_vars <- c("TARGET_WINS", "TEAM_BATTING_H", "TEAM_BATTING_2B", "TEAM_BATTING_3B", 
                    "TEAM_BATTING_HR", "TEAM_BATTING_BB", "TEAM_BATTING_SO", 
                    "TEAM_BASERUN_SB", "TEAM_BASERUN_CS", "TEAM_BATTING_HBP", 
                    "TEAM_PITCHING_H", "TEAM_PITCHING_HR", "TEAM_PITCHING_BB", 
                    "TEAM_PITCHING_SO", "TEAM_FIELDING_E", "TEAM_FIELDING_DP")

correlation_matrix <- cor(baseball_data[, selected_vars], use = "pairwise.complete.obs")
print(correlation_matrix)

##                  TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## TARGET_WINS       1.00000000    0.388767521      0.28910365     0.142608411
## TEAM_BATTING_H    0.38876752    1.000000000      0.56284968     0.427696575
## TEAM_BATTING_2B   0.28910365    0.562849678      1.00000000    -0.107305824
## TEAM_BATTING_3B   0.14260841    0.427696575     -0.10730582     1.000000000
## TEAM_BATTING_HR   0.17615320   -0.006544685      0.43539729    -0.635566946
## TEAM_BATTING_BB   0.23255986   -0.072464013      0.25572610    -0.287235841
## TEAM_BATTING_SO  -0.03175071   -0.463853571      0.16268519    -0.669781188
## TEAM_BASERUN_SB   0.13513892    0.123567797     -0.19975724     0.533506448
## TEAM_BASERUN_CS   0.02240407    0.016705668     -0.09981406     0.348764919
## TEAM_BATTING_HBP  0.07350424   -0.029112176      0.04608475    -0.174247154
## TEAM_PITCHING_H  -0.10993705    0.302693709      0.02369219     0.194879411
## TEAM_PITCHING_HR  0.18901373    0.072853119      0.45455082    -0.567836679
## TEAM_PITCHING_BB  0.12417454    0.094193027      0.17805420    -0.002224148
## TEAM_PITCHING_SO -0.07843609   -0.252656790      0.06479231    -0.258818931
## TEAM_FIELDING_E  -0.17648476    0.264902478     -0.23515099     0.509778447
## TEAM_FIELDING_DP -0.03485058    0.155383321      0.29087998    -0.323074847
##                  TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## TARGET_WINS          0.176153200      0.23255986     -0.03175071
## TEAM_BATTING_H      -0.006544685     -0.07246401     -0.46385357
## TEAM_BATTING_2B      0.435397293      0.25572610      0.16268519
## TEAM_BATTING_3B     -0.635566946     -0.28723584     -0.66978119
## TEAM_BATTING_HR      1.000000000      0.51373481      0.72706935
## TEAM_BATTING_BB      0.513734810      1.00000000      0.37975087
## TEAM_BATTING_SO      0.727069348      0.37975087      1.00000000
## TEAM_BASERUN_SB     -0.453578426     -0.10511564     -0.25448923
## TEAM_BASERUN_CS     -0.433793868     -0.13698837     -0.21788137
## TEAM_BATTING_HBP     0.106181160      0.04746007      0.22094219
## TEAM_PITCHING_H     -0.250145481     -0.44977762     -0.37568637
## TEAM_PITCHING_HR     0.969371396      0.45955207      0.66717889
## TEAM_PITCHING_BB     0.136927564      0.48936126      0.03700514
## TEAM_PITCHING_SO     0.184707564     -0.02075682      0.41623330
## TEAM_FIELDING_E     -0.587339098     -0.65597081     -0.58466444
## TEAM_FIELDING_DP     0.448985348      0.43087675      0.15488939
##                  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP
## TARGET_WINS           0.13513892      0.02240407       0.07350424
## TEAM_BATTING_H        0.12356780      0.01670567      -0.02911218
## TEAM_BATTING_2B      -0.19975724     -0.09981406       0.04608475
## TEAM_BATTING_3B       0.53350645      0.34876492      -0.17424715
## TEAM_BATTING_HR      -0.45357843     -0.43379387       0.10618116
## TEAM_BATTING_BB      -0.10511564     -0.13698837       0.04746007
## TEAM_BATTING_SO      -0.25448923     -0.21788137       0.22094219
## TEAM_BASERUN_SB       1.00000000      0.65524480      -0.06400498
## TEAM_BASERUN_CS       0.65524480      1.00000000      -0.07051390
## TEAM_BATTING_HBP     -0.06400498     -0.07051390       1.00000000
## TEAM_PITCHING_H       0.07328505     -0.05200781      -0.02769699
## TEAM_PITCHING_HR     -0.41651072     -0.42256605       0.10675878
## TEAM_PITCHING_BB      0.14641513     -0.10696124       0.04785137
## TEAM_PITCHING_SO     -0.13712861     -0.21022274       0.22157375
## TEAM_FIELDING_E       0.50963090      0.04832189       0.04178971
## TEAM_FIELDING_DP     -0.49707763     -0.21424801      -0.07120824
##                  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## TARGET_WINS          -0.10993705       0.18901373      0.124174536
## TEAM_BATTING_H        0.30269371       0.07285312      0.094193027
## TEAM_BATTING_2B       0.02369219       0.45455082      0.178054204
## TEAM_BATTING_3B       0.19487941      -0.56783668     -0.002224148
## TEAM_BATTING_HR      -0.25014548       0.96937140      0.136927564
## TEAM_BATTING_BB      -0.44977762       0.45955207      0.489361263
## TEAM_BATTING_SO      -0.37568637       0.66717889      0.037005141
## TEAM_BASERUN_SB       0.07328505      -0.41651072      0.146415134
## TEAM_BASERUN_CS      -0.05200781      -0.42256605     -0.106961236
## TEAM_BATTING_HBP     -0.02769699       0.10675878      0.047851371
## TEAM_PITCHING_H       1.00000000      -0.14161276      0.320676162
## TEAM_PITCHING_HR     -0.14161276       1.00000000      0.221937505
## TEAM_PITCHING_BB      0.32067616       0.22193750      1.000000000
## TEAM_PITCHING_SO      0.26724807       0.20588053      0.488498653
## TEAM_FIELDING_E       0.66775901      -0.49314447     -0.022837561
## TEAM_FIELDING_DP     -0.22865059       0.43917040      0.324457226
##                  TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## TARGET_WINS           -0.07843609     -0.17648476      -0.03485058
## TEAM_BATTING_H        -0.25265679      0.26490248       0.15538332
## TEAM_BATTING_2B        0.06479231     -0.23515099       0.29087998
## TEAM_BATTING_3B       -0.25881893      0.50977845      -0.32307485
## TEAM_BATTING_HR        0.18470756     -0.58733910       0.44898535
## TEAM_BATTING_BB       -0.02075682     -0.65597081       0.43087675
## TEAM_BATTING_SO        0.41623330     -0.58466444       0.15488939
## TEAM_BASERUN_SB       -0.13712861      0.50963090      -0.49707763
## TEAM_BASERUN_CS       -0.21022274      0.04832189      -0.21424801
## TEAM_BATTING_HBP       0.22157375      0.04178971      -0.07120824
## TEAM_PITCHING_H        0.26724807      0.66775901      -0.22865059
## TEAM_PITCHING_HR       0.20588053     -0.49314447       0.43917040
## TEAM_PITCHING_BB       0.48849865     -0.02283756       0.32445723
## TEAM_PITCHING_SO       1.00000000     -0.02329178       0.02615804
## TEAM_FIELDING_E       -0.02329178      1.00000000      -0.49768495
## TEAM_FIELDING_DP       0.02615804     -0.49768495       1.00000000

corrplot(correlation_matrix,
         method = "color",  # Color-coded cells
         type = "upper",    # Display only upper triangle
         tl.cex = 0.3,      # Text label size
         tl.col = "black",  # Text label color
         diag = FALSE,      # Exclude diagonal elements
         addCoef.col = "black", # Color of correlation coefficients
         tl.srt = 45)       # Text label rotation angle

Going through the correlation matrix, I notice that Team_BAtting_H along with TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR are all values of the hits. TEAM_BATTING_BB and TEAM_PITCHING_BB have strong correlation. Our focus will be on winning.

So far we see that there are missing data, a lot of outliers as shown in the summary table and some need to prepare the data and transform it.

#PART 2: DATA PREPARATION: I will focus on working on removing the NA values or missing values by deleting the observations, variables, imputation with the mean/median/mode. Imputation is the easiest way, but however it reduces the variance in the dataset and shrinks standard errors which can invalidate the hypothesis tests.

##PART 3: BUILD MODELS In this section I will work on building at least three different linear regression models:

Model 1: I chose only all variables, this would allow me to see which variables is significant.

model_1 <- lm(formula = TARGET_WINS ~ ., data = baseball_data)
summary(model_1)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = baseball_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.0626  -5.4196  -0.0423   5.2111  22.9355 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      60.4562317 19.7385030   3.063  0.00254 ** 
## INDEX            -0.0002478  0.0008508  -0.291  0.77122    
## TEAM_BATTING_H    1.8111103  2.7908648   0.649  0.51723    
## TEAM_BATTING_2B   0.0267462  0.0303941   0.880  0.38008    
## TEAM_BATTING_3B  -0.1018043  0.0777401  -1.310  0.19208    
## TEAM_BATTING_HR  -4.6100155 10.5666083  -0.436  0.66317    
## TEAM_BATTING_BB  -4.4606275  3.6457882  -1.224  0.22279    
## TEAM_BATTING_SO   0.4303282  2.6231874   0.164  0.86988    
## TEAM_BASERUN_SB   0.0335937  0.0288100   1.166  0.24519    
## TEAM_BASERUN_CS  -0.0130338  0.0719436  -0.181  0.85645    
## TEAM_BATTING_HBP  0.0837038  0.0499097   1.677  0.09532 .  
## TEAM_PITCHING_H  -1.7887761  2.7903398  -0.641  0.52233    
## TEAM_PITCHING_HR  4.6958245 10.5649821   0.444  0.65725    
## TEAM_PITCHING_BB  4.5120283  3.6432611   1.238  0.21721    
## TEAM_PITCHING_SO -0.4618971  2.6214432  -0.176  0.86034    
## TEAM_FIELDING_E  -0.1724513  0.0415365  -4.152 5.16e-05 ***
## TEAM_FIELDING_DP -0.1063200  0.0371964  -2.858  0.00478 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.489 on 174 degrees of freedom
##   (2085 observations deleted due to missingness)
## Multiple R-squared:  0.5503, Adjusted R-squared:  0.509 
## F-statistic: 13.31 on 16 and 174 DF,  p-value: < 2.2e-16

Model 2: Highly statistical Variables Only

For Model 2 I include only the one to be best fit model based model 1 where I saw TEAM_BATTING_H , TEAM_BATTING_SO, TEAM_FIELDING_E, and TEAM_PITCHING_HR as viable candidates for having positive estimates.

model_2 <- lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + 
TEAM_BATTING_HR + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_SO + 
TEAM_FIELDING_E + TEAM_FIELDING_DP, data = baseball_data)
summary(model_2)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = baseball_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.968  -7.675   0.280   7.374  31.949 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      88.365267   5.488846  16.099  < 2e-16 ***
## TEAM_BATTING_H    0.011370   0.003215   3.537 0.000415 ***
## TEAM_BATTING_3B   0.187827   0.019636   9.565  < 2e-16 ***
## TEAM_BATTING_HR   0.130575   0.009150  14.270  < 2e-16 ***
## TEAM_BATTING_SO  -0.025391   0.004158  -6.107 1.24e-09 ***
## TEAM_BASERUN_SB   0.078100   0.005703  13.693  < 2e-16 ***
## TEAM_PITCHING_SO -0.004121   0.003532  -1.167 0.243564    
## TEAM_FIELDING_E  -0.112888   0.007027 -16.064  < 2e-16 ***
## TEAM_FIELDING_DP -0.097127   0.012639  -7.685 2.49e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.59 on 1826 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.3552, Adjusted R-squared:  0.3524 
## F-statistic: 125.8 on 8 and 1826 DF,  p-value: < 2.2e-16

Model 3: Eliminate variables that are negative or skewed.

model_3 <-lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB + 
TEAM_FIELDING_E + TEAM_BATTING_HR, data = baseball_data)
summary(model_3)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB + 
##     TEAM_FIELDING_E + TEAM_BATTING_HR, data = baseball_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.724  -9.003   0.146   8.425  40.502 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     10.354874   2.958501   3.500 0.000475 ***
## TEAM_BATTING_H   0.046809   0.002111  22.177  < 2e-16 ***
## TEAM_BASERUN_SB  0.058158   0.003660  15.891  < 2e-16 ***
## TEAM_FIELDING_E -0.037661   0.002164 -17.406  < 2e-16 ***
## TEAM_BATTING_HR  0.024259   0.005792   4.188 2.93e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.42 on 2140 degrees of freedom
##   (131 observations deleted due to missingness)
## Multiple R-squared:  0.2933, Adjusted R-squared:  0.292 
## F-statistic:   222 on 4 and 2140 DF,  p-value: < 2.2e-16

#PART 4: SELECT MODELS

From these three models I went through each of the following: Mean Squared Error Root MSE Adjusted R-Squared F-statistics I observe that the first model has the highest R-squared, it wasn’t statisfically significant, but had some multicollinearity issues. Looking at F-statistics in Model 3 this has the highest out of the three so I chose Model 3.

predictions <- predict(model_3, newdata = baseball_data)
rmse <- sqrt(mean((baseball_data$TARGET_WINS - predictions)^2))
summary(model_3)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB + 
##     TEAM_FIELDING_E + TEAM_BATTING_HR, data = baseball_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.724  -9.003   0.146   8.425  40.502 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     10.354874   2.958501   3.500 0.000475 ***
## TEAM_BATTING_H   0.046809   0.002111  22.177  < 2e-16 ***
## TEAM_BASERUN_SB  0.058158   0.003660  15.891  < 2e-16 ***
## TEAM_FIELDING_E -0.037661   0.002164 -17.406  < 2e-16 ***
## TEAM_BATTING_HR  0.024259   0.005792   4.188 2.93e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.42 on 2140 degrees of freedom
##   (131 observations deleted due to missingness)
## Multiple R-squared:  0.2933, Adjusted R-squared:  0.292 
## F-statistic:   222 on 4 and 2140 DF,  p-value: < 2.2e-16

Data 621 Assignment 1

Wilson Chau

2023-10-02