OBJECTIVE

This report outlines the development of a multiple linear regression model to predict the total number of wins for a baseball team in a season (TARGET_WINS). Using a training dataset of historical team performance, we undertook a comprehensive process of data cleaning, feature engineering, and model development.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(corrplot)

## corrplot 0.95 loaded

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(tidyr)
library(tidyselect)

Data Import

mb_evaluation <- read.csv(
  "https://raw.githubusercontent.com/GullitNa/DATA621-HW1/main/moneyball-evaluation-data.csv",
  stringsAsFactors = FALSE
)

mb_training <- read.csv(
  "https://raw.githubusercontent.com/GullitNa/DATA621-HW1/main/moneyball-training-data.csv",
  stringsAsFactors = FALSE
)

head(mb_evaluation)

##   INDEX TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 1     9           1209             170              33              83
## 2    10           1221             151              29              88
## 3    14           1395             183              29              93
## 4    47           1539             309              29             159
## 5    60           1445             203              68               5
## 6    63           1431             236              53              10
##   TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 1             447            1080              62              50
## 2             516             929              54              39
## 3             509             816              59              47
## 4             486             914             148              57
## 5              95             416              NA              NA
## 6             215             377              NA              NA
##   TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## 1               NA            1209               83              447
## 2               NA            1221               88              516
## 3               NA            1395               93              509
## 4               42            1539              159              486
## 5               NA            3902               14              257
## 6               NA            2793               20              420
##   TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1             1080             140              156
## 2              929             135              164
## 3              816             156              153
## 4              914             124              154
## 5             1123             616              130
## 6              736             572              105

head(mb_training)

##   INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1     1          39           1445             194              39
## 2     2          70           1339             219              22
## 3     3          86           1377             232              35
## 4     4          70           1387             209              38
## 5     5          82           1297             186              27
## 6     6          75           1279             200              36
##   TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1              13             143             842              NA
## 2             190             685            1075              37
## 3             137             602             917              46
## 4              96             451             922              43
## 5             102             472             920              49
## 6              92             443             973             107
##   TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1              NA               NA            9364               84
## 2              28               NA            1347              191
## 3              27               NA            1377              137
## 4              30               NA            1396               97
## 5              39               NA            1297              102
## 6              59               NA            1279               92
##   TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1              927             5456            1011               NA
## 2              689             1082             193              155
## 3              602              917             175              153
## 4              454              928             164              156
## 5              472              920             138              168
## 6              443              973             123              149

# Check Look at structure
str(mb_evaluation)

## 'data.frame':    259 obs. of  16 variables:
##  $ INDEX           : int  9 10 14 47 60 63 74 83 98 120 ...
##  $ TEAM_BATTING_H  : int  1209 1221 1395 1539 1445 1431 1430 1385 1259 1397 ...
##  $ TEAM_BATTING_2B : int  170 151 183 309 203 236 219 158 177 212 ...
##  $ TEAM_BATTING_3B : int  33 29 29 29 68 53 55 42 78 42 ...
##  $ TEAM_BATTING_HR : int  83 88 93 159 5 10 37 33 23 58 ...
##  $ TEAM_BATTING_BB : int  447 516 509 486 95 215 568 356 466 452 ...
##  $ TEAM_BATTING_SO : int  1080 929 816 914 416 377 527 609 689 584 ...
##  $ TEAM_BASERUN_SB : int  62 54 59 148 NA NA 365 185 150 52 ...
##  $ TEAM_BASERUN_CS : int  50 39 47 57 NA NA NA NA NA NA ...
##  $ TEAM_BATTING_HBP: int  NA NA NA 42 NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  1209 1221 1395 1539 3902 2793 1544 1626 1342 1489 ...
##  $ TEAM_PITCHING_HR: int  83 88 93 159 14 20 40 39 25 62 ...
##  $ TEAM_PITCHING_BB: int  447 516 509 486 257 420 613 418 497 482 ...
##  $ TEAM_PITCHING_SO: int  1080 929 816 914 1123 736 569 715 734 622 ...
##  $ TEAM_FIELDING_E : int  140 135 156 124 616 572 490 328 226 184 ...
##  $ TEAM_FIELDING_DP: int  156 164 153 154 130 105 NA 104 132 145 ...

str(mb_training)

## 'data.frame':    2276 obs. of  17 variables:
##  $ INDEX           : int  1 2 3 4 5 6 7 8 11 12 ...
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : int  842 1075 917 922 920 973 1062 1027 922 827 ...
##  $ TEAM_BASERUN_SB : int  NA 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : int  NA 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: int  5456 1082 917 928 920 973 1062 1033 922 827 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: int  NA 155 153 156 168 149 186 136 169 159 ...

# Check Column names
names(mb_evaluation)

##  [1] "INDEX"            "TEAM_BATTING_H"   "TEAM_BATTING_2B"  "TEAM_BATTING_3B" 
##  [5] "TEAM_BATTING_HR"  "TEAM_BATTING_BB"  "TEAM_BATTING_SO"  "TEAM_BASERUN_SB" 
##  [9] "TEAM_BASERUN_CS"  "TEAM_BATTING_HBP" "TEAM_PITCHING_H"  "TEAM_PITCHING_HR"
## [13] "TEAM_PITCHING_BB" "TEAM_PITCHING_SO" "TEAM_FIELDING_E"  "TEAM_FIELDING_DP"

names(mb_training)

##  [1] "INDEX"            "TARGET_WINS"      "TEAM_BATTING_H"   "TEAM_BATTING_2B" 
##  [5] "TEAM_BATTING_3B"  "TEAM_BATTING_HR"  "TEAM_BATTING_BB"  "TEAM_BATTING_SO" 
##  [9] "TEAM_BASERUN_SB"  "TEAM_BASERUN_CS"  "TEAM_BATTING_HBP" "TEAM_PITCHING_H" 
## [13] "TEAM_PITCHING_HR" "TEAM_PITCHING_BB" "TEAM_PITCHING_SO" "TEAM_FIELDING_E" 
## [17] "TEAM_FIELDING_DP"

# Check Dimensions
dim(mb_evaluation)

## [1] 259  16

dim(mb_training)

## [1] 2276   17

DATA EXPLORATION

To begin our analysis, we first explored the “moneyball training data” to understand its structure and key characteristics. This is a critical first step to ensure we are building our models on a solid foundation.

Finally, we found that several columns had missing values. Most notably, TEAM_BATTING_HBP was missing over 95% of its data and was removed. Other columns with sparse missing data were handled in the next phase.

# Check Mean, sd, Median
summary(mb_training)

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286

DATA PREPARATION

To improve our model’s accuracy and reliability, we performed a series of transformations. Our goal was to clean the data and reduce the influence of extreme values. Before imputing, we created missingness flags for every column that had NAs so models can learn from was initially missing.

Handling Missing Data:

The TEAM_BATTING_HBP (Hit By Pitch) column was missing over 95% of its data and was removed to avoid introducing noise.
We addressed the few remaining missing data points by filling them with the median value of their respective columns. We chose the median over the mean because it is less sensitive to extreme outliers, providing a more robust estimate.

# Check the Ratio of missing data 
colSums(is.na(mb_training)) / nrow(mb_training)

##            INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
##       0.00000000       0.00000000       0.00000000       0.00000000 
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
##       0.00000000       0.00000000       0.00000000       0.04481547 
##  TEAM_BASERUN_SB  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H 
##       0.05755712       0.33919156       0.91608084       0.00000000 
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##       0.00000000       0.00000000       0.04481547       0.00000000 
## TEAM_FIELDING_DP 
##       0.12565905

# The TEAM-BATTIng_HBP had a high missing values, I will drop the column since it not reliable for me
mb_training$TEAM_BATTING_HBP <- NULL

# for other will using the median to handle the missing data
mb_training$TEAM_BATTING_SO[is.na(mb_training$TEAM_BATTING_SO)] <- median(mb_training$TEAM_BATTING_SO, na.rm = TRUE)

mb_training$TEAM_BASERUN_SB[is.na(mb_training$TEAM_BASERUN_SB)] <- median(mb_training$TEAM_BASERUN_SB, na.rm = TRUE)

mb_training$TEAM_PITCHING_SO[is.na(mb_training$TEAM_PITCHING_SO)] <- median(mb_training$TEAM_PITCHING_SO, na.rm = TRUE)

mb_training$TEAM_BASERUN_CS[is.na(mb_training$TEAM_BASERUN_CS)] <- median(mb_training$TEAM_BASERUN_CS, na.rm = TRUE)

mb_training$TEAM_FIELDING_DP[is.na(mb_training$TEAM_FIELDING_DP)] <- median(mb_training$TEAM_FIELDING_DP, na.rm = TRUE)

summary(mb_training)

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 556.8  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 736.3  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##  TEAM_BASERUN_SB TEAM_BASERUN_CS  TEAM_PITCHING_H TEAM_PITCHING_HR
##  Min.   :  0.0   Min.   :  0.00   Min.   : 1137   Min.   :  0.0   
##  1st Qu.: 67.0   1st Qu.: 44.00   1st Qu.: 1419   1st Qu.: 50.0   
##  Median :101.0   Median : 49.00   Median : 1518   Median :107.0   
##  Mean   :123.4   Mean   : 51.51   Mean   : 1779   Mean   :105.7   
##  3rd Qu.:151.0   3rd Qu.: 54.25   3rd Qu.: 1682   3rd Qu.:150.0   
##  Max.   :697.0   Max.   :201.00   Max.   :30132   Max.   :343.0   
##  TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   : 52.0   
##  1st Qu.: 476.0   1st Qu.:  626.0   1st Qu.: 127.0   1st Qu.:134.0   
##  Median : 536.5   Median :  813.5   Median : 159.0   Median :149.0   
##  Mean   : 553.0   Mean   :  817.5   Mean   : 246.5   Mean   :146.7   
##  3rd Qu.: 611.0   3rd Qu.:  957.0   3rd Qu.: 249.2   3rd Qu.:161.2   
##  Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :228.0

DATA CLEANING

To improve model stability and meet the assumptions of linear regression, we addressed outliers by capping extreme values at the 1.5×IQR threshold.

# Function to identify outliers based on 1.5*IQR rule
identify_outliers <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  outliers <- which(x < lower_bound | x > upper_bound)
  return(outliers)
}

numeric_cols <- sapply(mb_training, is.numeric)
outliers_list <- lapply(mb_training[, numeric_cols], identify_outliers)

outliers_list

## $INDEX
## integer(0)
## 
## $TARGET_WINS
##  [1]    1   53  296  299  391  415  418  422  427  445  859  862  982 1082 1199
## [16] 1210 1211 1345 1505 1584 1708 1825 1828 2012 2015 2031 2232 2233 2235 2239
## [31] 2242 2276
## 
## $TEAM_BATTING_H
##  [1]   53   55   69   72  191  272  273  286  287  288  289  295  296  297  298
## [16]  299  416  417  418  420  423  434  466  853  854  855  856  860  980  997
## [31]  998 1082 1093 1192 1210 1211 1346 1347 1348 1456 1603 1604 1635 1636 1710
## [46] 1736 1810 1811 1812 1813 1815 1822 1823 1828 1905 1908 1939 2012 2022 2062
## [61] 2136 2219 2220 2233 2237 2239 2276
## 
## $TEAM_BATTING_2B
##  [1]  295  296  297  381  387  388  424  425  768  777 1810 2012 2062 2063 2239
## 
## $TEAM_BATTING_3B
##  [1]   59  286  295  298  302  407  408  416  429  639  645  650  980 1093 1585
## [16] 1590 1604 1699 1700 1710 1713 1726 1810 1813 1823 1828 2190 2219 2227
## 
## $TEAM_BATTING_HR
## integer(0)
## 
## $TEAM_BATTING_BB
##   [1]    1   53   54   55   56   57   59  207  272  273  274  294  296  297  298
##  [16]  299  341  342  344  391  392  393  394  396  399  400  401  402  403  407
##  [31]  409  410  411  412  413  414  415  416  417  418  419  420  421  422  425
##  [46]  637  638  639  640  860  861  862  863  864  865  866  881  882  919  921
##  [61]  976  982  996  997  998  999 1044 1045 1082 1191 1192 1193 1194 1195 1210
##  [76] 1211 1341 1345 1346 1347 1348 1349 1350 1393 1397 1425 1428 1534 1584 1590
##  [91] 1698 1701 1703 1707 1811 1812 1813 1814 1815 1816 1818 1819 1822 1823 1824
## [106] 1825 1826 1828 1829 1830 1888 1895 2015 2016 2017 2018 2136 2137 2191 2220
## [121] 2221 2223 2232 2233 2234 2239 2240 2241 2242
## 
## $TEAM_BATTING_SO
##  [1]  298  299  393  415  860  861  998  999 1211 1345 1349 1350 1812 1813 1823
## [16] 1824 2015 2016 2233 2239
## 
## $TEAM_BASERUN_SB
##   [1]   63   64   65   66   67   68   69   71   72  278  279  280  281  282  283
##  [16]  285  286  287  288  289  291  295  395  396  406  428  429  430  431  432
##  [31]  434  435  436  442  444  445  539  540  642  643  644  645  646  647  648
##  [46]  650  661  845  846  847  848  849  851  853  889  890  980 1046 1047 1048
##  [61] 1049 1050 1086 1087 1088 1089 1090 1091 1093 1098 1101 1197 1198 1200 1202
##  [76] 1255 1256 1343 1351 1396 1405 1406 1585 1592 1593 1594 1596 1597 1598 1599
##  [91] 1600 1604 1705 1706 1707 1711 1722 1810 1827 1828 1898 1899 1900 1901 1903
## [106] 1904 1905 1906 1907 1908 1914 1915 1916 1917 1920 1921 1922 1923 2021 2022
## [121] 2023 2024 2025 2026 2028 2034 2114 2138 2190 2219 2225 2227 2228 2236 2237
## [136] 2238
## 
## $TEAM_BASERUN_CS
##   [1]    2    3    9   12   13   14   15   30   31   32   35   51   89   93   94
##  [16]   96   97   98  125  126  127  130  148  149  150  152  154  156  167  183
##  [31]  184  189  190  191  193  194  195  197  198  202  203  204  223  224  258
##  [46]  261  297  313  314  319  321  326  341  342  343  344  348  349  350  351
##  [61]  354  355  356  357  365  370  371  386  389  390  392  417  457  458  459
##  [76]  460  461  487  491  492  493  495  502  506  508  513  524  532  534  549
##  [91]  550  554  555  556  557  558  559  560  561  562  576  577  583  584  596
## [106]  599  600  602  606  623  624  665  670  671  672  673  674  675  701  726
## [121]  728  732  736  737  746  747  761  762  767  770  772  774  776  786  790
## [136]  797  802  803  815  816  818  821  822  823  843  867  868  892  900  902
## [151]  903  904  905  926  927  932  934  937  939  942  949  966  968  974  982
## [166]  991  992 1015 1017 1018 1021 1024 1026 1028 1035 1037 1040 1053 1058 1059
## [181] 1060 1061 1062 1070 1071 1072 1074 1109 1114 1115 1138 1150 1152 1161 1167
## [196] 1170 1178 1210 1211 1213 1228 1231 1257 1258 1263 1264 1266 1267 1268 1269
## [211] 1273 1291 1294 1301 1302 1303 1307 1314 1326 1348 1361 1362 1363 1365 1366
## [226] 1367 1368 1378 1384 1391 1409 1410 1415 1418 1420 1437 1444 1447 1448 1449
## [241] 1451 1452 1453 1457 1490 1503 1504 1509 1513 1518 1519 1534 1535 1537 1542
## [256] 1543 1544 1545 1546 1555 1556 1557 1558 1560 1561 1562 1563 1578 1580 1581
## [271] 1582 1583 1622 1627 1629 1630 1655 1656 1657 1672 1675 1695 1696 1697 1729
## [286] 1734 1735 1737 1738 1765 1767 1770 1771 1785 1787 1789 1790 1791 1792 1794
## [301] 1805 1809 1811 1825 1840 1841 1842 1843 1847 1863 1869 1870 1871 1874 1875
## [316] 1925 1930 1931 1932 1961 1962 1963 1972 1977 1985 1986 1988 1990 1991 1992
## [331] 2005 2007 2009 2049 2053 2054 2057 2083 2107 2110 2111 2112 2113 2114 2115
## [346] 2116 2119 2120 2121 2122 2139 2150 2152 2155 2160 2163 2164 2169 2170 2171
## [361] 2179 2186 2188 2189 2194 2195 2196 2197 2199 2209 2212 2215 2232 2252 2257
## [376] 2258 2259 2260 2261 2262 2263
## 
## $TEAM_PITCHING_H
##   [1]    1   53   54   55   56   57   58   59   60   62   63   68   69   70   71
##  [16]   72  159  245  258  272  273  274  275  276  282  286  287  288  289  294
##  [31]  295  296  297  298  299  391  392  393  399  400  401  402  403  404  407
##  [46]  409  410  411  412  413  415  416  417  418  419  420  421  422  423  424
##  [61]  425  426  427  428  429  434  435  524  614  627  637  638  639  640  649
##  [76]  650  724  736  749  820  832  844  853  854  855  856  857  860  861  862
##  [91]  863  864  865  866  868  881  882  964  976  977  979  980  982  996  997
## [106]  998  999 1032 1044 1062 1072 1082 1083 1093 1168 1180 1191 1192 1193 1194
## [121] 1210 1211 1223 1328 1340 1341 1342 1345 1346 1347 1348 1349 1350 1351 1369
## [136] 1393 1394 1397 1479 1584 1588 1589 1590 1591 1597 1603 1604 1605 1606 1698
## [151] 1701 1702 1710 1712 1790 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819
## [166] 1822 1823 1824 1825 1826 1828 1829 1830 1842 1854 1869 1882 1895 1896 1897
## [181] 1898 1900 1904 1905 1907 1908 2012 2015 2016 2017 2018 2019 2022 2110 2136
## [196] 2137 2166 2177 2191 2208 2219 2220 2221 2222 2223 2228 2232 2233 2239 2240
## [211] 2241 2242 2264 2276
## 
## $TEAM_PITCHING_HR
## [1]  832  964 1810 1882
## 
## $TEAM_PITCHING_BB
##  [1]    1   19   53   68  207  245  258  269  272  282  285  286  295  298  299
## [16]  341  342  344  395  396  409  415  417  512  627  648  724  749  820  853
## [31]  860  861  862  865  919  921  951  964  996  997  998  999 1083 1089 1191
## [46] 1203 1210 1211 1233 1340 1342 1345 1349 1350 1397 1425 1426 1428 1468 1479
## [61] 1534 1572 1584 1585 1710 1810 1811 1812 1813 1823 1824 1828 1900 1986 2015
## [76] 2016 2026 2110 2123 2136 2137 2177 2219 2226 2227 2228 2233 2234 2239 2253
## 
## $TEAM_PITCHING_SO
##  [1]    1  282  294  298  299  393  394  415  512  524  736  860  861  868  882
## [16]  964  984  998  999 1082 1211 1340 1341 1342 1345 1349 1350 1369 1381 1698
## [31] 1801 1812 1813 1814 1823 1824 1826 1854 2015 2016 2136 2177 2233 2239 2276
## 
## $TEAM_FIELDING_E
##   [1]    1   53   54   55   56   57   58   59   60   61   62   63   64   65   66
##  [16]   68   69   70   71  171  269  272  273  274  275  276  277  278  279  280
##  [31]  281  282  283  284  285  294  295  296  297  298  299  391  392  393  394
##  [46]  395  396  399  400  401  402  403  404  405  406  407  408  409  410  411
##  [61]  412  413  414  415  416  417  418  419  420  421  422  423  424  425  426
##  [76]  427  428  429  430  431  432  433  434  435  436  437  438  439  539  637
##  [91]  638  639  640  641  642  643  644  645  646  647  649  650  749  844  845
## [106]  846  847  848  849  850  851  852  853  855  860  861  862  863  864  865
## [121]  866  881  882  883  976  977  978  979  980  981  982  996  997  998  999
## [136] 1044 1045 1046 1047 1048 1049 1050 1082 1083 1084 1085 1086 1087 1088 1090
## [151] 1092 1093 1096 1102 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201
## [166] 1202 1204 1205 1206 1207 1210 1211 1340 1341 1342 1345 1346 1347 1348 1349
## [181] 1350 1351 1393 1394 1395 1396 1397 1584 1585 1588 1589 1590 1591 1592 1593
## [196] 1594 1595 1596 1597 1598 1599 1600 1601 1698 1699 1700 1701 1702 1703 1704
## [211] 1705 1706 1707 1708 1709 1710 1711 1810 1811 1812 1813 1814 1815 1816 1817
## [226] 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1895 1896
## [241] 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911
## [256] 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026
## [271] 2027 2028 2029 2031 2136 2137 2138 2190 2191 2219 2220 2221 2222 2223 2224
## [286] 2225 2226 2227 2228 2229 2230 2232 2233 2234 2235 2236 2237 2238 2239 2240
## [301] 2241 2242 2276
## 
## $TEAM_FIELDING_DP
##  [1]   53   58   87  174  175  274  303  304  305  307  309  312  339  342  368
## [16]  402  409  418  422  442  447  453  542  543  545  637  656  660  677  680
## [31]  753  754  762  795  886  889  922  976  996  997 1078 1100 1103 1104 1145
## [46] 1234 1247 1249 1250 1251 1341 1397 1402 1403 1404 1407 1431 1438 1441 1443
## [61] 1445 1493 1494 1495 1498 1499 1534 1535 1536 1588 1610 1612 1613 1615 1621
## [76] 1701 1722 1723 1777 1815 1829 1830 1917 1918 1921 2017 2034 2038 2039 2042
## [91] 2137 2194

# visualize
cols_to_plot <- c("TEAM_PITCHING_H", "TEAM_PITCHING_BB", "TEAM_PITCHING_SO", "TEAM_FIELDING_E")
par(mfrow = c(2, length(cols_to_plot)))  # 2 rows: before & after

# Boxplots before capping
for(col in cols_to_plot){
  boxplot(mb_training[[col]], main = paste(col, "Before Capping"), col = "lightblue")
}

applying the IQR 1.5 and comparing the the boxplot, look more clean

# Function to cap outliers
cap_outliers <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  x[x < lower_bound] <- lower_bound
  x[x > upper_bound] <- upper_bound
  return(x)
}

mb_training_capped <- mb_training
numeric_cols <- sapply(mb_training_capped, is.numeric)
mb_training_capped[numeric_cols] <- lapply(mb_training[numeric_cols], cap_outliers)

# summary
summary(mb_training_capped)

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   : 39.50   Min.   :1152   Min.   :110.5  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.89   Mean   :1464   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :123.50   Max.   :1769   Max.   :370.5  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO   
##  Min.   :  0.00   Min.   :  0.00   Min.   :257.5   Min.   :   4.375  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 556.750  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.000  
##  Mean   : 54.95   Mean   : 99.61   Mean   :506.6   Mean   : 736.289  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.000  
##  Max.   :129.00   Max.   :264.00   Max.   :773.5   Max.   :1399.000  
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_H TEAM_PITCHING_HR
##  Min.   :  0.0   Min.   :28.62   Min.   :1137    Min.   :  0.0   
##  1st Qu.: 67.0   1st Qu.:44.00   1st Qu.:1419    1st Qu.: 50.0   
##  Median :101.0   Median :49.00   Median :1518    Median :107.0   
##  Mean   :117.7   Mean   :49.59   Mean   :1583    Mean   :105.7   
##  3rd Qu.:151.0   3rd Qu.:54.25   3rd Qu.:1682    3rd Qu.:150.0   
##  Max.   :277.0   Max.   :69.62   Max.   :2078    Max.   :300.0   
##  TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
##  Min.   :273.5    Min.   : 129.5   Min.   : 65.0   Min.   : 93.12  
##  1st Qu.:476.0    1st Qu.: 626.0   1st Qu.:127.0   1st Qu.:134.00  
##  Median :536.5    Median : 813.5   Median :159.0   Median :149.00  
##  Mean   :546.4    Mean   : 799.8   Mean   :205.9   Mean   :146.98  
##  3rd Qu.:611.0    3rd Qu.: 957.0   3rd Qu.:249.2   3rd Qu.:161.25  
##  Max.   :813.5    Max.   :1453.5   Max.   :432.6   Max.   :202.12

# visualize plot
cols_to_plot <- c("TEAM_PITCHING_H", "TEAM_PITCHING_BB", "TEAM_PITCHING_SO", "TEAM_FIELDING_E")
par(mfrow = c(2, length(cols_to_plot)))

# Boxplots after capping
for(col in cols_to_plot){
  boxplot(mb_training_capped[[col]], main = paste(col, "After Capping"), col = "lightgreen")
}

MODEL READY

Impute few sparse fields, add interpretable features, transform heavily right-skewed count variables using ‘log1p’, and replace outliers (extreme season stats) for moderate data which can induce accuracy.

safe_div <- function(num, den, default = 0) ifelse(den > 0, num/den, default)

build_prep_params <- function(train_df, target = "TARGET_WINS", id = "INDEX",
                              winsor_probs = c(.01, .99)) {
  # columns with any NA in training
  na_cols <- names(which(colSums(is.na(train_df)) > 0))
  medians <- train_df %>%
    summarise(across(all_of(na_cols), ~median(.x, na.rm = TRUE)))
  
  # build a temporary transformed training set to compute winsor caps on
  tmp <- apply_prep(train_df, list(medians = medians, caps = NULL,
                                   na_cols = na_cols, high_err_cut = NULL),
                    target = target, id = id, do_caps = FALSE)

  num_pred <- tmp %>%
    select(where(is.numeric), -all_of(c(target))) %>% names()

  caps <- list(
    lower = sapply(tmp[num_pred], quantile, probs = winsor_probs[1], na.rm = TRUE),
    upper = sapply(tmp[num_pred], quantile, probs = winsor_probs[2], na.rm = TRUE)
  )
  
  # example bucket/flag threshold (very high errors)
  high_err_cut <- quantile(tmp$TEAM_FIELDING_E, 0.90, na.rm = TRUE)
  
  list(medians = medians, caps = caps, na_cols = na_cols, high_err_cut = high_err_cut)
}

apply_prep <- function(df, params, target = "TARGET_WINS", id = "INDEX",
                       do_caps = TRUE) {
  medians <- params$medians
  na_cols <- params$na_cols
  
  out <- df %>%
    select(-all_of(id)) %>%
    # missingness flags before imputation
    mutate(across(all_of(na_cols), ~as.integer(is.na(.x)),
                  .names = "{.col}_MISSING")) %>%
    # impute with training medians
    mutate(across(all_of(names(medians)), ~replace_na(.x, medians[[cur_column()]]))) %>%
    # IMPORTANT: feature engineering
    mutate(
      SINGLES      = pmax(0, TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR),
      XBH          = TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR,
      FREE_BASES   = TEAM_BATTING_BB + coalesce(TEAM_BATTING_HBP, 0),
      SB_ATT       = TEAM_BASERUN_SB + coalesce(TEAM_BASERUN_CS, 0),
      SB_PCT       = safe_div(TEAM_BASERUN_SB, SB_ATT, 0),
      K_BB         = safe_div(TEAM_PITCHING_SO, TEAM_PITCHING_BB, 0),
      HR_RATE_ALLOWED = safe_div(TEAM_PITCHING_HR, TEAM_PITCHING_H, 0),
      NET_HR       = TEAM_BATTING_HR - TEAM_PITCHING_HR,
      NET_SO       = TEAM_PITCHING_SO - TEAM_BATTING_SO,
      NET_BB       = TEAM_BATTING_BB - TEAM_PITCHING_BB,
      DP_PER_ERROR = TEAM_FIELDING_DP / (TEAM_FIELDING_E + 1)
    ) %>%
    # --- light transforms for heavy right-skew (zero-safe) ---
    mutate(across(c(TEAM_FIELDING_E, TEAM_PITCHING_H, TEAM_PITCHING_BB, TEAM_PITCHING_SO,
                    TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_BATTING_SO, TEAM_BATTING_HR,
                    SINGLES, XBH, FREE_BASES, SB_ATT),
                  log1p, .names = "{.col}")) %>%  # overwrite with log1p versions
    # --- an example simple bucket flag (kept numeric for modeling) ---
    mutate(HIGH_ERRORS_FLAG = as.integer(TEAM_FIELDING_E > params$high_err_cut))

  # winsorize using training caps (keeps extreme seasons from dominating)
  if (do_caps && !is.null(params$caps)) {
    for (nm in intersect(names(out), names(params$caps$lower))) {
      out[[nm]] <- pmin(pmax(out[[nm]], params$caps$lower[[nm]]), params$caps$upper[[nm]])
    }
  }
  out
}

DATA TRAINING

The result should be a data prep recipe with imputation values and cutoffs for the training data only. This is suppose to clean the data as well as add complexity to the transformation via adding more of these features, log-transform skewed counts, and winsorize extremes for example without leaking the initial data.

Build the recipe on the training set, then apply it to both data sets

winsorize_iqr_vec <- function(x, q = 1.5) {
  qs <- stats::quantile(x, c(0.25, 0.75), na.rm = TRUE)
  iqr <- qs[2] - qs[1]
  lo <- qs[1] - q * iqr
  hi <- qs[2] + q * iqr
  x <- pmin(pmax(x, lo), hi)
  x
}

build_prep_params <- function(df, id = "INDEX", target = "TARGET_WINS") {

  # impute columns to learn medians from training
  imp_cols <- intersect(c("TEAM_BATTING_SO","TEAM_BASERUN_SB","TEAM_BASERUN_CS",
                          "TEAM_PITCHING_SO","TEAM_FIELDING_DP","TEAM_BATTING_HBP"),
                        names(df))
  medians <- vapply(df[imp_cols], function(x) stats::median(x, na.rm = TRUE), numeric(1))
  # which columns get missingness flags
  missing_flags <- intersect(c("TEAM_BATTING_SO","TEAM_BASERUN_SB","TEAM_BASERUN_CS",
                               "TEAM_PITCHING_SO","TEAM_FIELDING_DP","TEAM_BATTING_HBP"),
                             names(df))

  # IMPORTANT: Here are columns to winsorize/transform
  wins_cols <- intersect(c("TEAM_PITCHING_H","TEAM_PITCHING_BB",
                           "TEAM_PITCHING_SO","TEAM_FIELDING_E"),
                         names(df))
  trans_cols <- wins_cols

  # return a list once defined to check
  list(
    id = id,
    target = target,
    medians = medians,
    wins_cols = wins_cols,
    trans_cols = trans_cols,
    missing_flags = missing_flags
  )
}

apply_prep <- function(df, p){
  if (!"TEAM_BATTING_HBP" %in% names(df)) df$TEAM_BATTING_HBP <- NA_real_

  out <- df %>%
    # flags FIRST to preserve original NA info
    dplyr::mutate(dplyr::across(tidyselect::any_of(p$missing_flags),
                                ~ as.integer(is.na(.x)), .names = "{.col}_missing")) %>%
    dplyr::mutate(FREE_BASES = TEAM_BATTING_BB + dplyr::coalesce(TEAM_BATTING_HBP, 0)) %>%
    dplyr::select(-TEAM_BATTING_HBP) %>%
    # impute NAs with training medians
    dplyr::mutate(dplyr::across(tidyselect::any_of(names(p$medians)),
                                ~ ifelse(is.na(.x), p$medians[cur_column()], .x))) %>%
    # engineered features
    dplyr::mutate(
      SINGLES = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR,
      XBH     = TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR,
      SB_ATT  = TEAM_BASERUN_SB + dplyr::coalesce(TEAM_BASERUN_CS, 0),
      SB_PCT  = dplyr::if_else(SB_ATT > 0, TEAM_BASERUN_SB / SB_ATT, 0),
      K_BB    = dplyr::if_else(TEAM_PITCHING_BB > 0, TEAM_PITCHING_SO / TEAM_PITCHING_BB, 0),
      HR_RATE_ALLOWED = dplyr::if_else(TEAM_PITCHING_H > 0, TEAM_PITCHING_HR / TEAM_PITCHING_H, 0),
      NET_HR  = TEAM_BATTING_HR - TEAM_PITCHING_HR,
      NET_SO  = TEAM_PITCHING_SO - TEAM_BATTING_SO,
      NET_BB  = TEAM_BATTING_BB - TEAM_PITCHING_BB,
      DP_PER_ERROR = TEAM_FIELDING_DP / (TEAM_FIELDING_E + 1)
    ) %>%
    #  winsorize/transform
    dplyr::mutate(dplyr::across(tidyselect::any_of(p$wins_cols), winsorize_iqr_vec),
                  dplyr::across(tidyselect::any_of(p$trans_cols), log1p)) %>%
    dplyr::select(-tidyselect::any_of(p$id))

  out
}

PREP

Now we can use the original training/evaluation data to be compared and rerun prep.

tr_raw <- mb_training
ev_raw <- mb_evaluation

prep_params <- build_prep_params(tr_raw, id = "INDEX", target = "TARGET_WINS")

mb_train_prepared <- apply_prep(tr_raw, prep_params)
mb_eval_prepared  <- apply_prep(ev_raw,  prep_params)





dim(mb_train_prepared); dim(mb_eval_prepared)

## [1] 2276   31

## [1] 259  30

colSums(is.na(mb_train_prepared))

##              TARGET_WINS           TEAM_BATTING_H          TEAM_BATTING_2B 
##                        0                        0                        0 
##          TEAM_BATTING_3B          TEAM_BATTING_HR          TEAM_BATTING_BB 
##                        0                        0                        0 
##          TEAM_BATTING_SO          TEAM_BASERUN_SB          TEAM_BASERUN_CS 
##                        0                        0                        0 
##          TEAM_PITCHING_H         TEAM_PITCHING_HR         TEAM_PITCHING_BB 
##                        0                        0                        0 
##         TEAM_PITCHING_SO          TEAM_FIELDING_E         TEAM_FIELDING_DP 
##                        0                        0                        0 
##  TEAM_BATTING_SO_missing  TEAM_BASERUN_SB_missing  TEAM_BASERUN_CS_missing 
##                        0                        0                        0 
## TEAM_PITCHING_SO_missing TEAM_FIELDING_DP_missing               FREE_BASES 
##                        0                        0                        0 
##                  SINGLES                      XBH                   SB_ATT 
##                        0                        0                        0 
##                   SB_PCT                     K_BB          HR_RATE_ALLOWED 
##                        0                        0                        0 
##                   NET_HR                   NET_SO                   NET_BB 
##                        0                        0                        0 
##             DP_PER_ERROR 
##                        0

colSums(is.na(mb_eval_prepared))

##           TEAM_BATTING_H          TEAM_BATTING_2B          TEAM_BATTING_3B 
##                        0                        0                        0 
##          TEAM_BATTING_HR          TEAM_BATTING_BB          TEAM_BATTING_SO 
##                        0                        0                        0 
##          TEAM_BASERUN_SB          TEAM_BASERUN_CS          TEAM_PITCHING_H 
##                        0                        0                        0 
##         TEAM_PITCHING_HR         TEAM_PITCHING_BB         TEAM_PITCHING_SO 
##                        0                        0                        0 
##          TEAM_FIELDING_E         TEAM_FIELDING_DP  TEAM_BATTING_SO_missing 
##                        0                        0                        0 
##  TEAM_BASERUN_SB_missing  TEAM_BASERUN_CS_missing TEAM_PITCHING_SO_missing 
##                        0                        0                        0 
## TEAM_FIELDING_DP_missing               FREE_BASES                  SINGLES 
##                        0                        0                        0 
##                      XBH                   SB_ATT                   SB_PCT 
##                        0                        0                        0 
##                     K_BB          HR_RATE_ALLOWED                   NET_HR 
##                        0                        0                        0 
##                   NET_SO                   NET_BB             DP_PER_ERROR 
##                        0                        0                        0

(Ignore) This is used for checking/proving if setdiff() returns only TARGET_WINS and checking if any NA values are still in the data.

setdiff(names(mb_train_prepared), names(mb_eval_prepared))

## [1] "TARGET_WINS"

setdiff(names(mb_eval_prepared),  names(mb_train_prepared))

## character(0)

c(train_any_NA = anyNA(mb_train_prepared),
  eval_any_NA  = anyNA(mb_eval_prepared))

## train_any_NA  eval_any_NA 
##        FALSE        FALSE

colSums(is.na(mb_train_prepared))

##              TARGET_WINS           TEAM_BATTING_H          TEAM_BATTING_2B 
##                        0                        0                        0 
##          TEAM_BATTING_3B          TEAM_BATTING_HR          TEAM_BATTING_BB 
##                        0                        0                        0 
##          TEAM_BATTING_SO          TEAM_BASERUN_SB          TEAM_BASERUN_CS 
##                        0                        0                        0 
##          TEAM_PITCHING_H         TEAM_PITCHING_HR         TEAM_PITCHING_BB 
##                        0                        0                        0 
##         TEAM_PITCHING_SO          TEAM_FIELDING_E         TEAM_FIELDING_DP 
##                        0                        0                        0 
##  TEAM_BATTING_SO_missing  TEAM_BASERUN_SB_missing  TEAM_BASERUN_CS_missing 
##                        0                        0                        0 
## TEAM_PITCHING_SO_missing TEAM_FIELDING_DP_missing               FREE_BASES 
##                        0                        0                        0 
##                  SINGLES                      XBH                   SB_ATT 
##                        0                        0                        0 
##                   SB_PCT                     K_BB          HR_RATE_ALLOWED 
##                        0                        0                        0 
##                   NET_HR                   NET_SO                   NET_BB 
##                        0                        0                        0 
##             DP_PER_ERROR 
##                        0

colSums(is.na(mb_eval_prepared))

##           TEAM_BATTING_H          TEAM_BATTING_2B          TEAM_BATTING_3B 
##                        0                        0                        0 
##          TEAM_BATTING_HR          TEAM_BATTING_BB          TEAM_BATTING_SO 
##                        0                        0                        0 
##          TEAM_BASERUN_SB          TEAM_BASERUN_CS          TEAM_PITCHING_H 
##                        0                        0                        0 
##         TEAM_PITCHING_HR         TEAM_PITCHING_BB         TEAM_PITCHING_SO 
##                        0                        0                        0 
##          TEAM_FIELDING_E         TEAM_FIELDING_DP  TEAM_BATTING_SO_missing 
##                        0                        0                        0 
##  TEAM_BASERUN_SB_missing  TEAM_BASERUN_CS_missing TEAM_PITCHING_SO_missing 
##                        0                        0                        0 
## TEAM_FIELDING_DP_missing               FREE_BASES                  SINGLES 
##                        0                        0                        0 
##                      XBH                   SB_ATT                   SB_PCT 
##                        0                        0                        0 
##                     K_BB          HR_RATE_ALLOWED                   NET_HR 
##                        0                        0                        0 
##                   NET_SO                   NET_BB             DP_PER_ERROR 
##                        0                        0                        0

train_num <- dplyr::select(mb_train_prepared, where(is.numeric))
eval_num  <- dplyr::select(mb_eval_prepared,  where(is.numeric))
c(train_any_inf = any(is.infinite(as.matrix(train_num))),
  eval_any_inf  = any(is.infinite(as.matrix(eval_num))))

## train_any_inf  eval_any_inf 
##         FALSE         FALSE

BUILD MODELS

With a clean and enriched dataset, we constructed four different multiple linear regression models. Our approach was to start with a simple baseline and progressively test more complex hypotheses about what drives team wins.

Model 1: Traditional Baseball Statistics (Basic Model)
- This model serves as a simple baseline, using only a handful of traditional statistics like Hits, Home Runs, and Errors. It represents a basic baseball
Model 2: Engineered Features Model (Advanced Metrics)
- This model uses the composite variables we created for more baseball-specific insights
Model 3: Comprehensive Model with Interactions
- We combined basic stats with interaction terms to see if we could capture more complex dynamics, such as how the value of hits changes when combined with a high number of home runs.
Model 4: Parsimonious Model (Variable Selection using Stepwise)
- Lastly, Let R help select the most impact variables automatically.

## BUILD MODELS - SECTION 3

# Model 1: Traditional Baseball Statistics (Basic Model)
# Focus on core batting, pitching, and fielding metrics
model1 <- lm(TARGET_WINS ~ 
               TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_BATTING_BB +  # Offense
               TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_FIELDING_E,  # Defense
             data = mb_train_prepared)

# Model 2: Engineered Features Model (Advanced Metrics)
# Uses the composite variables we created for more baseball-specific insights
model2 <- lm(TARGET_WINS ~ 
               SINGLES + XBH + FREE_BASES +           # Offensive production
               HR_RATE_ALLOWED + K_BB + DP_PER_ERROR + # Pitching/fielding efficiency
               SB_PCT + NET_BB,                       # Base running and discipline
             data = mb_train_prepared)

# Model 3: Comprehensive Model with Interactions
# Includes most variables plus some key interactions that make baseball sense
model3 <- lm(TARGET_WINS ~ 
               TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_BATTING_BB +
               TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_FIELDING_E +
               TEAM_BASERUN_SB + TEAM_PITCHING_SO +
               # Interactions that make theoretical sense:
               TEAM_BATTING_H:TEAM_BATTING_HR +        # Hit productivity
               TEAM_PITCHING_H:TEAM_PITCHING_HR +      # Pitching vulnerability  
               TEAM_BATTING_BB:TEAM_PITCHING_BB +      # Discipline differential
               TEAM_FIELDING_E:TEAM_BASERUN_SB,        # Errors vs aggressive running
             data = mb_train_prepared)

# Model 4: Parsimonious Model (Variable Selection using Stepwise)
# Let R help select the most impactful variables automatically
full_model <- lm(TARGET_WINS ~ ., data = mb_train_prepared)
model4 <- step(full_model, direction = "both", trace = 0)  # Stepwise selection

# Display model summaries
cat("=== MODEL 1: Traditional Statistics ===\n")

## === MODEL 1: Traditional Statistics ===

summary(model1)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_FIELDING_E, 
##     data = mb_train_prepared)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.270  -8.834   0.169   9.042  49.004 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.234006  24.529242   0.417 0.676559    
## TEAM_BATTING_H    0.046456   0.002786  16.673  < 2e-16 ***
## TEAM_BATTING_HR   0.008399   0.025297   0.332 0.739903    
## TEAM_BATTING_BB   0.027377   0.002862   9.565  < 2e-16 ***
## TEAM_PITCHING_H   1.699264   3.891374   0.437 0.662389    
## TEAM_PITCHING_HR -0.016747   0.023161  -0.723 0.469713    
## TEAM_FIELDING_E  -4.413656   1.191548  -3.704 0.000217 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.86 on 2269 degrees of freedom
## Multiple R-squared:  0.2277, Adjusted R-squared:  0.2257 
## F-statistic: 111.5 on 6 and 2269 DF,  p-value: < 2.2e-16

cat("\n=== MODEL 2: Engineered Features ===\n")

## 
## === MODEL 2: Engineered Features ===

summary(model2)

## 
## Call:
## lm(formula = TARGET_WINS ~ SINGLES + XBH + FREE_BASES + HR_RATE_ALLOWED + 
##     K_BB + DP_PER_ERROR + SB_PCT + NET_BB, data = mb_train_prepared)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.680  -9.115   0.202   8.982  65.816 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -16.568871   5.285908  -3.135  0.00174 ** 
## SINGLES           0.044029   0.003638  12.103  < 2e-16 ***
## XBH               0.050577   0.005942   8.512  < 2e-16 ***
## FREE_BASES        0.026595   0.003741   7.110 1.55e-12 ***
## HR_RATE_ALLOWED   9.403030  16.038247   0.586  0.55774    
## K_BB              0.477563   0.687961   0.694  0.48765    
## DP_PER_ERROR      1.862919   1.137729   1.637  0.10169    
## SB_PCT           21.098956   2.726786   7.738 1.51e-14 ***
## NET_BB            0.005693   0.002052   2.774  0.00558 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.7 on 2267 degrees of freedom
## Multiple R-squared:  0.2464, Adjusted R-squared:  0.2437 
## F-statistic: 92.65 on 8 and 2267 DF,  p-value: < 2.2e-16

cat("\n=== MODEL 3: Comprehensive with Interactions ===\n")

## 
## === MODEL 3: Comprehensive with Interactions ===

summary(model3)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_FIELDING_E + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_BATTING_H:TEAM_BATTING_HR + 
##     TEAM_PITCHING_H:TEAM_PITCHING_HR + TEAM_BATTING_BB:TEAM_PITCHING_BB + 
##     TEAM_FIELDING_E:TEAM_BASERUN_SB, data = mb_train_prepared)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.249  -8.868  -0.161   8.646  52.266 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -4.524e+01  5.209e+01  -0.869 0.385131    
## TEAM_BATTING_H                    4.991e-02  4.466e-03  11.174  < 2e-16 ***
## TEAM_BATTING_HR                   1.623e-01  1.315e-01   1.235 0.216956    
## TEAM_BATTING_BB                   6.294e-02  6.088e-02   1.034 0.301355    
## TEAM_PITCHING_H                   6.704e+00  7.159e+00   0.936 0.349138    
## TEAM_PITCHING_HR                 -1.323e-01  5.975e-01  -0.221 0.824830    
## TEAM_FIELDING_E                  -3.670e+00  1.924e+00  -1.907 0.056594 .  
## TEAM_BASERUN_SB                   2.118e-01  5.001e-02   4.236 2.36e-05 ***
## TEAM_PITCHING_SO                  4.917e-01  1.206e+00   0.408 0.683572    
## TEAM_BATTING_H:TEAM_BATTING_HR   -8.527e-05  6.050e-05  -1.409 0.158877    
## TEAM_PITCHING_H:TEAM_PITCHING_HR  1.345e-02  7.528e-02   0.179 0.858265    
## TEAM_BATTING_BB:TEAM_PITCHING_BB -5.759e-03  8.512e-03  -0.677 0.498729    
## TEAM_FIELDING_E:TEAM_BASERUN_SB  -3.063e-02  8.785e-03  -3.487 0.000498 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.57 on 2263 degrees of freedom
## Multiple R-squared:  0.2623, Adjusted R-squared:  0.2584 
## F-statistic: 67.05 on 12 and 2263 DF,  p-value: < 2.2e-16

cat("\n=== MODEL 4: Stepwise Selection ===\n")

## 
## === MODEL 4: Stepwise Selection ===

summary(model4)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP + SB_PCT + K_BB + HR_RATE_ALLOWED + 
##     NET_SO + NET_BB, data = mb_train_prepared)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.610  -8.073  -0.053   7.627  58.410 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       39.319319  27.085683   1.452 0.146734    
## TEAM_BATTING_H     0.050734   0.003988  12.722  < 2e-16 ***
## TEAM_BATTING_2B   -0.028058   0.009458  -2.967 0.003042 ** 
## TEAM_BATTING_3B    0.109077   0.017049   6.398 1.91e-10 ***
## TEAM_BATTING_HR   -0.215950   0.058028  -3.721 0.000203 ***
## TEAM_BATTING_BB    0.069076   0.008316   8.306  < 2e-16 ***
## TEAM_BATTING_SO   -0.020975   0.004266  -4.917 9.43e-07 ***
## TEAM_BASERUN_SB    0.022301   0.005892   3.785 0.000158 ***
## TEAM_PITCHING_H   19.283709   4.623214   4.171 3.15e-05 ***
## TEAM_PITCHING_BB -29.997712   4.547823  -6.596 5.24e-11 ***
## TEAM_PITCHING_SO  12.341996   2.607991   4.732 2.36e-06 ***
## TEAM_FIELDING_E  -14.998163   1.332353 -11.257  < 2e-16 ***
## TEAM_FIELDING_DP  -0.135404   0.013106 -10.331  < 2e-16 ***
## SB_PCT            12.553555   3.994507   3.143 0.001696 ** 
## K_BB              -2.635252   1.319341  -1.997 0.045902 *  
## HR_RATE_ALLOWED  374.999970  85.226561   4.400 1.13e-05 ***
## NET_SO             0.005329   0.001015   5.249 1.67e-07 ***
## NET_BB             0.006406   0.003925   1.632 0.102825    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.84 on 2258 degrees of freedom
## Multiple R-squared:  0.341,  Adjusted R-squared:  0.336 
## F-statistic: 68.72 on 17 and 2258 DF,  p-value: < 2.2e-16

ANALYZING COEFFICIENTS

# Function to extract key coefficient insights
analyze_coefficients <- function(model, model_name) {
  cat("\n", paste("====", model_name, "Coefficient Analysis ===="), "\n")
  
  coef_df <- as.data.frame(summary(model)$coefficients)
  coef_df <- coef_df[-1, ]  # Remove intercept
  coef_df <- coef_df[order(abs(coef_df$Estimate), decreasing = TRUE), ]
  
  # Highlight counter-intuitive coefficients
  counter_intuitive <- coef_df[coef_df$Estimate < 0 & 
                               !grepl("PITCHING|FIELDING_E|BATTING_SO|BASERUN_CS", rownames(coef_df)), ]
  
  if(nrow(counter_intuitive) > 0) {
    cat("POTENTIAL COUNTER-INTUITIVE COEFFICIENTS (negative but should be positive):\n")
    print(counter_intuitive[, c("Estimate", "Pr(>|t|)")])
  } else {
    cat("All coefficients align with theoretical expectations.\n")
  }
  
  # Top 5 most impactful variables
  cat("\nTOP 5 MOST IMPACTFUL VARIABLES (by coefficient magnitude):\n")
  print(head(coef_df[, c("Estimate", "Pr(>|t|)")], 5))
}

# Analyze each model
analyze_coefficients(model1, "MODEL 1: Traditional")

## 
##  ==== MODEL 1: Traditional Coefficient Analysis ==== 
## All coefficients align with theoretical expectations.
## 
## TOP 5 MOST IMPACTFUL VARIABLES (by coefficient magnitude):
##                     Estimate     Pr(>|t|)
## TEAM_FIELDING_E  -4.41365637 2.171781e-04
## TEAM_PITCHING_H   1.69926391 6.623889e-01
## TEAM_BATTING_H    0.04645571 5.732837e-59
## TEAM_BATTING_BB   0.02737706 2.814836e-21
## TEAM_PITCHING_HR -0.01674690 4.697127e-01

analyze_coefficients(model2, "MODEL 2: Engineered Features")

## 
##  ==== MODEL 2: Engineered Features Coefficient Analysis ==== 
## All coefficients align with theoretical expectations.
## 
## TOP 5 MOST IMPACTFUL VARIABLES (by coefficient magnitude):
##                    Estimate     Pr(>|t|)
## SB_PCT          21.09895583 1.512744e-14
## HR_RATE_ALLOWED  9.40303048 5.577404e-01
## DP_PER_ERROR     1.86291853 1.016855e-01
## K_BB             0.47756291 4.876461e-01
## XBH              0.05057736 3.072995e-17

analyze_coefficients(model3, "MODEL 3: Comprehensive")

## 
##  ==== MODEL 3: Comprehensive Coefficient Analysis ==== 
## POTENTIAL COUNTER-INTUITIVE COEFFICIENTS (negative but should be positive):
##                                     Estimate  Pr(>|t|)
## TEAM_BATTING_H:TEAM_BATTING_HR -8.526705e-05 0.1588775
## 
## TOP 5 MOST IMPACTFUL VARIABLES (by coefficient magnitude):
##                    Estimate     Pr(>|t|)
## TEAM_PITCHING_H   6.7039785 0.3491384292
## TEAM_FIELDING_E  -3.6703168 0.0565940695
## TEAM_PITCHING_SO  0.4917158 0.6835718479
## TEAM_BASERUN_SB   0.2118431 0.0000236405
## TEAM_BATTING_HR   0.1623481 0.2169557527

analyze_coefficients(model4, "MODEL 4: Stepwise")

## 
##  ==== MODEL 4: Stepwise Coefficient Analysis ==== 
## POTENTIAL COUNTER-INTUITIVE COEFFICIENTS (negative but should be positive):
##                     Estimate     Pr(>|t|)
## K_BB             -2.63525186 4.590157e-02
## TEAM_BATTING_HR  -0.21595012 2.029069e-04
## TEAM_FIELDING_DP -0.13540379 1.768399e-24
## TEAM_BATTING_2B  -0.02805799 3.041682e-03
## 
## TOP 5 MOST IMPACTFUL VARIABLES (by coefficient magnitude):
##                   Estimate     Pr(>|t|)
## HR_RATE_ALLOWED  374.99997 1.132659e-05
## TEAM_PITCHING_BB -29.99771 5.242284e-11
## TEAM_PITCHING_H   19.28371 3.146453e-05
## TEAM_FIELDING_E  -14.99816 1.221650e-28
## SB_PCT            12.55355 1.695668e-03

# Check for multicollinearity

cat("\n=== MULTICOLLINEARITY CHECK (VIF) ===\n")

## 
## === MULTICOLLINEARITY CHECK (VIF) ===

cat("Model 1 VIF:\n")

## Model 1 VIF:

print(vif(model1))

##   TEAM_BATTING_H  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_PITCHING_H 
##         1.921821        27.778299         1.459812         3.346197 
## TEAM_PITCHING_HR  TEAM_FIELDING_E 
##        23.867422         3.792397

cat("\nModel 2 VIF:\n")

## 
## Model 2 VIF:

print(vif(model2))

##         SINGLES             XBH      FREE_BASES HR_RATE_ALLOWED            K_BB 
##        2.666819        2.856819        2.552833        5.338461        2.129716 
##    DP_PER_ERROR          SB_PCT          NET_BB 
##        3.148032        1.197411        1.161732

cat("\nModel 3 VIF:\n")

## 
## Model 3 VIF:

print(vif(model3))

## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif

##                   TEAM_BATTING_H                  TEAM_BATTING_HR 
##                         5.155971                       783.171135 
##                  TEAM_BATTING_BB                  TEAM_PITCHING_H 
##                       689.520454                        11.823970 
##                 TEAM_PITCHING_HR                  TEAM_FIELDING_E 
##                     16585.349178                        10.326119 
##                  TEAM_BASERUN_SB                 TEAM_PITCHING_SO 
##                       225.507497                         2.159819 
##   TEAM_BATTING_H:TEAM_BATTING_HR TEAM_PITCHING_H:TEAM_PITCHING_HR 
##                       369.286690                     14216.671830 
## TEAM_BATTING_BB:TEAM_PITCHING_BB  TEAM_FIELDING_E:TEAM_BASERUN_SB 
##                       637.652929                       258.979229

EVALUATION

The four models were evaluated based on their predictive accuracy, interpretability, and statistical robustness.

Based on performance metrics, Model 4 is the clear winner. It has the lowest Root Mean Square Error (RMSE), indicating its predictions are, on average, closest to the actual number of wins. Its Adjusted R-squared value of 0.341 shows that it explains 34.1% of the variability in team wins.

And, Model 2 had logically sound coefficients and low multicollinearity.

# Function to calculate model performance metrics
calculate_metrics <- function(model, data) {
  predictions <- predict(model, newdata = data)
  actual <- data$TARGET_WINS
  residuals <- actual - predictions
  
  mse <- mean(residuals^2)
  rmse <- sqrt(mse)
  r_squared <- summary(model)$r.squared
  adj_r_squared <- summary(model)$adj.r.squared
  
  return(c(
    RMSE = rmse,
    MSE = mse,
    R2 = r_squared,
    Adj_R2 = adj_r_squared,
    AIC = AIC(model),
    BIC = BIC(model)
  ))
}

# Compare all models
model_comparison <- data.frame(
  Model1 = calculate_metrics(model1, mb_train_prepared),
  Model2 = calculate_metrics(model2, mb_train_prepared),
  Model3 = calculate_metrics(model3, mb_train_prepared),
  Model4 = calculate_metrics(model4, mb_train_prepared)
)

cat("=== MODEL PERFORMANCE COMPARISON ===\n")

## === MODEL PERFORMANCE COMPARISON ===

print(round(model_comparison, 4))

##            Model1     Model2     Model3     Model4
## RMSE      13.8397    13.6715    13.5266    12.7850
## MSE      191.5368   186.9097   182.9680   163.4565
## R2         0.2277     0.2464     0.2623     0.3410
## Adj_R2     0.2257     0.2437     0.2584     0.3360
## AIC    18435.5697 18383.9118 18343.4007 18096.7489
## BIC    18481.4111 18441.2135 18423.6231 18205.6222

RESIDUAL ANALYSIS

The ‘Residuals vs Fitted’ plots look pretty solid for 4 models. Dots are around the middle zero line, no big curve or a funnel shape means the models are fine. And all of these p-values are extremely small—far below the standard 0.05 threshold.Therefore, the test provides very strong statistical evidence that the residuals for all four models are not normally distributed.

# Residual plots for each model
par(mfrow = c(2, 2))
plot(model1, which = 1, main = "Model 1: Residuals vs Fitted")
plot(model2, which = 1, main = "Model 2: Residuals vs Fitted")
plot(model3, which = 1, main = "Model 3: Residuals vs Fitted")
plot(model4, which = 1, main = "Model 4: Residuals vs Fitted")

# Reset plotting parameters
par(mfrow = c(1, 1))

# Normality check
cat("\n=== NORMALITY TESTS (Shapiro-Wilk) ===\n")

## 
## === NORMALITY TESTS (Shapiro-Wilk) ===

cat("Model 1 residuals normality p-value:", shapiro.test(residuals(model1))$p.value, "\n")

## Model 1 residuals normality p-value: 2.332624e-07

cat("Model 2 residuals normality p-value:", shapiro.test(residuals(model2))$p.value, "\n")

## Model 2 residuals normality p-value: 2.449603e-08

cat("Model 3 residuals normality p-value:", shapiro.test(residuals(model3))$p.value, "\n")

## Model 3 residuals normality p-value: 3.61602e-07

cat("Model 4 residuals normality p-value:", shapiro.test(residuals(model4))$p.value, "\n")

## Model 4 residuals normality p-value: 3.818181e-12

PREDICTION RESULT with MODEL 2

We now use our chosen model (Model 2) to make final predictions for the baseball teams.

final_predictions <- predict(model2, newdata = mb_eval_prepared)

evaluation_results <- data.frame(
  INDEX = mb_evaluation$INDEX,
  PREDICTED_WINS = final_predictions
)
evaluation_results <- evaluation_results %>%
  mutate(PREDICTED_WINS = round(PREDICTED_WINS, 0))

head(evaluation_results)

##   INDEX PREDICTED_WINS
## 1     9             66
## 2    10             69
## 3    14             75
## 4    47             88
## 5    60             67
## 6    63             68

CONCLUSION

The top-performing model is Model 4 with the best predictive power as indicated by the lowest Root Mean Square Error (RMSE) of 12.79 and the highest R-squared (R2) of 0.341. However, it contains several counter-intuitive coefficients, such as a negative coefficient for TEAM_BATTING_HR (-0.216) and a positive coefficient for TEAM_PITCHING_H (+19.28), which could be due to multicollinearity.

For projects prioritizing interpretability, Model 2 (Engineered Features) is the recommended choice. Although its performance metrics are lower (RMSE = 13.67, R2 = 0.246), all of its coefficients are theoretically sound and it has low multicollinearity (VIF scores < 6). It also highlights the significance of new features like SB_PCT, which has a high impact on wins (+21 wins per unit). Model 3 is the second-best performer (RMSE = 13.53, R2 = 0.262) but should be avoided due to severe multicollinearity issues (VIF up to 16,585).

DATA 621 – Business Analytics and Data Mining Homework #1

Benson, Jose, Gullit

2025-09-14