Content Objective Library Data Import Data Exploration Data Preparation Data Cleaning Model Ready Prep Build Models Coefficient Evaluating Residual Prediction Result Conclusion
This report outlines the development of a multiple linear regression model to predict the total number of wins for a baseball team in a season (TARGET_WINS). Using a training dataset of historical team performance, we undertook a comprehensive process of data cleaning, feature engineering, and model development.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(corrplot)
## corrplot 0.95 loaded
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(tidyr)
library(tidyselect)
mb_evaluation <- read.csv(
"https://raw.githubusercontent.com/GullitNa/DATA621-HW1/main/moneyball-evaluation-data.csv",
stringsAsFactors = FALSE
)
mb_training <- read.csv(
"https://raw.githubusercontent.com/GullitNa/DATA621-HW1/main/moneyball-training-data.csv",
stringsAsFactors = FALSE
)
head(mb_evaluation)
## INDEX TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 1 9 1209 170 33 83
## 2 10 1221 151 29 88
## 3 14 1395 183 29 93
## 4 47 1539 309 29 159
## 5 60 1445 203 68 5
## 6 63 1431 236 53 10
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 1 447 1080 62 50
## 2 516 929 54 39
## 3 509 816 59 47
## 4 486 914 148 57
## 5 95 416 NA NA
## 6 215 377 NA NA
## TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## 1 NA 1209 83 447
## 2 NA 1221 88 516
## 3 NA 1395 93 509
## 4 42 1539 159 486
## 5 NA 3902 14 257
## 6 NA 2793 20 420
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1 1080 140 156
## 2 929 135 164
## 3 816 156 153
## 4 914 124 154
## 5 1123 616 130
## 6 736 572 105
head(mb_training)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1 1 39 1445 194 39
## 2 2 70 1339 219 22
## 3 3 86 1377 232 35
## 4 4 70 1387 209 38
## 5 5 82 1297 186 27
## 6 6 75 1279 200 36
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1 13 143 842 NA
## 2 190 685 1075 37
## 3 137 602 917 46
## 4 96 451 922 43
## 5 102 472 920 49
## 6 92 443 973 107
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1 NA NA 9364 84
## 2 28 NA 1347 191
## 3 27 NA 1377 137
## 4 30 NA 1396 97
## 5 39 NA 1297 102
## 6 59 NA 1279 92
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1 927 5456 1011 NA
## 2 689 1082 193 155
## 3 602 917 175 153
## 4 454 928 164 156
## 5 472 920 138 168
## 6 443 973 123 149
# Check Look at structure
str(mb_evaluation)
## 'data.frame': 259 obs. of 16 variables:
## $ INDEX : int 9 10 14 47 60 63 74 83 98 120 ...
## $ TEAM_BATTING_H : int 1209 1221 1395 1539 1445 1431 1430 1385 1259 1397 ...
## $ TEAM_BATTING_2B : int 170 151 183 309 203 236 219 158 177 212 ...
## $ TEAM_BATTING_3B : int 33 29 29 29 68 53 55 42 78 42 ...
## $ TEAM_BATTING_HR : int 83 88 93 159 5 10 37 33 23 58 ...
## $ TEAM_BATTING_BB : int 447 516 509 486 95 215 568 356 466 452 ...
## $ TEAM_BATTING_SO : int 1080 929 816 914 416 377 527 609 689 584 ...
## $ TEAM_BASERUN_SB : int 62 54 59 148 NA NA 365 185 150 52 ...
## $ TEAM_BASERUN_CS : int 50 39 47 57 NA NA NA NA NA NA ...
## $ TEAM_BATTING_HBP: int NA NA NA 42 NA NA NA NA NA NA ...
## $ TEAM_PITCHING_H : int 1209 1221 1395 1539 3902 2793 1544 1626 1342 1489 ...
## $ TEAM_PITCHING_HR: int 83 88 93 159 14 20 40 39 25 62 ...
## $ TEAM_PITCHING_BB: int 447 516 509 486 257 420 613 418 497 482 ...
## $ TEAM_PITCHING_SO: int 1080 929 816 914 1123 736 569 715 734 622 ...
## $ TEAM_FIELDING_E : int 140 135 156 124 616 572 490 328 226 184 ...
## $ TEAM_FIELDING_DP: int 156 164 153 154 130 105 NA 104 132 145 ...
str(mb_training)
## 'data.frame': 2276 obs. of 17 variables:
## $ INDEX : int 1 2 3 4 5 6 7 8 11 12 ...
## $ TARGET_WINS : int 39 70 86 70 82 75 80 85 86 76 ...
## $ TEAM_BATTING_H : int 1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
## $ TEAM_BATTING_2B : int 194 219 232 209 186 200 179 171 197 213 ...
## $ TEAM_BATTING_3B : int 39 22 35 38 27 36 54 37 40 18 ...
## $ TEAM_BATTING_HR : int 13 190 137 96 102 92 122 115 114 96 ...
## $ TEAM_BATTING_BB : int 143 685 602 451 472 443 525 456 447 441 ...
## $ TEAM_BATTING_SO : int 842 1075 917 922 920 973 1062 1027 922 827 ...
## $ TEAM_BASERUN_SB : int NA 37 46 43 49 107 80 40 69 72 ...
## $ TEAM_BASERUN_CS : int NA 28 27 30 39 59 54 36 27 34 ...
## $ TEAM_BATTING_HBP: int NA NA NA NA NA NA NA NA NA NA ...
## $ TEAM_PITCHING_H : int 9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
## $ TEAM_PITCHING_HR: int 84 191 137 97 102 92 122 116 114 96 ...
## $ TEAM_PITCHING_BB: int 927 689 602 454 472 443 525 459 447 441 ...
## $ TEAM_PITCHING_SO: int 5456 1082 917 928 920 973 1062 1033 922 827 ...
## $ TEAM_FIELDING_E : int 1011 193 175 164 138 123 136 112 127 131 ...
## $ TEAM_FIELDING_DP: int NA 155 153 156 168 149 186 136 169 159 ...
# Check Column names
names(mb_evaluation)
## [1] "INDEX" "TEAM_BATTING_H" "TEAM_BATTING_2B" "TEAM_BATTING_3B"
## [5] "TEAM_BATTING_HR" "TEAM_BATTING_BB" "TEAM_BATTING_SO" "TEAM_BASERUN_SB"
## [9] "TEAM_BASERUN_CS" "TEAM_BATTING_HBP" "TEAM_PITCHING_H" "TEAM_PITCHING_HR"
## [13] "TEAM_PITCHING_BB" "TEAM_PITCHING_SO" "TEAM_FIELDING_E" "TEAM_FIELDING_DP"
names(mb_training)
## [1] "INDEX" "TARGET_WINS" "TEAM_BATTING_H" "TEAM_BATTING_2B"
## [5] "TEAM_BATTING_3B" "TEAM_BATTING_HR" "TEAM_BATTING_BB" "TEAM_BATTING_SO"
## [9] "TEAM_BASERUN_SB" "TEAM_BASERUN_CS" "TEAM_BATTING_HBP" "TEAM_PITCHING_H"
## [13] "TEAM_PITCHING_HR" "TEAM_PITCHING_BB" "TEAM_PITCHING_SO" "TEAM_FIELDING_E"
## [17] "TEAM_FIELDING_DP"
# Check Dimensions
dim(mb_evaluation)
## [1] 259 16
dim(mb_training)
## [1] 2276 17
To begin our analysis, we first explored the “moneyball training data” to understand its structure and key characteristics. This is a critical first step to ensure we are building our models on a solid foundation.
Finally, we found that several columns had missing values. Most notably, TEAM_BATTING_HBP was missing over 95% of its data and was removed. Other columns with sparse missing data were handled in the next phase.
# Check Mean, sd, Median
summary(mb_training)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
##
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## NA's :102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
## 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
## Median :101.0 Median : 49.0 Median :58.00 Median : 1518
## Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
## 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
## Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
## NA's :131 NA's :772 NA's :2085
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## NA's :102
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:131.0
## Median :149.0
## Mean :146.4
## 3rd Qu.:164.0
## Max. :228.0
## NA's :286
To improve our model’s accuracy and reliability, we performed a series of transformations. Our goal was to clean the data and reduce the influence of extreme values. Before imputing, we created missingness flags for every column that had NAs so models can learn from was initially missing.
# Check the Ratio of missing data
colSums(is.na(mb_training)) / nrow(mb_training)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## 0.00000000 0.00000000 0.00000000 0.00000000
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## 0.00000000 0.00000000 0.00000000 0.04481547
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## 0.05755712 0.33919156 0.91608084 0.00000000
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## 0.00000000 0.00000000 0.04481547 0.00000000
## TEAM_FIELDING_DP
## 0.12565905
# The TEAM-BATTIng_HBP had a high missing values, I will drop the column since it not reliable for me
mb_training$TEAM_BATTING_HBP <- NULL
# for other will using the median to handle the missing data
mb_training$TEAM_BATTING_SO[is.na(mb_training$TEAM_BATTING_SO)] <- median(mb_training$TEAM_BATTING_SO, na.rm = TRUE)
mb_training$TEAM_BASERUN_SB[is.na(mb_training$TEAM_BASERUN_SB)] <- median(mb_training$TEAM_BASERUN_SB, na.rm = TRUE)
mb_training$TEAM_PITCHING_SO[is.na(mb_training$TEAM_PITCHING_SO)] <- median(mb_training$TEAM_PITCHING_SO, na.rm = TRUE)
mb_training$TEAM_BASERUN_CS[is.na(mb_training$TEAM_BASERUN_CS)] <- median(mb_training$TEAM_BASERUN_CS, na.rm = TRUE)
mb_training$TEAM_FIELDING_DP[is.na(mb_training$TEAM_FIELDING_DP)] <- median(mb_training$TEAM_FIELDING_DP, na.rm = TRUE)
summary(mb_training)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 556.8
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 736.3
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 925.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_H TEAM_PITCHING_HR
## Min. : 0.0 Min. : 0.00 Min. : 1137 Min. : 0.0
## 1st Qu.: 67.0 1st Qu.: 44.00 1st Qu.: 1419 1st Qu.: 50.0
## Median :101.0 Median : 49.00 Median : 1518 Median :107.0
## Mean :123.4 Mean : 51.51 Mean : 1779 Mean :105.7
## 3rd Qu.:151.0 3rd Qu.: 54.25 3rd Qu.: 1682 3rd Qu.:150.0
## Max. :697.0 Max. :201.00 Max. :30132 Max. :343.0
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
## 1st Qu.: 476.0 1st Qu.: 626.0 1st Qu.: 127.0 1st Qu.:134.0
## Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
## Mean : 553.0 Mean : 817.5 Mean : 246.5 Mean :146.7
## 3rd Qu.: 611.0 3rd Qu.: 957.0 3rd Qu.: 249.2 3rd Qu.:161.2
## Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
To improve model stability and meet the assumptions of linear regression, we addressed outliers by capping extreme values at the 1.5×IQR threshold.
# Function to identify outliers based on 1.5*IQR rule
identify_outliers <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- which(x < lower_bound | x > upper_bound)
return(outliers)
}
numeric_cols <- sapply(mb_training, is.numeric)
outliers_list <- lapply(mb_training[, numeric_cols], identify_outliers)
outliers_list
## $INDEX
## integer(0)
##
## $TARGET_WINS
## [1] 1 53 296 299 391 415 418 422 427 445 859 862 982 1082 1199
## [16] 1210 1211 1345 1505 1584 1708 1825 1828 2012 2015 2031 2232 2233 2235 2239
## [31] 2242 2276
##
## $TEAM_BATTING_H
## [1] 53 55 69 72 191 272 273 286 287 288 289 295 296 297 298
## [16] 299 416 417 418 420 423 434 466 853 854 855 856 860 980 997
## [31] 998 1082 1093 1192 1210 1211 1346 1347 1348 1456 1603 1604 1635 1636 1710
## [46] 1736 1810 1811 1812 1813 1815 1822 1823 1828 1905 1908 1939 2012 2022 2062
## [61] 2136 2219 2220 2233 2237 2239 2276
##
## $TEAM_BATTING_2B
## [1] 295 296 297 381 387 388 424 425 768 777 1810 2012 2062 2063 2239
##
## $TEAM_BATTING_3B
## [1] 59 286 295 298 302 407 408 416 429 639 645 650 980 1093 1585
## [16] 1590 1604 1699 1700 1710 1713 1726 1810 1813 1823 1828 2190 2219 2227
##
## $TEAM_BATTING_HR
## integer(0)
##
## $TEAM_BATTING_BB
## [1] 1 53 54 55 56 57 59 207 272 273 274 294 296 297 298
## [16] 299 341 342 344 391 392 393 394 396 399 400 401 402 403 407
## [31] 409 410 411 412 413 414 415 416 417 418 419 420 421 422 425
## [46] 637 638 639 640 860 861 862 863 864 865 866 881 882 919 921
## [61] 976 982 996 997 998 999 1044 1045 1082 1191 1192 1193 1194 1195 1210
## [76] 1211 1341 1345 1346 1347 1348 1349 1350 1393 1397 1425 1428 1534 1584 1590
## [91] 1698 1701 1703 1707 1811 1812 1813 1814 1815 1816 1818 1819 1822 1823 1824
## [106] 1825 1826 1828 1829 1830 1888 1895 2015 2016 2017 2018 2136 2137 2191 2220
## [121] 2221 2223 2232 2233 2234 2239 2240 2241 2242
##
## $TEAM_BATTING_SO
## [1] 298 299 393 415 860 861 998 999 1211 1345 1349 1350 1812 1813 1823
## [16] 1824 2015 2016 2233 2239
##
## $TEAM_BASERUN_SB
## [1] 63 64 65 66 67 68 69 71 72 278 279 280 281 282 283
## [16] 285 286 287 288 289 291 295 395 396 406 428 429 430 431 432
## [31] 434 435 436 442 444 445 539 540 642 643 644 645 646 647 648
## [46] 650 661 845 846 847 848 849 851 853 889 890 980 1046 1047 1048
## [61] 1049 1050 1086 1087 1088 1089 1090 1091 1093 1098 1101 1197 1198 1200 1202
## [76] 1255 1256 1343 1351 1396 1405 1406 1585 1592 1593 1594 1596 1597 1598 1599
## [91] 1600 1604 1705 1706 1707 1711 1722 1810 1827 1828 1898 1899 1900 1901 1903
## [106] 1904 1905 1906 1907 1908 1914 1915 1916 1917 1920 1921 1922 1923 2021 2022
## [121] 2023 2024 2025 2026 2028 2034 2114 2138 2190 2219 2225 2227 2228 2236 2237
## [136] 2238
##
## $TEAM_BASERUN_CS
## [1] 2 3 9 12 13 14 15 30 31 32 35 51 89 93 94
## [16] 96 97 98 125 126 127 130 148 149 150 152 154 156 167 183
## [31] 184 189 190 191 193 194 195 197 198 202 203 204 223 224 258
## [46] 261 297 313 314 319 321 326 341 342 343 344 348 349 350 351
## [61] 354 355 356 357 365 370 371 386 389 390 392 417 457 458 459
## [76] 460 461 487 491 492 493 495 502 506 508 513 524 532 534 549
## [91] 550 554 555 556 557 558 559 560 561 562 576 577 583 584 596
## [106] 599 600 602 606 623 624 665 670 671 672 673 674 675 701 726
## [121] 728 732 736 737 746 747 761 762 767 770 772 774 776 786 790
## [136] 797 802 803 815 816 818 821 822 823 843 867 868 892 900 902
## [151] 903 904 905 926 927 932 934 937 939 942 949 966 968 974 982
## [166] 991 992 1015 1017 1018 1021 1024 1026 1028 1035 1037 1040 1053 1058 1059
## [181] 1060 1061 1062 1070 1071 1072 1074 1109 1114 1115 1138 1150 1152 1161 1167
## [196] 1170 1178 1210 1211 1213 1228 1231 1257 1258 1263 1264 1266 1267 1268 1269
## [211] 1273 1291 1294 1301 1302 1303 1307 1314 1326 1348 1361 1362 1363 1365 1366
## [226] 1367 1368 1378 1384 1391 1409 1410 1415 1418 1420 1437 1444 1447 1448 1449
## [241] 1451 1452 1453 1457 1490 1503 1504 1509 1513 1518 1519 1534 1535 1537 1542
## [256] 1543 1544 1545 1546 1555 1556 1557 1558 1560 1561 1562 1563 1578 1580 1581
## [271] 1582 1583 1622 1627 1629 1630 1655 1656 1657 1672 1675 1695 1696 1697 1729
## [286] 1734 1735 1737 1738 1765 1767 1770 1771 1785 1787 1789 1790 1791 1792 1794
## [301] 1805 1809 1811 1825 1840 1841 1842 1843 1847 1863 1869 1870 1871 1874 1875
## [316] 1925 1930 1931 1932 1961 1962 1963 1972 1977 1985 1986 1988 1990 1991 1992
## [331] 2005 2007 2009 2049 2053 2054 2057 2083 2107 2110 2111 2112 2113 2114 2115
## [346] 2116 2119 2120 2121 2122 2139 2150 2152 2155 2160 2163 2164 2169 2170 2171
## [361] 2179 2186 2188 2189 2194 2195 2196 2197 2199 2209 2212 2215 2232 2252 2257
## [376] 2258 2259 2260 2261 2262 2263
##
## $TEAM_PITCHING_H
## [1] 1 53 54 55 56 57 58 59 60 62 63 68 69 70 71
## [16] 72 159 245 258 272 273 274 275 276 282 286 287 288 289 294
## [31] 295 296 297 298 299 391 392 393 399 400 401 402 403 404 407
## [46] 409 410 411 412 413 415 416 417 418 419 420 421 422 423 424
## [61] 425 426 427 428 429 434 435 524 614 627 637 638 639 640 649
## [76] 650 724 736 749 820 832 844 853 854 855 856 857 860 861 862
## [91] 863 864 865 866 868 881 882 964 976 977 979 980 982 996 997
## [106] 998 999 1032 1044 1062 1072 1082 1083 1093 1168 1180 1191 1192 1193 1194
## [121] 1210 1211 1223 1328 1340 1341 1342 1345 1346 1347 1348 1349 1350 1351 1369
## [136] 1393 1394 1397 1479 1584 1588 1589 1590 1591 1597 1603 1604 1605 1606 1698
## [151] 1701 1702 1710 1712 1790 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819
## [166] 1822 1823 1824 1825 1826 1828 1829 1830 1842 1854 1869 1882 1895 1896 1897
## [181] 1898 1900 1904 1905 1907 1908 2012 2015 2016 2017 2018 2019 2022 2110 2136
## [196] 2137 2166 2177 2191 2208 2219 2220 2221 2222 2223 2228 2232 2233 2239 2240
## [211] 2241 2242 2264 2276
##
## $TEAM_PITCHING_HR
## [1] 832 964 1810 1882
##
## $TEAM_PITCHING_BB
## [1] 1 19 53 68 207 245 258 269 272 282 285 286 295 298 299
## [16] 341 342 344 395 396 409 415 417 512 627 648 724 749 820 853
## [31] 860 861 862 865 919 921 951 964 996 997 998 999 1083 1089 1191
## [46] 1203 1210 1211 1233 1340 1342 1345 1349 1350 1397 1425 1426 1428 1468 1479
## [61] 1534 1572 1584 1585 1710 1810 1811 1812 1813 1823 1824 1828 1900 1986 2015
## [76] 2016 2026 2110 2123 2136 2137 2177 2219 2226 2227 2228 2233 2234 2239 2253
##
## $TEAM_PITCHING_SO
## [1] 1 282 294 298 299 393 394 415 512 524 736 860 861 868 882
## [16] 964 984 998 999 1082 1211 1340 1341 1342 1345 1349 1350 1369 1381 1698
## [31] 1801 1812 1813 1814 1823 1824 1826 1854 2015 2016 2136 2177 2233 2239 2276
##
## $TEAM_FIELDING_E
## [1] 1 53 54 55 56 57 58 59 60 61 62 63 64 65 66
## [16] 68 69 70 71 171 269 272 273 274 275 276 277 278 279 280
## [31] 281 282 283 284 285 294 295 296 297 298 299 391 392 393 394
## [46] 395 396 399 400 401 402 403 404 405 406 407 408 409 410 411
## [61] 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426
## [76] 427 428 429 430 431 432 433 434 435 436 437 438 439 539 637
## [91] 638 639 640 641 642 643 644 645 646 647 649 650 749 844 845
## [106] 846 847 848 849 850 851 852 853 855 860 861 862 863 864 865
## [121] 866 881 882 883 976 977 978 979 980 981 982 996 997 998 999
## [136] 1044 1045 1046 1047 1048 1049 1050 1082 1083 1084 1085 1086 1087 1088 1090
## [151] 1092 1093 1096 1102 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201
## [166] 1202 1204 1205 1206 1207 1210 1211 1340 1341 1342 1345 1346 1347 1348 1349
## [181] 1350 1351 1393 1394 1395 1396 1397 1584 1585 1588 1589 1590 1591 1592 1593
## [196] 1594 1595 1596 1597 1598 1599 1600 1601 1698 1699 1700 1701 1702 1703 1704
## [211] 1705 1706 1707 1708 1709 1710 1711 1810 1811 1812 1813 1814 1815 1816 1817
## [226] 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1895 1896
## [241] 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911
## [256] 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026
## [271] 2027 2028 2029 2031 2136 2137 2138 2190 2191 2219 2220 2221 2222 2223 2224
## [286] 2225 2226 2227 2228 2229 2230 2232 2233 2234 2235 2236 2237 2238 2239 2240
## [301] 2241 2242 2276
##
## $TEAM_FIELDING_DP
## [1] 53 58 87 174 175 274 303 304 305 307 309 312 339 342 368
## [16] 402 409 418 422 442 447 453 542 543 545 637 656 660 677 680
## [31] 753 754 762 795 886 889 922 976 996 997 1078 1100 1103 1104 1145
## [46] 1234 1247 1249 1250 1251 1341 1397 1402 1403 1404 1407 1431 1438 1441 1443
## [61] 1445 1493 1494 1495 1498 1499 1534 1535 1536 1588 1610 1612 1613 1615 1621
## [76] 1701 1722 1723 1777 1815 1829 1830 1917 1918 1921 2017 2034 2038 2039 2042
## [91] 2137 2194
# visualize
cols_to_plot <- c("TEAM_PITCHING_H", "TEAM_PITCHING_BB", "TEAM_PITCHING_SO", "TEAM_FIELDING_E")
par(mfrow = c(2, length(cols_to_plot))) # 2 rows: before & after
# Boxplots before capping
for(col in cols_to_plot){
boxplot(mb_training[[col]], main = paste(col, "Before Capping"), col = "lightblue")
}
applying the IQR 1.5 and comparing the the boxplot, look more clean
# Function to cap outliers
cap_outliers <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
x[x < lower_bound] <- lower_bound
x[x > upper_bound] <- upper_bound
return(x)
}
mb_training_capped <- mb_training
numeric_cols <- sapply(mb_training_capped, is.numeric)
mb_training_capped[numeric_cols] <- lapply(mb_training[numeric_cols], cap_outliers)
# summary
summary(mb_training_capped)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 39.50 Min. :1152 Min. :110.5
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.89 Mean :1464 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :123.50 Max. :1769 Max. :370.5
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. :257.5 Min. : 4.375
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 556.750
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.000
## Mean : 54.95 Mean : 99.61 Mean :506.6 Mean : 736.289
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 925.000
## Max. :129.00 Max. :264.00 Max. :773.5 Max. :1399.000
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_H TEAM_PITCHING_HR
## Min. : 0.0 Min. :28.62 Min. :1137 Min. : 0.0
## 1st Qu.: 67.0 1st Qu.:44.00 1st Qu.:1419 1st Qu.: 50.0
## Median :101.0 Median :49.00 Median :1518 Median :107.0
## Mean :117.7 Mean :49.59 Mean :1583 Mean :105.7
## 3rd Qu.:151.0 3rd Qu.:54.25 3rd Qu.:1682 3rd Qu.:150.0
## Max. :277.0 Max. :69.62 Max. :2078 Max. :300.0
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. :273.5 Min. : 129.5 Min. : 65.0 Min. : 93.12
## 1st Qu.:476.0 1st Qu.: 626.0 1st Qu.:127.0 1st Qu.:134.00
## Median :536.5 Median : 813.5 Median :159.0 Median :149.00
## Mean :546.4 Mean : 799.8 Mean :205.9 Mean :146.98
## 3rd Qu.:611.0 3rd Qu.: 957.0 3rd Qu.:249.2 3rd Qu.:161.25
## Max. :813.5 Max. :1453.5 Max. :432.6 Max. :202.12
# visualize plot
cols_to_plot <- c("TEAM_PITCHING_H", "TEAM_PITCHING_BB", "TEAM_PITCHING_SO", "TEAM_FIELDING_E")
par(mfrow = c(2, length(cols_to_plot)))
# Boxplots after capping
for(col in cols_to_plot){
boxplot(mb_training_capped[[col]], main = paste(col, "After Capping"), col = "lightgreen")
}
Impute few sparse fields, add interpretable features, transform heavily right-skewed count variables using ‘log1p’, and replace outliers (extreme season stats) for moderate data which can induce accuracy.
safe_div <- function(num, den, default = 0) ifelse(den > 0, num/den, default)
build_prep_params <- function(train_df, target = "TARGET_WINS", id = "INDEX",
winsor_probs = c(.01, .99)) {
# columns with any NA in training
na_cols <- names(which(colSums(is.na(train_df)) > 0))
medians <- train_df %>%
summarise(across(all_of(na_cols), ~median(.x, na.rm = TRUE)))
# build a temporary transformed training set to compute winsor caps on
tmp <- apply_prep(train_df, list(medians = medians, caps = NULL,
na_cols = na_cols, high_err_cut = NULL),
target = target, id = id, do_caps = FALSE)
num_pred <- tmp %>%
select(where(is.numeric), -all_of(c(target))) %>% names()
caps <- list(
lower = sapply(tmp[num_pred], quantile, probs = winsor_probs[1], na.rm = TRUE),
upper = sapply(tmp[num_pred], quantile, probs = winsor_probs[2], na.rm = TRUE)
)
# example bucket/flag threshold (very high errors)
high_err_cut <- quantile(tmp$TEAM_FIELDING_E, 0.90, na.rm = TRUE)
list(medians = medians, caps = caps, na_cols = na_cols, high_err_cut = high_err_cut)
}
apply_prep <- function(df, params, target = "TARGET_WINS", id = "INDEX",
do_caps = TRUE) {
medians <- params$medians
na_cols <- params$na_cols
out <- df %>%
select(-all_of(id)) %>%
# missingness flags before imputation
mutate(across(all_of(na_cols), ~as.integer(is.na(.x)),
.names = "{.col}_MISSING")) %>%
# impute with training medians
mutate(across(all_of(names(medians)), ~replace_na(.x, medians[[cur_column()]]))) %>%
# IMPORTANT: feature engineering
mutate(
SINGLES = pmax(0, TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR),
XBH = TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR,
FREE_BASES = TEAM_BATTING_BB + coalesce(TEAM_BATTING_HBP, 0),
SB_ATT = TEAM_BASERUN_SB + coalesce(TEAM_BASERUN_CS, 0),
SB_PCT = safe_div(TEAM_BASERUN_SB, SB_ATT, 0),
K_BB = safe_div(TEAM_PITCHING_SO, TEAM_PITCHING_BB, 0),
HR_RATE_ALLOWED = safe_div(TEAM_PITCHING_HR, TEAM_PITCHING_H, 0),
NET_HR = TEAM_BATTING_HR - TEAM_PITCHING_HR,
NET_SO = TEAM_PITCHING_SO - TEAM_BATTING_SO,
NET_BB = TEAM_BATTING_BB - TEAM_PITCHING_BB,
DP_PER_ERROR = TEAM_FIELDING_DP / (TEAM_FIELDING_E + 1)
) %>%
# --- light transforms for heavy right-skew (zero-safe) ---
mutate(across(c(TEAM_FIELDING_E, TEAM_PITCHING_H, TEAM_PITCHING_BB, TEAM_PITCHING_SO,
TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_BATTING_SO, TEAM_BATTING_HR,
SINGLES, XBH, FREE_BASES, SB_ATT),
log1p, .names = "{.col}")) %>% # overwrite with log1p versions
# --- an example simple bucket flag (kept numeric for modeling) ---
mutate(HIGH_ERRORS_FLAG = as.integer(TEAM_FIELDING_E > params$high_err_cut))
# winsorize using training caps (keeps extreme seasons from dominating)
if (do_caps && !is.null(params$caps)) {
for (nm in intersect(names(out), names(params$caps$lower))) {
out[[nm]] <- pmin(pmax(out[[nm]], params$caps$lower[[nm]]), params$caps$upper[[nm]])
}
}
out
}
The result should be a data prep recipe with imputation values and cutoffs for the training data only. This is suppose to clean the data as well as add complexity to the transformation via adding more of these features, log-transform skewed counts, and winsorize extremes for example without leaking the initial data.
Build the recipe on the training set, then apply it to both data sets
winsorize_iqr_vec <- function(x, q = 1.5) {
qs <- stats::quantile(x, c(0.25, 0.75), na.rm = TRUE)
iqr <- qs[2] - qs[1]
lo <- qs[1] - q * iqr
hi <- qs[2] + q * iqr
x <- pmin(pmax(x, lo), hi)
x
}
build_prep_params <- function(df, id = "INDEX", target = "TARGET_WINS") {
# impute columns to learn medians from training
imp_cols <- intersect(c("TEAM_BATTING_SO","TEAM_BASERUN_SB","TEAM_BASERUN_CS",
"TEAM_PITCHING_SO","TEAM_FIELDING_DP","TEAM_BATTING_HBP"),
names(df))
medians <- vapply(df[imp_cols], function(x) stats::median(x, na.rm = TRUE), numeric(1))
# which columns get missingness flags
missing_flags <- intersect(c("TEAM_BATTING_SO","TEAM_BASERUN_SB","TEAM_BASERUN_CS",
"TEAM_PITCHING_SO","TEAM_FIELDING_DP","TEAM_BATTING_HBP"),
names(df))
# IMPORTANT: Here are columns to winsorize/transform
wins_cols <- intersect(c("TEAM_PITCHING_H","TEAM_PITCHING_BB",
"TEAM_PITCHING_SO","TEAM_FIELDING_E"),
names(df))
trans_cols <- wins_cols
# return a list once defined to check
list(
id = id,
target = target,
medians = medians,
wins_cols = wins_cols,
trans_cols = trans_cols,
missing_flags = missing_flags
)
}
apply_prep <- function(df, p){
if (!"TEAM_BATTING_HBP" %in% names(df)) df$TEAM_BATTING_HBP <- NA_real_
out <- df %>%
# flags FIRST to preserve original NA info
dplyr::mutate(dplyr::across(tidyselect::any_of(p$missing_flags),
~ as.integer(is.na(.x)), .names = "{.col}_missing")) %>%
dplyr::mutate(FREE_BASES = TEAM_BATTING_BB + dplyr::coalesce(TEAM_BATTING_HBP, 0)) %>%
dplyr::select(-TEAM_BATTING_HBP) %>%
# impute NAs with training medians
dplyr::mutate(dplyr::across(tidyselect::any_of(names(p$medians)),
~ ifelse(is.na(.x), p$medians[cur_column()], .x))) %>%
# engineered features
dplyr::mutate(
SINGLES = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR,
XBH = TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR,
SB_ATT = TEAM_BASERUN_SB + dplyr::coalesce(TEAM_BASERUN_CS, 0),
SB_PCT = dplyr::if_else(SB_ATT > 0, TEAM_BASERUN_SB / SB_ATT, 0),
K_BB = dplyr::if_else(TEAM_PITCHING_BB > 0, TEAM_PITCHING_SO / TEAM_PITCHING_BB, 0),
HR_RATE_ALLOWED = dplyr::if_else(TEAM_PITCHING_H > 0, TEAM_PITCHING_HR / TEAM_PITCHING_H, 0),
NET_HR = TEAM_BATTING_HR - TEAM_PITCHING_HR,
NET_SO = TEAM_PITCHING_SO - TEAM_BATTING_SO,
NET_BB = TEAM_BATTING_BB - TEAM_PITCHING_BB,
DP_PER_ERROR = TEAM_FIELDING_DP / (TEAM_FIELDING_E + 1)
) %>%
# winsorize/transform
dplyr::mutate(dplyr::across(tidyselect::any_of(p$wins_cols), winsorize_iqr_vec),
dplyr::across(tidyselect::any_of(p$trans_cols), log1p)) %>%
dplyr::select(-tidyselect::any_of(p$id))
out
}
Now we can use the original training/evaluation data to be compared and rerun prep.
tr_raw <- mb_training
ev_raw <- mb_evaluation
prep_params <- build_prep_params(tr_raw, id = "INDEX", target = "TARGET_WINS")
mb_train_prepared <- apply_prep(tr_raw, prep_params)
mb_eval_prepared <- apply_prep(ev_raw, prep_params)
dim(mb_train_prepared); dim(mb_eval_prepared)
## [1] 2276 31
## [1] 259 30
colSums(is.na(mb_train_prepared))
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## 0 0 0
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
## 0 0 0
## TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 0 0 0
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## 0 0 0
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 0 0 0
## TEAM_BATTING_SO_missing TEAM_BASERUN_SB_missing TEAM_BASERUN_CS_missing
## 0 0 0
## TEAM_PITCHING_SO_missing TEAM_FIELDING_DP_missing FREE_BASES
## 0 0 0
## SINGLES XBH SB_ATT
## 0 0 0
## SB_PCT K_BB HR_RATE_ALLOWED
## 0 0 0
## NET_HR NET_SO NET_BB
## 0 0 0
## DP_PER_ERROR
## 0
colSums(is.na(mb_eval_prepared))
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 0 0 0
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## 0 0 0
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_H
## 0 0 0
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 0 0 0
## TEAM_FIELDING_E TEAM_FIELDING_DP TEAM_BATTING_SO_missing
## 0 0 0
## TEAM_BASERUN_SB_missing TEAM_BASERUN_CS_missing TEAM_PITCHING_SO_missing
## 0 0 0
## TEAM_FIELDING_DP_missing FREE_BASES SINGLES
## 0 0 0
## XBH SB_ATT SB_PCT
## 0 0 0
## K_BB HR_RATE_ALLOWED NET_HR
## 0 0 0
## NET_SO NET_BB DP_PER_ERROR
## 0 0 0
(Ignore) This is used for checking/proving if setdiff() returns only TARGET_WINS and checking if any NA values are still in the data.
setdiff(names(mb_train_prepared), names(mb_eval_prepared))
## [1] "TARGET_WINS"
setdiff(names(mb_eval_prepared), names(mb_train_prepared))
## character(0)
c(train_any_NA = anyNA(mb_train_prepared),
eval_any_NA = anyNA(mb_eval_prepared))
## train_any_NA eval_any_NA
## FALSE FALSE
colSums(is.na(mb_train_prepared))
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## 0 0 0
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
## 0 0 0
## TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 0 0 0
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## 0 0 0
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 0 0 0
## TEAM_BATTING_SO_missing TEAM_BASERUN_SB_missing TEAM_BASERUN_CS_missing
## 0 0 0
## TEAM_PITCHING_SO_missing TEAM_FIELDING_DP_missing FREE_BASES
## 0 0 0
## SINGLES XBH SB_ATT
## 0 0 0
## SB_PCT K_BB HR_RATE_ALLOWED
## 0 0 0
## NET_HR NET_SO NET_BB
## 0 0 0
## DP_PER_ERROR
## 0
colSums(is.na(mb_eval_prepared))
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 0 0 0
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## 0 0 0
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_H
## 0 0 0
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 0 0 0
## TEAM_FIELDING_E TEAM_FIELDING_DP TEAM_BATTING_SO_missing
## 0 0 0
## TEAM_BASERUN_SB_missing TEAM_BASERUN_CS_missing TEAM_PITCHING_SO_missing
## 0 0 0
## TEAM_FIELDING_DP_missing FREE_BASES SINGLES
## 0 0 0
## XBH SB_ATT SB_PCT
## 0 0 0
## K_BB HR_RATE_ALLOWED NET_HR
## 0 0 0
## NET_SO NET_BB DP_PER_ERROR
## 0 0 0
train_num <- dplyr::select(mb_train_prepared, where(is.numeric))
eval_num <- dplyr::select(mb_eval_prepared, where(is.numeric))
c(train_any_inf = any(is.infinite(as.matrix(train_num))),
eval_any_inf = any(is.infinite(as.matrix(eval_num))))
## train_any_inf eval_any_inf
## FALSE FALSE
With a clean and enriched dataset, we constructed four different multiple linear regression models. Our approach was to start with a simple baseline and progressively test more complex hypotheses about what drives team wins.
## BUILD MODELS - SECTION 3
# Model 1: Traditional Baseball Statistics (Basic Model)
# Focus on core batting, pitching, and fielding metrics
model1 <- lm(TARGET_WINS ~
TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_BATTING_BB + # Offense
TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_FIELDING_E, # Defense
data = mb_train_prepared)
# Model 2: Engineered Features Model (Advanced Metrics)
# Uses the composite variables we created for more baseball-specific insights
model2 <- lm(TARGET_WINS ~
SINGLES + XBH + FREE_BASES + # Offensive production
HR_RATE_ALLOWED + K_BB + DP_PER_ERROR + # Pitching/fielding efficiency
SB_PCT + NET_BB, # Base running and discipline
data = mb_train_prepared)
# Model 3: Comprehensive Model with Interactions
# Includes most variables plus some key interactions that make baseball sense
model3 <- lm(TARGET_WINS ~
TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_BATTING_BB +
TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_FIELDING_E +
TEAM_BASERUN_SB + TEAM_PITCHING_SO +
# Interactions that make theoretical sense:
TEAM_BATTING_H:TEAM_BATTING_HR + # Hit productivity
TEAM_PITCHING_H:TEAM_PITCHING_HR + # Pitching vulnerability
TEAM_BATTING_BB:TEAM_PITCHING_BB + # Discipline differential
TEAM_FIELDING_E:TEAM_BASERUN_SB, # Errors vs aggressive running
data = mb_train_prepared)
# Model 4: Parsimonious Model (Variable Selection using Stepwise)
# Let R help select the most impactful variables automatically
full_model <- lm(TARGET_WINS ~ ., data = mb_train_prepared)
model4 <- step(full_model, direction = "both", trace = 0) # Stepwise selection
# Display model summaries
cat("=== MODEL 1: Traditional Statistics ===\n")
## === MODEL 1: Traditional Statistics ===
summary(model1)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR +
## TEAM_BATTING_BB + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_FIELDING_E,
## data = mb_train_prepared)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.270 -8.834 0.169 9.042 49.004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.234006 24.529242 0.417 0.676559
## TEAM_BATTING_H 0.046456 0.002786 16.673 < 2e-16 ***
## TEAM_BATTING_HR 0.008399 0.025297 0.332 0.739903
## TEAM_BATTING_BB 0.027377 0.002862 9.565 < 2e-16 ***
## TEAM_PITCHING_H 1.699264 3.891374 0.437 0.662389
## TEAM_PITCHING_HR -0.016747 0.023161 -0.723 0.469713
## TEAM_FIELDING_E -4.413656 1.191548 -3.704 0.000217 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.86 on 2269 degrees of freedom
## Multiple R-squared: 0.2277, Adjusted R-squared: 0.2257
## F-statistic: 111.5 on 6 and 2269 DF, p-value: < 2.2e-16
cat("\n=== MODEL 2: Engineered Features ===\n")
##
## === MODEL 2: Engineered Features ===
summary(model2)
##
## Call:
## lm(formula = TARGET_WINS ~ SINGLES + XBH + FREE_BASES + HR_RATE_ALLOWED +
## K_BB + DP_PER_ERROR + SB_PCT + NET_BB, data = mb_train_prepared)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.680 -9.115 0.202 8.982 65.816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -16.568871 5.285908 -3.135 0.00174 **
## SINGLES 0.044029 0.003638 12.103 < 2e-16 ***
## XBH 0.050577 0.005942 8.512 < 2e-16 ***
## FREE_BASES 0.026595 0.003741 7.110 1.55e-12 ***
## HR_RATE_ALLOWED 9.403030 16.038247 0.586 0.55774
## K_BB 0.477563 0.687961 0.694 0.48765
## DP_PER_ERROR 1.862919 1.137729 1.637 0.10169
## SB_PCT 21.098956 2.726786 7.738 1.51e-14 ***
## NET_BB 0.005693 0.002052 2.774 0.00558 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.7 on 2267 degrees of freedom
## Multiple R-squared: 0.2464, Adjusted R-squared: 0.2437
## F-statistic: 92.65 on 8 and 2267 DF, p-value: < 2.2e-16
cat("\n=== MODEL 3: Comprehensive with Interactions ===\n")
##
## === MODEL 3: Comprehensive with Interactions ===
summary(model3)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR +
## TEAM_BATTING_BB + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_FIELDING_E +
## TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_BATTING_H:TEAM_BATTING_HR +
## TEAM_PITCHING_H:TEAM_PITCHING_HR + TEAM_BATTING_BB:TEAM_PITCHING_BB +
## TEAM_FIELDING_E:TEAM_BASERUN_SB, data = mb_train_prepared)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.249 -8.868 -0.161 8.646 52.266
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.524e+01 5.209e+01 -0.869 0.385131
## TEAM_BATTING_H 4.991e-02 4.466e-03 11.174 < 2e-16 ***
## TEAM_BATTING_HR 1.623e-01 1.315e-01 1.235 0.216956
## TEAM_BATTING_BB 6.294e-02 6.088e-02 1.034 0.301355
## TEAM_PITCHING_H 6.704e+00 7.159e+00 0.936 0.349138
## TEAM_PITCHING_HR -1.323e-01 5.975e-01 -0.221 0.824830
## TEAM_FIELDING_E -3.670e+00 1.924e+00 -1.907 0.056594 .
## TEAM_BASERUN_SB 2.118e-01 5.001e-02 4.236 2.36e-05 ***
## TEAM_PITCHING_SO 4.917e-01 1.206e+00 0.408 0.683572
## TEAM_BATTING_H:TEAM_BATTING_HR -8.527e-05 6.050e-05 -1.409 0.158877
## TEAM_PITCHING_H:TEAM_PITCHING_HR 1.345e-02 7.528e-02 0.179 0.858265
## TEAM_BATTING_BB:TEAM_PITCHING_BB -5.759e-03 8.512e-03 -0.677 0.498729
## TEAM_FIELDING_E:TEAM_BASERUN_SB -3.063e-02 8.785e-03 -3.487 0.000498 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.57 on 2263 degrees of freedom
## Multiple R-squared: 0.2623, Adjusted R-squared: 0.2584
## F-statistic: 67.05 on 12 and 2263 DF, p-value: < 2.2e-16
cat("\n=== MODEL 4: Stepwise Selection ===\n")
##
## === MODEL 4: Stepwise Selection ===
summary(model4)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
## TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP + SB_PCT + K_BB + HR_RATE_ALLOWED +
## NET_SO + NET_BB, data = mb_train_prepared)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.610 -8.073 -0.053 7.627 58.410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.319319 27.085683 1.452 0.146734
## TEAM_BATTING_H 0.050734 0.003988 12.722 < 2e-16 ***
## TEAM_BATTING_2B -0.028058 0.009458 -2.967 0.003042 **
## TEAM_BATTING_3B 0.109077 0.017049 6.398 1.91e-10 ***
## TEAM_BATTING_HR -0.215950 0.058028 -3.721 0.000203 ***
## TEAM_BATTING_BB 0.069076 0.008316 8.306 < 2e-16 ***
## TEAM_BATTING_SO -0.020975 0.004266 -4.917 9.43e-07 ***
## TEAM_BASERUN_SB 0.022301 0.005892 3.785 0.000158 ***
## TEAM_PITCHING_H 19.283709 4.623214 4.171 3.15e-05 ***
## TEAM_PITCHING_BB -29.997712 4.547823 -6.596 5.24e-11 ***
## TEAM_PITCHING_SO 12.341996 2.607991 4.732 2.36e-06 ***
## TEAM_FIELDING_E -14.998163 1.332353 -11.257 < 2e-16 ***
## TEAM_FIELDING_DP -0.135404 0.013106 -10.331 < 2e-16 ***
## SB_PCT 12.553555 3.994507 3.143 0.001696 **
## K_BB -2.635252 1.319341 -1.997 0.045902 *
## HR_RATE_ALLOWED 374.999970 85.226561 4.400 1.13e-05 ***
## NET_SO 0.005329 0.001015 5.249 1.67e-07 ***
## NET_BB 0.006406 0.003925 1.632 0.102825
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.84 on 2258 degrees of freedom
## Multiple R-squared: 0.341, Adjusted R-squared: 0.336
## F-statistic: 68.72 on 17 and 2258 DF, p-value: < 2.2e-16
# Function to extract key coefficient insights
analyze_coefficients <- function(model, model_name) {
cat("\n", paste("====", model_name, "Coefficient Analysis ===="), "\n")
coef_df <- as.data.frame(summary(model)$coefficients)
coef_df <- coef_df[-1, ] # Remove intercept
coef_df <- coef_df[order(abs(coef_df$Estimate), decreasing = TRUE), ]
# Highlight counter-intuitive coefficients
counter_intuitive <- coef_df[coef_df$Estimate < 0 &
!grepl("PITCHING|FIELDING_E|BATTING_SO|BASERUN_CS", rownames(coef_df)), ]
if(nrow(counter_intuitive) > 0) {
cat("POTENTIAL COUNTER-INTUITIVE COEFFICIENTS (negative but should be positive):\n")
print(counter_intuitive[, c("Estimate", "Pr(>|t|)")])
} else {
cat("All coefficients align with theoretical expectations.\n")
}
# Top 5 most impactful variables
cat("\nTOP 5 MOST IMPACTFUL VARIABLES (by coefficient magnitude):\n")
print(head(coef_df[, c("Estimate", "Pr(>|t|)")], 5))
}
# Analyze each model
analyze_coefficients(model1, "MODEL 1: Traditional")
##
## ==== MODEL 1: Traditional Coefficient Analysis ====
## All coefficients align with theoretical expectations.
##
## TOP 5 MOST IMPACTFUL VARIABLES (by coefficient magnitude):
## Estimate Pr(>|t|)
## TEAM_FIELDING_E -4.41365637 2.171781e-04
## TEAM_PITCHING_H 1.69926391 6.623889e-01
## TEAM_BATTING_H 0.04645571 5.732837e-59
## TEAM_BATTING_BB 0.02737706 2.814836e-21
## TEAM_PITCHING_HR -0.01674690 4.697127e-01
analyze_coefficients(model2, "MODEL 2: Engineered Features")
##
## ==== MODEL 2: Engineered Features Coefficient Analysis ====
## All coefficients align with theoretical expectations.
##
## TOP 5 MOST IMPACTFUL VARIABLES (by coefficient magnitude):
## Estimate Pr(>|t|)
## SB_PCT 21.09895583 1.512744e-14
## HR_RATE_ALLOWED 9.40303048 5.577404e-01
## DP_PER_ERROR 1.86291853 1.016855e-01
## K_BB 0.47756291 4.876461e-01
## XBH 0.05057736 3.072995e-17
analyze_coefficients(model3, "MODEL 3: Comprehensive")
##
## ==== MODEL 3: Comprehensive Coefficient Analysis ====
## POTENTIAL COUNTER-INTUITIVE COEFFICIENTS (negative but should be positive):
## Estimate Pr(>|t|)
## TEAM_BATTING_H:TEAM_BATTING_HR -8.526705e-05 0.1588775
##
## TOP 5 MOST IMPACTFUL VARIABLES (by coefficient magnitude):
## Estimate Pr(>|t|)
## TEAM_PITCHING_H 6.7039785 0.3491384292
## TEAM_FIELDING_E -3.6703168 0.0565940695
## TEAM_PITCHING_SO 0.4917158 0.6835718479
## TEAM_BASERUN_SB 0.2118431 0.0000236405
## TEAM_BATTING_HR 0.1623481 0.2169557527
analyze_coefficients(model4, "MODEL 4: Stepwise")
##
## ==== MODEL 4: Stepwise Coefficient Analysis ====
## POTENTIAL COUNTER-INTUITIVE COEFFICIENTS (negative but should be positive):
## Estimate Pr(>|t|)
## K_BB -2.63525186 4.590157e-02
## TEAM_BATTING_HR -0.21595012 2.029069e-04
## TEAM_FIELDING_DP -0.13540379 1.768399e-24
## TEAM_BATTING_2B -0.02805799 3.041682e-03
##
## TOP 5 MOST IMPACTFUL VARIABLES (by coefficient magnitude):
## Estimate Pr(>|t|)
## HR_RATE_ALLOWED 374.99997 1.132659e-05
## TEAM_PITCHING_BB -29.99771 5.242284e-11
## TEAM_PITCHING_H 19.28371 3.146453e-05
## TEAM_FIELDING_E -14.99816 1.221650e-28
## SB_PCT 12.55355 1.695668e-03
# Check for multicollinearity
cat("\n=== MULTICOLLINEARITY CHECK (VIF) ===\n")
##
## === MULTICOLLINEARITY CHECK (VIF) ===
cat("Model 1 VIF:\n")
## Model 1 VIF:
print(vif(model1))
## TEAM_BATTING_H TEAM_BATTING_HR TEAM_BATTING_BB TEAM_PITCHING_H
## 1.921821 27.778299 1.459812 3.346197
## TEAM_PITCHING_HR TEAM_FIELDING_E
## 23.867422 3.792397
cat("\nModel 2 VIF:\n")
##
## Model 2 VIF:
print(vif(model2))
## SINGLES XBH FREE_BASES HR_RATE_ALLOWED K_BB
## 2.666819 2.856819 2.552833 5.338461 2.129716
## DP_PER_ERROR SB_PCT NET_BB
## 3.148032 1.197411 1.161732
cat("\nModel 3 VIF:\n")
##
## Model 3 VIF:
print(vif(model3))
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## TEAM_BATTING_H TEAM_BATTING_HR
## 5.155971 783.171135
## TEAM_BATTING_BB TEAM_PITCHING_H
## 689.520454 11.823970
## TEAM_PITCHING_HR TEAM_FIELDING_E
## 16585.349178 10.326119
## TEAM_BASERUN_SB TEAM_PITCHING_SO
## 225.507497 2.159819
## TEAM_BATTING_H:TEAM_BATTING_HR TEAM_PITCHING_H:TEAM_PITCHING_HR
## 369.286690 14216.671830
## TEAM_BATTING_BB:TEAM_PITCHING_BB TEAM_FIELDING_E:TEAM_BASERUN_SB
## 637.652929 258.979229
The four models were evaluated based on their predictive accuracy, interpretability, and statistical robustness.
Based on performance metrics, Model 4 is the clear winner. It has the lowest Root Mean Square Error (RMSE), indicating its predictions are, on average, closest to the actual number of wins. Its Adjusted R-squared value of 0.341 shows that it explains 34.1% of the variability in team wins.
And, Model 2 had logically sound coefficients and low multicollinearity.
# Function to calculate model performance metrics
calculate_metrics <- function(model, data) {
predictions <- predict(model, newdata = data)
actual <- data$TARGET_WINS
residuals <- actual - predictions
mse <- mean(residuals^2)
rmse <- sqrt(mse)
r_squared <- summary(model)$r.squared
adj_r_squared <- summary(model)$adj.r.squared
return(c(
RMSE = rmse,
MSE = mse,
R2 = r_squared,
Adj_R2 = adj_r_squared,
AIC = AIC(model),
BIC = BIC(model)
))
}
# Compare all models
model_comparison <- data.frame(
Model1 = calculate_metrics(model1, mb_train_prepared),
Model2 = calculate_metrics(model2, mb_train_prepared),
Model3 = calculate_metrics(model3, mb_train_prepared),
Model4 = calculate_metrics(model4, mb_train_prepared)
)
cat("=== MODEL PERFORMANCE COMPARISON ===\n")
## === MODEL PERFORMANCE COMPARISON ===
print(round(model_comparison, 4))
## Model1 Model2 Model3 Model4
## RMSE 13.8397 13.6715 13.5266 12.7850
## MSE 191.5368 186.9097 182.9680 163.4565
## R2 0.2277 0.2464 0.2623 0.3410
## Adj_R2 0.2257 0.2437 0.2584 0.3360
## AIC 18435.5697 18383.9118 18343.4007 18096.7489
## BIC 18481.4111 18441.2135 18423.6231 18205.6222
The ‘Residuals vs Fitted’ plots look pretty solid for 4 models. Dots are around the middle zero line, no big curve or a funnel shape means the models are fine. And all of these p-values are extremely small—far below the standard 0.05 threshold.Therefore, the test provides very strong statistical evidence that the residuals for all four models are not normally distributed.
# Residual plots for each model
par(mfrow = c(2, 2))
plot(model1, which = 1, main = "Model 1: Residuals vs Fitted")
plot(model2, which = 1, main = "Model 2: Residuals vs Fitted")
plot(model3, which = 1, main = "Model 3: Residuals vs Fitted")
plot(model4, which = 1, main = "Model 4: Residuals vs Fitted")
# Reset plotting parameters
par(mfrow = c(1, 1))
# Normality check
cat("\n=== NORMALITY TESTS (Shapiro-Wilk) ===\n")
##
## === NORMALITY TESTS (Shapiro-Wilk) ===
cat("Model 1 residuals normality p-value:", shapiro.test(residuals(model1))$p.value, "\n")
## Model 1 residuals normality p-value: 2.332624e-07
cat("Model 2 residuals normality p-value:", shapiro.test(residuals(model2))$p.value, "\n")
## Model 2 residuals normality p-value: 2.449603e-08
cat("Model 3 residuals normality p-value:", shapiro.test(residuals(model3))$p.value, "\n")
## Model 3 residuals normality p-value: 3.61602e-07
cat("Model 4 residuals normality p-value:", shapiro.test(residuals(model4))$p.value, "\n")
## Model 4 residuals normality p-value: 3.818181e-12
We now use our chosen model (Model 2) to make final predictions for the baseball teams.
final_predictions <- predict(model2, newdata = mb_eval_prepared)
evaluation_results <- data.frame(
INDEX = mb_evaluation$INDEX,
PREDICTED_WINS = final_predictions
)
evaluation_results <- evaluation_results %>%
mutate(PREDICTED_WINS = round(PREDICTED_WINS, 0))
head(evaluation_results)
## INDEX PREDICTED_WINS
## 1 9 66
## 2 10 69
## 3 14 75
## 4 47 88
## 5 60 67
## 6 63 68
The top-performing model is Model 4 with the best predictive power as indicated by the lowest Root Mean Square Error (RMSE) of 12.79 and the highest R-squared (R2) of 0.341. However, it contains several counter-intuitive coefficients, such as a negative coefficient for TEAM_BATTING_HR (-0.216) and a positive coefficient for TEAM_PITCHING_H (+19.28), which could be due to multicollinearity.
For projects prioritizing interpretability, Model 2 (Engineered Features) is the recommended choice. Although its performance metrics are lower (RMSE = 13.67, R2 = 0.246), all of its coefficients are theoretically sound and it has low multicollinearity (VIF scores < 6). It also highlights the significance of new features like SB_PCT, which has a high impact on wins (+21 wins per unit). Model 3 is the second-best performer (RMSE = 13.53, R2 = 0.262) but should be avoided due to severe multicollinearity issues (VIF up to 16,585).