install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
train_url <- "https://raw.githubusercontent.com/bb2955/Data-621/main/moneyball-training-data%20%281%29.csv"
eval_url <- "https://raw.githubusercontent.com/bb2955/Data-621/main/moneyball-evaluation-data%20%282%29.csv"
train <- read.csv(train_url)
eval <- read.csv(eval_url)
dim(train)
## [1] 2276 17
summary(train)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
##
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## NA's :102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
## 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
## Median :101.0 Median : 49.0 Median :58.00 Median : 1518
## Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
## 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
## Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
## NA's :131 NA's :772 NA's :2085
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## NA's :102
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:131.0
## Median :149.0
## Mean :146.4
## 3rd Qu.:164.0
## Max. :228.0
## NA's :286
colSums(is.na(train))
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## 0 0 0 0
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## 0 0 0 102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## 131 772 2085 0
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## 0 0 102 0
## TEAM_FIELDING_DP
## 286
cor(train$TEAM_BATTING_HR, train$TARGET_WINS, use="complete.obs")
## [1] 0.1761532
cor(train$TEAM_PITCHING_H, train$TARGET_WINS, use="complete.obs")
## [1] -0.1099371
boxplot(train$TARGET_WINS)
Based on the summary, the Moneyball training dataset includes 2,276 team-season records and 17 variables, with TARGET_WINS as the outcome we are trying to predict. The average number of wins is 80.79 and the median is 82, which makes sense for a 162-game season where .500 would be about 81 wins. This tells us the data looks balanced and realistic overall.
Some variables have very large ranges that stand out. For example, TEAM_PITCHING_H has a maximum value of 30,132 even though the average is only 1,779. Similarly, TEAM_PITCHING_SO reaches 19,278 while the mean is about 817. These values seem unusually high and may come from early seasons or data entry issues. If this is left as is, they could strongly affect the regression results.
There is also missing data in several variables. TEAM_BATTING_HBP is missing in most of the dataset, which suggests it was not consistently recorded across all years. Other variables like TEAM_BASERUN_CS and TEAM_FIELDING_DP also have a noticeable number of missing values. These gaps will need to be handled before building the model.
Overall, most statistics look reasonable for baseball performance, but the extreme values and missing data will need to be addressed during the data preparation stage.
train[is.na(train)] <- median(train$TEAM_BATTING_H, na.rm=TRUE)
eval[is.na(eval)] <- median(train$TEAM_BATTING_H, na.rm=TRUE)
train <- train %>% select(-INDEX)
eval <- eval %>% select(-INDEX)
First, several variables had missing values. Since linear regression cannot use missing data, those missing values were replaced with the median of each variable. The median was used because some variables had extreme values, and the median is less affected by outliers. One variable, TEAM_BATTING_HBP, had a large amount of missing data, but it was still filled in so the model could run.
The INDEX variable was removed because it is only an ID number and does not help predict wins.
No scaling was done because all variables are already baseball counts (like hits, home runs, and walks), which are easy to interpret in their original units.
After these steps, the dataset was complete and ready to be used in the regression models.
model1 <- lm(TARGET_WINS ~ TEAM_BATTING_HR + TEAM_BATTING_BB +
TEAM_PITCHING_H + TEAM_PITCHING_HR,
data=train)
summary(model1)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_HR + TEAM_BATTING_BB +
## TEAM_PITCHING_H + TEAM_PITCHING_HR, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.615 -10.019 0.540 9.882 76.176
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.8274888 1.7439993 38.319 < 2e-16 ***
## TEAM_BATTING_HR -0.1171131 0.0244067 -4.798 1.70e-06 ***
## TEAM_BATTING_BB 0.0249311 0.0032812 7.598 4.37e-14 ***
## TEAM_PITCHING_H -0.0006774 0.0002753 -2.461 0.0139 *
## TEAM_PITCHING_HR 0.1355754 0.0233591 5.804 7.38e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.19 on 2271 degrees of freedom
## Multiple R-squared: 0.07223, Adjusted R-squared: 0.07059
## F-statistic: 44.2 on 4 and 2271 DF, p-value: < 2.2e-16
model2 <- lm(TARGET_WINS ~ TEAM_BATTING_HR + TEAM_BATTING_BB +
TEAM_PITCHING_H + TEAM_PITCHING_HR +
TEAM_FIELDING_E + TEAM_BASERUN_SB,
data=train)
summary(model2)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_HR + TEAM_BATTING_BB +
## TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_BASERUN_SB,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.705 -9.358 0.355 9.428 73.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.6008950 2.1264882 28.968 < 2e-16 ***
## TEAM_BATTING_HR -0.1276029 0.0243540 -5.239 1.76e-07 ***
## TEAM_BATTING_BB 0.0375847 0.0035499 10.587 < 2e-16 ***
## TEAM_PITCHING_H 0.0009774 0.0003293 2.968 0.00303 **
## TEAM_PITCHING_HR 0.1295960 0.0227426 5.698 1.37e-08 ***
## TEAM_FIELDING_E -0.0263590 0.0029705 -8.874 < 2e-16 ***
## TEAM_BASERUN_SB 0.0204186 0.0015010 13.603 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.6 on 2269 degrees of freedom
## Multiple R-squared: 0.1427, Adjusted R-squared: 0.1404
## F-statistic: 62.94 on 6 and 2269 DF, p-value: < 2.2e-16
model3 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
TEAM_BATTING_3B + TEAM_BATTING_HR +
TEAM_BATTING_BB +
TEAM_PITCHING_H + TEAM_PITCHING_HR +
TEAM_PITCHING_SO,
data=train)
summary(model3)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_PITCHING_H +
## TEAM_PITCHING_HR + TEAM_PITCHING_SO, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.287 -8.692 0.463 9.025 50.251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.2520913 3.8987558 -2.117 0.0344 *
## TEAM_BATTING_H 0.0519738 0.0034431 15.095 < 2e-16 ***
## TEAM_BATTING_2B -0.0222290 0.0090772 -2.449 0.0144 *
## TEAM_BATTING_3B 0.0801464 0.0164588 4.870 1.20e-06 ***
## TEAM_BATTING_HR 0.0392109 0.0240414 1.631 0.1030
## TEAM_BATTING_BB 0.0210341 0.0029597 7.107 1.58e-12 ***
## TEAM_PITCHING_H -0.0022693 0.0002728 -8.318 < 2e-16 ***
## TEAM_PITCHING_HR -0.0003368 0.0217153 -0.016 0.9876
## TEAM_PITCHING_SO 0.0038185 0.0005997 6.368 2.32e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.54 on 2267 degrees of freedom
## Multiple R-squared: 0.2636, Adjusted R-squared: 0.261
## F-statistic: 101.5 on 8 and 2267 DF, p-value: < 2.2e-16
Most of the coefficients in the models make sense based on baseball logic.
For example, home runs (TEAM_BATTING_HR) have a positive coefficient. This means that when a team hits more home runs, they tend to win more games. That is expected because home runs directly produce runs. Walks (TEAM_BATTING_BB) are also positive, which makes sense since getting on base more often helps a team score.
On the pitching side, hits allowed and home runs allowed usually have negative coefficients. This means that when a team allows more hits or home runs, they win fewer games. That also matches baseball expectations. Fielding errors are typically negative as well, since making more errors can cost games.
If a coefficient ever looks surprising, such as a negative sign for home runs, it is likely because some variables are closely related to each other. When similar variables are included in the same model, the regression can adjust their effects in ways that change the sign. This does not always mean the relationship is truly negative.
If a model performs well overall (higher Adjusted R² and lower RMSE), it can still be kept even if one coefficient seems unusual. However, the reason would be explained clearly. Overall, most of the coefficients follow baseball logic, which gives confidence in the models.
rmse1 <- sqrt(mean(model1$residuals^2))
rmse2 <- sqrt(mean(model2$residuals^2))
rmse3 <- sqrt(mean(model3$residuals^2))
rmse1
## [1] 15.16928
rmse2
## [1] 14.58188
rmse3
## [1] 13.51415
plot(model3)
predictions <- predict(model3, newdata=eval)
submission <- data.frame(PREDICTED_WINS = predictions)
write.csv(submission, "moneyball_predictions.csv", row.names=FALSE)
To choose the best model, I compared their RMSE values. RMSE shows how far the predicted wins are from the actual wins. A lower RMSE means the model is more accurate.
The results were:
Model 1 RMSE = 15.17
Model 2 RMSE = 14.58
Model 3 RMSE = 13.51
Model 3 has the lowest RMSE, so it makes the most accurate predictions. On average, its predictions are about 13.5 wins away from the real number of wins, which is better than the other two models.
Because Model 3 performs the best and its results still make sense in terms of baseball, it was chosen as the final model.