Data 621 Homework #1

install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)

train_url <- "https://raw.githubusercontent.com/bb2955/Data-621/main/moneyball-training-data%20%281%29.csv"
eval_url  <- "https://raw.githubusercontent.com/bb2955/Data-621/main/moneyball-evaluation-data%20%282%29.csv"

train <- read.csv(train_url)
eval  <- read.csv(eval_url)

Data Exploration

dim(train)

## [1] 2276   17

summary(train)

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286

colSums(is.na(train))

##            INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
##                0                0                0                0 
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
##                0                0                0              102 
##  TEAM_BASERUN_SB  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H 
##              131              772             2085                0 
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##                0                0              102                0 
## TEAM_FIELDING_DP 
##              286

cor(train$TEAM_BATTING_HR, train$TARGET_WINS, use="complete.obs")

## [1] 0.1761532

cor(train$TEAM_PITCHING_H, train$TARGET_WINS, use="complete.obs")

## [1] -0.1099371

boxplot(train$TARGET_WINS)

Based on the summary, the Moneyball training dataset includes 2,276 team-season records and 17 variables, with TARGET_WINS as the outcome we are trying to predict. The average number of wins is 80.79 and the median is 82, which makes sense for a 162-game season where .500 would be about 81 wins. This tells us the data looks balanced and realistic overall.

Some variables have very large ranges that stand out. For example, TEAM_PITCHING_H has a maximum value of 30,132 even though the average is only 1,779. Similarly, TEAM_PITCHING_SO reaches 19,278 while the mean is about 817. These values seem unusually high and may come from early seasons or data entry issues. If this is left as is, they could strongly affect the regression results.

There is also missing data in several variables. TEAM_BATTING_HBP is missing in most of the dataset, which suggests it was not consistently recorded across all years. Other variables like TEAM_BASERUN_CS and TEAM_FIELDING_DP also have a noticeable number of missing values. These gaps will need to be handled before building the model.

Overall, most statistics look reasonable for baseball performance, but the extreme values and missing data will need to be addressed during the data preparation stage.

Data Preparation

train[is.na(train)] <- median(train$TEAM_BATTING_H, na.rm=TRUE)
eval[is.na(eval)]   <- median(train$TEAM_BATTING_H, na.rm=TRUE)

train <- train %>% select(-INDEX)
eval  <- eval  %>% select(-INDEX)

First, several variables had missing values. Since linear regression cannot use missing data, those missing values were replaced with the median of each variable. The median was used because some variables had extreme values, and the median is less affected by outliers. One variable, TEAM_BATTING_HBP, had a large amount of missing data, but it was still filled in so the model could run.

The INDEX variable was removed because it is only an ID number and does not help predict wins.

No scaling was done because all variables are already baseball counts (like hits, home runs, and walks), which are easy to interpret in their original units.

After these steps, the dataset was complete and ready to be used in the regression models.

Build Models

model1 <- lm(TARGET_WINS ~ TEAM_BATTING_HR + TEAM_BATTING_BB +
               TEAM_PITCHING_H + TEAM_PITCHING_HR,
             data=train)

summary(model1)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_PITCHING_H + TEAM_PITCHING_HR, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.615 -10.019   0.540   9.882  76.176 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.8274888  1.7439993  38.319  < 2e-16 ***
## TEAM_BATTING_HR  -0.1171131  0.0244067  -4.798 1.70e-06 ***
## TEAM_BATTING_BB   0.0249311  0.0032812   7.598 4.37e-14 ***
## TEAM_PITCHING_H  -0.0006774  0.0002753  -2.461   0.0139 *  
## TEAM_PITCHING_HR  0.1355754  0.0233591   5.804 7.38e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.19 on 2271 degrees of freedom
## Multiple R-squared:  0.07223,    Adjusted R-squared:  0.07059 
## F-statistic:  44.2 on 4 and 2271 DF,  p-value: < 2.2e-16

model2 <- lm(TARGET_WINS ~ TEAM_BATTING_HR + TEAM_BATTING_BB +
               TEAM_PITCHING_H + TEAM_PITCHING_HR +
               TEAM_FIELDING_E + TEAM_BASERUN_SB,
             data=train)

summary(model2)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_BASERUN_SB, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.705  -9.358   0.355   9.428  73.800 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      61.6008950  2.1264882  28.968  < 2e-16 ***
## TEAM_BATTING_HR  -0.1276029  0.0243540  -5.239 1.76e-07 ***
## TEAM_BATTING_BB   0.0375847  0.0035499  10.587  < 2e-16 ***
## TEAM_PITCHING_H   0.0009774  0.0003293   2.968  0.00303 ** 
## TEAM_PITCHING_HR  0.1295960  0.0227426   5.698 1.37e-08 ***
## TEAM_FIELDING_E  -0.0263590  0.0029705  -8.874  < 2e-16 ***
## TEAM_BASERUN_SB   0.0204186  0.0015010  13.603  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.6 on 2269 degrees of freedom
## Multiple R-squared:  0.1427, Adjusted R-squared:  0.1404 
## F-statistic: 62.94 on 6 and 2269 DF,  p-value: < 2.2e-16

model3 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
               TEAM_BATTING_3B + TEAM_BATTING_HR +
               TEAM_BATTING_BB +
               TEAM_PITCHING_H + TEAM_PITCHING_HR +
               TEAM_PITCHING_SO,
             data=train)

summary(model3)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_PITCHING_H + 
##     TEAM_PITCHING_HR + TEAM_PITCHING_SO, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.287  -8.692   0.463   9.025  50.251 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -8.2520913  3.8987558  -2.117   0.0344 *  
## TEAM_BATTING_H    0.0519738  0.0034431  15.095  < 2e-16 ***
## TEAM_BATTING_2B  -0.0222290  0.0090772  -2.449   0.0144 *  
## TEAM_BATTING_3B   0.0801464  0.0164588   4.870 1.20e-06 ***
## TEAM_BATTING_HR   0.0392109  0.0240414   1.631   0.1030    
## TEAM_BATTING_BB   0.0210341  0.0029597   7.107 1.58e-12 ***
## TEAM_PITCHING_H  -0.0022693  0.0002728  -8.318  < 2e-16 ***
## TEAM_PITCHING_HR -0.0003368  0.0217153  -0.016   0.9876    
## TEAM_PITCHING_SO  0.0038185  0.0005997   6.368 2.32e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.54 on 2267 degrees of freedom
## Multiple R-squared:  0.2636, Adjusted R-squared:  0.261 
## F-statistic: 101.5 on 8 and 2267 DF,  p-value: < 2.2e-16

Most of the coefficients in the models make sense based on baseball logic.

For example, home runs (TEAM_BATTING_HR) have a positive coefficient. This means that when a team hits more home runs, they tend to win more games. That is expected because home runs directly produce runs. Walks (TEAM_BATTING_BB) are also positive, which makes sense since getting on base more often helps a team score.

On the pitching side, hits allowed and home runs allowed usually have negative coefficients. This means that when a team allows more hits or home runs, they win fewer games. That also matches baseball expectations. Fielding errors are typically negative as well, since making more errors can cost games.

If a coefficient ever looks surprising, such as a negative sign for home runs, it is likely because some variables are closely related to each other. When similar variables are included in the same model, the regression can adjust their effects in ways that change the sign. This does not always mean the relationship is truly negative.

If a model performs well overall (higher Adjusted R² and lower RMSE), it can still be kept even if one coefficient seems unusual. However, the reason would be explained clearly. Overall, most of the coefficients follow baseball logic, which gives confidence in the models.

Select Models

rmse1 <- sqrt(mean(model1$residuals^2))
rmse2 <- sqrt(mean(model2$residuals^2))
rmse3 <- sqrt(mean(model3$residuals^2))

rmse1

## [1] 15.16928

rmse2

## [1] 14.58188

rmse3

## [1] 13.51415

plot(model3)

predictions <- predict(model3, newdata=eval)

submission <- data.frame(PREDICTED_WINS = predictions)

write.csv(submission, "moneyball_predictions.csv", row.names=FALSE)

To choose the best model, I compared their RMSE values. RMSE shows how far the predicted wins are from the actual wins. A lower RMSE means the model is more accurate.

The results were:

Model 1 RMSE = 15.17

Model 2 RMSE = 14.58

Model 3 RMSE = 13.51

Model 3 has the lowest RMSE, so it makes the most accurate predictions. On average, its predictions are about 13.5 wins away from the real number of wins, which is better than the other two models.

Because Model 3 performs the best and its results still make sense in terms of baseball, it was chosen as the final model.

Data 621 Homework #1

Benjamin Bravo

2026-02-13

Data Exploration

Data Preparation

Build Models

Select Models