Overview In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided).
Rows: 2276 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (17): INDEX, TARGET_WINS, TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
test <-read_csv("/data621/moneyball-evaluation-data.csv") %>%clean_names()
Rows: 259 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (16): INDEX, TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATT...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
EDA
The Data has been imported lets check dimensions and missingness.
Code
cat("Dimensions\n")
Dimensions
Code
cat("===========\n")
===========
Code
cat(paste("Training:", paste(dim(train), collapse =" x "), "\n"))
Training: 2276 x 17
Code
cat(paste("Test:", paste(dim(test), collapse =" x "), "\n"))
Test: 259 x 16
Code
train_test_ratio <-round(dim(test)[1] /dim(train)[1], 4)cat(paste("Ratio of test to train data:", train_test_ratio, "\n"))
Ratio of test to train data: 0.1138
Code
my_skim <-skim_with(numeric =sfl(median,p0 =NULL, p25 =NULL,p50 =NULL, p75 =NULL, p100=NULL, ),append =TRUE)# Now run skimmy_skim(train)|>arrange((complete_rate))
Data summary
Name
train
Number of rows
2276
Number of columns
17
_______________________
Column type frequency:
numeric
17
________________________
Group variables
None
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
hist
median
team_batting_hbp
2085
0.08
59.36
12.97
▂▇▇▅▁
NA
team_baserun_cs
772
0.66
52.80
22.96
▃▇▁▁▁
NA
team_fielding_dp
286
0.87
146.39
26.23
▁▂▇▆▁
NA
team_baserun_sb
131
0.94
124.76
87.79
▇▃▁▁▁
NA
team_batting_so
102
0.96
735.61
248.53
▁▆▇▇▁
NA
team_pitching_so
102
0.96
817.73
553.09
▇▁▁▁▁
NA
index
0
1.00
1268.46
736.35
▇▇▇▇▇
1270.5
target_wins
0
1.00
80.79
15.75
▁▁▇▅▁
82.0
team_batting_h
0
1.00
1469.27
144.59
▁▇▂▁▁
1454.0
team_batting_2b
0
1.00
241.25
46.80
▁▆▇▂▁
238.0
team_batting_3b
0
1.00
55.25
27.94
▇▇▂▁▁
47.0
team_batting_hr
0
1.00
99.61
60.55
▇▆▇▅▁
102.0
team_batting_bb
0
1.00
501.56
122.67
▁▁▇▇▁
512.0
team_pitching_h
0
1.00
1779.21
1406.84
▇▁▁▁▁
1518.0
team_pitching_hr
0
1.00
105.70
61.30
▇▇▆▁▁
107.0
team_pitching_bb
0
1.00
553.01
166.36
▇▁▁▁▁
536.5
team_fielding_e
0
1.00
246.48
227.77
▇▁▁▁▁
159.0
There are some missing values that should be addressed. I will proceed by removing the team_batting_HBP as there is very little data available, I will then use median imputation for the remaining missing values in the data set. As a reminder we must also do this to our test set for accurate modeling.
We also notice some columns with 0’s since there is little chance that a team can go a whole season with put allowing a home run or giving up a hit we will assume these are data entry/capture errors and include them in our imputation implementation.
Missing Data
Code
train <- train %>% dplyr::select(-team_batting_hbp,-index)test <- test %>% dplyr::select(-team_batting_hbp)print("train cols")
We see our columns match except for target wins, our target variable, lets also impute any 0 values predictors and remove the row with now traget_wins, lets impute.
Code
impute_nonpositive_with_median <-function(df) { df %>%mutate(across(where(is.numeric),~ { med <-median(.[. >0], na.rm =TRUE)# fallback median if no positivesif (is.na(med)) med <-median(., na.rm =TRUE)ifelse(. <=0|is.na(.), med, .) } ))}train <-impute_nonpositive_with_median(train)test <-impute_nonpositive_with_median(test)
Code
vars <- train |> dplyr::select(where(is.numeric)) %>% dplyr::select(-target_wins)for (col innames(vars)) {print(ggplot(train, aes_string(x = col, y ="target_wins")) +geom_point(alpha =0.5) +geom_smooth(method ="lm", se =FALSE, color ="red", linewidth =1) +geom_smooth(method ="loess", se =FALSE, color ="blue", linewidth =1) +labs(title =paste("Linearity check:", col)) )}
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Code
library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:psych':
logit
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
Code
model <-lm(target_wins ~ ., data = train)par(mfrow =c(2, 3))crPlots(model)
Pre Process
To satisfy the linearity requirement of linear regression we will implement a box cox transformation to the training and testing set before modeling.
Code
# Apply Box-Cox transformation, centering, and scalingtrain_df <-as.data.frame(train)test_df <-as.data.frame(test)train_x <- train_df[, setdiff(names(train_df), "target_wins")]test_x <- test_df # test should have same predictors only# --- Preprocess predictors only ---preproc <-preProcess(train_x, method =c("BoxCox", "center", "scale"))# --- Apply transformations ---train_trans <-predict(preproc, train_x)test_trans <-predict(preproc, test_x)
Lets take a look at the component and residual plots to determine if the linearity assumption is met.
Now that we have transformed our data we can choose and train some models and then make some predictions.
Lets begin with the kitchen sink: Target_Wins ~ .(All the predictors)
The first model shows some very important predictors in team_batting_h (hits), team_batting_3b (tripples), team_batting_bb (walks), team_batting_so (strike outs)[negative impact on wins], team_batting_hr (home runs) , team_picthing_hr (home runs given up)[negative impact], team_pitching_so (strike outs), team_fielding_e (defensive errors) [negative impact on wins], team_pitching_bb (walks given up)[negative impact on wins], team_fielding_dp (forced double plays ie two outs on one play)
The model has P- value of 2.2e16 indicating that there is a definite relationship between the predictors and the target variable.
The \(R^2 = 0.30\) indicating that the model is still missing some information to accurately predict wins.
In our second model we remove our non influential predictors, and see a small reduction to \(R^2 = .2946\) While now we have only significant predictors we don’t see an improvement to \(R^2\) . Lets see if we can remove some outliers to help our model.
Code
# Compute diagnostic measuresinfluence <-influence.measures(model_sig)# View the summary of flagged observations#summary(influence)# Extract Cook’s DistancecooksD <-cooks.distance(model_sig)# Inspect top outlierssort(cooksD, decreasing =TRUE)[1:20]
Influential points were identified using Cook’s Distance and removed to reduce leverage bias. the criteria was a cooks distance above .015
A total of 35 about \(35/2276 = 1.53%\) % of observations (e.g., 1342, 1211, 1828, 299, etc.) were excluded from the dataset.
After refitting the model:
\(R^2\) improved from 0.2946 to 0.3161
Diagnostic plots confirmed:
Improved residual homoscedasticity
Fewer high-leverage outliers
Similar coefficient directions and magnitudes
These results suggest that the original model was modestly affected by a small set of influential data points.
After their removal, the simple linear regression model explains approximately 31.5% of the variance in team wins, satisfying all OLS assumptions.
# Create histogram and save histogram infom <-mean(predictions_ceiling)s <-sd(predictions_ceiling)# Re-plot with labelsh <-hist(predictions_ceiling,breaks =20,col ="gray",border ="white",freq =FALSE,main ="Distribution of Predicted Wins",xlab ="Predicted Wins")# Add the normal curvexfit <-seq(min(predictions_ceiling), max(predictions_ceiling), length =100)yfit <-dnorm(xfit, mean = m, sd = s)lines(xfit, yfit, col ="red", lwd =2)# Add mean and SD text annotationtext(x = m, y =max(yfit)/2,labels =paste0("Mean = ", round(m, 1), "\nSD = ", round(s, 1)),col ="blue", cex =0.9, pos =4)
We generated predictions using the cleaned linear model after removing 35 influential observations identified by Cook’s Distance values greater than 0.015.
A ceiling function was applied to ensure all predicted team win totals were whole numbers.
We then visualized the distribution of predicted wins using a histogram with an overlaid normal distribution curve.
The resulting predictions followed an approximately normal (Gaussian) distribution, centered around 80.8 wins—a realistic outcome given that Major League Baseball teams play 162 games per season.
An average near 81 wins represents a balanced league, confirming that our model’s predictions align well with real-world expectations.
As a baseball fan, I recognize that additional variables could improve the model’s ability to explain team wins.
Metrics such as runs scored, hits with runners in scoring position, and runs allowed directly capture a team’s offensive efficiency and pitching effectiveness.
Including these types of performance indicators could enhance the model’s explanatory power and yield a higher \(R^2\), providing a more complete representation of what drives success over a 162-game season.