In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
The objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided).
## [1] 2276 17
The training data has a size of 2276 observations and 17 attributes. With two of the column are the index column and number of wins column which would be our target variable. Exploring this training dataset can help us gain insights on how to approach the problem at hand. The following visualization is the histogram of each predictor for this assignment. The current column names are difficult to read since we don’t know what they mean, but for now we have an idea of what each predictor variable’s distribution looks like. Predictor variables such as “BATTING_HBP”, “PITCHING_H”, and “PITCHING_SO”, seems to have good amount of outlier for which we can handle in the next section of this assignment which is the data cleaning and wrangling to “tidy” this dataset.
Looking at these distribution only give us rough idea of the data. The following table gives us the values of the summary statistics to get a better understanding of the data. We can treat these two visualizations as supplementary of each other.
| column | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TARGET_WINS | 2276 | 80.79086 | 15.75215 | 82.0 | 81.31229 | 10.0 | 0 | 146 | 146 | -0.3989861 | 4.031017 | 0.3301823 |
| TEAM_BATTING_H | 2276 | 1469.26977 | 144.59120 | 1454.0 | 1459.04116 | 77.0 | 891 | 2554 | 1663 | 1.5723696 | 10.287564 | 3.0307891 |
| TEAM_BATTING_2B | 2276 | 241.24692 | 46.80141 | 238.0 | 240.39627 | 32.0 | 69 | 458 | 389 | 0.2152436 | 3.008804 | 0.9810087 |
| TEAM_BATTING_3B | 2276 | 55.25000 | 27.93856 | 47.0 | 52.17563 | 16.0 | 0 | 223 | 223 | 1.1101968 | 4.507201 | 0.5856226 |
| TEAM_BATTING_HR | 2276 | 99.61204 | 60.54687 | 102.0 | 97.38529 | 53.0 | 0 | 264 | 264 | 0.1861648 | 2.038672 | 1.2691285 |
| TEAM_BATTING_BB | 2276 | 501.55888 | 122.67086 | 512.0 | 512.18331 | 64.0 | 0 | 878 | 878 | -1.0264363 | 5.187412 | 2.5713150 |
| TEAM_BATTING_SO | 2174 | 735.60534 | 248.52642 | 750.0 | 742.31322 | 192.0 | 0 | 1399 | 1399 | NA | NA | 5.3301912 |
| TEAM_BASERUN_SB | 2145 | 124.76177 | 87.79117 | 101.0 | 110.81188 | 41.0 | 0 | 697 | 697 | NA | NA | 1.8955584 |
| TEAM_BASERUN_CS | 1504 | 52.80386 | 22.95634 | 49.0 | 50.35963 | 12.0 | 0 | 201 | 201 | NA | NA | 0.5919414 |
| TEAM_BATTING_HBP | 191 | 59.35602 | 12.96712 | 58.0 | 58.86275 | 8.0 | 29 | 95 | 66 | NA | NA | 0.9382681 |
| TEAM_PITCHING_H | 2276 | 1779.21046 | 1406.84293 | 1518.0 | 1555.89517 | 118.0 | 1137 | 30132 | 28995 | 10.3363225 | 144.967058 | 29.4889618 |
| TEAM_PITCHING_HR | 2276 | 105.69859 | 61.29875 | 107.0 | 103.15697 | 50.0 | 0 | 343 | 343 | 0.2879774 | 2.397475 | 1.2848886 |
| TEAM_PITCHING_BB | 2276 | 553.00791 | 166.35736 | 536.5 | 542.62459 | 66.5 | 0 | 3645 | 3645 | 6.7483465 | 100.055543 | 3.4870317 |
| TEAM_PITCHING_SO | 2174 | 817.73045 | 553.08503 | 813.5 | 796.93391 | 173.5 | 0 | 19278 | 19278 | NA | NA | 11.8621151 |
| TEAM_FIELDING_E | 2276 | 246.48067 | 227.77097 | 159.0 | 193.43798 | 42.0 | 65 | 1898 | 1833 | 2.9924375 | 13.982556 | 4.7743279 |
| TEAM_FIELDING_DP | 1990 | 146.38794 | 26.22639 | 149.0 | 147.57789 | 16.0 | 52 | 228 | 176 | NA | NA | 0.5879114 |
As expected, there are missing values in this dataset that will need to removed or imputed whichever method has the least impact on the distributions. We want to keep the data in its truest form while performing data cleaning. Skew and kurtosis return a NA value since it was not able to compute those columns due to missingness. First and foremost, we need to change the column names to more easily understood.
In the context of this dataset, opting to replace NA values with zeros instead of performing mean or median imputation appears to be the most suitable approach. This choice is justified by the fact that each row may represent either the same team across different years or different teams within the same year. In sports data, it is evident that teams either have a recorded score or do not, with no in-between.
Moreover, the column names have been updated to be more easily understood for readers with little to no knowledge of baseball and no longer requires the table explaining what the acronyms means. Also, added a “P” or “N” at the end of each column name to represent the theoretical impact on winning a game based on the table given for this assignment.
Due to many features available in the dataset, we wont be performing any transformation because we want to use the data in its truest form. We are interested if we get sufficient results doing so.
## [1] "Wins" "Hits_P" "Double_Hits_P"
## [4] "Triple_Hits_P" "Homerun_P" "Walks_P"
## [7] "Batter_Hit_P" "Strike_Out_N" "Stolen_Base_P"
## [10] "Caught_Stealing_N" "Errors_N" "Double_Play_P"
## [13] "Walks_Allowed_N" "Hits_Allowed_N" "Homeruns_Allowed_N"
## [16] "Strike_Out_Pitcher_P"
The following chart show the correlation between each column in the data. We can observe the “Homeruns_Allowed_N” and “Strike_Out_Pitcher_P” has the strongest negative correlation in the dataset. This will help us determine which features are best in builing the models later.
Here, we are using LM model to estimate the variable important which rates each feature on the impact it has on the model. It shows that Hits_P is the most important feature in this data followed by Homeruns_Allowed_N and Strike_Out_Pitcher_P; and Double_Play_P to be the least important. This is another tool that can help in the feature selection process.
After exploring and cleaning the data, we should have a good idea of what features we want to use in building the lm models.
##
## Call:
## lm(formula = Wins ~ Hits_P + Double_Hits_P + Triple_Hits_P +
## Walks_P + Strike_Out_N + Stolen_Base_P + Caught_Stealing_N +
## Errors_N + Double_Play_P + Walks_Allowed_N + Homeruns_Allowed_N +
## Strike_Out_Pitcher_P, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.110 -8.706 0.236 8.617 51.107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.5798159 3.8468397 5.350 9.69e-08 ***
## Hits_P 0.0470546 0.0032408 14.519 < 2e-16 ***
## Double_Hits_P -0.0100974 0.0094659 -1.067 0.286213
## Triple_Hits_P 0.0621507 0.0166086 3.742 0.000187 ***
## Walks_P 0.0023481 0.0045229 0.519 0.603705
## Strike_Out_N 0.0028463 0.0041928 0.679 0.497301
## Stolen_Base_P -0.0227720 0.0105501 -2.158 0.030996 *
## Caught_Stealing_N -0.0576685 0.0190889 -3.021 0.002547 **
## Errors_N -0.0004893 0.0003770 -1.298 0.194449
## Double_Play_P 0.0349073 0.0070858 4.926 8.98e-07 ***
## Walks_Allowed_N 0.0041496 0.0028262 1.468 0.142177
## Homeruns_Allowed_N -0.0304417 0.0029858 -10.195 < 2e-16 ***
## Strike_Out_Pitcher_P -0.0622201 0.0101196 -6.148 9.22e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.31 on 2263 degrees of freedom
## Multiple R-squared: 0.2897, Adjusted R-squared: 0.286
## F-statistic: 76.93 on 12 and 2263 DF, p-value: < 2.2e-16
This model includes of the available features to create baseline needed in order to improve the fit or the R-squared. So, the “Everything Model” has a R-squared of 0.29
##
## Call:
## lm(formula = Wins ~ Hits_P + Triple_Hits_P + Homerun_P + Batter_Hit_P +
## Strike_Out_Pitcher_P, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.846 -8.748 0.620 9.416 54.809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.719719 3.872096 6.126 1.06e-09 ***
## Hits_P 0.029675 0.002583 11.491 < 2e-16 ***
## Triple_Hits_P 0.121788 0.017009 7.160 1.08e-12 ***
## Homerun_P 0.098998 0.009110 10.866 < 2e-16 ***
## Batter_Hit_P -0.006119 0.001704 -3.591 0.000336 ***
## Strike_Out_Pitcher_P 0.009217 0.006866 1.343 0.179566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.04 on 2270 degrees of freedom
## Multiple R-squared: 0.2077, Adjusted R-squared: 0.2059
## F-statistic: 119 on 5 and 2270 DF, p-value: < 2.2e-16
This model based on the “Everything Model” by picking the coefficients that were the most significant and restricting the features effect to only positive in hopes it would create a better model from this data. Which we did not accomplish since it return a R-squared of 0.20, a decrease from out initial model.
##
## Call:
## lm(formula = Wins ~ Hits_P + Homeruns_Allowed_N + Strike_Out_Pitcher_P,
## data = train_df, subset = Batter_Hit_P)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.317 -8.036 -0.022 7.552 46.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.099982 3.343992 12.59 <2e-16 ***
## Hits_P 0.039064 0.002196 17.79 <2e-16 ***
## Homeruns_Allowed_N -0.033212 0.001733 -19.16 <2e-16 ***
## Strike_Out_Pitcher_P -0.083097 0.007290 -11.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.18 on 2150 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.2077
## F-statistic: 189.2 on 3 and 2150 DF, p-value: < 2.2e-16
Now, building a third model to based on the correlation plot and importance chart. We hope that this model would generate the best model out of all the prior model. In this model, we use the following features: Hits_P, Homeruns_Allowed_N, Strike_Out_Pitcher_P and Batter_Hit_P.
The resulting R-squared was not the intended result with R-squared similar to the second model.
Based on the R-squared of each model, the “Everything Model” has best fit on the data while the other two model had R-squared of approx 0.20. The initial thought was that the second and third model would be a better than the first model, but that wasnt the case. It seems that taking out less features in the model resulted in a lower than desired R-squared. Even the model with features based on ranked important performed poorly. Moreover, the Everything Model with R-square 0.30 is still low where it only explained 30% of the variance in the target variable. We hoped that selecting the most important features would at least raise it to 0.50.
Now using the first model, we will predict the number of wins in the eval dataset. We have the distributions of the predicted wins, with a median win of 80
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.90 76.33 80.89 80.58 85.85 113.00
## 1 2 3 4 5 6 7 8
## 68.08311 69.09019 76.87568 82.97867 63.77830 66.45927 79.71547 73.97235
## 9 10
## 71.58869 76.80667
Feature Selection with Caret Package LINK
Faraway, J. J. (2005). Linear models with R. Chapman & Hall/CRC.
Modern approach to regression with R. (n.d.). . Springer Nature.
library(tidyverse)
library(zoo)
library(kableExtra)
library(knitr)
library(broom)
library(DT)
library(faraway)
library(caret)
library(corrplot)
library(mlbench)
library(randomForest)
setwd("C:/Users/Nick Climaco/Documents/DataScience/DATA_621_F23/Homework-1")
set.seed(12041998)
eval_df <- read_csv("moneyball-evaluation-data.csv")
train_df <- read_csv("moneyball-training-data.csv")
dim(train_df)
par(mfrow = c(4,4), mar = c(3,3,1,1))
for (col_name in names(train_df)[-1]) {
hist(train_df[[col_name]], main = paste(col_name), xlab = "Value")
}
par(mfrow = c(1,1)
kable(tidy(train_df[-1]), "pipe")
colnames(train_df) <- c("Index", "Wins", "Hits_P","Double_Hits_P","Triple_Hits_P", "Homerun_P","Walks_P", "Batter_Hit_P", "Strike_Out_N",
"Stolen_Base_P", "Caught_Stealing_N", "Errors_N","Double_Play_P", "Walks_Allowed_N", "Hits_Allowed_N", "Homeruns_Allowed_N",
"Strike_Out_Pitcher_P")
colnames(eval_df) <- c("Index", "Wins", "Hits_P","Double_Hits_P","Triple_Hits_P", "Homerun_P","Walks_P", "Batter_Hit_P", "Strike_Out_N",
"Stolen_Base_P", "Caught_Stealing_N", "Errors_N","Double_Play_P", "Walks_Allowed_N", "Hits_Allowed_N",
"Strike_Out_Pitcher_P")
train_df <- as.data.frame(na.fill(train_df, fill = 0))
train_df <- train_df[-1]
eval_df <- as.data.frame(na.fill(eval_df, fill = 0))
colnames(train_df)
correlation_Matrix <- cor(train_df[,2:16])
corrplot(correlation_Matrix,method = "color", type = "full", order = "hclust")
control <- trainControl(method = "repeatedcv", number = 10, repeats =3)
model <- train(Wins~., data=train_df, method="lm", preProcess="scale", trControl=control)
importance <- varImp(model, scale=FALSE)
plot(importance)
model_1 <- lm(Wins ~ Hits_P + Double_Hits_P + Triple_Hits_P + Walks_P + Strike_Out_N +
Stolen_Base_P + Caught_Stealing_N + Errors_N + Double_Play_P + Walks_Allowed_N +
Homeruns_Allowed_N + Strike_Out_Pitcher_P,
data = train_df)
summary(model_1)
plot(model_1, 1:2)
model_2 <- lm(Wins~ Hits_P + Triple_Hits_P + Homerun_P + Batter_Hit_P + Strike_Out_Pitcher_P, data = train_df)
summary(model_2)
plot(model_2, 1:2)
model_3 <- lm(Wins ~ Hits_P + Homeruns_Allowed_N + Strike_Out_Pitcher_P, Batter_Hit_P,
data = train_df)
summary(model_3)
plot(model_3, 1:2)