Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Objective

The objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided).


Data Exploration

## [1] 2276   17

The training data has a size of 2276 observations and 17 attributes. With two of the column are the index column and number of wins column which would be our target variable. Exploring this training dataset can help us gain insights on how to approach the problem at hand. The following visualization is the histogram of each predictor for this assignment. The current column names are difficult to read since we don’t know what they mean, but for now we have an idea of what each predictor variable’s distribution looks like. Predictor variables such as “BATTING_HBP”, “PITCHING_H”, and “PITCHING_SO”, seems to have good amount of outlier for which we can handle in the next section of this assignment which is the data cleaning and wrangling to “tidy” this dataset.

Summary Statistics Visualized for Predictor Variable

Looking at these distribution only give us rough idea of the data. The following table gives us the values of the summary statistics to get a better understanding of the data. We can treat these two visualizations as supplementary of each other.

column n mean sd median trimmed mad min max range skew kurtosis se
TARGET_WINS 2276 80.79086 15.75215 82.0 81.31229 10.0 0 146 146 -0.3989861 4.031017 0.3301823
TEAM_BATTING_H 2276 1469.26977 144.59120 1454.0 1459.04116 77.0 891 2554 1663 1.5723696 10.287564 3.0307891
TEAM_BATTING_2B 2276 241.24692 46.80141 238.0 240.39627 32.0 69 458 389 0.2152436 3.008804 0.9810087
TEAM_BATTING_3B 2276 55.25000 27.93856 47.0 52.17563 16.0 0 223 223 1.1101968 4.507201 0.5856226
TEAM_BATTING_HR 2276 99.61204 60.54687 102.0 97.38529 53.0 0 264 264 0.1861648 2.038672 1.2691285
TEAM_BATTING_BB 2276 501.55888 122.67086 512.0 512.18331 64.0 0 878 878 -1.0264363 5.187412 2.5713150
TEAM_BATTING_SO 2174 735.60534 248.52642 750.0 742.31322 192.0 0 1399 1399 NA NA 5.3301912
TEAM_BASERUN_SB 2145 124.76177 87.79117 101.0 110.81188 41.0 0 697 697 NA NA 1.8955584
TEAM_BASERUN_CS 1504 52.80386 22.95634 49.0 50.35963 12.0 0 201 201 NA NA 0.5919414
TEAM_BATTING_HBP 191 59.35602 12.96712 58.0 58.86275 8.0 29 95 66 NA NA 0.9382681
TEAM_PITCHING_H 2276 1779.21046 1406.84293 1518.0 1555.89517 118.0 1137 30132 28995 10.3363225 144.967058 29.4889618
TEAM_PITCHING_HR 2276 105.69859 61.29875 107.0 103.15697 50.0 0 343 343 0.2879774 2.397475 1.2848886
TEAM_PITCHING_BB 2276 553.00791 166.35736 536.5 542.62459 66.5 0 3645 3645 6.7483465 100.055543 3.4870317
TEAM_PITCHING_SO 2174 817.73045 553.08503 813.5 796.93391 173.5 0 19278 19278 NA NA 11.8621151
TEAM_FIELDING_E 2276 246.48067 227.77097 159.0 193.43798 42.0 65 1898 1833 2.9924375 13.982556 4.7743279
TEAM_FIELDING_DP 1990 146.38794 26.22639 149.0 147.57789 16.0 52 228 176 NA NA 0.5879114

As expected, there are missing values in this dataset that will need to removed or imputed whichever method has the least impact on the distributions. We want to keep the data in its truest form while performing data cleaning. Skew and kurtosis return a NA value since it was not able to compute those columns due to missingness. First and foremost, we need to change the column names to more easily understood.


Data Preparation

In the context of this dataset, opting to replace NA values with zeros instead of performing mean or median imputation appears to be the most suitable approach. This choice is justified by the fact that each row may represent either the same team across different years or different teams within the same year. In sports data, it is evident that teams either have a recorded score or do not, with no in-between.

Moreover, the column names have been updated to be more easily understood for readers with little to no knowledge of baseball and no longer requires the table explaining what the acronyms means. Also, added a “P” or “N” at the end of each column name to represent the theoretical impact on winning a game based on the table given for this assignment.

Due to many features available in the dataset, we wont be performing any transformation because we want to use the data in its truest form. We are interested if we get sufficient results doing so.

##  [1] "Wins"                 "Hits_P"               "Double_Hits_P"       
##  [4] "Triple_Hits_P"        "Homerun_P"            "Walks_P"             
##  [7] "Batter_Hit_P"         "Strike_Out_N"         "Stolen_Base_P"       
## [10] "Caught_Stealing_N"    "Errors_N"             "Double_Play_P"       
## [13] "Walks_Allowed_N"      "Hits_Allowed_N"       "Homeruns_Allowed_N"  
## [16] "Strike_Out_Pitcher_P"

The following chart show the correlation between each column in the data. We can observe the “Homeruns_Allowed_N” and “Strike_Out_Pitcher_P” has the strongest negative correlation in the dataset. This will help us determine which features are best in builing the models later.

Here, we are using LM model to estimate the variable important which rates each feature on the impact it has on the model. It shows that Hits_P is the most important feature in this data followed by Homeruns_Allowed_N and Strike_Out_Pitcher_P; and Double_Play_P to be the least important. This is another tool that can help in the feature selection process.


Model Building

After exploring and cleaning the data, we should have a good idea of what features we want to use in building the lm models.

Everything Model

## 
## Call:
## lm(formula = Wins ~ Hits_P + Double_Hits_P + Triple_Hits_P + 
##     Walks_P + Strike_Out_N + Stolen_Base_P + Caught_Stealing_N + 
##     Errors_N + Double_Play_P + Walks_Allowed_N + Homeruns_Allowed_N + 
##     Strike_Out_Pitcher_P, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.110  -8.706   0.236   8.617  51.107 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          20.5798159  3.8468397   5.350 9.69e-08 ***
## Hits_P                0.0470546  0.0032408  14.519  < 2e-16 ***
## Double_Hits_P        -0.0100974  0.0094659  -1.067 0.286213    
## Triple_Hits_P         0.0621507  0.0166086   3.742 0.000187 ***
## Walks_P               0.0023481  0.0045229   0.519 0.603705    
## Strike_Out_N          0.0028463  0.0041928   0.679 0.497301    
## Stolen_Base_P        -0.0227720  0.0105501  -2.158 0.030996 *  
## Caught_Stealing_N    -0.0576685  0.0190889  -3.021 0.002547 ** 
## Errors_N             -0.0004893  0.0003770  -1.298 0.194449    
## Double_Play_P         0.0349073  0.0070858   4.926 8.98e-07 ***
## Walks_Allowed_N       0.0041496  0.0028262   1.468 0.142177    
## Homeruns_Allowed_N   -0.0304417  0.0029858 -10.195  < 2e-16 ***
## Strike_Out_Pitcher_P -0.0622201  0.0101196  -6.148 9.22e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.31 on 2263 degrees of freedom
## Multiple R-squared:  0.2897, Adjusted R-squared:  0.286 
## F-statistic: 76.93 on 12 and 2263 DF,  p-value: < 2.2e-16

This model includes of the available features to create baseline needed in order to improve the fit or the R-squared. So, the “Everything Model” has a R-squared of 0.29

Positive and Significant Model

## 
## Call:
## lm(formula = Wins ~ Hits_P + Triple_Hits_P + Homerun_P + Batter_Hit_P + 
##     Strike_Out_Pitcher_P, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.846  -8.748   0.620   9.416  54.809 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          23.719719   3.872096   6.126 1.06e-09 ***
## Hits_P                0.029675   0.002583  11.491  < 2e-16 ***
## Triple_Hits_P         0.121788   0.017009   7.160 1.08e-12 ***
## Homerun_P             0.098998   0.009110  10.866  < 2e-16 ***
## Batter_Hit_P         -0.006119   0.001704  -3.591 0.000336 ***
## Strike_Out_Pitcher_P  0.009217   0.006866   1.343 0.179566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.04 on 2270 degrees of freedom
## Multiple R-squared:  0.2077, Adjusted R-squared:  0.2059 
## F-statistic:   119 on 5 and 2270 DF,  p-value: < 2.2e-16

This model based on the “Everything Model” by picking the coefficients that were the most significant and restricting the features effect to only positive in hopes it would create a better model from this data. Which we did not accomplish since it return a R-squared of 0.20, a decrease from out initial model.

Top 4 Importance Based Model

## 
## Call:
## lm(formula = Wins ~ Hits_P + Homeruns_Allowed_N + Strike_Out_Pitcher_P, 
##     data = train_df, subset = Batter_Hit_P)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.317  -8.036  -0.022   7.552  46.800 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          42.099982   3.343992   12.59   <2e-16 ***
## Hits_P                0.039064   0.002196   17.79   <2e-16 ***
## Homeruns_Allowed_N   -0.033212   0.001733  -19.16   <2e-16 ***
## Strike_Out_Pitcher_P -0.083097   0.007290  -11.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.18 on 2150 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.2077 
## F-statistic: 189.2 on 3 and 2150 DF,  p-value: < 2.2e-16

Now, building a third model to based on the correlation plot and importance chart. We hope that this model would generate the best model out of all the prior model. In this model, we use the following features: Hits_P, Homeruns_Allowed_N, Strike_Out_Pitcher_P and Batter_Hit_P.

The resulting R-squared was not the intended result with R-squared similar to the second model.


Model Selection

Based on the R-squared of each model, the “Everything Model” has best fit on the data while the other two model had R-squared of approx 0.20. The initial thought was that the second and third model would be a better than the first model, but that wasnt the case. It seems that taking out less features in the model resulted in a lower than desired R-squared. Even the model with features based on ranked important performed poorly. Moreover, the Everything Model with R-square 0.30 is still low where it only explained 30% of the variance in the target variable. We hoped that selecting the most important features would at least raise it to 0.50.

Now using the first model, we will predict the number of wins in the eval dataset. We have the distributions of the predicted wins, with a median win of 80

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.90   76.33   80.89   80.58   85.85  113.00
##        1        2        3        4        5        6        7        8 
## 68.08311 69.09019 76.87568 82.97867 63.77830 66.45927 79.71547 73.97235 
##        9       10 
## 71.58869 76.80667

Resources

Feature Selection with Caret Package LINK

Faraway, J. J. (2005). Linear models with R. Chapman & Hall/CRC.

Modern approach to regression with R. (n.d.). . Springer Nature.


Appendix

library(tidyverse)
library(zoo)
library(kableExtra)
library(knitr)
library(broom)
library(DT)
library(faraway)
library(caret)
library(corrplot)
library(mlbench)
library(randomForest)

setwd("C:/Users/Nick Climaco/Documents/DataScience/DATA_621_F23/Homework-1")
set.seed(12041998)
         
eval_df <- read_csv("moneyball-evaluation-data.csv")
train_df <- read_csv("moneyball-training-data.csv")        
         
dim(train_df)

par(mfrow = c(4,4), mar = c(3,3,1,1))

for (col_name in names(train_df)[-1]) {
    hist(train_df[[col_name]], main = paste(col_name), xlab = "Value")
}

par(mfrow = c(1,1)
    

kable(tidy(train_df[-1]), "pipe")  

colnames(train_df) <- c("Index", "Wins", "Hits_P","Double_Hits_P","Triple_Hits_P", "Homerun_P","Walks_P", "Batter_Hit_P", "Strike_Out_N",
                        "Stolen_Base_P", "Caught_Stealing_N", "Errors_N","Double_Play_P", "Walks_Allowed_N", "Hits_Allowed_N", "Homeruns_Allowed_N",
                        "Strike_Out_Pitcher_P")

colnames(eval_df) <- c("Index", "Wins", "Hits_P","Double_Hits_P","Triple_Hits_P", "Homerun_P","Walks_P", "Batter_Hit_P", "Strike_Out_N",
                       "Stolen_Base_P", "Caught_Stealing_N", "Errors_N","Double_Play_P", "Walks_Allowed_N", "Hits_Allowed_N",
                       "Strike_Out_Pitcher_P")

train_df <- as.data.frame(na.fill(train_df, fill = 0))
train_df <- train_df[-1]

eval_df <- as.data.frame(na.fill(eval_df, fill = 0))
         
colnames(train_df)

correlation_Matrix <- cor(train_df[,2:16])

corrplot(correlation_Matrix,method = "color", type = "full", order = "hclust")
         
control <- trainControl(method = "repeatedcv", number = 10, repeats =3)


model <- train(Wins~., data=train_df, method="lm", preProcess="scale", trControl=control)

importance <- varImp(model, scale=FALSE)

plot(importance)

model_1 <- lm(Wins ~ Hits_P + Double_Hits_P + Triple_Hits_P +  Walks_P + Strike_Out_N +
                  Stolen_Base_P + Caught_Stealing_N + Errors_N + Double_Play_P + Walks_Allowed_N +  
                  Homeruns_Allowed_N + Strike_Out_Pitcher_P, 
              data = train_df)

summary(model_1)

plot(model_1, 1:2)

model_2 <- lm(Wins~ Hits_P + Triple_Hits_P + Homerun_P + Batter_Hit_P + Strike_Out_Pitcher_P, data = train_df)

summary(model_2)

plot(model_2, 1:2)

model_3 <- lm(Wins ~ Hits_P + Homeruns_Allowed_N + Strike_Out_Pitcher_P, Batter_Hit_P,
              data = train_df)


summary(model_3)

plot(model_3, 1:2)