Data 621: Homework 1

Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Objective

The objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided).

Data Exploration

## [1] 2276   17

The training data has a size of 2276 observations and 17 attributes. With two of the column are the index column and number of wins column which would be our target variable. Exploring this training dataset can help us gain insights on how to approach the problem at hand. The following visualization is the histogram of each predictor for this assignment. The current column names are difficult to read since we don’t know what they mean, but for now we have an idea of what each predictor variable’s distribution looks like. Predictor variables such as “BATTING_HBP”, “PITCHING_H”, and “PITCHING_SO”, seems to have good amount of outlier for which we can handle in the next section of this assignment which is the data cleaning and wrangling to “tidy” this dataset.

Summary Statistics Visualized for Predictor Variable

Looking at these distribution only give us rough idea of the data. The following table gives us the values of the summary statistics to get a better understanding of the data. We can treat these two visualizations as supplementary of each other.

column	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
TARGET_WINS	2276	80.79086	15.75215	82.0	81.31229	10.0	0	146	146	-0.3989861	4.031017	0.3301823
TEAM_BATTING_H	2276	1469.26977	144.59120	1454.0	1459.04116	77.0	891	2554	1663	1.5723696	10.287564	3.0307891
TEAM_BATTING_2B	2276	241.24692	46.80141	238.0	240.39627	32.0	69	458	389	0.2152436	3.008804	0.9810087
TEAM_BATTING_3B	2276	55.25000	27.93856	47.0	52.17563	16.0	0	223	223	1.1101968	4.507201	0.5856226
TEAM_BATTING_HR	2276	99.61204	60.54687	102.0	97.38529	53.0	0	264	264	0.1861648	2.038672	1.2691285
TEAM_BATTING_BB	2276	501.55888	122.67086	512.0	512.18331	64.0	0	878	878	-1.0264363	5.187412	2.5713150
TEAM_BATTING_SO	2174	735.60534	248.52642	750.0	742.31322	192.0	0	1399	1399	NA	NA	5.3301912
TEAM_BASERUN_SB	2145	124.76177	87.79117	101.0	110.81188	41.0	0	697	697	NA	NA	1.8955584
TEAM_BASERUN_CS	1504	52.80386	22.95634	49.0	50.35963	12.0	0	201	201	NA	NA	0.5919414
TEAM_BATTING_HBP	191	59.35602	12.96712	58.0	58.86275	8.0	29	95	66	NA	NA	0.9382681
TEAM_PITCHING_H	2276	1779.21046	1406.84293	1518.0	1555.89517	118.0	1137	30132	28995	10.3363225	144.967058	29.4889618
TEAM_PITCHING_HR	2276	105.69859	61.29875	107.0	103.15697	50.0	0	343	343	0.2879774	2.397475	1.2848886
TEAM_PITCHING_BB	2276	553.00791	166.35736	536.5	542.62459	66.5	0	3645	3645	6.7483465	100.055543	3.4870317
TEAM_PITCHING_SO	2174	817.73045	553.08503	813.5	796.93391	173.5	0	19278	19278	NA	NA	11.8621151
TEAM_FIELDING_E	2276	246.48067	227.77097	159.0	193.43798	42.0	65	1898	1833	2.9924375	13.982556	4.7743279
TEAM_FIELDING_DP	1990	146.38794	26.22639	149.0	147.57789	16.0	52	228	176	NA	NA	0.5879114

As expected, there are missing values in this dataset that will need to removed or imputed whichever method has the least impact on the distributions. We want to keep the data in its truest form while performing data cleaning. Skew and kurtosis return a NA value since it was not able to compute those columns due to missingness. First and foremost, we need to change the column names to more easily understood.

Data Preparation

In the context of this dataset, opting to replace NA values with zeros instead of performing mean or median imputation appears to be the most suitable approach. This choice is justified by the fact that each row may represent either the same team across different years or different teams within the same year. In sports data, it is evident that teams either have a recorded score or do not, with no in-between.

Moreover, the column names have been updated to be more easily understood for readers with little to no knowledge of baseball and no longer requires the table explaining what the acronyms means. Also, added a “P” or “N” at the end of each column name to represent the theoretical impact on winning a game based on the table given for this assignment.

Due to many features available in the dataset, we wont be performing any transformation because we want to use the data in its truest form. We are interested if we get sufficient results doing so.

##  [1] "Wins"                 "Hits_P"               "Double_Hits_P"       
##  [4] "Triple_Hits_P"        "Homerun_P"            "Walks_P"             
##  [7] "Batter_Hit_P"         "Strike_Out_N"         "Stolen_Base_P"       
## [10] "Caught_Stealing_N"    "Errors_N"             "Double_Play_P"       
## [13] "Walks_Allowed_N"      "Hits_Allowed_N"       "Homeruns_Allowed_N"  
## [16] "Strike_Out_Pitcher_P"

The following chart show the correlation between each column in the data. We can observe the “Homeruns_Allowed_N” and “Strike_Out_Pitcher_P” has the strongest negative correlation in the dataset. This will help us determine which features are best in builing the models later.

Here, we are using LM model to estimate the variable important which rates each feature on the impact it has on the model. It shows that Hits_P is the most important feature in this data followed by Homeruns_Allowed_N and Strike_Out_Pitcher_P; and Double_Play_P to be the least important. This is another tool that can help in the feature selection process.

Model Building

After exploring and cleaning the data, we should have a good idea of what features we want to use in building the lm models.

Everything Model

## 
## Call:
## lm(formula = Wins ~ Hits_P + Double_Hits_P + Triple_Hits_P + 
##     Walks_P + Strike_Out_N + Stolen_Base_P + Caught_Stealing_N + 
##     Errors_N + Double_Play_P + Walks_Allowed_N + Homeruns_Allowed_N + 
##     Strike_Out_Pitcher_P, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.110  -8.706   0.236   8.617  51.107 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          20.5798159  3.8468397   5.350 9.69e-08 ***
## Hits_P                0.0470546  0.0032408  14.519  < 2e-16 ***
## Double_Hits_P        -0.0100974  0.0094659  -1.067 0.286213    
## Triple_Hits_P         0.0621507  0.0166086   3.742 0.000187 ***
## Walks_P               0.0023481  0.0045229   0.519 0.603705    
## Strike_Out_N          0.0028463  0.0041928   0.679 0.497301    
## Stolen_Base_P        -0.0227720  0.0105501  -2.158 0.030996 *  
## Caught_Stealing_N    -0.0576685  0.0190889  -3.021 0.002547 ** 
## Errors_N             -0.0004893  0.0003770  -1.298 0.194449    
## Double_Play_P         0.0349073  0.0070858   4.926 8.98e-07 ***
## Walks_Allowed_N       0.0041496  0.0028262   1.468 0.142177    
## Homeruns_Allowed_N   -0.0304417  0.0029858 -10.195  < 2e-16 ***
## Strike_Out_Pitcher_P -0.0622201  0.0101196  -6.148 9.22e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.31 on 2263 degrees of freedom
## Multiple R-squared:  0.2897, Adjusted R-squared:  0.286 
## F-statistic: 76.93 on 12 and 2263 DF,  p-value: < 2.2e-16

This model includes of the available features to create baseline needed in order to improve the fit or the R-squared. So, the “Everything Model” has a R-squared of 0.29

Positive and Significant Model

## 
## Call:
## lm(formula = Wins ~ Hits_P + Triple_Hits_P + Homerun_P + Batter_Hit_P + 
##     Strike_Out_Pitcher_P, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.846  -8.748   0.620   9.416  54.809 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          23.719719   3.872096   6.126 1.06e-09 ***
## Hits_P                0.029675   0.002583  11.491  < 2e-16 ***
## Triple_Hits_P         0.121788   0.017009   7.160 1.08e-12 ***
## Homerun_P             0.098998   0.009110  10.866  < 2e-16 ***
## Batter_Hit_P         -0.006119   0.001704  -3.591 0.000336 ***
## Strike_Out_Pitcher_P  0.009217   0.006866   1.343 0.179566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.04 on 2270 degrees of freedom
## Multiple R-squared:  0.2077, Adjusted R-squared:  0.2059 
## F-statistic:   119 on 5 and 2270 DF,  p-value: < 2.2e-16

This model based on the “Everything Model” by picking the coefficients that were the most significant and restricting the features effect to only positive in hopes it would create a better model from this data. Which we did not accomplish since it return a R-squared of 0.20, a decrease from out initial model.

Top 4 Importance Based Model

## 
## Call:
## lm(formula = Wins ~ Hits_P + Homeruns_Allowed_N + Strike_Out_Pitcher_P, 
##     data = train_df, subset = Batter_Hit_P)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.317  -8.036  -0.022   7.552  46.800 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          42.099982   3.343992   12.59   <2e-16 ***
## Hits_P                0.039064   0.002196   17.79   <2e-16 ***
## Homeruns_Allowed_N   -0.033212   0.001733  -19.16   <2e-16 ***
## Strike_Out_Pitcher_P -0.083097   0.007290  -11.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.18 on 2150 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.2077 
## F-statistic: 189.2 on 3 and 2150 DF,  p-value: < 2.2e-16

Now, building a third model to based on the correlation plot and importance chart. We hope that this model would generate the best model out of all the prior model. In this model, we use the following features: Hits_P, Homeruns_Allowed_N, Strike_Out_Pitcher_P and Batter_Hit_P.

The resulting R-squared was not the intended result with R-squared similar to the second model.

Model Selection

Based on the R-squared of each model, the “Everything Model” has best fit on the data while the other two model had R-squared of approx 0.20. The initial thought was that the second and third model would be a better than the first model, but that wasnt the case. It seems that taking out less features in the model resulted in a lower than desired R-squared. Even the model with features based on ranked important performed poorly. Moreover, the Everything Model with R-square 0.30 is still low where it only explained 30% of the variance in the target variable. We hoped that selecting the most important features would at least raise it to 0.50.

Now using the first model, we will predict the number of wins in the eval dataset. We have the distributions of the predicted wins, with a median win of 80

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.90   76.33   80.89   80.58   85.85  113.00

##        1        2        3        4        5        6        7        8 
## 68.08311 69.09019 76.87568 82.97867 63.77830 66.45927 79.71547 73.97235 
##        9       10 
## 71.58869 76.80667

Resources

Feature Selection with Caret Package LINK

Faraway, J. J. (2005). Linear models with R. Chapman & Hall/CRC.

Modern approach to regression with R. (n.d.). . Springer Nature.

Appendix

library(tidyverse)
library(zoo)
library(kableExtra)
library(knitr)
library(broom)
library(DT)
library(faraway)
library(caret)
library(corrplot)
library(mlbench)
library(randomForest)

setwd("C:/Users/Nick Climaco/Documents/DataScience/DATA_621_F23/Homework-1")
set.seed(12041998)
         
eval_df <- read_csv("moneyball-evaluation-data.csv")
train_df <- read_csv("moneyball-training-data.csv")        
         
dim(train_df)

par(mfrow = c(4,4), mar = c(3,3,1,1))

for (col_name in names(train_df)[-1]) {
    hist(train_df[[col_name]], main = paste(col_name), xlab = "Value")
}

par(mfrow = c(1,1)
    

kable(tidy(train_df[-1]), "pipe")  

colnames(train_df) <- c("Index", "Wins", "Hits_P","Double_Hits_P","Triple_Hits_P", "Homerun_P","Walks_P", "Batter_Hit_P", "Strike_Out_N",
                        "Stolen_Base_P", "Caught_Stealing_N", "Errors_N","Double_Play_P", "Walks_Allowed_N", "Hits_Allowed_N", "Homeruns_Allowed_N",
                        "Strike_Out_Pitcher_P")

colnames(eval_df) <- c("Index", "Wins", "Hits_P","Double_Hits_P","Triple_Hits_P", "Homerun_P","Walks_P", "Batter_Hit_P", "Strike_Out_N",
                       "Stolen_Base_P", "Caught_Stealing_N", "Errors_N","Double_Play_P", "Walks_Allowed_N", "Hits_Allowed_N",
                       "Strike_Out_Pitcher_P")

train_df <- as.data.frame(na.fill(train_df, fill = 0))
train_df <- train_df[-1]

eval_df <- as.data.frame(na.fill(eval_df, fill = 0))
         
colnames(train_df)

correlation_Matrix <- cor(train_df[,2:16])

corrplot(correlation_Matrix,method = "color", type = "full", order = "hclust")
         
control <- trainControl(method = "repeatedcv", number = 10, repeats =3)


model <- train(Wins~., data=train_df, method="lm", preProcess="scale", trControl=control)

importance <- varImp(model, scale=FALSE)

plot(importance)

model_1 <- lm(Wins ~ Hits_P + Double_Hits_P + Triple_Hits_P +  Walks_P + Strike_Out_N +
                  Stolen_Base_P + Caught_Stealing_N + Errors_N + Double_Play_P + Walks_Allowed_N +  
                  Homeruns_Allowed_N + Strike_Out_Pitcher_P, 
              data = train_df)

summary(model_1)

plot(model_1, 1:2)

model_2 <- lm(Wins~ Hits_P + Triple_Hits_P + Homerun_P + Batter_Hit_P + Strike_Out_Pitcher_P, data = train_df)

summary(model_2)

plot(model_2, 1:2)

model_3 <- lm(Wins ~ Hits_P + Homeruns_Allowed_N + Strike_Out_Pitcher_P, Batter_Hit_P,
              data = train_df)


summary(model_3)

plot(model_3, 1:2)