HOMEWORK #1

Overview:

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided).

Deliverables:

  • A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details.
  • Assigned predictions (the number of wins for the team) for the evaluation data set.
  • Include your R statistical programming code in an Appendix.
  1. DATA EXPLORATION

Data acquisition

First, we need to explore our given data set. I have published the original data sets in my github account

Read Data

Here, we read the dataset and shorten the feature names for better readibility in visualizations.

Summary

First, we take a look at a summary of the data. A few things of interest are revealed:

  • bt_SO, br_SB, br_CS, bt_HBP, ph_SO, and fd_DP have missing values
  • The max values of ph_H, ph_BB, ph_SO, and fd_E seem abnormally high
##       WINS             bt_H          bt_2B           bt_3B       
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##                                                                  
##      bt_HR            bt_BB           bt_SO            br_SB      
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0   1st Qu.: 66.0  
##  Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 735.6   Mean   :124.8  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0   3rd Qu.:156.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##                                   NA's   :102      NA's   :131    
##      br_CS           bt_HBP           ph_H           ph_HR      
##  Min.   :  0.0   Min.   :29.00   Min.   : 1137   Min.   :  0.0  
##  1st Qu.: 38.0   1st Qu.:50.50   1st Qu.: 1419   1st Qu.: 50.0  
##  Median : 49.0   Median :58.00   Median : 1518   Median :107.0  
##  Mean   : 52.8   Mean   :59.36   Mean   : 1779   Mean   :105.7  
##  3rd Qu.: 62.0   3rd Qu.:67.00   3rd Qu.: 1682   3rd Qu.:150.0  
##  Max.   :201.0   Max.   :95.00   Max.   :30132   Max.   :343.0  
##  NA's   :772     NA's   :2085                                   
##      ph_BB            ph_SO              fd_E            fd_DP      
##  Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   : 52.0  
##  1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0   1st Qu.:131.0  
##  Median : 536.5   Median :  813.5   Median : 159.0   Median :149.0  
##  Mean   : 553.0   Mean   :  817.7   Mean   : 246.5   Mean   :146.4  
##  3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2   3rd Qu.:164.0  
##  Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :228.0  
##                   NA's   :102                        NA's   :286

Dimensions

Let’s see the dimensions of our moneyball training data set.

## [1] 2276   16

The training data has 17 columns and 2,276 rows.

The explanatory columns are broken down into four categories:

  • Batting
  • Base run
  • Pitching
  • Fielding

Below you will see a preview of the columns and the first few observations broken down into these four categories.


QQ Plots

  • Most of the features are not lined up with the theoretical QQ plot, however this will be addressed by the models we build.

Correlation Plot

  • There is a strong positive correlation between ph_H and bt_H
  • There is a strong positive correlation between ph_HR and bt_HR
  • There is a strong positive correlation between ph_BB and bt_BB
  • There is a strong positive correlation between ph_SO and bt_SO
  • There seems to be a weak correlation between bt_HBP/br_SB and Wins

Outliers

Extreme Values

While exploring the data, we noticed that the max values of ph_H, ph_BB, ph_SO, and fd_E seem abnormally high.

We see that the record for most hits in a season by team (ph_H) was set at 1,724 in 1921. However, we also know that the datapoints were normalized for 162 games in a season. To take a moderate approach, we will remove the some of the most egggregious outliers that are seen in these variables.

Fill Missing Values

The following features have missing values.

  • bt_SO - Strikeouts by batters
  • br_SB - Stolen bases
  • br_CS - Caught stealing
  • bt_HBP - Batters hit by pitch (get a free base)
  • ph_SO - Strikeouts by pitchers
  • fd_DP - Double Plays

Since most values in bt_HBP are missing (90%), we will drop this feature.

Multivariate Imputation by Chained Equations (mice)

We will use Multivariable Imputation by Chained Equations (mice) to fill the missing variables.

Address Correlated Features

While exploring the data, we noticed several features had strong positive linear relationships.

Let’s run a Variance Inflation Factor test to detect multicollinearity. Features with a VIF score > 10 will be reviewed.

##      bt_H     bt_2B     bt_3B     bt_HR     bt_BB     bt_SO     br_SB 
##  3.820596  2.467157  2.989892 36.501400  6.787771  5.279911  3.862460 
##     br_CS      ph_H     ph_HR     ph_BB     ph_SO      fd_E     fd_DP 
##  3.793169  4.073762 29.596294  6.468847  3.369127  4.988328  1.902235

Let’s make another correlation plot with only these features.

  • bt_SO (strikeouts by batters) and bt_H (base hits by batters) have a strong positive correlation
  • bt_H (base hits by batters) and bt_BB (walks by batters) have a strong positive correlation
  • ph_BB (walks allowed) and bt_BB (walks by batters) have a strong negative correlation
  • ph_SO (strikeouts by pitchers) and bt_SO (strikeouts by batters) have a moderate negative correlation
  • ph_HR (homeruns allowed) and bt_HR (homeruns by batters) have a strong negative correlation
  • ph_SO (strikeouts by pitchers) and ph_BB (walks allowed) have a moderate negative correation

To fix this, we can remove some correlated features and combine others.

  • Remove bt_HR. It has an extremely strong correlation with ph_HR.
  • Remove bt_SO. It has an extremely strong correlation with ph_SO.
  • Replace bt_H (total base hits by batters) with BT_1B = bt_H - BT_2B - BT_3B - BT_HR (1B base hits)
  • Replace ph_BB and bt_BB as a ratio of walks by batters to walks allowed

These adjustments result in less multicollinearity.

##    bt_2B    bt_3B    br_SB    br_CS     ph_H    ph_HR    ph_SO     fd_E 
## 1.553145 2.338689 3.650821 3.686438 3.628940 2.311793 1.832450 6.805560 
##    fd_DP    bt_1B       BB 
## 1.865776 2.664315 5.725045

Linear Model 1.

We will begin with all independent variables and use the back elimination method to eliminate the non-significant ones.

We will start by eliminating the variables with high p-values and lowest significance from the model

Let’s take a look at the resulting model:

###Linear Model 2.

This Linear Model will be built using the variables we believe would have the highest corelation with WINs.

THe following variables will be used: - Base Hits by batters (1B,2B,3B,HR) - Walks by batters - Stolen bases - Strikeouts by batters

Let’s remove the two variables with low significance:

Model N_Vars Sigma R_Sq Adj_R_Sq F_Stat F_P_Val MSE RMSE
lm1 16 12.540 0.363 0.359 85.935 0 156.134 12.495
lm2 5 13.714 0.243 0.242 182.565 0 187.659 13.699
lm3 7 13.785 0.236 0.234 116.894 0 189.444 13.764
lm4 7 14.774 0.123 0.120 52.855 0 217.602 14.751