Introduction

This homework explore, analyze and model a data set containing approximately 2000 records. Each record represents a professional baseball team from the years 1871-2006. Each record has the performance of the team for the given year, with all the statistics adjusted to match the performance of a 162 game season.

Approach

The following approach is used to buidling a multiple linear regression model. Steps: * Data exploration - the data are examine to understand the data structure, whether the data has missing values, outliers, whether the data are skewed, etc. Findings during the data exploration process can dictate the appropriate model to fit the data.

library(dplyr)
train <- read.csv("https://raw.githubusercontent.com/indianspice/DATA621/master/Hw1/moneyball-training-data.csv")

#remove the leading text "TEAM_" on the variable names to make our plots look less cluttered
colnames(train) = gsub("TEAM_", "", colnames(train))

#Remove index column
train <- train[,-1]

Explore the data

Summary and discriptive statistics Descriptive statistics is used here to summarize the data to gather insights into the information contained in the dataset.

The descriptive statistics below shows the the mean, mode, standard deviation, minimum and maximum of each variable in the dataset.

library(pastecs)
attach(train)
stat <- cbind(TARGET_WINS,BATTING_H,BATTING_2B,BATTING_3B,BATTING_HR,BATTING_BB,BATTING_SO,BASERUN_SB,BASERUN_CS,BATTING_HBP,PITCHING_H,PITCHING_HR,PITCHING_BB,PITCHING_SO,FIELDING_E,FIELDING_DP)

options(scipen=100)
options(digits=2)


stat.desc(stat, basic = F)
##              TARGET_WINS BATTING_H BATTING_2B BATTING_3B BATTING_HR
## median             82.00  1454.000     238.00      47.00     102.00
## mean               80.79  1469.270     241.25      55.25      99.61
## SE.mean             0.33     3.031       0.98       0.59       1.27
## CI.mean.0.95        0.65     5.943       1.92       1.15       2.49
## var               248.13 20906.614    2190.37     780.56    3665.92
## std.dev            15.75   144.591      46.80      27.94      60.55
## coef.var            0.19     0.098       0.19       0.51       0.61
##              BATTING_BB BATTING_SO BASERUN_SB BASERUN_CS BATTING_HBP
## median           512.00     750.00      101.0      49.00       58.00
## mean             501.56     735.61      124.8      52.80       59.36
## SE.mean            2.57       5.33        1.9       0.59        0.94
## CI.mean.0.95       5.04      10.45        3.7       1.16        1.85
## var            15048.14   61765.38     7707.3     526.99      168.15
## std.dev          122.67     248.53       87.8      22.96       12.97
## coef.var           0.24       0.34        0.7       0.43        0.22
##              PITCHING_H PITCHING_HR PITCHING_BB PITCHING_SO FIELDING_E
## median          1518.00      107.00       536.5      813.50     159.00
## mean            1779.21      105.70       553.0      817.73     246.48
## SE.mean           29.49        1.28         3.5       11.86       4.77
## CI.mean.0.95      57.83        2.52         6.8       23.26       9.36
## var          1979207.03     3757.54     27674.8   305903.05   51879.62
## std.dev         1406.84       61.30       166.4      553.09     227.77
## coef.var           0.79        0.58         0.3        0.68       0.92
##              FIELDING_DP
## median            149.00
## mean              146.39
## SE.mean             0.59
## CI.mean.0.95        1.15
## var               687.82
## std.dev            26.23
## coef.var            0.18
stat.desc(stat, desc = F)
##          TARGET_WINS BATTING_H BATTING_2B BATTING_3B BATTING_HR BATTING_BB
## nbr.val         2276      2276       2276       2276       2276       2276
## nbr.null           1         0          0          2         15          1
## nbr.na             0         0          0          0          0          0
## min                0       891         69          0          0          0
## max              146      2554        458        223        264        878
## range            146      1663        389        223        264        878
## sum           183880   3344058     549078     125749     226717    1141548
##          BATTING_SO BASERUN_SB BASERUN_CS BATTING_HBP PITCHING_H
## nbr.val        2174       2145       1504         191       2276
## nbr.null         20          2          1           0          0
## nbr.na          102        131        772        2085          0
## min               0          0          0          29       1137
## max            1399        697        201          95      30132
## range          1399        697        201          66      28995
## sum         1599206     267614      79417       11337    4049483
##          PITCHING_HR PITCHING_BB PITCHING_SO FIELDING_E FIELDING_DP
## nbr.val         2276        2276        2174       2276        1990
## nbr.null          15           1          20          0           0
## nbr.na             0           0         102          0         286
## min                0           0           0         65          52
## max              343        3645       19278       1898         228
## range            343        3645       19278       1833         176
## sum           240570     1258646     1777746     560990      291312

Skewness and outliers

Examining skewness and outliers in the data is important prior to choosing the model. This is important because some models will require transformation of the data.

As seen below in the density matrix and boxplots, several variables are skewed. Four of the sixteen variables are normally or close to mormally distributed.

library(reshape)
library(ggplot2)

par(mfrow = c(3, 3))

datasub = melt(train)
ggplot(datasub, aes(x= value)) + 
    geom_density(fill='red') + facet_wrap(~variable, scales = 'free') 

Missing values

The MICE library in r was used to provide analyze the missing values of the dataset. The analysis shows that 191 observations are complete, 1295 miss only TEAM_BATTING_HPB, 349 miss only TEAM_BATTING_HPB. The TEAM_BATTING_HPB has the most missing values accoss the values.

The VIM library was used to visualize missing values. The visualization shows that less than one percent of the data does not have any missing values.

library(mice)
md.pattern(train)
##      TARGET_WINS BATTING_H BATTING_2B BATTING_3B BATTING_HR BATTING_BB
##  191           1         1          1          1          1          1
## 1295           1         1          1          1          1          1
##  349           1         1          1          1          1          1
##   18           1         1          1          1          1          1
##   53           1         1          1          1          1          1
##  190           1         1          1          1          1          1
##  102           1         1          1          1          1          1
##   78           1         1          1          1          1          1
##                0         0          0          0          0          0
##      PITCHING_H PITCHING_HR PITCHING_BB FIELDING_E BATTING_SO PITCHING_SO
##  191          1           1           1          1          1           1
## 1295          1           1           1          1          1           1
##  349          1           1           1          1          1           1
##   18          1           1           1          1          1           1
##   53          1           1           1          1          1           1
##  190          1           1           1          1          1           1
##  102          1           1           1          1          0           0
##   78          1           1           1          1          1           1
##               0           0           0          0        102         102
##      BASERUN_SB FIELDING_DP BASERUN_CS BATTING_HBP     
##  191          1           1          1           1    0
## 1295          1           1          1           0    1
##  349          1           1          0           0    2
##   18          1           0          1           0    2
##   53          0           1          0           0    3
##  190          1           0          0           0    3
##  102          1           1          0           0    4
##   78          0           0          0           0    4
##             131         286        772        2085 3478
#Visualize missing values
library(VIM)
aggr_plot <- aggr(train, 
                  col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(train), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##     Variable Count
##  BATTING_HBP 0.916
##   BASERUN_CS 0.339
##  FIELDING_DP 0.126
##   BASERUN_SB 0.058
##   BATTING_SO 0.045
##  PITCHING_SO 0.045
##  TARGET_WINS 0.000
##    BATTING_H 0.000
##   BATTING_2B 0.000
##   BATTING_3B 0.000
##   BATTING_HR 0.000
##   BATTING_BB 0.000
##   PITCHING_H 0.000
##  PITCHING_HR 0.000
##  PITCHING_BB 0.000
##   FIELDING_E 0.000

Handling missing values

Data were filtered to remove missing records and 0 sent to NA

train <- train %>% 
  filter(TARGET_WINS >=22 & TARGET_WINS <= 124 & BATTING_H <= 1876 & BATTING_2B >= 116 & BATTING_2B <= 376 & BATTING_BB >= 292 &
           BATTING_BB <= 879 & BATTING_SO >= 326 & BATTING_SO <= 1535 & PITCHING_HR <= 258 & PITCHING_SO <= 1450)

train[train == 0] <- NA

Examine new dataset

par(mfrow = c(3, 3))

datasub = melt(train)
ggplot(datasub, aes(x= value)) + 
    geom_density(fill='red') + facet_wrap(~variable, scales = 'free') 

Correlation amoung predictors

The visualzations below shows there are positive or negative correlations among values. There are a small number of values that are not correlated.

library(psych)
pairs.panels(train[1:8])  # select columns 1-8

pairs.panels(train[9:16])

#Data Preparation This step is extremently important to modelling data, it can make or break the quality of the model

Data transformation

Centering and scaling was used to transform individual predictors in the dataset using the caret library. The density diagrams of the transformed data shows that some variables were transformed from skewedness to normality or close to normality.`

library(caret)
trans = preProcess(train, 
                   c("BoxCox", "center", "scale"))
predictorsTrans = data.frame(
      trans = predict(trans, train))
#Density plot of tranformed data
dataTrans = melt(predictorsTrans)
ggplot(dataTrans, aes(x= value)) + 
    geom_density(fill='red') + facet_wrap(~variable, scales = 'free') 

Build Model

The models are built with one dependent variable and measure the associations amoung all the predictor variables.

#All variables
summary(mod1 <- lm(trans.BATTING_H ~ ., data = predictorsTrans))
## 
## Call:
## lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03991 -0.01983 -0.00947  0.01019  0.10372 
## 
## Coefficients:
##                    Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)        0.224777   0.034764    6.47       0.000000000981 ***
## trans.TARGET_WINS -0.002832   0.003713   -0.76                0.447    
## trans.BATTING_2B   0.002624   0.005253    0.50                0.618    
## trans.BATTING_3B  -0.003003   0.004049   -0.74                0.459    
## trans.BATTING_HR   4.059118   0.588259    6.90       0.000000000093 ***
## trans.BATTING_BB  -0.003426   0.026823   -0.13                0.899    
## trans.BATTING_SO   0.437314   0.220932    1.98                0.049 *  
## trans.BASERUN_SB   0.006665   0.005640    1.18                0.239    
## trans.BASERUN_CS   0.001365   0.003971    0.34                0.731    
## trans.BATTING_HBP  0.000234   0.002331    0.10                0.920    
## trans.PITCHING_H   1.326131   0.007081  187.29 < 0.0000000000000002 ***
## trans.PITCHING_HR -4.016870   0.582888   -6.89       0.000000000097 ***
## trans.PITCHING_BB  0.003952   0.027994    0.14                0.888    
## trans.PITCHING_SO -0.472120   0.236579   -2.00                0.048 *  
## trans.FIELDING_E   0.002841   0.003922    0.72                0.470    
## trans.FIELDING_DP  0.004639   0.002977    1.56                0.121    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03 on 174 degrees of freedom
##   (1792 observations deleted due to missingness)
## Multiple R-squared:  0.998,  Adjusted R-squared:  0.998 
## F-statistic: 6.08e+03 on 15 and 174 DF,  p-value: <0.0000000000000002

Evaluate the model

Risidual plots

res <- residuals(mod1)
plot(mod1)

abline(0,0)

Wald Test

The Wald test is used to evaluate the statistical significance of each coefficient in the model and is calculated by taking the ratio of the square of the regression coefficient to the square of the standard error of the coefficient.

library(survey)
regTermTest(mod1, "trans.TARGET_WINS")
## Wald test for trans.TARGET_WINS
##  in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F =  0.58  on  1  and  174  df: p= 0.4
regTermTest(mod1, "trans.BATTING_2B")
## Wald test for trans.BATTING_2B
##  in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F =  0.25  on  1  and  174  df: p= 0.6
regTermTest(mod1, "trans.BATTING_3B")
## Wald test for trans.BATTING_3B
##  in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F =  0.55  on  1  and  174  df: p= 0.5
regTermTest(mod1, "trans.BATTING_HR")
## Wald test for trans.BATTING_HR
##  in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F =  48  on  1  and  174  df: p= 0.00000000009
regTermTest(mod1, "trans.BATTING_BB")
## Wald test for trans.BATTING_BB
##  in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F =  0.016  on  1  and  174  df: p= 0.9
regTermTest(mod1, "trans.BATTING_SO")
## Wald test for trans.BATTING_SO
##  in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F =  3.9  on  1  and  174  df: p= 0.05
regTermTest(mod1, "trans.BASERUN_SB")
## Wald test for trans.BASERUN_SB
##  in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F =  1.4  on  1  and  174  df: p= 0.2
regTermTest(mod1, "trans.BASERUN_CS")
## Wald test for trans.BASERUN_CS
##  in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F =  0.12  on  1  and  174  df: p= 0.7
regTermTest(mod1, "trans.BATTING_HBP")
## Wald test for trans.BATTING_HBP
##  in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F =  0.01  on  1  and  174  df: p= 0.9
regTermTest(mod1, "trans.PITCHING_HR")
## Wald test for trans.PITCHING_HR
##  in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F =  47  on  1  and  174  df: p= 0.0000000001
regTermTest(mod1, "trans.PITCHING_BB")
## Wald test for trans.PITCHING_BB
##  in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F =  0.02  on  1  and  174  df: p= 0.9
regTermTest(mod1, "trans.PITCHING_SO")
## Wald test for trans.PITCHING_SO
##  in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F =  4  on  1  and  174  df: p= 0.05
regTermTest(mod1, "trans.FIELDING_DP")
## Wald test for trans.FIELDING_DP
##  in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F =  2.4  on  1  and  174  df: p= 0.1

Variable Importance

Assess the relative importance of individual predictors in the model, we can also look at the absolute value of the t-statistic for each model parameter.

varImp(mod1)
##                   Overall
## trans.TARGET_WINS    0.76
## trans.BATTING_2B     0.50
## trans.BATTING_3B     0.74
## trans.BATTING_HR     6.90
## trans.BATTING_BB     0.13
## trans.BATTING_SO     1.98
## trans.BASERUN_SB     1.18
## trans.BASERUN_CS     0.34
## trans.BATTING_HBP    0.10
## trans.PITCHING_H   187.29
## trans.PITCHING_HR    6.89
## trans.PITCHING_BB    0.14
## trans.PITCHING_SO    2.00
## trans.FIELDING_E     0.72
## trans.FIELDING_DP    1.56

References: https://www.r-bloggers.com/evaluating-logistic-regression-models/