This homework explore, analyze and model a data set containing approximately 2000 records. Each record represents a professional baseball team from the years 1871-2006. Each record has the performance of the team for the given year, with all the statistics adjusted to match the performance of a 162 game season.
The following approach is used to buidling a multiple linear regression model. Steps: * Data exploration - the data are examine to understand the data structure, whether the data has missing values, outliers, whether the data are skewed, etc. Findings during the data exploration process can dictate the appropriate model to fit the data.
library(dplyr)
train <- read.csv("https://raw.githubusercontent.com/indianspice/DATA621/master/Hw1/moneyball-training-data.csv")
#remove the leading text "TEAM_" on the variable names to make our plots look less cluttered
colnames(train) = gsub("TEAM_", "", colnames(train))
#Remove index column
train <- train[,-1]
The descriptive statistics below shows the the mean, mode, standard deviation, minimum and maximum of each variable in the dataset.
library(pastecs)
attach(train)
stat <- cbind(TARGET_WINS,BATTING_H,BATTING_2B,BATTING_3B,BATTING_HR,BATTING_BB,BATTING_SO,BASERUN_SB,BASERUN_CS,BATTING_HBP,PITCHING_H,PITCHING_HR,PITCHING_BB,PITCHING_SO,FIELDING_E,FIELDING_DP)
options(scipen=100)
options(digits=2)
stat.desc(stat, basic = F)
## TARGET_WINS BATTING_H BATTING_2B BATTING_3B BATTING_HR
## median 82.00 1454.000 238.00 47.00 102.00
## mean 80.79 1469.270 241.25 55.25 99.61
## SE.mean 0.33 3.031 0.98 0.59 1.27
## CI.mean.0.95 0.65 5.943 1.92 1.15 2.49
## var 248.13 20906.614 2190.37 780.56 3665.92
## std.dev 15.75 144.591 46.80 27.94 60.55
## coef.var 0.19 0.098 0.19 0.51 0.61
## BATTING_BB BATTING_SO BASERUN_SB BASERUN_CS BATTING_HBP
## median 512.00 750.00 101.0 49.00 58.00
## mean 501.56 735.61 124.8 52.80 59.36
## SE.mean 2.57 5.33 1.9 0.59 0.94
## CI.mean.0.95 5.04 10.45 3.7 1.16 1.85
## var 15048.14 61765.38 7707.3 526.99 168.15
## std.dev 122.67 248.53 87.8 22.96 12.97
## coef.var 0.24 0.34 0.7 0.43 0.22
## PITCHING_H PITCHING_HR PITCHING_BB PITCHING_SO FIELDING_E
## median 1518.00 107.00 536.5 813.50 159.00
## mean 1779.21 105.70 553.0 817.73 246.48
## SE.mean 29.49 1.28 3.5 11.86 4.77
## CI.mean.0.95 57.83 2.52 6.8 23.26 9.36
## var 1979207.03 3757.54 27674.8 305903.05 51879.62
## std.dev 1406.84 61.30 166.4 553.09 227.77
## coef.var 0.79 0.58 0.3 0.68 0.92
## FIELDING_DP
## median 149.00
## mean 146.39
## SE.mean 0.59
## CI.mean.0.95 1.15
## var 687.82
## std.dev 26.23
## coef.var 0.18
stat.desc(stat, desc = F)
## TARGET_WINS BATTING_H BATTING_2B BATTING_3B BATTING_HR BATTING_BB
## nbr.val 2276 2276 2276 2276 2276 2276
## nbr.null 1 0 0 2 15 1
## nbr.na 0 0 0 0 0 0
## min 0 891 69 0 0 0
## max 146 2554 458 223 264 878
## range 146 1663 389 223 264 878
## sum 183880 3344058 549078 125749 226717 1141548
## BATTING_SO BASERUN_SB BASERUN_CS BATTING_HBP PITCHING_H
## nbr.val 2174 2145 1504 191 2276
## nbr.null 20 2 1 0 0
## nbr.na 102 131 772 2085 0
## min 0 0 0 29 1137
## max 1399 697 201 95 30132
## range 1399 697 201 66 28995
## sum 1599206 267614 79417 11337 4049483
## PITCHING_HR PITCHING_BB PITCHING_SO FIELDING_E FIELDING_DP
## nbr.val 2276 2276 2174 2276 1990
## nbr.null 15 1 20 0 0
## nbr.na 0 0 102 0 286
## min 0 0 0 65 52
## max 343 3645 19278 1898 228
## range 343 3645 19278 1833 176
## sum 240570 1258646 1777746 560990 291312
Examining skewness and outliers in the data is important prior to choosing the model. This is important because some models will require transformation of the data.
As seen below in the density matrix and boxplots, several variables are skewed. Four of the sixteen variables are normally or close to mormally distributed.
library(reshape)
library(ggplot2)
par(mfrow = c(3, 3))
datasub = melt(train)
ggplot(datasub, aes(x= value)) +
geom_density(fill='red') + facet_wrap(~variable, scales = 'free')
The MICE library in r was used to provide analyze the missing values of the dataset. The analysis shows that 191 observations are complete, 1295 miss only TEAM_BATTING_HPB, 349 miss only TEAM_BATTING_HPB. The TEAM_BATTING_HPB has the most missing values accoss the values.
The VIM library was used to visualize missing values. The visualization shows that less than one percent of the data does not have any missing values.
library(mice)
md.pattern(train)
## TARGET_WINS BATTING_H BATTING_2B BATTING_3B BATTING_HR BATTING_BB
## 191 1 1 1 1 1 1
## 1295 1 1 1 1 1 1
## 349 1 1 1 1 1 1
## 18 1 1 1 1 1 1
## 53 1 1 1 1 1 1
## 190 1 1 1 1 1 1
## 102 1 1 1 1 1 1
## 78 1 1 1 1 1 1
## 0 0 0 0 0 0
## PITCHING_H PITCHING_HR PITCHING_BB FIELDING_E BATTING_SO PITCHING_SO
## 191 1 1 1 1 1 1
## 1295 1 1 1 1 1 1
## 349 1 1 1 1 1 1
## 18 1 1 1 1 1 1
## 53 1 1 1 1 1 1
## 190 1 1 1 1 1 1
## 102 1 1 1 1 0 0
## 78 1 1 1 1 1 1
## 0 0 0 0 102 102
## BASERUN_SB FIELDING_DP BASERUN_CS BATTING_HBP
## 191 1 1 1 1 0
## 1295 1 1 1 0 1
## 349 1 1 0 0 2
## 18 1 0 1 0 2
## 53 0 1 0 0 3
## 190 1 0 0 0 3
## 102 1 1 0 0 4
## 78 0 0 0 0 4
## 131 286 772 2085 3478
#Visualize missing values
library(VIM)
aggr_plot <- aggr(train,
col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(train), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## BATTING_HBP 0.916
## BASERUN_CS 0.339
## FIELDING_DP 0.126
## BASERUN_SB 0.058
## BATTING_SO 0.045
## PITCHING_SO 0.045
## TARGET_WINS 0.000
## BATTING_H 0.000
## BATTING_2B 0.000
## BATTING_3B 0.000
## BATTING_HR 0.000
## BATTING_BB 0.000
## PITCHING_H 0.000
## PITCHING_HR 0.000
## PITCHING_BB 0.000
## FIELDING_E 0.000
Data were filtered to remove missing records and 0 sent to NA
train <- train %>%
filter(TARGET_WINS >=22 & TARGET_WINS <= 124 & BATTING_H <= 1876 & BATTING_2B >= 116 & BATTING_2B <= 376 & BATTING_BB >= 292 &
BATTING_BB <= 879 & BATTING_SO >= 326 & BATTING_SO <= 1535 & PITCHING_HR <= 258 & PITCHING_SO <= 1450)
train[train == 0] <- NA
Examine new dataset
par(mfrow = c(3, 3))
datasub = melt(train)
ggplot(datasub, aes(x= value)) +
geom_density(fill='red') + facet_wrap(~variable, scales = 'free')
The visualzations below shows there are positive or negative correlations among values. There are a small number of values that are not correlated.
library(psych)
pairs.panels(train[1:8]) # select columns 1-8
pairs.panels(train[9:16])
#Data Preparation This step is extremently important to modelling data, it can make or break the quality of the model
Centering and scaling was used to transform individual predictors in the dataset using the caret library. The density diagrams of the transformed data shows that some variables were transformed from skewedness to normality or close to normality.`
library(caret)
trans = preProcess(train,
c("BoxCox", "center", "scale"))
predictorsTrans = data.frame(
trans = predict(trans, train))
#Density plot of tranformed data
dataTrans = melt(predictorsTrans)
ggplot(dataTrans, aes(x= value)) +
geom_density(fill='red') + facet_wrap(~variable, scales = 'free')
The models are built with one dependent variable and measure the associations amoung all the predictor variables.
#All variables
summary(mod1 <- lm(trans.BATTING_H ~ ., data = predictorsTrans))
##
## Call:
## lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03991 -0.01983 -0.00947 0.01019 0.10372
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.224777 0.034764 6.47 0.000000000981 ***
## trans.TARGET_WINS -0.002832 0.003713 -0.76 0.447
## trans.BATTING_2B 0.002624 0.005253 0.50 0.618
## trans.BATTING_3B -0.003003 0.004049 -0.74 0.459
## trans.BATTING_HR 4.059118 0.588259 6.90 0.000000000093 ***
## trans.BATTING_BB -0.003426 0.026823 -0.13 0.899
## trans.BATTING_SO 0.437314 0.220932 1.98 0.049 *
## trans.BASERUN_SB 0.006665 0.005640 1.18 0.239
## trans.BASERUN_CS 0.001365 0.003971 0.34 0.731
## trans.BATTING_HBP 0.000234 0.002331 0.10 0.920
## trans.PITCHING_H 1.326131 0.007081 187.29 < 0.0000000000000002 ***
## trans.PITCHING_HR -4.016870 0.582888 -6.89 0.000000000097 ***
## trans.PITCHING_BB 0.003952 0.027994 0.14 0.888
## trans.PITCHING_SO -0.472120 0.236579 -2.00 0.048 *
## trans.FIELDING_E 0.002841 0.003922 0.72 0.470
## trans.FIELDING_DP 0.004639 0.002977 1.56 0.121
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03 on 174 degrees of freedom
## (1792 observations deleted due to missingness)
## Multiple R-squared: 0.998, Adjusted R-squared: 0.998
## F-statistic: 6.08e+03 on 15 and 174 DF, p-value: <0.0000000000000002
Risidual plots
res <- residuals(mod1)
plot(mod1)
abline(0,0)
The Wald test is used to evaluate the statistical significance of each coefficient in the model and is calculated by taking the ratio of the square of the regression coefficient to the square of the standard error of the coefficient.
library(survey)
regTermTest(mod1, "trans.TARGET_WINS")
## Wald test for trans.TARGET_WINS
## in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F = 0.58 on 1 and 174 df: p= 0.4
regTermTest(mod1, "trans.BATTING_2B")
## Wald test for trans.BATTING_2B
## in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F = 0.25 on 1 and 174 df: p= 0.6
regTermTest(mod1, "trans.BATTING_3B")
## Wald test for trans.BATTING_3B
## in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F = 0.55 on 1 and 174 df: p= 0.5
regTermTest(mod1, "trans.BATTING_HR")
## Wald test for trans.BATTING_HR
## in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F = 48 on 1 and 174 df: p= 0.00000000009
regTermTest(mod1, "trans.BATTING_BB")
## Wald test for trans.BATTING_BB
## in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F = 0.016 on 1 and 174 df: p= 0.9
regTermTest(mod1, "trans.BATTING_SO")
## Wald test for trans.BATTING_SO
## in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F = 3.9 on 1 and 174 df: p= 0.05
regTermTest(mod1, "trans.BASERUN_SB")
## Wald test for trans.BASERUN_SB
## in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F = 1.4 on 1 and 174 df: p= 0.2
regTermTest(mod1, "trans.BASERUN_CS")
## Wald test for trans.BASERUN_CS
## in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F = 0.12 on 1 and 174 df: p= 0.7
regTermTest(mod1, "trans.BATTING_HBP")
## Wald test for trans.BATTING_HBP
## in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F = 0.01 on 1 and 174 df: p= 0.9
regTermTest(mod1, "trans.PITCHING_HR")
## Wald test for trans.PITCHING_HR
## in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F = 47 on 1 and 174 df: p= 0.0000000001
regTermTest(mod1, "trans.PITCHING_BB")
## Wald test for trans.PITCHING_BB
## in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F = 0.02 on 1 and 174 df: p= 0.9
regTermTest(mod1, "trans.PITCHING_SO")
## Wald test for trans.PITCHING_SO
## in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F = 4 on 1 and 174 df: p= 0.05
regTermTest(mod1, "trans.FIELDING_DP")
## Wald test for trans.FIELDING_DP
## in lm(formula = trans.BATTING_H ~ ., data = predictorsTrans)
## F = 2.4 on 1 and 174 df: p= 0.1
Assess the relative importance of individual predictors in the model, we can also look at the absolute value of the t-statistic for each model parameter.
varImp(mod1)
## Overall
## trans.TARGET_WINS 0.76
## trans.BATTING_2B 0.50
## trans.BATTING_3B 0.74
## trans.BATTING_HR 6.90
## trans.BATTING_BB 0.13
## trans.BATTING_SO 1.98
## trans.BASERUN_SB 1.18
## trans.BASERUN_CS 0.34
## trans.BATTING_HBP 0.10
## trans.PITCHING_H 187.29
## trans.PITCHING_HR 6.89
## trans.PITCHING_BB 0.14
## trans.PITCHING_SO 2.00
## trans.FIELDING_E 0.72
## trans.FIELDING_DP 1.56
References: https://www.r-bloggers.com/evaluating-logistic-regression-models/