DATA 621 HW 1

Data Preparation

At a high level the data looks relatively tidy. However once one digs into the dataset some inconsistencies become evident. There are a fair amount of NAs in columns: TEAM_BATTING_SO,TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_BATTING_HBP,TEAM_PITCHING_SO,and TEAM_FIELDING_DP. Total they account for 3478 missing observations with the largest amount of NAs being in column TEAM_BATTING_HBP, which accounts for ~92% of the total observations in that respective column. Given these obstacles I was tasked with choosing an appropriate method to transform this data. I set a baseline rule that any column that had greater than 10% of its observations missing could not be effectively imputed or used in the analysis. This lead to the removal of the columns TEAM_BASERUN_CS,TEAM_BATTING_HBP, and TEAM_FIELDING_DP from the analysis. I then imputed the missing values for columns TEAM_BATTING_SO,TEAM_BASERUN_SB,and TEAM_PITCHING_SO using the MICE package and applying the function mice with the use of the predictive mean matching feature to fill the missing observations.

Predictive mean matching calculates the predicted value of target variable Y according to the specified imputation model; these are based on observed values thus the imputation is realistic with respect to the dataset.

Once the data had been cleansed I then moved on to my data exploration.

*Reference appendix 1A for the data preparation

Data Exploration

Using the now completed data set with the newly imputed values for missing observations, we can produce summary statistics and diagrams to get a general view of the data. Of the 12 relevant variables with respect to TARGET_WINS, it is obvious that TEAM_PITCHING_H, TEAM_PITCHING_SO, and TEAM_PITCHING_BB seem to rarely produce wins outside of the rarest occasions according to the boxplots. This begets an interesting question, will these variables serve as statistically significant components to the linear model? This is tested in the model portion of the process.

*Reference appendix 1B for the summary statistics

Following the review of the summary statistics, I applied a correlation matrix with a p values matrix to weed out the most obvious not statistically significant variables with respect to TARGET_WINS. TEAM_BATTING_SO is not statistically significant once the correlation and p value matrix is applied and thus is dropped from the linear model.

*Reference appendix 1C for the Correlation matrix

Build Models

Model 1:

This model is produced using the completed training data set with imputed values, which contains only the variables considered relevant and ‘complete’ with respect to the dataset. The correlation and p-value matrices indicated that TEAM_BATTING_SO is not statistically significant to TARGET_WINS. The strongest correlation is among TEAM_BATTING_H observations with respect to its relationship to TARGET_WINS. Thus, TEAM_BATTING_SO is removed and all other variables in the completed training data set are kept in the model. Given that there are no glaringly strong correlations in the matrix we need to turn to the the models statistical significance for each variable.

In model 1 four coefficients are not statistically significant. I then pulled out the two largest p-values and reproduced the model.

Model 2:

After pulling out TEAM_BATTING_SO, TEAM_PITCHING_HR, and TEAM_PITCHING_SO, the model still indicated that there are still two not statistically significant variables that need to be removed.

Model 3:

After pulling out TEAM_PITCHING_BB and TEAM_BATTING_BB all variables appear to be statistically significant. However the predictive power and fit of the model can still be made more efficient by pulling out TEAM_BATTING_2B and TEAM_PITCHING_H.

Model 4:

After retaining only the most statistically significant variables (TEAM_BATTING_H,TEAM_BATTING_3B, TEAM_BATTING_HR,TEAM_BASERUN_SB,and TEAM_FIELDING_E) the model is optimized. From a conceptual standpoint, base hits by batters, triples by batters, homeruns by batters, stolen bases, and errors reasonable have the largest impact on whether a team wins or loses. Consistent errors would clearly diminish a teams ability to win which is why there is negative correlation associated with this variable. Base hits, triples, homeruns, and stolen bases are the teams core scoring variables and clearly significant from just a gameplay perspective to whether or not they win or lose.

Predictively, model 4 has the strongest predictions of all the models produced.

Reference Appendix 2 for the models and model summaries

Select Models

Applying model checks to the model 4 it can be seen that the model’s residuals are normally distributed. The qqnorm plot with the qqline indicate that the residuals are inline with model validity. Of the four produced models, model 4 has the smallest skewness.

Despite having the smallest adusted R-Squared, the predictive power in model 4 is greater than the other three with its assortment of variables. Model 4 in fact has the largest F-Statistic indicating that we can reject the null hypthesis in favor of the alternative hypothesis: removing none statistically significant variables does improve the model.

Using model 4, I apply a clean evaluation data set (imputed values for NAs in relevant columns and removal of irrelevant variables from the dataset) for target wins predictions.

*Reference Appendix 3A for validity checks and Appendix 3B for model prediction

Appendix 1A

library(ggplot2)
library(dplyr)
library(mice)
library(corrplot)
library(Hmisc)
library(moments)
library(reshape2)

## Load Training Data

trainingdata = read.csv(file='moneyball-training-data.csv',header = TRUE,sep=',')
evaldata = read.csv(file='moneyball-evaluation-data.csv',header = TRUE,sep=',')

## Summary Statistics

names(trainingdata)

##  [1] "INDEX"            "TARGET_WINS"      "TEAM_BATTING_H"  
##  [4] "TEAM_BATTING_2B"  "TEAM_BATTING_3B"  "TEAM_BATTING_HR" 
##  [7] "TEAM_BATTING_BB"  "TEAM_BATTING_SO"  "TEAM_BASERUN_SB" 
## [10] "TEAM_BASERUN_CS"  "TEAM_BATTING_HBP" "TEAM_PITCHING_H" 
## [13] "TEAM_PITCHING_HR" "TEAM_PITCHING_BB" "TEAM_PITCHING_SO"
## [16] "TEAM_FIELDING_E"  "TEAM_FIELDING_DP"

sum1 = summary(trainingdata[,2:length(names(trainingdata))]) ## exclude index

print(sum1)

##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##                                                                  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0   1st Qu.: 66.0  
##  Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 735.6   Mean   :124.8  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0   3rd Qu.:156.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##                                   NA's   :102      NA's   :131    
##  TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
##  Min.   :  0.0   Min.   :29.00    Min.   : 1137   Min.   :  0.0   
##  1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419   1st Qu.: 50.0   
##  Median : 49.0   Median :58.00    Median : 1518   Median :107.0   
##  Mean   : 52.8   Mean   :59.36    Mean   : 1779   Mean   :105.7   
##  3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682   3rd Qu.:150.0   
##  Max.   :201.0   Max.   :95.00    Max.   :30132   Max.   :343.0   
##  NA's   :772     NA's   :2085                                     
##  TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   : 52.0   
##  1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0   1st Qu.:131.0   
##  Median : 536.5   Median :  813.5   Median : 159.0   Median :149.0   
##  Mean   : 553.0   Mean   :  817.7   Mean   : 246.5   Mean   :146.4   
##  3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2   3rd Qu.:164.0   
##  Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :228.0   
##                   NA's   :102                        NA's   :286

## Initial view of data frame

head(trainingdata,5) ## All variables are quantitative

## correlation matrix to see relationship between variables

cor(trainingdata[,2:length(names(trainingdata))]) ## doing this is ineffective; requires transformation to parse NAs

##                  TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## TARGET_WINS        1.0000000    0.388767521      0.28910365
## TEAM_BATTING_H     0.3887675    1.000000000      0.56284968
## TEAM_BATTING_2B    0.2891036    0.562849678      1.00000000
## TEAM_BATTING_3B    0.1426084    0.427696575     -0.10730582
## TEAM_BATTING_HR    0.1761532   -0.006544685      0.43539729
## TEAM_BATTING_BB    0.2325599   -0.072464013      0.25572610
## TEAM_BATTING_SO           NA             NA              NA
## TEAM_BASERUN_SB           NA             NA              NA
## TEAM_BASERUN_CS           NA             NA              NA
## TEAM_BATTING_HBP          NA             NA              NA
## TEAM_PITCHING_H   -0.1099371    0.302693709      0.02369219
## TEAM_PITCHING_HR   0.1890137    0.072853119      0.45455082
## TEAM_PITCHING_BB   0.1241745    0.094193027      0.17805420
## TEAM_PITCHING_SO          NA             NA              NA
## TEAM_FIELDING_E   -0.1764848    0.264902478     -0.23515099
## TEAM_FIELDING_DP          NA             NA              NA
##                  TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
## TARGET_WINS          0.142608411     0.176153200      0.23255986
## TEAM_BATTING_H       0.427696575    -0.006544685     -0.07246401
## TEAM_BATTING_2B     -0.107305824     0.435397293      0.25572610
## TEAM_BATTING_3B      1.000000000    -0.635566946     -0.28723584
## TEAM_BATTING_HR     -0.635566946     1.000000000      0.51373481
## TEAM_BATTING_BB     -0.287235841     0.513734810      1.00000000
## TEAM_BATTING_SO               NA              NA              NA
## TEAM_BASERUN_SB               NA              NA              NA
## TEAM_BASERUN_CS               NA              NA              NA
## TEAM_BATTING_HBP              NA              NA              NA
## TEAM_PITCHING_H      0.194879411    -0.250145481     -0.44977762
## TEAM_PITCHING_HR    -0.567836679     0.969371396      0.45955207
## TEAM_PITCHING_BB    -0.002224148     0.136927564      0.48936126
## TEAM_PITCHING_SO              NA              NA              NA
## TEAM_FIELDING_E      0.509778447    -0.587339098     -0.65597081
## TEAM_FIELDING_DP              NA              NA              NA
##                  TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## TARGET_WINS                   NA              NA              NA
## TEAM_BATTING_H                NA              NA              NA
## TEAM_BATTING_2B               NA              NA              NA
## TEAM_BATTING_3B               NA              NA              NA
## TEAM_BATTING_HR               NA              NA              NA
## TEAM_BATTING_BB               NA              NA              NA
## TEAM_BATTING_SO                1              NA              NA
## TEAM_BASERUN_SB               NA               1              NA
## TEAM_BASERUN_CS               NA              NA               1
## TEAM_BATTING_HBP              NA              NA              NA
## TEAM_PITCHING_H               NA              NA              NA
## TEAM_PITCHING_HR              NA              NA              NA
## TEAM_PITCHING_BB              NA              NA              NA
## TEAM_PITCHING_SO              NA              NA              NA
## TEAM_FIELDING_E               NA              NA              NA
## TEAM_FIELDING_DP              NA              NA              NA
##                  TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## TARGET_WINS                    NA     -0.10993705       0.18901373
## TEAM_BATTING_H                 NA      0.30269371       0.07285312
## TEAM_BATTING_2B                NA      0.02369219       0.45455082
## TEAM_BATTING_3B                NA      0.19487941      -0.56783668
## TEAM_BATTING_HR                NA     -0.25014548       0.96937140
## TEAM_BATTING_BB                NA     -0.44977762       0.45955207
## TEAM_BATTING_SO                NA              NA               NA
## TEAM_BASERUN_SB                NA              NA               NA
## TEAM_BASERUN_CS                NA              NA               NA
## TEAM_BATTING_HBP                1              NA               NA
## TEAM_PITCHING_H                NA      1.00000000      -0.14161276
## TEAM_PITCHING_HR               NA     -0.14161276       1.00000000
## TEAM_PITCHING_BB               NA      0.32067616       0.22193750
## TEAM_PITCHING_SO               NA              NA               NA
## TEAM_FIELDING_E                NA      0.66775901      -0.49314447
## TEAM_FIELDING_DP               NA              NA               NA
##                  TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## TARGET_WINS           0.124174536               NA     -0.17648476
## TEAM_BATTING_H        0.094193027               NA      0.26490248
## TEAM_BATTING_2B       0.178054204               NA     -0.23515099
## TEAM_BATTING_3B      -0.002224148               NA      0.50977845
## TEAM_BATTING_HR       0.136927564               NA     -0.58733910
## TEAM_BATTING_BB       0.489361263               NA     -0.65597081
## TEAM_BATTING_SO                NA               NA              NA
## TEAM_BASERUN_SB                NA               NA              NA
## TEAM_BASERUN_CS                NA               NA              NA
## TEAM_BATTING_HBP               NA               NA              NA
## TEAM_PITCHING_H       0.320676162               NA      0.66775901
## TEAM_PITCHING_HR      0.221937505               NA     -0.49314447
## TEAM_PITCHING_BB      1.000000000               NA     -0.02283756
## TEAM_PITCHING_SO               NA                1              NA
## TEAM_FIELDING_E      -0.022837561               NA      1.00000000
## TEAM_FIELDING_DP               NA               NA              NA
##                  TEAM_FIELDING_DP
## TARGET_WINS                    NA
## TEAM_BATTING_H                 NA
## TEAM_BATTING_2B                NA
## TEAM_BATTING_3B                NA
## TEAM_BATTING_HR                NA
## TEAM_BATTING_BB                NA
## TEAM_BATTING_SO                NA
## TEAM_BASERUN_SB                NA
## TEAM_BASERUN_CS                NA
## TEAM_BATTING_HBP               NA
## TEAM_PITCHING_H                NA
## TEAM_PITCHING_HR               NA
## TEAM_PITCHING_BB               NA
## TEAM_PITCHING_SO               NA
## TEAM_FIELDING_E                NA
## TEAM_FIELDING_DP                1

## NA Transformation: Applied to TEAM_BATTING_SO,TEAM_BASERUN_SB,TEAM_BASERUN_CS,TEAM_BATTING_HBP,TEAM_PITCHING_SO,TEAM_FIELDING_DP

## First, we determine the % of NAs relative to the data set.

naperTEAMBATTINGSO = 102/length(trainingdata$INDEX)
naperTEAMBASERUNSB = 131/length(trainingdata$INDEX)
naperTEAMBASERUNCS = 772/length(trainingdata$INDEX)
naperTEAMBATTINGhbp = 2085/length(trainingdata$INDEX)
naperTEAMPITCHINGSO = 102/length(trainingdata$INDEX)
naperTEAMFIELDINGDP = 286/length(trainingdata$INDEX)

## If the % of NAs is less than 10% we will be imputing the average for missing values

percentlist = list(naperTEAMBATTINGSO,naperTEAMBASERUNSB,naperTEAMBASERUNCS,naperTEAMBATTINGhbp,naperTEAMPITCHINGSO,naperTEAMFIELDINGDP)

ifelse(percentlist>.1,"remove","impute")

## [1] "impute" "impute" "remove" "remove" "impute" "remove"

## to many entries are missing to consider the following columns relevant to the analysis: TEAM_BASERUN_CS,TEAM_BATTING_HBP,TEAM_FIELDING_DP
## we remove these columns to focus only on the relevant datapoints

todrop = c('INDEX','TEAM_BASERUN_CS','TEAM_BATTING_HBP','TEAM_FIELDING_DP')
newtrainingdata = trainingdata[,!names(trainingdata) %in% todrop]

Appendix 1B

summary(completedatatraining)

##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 542.0   1st Qu.: 67.0  
##  Median :102.00   Median :512.0   Median : 733.0   Median :105.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 727.4   Mean   :136.8  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.0   3rd Qu.:169.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
##  Min.   : 1137   Min.   :  0.0    Min.   :   0.0   Min.   :    0.0  
##  1st Qu.: 1419   1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  607.8  
##  Median : 1518   Median :107.0    Median : 536.5   Median :  800.0  
##  Mean   : 1779   Mean   :105.7    Mean   : 553.0   Mean   :  808.9  
##  3rd Qu.: 1682   3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  958.0  
##  Max.   :30132   Max.   :343.0    Max.   :3645.0   Max.   :19278.0  
##  TEAM_FIELDING_E 
##  Min.   :  65.0  
##  1st Qu.: 127.0  
##  Median : 159.0  
##  Mean   : 246.5  
##  3rd Qu.: 249.2  
##  Max.   :1898.0

dfmelt = melt(completedatatraining,id.var='TARGET_WINS')
p = ggplot(data = dfmelt,aes(x=variable,y=value))+geom_boxplot(aes(fill=TARGET_WINS))
p+facet_wrap(~variable,scales="free")

Appendix 1C

# correlation matrix of completed data

correlationmatrix = cor(completedatatraining)

corrandpvalues = rcorr(as.matrix(completedatatraining))

print(corrandpvalues)

##                  TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## TARGET_WINS             1.00           0.39            0.29
## TEAM_BATTING_H          0.39           1.00            0.56
## TEAM_BATTING_2B         0.29           0.56            1.00
## TEAM_BATTING_3B         0.14           0.43           -0.11
## TEAM_BATTING_HR         0.18          -0.01            0.44
## TEAM_BATTING_BB         0.23          -0.07            0.26
## TEAM_BATTING_SO        -0.03          -0.42            0.19
## TEAM_BASERUN_SB         0.12           0.16           -0.19
## TEAM_PITCHING_H        -0.11           0.30            0.02
## TEAM_PITCHING_HR        0.19           0.07            0.45
## TEAM_PITCHING_BB        0.12           0.09            0.18
## TEAM_PITCHING_SO       -0.07          -0.23            0.08
## TEAM_FIELDING_E        -0.18           0.26           -0.24
##                  TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
## TARGET_WINS                 0.14            0.18            0.23
## TEAM_BATTING_H              0.43           -0.01           -0.07
## TEAM_BATTING_2B            -0.11            0.44            0.26
## TEAM_BATTING_3B             1.00           -0.64           -0.29
## TEAM_BATTING_HR            -0.64            1.00            0.51
## TEAM_BATTING_BB            -0.29            0.51            1.00
## TEAM_BATTING_SO            -0.67            0.73            0.39
## TEAM_BASERUN_SB             0.53           -0.50           -0.34
## TEAM_PITCHING_H             0.19           -0.25           -0.45
## TEAM_PITCHING_HR           -0.57            0.97            0.46
## TEAM_PITCHING_BB            0.00            0.14            0.49
## TEAM_PITCHING_SO           -0.26            0.20           -0.01
## TEAM_FIELDING_E             0.51           -0.59           -0.66
##                  TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H
## TARGET_WINS                -0.03            0.12           -0.11
## TEAM_BATTING_H             -0.42            0.16            0.30
## TEAM_BATTING_2B             0.19           -0.19            0.02
## TEAM_BATTING_3B            -0.67            0.53            0.19
## TEAM_BATTING_HR             0.73           -0.50           -0.25
## TEAM_BATTING_BB             0.39           -0.34           -0.45
## TEAM_BATTING_SO             1.00           -0.33           -0.36
## TEAM_BASERUN_SB            -0.33            1.00            0.15
## TEAM_PITCHING_H            -0.36            0.15            1.00
## TEAM_PITCHING_HR            0.67           -0.44           -0.14
## TEAM_PITCHING_BB            0.06           -0.03            0.32
## TEAM_PITCHING_SO            0.42           -0.06            0.27
## TEAM_FIELDING_E            -0.58            0.59            0.67
##                  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## TARGET_WINS                  0.19             0.12            -0.07
## TEAM_BATTING_H               0.07             0.09            -0.23
## TEAM_BATTING_2B              0.45             0.18             0.08
## TEAM_BATTING_3B             -0.57             0.00            -0.26
## TEAM_BATTING_HR              0.97             0.14             0.20
## TEAM_BATTING_BB              0.46             0.49            -0.01
## TEAM_BATTING_SO              0.67             0.06             0.42
## TEAM_BASERUN_SB             -0.44            -0.03            -0.06
## TEAM_PITCHING_H             -0.14             0.32             0.27
## TEAM_PITCHING_HR             1.00             0.22             0.22
## TEAM_PITCHING_BB             0.22             1.00             0.49
## TEAM_PITCHING_SO             0.22             0.49             1.00
## TEAM_FIELDING_E             -0.49            -0.02            -0.03
##                  TEAM_FIELDING_E
## TARGET_WINS                -0.18
## TEAM_BATTING_H              0.26
## TEAM_BATTING_2B            -0.24
## TEAM_BATTING_3B             0.51
## TEAM_BATTING_HR            -0.59
## TEAM_BATTING_BB            -0.66
## TEAM_BATTING_SO            -0.58
## TEAM_BASERUN_SB             0.59
## TEAM_PITCHING_H             0.67
## TEAM_PITCHING_HR           -0.49
## TEAM_PITCHING_BB           -0.02
## TEAM_PITCHING_SO           -0.03
## TEAM_FIELDING_E             1.00
## 
## n= 2276 
## 
## 
## P
##                  TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## TARGET_WINS                  0.0000         0.0000         
## TEAM_BATTING_H   0.0000                     0.0000         
## TEAM_BATTING_2B  0.0000      0.0000                        
## TEAM_BATTING_3B  0.0000      0.0000         0.0000         
## TEAM_BATTING_HR  0.0000      0.7550         0.0000         
## TEAM_BATTING_BB  0.0000      0.0005         0.0000         
## TEAM_BATTING_SO  0.1390      0.0000         0.0000         
## TEAM_BASERUN_SB  0.0000      0.0000         0.0000         
## TEAM_PITCHING_H  0.0000      0.0000         0.2585         
## TEAM_PITCHING_HR 0.0000      0.0005         0.0000         
## TEAM_PITCHING_BB 0.0000      0.0000         0.0000         
## TEAM_PITCHING_SO 0.0004      0.0000         0.0000         
## TEAM_FIELDING_E  0.0000      0.0000         0.0000         
##                  TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
## TARGET_WINS      0.0000          0.0000          0.0000         
## TEAM_BATTING_H   0.0000          0.7550          0.0005         
## TEAM_BATTING_2B  0.0000          0.0000          0.0000         
## TEAM_BATTING_3B                  0.0000          0.0000         
## TEAM_BATTING_HR  0.0000                          0.0000         
## TEAM_BATTING_BB  0.0000          0.0000                         
## TEAM_BATTING_SO  0.0000          0.0000          0.0000         
## TEAM_BASERUN_SB  0.0000          0.0000          0.0000         
## TEAM_PITCHING_H  0.0000          0.0000          0.0000         
## TEAM_PITCHING_HR 0.0000          0.0000          0.0000         
## TEAM_PITCHING_BB 0.9155          0.0000          0.0000         
## TEAM_PITCHING_SO 0.0000          0.0000          0.6562         
## TEAM_FIELDING_E  0.0000          0.0000          0.0000         
##                  TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H
## TARGET_WINS      0.1390          0.0000          0.0000         
## TEAM_BATTING_H   0.0000          0.0000          0.0000         
## TEAM_BATTING_2B  0.0000          0.0000          0.2585         
## TEAM_BATTING_3B  0.0000          0.0000          0.0000         
## TEAM_BATTING_HR  0.0000          0.0000          0.0000         
## TEAM_BATTING_BB  0.0000          0.0000          0.0000         
## TEAM_BATTING_SO                  0.0000          0.0000         
## TEAM_BASERUN_SB  0.0000                          0.0000         
## TEAM_PITCHING_H  0.0000          0.0000                         
## TEAM_PITCHING_HR 0.0000          0.0000          0.0000         
## TEAM_PITCHING_BB 0.0075          0.1046          0.0000         
## TEAM_PITCHING_SO 0.0000          0.0059          0.0000         
## TEAM_FIELDING_E  0.0000          0.0000          0.0000         
##                  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## TARGET_WINS      0.0000           0.0000           0.0004          
## TEAM_BATTING_H   0.0005           0.0000           0.0000          
## TEAM_BATTING_2B  0.0000           0.0000           0.0000          
## TEAM_BATTING_3B  0.0000           0.9155           0.0000          
## TEAM_BATTING_HR  0.0000           0.0000           0.0000          
## TEAM_BATTING_BB  0.0000           0.0000           0.6562          
## TEAM_BATTING_SO  0.0000           0.0075           0.0000          
## TEAM_BASERUN_SB  0.0000           0.1046           0.0059          
## TEAM_PITCHING_H  0.0000           0.0000           0.0000          
## TEAM_PITCHING_HR                  0.0000           0.0000          
## TEAM_PITCHING_BB 0.0000                            0.0000          
## TEAM_PITCHING_SO 0.0000           0.0000                           
## TEAM_FIELDING_E  0.0000           0.2761           0.1918          
##                  TEAM_FIELDING_E
## TARGET_WINS      0.0000         
## TEAM_BATTING_H   0.0000         
## TEAM_BATTING_2B  0.0000         
## TEAM_BATTING_3B  0.0000         
## TEAM_BATTING_HR  0.0000         
## TEAM_BATTING_BB  0.0000         
## TEAM_BATTING_SO  0.0000         
## TEAM_BASERUN_SB  0.0000         
## TEAM_PITCHING_H  0.0000         
## TEAM_PITCHING_HR 0.0000         
## TEAM_PITCHING_BB 0.2761         
## TEAM_PITCHING_SO 0.1918         
## TEAM_FIELDING_E

corrplot(correlationmatrix, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

## Using the correlation and p value matrix it is shown that target wins is statistically significant for all variables except TEAM_BATTING_SO

Appendix 2

## we now build our first multi regression model

fit = lm(TARGET_WINS~.-TEAM_BATTING_SO,completedatatraining)
summary(fit)

## 
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_BATTING_SO, data = completedatatraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.737  -8.712   0.120   8.437  58.150 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.6387446  3.8884918   0.679 0.497458    
## TEAM_BATTING_H    0.0500683  0.0033156  15.101  < 2e-16 ***
## TEAM_BATTING_2B  -0.0323184  0.0088988  -3.632 0.000288 ***
## TEAM_BATTING_3B   0.0641485  0.0162716   3.942 8.31e-05 ***
## TEAM_BATTING_HR   0.0495069  0.0270294   1.832 0.067143 .  
## TEAM_BATTING_BB   0.0062705  0.0056648   1.107 0.268444    
## TEAM_BASERUN_SB   0.0531803  0.0037933  14.019  < 2e-16 ***
## TEAM_PITCHING_H   0.0007901  0.0003839   2.058 0.039675 *  
## TEAM_PITCHING_HR -0.0094255  0.0238070  -0.396 0.692207    
## TEAM_PITCHING_BB  0.0024333  0.0039623   0.614 0.539202    
## TEAM_PITCHING_SO  0.0005013  0.0008170   0.614 0.539539    
## TEAM_FIELDING_E  -0.0351830  0.0026662 -13.196  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.92 on 2264 degrees of freedom
## Multiple R-squared:   0.33,  Adjusted R-squared:  0.3268 
## F-statistic: 101.4 on 11 and 2264 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))

plot(fit)

## model 1 has 4 coefficients that are not statistically significant. Lets pull out the two largest p-values and see how that affects the model in fit2

fit2 = lm(TARGET_WINS~.-TEAM_BATTING_SO-TEAM_PITCHING_HR-TEAM_PITCHING_SO,completedatatraining)
summary(fit2)

## 
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_BATTING_SO - TEAM_PITCHING_HR - 
##     TEAM_PITCHING_SO, data = completedatatraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.923  -8.638   0.119   8.457  58.192 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.2147225  3.3072988   1.274 0.202663    
## TEAM_BATTING_H    0.0490732  0.0030738  15.965  < 2e-16 ***
## TEAM_BATTING_2B  -0.0308898  0.0086877  -3.556 0.000385 ***
## TEAM_BATTING_3B   0.0629724  0.0161788   3.892 0.000102 ***
## TEAM_BATTING_HR   0.0406906  0.0074319   5.475 4.85e-08 ***
## TEAM_BATTING_BB   0.0051740  0.0044093   1.173 0.240746    
## TEAM_BASERUN_SB   0.0535027  0.0037633  14.217  < 2e-16 ***
## TEAM_PITCHING_H   0.0007981  0.0003829   2.084 0.037256 *  
## TEAM_PITCHING_BB  0.0032831  0.0027314   1.202 0.229485    
## TEAM_FIELDING_E  -0.0355254  0.0026300 -13.508  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.92 on 2266 degrees of freedom
## Multiple R-squared:  0.3298, Adjusted R-squared:  0.3272 
## F-statistic: 123.9 on 9 and 2266 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))

plot(fit2)

## Model 2 a still has two none statistically significant variables that we can weed out. Model 3 will be the fit less those variables

fit3 = lm(TARGET_WINS~.-TEAM_BATTING_SO-TEAM_PITCHING_HR-TEAM_PITCHING_SO-TEAM_PITCHING_BB-TEAM_BATTING_BB,completedatatraining)
summary(fit3)

## 
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_BATTING_SO - TEAM_PITCHING_HR - 
##     TEAM_PITCHING_SO - TEAM_PITCHING_BB - TEAM_BATTING_BB, data = completedatatraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.609  -8.874   0.111   8.365  59.609 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.3141266  2.9912065   2.780 0.005489 ** 
## TEAM_BATTING_H   0.0484088  0.0030373  15.938  < 2e-16 ***
## TEAM_BATTING_2B -0.0293568  0.0086655  -3.388 0.000717 ***
## TEAM_BATTING_3B  0.0716738  0.0159635   4.490 7.48e-06 ***
## TEAM_BATTING_HR  0.0455804  0.0071978   6.333 2.90e-10 ***
## TEAM_BASERUN_SB  0.0550699  0.0036982  14.891  < 2e-16 ***
## TEAM_PITCHING_H  0.0010668  0.0002983   3.576 0.000357 ***
## TEAM_FIELDING_E -0.0385381  0.0024446 -15.764  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.94 on 2268 degrees of freedom
## Multiple R-squared:  0.3269, Adjusted R-squared:  0.3248 
## F-statistic: 157.3 on 7 and 2268 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))

plot(fit3)

### Model three can still be made more efficient by pulling out TEAM_BATTING_2B and TEAM_PITCHING_H

fit4 = lm(TARGET_WINS~.-TEAM_BATTING_SO-TEAM_PITCHING_HR-TEAM_PITCHING_SO-TEAM_PITCHING_BB-TEAM_BATTING_BB-TEAM_PITCHING_H-TEAM_BATTING_2B,completedatatraining)
summary(fit4)

## 
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_BATTING_SO - TEAM_PITCHING_HR - 
##     TEAM_PITCHING_SO - TEAM_PITCHING_BB - TEAM_BATTING_BB - TEAM_PITCHING_H - 
##     TEAM_BATTING_2B, data = completedatatraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.941  -8.933   0.182   8.339  66.352 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.297838   2.911907   3.193  0.00143 ** 
## TEAM_BATTING_H   0.043781   0.002339  18.718  < 2e-16 ***
## TEAM_BATTING_3B  0.071040   0.015708   4.522 6.43e-06 ***
## TEAM_BATTING_HR  0.041090   0.007080   5.804 7.38e-09 ***
## TEAM_BASERUN_SB  0.050051   0.003465  14.445  < 2e-16 ***
## TEAM_FIELDING_E -0.031235   0.001708 -18.284  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13 on 2270 degrees of freedom
## Multiple R-squared:  0.3203, Adjusted R-squared:  0.3188 
## F-statistic:   214 on 5 and 2270 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))

plot(fit4)

Appendix 3A

hist(fit4$residuals)

qqnorm(fit4$residuals)
qqline(fit4$residuals)

skewness(fit4$residuals)

## [1] 0.05484275

Appendix 3B

predict(fit4,newdata=evalcompletedatatraining)

##         1         2         3         4         5         6         7 
##  66.71448  66.91692  74.33463  88.80512  77.97709  74.97541  80.29598 
##         8         9        10        11        12        13        14 
##  73.28890  71.35320  72.68263  73.62612  83.15783  80.43170  79.46027 
##        15        16        17        18        19        20        21 
##  77.79081  78.79079  72.38085  81.46444  67.06066  91.52215  81.35846 
##        22        23        24        25        26        27        28 
##  83.80425  78.39718  72.22746  84.79579  88.45129  54.77286  75.35190 
##        29        30        31        32        33        34        35 
##  80.85762  75.25384  86.57856  84.08242  81.95286  82.38853  80.45831 
##        36        37        38        39        40        41        42 
##  80.74101  75.37595  89.56106  83.90218  87.32979  80.27988  86.20489 
##        43        44        45        46        47        48        49 
##  23.60629 102.96107  90.56256  91.47519  96.70438  74.25741  69.54925 
##        50        51        52        53        54        55        56 
##  76.57278  78.97788  85.22266  78.15118  73.77693  77.17613  78.94642 
##        57        58        59        60        61        62        63 
##  90.21744  74.62191  62.29379  78.17635  86.74168  76.59823  85.21853 
##        64        65        66        67        68        69        70 
##  86.21196  86.55352 100.93909  74.87785  82.49043  79.26032  87.51570 
##        71        72        73        74        75        76        77 
##  87.41311  76.00227  80.07645  84.63406  83.53398  86.22572  82.27870 
##        78        79        80        81        82        83        84 
##  82.38228  71.50317  77.84126  85.59126  89.69577  97.17700  80.36304 
##        85        86        87        88        89        90        91 
##  81.15918  80.38495  78.94351  82.04370  83.59884  90.27167  78.42373 
##        92        93        94        95        96        97        98 
##  82.75803  71.72502  82.41296  83.94514  80.39333  84.74159  96.63802 
##        99       100       101       102       103       104       105 
##  87.02933  90.65864  83.47495  71.52896  82.66486  78.20985  81.17272 
##       106       107       108       109       110       111       112 
##  82.70131  61.26755  83.45179  84.67205  59.00284  83.83428  87.95648 
##       113       114       115       116       117       118       119 
##  94.56469  91.71339  84.17833  82.67092  91.49590  82.93123  78.90285 
##       120       121       122       123       124       125       126 
##  77.04585  91.00522  66.38781  67.04003  61.05919  70.18043  87.00204 
##       127       128       129       130       131       132       133 
##  88.13301  75.56397  87.85645  93.44994  84.80089  78.65368  77.91263 
##       134       135       136       137       138       139       140 
##  85.05681  86.07824  70.52353  77.49619  77.86626  89.85170  81.45746 
##       141       142       143       144       145       146       147 
##  66.59201  70.49547  92.01889  76.26443  72.02778  72.09791  78.79598 
##       148       149       150       151       152       153       154 
##  80.84691  83.88397  81.27367  83.79820  83.10673  33.03463  72.41712 
##       155       156       157       158       159       160       161 
##  76.10251  75.61627  88.80220  71.35366  89.33885  71.64589  99.57021 
##       162       163       164       165       166       167       168 
## 100.98575  87.53355  99.89207  91.46320  86.39231  83.22247  81.85087 
##       169       170       171       172       173       174       175 
##  76.58715  81.68815  90.72254  86.94268  78.41565  89.70903  81.08591 
##       176       177       178       179       180       181       182 
##  73.47596  74.06177  74.86833  73.31127  79.38799  87.06255  85.08564 
##       183       184       185       186       187       188       189 
##  84.98129  82.30968  85.57112  99.99836  86.14053  71.12787  62.73376 
##       190       191       192       193       194       195       196 
## 111.69925  68.71517  80.20519  77.94621  79.07606  81.11658  67.72243 
##       197       198       199       200       201       202       203 
##  75.81336  77.34638  76.30824  82.87009  77.20046  80.25496  74.90562 
##       204       205       206       207       208       209       210 
##  86.43434  80.34110  79.23276  80.58263  78.73170  81.19785  71.61353 
##       211       212       213       214       215       216       217 
## 104.51662  92.36336  82.15997  67.17027  71.54462  85.03209  85.27166 
##       218       219       220       221       222       223       224 
##  95.15724  78.29498  78.03923  80.81846  80.84097  84.51920  80.76278 
##       225       226       227       228       229       230       231 
##  76.40184  75.54287  80.38288  82.39295  80.95485  76.38155  73.79972 
##       232       233       234       235       236       237       238 
##  93.60593  78.77698  85.70279  77.60486  73.56580  82.40282  78.76876 
##       239       240       241       242       243       244       245 
##  88.13596  73.83797  88.13787  85.67641  82.75440  86.44116  64.49621 
##       246       247       248       249       250       251       252 
##  87.52247  80.05688  85.54697  73.25122  89.83633  83.68202  57.63391 
##       253       254       255       256       257       258       259 
##  91.04826  34.73634  68.81767  73.71647  82.51630  85.04511  80.74351