At a high level the data looks relatively tidy. However once one digs into the dataset some inconsistencies become evident. There are a fair amount of NAs in columns: TEAM_BATTING_SO,TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_BATTING_HBP,TEAM_PITCHING_SO,and TEAM_FIELDING_DP. Total they account for 3478 missing observations with the largest amount of NAs being in column TEAM_BATTING_HBP, which accounts for ~92% of the total observations in that respective column. Given these obstacles I was tasked with choosing an appropriate method to transform this data. I set a baseline rule that any column that had greater than 10% of its observations missing could not be effectively imputed or used in the analysis. This lead to the removal of the columns TEAM_BASERUN_CS,TEAM_BATTING_HBP, and TEAM_FIELDING_DP from the analysis. I then imputed the missing values for columns TEAM_BATTING_SO,TEAM_BASERUN_SB,and TEAM_PITCHING_SO using the MICE package and applying the function mice with the use of the predictive mean matching feature to fill the missing observations.
Predictive mean matching calculates the predicted value of target variable Y according to the specified imputation model; these are based on observed values thus the imputation is realistic with respect to the dataset.
Once the data had been cleansed I then moved on to my data exploration.
*Reference appendix 1A for the data preparation
Using the now completed data set with the newly imputed values for missing observations, we can produce summary statistics and diagrams to get a general view of the data. Of the 12 relevant variables with respect to TARGET_WINS, it is obvious that TEAM_PITCHING_H, TEAM_PITCHING_SO, and TEAM_PITCHING_BB seem to rarely produce wins outside of the rarest occasions according to the boxplots. This begets an interesting question, will these variables serve as statistically significant components to the linear model? This is tested in the model portion of the process.
*Reference appendix 1B for the summary statistics
Following the review of the summary statistics, I applied a correlation matrix with a p values matrix to weed out the most obvious not statistically significant variables with respect to TARGET_WINS. TEAM_BATTING_SO is not statistically significant once the correlation and p value matrix is applied and thus is dropped from the linear model.
*Reference appendix 1C for the Correlation matrix
Model 1:
This model is produced using the completed training data set with imputed values, which contains only the variables considered relevant and ‘complete’ with respect to the dataset. The correlation and p-value matrices indicated that TEAM_BATTING_SO is not statistically significant to TARGET_WINS. The strongest correlation is among TEAM_BATTING_H observations with respect to its relationship to TARGET_WINS. Thus, TEAM_BATTING_SO is removed and all other variables in the completed training data set are kept in the model. Given that there are no glaringly strong correlations in the matrix we need to turn to the the models statistical significance for each variable.
In model 1 four coefficients are not statistically significant. I then pulled out the two largest p-values and reproduced the model.
Model 2:
After pulling out TEAM_BATTING_SO, TEAM_PITCHING_HR, and TEAM_PITCHING_SO, the model still indicated that there are still two not statistically significant variables that need to be removed.
Model 3:
After pulling out TEAM_PITCHING_BB and TEAM_BATTING_BB all variables appear to be statistically significant. However the predictive power and fit of the model can still be made more efficient by pulling out TEAM_BATTING_2B and TEAM_PITCHING_H.
Model 4:
After retaining only the most statistically significant variables (TEAM_BATTING_H,TEAM_BATTING_3B, TEAM_BATTING_HR,TEAM_BASERUN_SB,and TEAM_FIELDING_E) the model is optimized. From a conceptual standpoint, base hits by batters, triples by batters, homeruns by batters, stolen bases, and errors reasonable have the largest impact on whether a team wins or loses. Consistent errors would clearly diminish a teams ability to win which is why there is negative correlation associated with this variable. Base hits, triples, homeruns, and stolen bases are the teams core scoring variables and clearly significant from just a gameplay perspective to whether or not they win or lose.
Predictively, model 4 has the strongest predictions of all the models produced.
Applying model checks to the model 4 it can be seen that the model’s residuals are normally distributed. The qqnorm plot with the qqline indicate that the residuals are inline with model validity. Of the four produced models, model 4 has the smallest skewness.
Despite having the smallest adusted R-Squared, the predictive power in model 4 is greater than the other three with its assortment of variables. Model 4 in fact has the largest F-Statistic indicating that we can reject the null hypthesis in favor of the alternative hypothesis: removing none statistically significant variables does improve the model.
Using model 4, I apply a clean evaluation data set (imputed values for NAs in relevant columns and removal of irrelevant variables from the dataset) for target wins predictions.
*Reference Appendix 3A for validity checks and Appendix 3B for model prediction
library(ggplot2)
library(dplyr)
library(mice)
library(corrplot)
library(Hmisc)
library(moments)
library(reshape2)
## Load Training Data
trainingdata = read.csv(file='moneyball-training-data.csv',header = TRUE,sep=',')
evaldata = read.csv(file='moneyball-evaluation-data.csv',header = TRUE,sep=',')
## Summary Statistics
names(trainingdata)
## [1] "INDEX" "TARGET_WINS" "TEAM_BATTING_H"
## [4] "TEAM_BATTING_2B" "TEAM_BATTING_3B" "TEAM_BATTING_HR"
## [7] "TEAM_BATTING_BB" "TEAM_BATTING_SO" "TEAM_BASERUN_SB"
## [10] "TEAM_BASERUN_CS" "TEAM_BATTING_HBP" "TEAM_PITCHING_H"
## [13] "TEAM_PITCHING_HR" "TEAM_PITCHING_BB" "TEAM_PITCHING_SO"
## [16] "TEAM_FIELDING_E" "TEAM_FIELDING_DP"
sum1 = summary(trainingdata[,2:length(names(trainingdata))]) ## exclude index
print(sum1)
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
##
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0
## Median :102.00 Median :512.0 Median : 750.0 Median :101.0
## Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## NA's :102 NA's :131
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0
## 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0
## Median : 49.0 Median :58.00 Median : 1518 Median :107.0
## Mean : 52.8 Mean :59.36 Mean : 1779 Mean :105.7
## 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0
## Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0
## NA's :772 NA's :2085
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
## 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
## Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
## Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :146.4
## 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
## Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
## NA's :102 NA's :286
## Initial view of data frame
head(trainingdata,5) ## All variables are quantitative
## correlation matrix to see relationship between variables
cor(trainingdata[,2:length(names(trainingdata))]) ## doing this is ineffective; requires transformation to parse NAs
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## TARGET_WINS 1.0000000 0.388767521 0.28910365
## TEAM_BATTING_H 0.3887675 1.000000000 0.56284968
## TEAM_BATTING_2B 0.2891036 0.562849678 1.00000000
## TEAM_BATTING_3B 0.1426084 0.427696575 -0.10730582
## TEAM_BATTING_HR 0.1761532 -0.006544685 0.43539729
## TEAM_BATTING_BB 0.2325599 -0.072464013 0.25572610
## TEAM_BATTING_SO NA NA NA
## TEAM_BASERUN_SB NA NA NA
## TEAM_BASERUN_CS NA NA NA
## TEAM_BATTING_HBP NA NA NA
## TEAM_PITCHING_H -0.1099371 0.302693709 0.02369219
## TEAM_PITCHING_HR 0.1890137 0.072853119 0.45455082
## TEAM_PITCHING_BB 0.1241745 0.094193027 0.17805420
## TEAM_PITCHING_SO NA NA NA
## TEAM_FIELDING_E -0.1764848 0.264902478 -0.23515099
## TEAM_FIELDING_DP NA NA NA
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
## TARGET_WINS 0.142608411 0.176153200 0.23255986
## TEAM_BATTING_H 0.427696575 -0.006544685 -0.07246401
## TEAM_BATTING_2B -0.107305824 0.435397293 0.25572610
## TEAM_BATTING_3B 1.000000000 -0.635566946 -0.28723584
## TEAM_BATTING_HR -0.635566946 1.000000000 0.51373481
## TEAM_BATTING_BB -0.287235841 0.513734810 1.00000000
## TEAM_BATTING_SO NA NA NA
## TEAM_BASERUN_SB NA NA NA
## TEAM_BASERUN_CS NA NA NA
## TEAM_BATTING_HBP NA NA NA
## TEAM_PITCHING_H 0.194879411 -0.250145481 -0.44977762
## TEAM_PITCHING_HR -0.567836679 0.969371396 0.45955207
## TEAM_PITCHING_BB -0.002224148 0.136927564 0.48936126
## TEAM_PITCHING_SO NA NA NA
## TEAM_FIELDING_E 0.509778447 -0.587339098 -0.65597081
## TEAM_FIELDING_DP NA NA NA
## TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## TARGET_WINS NA NA NA
## TEAM_BATTING_H NA NA NA
## TEAM_BATTING_2B NA NA NA
## TEAM_BATTING_3B NA NA NA
## TEAM_BATTING_HR NA NA NA
## TEAM_BATTING_BB NA NA NA
## TEAM_BATTING_SO 1 NA NA
## TEAM_BASERUN_SB NA 1 NA
## TEAM_BASERUN_CS NA NA 1
## TEAM_BATTING_HBP NA NA NA
## TEAM_PITCHING_H NA NA NA
## TEAM_PITCHING_HR NA NA NA
## TEAM_PITCHING_BB NA NA NA
## TEAM_PITCHING_SO NA NA NA
## TEAM_FIELDING_E NA NA NA
## TEAM_FIELDING_DP NA NA NA
## TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## TARGET_WINS NA -0.10993705 0.18901373
## TEAM_BATTING_H NA 0.30269371 0.07285312
## TEAM_BATTING_2B NA 0.02369219 0.45455082
## TEAM_BATTING_3B NA 0.19487941 -0.56783668
## TEAM_BATTING_HR NA -0.25014548 0.96937140
## TEAM_BATTING_BB NA -0.44977762 0.45955207
## TEAM_BATTING_SO NA NA NA
## TEAM_BASERUN_SB NA NA NA
## TEAM_BASERUN_CS NA NA NA
## TEAM_BATTING_HBP 1 NA NA
## TEAM_PITCHING_H NA 1.00000000 -0.14161276
## TEAM_PITCHING_HR NA -0.14161276 1.00000000
## TEAM_PITCHING_BB NA 0.32067616 0.22193750
## TEAM_PITCHING_SO NA NA NA
## TEAM_FIELDING_E NA 0.66775901 -0.49314447
## TEAM_FIELDING_DP NA NA NA
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## TARGET_WINS 0.124174536 NA -0.17648476
## TEAM_BATTING_H 0.094193027 NA 0.26490248
## TEAM_BATTING_2B 0.178054204 NA -0.23515099
## TEAM_BATTING_3B -0.002224148 NA 0.50977845
## TEAM_BATTING_HR 0.136927564 NA -0.58733910
## TEAM_BATTING_BB 0.489361263 NA -0.65597081
## TEAM_BATTING_SO NA NA NA
## TEAM_BASERUN_SB NA NA NA
## TEAM_BASERUN_CS NA NA NA
## TEAM_BATTING_HBP NA NA NA
## TEAM_PITCHING_H 0.320676162 NA 0.66775901
## TEAM_PITCHING_HR 0.221937505 NA -0.49314447
## TEAM_PITCHING_BB 1.000000000 NA -0.02283756
## TEAM_PITCHING_SO NA 1 NA
## TEAM_FIELDING_E -0.022837561 NA 1.00000000
## TEAM_FIELDING_DP NA NA NA
## TEAM_FIELDING_DP
## TARGET_WINS NA
## TEAM_BATTING_H NA
## TEAM_BATTING_2B NA
## TEAM_BATTING_3B NA
## TEAM_BATTING_HR NA
## TEAM_BATTING_BB NA
## TEAM_BATTING_SO NA
## TEAM_BASERUN_SB NA
## TEAM_BASERUN_CS NA
## TEAM_BATTING_HBP NA
## TEAM_PITCHING_H NA
## TEAM_PITCHING_HR NA
## TEAM_PITCHING_BB NA
## TEAM_PITCHING_SO NA
## TEAM_FIELDING_E NA
## TEAM_FIELDING_DP 1
## NA Transformation: Applied to TEAM_BATTING_SO,TEAM_BASERUN_SB,TEAM_BASERUN_CS,TEAM_BATTING_HBP,TEAM_PITCHING_SO,TEAM_FIELDING_DP
## First, we determine the % of NAs relative to the data set.
naperTEAMBATTINGSO = 102/length(trainingdata$INDEX)
naperTEAMBASERUNSB = 131/length(trainingdata$INDEX)
naperTEAMBASERUNCS = 772/length(trainingdata$INDEX)
naperTEAMBATTINGhbp = 2085/length(trainingdata$INDEX)
naperTEAMPITCHINGSO = 102/length(trainingdata$INDEX)
naperTEAMFIELDINGDP = 286/length(trainingdata$INDEX)
## If the % of NAs is less than 10% we will be imputing the average for missing values
percentlist = list(naperTEAMBATTINGSO,naperTEAMBASERUNSB,naperTEAMBASERUNCS,naperTEAMBATTINGhbp,naperTEAMPITCHINGSO,naperTEAMFIELDINGDP)
ifelse(percentlist>.1,"remove","impute")
## [1] "impute" "impute" "remove" "remove" "impute" "remove"
## to many entries are missing to consider the following columns relevant to the analysis: TEAM_BASERUN_CS,TEAM_BATTING_HBP,TEAM_FIELDING_DP
## we remove these columns to focus only on the relevant datapoints
todrop = c('INDEX','TEAM_BASERUN_CS','TEAM_BATTING_HBP','TEAM_FIELDING_DP')
newtrainingdata = trainingdata[,!names(trainingdata) %in% todrop]
summary(completedatatraining)
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 542.0 1st Qu.: 67.0
## Median :102.00 Median :512.0 Median : 733.0 Median :105.0
## Mean : 99.61 Mean :501.6 Mean : 727.4 Mean :136.8
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 925.0 3rd Qu.:169.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## Min. : 1137 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 1419 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 607.8
## Median : 1518 Median :107.0 Median : 536.5 Median : 800.0
## Mean : 1779 Mean :105.7 Mean : 553.0 Mean : 808.9
## 3rd Qu.: 1682 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 958.0
## Max. :30132 Max. :343.0 Max. :3645.0 Max. :19278.0
## TEAM_FIELDING_E
## Min. : 65.0
## 1st Qu.: 127.0
## Median : 159.0
## Mean : 246.5
## 3rd Qu.: 249.2
## Max. :1898.0
dfmelt = melt(completedatatraining,id.var='TARGET_WINS')
p = ggplot(data = dfmelt,aes(x=variable,y=value))+geom_boxplot(aes(fill=TARGET_WINS))
p+facet_wrap(~variable,scales="free")
# correlation matrix of completed data
correlationmatrix = cor(completedatatraining)
corrandpvalues = rcorr(as.matrix(completedatatraining))
print(corrandpvalues)
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## TARGET_WINS 1.00 0.39 0.29
## TEAM_BATTING_H 0.39 1.00 0.56
## TEAM_BATTING_2B 0.29 0.56 1.00
## TEAM_BATTING_3B 0.14 0.43 -0.11
## TEAM_BATTING_HR 0.18 -0.01 0.44
## TEAM_BATTING_BB 0.23 -0.07 0.26
## TEAM_BATTING_SO -0.03 -0.42 0.19
## TEAM_BASERUN_SB 0.12 0.16 -0.19
## TEAM_PITCHING_H -0.11 0.30 0.02
## TEAM_PITCHING_HR 0.19 0.07 0.45
## TEAM_PITCHING_BB 0.12 0.09 0.18
## TEAM_PITCHING_SO -0.07 -0.23 0.08
## TEAM_FIELDING_E -0.18 0.26 -0.24
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
## TARGET_WINS 0.14 0.18 0.23
## TEAM_BATTING_H 0.43 -0.01 -0.07
## TEAM_BATTING_2B -0.11 0.44 0.26
## TEAM_BATTING_3B 1.00 -0.64 -0.29
## TEAM_BATTING_HR -0.64 1.00 0.51
## TEAM_BATTING_BB -0.29 0.51 1.00
## TEAM_BATTING_SO -0.67 0.73 0.39
## TEAM_BASERUN_SB 0.53 -0.50 -0.34
## TEAM_PITCHING_H 0.19 -0.25 -0.45
## TEAM_PITCHING_HR -0.57 0.97 0.46
## TEAM_PITCHING_BB 0.00 0.14 0.49
## TEAM_PITCHING_SO -0.26 0.20 -0.01
## TEAM_FIELDING_E 0.51 -0.59 -0.66
## TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H
## TARGET_WINS -0.03 0.12 -0.11
## TEAM_BATTING_H -0.42 0.16 0.30
## TEAM_BATTING_2B 0.19 -0.19 0.02
## TEAM_BATTING_3B -0.67 0.53 0.19
## TEAM_BATTING_HR 0.73 -0.50 -0.25
## TEAM_BATTING_BB 0.39 -0.34 -0.45
## TEAM_BATTING_SO 1.00 -0.33 -0.36
## TEAM_BASERUN_SB -0.33 1.00 0.15
## TEAM_PITCHING_H -0.36 0.15 1.00
## TEAM_PITCHING_HR 0.67 -0.44 -0.14
## TEAM_PITCHING_BB 0.06 -0.03 0.32
## TEAM_PITCHING_SO 0.42 -0.06 0.27
## TEAM_FIELDING_E -0.58 0.59 0.67
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## TARGET_WINS 0.19 0.12 -0.07
## TEAM_BATTING_H 0.07 0.09 -0.23
## TEAM_BATTING_2B 0.45 0.18 0.08
## TEAM_BATTING_3B -0.57 0.00 -0.26
## TEAM_BATTING_HR 0.97 0.14 0.20
## TEAM_BATTING_BB 0.46 0.49 -0.01
## TEAM_BATTING_SO 0.67 0.06 0.42
## TEAM_BASERUN_SB -0.44 -0.03 -0.06
## TEAM_PITCHING_H -0.14 0.32 0.27
## TEAM_PITCHING_HR 1.00 0.22 0.22
## TEAM_PITCHING_BB 0.22 1.00 0.49
## TEAM_PITCHING_SO 0.22 0.49 1.00
## TEAM_FIELDING_E -0.49 -0.02 -0.03
## TEAM_FIELDING_E
## TARGET_WINS -0.18
## TEAM_BATTING_H 0.26
## TEAM_BATTING_2B -0.24
## TEAM_BATTING_3B 0.51
## TEAM_BATTING_HR -0.59
## TEAM_BATTING_BB -0.66
## TEAM_BATTING_SO -0.58
## TEAM_BASERUN_SB 0.59
## TEAM_PITCHING_H 0.67
## TEAM_PITCHING_HR -0.49
## TEAM_PITCHING_BB -0.02
## TEAM_PITCHING_SO -0.03
## TEAM_FIELDING_E 1.00
##
## n= 2276
##
##
## P
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## TARGET_WINS 0.0000 0.0000
## TEAM_BATTING_H 0.0000 0.0000
## TEAM_BATTING_2B 0.0000 0.0000
## TEAM_BATTING_3B 0.0000 0.0000 0.0000
## TEAM_BATTING_HR 0.0000 0.7550 0.0000
## TEAM_BATTING_BB 0.0000 0.0005 0.0000
## TEAM_BATTING_SO 0.1390 0.0000 0.0000
## TEAM_BASERUN_SB 0.0000 0.0000 0.0000
## TEAM_PITCHING_H 0.0000 0.0000 0.2585
## TEAM_PITCHING_HR 0.0000 0.0005 0.0000
## TEAM_PITCHING_BB 0.0000 0.0000 0.0000
## TEAM_PITCHING_SO 0.0004 0.0000 0.0000
## TEAM_FIELDING_E 0.0000 0.0000 0.0000
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
## TARGET_WINS 0.0000 0.0000 0.0000
## TEAM_BATTING_H 0.0000 0.7550 0.0005
## TEAM_BATTING_2B 0.0000 0.0000 0.0000
## TEAM_BATTING_3B 0.0000 0.0000
## TEAM_BATTING_HR 0.0000 0.0000
## TEAM_BATTING_BB 0.0000 0.0000
## TEAM_BATTING_SO 0.0000 0.0000 0.0000
## TEAM_BASERUN_SB 0.0000 0.0000 0.0000
## TEAM_PITCHING_H 0.0000 0.0000 0.0000
## TEAM_PITCHING_HR 0.0000 0.0000 0.0000
## TEAM_PITCHING_BB 0.9155 0.0000 0.0000
## TEAM_PITCHING_SO 0.0000 0.0000 0.6562
## TEAM_FIELDING_E 0.0000 0.0000 0.0000
## TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H
## TARGET_WINS 0.1390 0.0000 0.0000
## TEAM_BATTING_H 0.0000 0.0000 0.0000
## TEAM_BATTING_2B 0.0000 0.0000 0.2585
## TEAM_BATTING_3B 0.0000 0.0000 0.0000
## TEAM_BATTING_HR 0.0000 0.0000 0.0000
## TEAM_BATTING_BB 0.0000 0.0000 0.0000
## TEAM_BATTING_SO 0.0000 0.0000
## TEAM_BASERUN_SB 0.0000 0.0000
## TEAM_PITCHING_H 0.0000 0.0000
## TEAM_PITCHING_HR 0.0000 0.0000 0.0000
## TEAM_PITCHING_BB 0.0075 0.1046 0.0000
## TEAM_PITCHING_SO 0.0000 0.0059 0.0000
## TEAM_FIELDING_E 0.0000 0.0000 0.0000
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## TARGET_WINS 0.0000 0.0000 0.0004
## TEAM_BATTING_H 0.0005 0.0000 0.0000
## TEAM_BATTING_2B 0.0000 0.0000 0.0000
## TEAM_BATTING_3B 0.0000 0.9155 0.0000
## TEAM_BATTING_HR 0.0000 0.0000 0.0000
## TEAM_BATTING_BB 0.0000 0.0000 0.6562
## TEAM_BATTING_SO 0.0000 0.0075 0.0000
## TEAM_BASERUN_SB 0.0000 0.1046 0.0059
## TEAM_PITCHING_H 0.0000 0.0000 0.0000
## TEAM_PITCHING_HR 0.0000 0.0000
## TEAM_PITCHING_BB 0.0000 0.0000
## TEAM_PITCHING_SO 0.0000 0.0000
## TEAM_FIELDING_E 0.0000 0.2761 0.1918
## TEAM_FIELDING_E
## TARGET_WINS 0.0000
## TEAM_BATTING_H 0.0000
## TEAM_BATTING_2B 0.0000
## TEAM_BATTING_3B 0.0000
## TEAM_BATTING_HR 0.0000
## TEAM_BATTING_BB 0.0000
## TEAM_BATTING_SO 0.0000
## TEAM_BASERUN_SB 0.0000
## TEAM_PITCHING_H 0.0000
## TEAM_PITCHING_HR 0.0000
## TEAM_PITCHING_BB 0.2761
## TEAM_PITCHING_SO 0.1918
## TEAM_FIELDING_E
corrplot(correlationmatrix, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
## Using the correlation and p value matrix it is shown that target wins is statistically significant for all variables except TEAM_BATTING_SO
## we now build our first multi regression model
fit = lm(TARGET_WINS~.-TEAM_BATTING_SO,completedatatraining)
summary(fit)
##
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_BATTING_SO, data = completedatatraining)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.737 -8.712 0.120 8.437 58.150
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6387446 3.8884918 0.679 0.497458
## TEAM_BATTING_H 0.0500683 0.0033156 15.101 < 2e-16 ***
## TEAM_BATTING_2B -0.0323184 0.0088988 -3.632 0.000288 ***
## TEAM_BATTING_3B 0.0641485 0.0162716 3.942 8.31e-05 ***
## TEAM_BATTING_HR 0.0495069 0.0270294 1.832 0.067143 .
## TEAM_BATTING_BB 0.0062705 0.0056648 1.107 0.268444
## TEAM_BASERUN_SB 0.0531803 0.0037933 14.019 < 2e-16 ***
## TEAM_PITCHING_H 0.0007901 0.0003839 2.058 0.039675 *
## TEAM_PITCHING_HR -0.0094255 0.0238070 -0.396 0.692207
## TEAM_PITCHING_BB 0.0024333 0.0039623 0.614 0.539202
## TEAM_PITCHING_SO 0.0005013 0.0008170 0.614 0.539539
## TEAM_FIELDING_E -0.0351830 0.0026662 -13.196 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.92 on 2264 degrees of freedom
## Multiple R-squared: 0.33, Adjusted R-squared: 0.3268
## F-statistic: 101.4 on 11 and 2264 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(fit)
## model 1 has 4 coefficients that are not statistically significant. Lets pull out the two largest p-values and see how that affects the model in fit2
fit2 = lm(TARGET_WINS~.-TEAM_BATTING_SO-TEAM_PITCHING_HR-TEAM_PITCHING_SO,completedatatraining)
summary(fit2)
##
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_BATTING_SO - TEAM_PITCHING_HR -
## TEAM_PITCHING_SO, data = completedatatraining)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.923 -8.638 0.119 8.457 58.192
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2147225 3.3072988 1.274 0.202663
## TEAM_BATTING_H 0.0490732 0.0030738 15.965 < 2e-16 ***
## TEAM_BATTING_2B -0.0308898 0.0086877 -3.556 0.000385 ***
## TEAM_BATTING_3B 0.0629724 0.0161788 3.892 0.000102 ***
## TEAM_BATTING_HR 0.0406906 0.0074319 5.475 4.85e-08 ***
## TEAM_BATTING_BB 0.0051740 0.0044093 1.173 0.240746
## TEAM_BASERUN_SB 0.0535027 0.0037633 14.217 < 2e-16 ***
## TEAM_PITCHING_H 0.0007981 0.0003829 2.084 0.037256 *
## TEAM_PITCHING_BB 0.0032831 0.0027314 1.202 0.229485
## TEAM_FIELDING_E -0.0355254 0.0026300 -13.508 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.92 on 2266 degrees of freedom
## Multiple R-squared: 0.3298, Adjusted R-squared: 0.3272
## F-statistic: 123.9 on 9 and 2266 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(fit2)
## Model 2 a still has two none statistically significant variables that we can weed out. Model 3 will be the fit less those variables
fit3 = lm(TARGET_WINS~.-TEAM_BATTING_SO-TEAM_PITCHING_HR-TEAM_PITCHING_SO-TEAM_PITCHING_BB-TEAM_BATTING_BB,completedatatraining)
summary(fit3)
##
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_BATTING_SO - TEAM_PITCHING_HR -
## TEAM_PITCHING_SO - TEAM_PITCHING_BB - TEAM_BATTING_BB, data = completedatatraining)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.609 -8.874 0.111 8.365 59.609
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.3141266 2.9912065 2.780 0.005489 **
## TEAM_BATTING_H 0.0484088 0.0030373 15.938 < 2e-16 ***
## TEAM_BATTING_2B -0.0293568 0.0086655 -3.388 0.000717 ***
## TEAM_BATTING_3B 0.0716738 0.0159635 4.490 7.48e-06 ***
## TEAM_BATTING_HR 0.0455804 0.0071978 6.333 2.90e-10 ***
## TEAM_BASERUN_SB 0.0550699 0.0036982 14.891 < 2e-16 ***
## TEAM_PITCHING_H 0.0010668 0.0002983 3.576 0.000357 ***
## TEAM_FIELDING_E -0.0385381 0.0024446 -15.764 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.94 on 2268 degrees of freedom
## Multiple R-squared: 0.3269, Adjusted R-squared: 0.3248
## F-statistic: 157.3 on 7 and 2268 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(fit3)
### Model three can still be made more efficient by pulling out TEAM_BATTING_2B and TEAM_PITCHING_H
fit4 = lm(TARGET_WINS~.-TEAM_BATTING_SO-TEAM_PITCHING_HR-TEAM_PITCHING_SO-TEAM_PITCHING_BB-TEAM_BATTING_BB-TEAM_PITCHING_H-TEAM_BATTING_2B,completedatatraining)
summary(fit4)
##
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_BATTING_SO - TEAM_PITCHING_HR -
## TEAM_PITCHING_SO - TEAM_PITCHING_BB - TEAM_BATTING_BB - TEAM_PITCHING_H -
## TEAM_BATTING_2B, data = completedatatraining)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.941 -8.933 0.182 8.339 66.352
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.297838 2.911907 3.193 0.00143 **
## TEAM_BATTING_H 0.043781 0.002339 18.718 < 2e-16 ***
## TEAM_BATTING_3B 0.071040 0.015708 4.522 6.43e-06 ***
## TEAM_BATTING_HR 0.041090 0.007080 5.804 7.38e-09 ***
## TEAM_BASERUN_SB 0.050051 0.003465 14.445 < 2e-16 ***
## TEAM_FIELDING_E -0.031235 0.001708 -18.284 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13 on 2270 degrees of freedom
## Multiple R-squared: 0.3203, Adjusted R-squared: 0.3188
## F-statistic: 214 on 5 and 2270 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(fit4)
hist(fit4$residuals)
qqnorm(fit4$residuals)
qqline(fit4$residuals)
skewness(fit4$residuals)
## [1] 0.05484275
predict(fit4,newdata=evalcompletedatatraining)
## 1 2 3 4 5 6 7
## 66.71448 66.91692 74.33463 88.80512 77.97709 74.97541 80.29598
## 8 9 10 11 12 13 14
## 73.28890 71.35320 72.68263 73.62612 83.15783 80.43170 79.46027
## 15 16 17 18 19 20 21
## 77.79081 78.79079 72.38085 81.46444 67.06066 91.52215 81.35846
## 22 23 24 25 26 27 28
## 83.80425 78.39718 72.22746 84.79579 88.45129 54.77286 75.35190
## 29 30 31 32 33 34 35
## 80.85762 75.25384 86.57856 84.08242 81.95286 82.38853 80.45831
## 36 37 38 39 40 41 42
## 80.74101 75.37595 89.56106 83.90218 87.32979 80.27988 86.20489
## 43 44 45 46 47 48 49
## 23.60629 102.96107 90.56256 91.47519 96.70438 74.25741 69.54925
## 50 51 52 53 54 55 56
## 76.57278 78.97788 85.22266 78.15118 73.77693 77.17613 78.94642
## 57 58 59 60 61 62 63
## 90.21744 74.62191 62.29379 78.17635 86.74168 76.59823 85.21853
## 64 65 66 67 68 69 70
## 86.21196 86.55352 100.93909 74.87785 82.49043 79.26032 87.51570
## 71 72 73 74 75 76 77
## 87.41311 76.00227 80.07645 84.63406 83.53398 86.22572 82.27870
## 78 79 80 81 82 83 84
## 82.38228 71.50317 77.84126 85.59126 89.69577 97.17700 80.36304
## 85 86 87 88 89 90 91
## 81.15918 80.38495 78.94351 82.04370 83.59884 90.27167 78.42373
## 92 93 94 95 96 97 98
## 82.75803 71.72502 82.41296 83.94514 80.39333 84.74159 96.63802
## 99 100 101 102 103 104 105
## 87.02933 90.65864 83.47495 71.52896 82.66486 78.20985 81.17272
## 106 107 108 109 110 111 112
## 82.70131 61.26755 83.45179 84.67205 59.00284 83.83428 87.95648
## 113 114 115 116 117 118 119
## 94.56469 91.71339 84.17833 82.67092 91.49590 82.93123 78.90285
## 120 121 122 123 124 125 126
## 77.04585 91.00522 66.38781 67.04003 61.05919 70.18043 87.00204
## 127 128 129 130 131 132 133
## 88.13301 75.56397 87.85645 93.44994 84.80089 78.65368 77.91263
## 134 135 136 137 138 139 140
## 85.05681 86.07824 70.52353 77.49619 77.86626 89.85170 81.45746
## 141 142 143 144 145 146 147
## 66.59201 70.49547 92.01889 76.26443 72.02778 72.09791 78.79598
## 148 149 150 151 152 153 154
## 80.84691 83.88397 81.27367 83.79820 83.10673 33.03463 72.41712
## 155 156 157 158 159 160 161
## 76.10251 75.61627 88.80220 71.35366 89.33885 71.64589 99.57021
## 162 163 164 165 166 167 168
## 100.98575 87.53355 99.89207 91.46320 86.39231 83.22247 81.85087
## 169 170 171 172 173 174 175
## 76.58715 81.68815 90.72254 86.94268 78.41565 89.70903 81.08591
## 176 177 178 179 180 181 182
## 73.47596 74.06177 74.86833 73.31127 79.38799 87.06255 85.08564
## 183 184 185 186 187 188 189
## 84.98129 82.30968 85.57112 99.99836 86.14053 71.12787 62.73376
## 190 191 192 193 194 195 196
## 111.69925 68.71517 80.20519 77.94621 79.07606 81.11658 67.72243
## 197 198 199 200 201 202 203
## 75.81336 77.34638 76.30824 82.87009 77.20046 80.25496 74.90562
## 204 205 206 207 208 209 210
## 86.43434 80.34110 79.23276 80.58263 78.73170 81.19785 71.61353
## 211 212 213 214 215 216 217
## 104.51662 92.36336 82.15997 67.17027 71.54462 85.03209 85.27166
## 218 219 220 221 222 223 224
## 95.15724 78.29498 78.03923 80.81846 80.84097 84.51920 80.76278
## 225 226 227 228 229 230 231
## 76.40184 75.54287 80.38288 82.39295 80.95485 76.38155 73.79972
## 232 233 234 235 236 237 238
## 93.60593 78.77698 85.70279 77.60486 73.56580 82.40282 78.76876
## 239 240 241 242 243 244 245
## 88.13596 73.83797 88.13787 85.67641 82.75440 86.44116 64.49621
## 246 247 248 249 250 251 252
## 87.52247 80.05688 85.54697 73.25122 89.83633 83.68202 57.63391
## 253 254 255 256 257 258 259
## 91.04826 34.73634 68.81767 73.71647 82.51630 85.04511 80.74351