Purpose

This report covers an attempt to build a model to predict number of wins of a baseball team in a season based on several offensive and deffensive statistics. Resulting model explained about 36% of variability in the target variable and included most of the provided explanatory variables. Some potentially helpful variables were not included in the data set. For instance, number of At Bats can be used to calculate on-base percentage which may correlate strongly with winning percentage. The model can be revised with additional variables or further analysis.

Data Exploration

The data set describes baseball team statistics for the years 1871 to 2006 inclusive. Each record in the data set represents the performance of the team for the given year adjusted to the current length of the season - 162 games. The data set includes 16 variables (excluding the index) and the training set includes 2,276 records.

Dimensions

For reproducibility purposes, we have put the original data sets in git-hub account and then we will read them from there. Let’s load our data and look at the the dimensions of our data set.

#BTraining <- read.csv("moneyball-training-data.csv")
BTraining <- read.csv("https://raw.githubusercontent.com/theoracley/Data621/master/Homework1/moneyball-training-data.csv")
dim(BTraining)
## [1] 2276   17

As we can notice, the training data set has a total of 17 different variables. The total number or records available are 2276.

Structure

The below structure is currently present in the data, for simplicity purposes, we have loaded and treated this data set as a data frame in which all the variables are integers.

str(BTraining)
## 'data.frame':    2276 obs. of  17 variables:
##  $ INDEX           : int  1 2 3 4 5 6 7 8 11 12 ...
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : int  842 1075 917 922 920 973 1062 1027 922 827 ...
##  $ TEAM_BASERUN_SB : int  NA 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : int  NA 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: int  5456 1082 917 928 920 973 1062 1033 922 827 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: int  NA 155 153 156 168 149 186 136 169 159 ...

Summary

Let’s look at the summary of our data.

BTraining.summary <- data.frame(unclass(summary(BTraining[2:17])),
check.names = FALSE,
row.names = NULL,
stringsAsFactors = FALSE)
BTraining.summary
##        TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B  TEAM_BATTING_3B
## 1 Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
## 2 1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
## 3 Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
## 4 Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
## 5 3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
## 6 Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
## 7             <NA>           <NA>            <NA>             <NA>
##    TEAM_BATTING_HR TEAM_BATTING_BB  TEAM_BATTING_SO TEAM_BASERUN_SB
## 1 Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
## 2 1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0   1st Qu.: 66.0  
## 3 Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
## 4 Mean   : 99.61   Mean   :501.6   Mean   : 735.6   Mean   :124.8  
## 5 3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0   3rd Qu.:156.0  
## 6 Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
## 7             <NA>            <NA>    NA's   :102     NA's   :131  
##   TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1 Min.   :  0.0    Min.   :29.00   Min.   : 1137    Min.   :  0.0  
## 2 1st Qu.: 38.0    1st Qu.:50.50   1st Qu.: 1419    1st Qu.: 50.0  
## 3 Median : 49.0    Median :58.00   Median : 1518    Median :107.0  
## 4 Mean   : 52.8    Mean   :59.36   Mean   : 1779    Mean   :105.7  
## 5 3rd Qu.: 62.0    3rd Qu.:67.00   3rd Qu.: 1682    3rd Qu.:150.0  
## 6 Max.   :201.0    Max.   :95.00   Max.   :30132    Max.   :343.0  
## 7   NA's   :772     NA's   :2085              <NA>             <NA>
##   TEAM_PITCHING_BB  TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP
## 1 Min.   :   0.0   Min.   :    0.0   Min.   :  65.0    Min.   : 52.0  
## 2 1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0    1st Qu.:131.0  
## 3 Median : 536.5   Median :  813.5   Median : 159.0    Median :149.0  
## 4 Mean   : 553.0   Mean   :  817.7   Mean   : 246.5    Mean   :146.4  
## 5 3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2    3rd Qu.:164.0  
## 6 Max.   :3645.0   Max.   :19278.0   Max.   :1898.0    Max.   :228.0  
## 7             <NA>     NA's   :102               <NA>    NA's   :286

From the statistics above, we can see that there are 6 variables that contain some NA’s values, and we need to deal with them later.

BTraining.summary$TEAM_BATTING_SO[7]
## [1] "NA's   :102  "
BTraining.summary$TEAM_BASERUN_SB[7]
## [1] "NA's   :131  "
BTraining.summary$TEAM_BASERUN_CS[7]
## [1] "NA's   :772  "
BTraining.summary$TEAM_BATTING_HBP[7]
## [1] "NA's   :2085  "
BTraining.summary$TEAM_PITCHING_SO[7]
## [1] "NA's   :102  "
BTraining.summary$TEAM_FIELDING_DP[7]
## [1] "NA's   :286  "

Each variable is presented below with corresponding basic statistics (minimum, median and maximum values, mean and standard deviation, number of records with missing values and zero values), boxplot, density plot with highlighted mean value, and scatterplot against outcome variable (TARGET_WINS) with best fit line. This information is used to check general validity of data and adjust as necessary.

Let’s create a statistical container that will hold this information for each variable

sumBTraining = data.frame(Variable = character(),
                   Min = integer(),
                   Median = integer(),
                   Mean = double(),
                   SD = double(),
                   Max = integer(),
                   Num_NAs = integer(),
                   Num_Zeros = integer())
for (i in 2:17) {
  sumBTraining <- rbind(sumBTraining, data.frame(Variable = colnames(BTraining)[i],
                                   Min = min(BTraining[,i], na.rm=TRUE),
                                   Median = median(BTraining[,i], na.rm=TRUE),
                                   Mean = mean(BTraining[,i], na.rm=TRUE),
                                   SD = sd(BTraining[,i], na.rm=TRUE),
                                   Max = max(BTraining[,i], na.rm=TRUE),
                                   Num_NAs = sum(is.na(BTraining[,i])),
                                   Num_Zeros = length(which(BTraining[,i]==0)))
                 )
}
colnames(sumBTraining) <- c("", "Min", "Median", "Mean", "SD", "Max", "Num of NAs", "Num of Zeros")

Let’s go through every variable and check on the information output.

TEAM_BATTING_H: Number of team base hits (includes singles, doubles, triples and home runs)

kable(sumBTraining[sumBTraining[,1]=="TEAM_BATTING_H",2:8], row.names=FALSE)
Min Median Mean SD Max Num of NAs Num of Zeros
891 1454 1469.27 144.5912 2554 0 0
# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BATTING_H)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BATTING_H)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BATTING_H, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BATTING_H, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: There are no missing values. The range and distribution are reasonable.

TEAM_BATTING_2B: Number of team doubles

kable(sumBTraining[sumBTraining[,1]=="TEAM_BATTING_2B",2:8], row.names=FALSE)
Min Median Mean SD Max Num of NAs Num of Zeros
69 238 241.2469 46.80141 458 0 0
# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BATTING_2B)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BATTING_2B)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BATTING_2B, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BATTING_2B, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: There are no missing values. The range and distribution are reasonable.

TEAM_BATTING_3B: Number of team triples

kable(sumBTraining[sumBTraining[,1]=="TEAM_BATTING_3B",2:8], row.names=FALSE)
Min Median Mean SD Max Num of NAs Num of Zeros
0 47 55.25 27.93856 223 0 2
# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BATTING_3B)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BATTING_3B)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BATTING_3B, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BATTING_3B, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: The range and distribution are reasonable. There are 2 records with zero values which is unrealistic for a team in a season. One record (index 1347) has 12 variables with missing values, including the outcome variable. This record will be deleted from the data set. Second record (index 1494) has 7 missing variables, but it does have some recorded values in all categories - batting, pitching and fielding. Zero value for TEAM_BATTING_3B can be replaced with the median (because the distribution is right-skewed, median value will provide more realistic estimate).

TEAM_BATTING_HR: Number of team home runs

kable(sumBTraining[sumBTraining[,1]=="TEAM_BATTING_HR",2:8], row.names=FALSE)
Min Median Mean SD Max Num of NAs Num of Zeros
0 102 99.61204 60.54687 264 0 15
# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BATTING_HR)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BATTING_HR)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BATTING_HR, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BATTING_HR, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: The range is reasonable. The distribution is interesting because it is multimodal. Most likely this indicates major changes in game dynamics - perhaps, some rule adjustments started favoring batters. Or perhaps, this is an affect of steroid era. There are 15 records with zero values which is unrealistic for this variable. They can be imputed from other values.

TEAM_BATTING_BB: Number of team walks

kable(sumBTraining[sumBTraining[,1]=="TEAM_BATTING_BB",2:8], row.names=FALSE)
Min Median Mean SD Max Num of NAs Num of Zeros
0 512 501.5589 122.6709 878 0 1
# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BATTING_BB)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BATTING_BB)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BATTING_BB, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BATTING_BB, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: The range and distribution are reasonable. There is one record (index 1347) that has a zero value. This record was discussed above (under TEAM_BATTING_3B) and it will be deleted from the data set.

TEAM_BATTING_HBP: Number of team batters hit by pitch

kable(sumBTraining[sumBTraining[,1]=="TEAM_BATTING_HBP",2:8], row.names=FALSE)
Min Median Mean SD Max Num of NAs Num of Zeros
29 58 59.35602 12.96712 95 2085 0
# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BATTING_HBP)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BATTING_HBP)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BATTING_HBP, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BATTING_HBP, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: There are 2,085 records - 91.6% of data set - that are missing value. Because this variable is missing for majority of records, it will not be imputed and will be left out from the regression model.

TEAM_BATTING_SO: Number of team strikeouts by batters

kable(sumBTraining[sumBTraining[,1]=="TEAM_BATTING_SO",2:8], row.names=FALSE)
Min Median Mean SD Max Num of NAs Num of Zeros
0 750 735.6053 248.5264 1399 102 20
# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BATTING_SO)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BATTING_SO)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BATTING_SO, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BATTING_SO, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: There are 122 records with missing or zero value (as wtih other variables a zero value is unrealistic). These values can be imputed. Similarly to homeruns, the distribution is multimodal, which is interesting enough for additional analysis. Another area of concern is a noticeable left tail. It is highly unlikely to have games without any strikeouts, so anything lower than 162 (average of 1 strikeout per game) is definitely suspect.

TEAM_BASERUN_SB: Number of team stolen bases

kable(sumBTraining[sumBTraining[,1]=="TEAM_BASERUN_SB",2:8], row.names=FALSE)
Min Median Mean SD Max Num of NAs Num of Zeros
0 101 124.7618 87.79117 697 131 2
# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BASERUN_SB)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BASERUN_SB)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BASERUN_SB, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BASERUN_SB, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: The range and distribution are reasonable. The only issue are 133 records with missing or zero value. These values can be imputed in order to use these records in model building.

TEAM_BASERUN_CS: Number of team runners caught stealing

kable(sumBTraining[sumBTraining[,1]=="TEAM_BASERUN_CS",2:8], row.names=FALSE)
Min Median Mean SD Max Num of NAs Num of Zeros
0 49 52.80386 22.95634 201 772 1
# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BASERUN_CS)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BASERUN_CS)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BASERUN_CS, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BASERUN_CS, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: The range and distribution are reasonable; however, there is significant number of missing values - 773, including one zero value. This represents a third of the entire data set. It may be possible to impute this value, but it may be necessary to leave this variable out of model building.

TEAM_FIELDING_E: Number of team fielding errors

kable(sumBTraining[sumBTraining[,1]=="TEAM_FIELDING_E",2:8], row.names=FALSE)
Min Median Mean SD Max Num of NAs Num of Zeros
65 159 246.4807 227.771 1898 0 0
# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_FIELDING_E)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_FIELDING_E)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_FIELDING_E, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_FIELDING_E, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: There are no missing values. Distribution has a very long right tail. Values in the 1,000 and above range are highly suspect. One of the highest historical number of errors is 867 errors by Washington in 1886 for 122 games. That is equal about 1,151 errors for 162 game season. There are multiple values above that number. This may unfavorably influence a model.

TEAM_FIELDING_DP: Number of team fielding double plays

kable(sumBTraining[sumBTraining[,1]=="TEAM_FIELDING_DP",2:8], row.names=FALSE)
Min Median Mean SD Max Num of NAs Num of Zeros
52 149 146.3879 26.22639 228 286 0
# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_FIELDING_DP)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_FIELDING_DP)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_FIELDING_DP, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_FIELDING_DP, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: The range and distribution are reasonable. Similar to a few other variables there is a medium number off missing values - 286 records. This value can be imputed.

TEAM_PITCHING_BB: Number of walks given up by pitchers

kable(sumBTraining[sumBTraining[,1]=="TEAM_PITCHING_BB",2:8], row.names=FALSE)
Min Median Mean SD Max Num of NAs Num of Zeros
0 536.5 553.0079 166.3574 3645 0 1
# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_PITCHING_BB)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_PITCHING_BB)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_PITCHING_BB, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_PITCHING_BB, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: There are no missing values with the exception of record 1347 which will be deleted from model building. There are some unrealistic outliers. Current record of walks by a team in a season is held by 1949 Boston Red Sox - 835 walks in 155 games. For a 162 game season, this number is 873. This variable will be capped at 1,100 and any value over this will be set to this cap.

TEAM_PITCHING_H: Number of base hits given up by pitchers

kable(sumBTraining[sumBTraining[,1]=="TEAM_PITCHING_H",2:8], row.names=FALSE)
Min Median Mean SD Max Num of NAs Num of Zeros
1137 1518 1779.21 1406.843 30132 0 0
# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_PITCHING_H)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_PITCHING_H)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_PITCHING_H, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_PITCHING_H, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: Similar to TEAM_PITCHING_BB above, there are no missing value, but there issues with outliers. Based on visualizations, this variable will be capped at 13,000 and any value over this will be set to this cap.

TEAM_PITCHING_HR: Number of home runs given up by pitchers

kable(sumBTraining[sumBTraining[,1]=="TEAM_PITCHING_HR",2:8], row.names=FALSE)
Min Median Mean SD Max Num of NAs Num of Zeros
0 107 105.6986 61.29875 343 0 15
# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_PITCHING_HR)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_PITCHING_HR)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_PITCHING_HR, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_PITCHING_HR, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: This variable is more consistent than other pitching variables. The range and distribution are reasonable. Multimodality is interesting similar to a few other variables above. There are 15 zero values which can be imputed as needed.

TEAM_PITCHING_SO: Number of strikeouts by pitchers

kable(sumBTraining[sumBTraining[,1]=="TEAM_PITCHING_SO",2:8], row.names=FALSE)
Min Median Mean SD Max Num of NAs Num of Zeros
0 813.5 817.7305 553.085 19278 102 20
# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_PITCHING_SO)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_PITCHING_SO)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_PITCHING_SO, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_PITCHING_SO, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: This variable has 122 missing or zero values. They can be imputed as needed. There is also an outlier issue. Based on visualizations, this variable will be capped at 2,500 and any value over this will be set to this cap.

TARGET_WINS: Number of wins (Outcome)

kable(sumBTraining[sumBTraining[,1]=="TARGET_WINS",2:8], row.names=FALSE)
Min Median Mean SD Max Num of NAs Num of Zeros
0 82 80.79086 15.75215 146 0 1
# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TARGET_WINS)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TARGET_WINS)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TARGET_WINS, na.rm=TRUE)), color="red", linetype="dashed", size=1)

grid.arrange(bp, hp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: The range and distribution are reasonable. There are no missing values with the exception of record 1347.

Density Plots

par(mfrow = c(3, 3))

datasub = melt(BTraining[c(2:17)])
## Using  as id variables
ggplot(datasub, aes(x= value)) +
    geom_density(fill='red') + facet_wrap(~variable, scales = 'free')
## Warning: Removed 3478 rows containing non-finite values (stat_density).

  • The density plots show various issues with skew an non-normality

Box Plots

Box plots can provide a visual representation of the variance of the data

vis <- melt(BTraining) %>%
   dplyr::filter(variable != "INDEX")
## Using  as id variables
ggplot(vis, aes(x = variable, y = value)) +
  geom_boxplot(show.legend = T) +
  stat_summary(fun.y = mean, color = "red", geom = "point", shape = 18, size = 3) +
  coord_flip() +
  ylim(0, 2200)
## Warning: Removed 3671 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3671 rows containing non-finite values (stat_summary).

  • The box plots reveal that a great majority of the explanatory variables have high variances
  • Some of the variables contain extreme outliers that this graph does not show because i had to reduce the limits on the graph to get clear box plots
  • Many of the medians and means are also not aligned which demonstrates the outliers’ effects

Histograms

ggplot(vis, aes(value)) +
  geom_histogram(fill = "red")  +
  facet_wrap(~ variable, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3478 rows containing non-finite values (stat_bin).

  • The histograms reveal that very few of the variables are normally distributed
    • A few variables are multi-modal
    • Some of the variable exhibit a lot of skew (e.g. BASERUN_SB)

Missing Values

missmap(BTraining[2:17])

Correlation Matrix

# Correlation matrix
cm <- cor(BTraining, use="pairwise.complete.obs")
cm <- cm[2:17,2:17]
names <- c("Wins", "H", "2B", "3B", "HR", "BB", "SO", "SB", "CS", "HBP", "P-H", "P-HR", "P-BB", "P-SO", "E", "DP")
colnames(cm) <- names; rownames(cm) <- names
cm <- round(cm, 2)
cmout <- as.data.frame(cm) %>% mutate_all(function(x) {
  cell_spec(x, "latex", color = ifelse(x>0.5 | x<(-0.5),"blue","black"))
  })
rownames(cmout) <- names
cmout %>%
  kable("latex", escape = F, align = "c", row.names = TRUE) %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(0, angle = -90)
# cmout %>%
#   kable("html", escape = F, align = "c", row.names = TRUE) %>%
#   kable_styling("striped", full_width = F)
#   

Anything over 0.5 or under -0.5 is highlighted in blue. The matrix was created using complete pairwise observations.

A few conclusions:

  • Not surprisingly there is a very strong correlation between home runs batted in and home runs given up by pitching.
  • There is a negative correlation between number of triples and home runs. A less powerful team may not have enough power to hit home runs, but they get a lot of triples.
  • There is a strong positive correlation between number of strikeouts and home runs. More swings of the bat results in more home runs.

Other correlation visualization

#chart.Correlation(BTraining)

corr <- cor(cm, use="complete.ob")
ggcorrplot(corr, hc.order = TRUE, type = "lower",
   lab = TRUE, lab_size = 3, digits = 2)

cor_res <- cor(BTraining[c(2:17)], use = "na.or.complete")
corrplot(cor_res,
  type = "upper",
  order = "original",
  tl.col = "black",
  tl.srt = 45,
  tl.cex = 0.55)

Let’s read the correlations respective values:

cor_res.df <- data.frame(cor_res)
cor_res.df[1]
##                  TARGET_WINS
## TARGET_WINS       1.00000000
## TEAM_BATTING_H    0.46994665
## TEAM_BATTING_2B   0.31298400
## TEAM_BATTING_3B  -0.12434586
## TEAM_BATTING_HR   0.42241683
## TEAM_BATTING_BB   0.46868793
## TEAM_BATTING_SO  -0.22889273
## TEAM_BASERUN_SB   0.01483639
## TEAM_BASERUN_CS  -0.17875598
## TEAM_BATTING_HBP  0.07350424
## TEAM_PITCHING_H   0.47123431
## TEAM_PITCHING_HR  0.42246683
## TEAM_PITCHING_BB  0.46839882
## TEAM_PITCHING_SO -0.22936481
## TEAM_FIELDING_E  -0.38668800
## TEAM_FIELDING_DP -0.19586601

We notice also that there are some very strong positive correlations between other variables, represented in the correlation graphic above, among themselves. some of these are:

  • TEAM_BATTING_H is strongly correlated in a positive way with TEAM_PITCHING_H
cor_res.df['TEAM_BATTING_H','TEAM_PITCHING_H']
## [1] 0.9991927
  • TEAM_BATTING_HR is strongly correlated in a positive way with TEAM_PITCHING_HR
cor_res.df['TEAM_BATTING_HR','TEAM_PITCHING_HR']
## [1] 0.9999326
  • TEAM_BATTING_BB is strongly correlated in a positive way with TEAM_PITCHING_BB
cor_res.df['TEAM_BATTING_BB','TEAM_PITCHING_BB']
## [1] 0.9998814
  • TEAM_BATTING_SO is strongly correlated in a positive way with TEAM_PITCHING_SO
cor_res.df['TEAM_BATTING_SO','TEAM_PITCHING_SO']
## [1] 0.9997684

Primary insigths

Based on the above data exploration, we could note the following:

  • Missing values

It is confirmed the presence of missing values and these need to be address in a case by case - (will be taken care of in data Preparation section).

  • Zero values

It is confirmed the presence of Zero values as minimum entries in the data - (will be taken care of in data Preparation section).

  • Correlations

There seems to be very strong correlations to the target variable.

In regards to correlations related to other variables, it is confirmed that some variables have stronh correlation between each others.

Data Preparation

The following steps and/or assumptions will be considered. The idea is to make our given training data set more homogeneous and workable. The final goal is to be able to predict the TARGET_WINS with our data.

As noted in the Data Exploration section, the following adjustments have been performed:

# Remove observations with no target
BTraining <- BTraining[which(BTraining$TARGET_WINS!=0), ]

# Reset zero values
BTraining[which(BTraining$TEAM_BATTING_H==0),"TEAM_BATTING_H"] <- NA
BTraining[which(BTraining$TEAM_BATTING_2B==0),"TEAM_BATTING_2B"] <- NA
BTraining[which(BTraining$TEAM_BATTING_3B==0),"TEAM_BATTING_3B"] <- NA
BTraining[which(BTraining$TEAM_BATTING_HR==0),"TEAM_BATTING_HR"] <- NA
BTraining[which(BTraining$TEAM_BATTING_BB==0),"TEAM_BATTING_BB"] <- NA
BTraining[which(BTraining$TEAM_BATTING_SO==0),"TEAM_BATTING_SO"] <- NA
BTraining[which(BTraining$TEAM_BASERUN_SB==0),"TEAM_BASERUN_SB"] <- NA
BTraining[which(BTraining$TEAM_BASERUN_CS==0),"TEAM_BASERUN_CS"] <- NA
BTraining[which(BTraining$TEAM_FIELDING_E==0),"TEAM_FIELDING_E"] <- NA
BTraining[which(BTraining$TEAM_FIELDING_DP==0),"TEAM_FIELDING_DP"] <- NA
BTraining[which(BTraining$TEAM_PITCHING_BB==0),"TEAM_PITCHING_BB"] <- NA
BTraining[which(BTraining$TEAM_PITCHING_H==0),"TEAM_PITCHING_H"] <- NA
BTraining[which(BTraining$TEAM_PITCHING_HR==0),"TEAM_PITCHING_HR"] <- NA
BTraining[which(BTraining$TEAM_PITCHING_SO==0),"TEAM_PITCHING_SO"] <- NA

# Impute missing values
BTrainingImpute <- aregImpute(~ TARGET_WINS + TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
                         TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
                         TEAM_BASERUN_CS + TEAM_FIELDING_DP + TEAM_FIELDING_E + TEAM_PITCHING_BB + 
                         TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_SO, 
                       data = BTraining, n.impute = 10)
BTrainingI <- impute.transcan(BTrainingImpute, imputation=10, data=BTraining, 
                       list.out=TRUE, pr=FALSE, check=FALSE)
BTraining$TEAM_BASERUN_SB <- BTrainingI$TEAM_BASERUN_SB
BTraining$TEAM_BASERUN_CS <- BTrainingI$TEAM_BASERUN_CS
BTraining$TEAM_BATTING_3B <- BTrainingI$TEAM_BATTING_3B
BTraining$TEAM_BATTING_HR <- BTrainingI$TEAM_BATTING_HR
BTraining$TEAM_BATTING_SO <- BTrainingI$TEAM_BATTING_SO
BTraining$TEAM_FIELDING_DP <- BTrainingI$TEAM_FIELDING_DP
BTraining$TEAM_PITCHING_HR <- BTrainingI$TEAM_PITCHING_HR
BTraining$TEAM_PITCHING_SO <- BTrainingI$TEAM_PITCHING_SO
BTrainingImpute$rsq
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_SO  TEAM_BASERUN_SB 
##        0.7238384        0.9872944        0.9275487        0.6804871 
##  TEAM_BASERUN_CS TEAM_FIELDING_DP TEAM_PITCHING_HR TEAM_PITCHING_SO 
##        0.6379649        0.4685603        0.9835082        0.7863081
# Adjust outliers
BTraining[which(BTraining$TEAM_PITCHING_SO>2500),"TEAM_PITCHING_SO"] <- 2500
BTraining[which(BTraining$TEAM_PITCHING_H>13000),"TEAM_PITCHING_H"] <- 13000
BTraining[which(BTraining$TEAM_PITCHING_BB>1100),"TEAM_PITCHING_BB"] <- 1100

# Create singles
BTraining$TEAM_BATTING_S <- BTraining$TEAM_BATTING_H - BTraining$TEAM_BATTING_2B - BTraining$TEAM_BATTING_3B - BTraining$TEAM_BATTING_HR
summary(BTraining$TEAM_BATTING_S)

# Create log fielding error
BTraining$TEAM_FIELDING_E_LOG <- log(BTraining$TEAM_FIELDING_E)

Model Building

Model 1

The first model includes several variables, selected manually, that have higher than average correlation to the target variable. They cover hitting, walking and fielding errors.

model1 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_FIELDING_E, data=BTraining)
summary(model1)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     TEAM_FIELDING_E, data = BTraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.971  -9.022   0.101   9.062  51.557 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.089556   3.311975   1.235    0.217    
## TEAM_BATTING_H   0.049143   0.002100  23.403  < 2e-16 ***
## TEAM_BATTING_BB  0.016107   0.003136   5.137 3.03e-07 ***
## TEAM_FIELDING_E -0.014493   0.001768  -8.196 4.11e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.7 on 2271 degrees of freedom
## Multiple R-squared:  0.2356, Adjusted R-squared:  0.2346 
## F-statistic: 233.3 on 3 and 2271 DF,  p-value: < 2.2e-16

All variables are significant, but the \(R^2\) value is relatively small at 0.2356.

Model 2

The second model expand the base hit variable, TEAM_BATTING_H, into its components - singles, doubles, triples and home runs.

model2 <- lm(TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + 
           TEAM_BATTING_BB + TEAM_FIELDING_E, data=BTraining)
summary(model2)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_FIELDING_E, 
##     data = BTraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.014  -8.792   0.093   8.751  59.151 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.022597   3.444892   2.329  0.01996 *  
## TEAM_BATTING_S   0.045595   0.003158  14.437  < 2e-16 ***
## TEAM_BATTING_2B  0.023042   0.007415   3.107  0.00191 ** 
## TEAM_BATTING_3B  0.159706   0.015181  10.520  < 2e-16 ***
## TEAM_BATTING_HR  0.077205   0.007713  10.009  < 2e-16 ***
## TEAM_BATTING_BB  0.012727   0.003200   3.977 7.21e-05 ***
## TEAM_FIELDING_E -0.018777   0.001971  -9.525  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.53 on 2268 degrees of freedom
## Multiple R-squared:  0.2559, Adjusted R-squared:  0.2539 
## F-statistic:   130 on 6 and 2268 DF,  p-value: < 2.2e-16

All variables are still significant and \(R^2\) is slightly improved at 0.2574. Another variation of this model - with log-transformed fielding error variable - produced slightly worse results.

Model 3

The third model includes several variables manually selected to try and cover different aspects of the game:

  • TEAM_BATTING_SO:TEAM_BATTING_H interaction covers offensive successes (hits) and failures (strikeouts).
  • Similarly TEAM_BATTING_BB:TEAM_BATTING_H interaction covers interaction between hits and walks.
  • TEAM_BASERUN_SB covers base running.
  • TEAM_FIELDING_DP and TEAM_FIELDING_E_LOG cover fielding performance.
  • TEAM_PITCHING_HR covers pitching performance.
model3 <- lm(TARGET_WINS ~ TEAM_BATTING_SO:TEAM_BATTING_H + TEAM_BATTING_BB:TEAM_BATTING_H + TEAM_BATTING_SO + 
           TEAM_BASERUN_SB + TEAM_FIELDING_DP + TEAM_PITCHING_HR + TEAM_FIELDING_E_LOG, data=BTraining)
summary(model3)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_SO:TEAM_BATTING_H + TEAM_BATTING_BB:TEAM_BATTING_H + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_DP + TEAM_PITCHING_HR + 
##     TEAM_FIELDING_E_LOG, data = BTraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.380  -8.256   0.058   8.256  65.785 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     1.703e+02  6.666e+00  25.549  < 2e-16 ***
## TEAM_BATTING_SO                -8.773e-02  5.562e-03 -15.772  < 2e-16 ***
## TEAM_BASERUN_SB                 5.315e-02  4.204e-03  12.642  < 2e-16 ***
## TEAM_FIELDING_DP               -1.485e-01  1.322e-02 -11.234  < 2e-16 ***
## TEAM_PITCHING_HR                4.984e-02  7.562e-03   6.592 5.40e-11 ***
## TEAM_FIELDING_E_LOG            -1.399e+01  9.191e-01 -15.222  < 2e-16 ***
## TEAM_BATTING_SO:TEAM_BATTING_H  4.672e-05  4.259e-06  10.971  < 2e-16 ***
## TEAM_BATTING_H:TEAM_BATTING_BB  1.032e-05  1.924e-06   5.361 9.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.07 on 2267 degrees of freedom
## Multiple R-squared:  0.3059, Adjusted R-squared:  0.3038 
## F-statistic: 142.7 on 7 and 2267 DF,  p-value: < 2.2e-16

All variables are statistically significant and there is noticeable improvement of the \(R^2\) value at 0.3027.

Model 4

The fourth model started with all variables and used backward elimination to arrive at the optimal model. It started with the following variables: TEAM_BATTING_S, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_FIELDING_DP, TEAM_FIELDING_E, TEAM_PITCHING_BB, TEAM_PITCHING_H, TEAM_PITCHING_SO, and TEAM_PITCHING_HR. It was necessary to remove only one variable - TEAM_BASERUN_CS - to arrive at a model with all significant variables.

#model started with all variables
model4 <- lm(TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + TEAM_BATTING_3B + 
           TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
           TEAM_BASERUN_CS + TEAM_FIELDING_DP + TEAM_FIELDING_E + TEAM_PITCHING_BB + 
           TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_PITCHING_HR, data=BTraining)

#then removed TEAM_BASERUN_CS variable, the model becomes
model4 <- lm(TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + TEAM_BATTING_3B + 
           TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
           TEAM_FIELDING_DP + TEAM_FIELDING_E + TEAM_PITCHING_BB + 
           TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_PITCHING_HR, data=BTraining)

summary(model4)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_FIELDING_DP + TEAM_FIELDING_E + TEAM_PITCHING_BB + 
##     TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_PITCHING_HR, data = BTraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.710  -8.295   0.157   8.179  51.173 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      24.462823   5.048779   4.845 1.35e-06 ***
## TEAM_BATTING_S    0.046571   0.003651  12.756  < 2e-16 ***
## TEAM_BATTING_2B   0.022722   0.007067   3.215 0.001323 ** 
## TEAM_BATTING_3B   0.101053   0.015282   6.613 4.69e-11 ***
## TEAM_BATTING_HR   0.112250   0.028906   3.883 0.000106 ***
## TEAM_BATTING_BB   0.048538   0.009773   4.967 7.32e-07 ***
## TEAM_BATTING_SO  -0.025144   0.004426  -5.681 1.51e-08 ***
## TEAM_BASERUN_SB   0.049378   0.004159  11.873  < 2e-16 ***
## TEAM_FIELDING_DP -0.133065   0.012703 -10.475  < 2e-16 ***
## TEAM_FIELDING_E  -0.041039   0.002734 -15.011  < 2e-16 ***
## TEAM_PITCHING_BB -0.031511   0.008133  -3.875 0.000110 ***
## TEAM_PITCHING_H   0.002598   0.000566   4.590 4.67e-06 ***
## TEAM_PITCHING_SO  0.015793   0.003458   4.567 5.22e-06 ***
## TEAM_PITCHING_HR  0.005472   0.026031   0.210 0.833511    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.57 on 2261 degrees of freedom
## Multiple R-squared:  0.3599, Adjusted R-squared:  0.3562 
## F-statistic: 97.78 on 13 and 2261 DF,  p-value: < 2.2e-16

The \(R^2\) value is 0.3581.

Model 5

Additionally, several models were created by trying out some variables and there interactions. Variables were selected either based on theoretical expectation or correlation information from the first section. The following model has \(R^2\) values of 0.3279, which is relatively close to the fourth model; however, this model has fewer variables and may be preferential because of its simplicity.

model5 <- lm(TARGET_WINS ~ TEAM_BATTING_S  + TEAM_BATTING_3B + 
           TEAM_BATTING_HR + TEAM_BASERUN_SB  + 
           TEAM_FIELDING_E_LOG*TEAM_PITCHING_H, data=BTraining)
summary(model5)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BASERUN_SB + TEAM_FIELDING_E_LOG * 
##     TEAM_PITCHING_H, data = BTraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.102  -8.859  -0.125   8.612  53.696 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         43.5980264  8.2083937   5.311 1.19e-07 ***
## TEAM_BATTING_S                       0.0462442  0.0030660  15.083  < 2e-16 ***
## TEAM_BATTING_3B                      0.1583804  0.0152079  10.414  < 2e-16 ***
## TEAM_BATTING_HR                      0.0733046  0.0071637  10.233  < 2e-16 ***
## TEAM_BASERUN_SB                      0.0542861  0.0038357  14.153  < 2e-16 ***
## TEAM_FIELDING_E_LOG                 -8.8627398  1.4160051  -6.259 4.62e-10 ***
## TEAM_PITCHING_H                      0.0286916  0.0045198   6.348 2.63e-10 ***
## TEAM_FIELDING_E_LOG:TEAM_PITCHING_H -0.0040987  0.0006387  -6.417 1.68e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.87 on 2267 degrees of freedom
## Multiple R-squared:  0.3268, Adjusted R-squared:  0.3247 
## F-statistic: 157.2 on 7 and 2267 DF,  p-value: < 2.2e-16

The \(R^2\) value is 0.3294.

Model Selection

Based on \(R^2\) value, the fourth model (\(R^2\)=0.358)was selected for further analysis. This model also has the lowest AIC score.

AIC(model1, model2, model3, model4, model5)
##        df      AIC
## model1  5 18372.69
## model2  8 18317.53
## model3  9 18161.22
## model4 15 17989.07
## model5  9 18091.76
summary(model4)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_FIELDING_DP + TEAM_FIELDING_E + TEAM_PITCHING_BB + 
##     TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_PITCHING_HR, data = BTraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.710  -8.295   0.157   8.179  51.173 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      24.462823   5.048779   4.845 1.35e-06 ***
## TEAM_BATTING_S    0.046571   0.003651  12.756  < 2e-16 ***
## TEAM_BATTING_2B   0.022722   0.007067   3.215 0.001323 ** 
## TEAM_BATTING_3B   0.101053   0.015282   6.613 4.69e-11 ***
## TEAM_BATTING_HR   0.112250   0.028906   3.883 0.000106 ***
## TEAM_BATTING_BB   0.048538   0.009773   4.967 7.32e-07 ***
## TEAM_BATTING_SO  -0.025144   0.004426  -5.681 1.51e-08 ***
## TEAM_BASERUN_SB   0.049378   0.004159  11.873  < 2e-16 ***
## TEAM_FIELDING_DP -0.133065   0.012703 -10.475  < 2e-16 ***
## TEAM_FIELDING_E  -0.041039   0.002734 -15.011  < 2e-16 ***
## TEAM_PITCHING_BB -0.031511   0.008133  -3.875 0.000110 ***
## TEAM_PITCHING_H   0.002598   0.000566   4.590 4.67e-06 ***
## TEAM_PITCHING_SO  0.015793   0.003458   4.567 5.22e-06 ***
## TEAM_PITCHING_HR  0.005472   0.026031   0.210 0.833511    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.57 on 2261 degrees of freedom
## Multiple R-squared:  0.3599, Adjusted R-squared:  0.3562 
## F-statistic: 97.78 on 13 and 2261 DF,  p-value: < 2.2e-16

All variables used in this model have statistical significance at 0.01 level. The F-statistic is high with a p-value nearly 0 and, therefore, is significant. Median value of residuals is close to 0 and they are equally distributed. Standard errors are significantly smaller than estimated coefficients.

Only 4 variables are negatively correlated - strikeouts, double plays, errors, and home runs allowed. Remaining variables are positively correlated. Some correlation is counter-intuitive. For example, double plays are considered successful defensive moves and should increase the winning percentage. Similarly, allowing base hits should decrease winning percentage. The model indicates otherwise and there are probably there factors that influence these variables.

Consider residuals plotted against data index. There is no pattern.

plot(model4$residuals, ylab="Residuals")
abline(h=0)

Plotting fitted values against the residuals is more problematic. Although there is no pattern among residuals, there are some outliers and variability does not appear to be constant across the entire range.

plot(model4$fitted.values, model4$residuals, xlab="Fitted Values", ylab="Residuals")
abline(h=0)

Q-Q plot confirms that residuals are normally distributed.

qqnorm(model4$residuals)
qqline(model4$residuals)

Prediction

Using selected model and evaluation data (transformed similarly to training data), prediction table is as follows. It includes predicted number of wins along with confidence interval(CI).

#BEvaluation <- read.csv("moneyball-evaluation-data.csv")
BEvaluation <- read.csv("https://raw.githubusercontent.com/theoracley/Data621/master/Homework1/moneyball-evaluation-data.csv")

BEvaluation[which(BEvaluation$TEAM_BATTING_H==0),"TEAM_BATTING_H"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_2B==0),"TEAM_BATTING_2B"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_3B==0),"TEAM_BATTING_3B"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_HR==0),"TEAM_BATTING_HR"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_BB==0),"TEAM_BATTING_BB"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_SO==0),"TEAM_BATTING_SO"] <- NA
BEvaluation[which(BEvaluation$TEAM_BASERUN_SB==0),"TEAM_BASERUN_SB"] <- NA
BEvaluation[which(BEvaluation$TEAM_BASERUN_CS==0),"TEAM_BASERUN_CS"] <- NA
BEvaluation[which(BEvaluation$TEAM_FIELDING_E==0),"TEAM_FIELDING_E"] <- NA
BEvaluation[which(BEvaluation$TEAM_FIELDING_DP==0),"TEAM_FIELDING_DP"] <- NA
BEvaluation[which(BEvaluation$TEAM_PITCHING_BB==0),"TEAM_PITCHING_BB"] <- NA
BEvaluation[which(BEvaluation$TEAM_PITCHING_H==0),"TEAM_PITCHING_H"] <- NA
BEvaluation[which(BEvaluation$TEAM_PITCHING_HR==0),"TEAM_PITCHING_HR"] <- NA
BEvaluation[which(BEvaluation$TEAM_PITCHING_SO==0),"TEAM_PITCHING_SO"] <- NA

# Impute mimssing values
BTrainingImpute <- aregImpute(~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
                         TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
                         TEAM_BASERUN_CS + TEAM_FIELDING_DP + TEAM_FIELDING_E + TEAM_PITCHING_BB + 
                         TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_SO, 
                       data = BEvaluation, n.impute = 10)
BTrainingImpute
BTrainingImpute$rsq

BTrainingI <- impute.transcan(BTrainingImpute, imputation=10, data=BEvaluation, 
                       list.out=TRUE, pr=FALSE, check=FALSE)
BEvaluation$TEAM_BATTING_HR <- BTrainingI$TEAM_BATTING_HR
BEvaluation$TEAM_BATTING_SO <- BTrainingI$TEAM_BATTING_SO
BEvaluation$TEAM_BASERUN_SB <- BTrainingI$TEAM_BASERUN_SB
BEvaluation$TEAM_BASERUN_CS <- BTrainingI$TEAM_BASERUN_CS
BEvaluation$TEAM_FIELDING_DP <- BTrainingI$TEAM_FIELDING_DP
BEvaluation$TEAM_PITCHING_HR <- BTrainingI$TEAM_PITCHING_HR
BEvaluation$TEAM_PITCHING_SO <- BTrainingI$TEAM_PITCHING_SO

# Adjust outliers
BEvaluation[which(BEvaluation$TEAM_PITCHING_SO>2500),"TEAM_PITCHING_SO"] <- 2500
BEvaluation[which(BEvaluation$TEAM_PITCHING_H>13000),"TEAM_PITCHING_H"] <- 13000
BEvaluation[which(BEvaluation$TEAM_PITCHING_BB>1100),"TEAM_PITCHING_BB"] <- 1100

BEvaluation$TEAM_BATTING_S <- BEvaluation$TEAM_BATTING_H - BEvaluation$TEAM_BATTING_2B - BEvaluation$TEAM_BATTING_3B - BEvaluation$TEAM_BATTING_HR

BEvaluation$PREDICT_WIN <- predict(model4, newdata=BEvaluation, interval="confidence")
BTrainingPredict <- cbind(BEvaluation$INDEX, BEvaluation$PREDICT_WIN[, 1], BEvaluation$PREDICT_WIN[, 2], BEvaluation$PREDICT_WIN[, 3])
colnames(BTrainingPredict) <- c("Index", "Predicted Wins", "CI Lower", "CI Upper")
kable(round(BTrainingPredict,0))
Index Predicted Wins CI Lower CI Upper
9 62 60 63
10 64 63 66
14 74 73 75
47 87 86 88
60 65 62 69
63 61 58 63
74 87 85 90
83 76 74 78
98 70 69 71
120 72 71 74
123 68 66 69
135 81 80 83
138 81 79 82
140 82 80 83
151 85 84 87
153 76 75 78
171 74 73 75
184 78 77 79
193 74 72 75
213 90 88 92
217 81 80 83
226 83 82 85
230 80 78 81
241 70 69 72
291 82 80 83
294 88 86 89
300 42 37 46
348 73 71 74
350 83 81 85
357 74 72 75
367 91 89 92
368 85 84 86
372 82 80 83
382 83 82 85
388 79 78 80
396 86 85 88
398 75 75 76
403 90 88 92
407 84 81 87
410 92 90 93
412 82 81 84
414 92 90 94
436 9 3 16
440 109 105 112
476 96 94 98
479 93 91 95
481 105 103 107
501 76 75 78
503 68 67 69
506 79 78 80
519 75 74 76
522 85 83 86
550 76 75 77
554 72 71 74
566 74 74 75
578 78 77 79
596 93 91 94
599 76 75 78
605 64 63 66
607 75 73 78
614 88 86 89
644 74 72 76
692 89 88 90
699 85 83 87
700 84 82 86
716 96 93 99
721 73 72 75
722 79 78 81
729 78 77 80
731 91 89 92
746 85 82 87
763 69 67 70
774 76 75 78
776 89 87 91
788 80 78 82
789 84 83 86
792 83 82 84
811 84 83 85
835 74 72 75
837 76 75 78
861 83 81 85
862 87 86 89
863 96 95 98
871 73 71 75
879 85 84 86
887 80 78 81
892 84 82 85
904 84 83 85
909 91 89 92
925 91 90 93
940 78 76 79
951 91 80 101
976 72 70 73
981 90 88 92
983 88 86 89
984 88 86 90
989 90 88 92
995 104 102 106
1000 86 85 88
1001 87 85 89
1007 78 77 80
1016 73 71 74
1027 83 82 84
1033 84 82 85
1070 79 77 80
1081 59 57 62
1084 47 43 51
1098 76 75 78
1150 87 86 89
1160 60 58 63
1169 85 84 86
1172 86 84 88
1174 95 94 96
1176 92 91 93
1178 80 79 82
1184 78 77 79
1193 86 84 87
1196 81 80 82
1199 73 72 74
1207 80 78 82
1218 92 90 93
1223 68 66 70
1226 68 66 69
1227 66 64 68
1229 68 67 70
1241 87 85 88
1244 90 89 92
1246 76 75 77
1248 93 91 94
1249 91 89 92
1253 85 84 87
1261 79 78 81
1305 80 79 82
1314 87 85 88
1323 88 87 90
1328 66 63 68
1353 74 72 75
1363 76 75 77
1371 89 88 91
1372 82 81 83
1389 64 63 66
1393 70 68 72
1421 90 89 92
1431 71 70 73
1437 71 70 72
1442 71 70 72
1450 77 76 78
1463 79 78 80
1464 78 77 79
1470 83 82 84
1471 82 80 83
1484 81 80 82
1495 62 52 71
1507 68 66 70
1514 77 76 79
1526 70 68 72
1549 90 89 92
1552 80 78 82
1556 92 90 94
1564 73 71 74
1585 105 102 107
1586 109 107 111
1590 94 93 96
1591 105 102 107
1592 98 96 100
1603 89 87 91
1612 81 80 83
1634 81 79 82
1645 73 71 74
1647 80 79 81
1673 94 92 96
1674 91 89 92
1687 80 78 81
1688 94 93 96
1700 82 81 84
1708 73 71 74
1713 78 76 79
1717 69 67 70
1721 73 72 74
1730 80 79 81
1737 91 88 93
1748 90 88 91
1749 86 85 87
1763 85 84 86
1768 85 77 93
1778 92 88 95
1780 92 89 94
1782 46 42 50
1784 59 57 62
1794 118 115 121
1803 69 68 71
1804 79 77 81
1819 75 73 76
1832 76 74 77
1833 78 77 80
1844 66 64 67
1847 77 76 78
1854 85 83 87
1855 79 78 80
1857 84 83 86
1864 75 73 76
1865 80 78 81
1869 74 72 75
1880 89 86 92
1881 81 80 82
1882 85 84 86
1894 78 76 79
1896 78 76 79
1916 86 85 88
1918 74 72 76
1921 110 108 113
1926 96 94 98
1938 82 80 84
1979 63 62 65
1982 66 65 68
1987 83 82 84
1997 78 76 80
2004 95 94 97
2011 77 76 78
2015 79 78 80
2022 78 77 79
2025 74 72 75
2027 81 80 82
2031 72 70 73
2036 73 69 78
2066 75 74 76
2073 82 81 83
2087 80 78 81
2092 82 81 83
2125 71 68 73
2148 82 80 84
2162 93 92 94
2191 78 76 79
2203 89 87 90
2218 80 79 81
2221 75 74 76
2225 83 82 85
2232 77 76 78
2267 93 91 95
2291 71 69 72
2299 89 87 90
2317 86 84 87
2318 83 82 84
2353 81 80 82
2403 61 59 62
2411 87 85 89
2415 81 79 82
2424 85 84 86
2441 71 70 73
2464 84 82 85
2465 80 79 82
2472 50 46 53
2481 96 94 98
2487 17 9 24
2500 69 68 70
2501 77 76 79
2520 84 82 85
2521 85 83 86
2525 76 74 78

Saving our prediction to a prediction file (ourPrediction.csv)

write.csv(BTrainingPredict, file = "ourPrediction.csv") 

APPENDIX:

# Required libraries
library(tidyverse)
library(kableExtra)
library(ggplot2)
library(gridExtra)
library(dplyr)
library(Hmisc)
library(PerformanceAnalytics)
library(ggcorrplot)
library(corrplot)
library(Amelia)
library(reshape)
library(ggplot2)

# Import data
BTraining <- read.csv("moneyball-training-data.csv")

#structure
str(BTraining)

#Summary
BTraining.summary <- data.frame(unclass(summary(BTraining[2:17])),
check.names = FALSE,
row.names = NULL,
stringsAsFactors = FALSE)
BTraining.summary

#Columns having NA values
BTraining.summary$TEAM_BATTING_SO[7]
BTraining.summary$TEAM_BASERUN_SB[7]
BTraining.summary$TEAM_BASERUN_CS[7]
BTraining.summary$TEAM_BATTING_HBP[7]
BTraining.summary$TEAM_PITCHING_SO[7]
BTraining.summary$TEAM_FIELDING_DP[7]

# Get summary table
sumBTraining = data.frame(Variable = character(),
                   Min = integer(),
                   Median = integer(),
                   Mean = double(),
                   SD = double(),
                   Max = integer(),
                   Num_NAs = integer(),
                   Num_Zeros = integer())

for (i in 2:17) {
  sumBTraining <- rbind(sumBTraining, data.frame(Variable = colnames(BTraining)[i],
                                   Min = min(BTraining[,i], na.rm=TRUE),
                                   Median = median(BTraining[,i], na.rm=TRUE),
                                   Mean = mean(BTraining[,i], na.rm=TRUE),
                                   SD = sd(BTraining[,i], na.rm=TRUE),
                                   Max = max(BTraining[,i], na.rm=TRUE),
                                   Num_NAs = sum(is.na(BTraining[,i])),
                                   Num_Zeros = length(which(BTraining[,i]==0)))
                 )
}

# ----This section is for Exploration of Data used for every variable----
kable(sumBTraining[sumBTraining[,1]=="TEAM_BASERUN_SB",2:8], row.names=FALSE)

# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BASERUN_SB)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), 
                                     axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BASERUN_SB)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BASERUN_SB, na.rm=TRUE)), color="red", 
             linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BASERUN_SB, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

##-----End of exploratory Data for every variable--------


#Density plots
par(mfrow = c(3, 3))
datasub = melt(BTraining[c(2:17)])
ggplot(datasub, aes(x= value)) +
    geom_density(fill='red') + facet_wrap(~variable, scales = 'free')


# Correlation matrix
cm <- cor(BTraining, use="pairwise.complete.obs")
cm <- cm[2:17,2:17]
names <- c("Wins", "H", "2B", "3B", "HR", "BTraining", "SO", "SB", "CS", "HBP", "P-H", 
           "P-HR", "P-BTraining", "P-SO", "E", "DP")
colnames(cm) <- names; rownames(cm) <- names
cm <- round(cm, 2)
cmout <- as.data.frame(cm) %>% mutate_all(function(x) {
  cell_spec(x, "html", color = ifelse(x>0.5 | x<(-0.5),"blue","black"))
  })
rownames(cmout) <- names
cmout %>%
  kable("html", escape = F, align = "c", row.names = TRUE) %>%
  kable_styling("striped", full_width = F)


#Box plots
vis <- melt(BTraining) %>%
   dplyr::filter(variable != "INDEX")

ggplot(vis, aes(x = variable, y = value)) +
  geom_boxplot(show.legend = T) +
  stat_summary(fun.y = mean, color = "red", geom = "point", shape = 18, size = 3) +
  coord_flip() +
  ylim(0, 2200)

#Histograms
ggplot(vis, aes(value)) +
  geom_histogram(fill = "red")  +
  facet_wrap(~ variable, scales = "free")

#missing values
missmap(BTraining[2:17])


# Correlation matrix
cm <- cor(BTraining, use="pairwise.complete.obs")
cm <- cm[2:17,2:17]
names <- c("Wins", "H", "2B", "3B", "HR", "BB", "SO", "SB", "CS", "HBP", "P-H", "P-HR", "P-BB", "P-SO", "E", "DP")
colnames(cm) <- names; rownames(cm) <- names
cm <- round(cm, 2)
cmout <- as.data.frame(cm) %>% mutate_all(function(x) {
  cell_spec(x, "latex", color = ifelse(x>0.5 | x<(-0.5),"blue","black"))
  })
rownames(cmout) <- names
cmout %>%
  kable("latex", escape = F, align = "c", row.names = TRUE) %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(0, angle = -90)


#Other visualization of correlation
#1.
corr <- cor(cm, use="complete.ob")
ggcorrplot(corr, hc.order = TRUE, type = "lower",
   lab = TRUE, lab_size = 3, digits = 2)

#2.
cor_res <- cor(BTraining[c(2:17)], use = "na.or.complete")
corrplot(cor_res,
  type = "upper",
  order = "original",
  tl.col = "black",
  tl.srt = 45,
  tl.cex = 0.55)

#correlation values
cor_res.df <- data.frame(cor_res)
cor_res.df[1]


# Remove observations with no target
BTraining <- BTraining[which(BTraining$TARGET_WINS!=0), ]

# Reset zero values
BTraining[which(BTraining$TEAM_BATTING_H==0),"TEAM_BATTING_H"] <- NA
BTraining[which(BTraining$TEAM_BATTING_2B==0),"TEAM_BATTING_2B"] <- NA
BTraining[which(BTraining$TEAM_BATTING_3B==0),"TEAM_BATTING_3B"] <- NA
BTraining[which(BTraining$TEAM_BATTING_HR==0),"TEAM_BATTING_HR"] <- NA
BTraining[which(BTraining$TEAM_BATTING_BB==0),"TEAM_BATTING_BB"] <- NA
BTraining[which(BTraining$TEAM_BATTING_SO==0),"TEAM_BATTING_SO"] <- NA
BTraining[which(BTraining$TEAM_BASERUN_SB==0),"TEAM_BASERUN_SB"] <- NA
BTraining[which(BTraining$TEAM_BASERUN_CS==0),"TEAM_BASERUN_CS"] <- NA
BTraining[which(BTraining$TEAM_FIELDING_E==0),"TEAM_FIELDING_E"] <- NA
BTraining[which(BTraining$TEAM_FIELDING_DP==0),"TEAM_FIELDING_DP"] <- NA
BTraining[which(BTraining$TEAM_PITCHING_BB==0),"TEAM_PITCHING_BB"] <- NA
BTraining[which(BTraining$TEAM_PITCHING_H==0),"TEAM_PITCHING_H"] <- NA
BTraining[which(BTraining$TEAM_PITCHING_HR==0),"TEAM_PITCHING_HR"] <- NA
BTraining[which(BTraining$TEAM_PITCHING_SO==0),"TEAM_PITCHING_SO"] <- NA

# Impute mimssing values
BTrainingImpute <- aregImpute(~ TARGET_WINS + TEAM_BATTING_H + TEAM_BATTING_2B + 
                         TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
                         TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + 
                         TEAM_FIELDING_DP + TEAM_FIELDING_E + TEAM_PITCHING_BB + 
                         TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_SO, 
                       data = BTraining, n.impute = 10)

BTrainingI <- impute.transcan(BTrainingImpute, imputation=10, data=BTraining, 
                       list.out=TRUE, pr=FALSE, check=FALSE)
BTraining$TEAM_BASERUN_SB <- BTrainingI$TEAM_BASERUN_SB
BTraining$TEAM_BASERUN_CS <- BTrainingI$TEAM_BASERUN_CS
BTraining$TEAM_BATTING_3B <- BTrainingI$TEAM_BATTING_3B
BTraining$TEAM_BATTING_HR <- BTrainingI$TEAM_BATTING_HR
BTraining$TEAM_BATTING_SO <- BTrainingI$TEAM_BATTING_SO
BTraining$TEAM_FIELDING_DP <- BTrainingI$TEAM_FIELDING_DP
BTraining$TEAM_PITCHING_HR <- BTrainingI$TEAM_PITCHING_HR
BTraining$TEAM_PITCHING_SO <- BTrainingI$TEAM_PITCHING_SO

BTrainingImpute$rsq

# Adjust outliers
BTraining[which(BTraining$TEAM_PITCHING_SO>2500),"TEAM_PITCHING_SO"] <- 2500
BTraining[which(BTraining$TEAM_PITCHING_H>13000),"TEAM_PITCHING_H"] <- 13000
BTraining[which(BTraining$TEAM_PITCHING_BB>1100),"TEAM_PITCHING_BB"] <- 1100

# Creat singles
BTraining$TEAM_BATTING_S <- BTraining$TEAM_BATTING_H - BTraining$TEAM_BATTING_2B - 
  BTraining$TEAM_BATTING_3B - BTraining$TEAM_BATTING_HR
summary(BTraining$TEAM_BATTING_S)

# Create log fielding error
BTraining$TEAM_FIELDING_E_LOG <- log(BTraining$TEAM_FIELDING_E)

# Model building
model1 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_FIELDING_E, data=BTraining)
summary(model1)

model2 <- lm(TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + 
           TEAM_BATTING_BB + TEAM_FIELDING_E, data=BTraining)
summary(model2)
model2b <- lm(TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + TEAM_BATTING_3B + 
            TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_FIELDING_E_LOG, data=BTraining)
summary(model2b)

model3 <- lm(TARGET_WINS ~ TEAM_BATTING_SO:TEAM_BATTING_H + TEAM_BATTING_BB:TEAM_BATTING_H + TEAM_BATTING_SO + 
           TEAM_BASERUN_SB + TEAM_FIELDING_DP + TEAM_PITCHING_HR + TEAM_FIELDING_E_LOG, data=BTraining)
summary(model3)

model4 <- lm(TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + TEAM_BATTING_3B + 
           TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
           TEAM_BASERUN_CS + TEAM_FIELDING_DP + TEAM_FIELDING_E + TEAM_PITCHING_BB + 
           TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_PITCHING_HR, data=BTraining)
summary(model4)
model4 <- lm(TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + TEAM_BATTING_3B + 
           TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
           TEAM_FIELDING_DP + TEAM_FIELDING_E + TEAM_PITCHING_BB + 
           TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_PITCHING_HR, data=BTraining)
summary(model4)

model5 <- lm(TARGET_WINS ~ TEAM_BATTING_S  + TEAM_BATTING_3B + 
           TEAM_BATTING_HR + TEAM_BASERUN_SB  + 
           TEAM_FIELDING_E_LOG*TEAM_PITCHING_H, data=BTraining)
summary(model5)

#AIC Score
AIC(model1, model2, model3, model4, model5)
summary(model4)

# Residuals plots
plot(model4$residuals, ylab="Residuals")
abline(h=0)

plot(model4$fitted.values, m4$residuals, xlab="Fitted Values", ylab="Residuals")
abline(h=0)

qqnorm(model4$residuals)
qqline(model4$residuals)

# Test evaluation data for prediction
#BEvaluation <- read.csv("moneyball-evaluation-data.csv")
BEvaluation <- read.csv("https://raw.githubusercontent.com/theoracley/Data621/master/Homework1/moneyball-evaluation-data.csv")

BEvaluation[which(BEvaluation$TEAM_BATTING_H==0),"TEAM_BATTING_H"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_2B==0),"TEAM_BATTING_2B"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_3B==0),"TEAM_BATTING_3B"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_HR==0),"TEAM_BATTING_HR"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_BB==0),"TEAM_BATTING_BB"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_SO==0),"TEAM_BATTING_SO"] <- NA
BEvaluation[which(BEvaluation$TEAM_BASERUN_SB==0),"TEAM_BASERUN_SB"] <- NA
BEvaluation[which(BEvaluation$TEAM_BASERUN_CS==0),"TEAM_BASERUN_CS"] <- NA
BEvaluation[which(BEvaluation$TEAM_FIELDING_E==0),"TEAM_FIELDING_E"] <- NA
BEvaluation[which(BEvaluation$TEAM_FIELDING_DP==0),"TEAM_FIELDING_DP"] <- NA
BEvaluation[which(BEvaluation$TEAM_PITCHING_BB==0),"TEAM_PITCHING_BB"] <- NA
BEvaluation[which(BEvaluation$TEAM_PITCHING_H==0),"TEAM_PITCHING_H"] <- NA
BEvaluation[which(BEvaluation$TEAM_PITCHING_HR==0),"TEAM_PITCHING_HR"] <- NA
BEvaluation[which(BEvaluation$TEAM_PITCHING_SO==0),"TEAM_PITCHING_SO"] <- NA

# Impute mimssing values
BTrainingImpute <- aregImpute(~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
                         TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
                         TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_DP + 
                         TEAM_FIELDING_E + TEAM_PITCHING_BB + TEAM_PITCHING_H + 
                         TEAM_PITCHING_HR + TEAM_PITCHING_SO, 
                       data = BEvaluation, n.impute = 10)

BTrainingImpute$rsq

BTrainingI <- impute.transcan(BTrainingImpute, imputation=10, data=BEvaluation, 
                       list.out=TRUE, pr=FALSE, check=FALSE)
BEvaluation$TEAM_BATTING_HR <- BTrainingI$TEAM_BATTING_HR
BEvaluation$TEAM_BATTING_SO <- BTrainingI$TEAM_BATTING_SO
BEvaluation$TEAM_BASERUN_SB <- BTrainingI$TEAM_BASERUN_SB
BEvaluation$TEAM_BASERUN_CS <- BTrainingI$TEAM_BASERUN_CS
BEvaluation$TEAM_FIELDING_DP <- BTrainingI$TEAM_FIELDING_DP
BEvaluation$TEAM_PITCHING_HR <- BTrainingI$TEAM_PITCHING_HR
BEvaluation$TEAM_PITCHING_SO <- BTrainingI$TEAM_PITCHING_SO

# Adjust outliers
BEvaluation[which(BEvaluation$TEAM_PITCHING_SO>2500),"TEAM_PITCHING_SO"] <- 2500
BEvaluation[which(BEvaluation$TEAM_PITCHING_H>13000),"TEAM_PITCHING_H"] <- 13000
BEvaluation[which(BEvaluation$TEAM_PITCHING_BB>1100),"TEAM_PITCHING_BB"] <- 1100

BEvaluation$TEAM_BATTING_S <- BEvaluation$TEAM_BATTING_H - BEvaluation$TEAM_BATTING_2B - 
  BEvaluation$TEAM_BATTING_3B - BEvaluation$TEAM_BATTING_HR

BEvaluation$PREDICT_WIN <- predict(model4, newdata=BEvaluation, interval="confidence")

BTrainingPredict <- cbind(BEvaluation$INDEX, BEvaluation$PREDICT_WIN[, 1], BEvaluation$PREDICT_WIN[, 2], 
                   BEvaluation$PREDICT_WIN[, 3])
colnames(BTrainingPredict) <- c("Index", "Predicted Wins", "CI Lower", "CI Upper")
round(BTrainingPredict,0)

#write our prediction to a file
write.csv(BTrainingPredict, file = "ourPrediction.csv")