DATA 621 Homework 1 - Spring2020

Purpose

This report covers an attempt to build a model to predict number of wins of a baseball team in a season based on several offensive and deffensive statistics. Resulting model explained about 36% of variability in the target variable and included most of the provided explanatory variables. Some potentially helpful variables were not included in the data set. For instance, number of At Bats can be used to calculate on-base percentage which may correlate strongly with winning percentage. The model can be revised with additional variables or further analysis.

Data Exploration

The data set describes baseball team statistics for the years 1871 to 2006 inclusive. Each record in the data set represents the performance of the team for the given year adjusted to the current length of the season - 162 games. The data set includes 16 variables (excluding the index) and the training set includes 2,276 records.

Dimensions

For reproducibility purposes, we have put the original data sets in git-hub account and then we will read them from there. Let’s load our data and look at the the dimensions of our data set.

#BTraining <- read.csv("moneyball-training-data.csv")
BTraining <- read.csv("https://raw.githubusercontent.com/theoracley/Data621/master/Homework1/moneyball-training-data.csv")
dim(BTraining)

## [1] 2276   17

As we can notice, the training data set has a total of 17 different variables. The total number or records available are 2276.

Structure

The below structure is currently present in the data, for simplicity purposes, we have loaded and treated this data set as a data frame in which all the variables are integers.

str(BTraining)

## 'data.frame':    2276 obs. of  17 variables:
##  $ INDEX           : int  1 2 3 4 5 6 7 8 11 12 ...
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : int  842 1075 917 922 920 973 1062 1027 922 827 ...
##  $ TEAM_BASERUN_SB : int  NA 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : int  NA 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: int  5456 1082 917 928 920 973 1062 1033 922 827 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: int  NA 155 153 156 168 149 186 136 169 159 ...

Summary

Let’s look at the summary of our data.

BTraining.summary <- data.frame(unclass(summary(BTraining[2:17])),
check.names = FALSE,
row.names = NULL,
stringsAsFactors = FALSE)
BTraining.summary

##        TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B  TEAM_BATTING_3B
## 1 Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
## 2 1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
## 3 Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
## 4 Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
## 5 3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
## 6 Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
## 7             <NA>           <NA>            <NA>             <NA>
##    TEAM_BATTING_HR TEAM_BATTING_BB  TEAM_BATTING_SO TEAM_BASERUN_SB
## 1 Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
## 2 1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0   1st Qu.: 66.0  
## 3 Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
## 4 Mean   : 99.61   Mean   :501.6   Mean   : 735.6   Mean   :124.8  
## 5 3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0   3rd Qu.:156.0  
## 6 Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
## 7             <NA>            <NA>    NA's   :102     NA's   :131  
##   TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1 Min.   :  0.0    Min.   :29.00   Min.   : 1137    Min.   :  0.0  
## 2 1st Qu.: 38.0    1st Qu.:50.50   1st Qu.: 1419    1st Qu.: 50.0  
## 3 Median : 49.0    Median :58.00   Median : 1518    Median :107.0  
## 4 Mean   : 52.8    Mean   :59.36   Mean   : 1779    Mean   :105.7  
## 5 3rd Qu.: 62.0    3rd Qu.:67.00   3rd Qu.: 1682    3rd Qu.:150.0  
## 6 Max.   :201.0    Max.   :95.00   Max.   :30132    Max.   :343.0  
## 7   NA's   :772     NA's   :2085              <NA>             <NA>
##   TEAM_PITCHING_BB  TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP
## 1 Min.   :   0.0   Min.   :    0.0   Min.   :  65.0    Min.   : 52.0  
## 2 1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0    1st Qu.:131.0  
## 3 Median : 536.5   Median :  813.5   Median : 159.0    Median :149.0  
## 4 Mean   : 553.0   Mean   :  817.7   Mean   : 246.5    Mean   :146.4  
## 5 3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2    3rd Qu.:164.0  
## 6 Max.   :3645.0   Max.   :19278.0   Max.   :1898.0    Max.   :228.0  
## 7             <NA>     NA's   :102               <NA>    NA's   :286

From the statistics above, we can see that there are 6 variables that contain some NA’s values, and we need to deal with them later.

BTraining.summary$TEAM_BATTING_SO[7]

## [1] "NA's   :102  "

BTraining.summary$TEAM_BASERUN_SB[7]

## [1] "NA's   :131  "

BTraining.summary$TEAM_BASERUN_CS[7]

## [1] "NA's   :772  "

BTraining.summary$TEAM_BATTING_HBP[7]

## [1] "NA's   :2085  "

BTraining.summary$TEAM_PITCHING_SO[7]

## [1] "NA's   :102  "

BTraining.summary$TEAM_FIELDING_DP[7]

## [1] "NA's   :286  "

Each variable is presented below with corresponding basic statistics (minimum, median and maximum values, mean and standard deviation, number of records with missing values and zero values), boxplot, density plot with highlighted mean value, and scatterplot against outcome variable (TARGET_WINS) with best fit line. This information is used to check general validity of data and adjust as necessary.

Let’s create a statistical container that will hold this information for each variable

sumBTraining = data.frame(Variable = character(),
                   Min = integer(),
                   Median = integer(),
                   Mean = double(),
                   SD = double(),
                   Max = integer(),
                   Num_NAs = integer(),
                   Num_Zeros = integer())
for (i in 2:17) {
  sumBTraining <- rbind(sumBTraining, data.frame(Variable = colnames(BTraining)[i],
                                   Min = min(BTraining[,i], na.rm=TRUE),
                                   Median = median(BTraining[,i], na.rm=TRUE),
                                   Mean = mean(BTraining[,i], na.rm=TRUE),
                                   SD = sd(BTraining[,i], na.rm=TRUE),
                                   Max = max(BTraining[,i], na.rm=TRUE),
                                   Num_NAs = sum(is.na(BTraining[,i])),
                                   Num_Zeros = length(which(BTraining[,i]==0)))
                 )
}
colnames(sumBTraining) <- c("", "Min", "Median", "Mean", "SD", "Max", "Num of NAs", "Num of Zeros")

Let’s go through every variable and check on the information output.

TEAM_BATTING_H: Number of team base hits (includes singles, doubles, triples and home runs)

kable(sumBTraining[sumBTraining[,1]=="TEAM_BATTING_H",2:8], row.names=FALSE)

Min	Median	Mean	SD	Max	Num of NAs	Num of Zeros
891	1454	1469.27	144.5912	2554	0	0

# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BATTING_H)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BATTING_H)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BATTING_H, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BATTING_H, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: There are no missing values. The range and distribution are reasonable.

TEAM_BATTING_2B: Number of team doubles

kable(sumBTraining[sumBTraining[,1]=="TEAM_BATTING_2B",2:8], row.names=FALSE)

Min	Median	Mean	SD	Max	Num of NAs	Num of Zeros
69	238	241.2469	46.80141	458	0	0

# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BATTING_2B)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BATTING_2B)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BATTING_2B, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BATTING_2B, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: There are no missing values. The range and distribution are reasonable.

TEAM_BATTING_3B: Number of team triples

kable(sumBTraining[sumBTraining[,1]=="TEAM_BATTING_3B",2:8], row.names=FALSE)

Min	Median	Mean	SD	Max	Num of NAs	Num of Zeros
0	47	55.25	27.93856	223	0	2

# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BATTING_3B)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BATTING_3B)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BATTING_3B, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BATTING_3B, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: The range and distribution are reasonable. There are 2 records with zero values which is unrealistic for a team in a season. One record (index 1347) has 12 variables with missing values, including the outcome variable. This record will be deleted from the data set. Second record (index 1494) has 7 missing variables, but it does have some recorded values in all categories - batting, pitching and fielding. Zero value for TEAM_BATTING_3B can be replaced with the median (because the distribution is right-skewed, median value will provide more realistic estimate).

TEAM_BATTING_HR: Number of team home runs

kable(sumBTraining[sumBTraining[,1]=="TEAM_BATTING_HR",2:8], row.names=FALSE)

Min	Median	Mean	SD	Max	Num of NAs	Num of Zeros
0	102	99.61204	60.54687	264	0	15

# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BATTING_HR)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BATTING_HR)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BATTING_HR, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BATTING_HR, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: The range is reasonable. The distribution is interesting because it is multimodal. Most likely this indicates major changes in game dynamics - perhaps, some rule adjustments started favoring batters. Or perhaps, this is an affect of steroid era. There are 15 records with zero values which is unrealistic for this variable. They can be imputed from other values.

TEAM_BATTING_BB: Number of team walks

kable(sumBTraining[sumBTraining[,1]=="TEAM_BATTING_BB",2:8], row.names=FALSE)

Min	Median	Mean	SD	Max	Num of NAs	Num of Zeros
0	512	501.5589	122.6709	878	0	1

# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BATTING_BB)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BATTING_BB)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BATTING_BB, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BATTING_BB, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: The range and distribution are reasonable. There is one record (index 1347) that has a zero value. This record was discussed above (under TEAM_BATTING_3B) and it will be deleted from the data set.

TEAM_BATTING_HBP: Number of team batters hit by pitch

kable(sumBTraining[sumBTraining[,1]=="TEAM_BATTING_HBP",2:8], row.names=FALSE)

Min	Median	Mean	SD	Max	Num of NAs	Num of Zeros
29	58	59.35602	12.96712	95	2085	0

# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BATTING_HBP)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BATTING_HBP)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BATTING_HBP, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BATTING_HBP, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: There are 2,085 records - 91.6% of data set - that are missing value. Because this variable is missing for majority of records, it will not be imputed and will be left out from the regression model.

TEAM_BATTING_SO: Number of team strikeouts by batters

kable(sumBTraining[sumBTraining[,1]=="TEAM_BATTING_SO",2:8], row.names=FALSE)

Min	Median	Mean	SD	Max	Num of NAs	Num of Zeros
0	750	735.6053	248.5264	1399	102	20

# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BATTING_SO)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BATTING_SO)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BATTING_SO, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BATTING_SO, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: There are 122 records with missing or zero value (as wtih other variables a zero value is unrealistic). These values can be imputed. Similarly to homeruns, the distribution is multimodal, which is interesting enough for additional analysis. Another area of concern is a noticeable left tail. It is highly unlikely to have games without any strikeouts, so anything lower than 162 (average of 1 strikeout per game) is definitely suspect.

TEAM_BASERUN_SB: Number of team stolen bases

kable(sumBTraining[sumBTraining[,1]=="TEAM_BASERUN_SB",2:8], row.names=FALSE)

Min	Median	Mean	SD	Max	Num of NAs	Num of Zeros
0	101	124.7618	87.79117	697	131	2

# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BASERUN_SB)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BASERUN_SB)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BASERUN_SB, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BASERUN_SB, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: The range and distribution are reasonable. The only issue are 133 records with missing or zero value. These values can be imputed in order to use these records in model building.

TEAM_BASERUN_CS: Number of team runners caught stealing

kable(sumBTraining[sumBTraining[,1]=="TEAM_BASERUN_CS",2:8], row.names=FALSE)

Min	Median	Mean	SD	Max	Num of NAs	Num of Zeros
0	49	52.80386	22.95634	201	772	1

# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BASERUN_CS)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BASERUN_CS)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BASERUN_CS, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BASERUN_CS, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: The range and distribution are reasonable; however, there is significant number of missing values - 773, including one zero value. This represents a third of the entire data set. It may be possible to impute this value, but it may be necessary to leave this variable out of model building.

TEAM_FIELDING_E: Number of team fielding errors

kable(sumBTraining[sumBTraining[,1]=="TEAM_FIELDING_E",2:8], row.names=FALSE)

Min	Median	Mean	SD	Max	Num of NAs	Num of Zeros
65	159	246.4807	227.771	1898	0	0

# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_FIELDING_E)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_FIELDING_E)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_FIELDING_E, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_FIELDING_E, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: There are no missing values. Distribution has a very long right tail. Values in the 1,000 and above range are highly suspect. One of the highest historical number of errors is 867 errors by Washington in 1886 for 122 games. That is equal about 1,151 errors for 162 game season. There are multiple values above that number. This may unfavorably influence a model.

TEAM_FIELDING_DP: Number of team fielding double plays

kable(sumBTraining[sumBTraining[,1]=="TEAM_FIELDING_DP",2:8], row.names=FALSE)

Min	Median	Mean	SD	Max	Num of NAs	Num of Zeros
52	149	146.3879	26.22639	228	286	0

# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_FIELDING_DP)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_FIELDING_DP)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_FIELDING_DP, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_FIELDING_DP, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: The range and distribution are reasonable. Similar to a few other variables there is a medium number off missing values - 286 records. This value can be imputed.

TEAM_PITCHING_BB: Number of walks given up by pitchers

kable(sumBTraining[sumBTraining[,1]=="TEAM_PITCHING_BB",2:8], row.names=FALSE)

Min	Median	Mean	SD	Max	Num of NAs	Num of Zeros
0	536.5	553.0079	166.3574	3645	0	1

# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_PITCHING_BB)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_PITCHING_BB)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_PITCHING_BB, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_PITCHING_BB, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: There are no missing values with the exception of record 1347 which will be deleted from model building. There are some unrealistic outliers. Current record of walks by a team in a season is held by 1949 Boston Red Sox - 835 walks in 155 games. For a 162 game season, this number is 873. This variable will be capped at 1,100 and any value over this will be set to this cap.

TEAM_PITCHING_H: Number of base hits given up by pitchers

kable(sumBTraining[sumBTraining[,1]=="TEAM_PITCHING_H",2:8], row.names=FALSE)

Min	Median	Mean	SD	Max	Num of NAs	Num of Zeros
1137	1518	1779.21	1406.843	30132	0	0

# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_PITCHING_H)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_PITCHING_H)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_PITCHING_H, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_PITCHING_H, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: Similar to TEAM_PITCHING_BB above, there are no missing value, but there issues with outliers. Based on visualizations, this variable will be capped at 13,000 and any value over this will be set to this cap.

TEAM_PITCHING_HR: Number of home runs given up by pitchers

kable(sumBTraining[sumBTraining[,1]=="TEAM_PITCHING_HR",2:8], row.names=FALSE)

Min	Median	Mean	SD	Max	Num of NAs	Num of Zeros
0	107	105.6986	61.29875	343	0	15

# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_PITCHING_HR)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_PITCHING_HR)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_PITCHING_HR, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_PITCHING_HR, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: This variable is more consistent than other pitching variables. The range and distribution are reasonable. Multimodality is interesting similar to a few other variables above. There are 15 zero values which can be imputed as needed.

TEAM_PITCHING_SO: Number of strikeouts by pitchers

kable(sumBTraining[sumBTraining[,1]=="TEAM_PITCHING_SO",2:8], row.names=FALSE)

Min	Median	Mean	SD	Max	Num of NAs	Num of Zeros
0	813.5	817.7305	553.085	19278	102	20

# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_PITCHING_SO)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_PITCHING_SO)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_PITCHING_SO, na.rm=TRUE)), color="red", linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_PITCHING_SO, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: This variable has 122 missing or zero values. They can be imputed as needed. There is also an outlier issue. Based on visualizations, this variable will be capped at 2,500 and any value over this will be set to this cap.

TARGET_WINS: Number of wins (Outcome)

kable(sumBTraining[sumBTraining[,1]=="TARGET_WINS",2:8], row.names=FALSE)

Min	Median	Mean	SD	Max	Num of NAs	Num of Zeros
0	82	80.79086	15.75215	146	0	1

# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TARGET_WINS)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TARGET_WINS)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TARGET_WINS, na.rm=TRUE)), color="red", linetype="dashed", size=1)

grid.arrange(bp, hp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis: The range and distribution are reasonable. There are no missing values with the exception of record 1347.

Density Plots

par(mfrow = c(3, 3))

datasub = melt(BTraining[c(2:17)])

## Using  as id variables

ggplot(datasub, aes(x= value)) +
    geom_density(fill='red') + facet_wrap(~variable, scales = 'free')

## Warning: Removed 3478 rows containing non-finite values (stat_density).

The density plots show various issues with skew an non-normality

Box Plots

Box plots can provide a visual representation of the variance of the data

vis <- melt(BTraining) %>%
   dplyr::filter(variable != "INDEX")

## Using  as id variables

ggplot(vis, aes(x = variable, y = value)) +
  geom_boxplot(show.legend = T) +
  stat_summary(fun.y = mean, color = "red", geom = "point", shape = 18, size = 3) +
  coord_flip() +
  ylim(0, 2200)

## Warning: Removed 3671 rows containing non-finite values (stat_boxplot).

## Warning: Removed 3671 rows containing non-finite values (stat_summary).

The box plots reveal that a great majority of the explanatory variables have high variances
Some of the variables contain extreme outliers that this graph does not show because i had to reduce the limits on the graph to get clear box plots
Many of the medians and means are also not aligned which demonstrates the outliers’ effects

Histograms

ggplot(vis, aes(value)) +
  geom_histogram(fill = "red")  +
  facet_wrap(~ variable, scales = "free")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 3478 rows containing non-finite values (stat_bin).

The histograms reveal that very few of the variables are normally distributed
- A few variables are multi-modal
- Some of the variable exhibit a lot of skew (e.g. BASERUN_SB)

Missing Values

missmap(BTraining[2:17])

Correlation Matrix

# Correlation matrix
cm <- cor(BTraining, use="pairwise.complete.obs")
cm <- cm[2:17,2:17]
names <- c("Wins", "H", "2B", "3B", "HR", "BB", "SO", "SB", "CS", "HBP", "P-H", "P-HR", "P-BB", "P-SO", "E", "DP")
colnames(cm) <- names; rownames(cm) <- names
cm <- round(cm, 2)
cmout <- as.data.frame(cm) %>% mutate_all(function(x) {
  cell_spec(x, "latex", color = ifelse(x>0.5 | x<(-0.5),"blue","black"))
  })
rownames(cmout) <- names
cmout %>%
  kable("latex", escape = F, align = "c", row.names = TRUE) %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(0, angle = -90)

# cmout %>%
#   kable("html", escape = F, align = "c", row.names = TRUE) %>%
#   kable_styling("striped", full_width = F)
#

Anything over 0.5 or under -0.5 is highlighted in blue. The matrix was created using complete pairwise observations.

A few conclusions:

Not surprisingly there is a very strong correlation between home runs batted in and home runs given up by pitching.
There is a negative correlation between number of triples and home runs. A less powerful team may not have enough power to hit home runs, but they get a lot of triples.
There is a strong positive correlation between number of strikeouts and home runs. More swings of the bat results in more home runs.

Other correlation visualization

#chart.Correlation(BTraining)

corr <- cor(cm, use="complete.ob")
ggcorrplot(corr, hc.order = TRUE, type = "lower",
   lab = TRUE, lab_size = 3, digits = 2)

cor_res <- cor(BTraining[c(2:17)], use = "na.or.complete")
corrplot(cor_res,
  type = "upper",
  order = "original",
  tl.col = "black",
  tl.srt = 45,
  tl.cex = 0.55)

Let’s read the correlations respective values:

cor_res.df <- data.frame(cor_res)
cor_res.df[1]

##                  TARGET_WINS
## TARGET_WINS       1.00000000
## TEAM_BATTING_H    0.46994665
## TEAM_BATTING_2B   0.31298400
## TEAM_BATTING_3B  -0.12434586
## TEAM_BATTING_HR   0.42241683
## TEAM_BATTING_BB   0.46868793
## TEAM_BATTING_SO  -0.22889273
## TEAM_BASERUN_SB   0.01483639
## TEAM_BASERUN_CS  -0.17875598
## TEAM_BATTING_HBP  0.07350424
## TEAM_PITCHING_H   0.47123431
## TEAM_PITCHING_HR  0.42246683
## TEAM_PITCHING_BB  0.46839882
## TEAM_PITCHING_SO -0.22936481
## TEAM_FIELDING_E  -0.38668800
## TEAM_FIELDING_DP -0.19586601

We notice also that there are some very strong positive correlations between other variables, represented in the correlation graphic above, among themselves. some of these are:

TEAM_BATTING_H is strongly correlated in a positive way with TEAM_PITCHING_H

cor_res.df['TEAM_BATTING_H','TEAM_PITCHING_H']

## [1] 0.9991927

TEAM_BATTING_HR is strongly correlated in a positive way with TEAM_PITCHING_HR

cor_res.df['TEAM_BATTING_HR','TEAM_PITCHING_HR']

## [1] 0.9999326

TEAM_BATTING_BB is strongly correlated in a positive way with TEAM_PITCHING_BB

cor_res.df['TEAM_BATTING_BB','TEAM_PITCHING_BB']

## [1] 0.9998814

TEAM_BATTING_SO is strongly correlated in a positive way with TEAM_PITCHING_SO

cor_res.df['TEAM_BATTING_SO','TEAM_PITCHING_SO']

## [1] 0.9997684

Primary insigths

Based on the above data exploration, we could note the following:

Missing values

It is confirmed the presence of missing values and these need to be address in a case by case - (will be taken care of in data Preparation section).

Zero values

It is confirmed the presence of Zero values as minimum entries in the data - (will be taken care of in data Preparation section).

Correlations

There seems to be very strong correlations to the target variable.

In regards to correlations related to other variables, it is confirmed that some variables have stronh correlation between each others.

Data Preparation

The following steps and/or assumptions will be considered. The idea is to make our given training data set more homogeneous and workable. The final goal is to be able to predict the TARGET_WINS with our data.

As noted in the Data Exploration section, the following adjustments have been performed:

Record 1347 having 0 for outcome variable TARGET_WINS has been removed.
Variable TEAM_BATTING_HBP has been removed.
Any zero value in all variables has been converted to NA.
Any NA value has been imputed using aregImpute function of Hmisc R package. Please see https://www.rdocumentation.org/packages/Hmisc/versions/4.3-1/topics/aregImpute for a full documentation of this function.
Imputation and \(R^2\) business:

# Remove observations with no target
BTraining <- BTraining[which(BTraining$TARGET_WINS!=0), ]

# Reset zero values
BTraining[which(BTraining$TEAM_BATTING_H==0),"TEAM_BATTING_H"] <- NA
BTraining[which(BTraining$TEAM_BATTING_2B==0),"TEAM_BATTING_2B"] <- NA
BTraining[which(BTraining$TEAM_BATTING_3B==0),"TEAM_BATTING_3B"] <- NA
BTraining[which(BTraining$TEAM_BATTING_HR==0),"TEAM_BATTING_HR"] <- NA
BTraining[which(BTraining$TEAM_BATTING_BB==0),"TEAM_BATTING_BB"] <- NA
BTraining[which(BTraining$TEAM_BATTING_SO==0),"TEAM_BATTING_SO"] <- NA
BTraining[which(BTraining$TEAM_BASERUN_SB==0),"TEAM_BASERUN_SB"] <- NA
BTraining[which(BTraining$TEAM_BASERUN_CS==0),"TEAM_BASERUN_CS"] <- NA
BTraining[which(BTraining$TEAM_FIELDING_E==0),"TEAM_FIELDING_E"] <- NA
BTraining[which(BTraining$TEAM_FIELDING_DP==0),"TEAM_FIELDING_DP"] <- NA
BTraining[which(BTraining$TEAM_PITCHING_BB==0),"TEAM_PITCHING_BB"] <- NA
BTraining[which(BTraining$TEAM_PITCHING_H==0),"TEAM_PITCHING_H"] <- NA
BTraining[which(BTraining$TEAM_PITCHING_HR==0),"TEAM_PITCHING_HR"] <- NA
BTraining[which(BTraining$TEAM_PITCHING_SO==0),"TEAM_PITCHING_SO"] <- NA

# Impute missing values
BTrainingImpute <- aregImpute(~ TARGET_WINS + TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
                         TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
                         TEAM_BASERUN_CS + TEAM_FIELDING_DP + TEAM_FIELDING_E + TEAM_PITCHING_BB + 
                         TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_SO, 
                       data = BTraining, n.impute = 10)
BTrainingI <- impute.transcan(BTrainingImpute, imputation=10, data=BTraining, 
                       list.out=TRUE, pr=FALSE, check=FALSE)
BTraining$TEAM_BASERUN_SB <- BTrainingI$TEAM_BASERUN_SB
BTraining$TEAM_BASERUN_CS <- BTrainingI$TEAM_BASERUN_CS
BTraining$TEAM_BATTING_3B <- BTrainingI$TEAM_BATTING_3B
BTraining$TEAM_BATTING_HR <- BTrainingI$TEAM_BATTING_HR
BTraining$TEAM_BATTING_SO <- BTrainingI$TEAM_BATTING_SO
BTraining$TEAM_FIELDING_DP <- BTrainingI$TEAM_FIELDING_DP
BTraining$TEAM_PITCHING_HR <- BTrainingI$TEAM_PITCHING_HR
BTraining$TEAM_PITCHING_SO <- BTrainingI$TEAM_PITCHING_SO

BTrainingImpute$rsq

##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_SO  TEAM_BASERUN_SB 
##        0.7238384        0.9872944        0.9275487        0.6804871 
##  TEAM_BASERUN_CS TEAM_FIELDING_DP TEAM_PITCHING_HR TEAM_PITCHING_SO 
##        0.6379649        0.4685603        0.9835082        0.7863081

Outliers for several variables have been capped: TEAM_PITCHING_SO at 2,500, TEAM_PITCHING_H at 13,000, and TEAM_PITCHING_BB at 1100.
To even out the spread of TEAM_FIELDING_E which has a long right tail with low median value, it has been log-transformed.
A new variable has been created to calculate number of singles batting in. It is equal to number of base hits minus doubles, triples and home runs.

# Adjust outliers
BTraining[which(BTraining$TEAM_PITCHING_SO>2500),"TEAM_PITCHING_SO"] <- 2500
BTraining[which(BTraining$TEAM_PITCHING_H>13000),"TEAM_PITCHING_H"] <- 13000
BTraining[which(BTraining$TEAM_PITCHING_BB>1100),"TEAM_PITCHING_BB"] <- 1100

# Create singles
BTraining$TEAM_BATTING_S <- BTraining$TEAM_BATTING_H - BTraining$TEAM_BATTING_2B - BTraining$TEAM_BATTING_3B - BTraining$TEAM_BATTING_HR
summary(BTraining$TEAM_BATTING_S)

# Create log fielding error
BTraining$TEAM_FIELDING_E_LOG <- log(BTraining$TEAM_FIELDING_E)

Model Building

Model 1

The first model includes several variables, selected manually, that have higher than average correlation to the target variable. They cover hitting, walking and fielding errors.

model1 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_FIELDING_E, data=BTraining)
summary(model1)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     TEAM_FIELDING_E, data = BTraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.971  -9.022   0.101   9.062  51.557 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.089556   3.311975   1.235    0.217    
## TEAM_BATTING_H   0.049143   0.002100  23.403  < 2e-16 ***
## TEAM_BATTING_BB  0.016107   0.003136   5.137 3.03e-07 ***
## TEAM_FIELDING_E -0.014493   0.001768  -8.196 4.11e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.7 on 2271 degrees of freedom
## Multiple R-squared:  0.2356, Adjusted R-squared:  0.2346 
## F-statistic: 233.3 on 3 and 2271 DF,  p-value: < 2.2e-16

All variables are significant, but the \(R^2\) value is relatively small at 0.2356.

Model 2

The second model expand the base hit variable, TEAM_BATTING_H, into its components - singles, doubles, triples and home runs.

model2 <- lm(TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + 
           TEAM_BATTING_BB + TEAM_FIELDING_E, data=BTraining)
summary(model2)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_FIELDING_E, 
##     data = BTraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.014  -8.792   0.093   8.751  59.151 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.022597   3.444892   2.329  0.01996 *  
## TEAM_BATTING_S   0.045595   0.003158  14.437  < 2e-16 ***
## TEAM_BATTING_2B  0.023042   0.007415   3.107  0.00191 ** 
## TEAM_BATTING_3B  0.159706   0.015181  10.520  < 2e-16 ***
## TEAM_BATTING_HR  0.077205   0.007713  10.009  < 2e-16 ***
## TEAM_BATTING_BB  0.012727   0.003200   3.977 7.21e-05 ***
## TEAM_FIELDING_E -0.018777   0.001971  -9.525  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.53 on 2268 degrees of freedom
## Multiple R-squared:  0.2559, Adjusted R-squared:  0.2539 
## F-statistic:   130 on 6 and 2268 DF,  p-value: < 2.2e-16

All variables are still significant and \(R^2\) is slightly improved at 0.2574. Another variation of this model - with log-transformed fielding error variable - produced slightly worse results.

Model 3

The third model includes several variables manually selected to try and cover different aspects of the game:

TEAM_BATTING_SO:TEAM_BATTING_H interaction covers offensive successes (hits) and failures (strikeouts).
Similarly TEAM_BATTING_BB:TEAM_BATTING_H interaction covers interaction between hits and walks.
TEAM_BASERUN_SB covers base running.
TEAM_FIELDING_DP and TEAM_FIELDING_E_LOG cover fielding performance.
TEAM_PITCHING_HR covers pitching performance.

model3 <- lm(TARGET_WINS ~ TEAM_BATTING_SO:TEAM_BATTING_H + TEAM_BATTING_BB:TEAM_BATTING_H + TEAM_BATTING_SO + 
           TEAM_BASERUN_SB + TEAM_FIELDING_DP + TEAM_PITCHING_HR + TEAM_FIELDING_E_LOG, data=BTraining)
summary(model3)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_SO:TEAM_BATTING_H + TEAM_BATTING_BB:TEAM_BATTING_H + 
##     TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_DP + TEAM_PITCHING_HR + 
##     TEAM_FIELDING_E_LOG, data = BTraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.380  -8.256   0.058   8.256  65.785 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     1.703e+02  6.666e+00  25.549  < 2e-16 ***
## TEAM_BATTING_SO                -8.773e-02  5.562e-03 -15.772  < 2e-16 ***
## TEAM_BASERUN_SB                 5.315e-02  4.204e-03  12.642  < 2e-16 ***
## TEAM_FIELDING_DP               -1.485e-01  1.322e-02 -11.234  < 2e-16 ***
## TEAM_PITCHING_HR                4.984e-02  7.562e-03   6.592 5.40e-11 ***
## TEAM_FIELDING_E_LOG            -1.399e+01  9.191e-01 -15.222  < 2e-16 ***
## TEAM_BATTING_SO:TEAM_BATTING_H  4.672e-05  4.259e-06  10.971  < 2e-16 ***
## TEAM_BATTING_H:TEAM_BATTING_BB  1.032e-05  1.924e-06   5.361 9.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.07 on 2267 degrees of freedom
## Multiple R-squared:  0.3059, Adjusted R-squared:  0.3038 
## F-statistic: 142.7 on 7 and 2267 DF,  p-value: < 2.2e-16

All variables are statistically significant and there is noticeable improvement of the \(R^2\) value at 0.3027.

Model 4

The fourth model started with all variables and used backward elimination to arrive at the optimal model. It started with the following variables: TEAM_BATTING_S, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_FIELDING_DP, TEAM_FIELDING_E, TEAM_PITCHING_BB, TEAM_PITCHING_H, TEAM_PITCHING_SO, and TEAM_PITCHING_HR. It was necessary to remove only one variable - TEAM_BASERUN_CS - to arrive at a model with all significant variables.

#model started with all variables
model4 <- lm(TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + TEAM_BATTING_3B + 
           TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
           TEAM_BASERUN_CS + TEAM_FIELDING_DP + TEAM_FIELDING_E + TEAM_PITCHING_BB + 
           TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_PITCHING_HR, data=BTraining)

#then removed TEAM_BASERUN_CS variable, the model becomes
model4 <- lm(TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + TEAM_BATTING_3B + 
           TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
           TEAM_FIELDING_DP + TEAM_FIELDING_E + TEAM_PITCHING_BB + 
           TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_PITCHING_HR, data=BTraining)

summary(model4)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_FIELDING_DP + TEAM_FIELDING_E + TEAM_PITCHING_BB + 
##     TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_PITCHING_HR, data = BTraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.710  -8.295   0.157   8.179  51.173 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      24.462823   5.048779   4.845 1.35e-06 ***
## TEAM_BATTING_S    0.046571   0.003651  12.756  < 2e-16 ***
## TEAM_BATTING_2B   0.022722   0.007067   3.215 0.001323 ** 
## TEAM_BATTING_3B   0.101053   0.015282   6.613 4.69e-11 ***
## TEAM_BATTING_HR   0.112250   0.028906   3.883 0.000106 ***
## TEAM_BATTING_BB   0.048538   0.009773   4.967 7.32e-07 ***
## TEAM_BATTING_SO  -0.025144   0.004426  -5.681 1.51e-08 ***
## TEAM_BASERUN_SB   0.049378   0.004159  11.873  < 2e-16 ***
## TEAM_FIELDING_DP -0.133065   0.012703 -10.475  < 2e-16 ***
## TEAM_FIELDING_E  -0.041039   0.002734 -15.011  < 2e-16 ***
## TEAM_PITCHING_BB -0.031511   0.008133  -3.875 0.000110 ***
## TEAM_PITCHING_H   0.002598   0.000566   4.590 4.67e-06 ***
## TEAM_PITCHING_SO  0.015793   0.003458   4.567 5.22e-06 ***
## TEAM_PITCHING_HR  0.005472   0.026031   0.210 0.833511    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.57 on 2261 degrees of freedom
## Multiple R-squared:  0.3599, Adjusted R-squared:  0.3562 
## F-statistic: 97.78 on 13 and 2261 DF,  p-value: < 2.2e-16

The \(R^2\) value is 0.3581.

Model 5

Additionally, several models were created by trying out some variables and there interactions. Variables were selected either based on theoretical expectation or correlation information from the first section. The following model has \(R^2\) values of 0.3279, which is relatively close to the fourth model; however, this model has fewer variables and may be preferential because of its simplicity.

model5 <- lm(TARGET_WINS ~ TEAM_BATTING_S  + TEAM_BATTING_3B + 
           TEAM_BATTING_HR + TEAM_BASERUN_SB  + 
           TEAM_FIELDING_E_LOG*TEAM_PITCHING_H, data=BTraining)
summary(model5)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BASERUN_SB + TEAM_FIELDING_E_LOG * 
##     TEAM_PITCHING_H, data = BTraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.102  -8.859  -0.125   8.612  53.696 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         43.5980264  8.2083937   5.311 1.19e-07 ***
## TEAM_BATTING_S                       0.0462442  0.0030660  15.083  < 2e-16 ***
## TEAM_BATTING_3B                      0.1583804  0.0152079  10.414  < 2e-16 ***
## TEAM_BATTING_HR                      0.0733046  0.0071637  10.233  < 2e-16 ***
## TEAM_BASERUN_SB                      0.0542861  0.0038357  14.153  < 2e-16 ***
## TEAM_FIELDING_E_LOG                 -8.8627398  1.4160051  -6.259 4.62e-10 ***
## TEAM_PITCHING_H                      0.0286916  0.0045198   6.348 2.63e-10 ***
## TEAM_FIELDING_E_LOG:TEAM_PITCHING_H -0.0040987  0.0006387  -6.417 1.68e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.87 on 2267 degrees of freedom
## Multiple R-squared:  0.3268, Adjusted R-squared:  0.3247 
## F-statistic: 157.2 on 7 and 2267 DF,  p-value: < 2.2e-16

The \(R^2\) value is 0.3294.

Model Selection

Based on \(R^2\) value, the fourth model (\(R^2\)=0.358)was selected for further analysis. This model also has the lowest AIC score.

AIC(model1, model2, model3, model4, model5)

##        df      AIC
## model1  5 18372.69
## model2  8 18317.53
## model3  9 18161.22
## model4 15 17989.07
## model5  9 18091.76

summary(model4)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_FIELDING_DP + TEAM_FIELDING_E + TEAM_PITCHING_BB + 
##     TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_PITCHING_HR, data = BTraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.710  -8.295   0.157   8.179  51.173 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      24.462823   5.048779   4.845 1.35e-06 ***
## TEAM_BATTING_S    0.046571   0.003651  12.756  < 2e-16 ***
## TEAM_BATTING_2B   0.022722   0.007067   3.215 0.001323 ** 
## TEAM_BATTING_3B   0.101053   0.015282   6.613 4.69e-11 ***
## TEAM_BATTING_HR   0.112250   0.028906   3.883 0.000106 ***
## TEAM_BATTING_BB   0.048538   0.009773   4.967 7.32e-07 ***
## TEAM_BATTING_SO  -0.025144   0.004426  -5.681 1.51e-08 ***
## TEAM_BASERUN_SB   0.049378   0.004159  11.873  < 2e-16 ***
## TEAM_FIELDING_DP -0.133065   0.012703 -10.475  < 2e-16 ***
## TEAM_FIELDING_E  -0.041039   0.002734 -15.011  < 2e-16 ***
## TEAM_PITCHING_BB -0.031511   0.008133  -3.875 0.000110 ***
## TEAM_PITCHING_H   0.002598   0.000566   4.590 4.67e-06 ***
## TEAM_PITCHING_SO  0.015793   0.003458   4.567 5.22e-06 ***
## TEAM_PITCHING_HR  0.005472   0.026031   0.210 0.833511    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.57 on 2261 degrees of freedom
## Multiple R-squared:  0.3599, Adjusted R-squared:  0.3562 
## F-statistic: 97.78 on 13 and 2261 DF,  p-value: < 2.2e-16

All variables used in this model have statistical significance at 0.01 level. The F-statistic is high with a p-value nearly 0 and, therefore, is significant. Median value of residuals is close to 0 and they are equally distributed. Standard errors are significantly smaller than estimated coefficients.

Only 4 variables are negatively correlated - strikeouts, double plays, errors, and home runs allowed. Remaining variables are positively correlated. Some correlation is counter-intuitive. For example, double plays are considered successful defensive moves and should increase the winning percentage. Similarly, allowing base hits should decrease winning percentage. The model indicates otherwise and there are probably there factors that influence these variables.

Consider residuals plotted against data index. There is no pattern.

plot(model4$residuals, ylab="Residuals")
abline(h=0)

Plotting fitted values against the residuals is more problematic. Although there is no pattern among residuals, there are some outliers and variability does not appear to be constant across the entire range.

plot(model4$fitted.values, model4$residuals, xlab="Fitted Values", ylab="Residuals")
abline(h=0)

Q-Q plot confirms that residuals are normally distributed.

qqnorm(model4$residuals)
qqline(model4$residuals)

Prediction

Using selected model and evaluation data (transformed similarly to training data), prediction table is as follows. It includes predicted number of wins along with confidence interval(CI).

#BEvaluation <- read.csv("moneyball-evaluation-data.csv")
BEvaluation <- read.csv("https://raw.githubusercontent.com/theoracley/Data621/master/Homework1/moneyball-evaluation-data.csv")

BEvaluation[which(BEvaluation$TEAM_BATTING_H==0),"TEAM_BATTING_H"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_2B==0),"TEAM_BATTING_2B"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_3B==0),"TEAM_BATTING_3B"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_HR==0),"TEAM_BATTING_HR"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_BB==0),"TEAM_BATTING_BB"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_SO==0),"TEAM_BATTING_SO"] <- NA
BEvaluation[which(BEvaluation$TEAM_BASERUN_SB==0),"TEAM_BASERUN_SB"] <- NA
BEvaluation[which(BEvaluation$TEAM_BASERUN_CS==0),"TEAM_BASERUN_CS"] <- NA
BEvaluation[which(BEvaluation$TEAM_FIELDING_E==0),"TEAM_FIELDING_E"] <- NA
BEvaluation[which(BEvaluation$TEAM_FIELDING_DP==0),"TEAM_FIELDING_DP"] <- NA
BEvaluation[which(BEvaluation$TEAM_PITCHING_BB==0),"TEAM_PITCHING_BB"] <- NA
BEvaluation[which(BEvaluation$TEAM_PITCHING_H==0),"TEAM_PITCHING_H"] <- NA
BEvaluation[which(BEvaluation$TEAM_PITCHING_HR==0),"TEAM_PITCHING_HR"] <- NA
BEvaluation[which(BEvaluation$TEAM_PITCHING_SO==0),"TEAM_PITCHING_SO"] <- NA

# Impute mimssing values
BTrainingImpute <- aregImpute(~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
                         TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
                         TEAM_BASERUN_CS + TEAM_FIELDING_DP + TEAM_FIELDING_E + TEAM_PITCHING_BB + 
                         TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_SO, 
                       data = BEvaluation, n.impute = 10)
BTrainingImpute
BTrainingImpute$rsq

BTrainingI <- impute.transcan(BTrainingImpute, imputation=10, data=BEvaluation, 
                       list.out=TRUE, pr=FALSE, check=FALSE)
BEvaluation$TEAM_BATTING_HR <- BTrainingI$TEAM_BATTING_HR
BEvaluation$TEAM_BATTING_SO <- BTrainingI$TEAM_BATTING_SO
BEvaluation$TEAM_BASERUN_SB <- BTrainingI$TEAM_BASERUN_SB
BEvaluation$TEAM_BASERUN_CS <- BTrainingI$TEAM_BASERUN_CS
BEvaluation$TEAM_FIELDING_DP <- BTrainingI$TEAM_FIELDING_DP
BEvaluation$TEAM_PITCHING_HR <- BTrainingI$TEAM_PITCHING_HR
BEvaluation$TEAM_PITCHING_SO <- BTrainingI$TEAM_PITCHING_SO

# Adjust outliers
BEvaluation[which(BEvaluation$TEAM_PITCHING_SO>2500),"TEAM_PITCHING_SO"] <- 2500
BEvaluation[which(BEvaluation$TEAM_PITCHING_H>13000),"TEAM_PITCHING_H"] <- 13000
BEvaluation[which(BEvaluation$TEAM_PITCHING_BB>1100),"TEAM_PITCHING_BB"] <- 1100

BEvaluation$TEAM_BATTING_S <- BEvaluation$TEAM_BATTING_H - BEvaluation$TEAM_BATTING_2B - BEvaluation$TEAM_BATTING_3B - BEvaluation$TEAM_BATTING_HR

BEvaluation$PREDICT_WIN <- predict(model4, newdata=BEvaluation, interval="confidence")

BTrainingPredict <- cbind(BEvaluation$INDEX, BEvaluation$PREDICT_WIN[, 1], BEvaluation$PREDICT_WIN[, 2], BEvaluation$PREDICT_WIN[, 3])
colnames(BTrainingPredict) <- c("Index", "Predicted Wins", "CI Lower", "CI Upper")
kable(round(BTrainingPredict,0))

Index	Predicted Wins	CI Lower	CI Upper
9	62	60	63
10	64	63	66
14	74	73	75
47	87	86	88
60	65	62	69
63	61	58	63
74	87	85	90
83	76	74	78
98	70	69	71
120	72	71	74
123	68	66	69
135	81	80	83
138	81	79	82
140	82	80	83
151	85	84	87
153	76	75	78
171	74	73	75
184	78	77	79
193	74	72	75
213	90	88	92
217	81	80	83
226	83	82	85
230	80	78	81
241	70	69	72
291	82	80	83
294	88	86	89
300	42	37	46
348	73	71	74
350	83	81	85
357	74	72	75
367	91	89	92
368	85	84	86
372	82	80	83
382	83	82	85
388	79	78	80
396	86	85	88
398	75	75	76
403	90	88	92
407	84	81	87
410	92	90	93
412	82	81	84
414	92	90	94
436	9	3	16
440	109	105	112
476	96	94	98
479	93	91	95
481	105	103	107
501	76	75	78
503	68	67	69
506	79	78	80
519	75	74	76
522	85	83	86
550	76	75	77
554	72	71	74
566	74	74	75
578	78	77	79
596	93	91	94
599	76	75	78
605	64	63	66
607	75	73	78
614	88	86	89
644	74	72	76
692	89	88	90
699	85	83	87
700	84	82	86
716	96	93	99
721	73	72	75
722	79	78	81
729	78	77	80
731	91	89	92
746	85	82	87
763	69	67	70
774	76	75	78
776	89	87	91
788	80	78	82
789	84	83	86
792	83	82	84
811	84	83	85
835	74	72	75
837	76	75	78
861	83	81	85
862	87	86	89
863	96	95	98
871	73	71	75
879	85	84	86
887	80	78	81
892	84	82	85
904	84	83	85
909	91	89	92
925	91	90	93
940	78	76	79
951	91	80	101
976	72	70	73
981	90	88	92
983	88	86	89
984	88	86	90
989	90	88	92
995	104	102	106
1000	86	85	88
1001	87	85	89
1007	78	77	80
1016	73	71	74
1027	83	82	84
1033	84	82	85
1070	79	77	80
1081	59	57	62
1084	47	43	51
1098	76	75	78
1150	87	86	89
1160	60	58	63
1169	85	84	86
1172	86	84	88
1174	95	94	96
1176	92	91	93
1178	80	79	82
1184	78	77	79
1193	86	84	87
1196	81	80	82
1199	73	72	74
1207	80	78	82
1218	92	90	93
1223	68	66	70
1226	68	66	69
1227	66	64	68
1229	68	67	70
1241	87	85	88
1244	90	89	92
1246	76	75	77
1248	93	91	94
1249	91	89	92
1253	85	84	87
1261	79	78	81
1305	80	79	82
1314	87	85	88
1323	88	87	90
1328	66	63	68
1353	74	72	75
1363	76	75	77
1371	89	88	91
1372	82	81	83
1389	64	63	66
1393	70	68	72
1421	90	89	92
1431	71	70	73
1437	71	70	72
1442	71	70	72
1450	77	76	78
1463	79	78	80
1464	78	77	79
1470	83	82	84
1471	82	80	83
1484	81	80	82
1495	62	52	71
1507	68	66	70
1514	77	76	79
1526	70	68	72
1549	90	89	92
1552	80	78	82
1556	92	90	94
1564	73	71	74
1585	105	102	107
1586	109	107	111
1590	94	93	96
1591	105	102	107
1592	98	96	100
1603	89	87	91
1612	81	80	83
1634	81	79	82
1645	73	71	74
1647	80	79	81
1673	94	92	96
1674	91	89	92
1687	80	78	81
1688	94	93	96
1700	82	81	84
1708	73	71	74
1713	78	76	79
1717	69	67	70
1721	73	72	74
1730	80	79	81
1737	91	88	93
1748	90	88	91
1749	86	85	87
1763	85	84	86
1768	85	77	93
1778	92	88	95
1780	92	89	94
1782	46	42	50
1784	59	57	62
1794	118	115	121
1803	69	68	71
1804	79	77	81
1819	75	73	76
1832	76	74	77
1833	78	77	80
1844	66	64	67
1847	77	76	78
1854	85	83	87
1855	79	78	80
1857	84	83	86
1864	75	73	76
1865	80	78	81
1869	74	72	75
1880	89	86	92
1881	81	80	82
1882	85	84	86
1894	78	76	79
1896	78	76	79
1916	86	85	88
1918	74	72	76
1921	110	108	113
1926	96	94	98
1938	82	80	84
1979	63	62	65
1982	66	65	68
1987	83	82	84
1997	78	76	80
2004	95	94	97
2011	77	76	78
2015	79	78	80
2022	78	77	79
2025	74	72	75
2027	81	80	82
2031	72	70	73
2036	73	69	78
2066	75	74	76
2073	82	81	83
2087	80	78	81
2092	82	81	83
2125	71	68	73
2148	82	80	84
2162	93	92	94
2191	78	76	79
2203	89	87	90
2218	80	79	81
2221	75	74	76
2225	83	82	85
2232	77	76	78
2267	93	91	95
2291	71	69	72
2299	89	87	90
2317	86	84	87
2318	83	82	84
2353	81	80	82
2403	61	59	62
2411	87	85	89
2415	81	79	82
2424	85	84	86
2441	71	70	73
2464	84	82	85
2465	80	79	82
2472	50	46	53
2481	96	94	98
2487	17	9	24
2500	69	68	70
2501	77	76	79
2520	84	82	85
2521	85	83	86
2525	76	74	78

Saving our prediction to a prediction file (ourPrediction.csv)

write.csv(BTrainingPredict, file = "ourPrediction.csv")

APPENDIX:

# Required libraries
library(tidyverse)
library(kableExtra)
library(ggplot2)
library(gridExtra)
library(dplyr)
library(Hmisc)
library(PerformanceAnalytics)
library(ggcorrplot)
library(corrplot)
library(Amelia)
library(reshape)
library(ggplot2)

# Import data
BTraining <- read.csv("moneyball-training-data.csv")

#structure
str(BTraining)

#Summary
BTraining.summary <- data.frame(unclass(summary(BTraining[2:17])),
check.names = FALSE,
row.names = NULL,
stringsAsFactors = FALSE)
BTraining.summary

#Columns having NA values
BTraining.summary$TEAM_BATTING_SO[7]
BTraining.summary$TEAM_BASERUN_SB[7]
BTraining.summary$TEAM_BASERUN_CS[7]
BTraining.summary$TEAM_BATTING_HBP[7]
BTraining.summary$TEAM_PITCHING_SO[7]
BTraining.summary$TEAM_FIELDING_DP[7]

# Get summary table
sumBTraining = data.frame(Variable = character(),
                   Min = integer(),
                   Median = integer(),
                   Mean = double(),
                   SD = double(),
                   Max = integer(),
                   Num_NAs = integer(),
                   Num_Zeros = integer())

for (i in 2:17) {
  sumBTraining <- rbind(sumBTraining, data.frame(Variable = colnames(BTraining)[i],
                                   Min = min(BTraining[,i], na.rm=TRUE),
                                   Median = median(BTraining[,i], na.rm=TRUE),
                                   Mean = mean(BTraining[,i], na.rm=TRUE),
                                   SD = sd(BTraining[,i], na.rm=TRUE),
                                   Max = max(BTraining[,i], na.rm=TRUE),
                                   Num_NAs = sum(is.na(BTraining[,i])),
                                   Num_Zeros = length(which(BTraining[,i]==0)))
                 )
}

# ----This section is for Exploration of Data used for every variable----
kable(sumBTraining[sumBTraining[,1]=="TEAM_BASERUN_SB",2:8], row.names=FALSE)

# Boxplot
bp <- ggplot(BTraining, aes(x = 1, y = TEAM_BASERUN_SB)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), 
                                     axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(BTraining, aes(x = TEAM_BASERUN_SB)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(TEAM_BASERUN_SB, na.rm=TRUE)), color="red", 
             linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(data=BTraining, aes(x=TEAM_BASERUN_SB, y=TARGET_WINS)) + 
  geom_point() + geom_smooth(method = "loess") +
  xlab("Scatterplot with Best Fit Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

##-----End of exploratory Data for every variable--------


#Density plots
par(mfrow = c(3, 3))
datasub = melt(BTraining[c(2:17)])
ggplot(datasub, aes(x= value)) +
    geom_density(fill='red') + facet_wrap(~variable, scales = 'free')


# Correlation matrix
cm <- cor(BTraining, use="pairwise.complete.obs")
cm <- cm[2:17,2:17]
names <- c("Wins", "H", "2B", "3B", "HR", "BTraining", "SO", "SB", "CS", "HBP", "P-H", 
           "P-HR", "P-BTraining", "P-SO", "E", "DP")
colnames(cm) <- names; rownames(cm) <- names
cm <- round(cm, 2)
cmout <- as.data.frame(cm) %>% mutate_all(function(x) {
  cell_spec(x, "html", color = ifelse(x>0.5 | x<(-0.5),"blue","black"))
  })
rownames(cmout) <- names
cmout %>%
  kable("html", escape = F, align = "c", row.names = TRUE) %>%
  kable_styling("striped", full_width = F)


#Box plots
vis <- melt(BTraining) %>%
   dplyr::filter(variable != "INDEX")

ggplot(vis, aes(x = variable, y = value)) +
  geom_boxplot(show.legend = T) +
  stat_summary(fun.y = mean, color = "red", geom = "point", shape = 18, size = 3) +
  coord_flip() +
  ylim(0, 2200)

#Histograms
ggplot(vis, aes(value)) +
  geom_histogram(fill = "red")  +
  facet_wrap(~ variable, scales = "free")

#missing values
missmap(BTraining[2:17])


# Correlation matrix
cm <- cor(BTraining, use="pairwise.complete.obs")
cm <- cm[2:17,2:17]
names <- c("Wins", "H", "2B", "3B", "HR", "BB", "SO", "SB", "CS", "HBP", "P-H", "P-HR", "P-BB", "P-SO", "E", "DP")
colnames(cm) <- names; rownames(cm) <- names
cm <- round(cm, 2)
cmout <- as.data.frame(cm) %>% mutate_all(function(x) {
  cell_spec(x, "latex", color = ifelse(x>0.5 | x<(-0.5),"blue","black"))
  })
rownames(cmout) <- names
cmout %>%
  kable("latex", escape = F, align = "c", row.names = TRUE) %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(0, angle = -90)


#Other visualization of correlation
#1.
corr <- cor(cm, use="complete.ob")
ggcorrplot(corr, hc.order = TRUE, type = "lower",
   lab = TRUE, lab_size = 3, digits = 2)

#2.
cor_res <- cor(BTraining[c(2:17)], use = "na.or.complete")
corrplot(cor_res,
  type = "upper",
  order = "original",
  tl.col = "black",
  tl.srt = 45,
  tl.cex = 0.55)

#correlation values
cor_res.df <- data.frame(cor_res)
cor_res.df[1]


# Remove observations with no target
BTraining <- BTraining[which(BTraining$TARGET_WINS!=0), ]

# Reset zero values
BTraining[which(BTraining$TEAM_BATTING_H==0),"TEAM_BATTING_H"] <- NA
BTraining[which(BTraining$TEAM_BATTING_2B==0),"TEAM_BATTING_2B"] <- NA
BTraining[which(BTraining$TEAM_BATTING_3B==0),"TEAM_BATTING_3B"] <- NA
BTraining[which(BTraining$TEAM_BATTING_HR==0),"TEAM_BATTING_HR"] <- NA
BTraining[which(BTraining$TEAM_BATTING_BB==0),"TEAM_BATTING_BB"] <- NA
BTraining[which(BTraining$TEAM_BATTING_SO==0),"TEAM_BATTING_SO"] <- NA
BTraining[which(BTraining$TEAM_BASERUN_SB==0),"TEAM_BASERUN_SB"] <- NA
BTraining[which(BTraining$TEAM_BASERUN_CS==0),"TEAM_BASERUN_CS"] <- NA
BTraining[which(BTraining$TEAM_FIELDING_E==0),"TEAM_FIELDING_E"] <- NA
BTraining[which(BTraining$TEAM_FIELDING_DP==0),"TEAM_FIELDING_DP"] <- NA
BTraining[which(BTraining$TEAM_PITCHING_BB==0),"TEAM_PITCHING_BB"] <- NA
BTraining[which(BTraining$TEAM_PITCHING_H==0),"TEAM_PITCHING_H"] <- NA
BTraining[which(BTraining$TEAM_PITCHING_HR==0),"TEAM_PITCHING_HR"] <- NA
BTraining[which(BTraining$TEAM_PITCHING_SO==0),"TEAM_PITCHING_SO"] <- NA

# Impute mimssing values
BTrainingImpute <- aregImpute(~ TARGET_WINS + TEAM_BATTING_H + TEAM_BATTING_2B + 
                         TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + 
                         TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + 
                         TEAM_FIELDING_DP + TEAM_FIELDING_E + TEAM_PITCHING_BB + 
                         TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_SO, 
                       data = BTraining, n.impute = 10)

BTrainingI <- impute.transcan(BTrainingImpute, imputation=10, data=BTraining, 
                       list.out=TRUE, pr=FALSE, check=FALSE)
BTraining$TEAM_BASERUN_SB <- BTrainingI$TEAM_BASERUN_SB
BTraining$TEAM_BASERUN_CS <- BTrainingI$TEAM_BASERUN_CS
BTraining$TEAM_BATTING_3B <- BTrainingI$TEAM_BATTING_3B
BTraining$TEAM_BATTING_HR <- BTrainingI$TEAM_BATTING_HR
BTraining$TEAM_BATTING_SO <- BTrainingI$TEAM_BATTING_SO
BTraining$TEAM_FIELDING_DP <- BTrainingI$TEAM_FIELDING_DP
BTraining$TEAM_PITCHING_HR <- BTrainingI$TEAM_PITCHING_HR
BTraining$TEAM_PITCHING_SO <- BTrainingI$TEAM_PITCHING_SO

BTrainingImpute$rsq

# Adjust outliers
BTraining[which(BTraining$TEAM_PITCHING_SO>2500),"TEAM_PITCHING_SO"] <- 2500
BTraining[which(BTraining$TEAM_PITCHING_H>13000),"TEAM_PITCHING_H"] <- 13000
BTraining[which(BTraining$TEAM_PITCHING_BB>1100),"TEAM_PITCHING_BB"] <- 1100

# Creat singles
BTraining$TEAM_BATTING_S <- BTraining$TEAM_BATTING_H - BTraining$TEAM_BATTING_2B - 
  BTraining$TEAM_BATTING_3B - BTraining$TEAM_BATTING_HR
summary(BTraining$TEAM_BATTING_S)

# Create log fielding error
BTraining$TEAM_FIELDING_E_LOG <- log(BTraining$TEAM_FIELDING_E)

# Model building
model1 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_FIELDING_E, data=BTraining)
summary(model1)

model2 <- lm(TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + 
           TEAM_BATTING_BB + TEAM_FIELDING_E, data=BTraining)
summary(model2)
model2b <- lm(TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + TEAM_BATTING_3B + 
            TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_FIELDING_E_LOG, data=BTraining)
summary(model2b)

model3 <- lm(TARGET_WINS ~ TEAM_BATTING_SO:TEAM_BATTING_H + TEAM_BATTING_BB:TEAM_BATTING_H + TEAM_BATTING_SO + 
           TEAM_BASERUN_SB + TEAM_FIELDING_DP + TEAM_PITCHING_HR + TEAM_FIELDING_E_LOG, data=BTraining)
summary(model3)

model4 <- lm(TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + TEAM_BATTING_3B + 
           TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
           TEAM_BASERUN_CS + TEAM_FIELDING_DP + TEAM_FIELDING_E + TEAM_PITCHING_BB + 
           TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_PITCHING_HR, data=BTraining)
summary(model4)
model4 <- lm(TARGET_WINS ~ TEAM_BATTING_S + TEAM_BATTING_2B + TEAM_BATTING_3B + 
           TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
           TEAM_FIELDING_DP + TEAM_FIELDING_E + TEAM_PITCHING_BB + 
           TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_PITCHING_HR, data=BTraining)
summary(model4)

model5 <- lm(TARGET_WINS ~ TEAM_BATTING_S  + TEAM_BATTING_3B + 
           TEAM_BATTING_HR + TEAM_BASERUN_SB  + 
           TEAM_FIELDING_E_LOG*TEAM_PITCHING_H, data=BTraining)
summary(model5)

#AIC Score
AIC(model1, model2, model3, model4, model5)
summary(model4)

# Residuals plots
plot(model4$residuals, ylab="Residuals")
abline(h=0)

plot(model4$fitted.values, m4$residuals, xlab="Fitted Values", ylab="Residuals")
abline(h=0)

qqnorm(model4$residuals)
qqline(model4$residuals)

# Test evaluation data for prediction
#BEvaluation <- read.csv("moneyball-evaluation-data.csv")
BEvaluation <- read.csv("https://raw.githubusercontent.com/theoracley/Data621/master/Homework1/moneyball-evaluation-data.csv")

BEvaluation[which(BEvaluation$TEAM_BATTING_H==0),"TEAM_BATTING_H"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_2B==0),"TEAM_BATTING_2B"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_3B==0),"TEAM_BATTING_3B"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_HR==0),"TEAM_BATTING_HR"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_BB==0),"TEAM_BATTING_BB"] <- NA
BEvaluation[which(BEvaluation$TEAM_BATTING_SO==0),"TEAM_BATTING_SO"] <- NA
BEvaluation[which(BEvaluation$TEAM_BASERUN_SB==0),"TEAM_BASERUN_SB"] <- NA
BEvaluation[which(BEvaluation$TEAM_BASERUN_CS==0),"TEAM_BASERUN_CS"] <- NA
BEvaluation[which(BEvaluation$TEAM_FIELDING_E==0),"TEAM_FIELDING_E"] <- NA
BEvaluation[which(BEvaluation$TEAM_FIELDING_DP==0),"TEAM_FIELDING_DP"] <- NA
BEvaluation[which(BEvaluation$TEAM_PITCHING_BB==0),"TEAM_PITCHING_BB"] <- NA
BEvaluation[which(BEvaluation$TEAM_PITCHING_H==0),"TEAM_PITCHING_H"] <- NA
BEvaluation[which(BEvaluation$TEAM_PITCHING_HR==0),"TEAM_PITCHING_HR"] <- NA
BEvaluation[which(BEvaluation$TEAM_PITCHING_SO==0),"TEAM_PITCHING_SO"] <- NA

# Impute mimssing values
BTrainingImpute <- aregImpute(~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
                         TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
                         TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_DP + 
                         TEAM_FIELDING_E + TEAM_PITCHING_BB + TEAM_PITCHING_H + 
                         TEAM_PITCHING_HR + TEAM_PITCHING_SO, 
                       data = BEvaluation, n.impute = 10)

BTrainingImpute$rsq

BTrainingI <- impute.transcan(BTrainingImpute, imputation=10, data=BEvaluation, 
                       list.out=TRUE, pr=FALSE, check=FALSE)
BEvaluation$TEAM_BATTING_HR <- BTrainingI$TEAM_BATTING_HR
BEvaluation$TEAM_BATTING_SO <- BTrainingI$TEAM_BATTING_SO
BEvaluation$TEAM_BASERUN_SB <- BTrainingI$TEAM_BASERUN_SB
BEvaluation$TEAM_BASERUN_CS <- BTrainingI$TEAM_BASERUN_CS
BEvaluation$TEAM_FIELDING_DP <- BTrainingI$TEAM_FIELDING_DP
BEvaluation$TEAM_PITCHING_HR <- BTrainingI$TEAM_PITCHING_HR
BEvaluation$TEAM_PITCHING_SO <- BTrainingI$TEAM_PITCHING_SO

# Adjust outliers
BEvaluation[which(BEvaluation$TEAM_PITCHING_SO>2500),"TEAM_PITCHING_SO"] <- 2500
BEvaluation[which(BEvaluation$TEAM_PITCHING_H>13000),"TEAM_PITCHING_H"] <- 13000
BEvaluation[which(BEvaluation$TEAM_PITCHING_BB>1100),"TEAM_PITCHING_BB"] <- 1100

BEvaluation$TEAM_BATTING_S <- BEvaluation$TEAM_BATTING_H - BEvaluation$TEAM_BATTING_2B - 
  BEvaluation$TEAM_BATTING_3B - BEvaluation$TEAM_BATTING_HR

BEvaluation$PREDICT_WIN <- predict(model4, newdata=BEvaluation, interval="confidence")

BTrainingPredict <- cbind(BEvaluation$INDEX, BEvaluation$PREDICT_WIN[, 1], BEvaluation$PREDICT_WIN[, 2], 
                   BEvaluation$PREDICT_WIN[, 3])
colnames(BTrainingPredict) <- c("Index", "Predicted Wins", "CI Lower", "CI Upper")
round(BTrainingPredict,0)

#write our prediction to a file
write.csv(BTrainingPredict, file = "ourPrediction.csv")