Predicting Wins for Baseball Game Using Multiple Linear Regression

INTRODUCTION:

In this particular problem two data sets i.e. training and testing, are provided for Base Ball game. The idea of this study is to explore the training data set by checking the dimension of data set, carrying out descriptive summary statistics and to plots different features against the target variable and also check out the correlation between them. After that we have to prepare our data set for training a model. In order to do that we have to take care of any missing values in the data set and also to address the outliers. Once our data is prepared then we can go ahead and create different models using the features we found more statistically significant. After training the model we will go ahead and use the testing data set to predict the target variable. So Before any further due let’s kickoff this session by data exploration:

DATA EXPLORATION:

In this phase we explore the data and will get to know the data more before going straight towards the preparation or modeling. data explorations helps us understand what we are dealing with. Since the data set provided is about a Base ball game and usually in sports there is a lot of misconception about what are the real indicators of a team wining. As a data scientist our job is to find out those indicators and how they can effect the overall performance that could lead to win or lose for team using past data. In order to achieve that you have to have solid knowledge about the data you are dealing with. That is why data exploration holds a great impact on the overall modeling. Let’s start our data exploration by loading the data set into our markdown.

Loading Data:

The data set provided has been uploaded to github repository from where it has been loaded into the markdown using the code chunk below. The reason why it has uploaded to the github is to keep the work reproducible.

training <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/moneyball-training-data.csv")

Now our data set has been loaded into the environment. Let’s display the first few rows of our data set.

knitr::kable(head(training))
INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
1 39 1445 194 39 13 143 842 NA NA NA 9364 84 927 5456 1011 NA
2 70 1339 219 22 190 685 1075 37 28 NA 1347 191 689 1082 193 155
3 86 1377 232 35 137 602 917 46 27 NA 1377 137 602 917 175 153
4 70 1387 209 38 96 451 922 43 30 NA 1396 97 454 928 164 156
5 82 1297 186 27 102 472 920 49 39 NA 1297 102 472 920 138 168
6 75 1279 200 36 92 443 973 107 59 NA 1279 92 443 973 123 149

Everything looks fine but on a first glance we can see that there is an INDEX column which will not be used in the analysis and will be removed from the data set later on in the data preparation stage.

Dimension of Data Set:

Let’s check out the dimension of our raw data set:

dim(training)
## [1] 2276   17

So we can see that we have got 2276 observations and 17 columns which contains INDEX, our target variable TARGET_WINS and all the features. Now these dimension might not remain the same depending on the steps we carry out during data preparation

Descriptive Summary Statistics:

Before we go ahead with plot the relational plot between target and features. let’s check out the descriptive summary for each column. Descriptive summary will give us the mean, median, min, max and quartiles for each column. Descriptive statistics help us to simplify large amounts of data in a sensible way. For instance, finding the average of a column by going through all the 2276 observations could be very hectic thus we use descriptive summary statistics. The code chunk below uses summary() function from base R to give all the summary statistics of each column

knitr::kable(summary(training))
INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
Median :1270.5 Median : 82.00 Median :1454 Median :238.0 Median : 47.00 Median :102.00 Median :512.0 Median : 750.0 Median :101.0 Median : 49.0 Median :58.00 Median : 1518 Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779 Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :146.4
3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
NA NA NA NA NA NA NA NA’s :102 NA’s :131 NA’s :772 NA’s :2085 NA NA NA NA’s :102 NA NA’s :286

As we can see that the table above gave us the summary for even INDEX column too. So before moving and waiting for data preparation remove that from our data set. The code chunk below remove the redundant INDEX column from our data set.

training <- training[,-1]

Plotting the Target Column:

Before moving ahead with relational plot and correlation between target and features. Let’s check out the distribution of of target column TARGET_WINS.

ggplot()+
  geom_histogram(data = training, mapping = aes(x=TARGET_WINS), bins = 50, color = "black", fill = "grey")+geom_vline(xintercept=mean(training$TARGET_WINS), color='red')+labs(x="Target Wins", y="Count",title ="Distribution of Target Wins")+theme_bw()

As we can see that the distribution of TARGET_WINS is fairly normal with mean around 80.8 (represented by red vertical line), min at 0 and maximum at 146. We can observe min = 0 is a worry-some, since it is very rare for a team to go win less so there might be something that needs to be corrected in the data. We can also revisit the summary statistics for TARGET_WINS column to confirm the figures above.

summary(training$TARGET_WINS)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   71.00   82.00   80.79   92.00  146.00

Relational Plots Between Target and Feature column:

Now we can go ahead and plot the relational plot and we can also check the correlation between target and features. One thing must be kept in mind that the correlation might change also carry out data preparation since we still have the uncleaned, dirty data set. As we know that there are at least 15 features in the data set and it will be impractical to plot the relational graph between the target column and each feature column so we will use pairs.panel from psych library to plot multiple feature on the same graphic. In the graph below our focus will be on the first row and first column. The first row tell us the correlation between TARGET_WINS and feature column and similarly the first column show the relational plot between the same columns. On the diagonal we have the distribution of each column.

Scatter_Matrix <- pairs.panels(training[, c(1, 2:6)], main = "Scatter Plot Matrix for Training Dataset")

In the matrix plot above the value in the first row after TARGET_WINS distribution plot shows the correlation between TARGET_WINS and TEAM_BATTING_H which is .39. Similarly the scatter plot below the TARGET_WINS distribution is the plot between TARGET_WINS and TEAM_BATTING_H with fitted regression line in red color. In the matrix above we have only considered the frist 5 features because to avoid over crowdness. For the remaining features we have the matrix plot below:

pairs.panels(training[, c(1, 7:11)], main = "Scatter Plot Matrix for Training Dataset")

pairs.panels(training[, c(1, 11:16)], main = "Scatter Plot Matrix for Training Dataset")

From the above matrix plots we can see that diagonals have distributions of each features and we can see that some of these distributions contains outliers that needs to be address in our data preparation stage.

Missing values:

Let’s check out the missing values in the columns. We will check the number of missing values in each column using the code check below:

knitr::kable(colSums(is.na(training)))
x
TARGET_WINS 0
TEAM_BATTING_H 0
TEAM_BATTING_2B 0
TEAM_BATTING_3B 0
TEAM_BATTING_HR 0
TEAM_BATTING_BB 0
TEAM_BATTING_SO 102
TEAM_BASERUN_SB 131
TEAM_BASERUN_CS 772
TEAM_BATTING_HBP 2085
TEAM_PITCHING_H 0
TEAM_PITCHING_HR 0
TEAM_PITCHING_BB 0
TEAM_PITCHING_SO 102
TEAM_FIELDING_E 0
TEAM_FIELDING_DP 286

we can see that some the columns has missing values which also needs to be addressed in our data preparation stage.

DATA PREPARATION:

Every data set has some imperfection and so does ours. As we saw in the data exploration section that our data does contains missing values and outliers so let’s deal with that. We will try to address both missing values and outliers accordingly.

Fixing Missing Values and Outliers:

Removing Missing values:

As we saw earlier that some of the feature columns like TEAM_BATTING_HBP,TEAM_BASERUN_CS and TEAM_FIELDING_DP has a lot of missing values and it would be to the best of our interest to remove them from our data set since replacing them might create the issue of fidelity and bias.

training <- training[, !names(training) %in% c('TEAM_BATTING_HBP','TEAM_BASERUN_CS','TEAM_FIELDING_DP')]

So the features with high number of missing values are being removed from the data set. We can check the dimensions now.

dim(training)
## [1] 2276   13

Impute Missing values:

Let’s check out the remaining features for missing values

knitr::kable(colSums(is.na(training)))
x
TARGET_WINS 0
TEAM_BATTING_H 0
TEAM_BATTING_2B 0
TEAM_BATTING_3B 0
TEAM_BATTING_HR 0
TEAM_BATTING_BB 0
TEAM_BATTING_SO 102
TEAM_BASERUN_SB 131
TEAM_PITCHING_H 0
TEAM_PITCHING_HR 0
TEAM_PITCHING_BB 0
TEAM_PITCHING_SO 102
TEAM_FIELDING_E 0

As we can see that features like TEAM_BATTING_SO,TEAM_BASERUN_SB and TEAM_PITCHING_SO has some missing values and the least of those missing values are 102 in number which comes up to be 4.5% of the total data set. According to some of the data scientists any data set with 3% or lower missing values can be dealt with by removing the those observations from the analysis. Over here we will try to replace the missing values with median of their respective columns since there is skweness in the distribution and medians are insensitive to skewness.

training$TEAM_BATTING_SO[is.na(training$TEAM_BATTING_SO)] <- median(training$TEAM_BATTING_SO, na.rm = TRUE)
training$TEAM_BASERUN_SB[is.na(training$TEAM_BASERUN_SB)] <- median(training$TEAM_BASERUN_SB, na.rm = TRUE)
training$TEAM_PITCHING_SO[is.na(training$TEAM_PITCHING_SO)] <- median(training$TEAM_PITCHING_SO, na.rm = TRUE)
dim(training)
## [1] 2276   13

Fixing Outliers:

Looking at the summary and the plots below we see that PITCHING_H, PITCHING_BB, `PITCHING_O, and FIELDING_E are all skewed by their outliers. We also have some fields with a few missing values. Our plan to fix that is to pick any value that is 3 standard deviations above the mean and impute them as the median.

training$TEAM_PITCHING_H[training$TEAM_PITCHING_H > 3*sd(training$TEAM_PITCHING_H)] <- median(training$TEAM_PITCHING_H)
training$TEAM_PITCHING_BB[training$TEAM_PITCHING_BB > 3*sd(training$TEAM_PITCHING_BB)] <- median(training$TEAM_PITCHING_BB)
training$TEAM_PITCHING_SO[training$TEAM_PITCHING_SO > 3*sd(training$TEAM_PITCHING_SO)] <- median(training$TEAM_PITCHING_SO)
training$TEAM_FIELDING_E[training$TEAM_FIELDING_E > 3*sd(training$TEAM_FIELDING_E)] <- median(training$TEAM_FIELDING_E)

Now that our data set is ready to be used to model here is the final distribution of target and features columns.

Final Distribution Check:

ggplot(melt(training), aes(x=value)) + geom_histogram(color = 'black', fill = 'grey') + facet_wrap(~variable, scale='free') + labs(x='', y='Frequency')+theme_bw()

BUILDING MODELS:

Our data set is ready to be used for modeling. In this section we will create different models and then based on the statistical significance of the features used in the models. Out of these models that we will create, we are going to select one and use that model to predict using our prediction data set in the next section.

Model 1:

Even though we already know that some the features like TEAM_PITCHING_BB, TEAM_PITCHING_HR has weak correlation with our target which TARGET_WINS but we in our first model we will try to fit all the features in the model and then check there statistical significance. The code chunk below will create our first model

m1 <- lm(TARGET_WINS ~., training)

We can check the intercept and coefficients of the model above using summary function. The summary function also give us the p-value of each coefficient which shows the statistical significance of the that coefficient. So let’s apply summary function to our model:

summary(m1)
## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.862  -8.582   0.353   8.578  56.341 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.414986   5.607021  -0.252  0.80079    
## TEAM_BATTING_H    0.038461   0.003769  10.204  < 2e-16 ***
## TEAM_BATTING_2B  -0.019783   0.009431  -2.098  0.03604 *  
## TEAM_BATTING_3B   0.102623   0.017633   5.820 6.72e-09 ***
## TEAM_BATTING_HR   0.075863   0.027554   2.753  0.00595 ** 
## TEAM_BATTING_BB   0.035305   0.004431   7.968 2.52e-15 ***
## TEAM_BATTING_SO   0.011331   0.004418   2.565  0.01039 *  
## TEAM_BASERUN_SB   0.030714   0.004590   6.692 2.76e-11 ***
## TEAM_PITCHING_H   0.006600   0.001173   5.625 2.08e-08 ***
## TEAM_PITCHING_HR -0.036080   0.024315  -1.484  0.13798    
## TEAM_PITCHING_BB -0.018563   0.007410  -2.505  0.01230 *  
## TEAM_PITCHING_SO -0.007138   0.003582  -1.993  0.04643 *  
## TEAM_FIELDING_E  -0.022174   0.003612  -6.138 9.82e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.49 on 2263 degrees of freedom
## Multiple R-squared:   0.27,  Adjusted R-squared:  0.2661 
## F-statistic: 69.74 on 12 and 2263 DF,  p-value: < 2.2e-16

As we cans see that TEAM_PITCHING_HR has a very high p-value which means that it is not statistically significant and it will be in our best interest to remove that from our model. Below are the plots which shows residuals vs fitted values, QQ plot, residuals distributions, standardized residuals and residuals vs leverage.

par(mfrow=c(2,2))
plot(m1)

hist(resid(m1), main="Histogram of Residuals")

Model 2:

As we saw in model 1 that TEAM_PITCHING_HR has a high p-value. Let’s create another model without TEAM_PITCHING_HRand see how that performs.

m2<- lm(TARGET_WINS~TEAM_BATTING_H+TEAM_BATTING_2B+TEAM_BATTING_3B+TEAM_BATTING_HR+TEAM_BATTING_BB+TEAM_BATTING_SO+TEAM_BASERUN_SB+TEAM_PITCHING_H+TEAM_PITCHING_BB+TEAM_FIELDING_E, training)

Now our model is ready let’s check the summary of our model.

summary(m2)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_FIELDING_E, 
##     data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.792  -8.559   0.415   8.600  58.424 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.296543   5.513356   0.417 0.677053    
## TEAM_BATTING_H    0.037750   0.003753  10.059  < 2e-16 ***
## TEAM_BATTING_2B  -0.017073   0.009419  -1.813 0.070013 .  
## TEAM_BATTING_3B   0.098129   0.017593   5.578 2.73e-08 ***
## TEAM_BATTING_HR   0.039122   0.009739   4.017 6.08e-05 ***
## TEAM_BATTING_BB   0.040121   0.004191   9.573  < 2e-16 ***
## TEAM_BATTING_SO   0.003934   0.002289   1.719 0.085833 .  
## TEAM_BASERUN_SB   0.030545   0.004591   6.653 3.58e-11 ***
## TEAM_PITCHING_H   0.004893   0.001021   4.790 1.78e-06 ***
## TEAM_PITCHING_BB -0.025057   0.007169  -3.495 0.000483 ***
## TEAM_FIELDING_E  -0.021481   0.003610  -5.950 3.10e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.52 on 2265 degrees of freedom
## Multiple R-squared:  0.2661, Adjusted R-squared:  0.2628 
## F-statistic: 82.11 on 10 and 2265 DF,  p-value: < 2.2e-16

As we can that after removing the TEAM_PITCHING_HR from the model our r-squared has dropped down a bit but at the same time features like TEAM_BATTING_SO and TEAM_BATTING_2B has lost their significance too, with p-values way over .05. So let’s remove these from our analysis and create a new model.

par(mfrow=c(2,2))
plot(m2)

hist(resid(m2), main="Histogram of Residuals")

Model 3:

In our model 3 we have removed all the three weakly correlated and low statistically significant features from our consideration. The following code chunk contains our model.

m3<- lm(TARGET_WINS~TEAM_BATTING_H+TEAM_BATTING_3B+TEAM_BATTING_HR+TEAM_BATTING_BB+TEAM_BASERUN_SB+TEAM_PITCHING_H+TEAM_PITCHING_BB+TEAM_FIELDING_E, training)

Let’s check the summary of our model 3.

summary(m3)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_H + 
##     TEAM_PITCHING_BB + TEAM_FIELDING_E, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.259  -8.493   0.383   8.484  56.343 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.000771   3.926302   2.547 0.010927 *  
## TEAM_BATTING_H    0.031665   0.002537  12.482  < 2e-16 ***
## TEAM_BATTING_3B   0.098564   0.017475   5.640 1.91e-08 ***
## TEAM_BATTING_HR   0.047017   0.007669   6.131 1.03e-09 ***
## TEAM_BATTING_BB   0.038505   0.004128   9.328  < 2e-16 ***
## TEAM_BASERUN_SB   0.033250   0.004347   7.650 2.96e-14 ***
## TEAM_PITCHING_H   0.004579   0.001012   4.526 6.33e-06 ***
## TEAM_PITCHING_BB -0.024725   0.007151  -3.458 0.000555 ***
## TEAM_FIELDING_E  -0.021393   0.003558  -6.013 2.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.53 on 2267 degrees of freedom
## Multiple R-squared:  0.2645, Adjusted R-squared:  0.2619 
## F-statistic: 101.9 on 8 and 2267 DF,  p-value: < 2.2e-16

As we can see that all the features used in the model 3 are statistically significant. Features like TEAM_BATTING_H, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BASERUN_SB and TEAM_PITCHING_H which are Base hits by batters, Triples by batters, homeruns by batters, walks by batters, stolen bases, and hits allowed, respectively, are all contributing positively towards the wins. Similarly, TEAM_PITCHING_BB and TEAM_FIELDING_E which are walks allowed and errors, respectively, are contributing negatively towards the wins.

Even though r-square went a touch high but all the features are statistically significant. Below are the plots which shows residuals vs fitted values, QQ plot, residuals distributions, standardized residuals and residuals vs leverage.

par(mfrow=c(2,2))
plot(m3)

hist(resid(m3), main = "Histogram of Residuals")

SELECTING MODELS AND PREDICTING:

In this section we will select the model that we are working with and we will use the predicting data to predict the target variable.

Selecting Models:

We are going with model 3 for prediction since all the feature all statistically significant and there is no weak correlation between the target and features. Another reason why we are not selecting m1 which has all the features and a better r-squared value is the result from analysis of variance (ANOVA). The code chunk below show ANOVA between the biggest model and the samllest model in terms of features used.

anova(m1, m3)
## Analysis of Variance Table
## 
## Model 1: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E
## Model 2: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + 
##     TEAM_FIELDING_E
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)   
## 1   2263 412102                               
## 2   2267 415188 -4   -3085.9 4.2365 0.00203 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see that the p-value is .00203 which below .05, meaning that the null hypothesis can not be rejected. Which in turn means that the results produce by each model i.e. m1 and m3 are not that varied from each other. So we will go with the smaller model in order to save computing power. Below are some of the main features of model 3 (m3)

sum_m3 <- summary(m3)
RSS <- c(crossprod(m3$residuals))
MSE <- RSS/length(m3$residuals)
print(paste0("Mean Squared Error: ", MSE))
## [1] "Mean Squared Error: 182.419913764793"
print(paste0("Root MSE: ", sqrt(MSE)))
## [1] "Root MSE: 13.5062916363002"
print(paste0("Adjusted R-squared: ", sum_m3$adj.r.squared))
## [1] "Adjusted R-squared: 0.261903459682437"
print(paste0("F-statistic: ",sum_m3$fstatistic[1]))
## [1] "F-statistic: 101.90657831176"

Here is the distribution of the residuals which show no pattern accross the horizontal line.

plot(resid(m3))
abline(h=0, col=2)

Predicting the Data:

Now we have selected our model so we can go ahead and load the testing data set to predict the TARGET_WINS using m3 model. The following code chunk load the data in our environment.

testing <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/moneyball-evaluation-data.csv")

Here is the first few row of our testing data set:

knitr::kable(head(testing))
INDEX TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
9 1209 170 33 83 447 1080 62 50 NA 1209 83 447 1080 140 156
10 1221 151 29 88 516 929 54 39 NA 1221 88 516 929 135 164
14 1395 183 29 93 509 816 59 47 NA 1395 93 509 816 156 153
47 1539 309 29 159 486 914 148 57 42 1539 159 486 914 124 154
60 1445 203 68 5 95 416 NA NA NA 3902 14 257 1123 616 130
63 1431 236 53 10 215 377 NA NA NA 2793 20 420 736 572 105

Removing Un-necessary columns:

Let’s remove the columns that we do not need for this analysis:

testing <- testing[, !names(testing) %in% c('INDEX','TEAM_BATTING_HBP','TEAM_BASERUN_CS','TEAM_FIELDING_DP')]
knitr::kable(head(testing))
TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
1209 170 33 83 447 1080 62 1209 83 447 1080 140
1221 151 29 88 516 929 54 1221 88 516 929 135
1395 183 29 93 509 816 59 1395 93 509 816 156
1539 309 29 159 486 914 148 1539 159 486 914 124
1445 203 68 5 95 416 NA 3902 14 257 1123 616
1431 236 53 10 215 377 NA 2793 20 420 736 572
dim(testing)
## [1] 259  12

Fixing Missing Values:

Let’s check out and fix the missing values in our testing data set:

knitr::kable(colSums(is.na(testing)))
x
TEAM_BATTING_H 0
TEAM_BATTING_2B 0
TEAM_BATTING_3B 0
TEAM_BATTING_HR 0
TEAM_BATTING_BB 0
TEAM_BATTING_SO 18
TEAM_BASERUN_SB 13
TEAM_PITCHING_H 0
TEAM_PITCHING_HR 0
TEAM_PITCHING_BB 0
TEAM_PITCHING_SO 18
TEAM_FIELDING_E 0

For Imputation we have used MICE package:

testing <- mice(testing, m=5, maxit = 5, method = 'pmm')
testing <- complete(testing)

Fixing Outliers:

testing$TEAM_PITCHING_H[testing$TEAM_PITCHING_H > 3*sd(testing$TEAM_PITCHING_H)] <- median(testing$TEAM_PITCHING_H)
testing$TEAM_PITCHING_BB[testing$TEAM_PITCHING_BB > 3*sd(testing$TEAM_PITCHING_BB)] <- median(testing$TEAM_PITCHING_BB)
testing$TEAM_PITCHING_SO[testing$TEAM_PITCHING_SO > 3*sd(testing$TEAM_PITCHING_SO)] <- median(testing$TEAM_PITCHING_SO)
testing$TEAM_FIELDING_E[testing$TEAM_FIELDING_E > 3*sd(testing$TEAM_FIELDING_E)] <- median(testing$TEAM_FIELDING_E)

Final Prediction:

The following Code chunk predicts the TARGET_WINS using testing data set:

final <- predict(m3, newdata = testing, interval="prediction")
knitr::kable(head(final,3))
fit lwr upr
66.19992 39.62767 92.77216
67.26743 40.69895 93.83591
73.42938 46.87113 99.98763

The above data set gives us the predicted values. Fit columns contains the fitted values for testing data set while upper and lower column contain the limit values for 95% confidence interval.

CONCLUSION:

In this particular setting a moderately dirty training data set was given to train a model and then use that model to predict the target variable using testing data set. Initially the data was loaded into the environment and explored through descriptive summary statistics and multiple distribution and relational plots. Data exploration was followed by data preparation stage where the training data was wrangled by dealing with the missing values and outliers. Once the training data set was ready, it was then put forward to create models. Three different models were created based on statistical significance of the features used in the model. After creating models, one model was selected based on ANOVA and statistical significance of the features. The selected model was then applied to the testing data set to predict the final output.