Predicting Wins for Baseball Game Using Multiple Linear Regression

INTRODUCTION:

In this particular problem two data sets i.e. training and testing, are provided for Base Ball game. The idea of this study is to explore the training data set by checking the dimension of data set, carrying out descriptive summary statistics and to plots different features against the target variable and also check out the correlation between them. After that we have to prepare our data set for training a model. In order to do that we have to take care of any missing values in the data set and also to address the outliers. Once our data is prepared then we can go ahead and create different models using the features we found more statistically significant. After training the model we will go ahead and use the testing data set to predict the target variable. So Before any further due let’s kickoff this session by data exploration:

DATA EXPLORATION:

In this phase we explore the data and will get to know the data more before going straight towards the preparation or modeling. data explorations helps us understand what we are dealing with. Since the data set provided is about a Base ball game and usually in sports there is a lot of misconception about what are the real indicators of a team wining. As a data scientist our job is to find out those indicators and how they can effect the overall performance that could lead to win or lose for team using past data. In order to achieve that you have to have solid knowledge about the data you are dealing with. That is why data exploration holds a great impact on the overall modeling. Let’s start our data exploration by loading the data set into our markdown.

Loading Data:

The data set provided has been uploaded to github repository from where it has been loaded into the markdown using the code chunk below. The reason why it has uploaded to the github is to keep the work reproducible.

training <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/moneyball-training-data.csv")

Now our data set has been loaded into the environment. Let’s display the first few rows of our data set.

knitr::kable(head(training))

INDEX	TARGET_WINS	TEAM_BATTING_H	TEAM_BATTING_2B	TEAM_BATTING_3B	TEAM_BATTING_HR	TEAM_BATTING_BB	TEAM_BATTING_SO	TEAM_BASERUN_SB	TEAM_BASERUN_CS	TEAM_BATTING_HBP	TEAM_PITCHING_H	TEAM_PITCHING_HR	TEAM_PITCHING_BB	TEAM_PITCHING_SO	TEAM_FIELDING_E	TEAM_FIELDING_DP
1	39	1445	194	39	13	143	842	NA	NA	NA	9364	84	927	5456	1011	NA
2	70	1339	219	22	190	685	1075	37	28	NA	1347	191	689	1082	193	155
3	86	1377	232	35	137	602	917	46	27	NA	1377	137	602	917	175	153
4	70	1387	209	38	96	451	922	43	30	NA	1396	97	454	928	164	156
5	82	1297	186	27	102	472	920	49	39	NA	1297	102	472	920	138	168
6	75	1279	200	36	92	443	973	107	59	NA	1279	92	443	973	123	149

Everything looks fine but on a first glance we can see that there is an INDEX column which will not be used in the analysis and will be removed from the data set later on in the data preparation stage.

Dimension of Data Set:

Let’s check out the dimension of our raw data set:

dim(training)

## [1] 2276   17

So we can see that we have got 2276 observations and 17 columns which contains INDEX, our target variable TARGET_WINS and all the features. Now these dimension might not remain the same depending on the steps we carry out during data preparation

Descriptive Summary Statistics:

Before we go ahead with plot the relational plot between target and features. let’s check out the descriptive summary for each column. Descriptive summary will give us the mean, median, min, max and quartiles for each column. Descriptive statistics help us to simplify large amounts of data in a sensible way. For instance, finding the average of a column by going through all the 2276 observations could be very hectic thus we use descriptive summary statistics. The code chunk below uses summary() function from base R to give all the summary statistics of each column

knitr::kable(summary(training))

INDEX	TARGET_WINS	TEAM_BATTING_H	TEAM_BATTING_2B	TEAM_BATTING_3B	TEAM_BATTING_HR	TEAM_BATTING_BB	TEAM_BATTING_SO	TEAM_BASERUN_SB	TEAM_BASERUN_CS	TEAM_BATTING_HBP	TEAM_PITCHING_H	TEAM_PITCHING_HR	TEAM_PITCHING_BB	TEAM_PITCHING_SO	TEAM_FIELDING_E	TEAM_FIELDING_DP
Min. : 1.0	Min. : 0.00	Min. : 891	Min. : 69.0	Min. : 0.00	Min. : 0.00	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. :29.00	Min. : 1137	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 65.0	Min. : 52.0
1st Qu.: 630.8	1st Qu.: 71.00	1st Qu.:1383	1st Qu.:208.0	1st Qu.: 34.00	1st Qu.: 42.00	1st Qu.:451.0	1st Qu.: 548.0	1st Qu.: 66.0	1st Qu.: 38.0	1st Qu.:50.50	1st Qu.: 1419	1st Qu.: 50.0	1st Qu.: 476.0	1st Qu.: 615.0	1st Qu.: 127.0	1st Qu.:131.0
Median :1270.5	Median : 82.00	Median :1454	Median :238.0	Median : 47.00	Median :102.00	Median :512.0	Median : 750.0	Median :101.0	Median : 49.0	Median :58.00	Median : 1518	Median :107.0	Median : 536.5	Median : 813.5	Median : 159.0	Median :149.0
Mean :1268.5	Mean : 80.79	Mean :1469	Mean :241.2	Mean : 55.25	Mean : 99.61	Mean :501.6	Mean : 735.6	Mean :124.8	Mean : 52.8	Mean :59.36	Mean : 1779	Mean :105.7	Mean : 553.0	Mean : 817.7	Mean : 246.5	Mean :146.4
3rd Qu.:1915.5	3rd Qu.: 92.00	3rd Qu.:1537	3rd Qu.:273.0	3rd Qu.: 72.00	3rd Qu.:147.00	3rd Qu.:580.0	3rd Qu.: 930.0	3rd Qu.:156.0	3rd Qu.: 62.0	3rd Qu.:67.00	3rd Qu.: 1682	3rd Qu.:150.0	3rd Qu.: 611.0	3rd Qu.: 968.0	3rd Qu.: 249.2	3rd Qu.:164.0
Max. :2535.0	Max. :146.00	Max. :2554	Max. :458.0	Max. :223.00	Max. :264.00	Max. :878.0	Max. :1399.0	Max. :697.0	Max. :201.0	Max. :95.00	Max. :30132	Max. :343.0	Max. :3645.0	Max. :19278.0	Max. :1898.0	Max. :228.0
NA	NA	NA	NA	NA	NA	NA	NA’s :102	NA’s :131	NA’s :772	NA’s :2085	NA	NA	NA	NA’s :102	NA	NA’s :286

As we can see that the table above gave us the summary for even INDEX column too. So before moving and waiting for data preparation remove that from our data set. The code chunk below remove the redundant INDEX column from our data set.

training <- training[,-1]

Plotting the Target Column:

Before moving ahead with relational plot and correlation between target and features. Let’s check out the distribution of of target column TARGET_WINS.

ggplot()+
  geom_histogram(data = training, mapping = aes(x=TARGET_WINS), bins = 50, color = "black", fill = "grey")+geom_vline(xintercept=mean(training$TARGET_WINS), color='red')+labs(x="Target Wins", y="Count",title ="Distribution of Target Wins")+theme_bw()

As we can see that the distribution of TARGET_WINS is fairly normal with mean around 80.8 (represented by red vertical line), min at 0 and maximum at 146. We can observe min = 0 is a worry-some, since it is very rare for a team to go win less so there might be something that needs to be corrected in the data. We can also revisit the summary statistics for TARGET_WINS column to confirm the figures above.

summary(training$TARGET_WINS)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   71.00   82.00   80.79   92.00  146.00

Relational Plots Between Target and Feature column:

Now we can go ahead and plot the relational plot and we can also check the correlation between target and features. One thing must be kept in mind that the correlation might change also carry out data preparation since we still have the uncleaned, dirty data set. As we know that there are at least 15 features in the data set and it will be impractical to plot the relational graph between the target column and each feature column so we will use pairs.panel from psych library to plot multiple feature on the same graphic. In the graph below our focus will be on the first row and first column. The first row tell us the correlation between TARGET_WINS and feature column and similarly the first column show the relational plot between the same columns. On the diagonal we have the distribution of each column.

Scatter_Matrix <- pairs.panels(training[, c(1, 2:6)], main = "Scatter Plot Matrix for Training Dataset")

In the matrix plot above the value in the first row after TARGET_WINS distribution plot shows the correlation between TARGET_WINS and TEAM_BATTING_H which is .39. Similarly the scatter plot below the TARGET_WINS distribution is the plot between TARGET_WINS and TEAM_BATTING_H with fitted regression line in red color. In the matrix above we have only considered the frist 5 features because to avoid over crowdness. For the remaining features we have the matrix plot below:

pairs.panels(training[, c(1, 7:11)], main = "Scatter Plot Matrix for Training Dataset")

pairs.panels(training[, c(1, 11:16)], main = "Scatter Plot Matrix for Training Dataset")

From the above matrix plots we can see that diagonals have distributions of each features and we can see that some of these distributions contains outliers that needs to be address in our data preparation stage.

Missing values:

Let’s check out the missing values in the columns. We will check the number of missing values in each column using the code check below:

knitr::kable(colSums(is.na(training)))

	x
TARGET_WINS	0
TEAM_BATTING_H	0
TEAM_BATTING_2B	0
TEAM_BATTING_3B	0
TEAM_BATTING_HR	0
TEAM_BATTING_BB	0
TEAM_BATTING_SO	102
TEAM_BASERUN_SB	131
TEAM_BASERUN_CS	772
TEAM_BATTING_HBP	2085
TEAM_PITCHING_H	0
TEAM_PITCHING_HR	0
TEAM_PITCHING_BB	0
TEAM_PITCHING_SO	102
TEAM_FIELDING_E	0
TEAM_FIELDING_DP	286

we can see that some the columns has missing values which also needs to be addressed in our data preparation stage.

DATA PREPARATION:

Every data set has some imperfection and so does ours. As we saw in the data exploration section that our data does contains missing values and outliers so let’s deal with that. We will try to address both missing values and outliers accordingly.

Fixing Missing Values and Outliers:

Removing Missing values:

As we saw earlier that some of the feature columns like TEAM_BATTING_HBP,TEAM_BASERUN_CS and TEAM_FIELDING_DP has a lot of missing values and it would be to the best of our interest to remove them from our data set since replacing them might create the issue of fidelity and bias.

training <- training[, !names(training) %in% c('TEAM_BATTING_HBP','TEAM_BASERUN_CS','TEAM_FIELDING_DP')]

So the features with high number of missing values are being removed from the data set. We can check the dimensions now.

dim(training)

## [1] 2276   13

Impute Missing values:

Let’s check out the remaining features for missing values

knitr::kable(colSums(is.na(training)))

	x
TARGET_WINS	0
TEAM_BATTING_H	0
TEAM_BATTING_2B	0
TEAM_BATTING_3B	0
TEAM_BATTING_HR	0
TEAM_BATTING_BB	0
TEAM_BATTING_SO	102
TEAM_BASERUN_SB	131
TEAM_PITCHING_H	0
TEAM_PITCHING_HR	0
TEAM_PITCHING_BB	0
TEAM_PITCHING_SO	102
TEAM_FIELDING_E	0

As we can see that features like TEAM_BATTING_SO,TEAM_BASERUN_SB and TEAM_PITCHING_SO has some missing values and the least of those missing values are 102 in number which comes up to be 4.5% of the total data set. According to some of the data scientists any data set with 3% or lower missing values can be dealt with by removing the those observations from the analysis. Over here we will try to replace the missing values with median of their respective columns since there is skweness in the distribution and medians are insensitive to skewness.

training$TEAM_BATTING_SO[is.na(training$TEAM_BATTING_SO)] <- median(training$TEAM_BATTING_SO, na.rm = TRUE)
training$TEAM_BASERUN_SB[is.na(training$TEAM_BASERUN_SB)] <- median(training$TEAM_BASERUN_SB, na.rm = TRUE)
training$TEAM_PITCHING_SO[is.na(training$TEAM_PITCHING_SO)] <- median(training$TEAM_PITCHING_SO, na.rm = TRUE)

dim(training)

## [1] 2276   13

Fixing Outliers:

Looking at the summary and the plots below we see that PITCHING_H, PITCHING_BB, `PITCHING_O, and FIELDING_E are all skewed by their outliers. We also have some fields with a few missing values. Our plan to fix that is to pick any value that is 3 standard deviations above the mean and impute them as the median.

training$TEAM_PITCHING_H[training$TEAM_PITCHING_H > 3*sd(training$TEAM_PITCHING_H)] <- median(training$TEAM_PITCHING_H)
training$TEAM_PITCHING_BB[training$TEAM_PITCHING_BB > 3*sd(training$TEAM_PITCHING_BB)] <- median(training$TEAM_PITCHING_BB)
training$TEAM_PITCHING_SO[training$TEAM_PITCHING_SO > 3*sd(training$TEAM_PITCHING_SO)] <- median(training$TEAM_PITCHING_SO)
training$TEAM_FIELDING_E[training$TEAM_FIELDING_E > 3*sd(training$TEAM_FIELDING_E)] <- median(training$TEAM_FIELDING_E)

Now that our data set is ready to be used to model here is the final distribution of target and features columns.

Final Distribution Check:

ggplot(melt(training), aes(x=value)) + geom_histogram(color = 'black', fill = 'grey') + facet_wrap(~variable, scale='free') + labs(x='', y='Frequency')+theme_bw()

BUILDING MODELS:

Our data set is ready to be used for modeling. In this section we will create different models and then based on the statistical significance of the features used in the models. Out of these models that we will create, we are going to select one and use that model to predict using our prediction data set in the next section.

Model 1:

Even though we already know that some the features like TEAM_PITCHING_BB, TEAM_PITCHING_HR has weak correlation with our target which TARGET_WINS but we in our first model we will try to fit all the features in the model and then check there statistical significance. The code chunk below will create our first model

m1 <- lm(TARGET_WINS ~., training)

We can check the intercept and coefficients of the model above using summary function. The summary function also give us the p-value of each coefficient which shows the statistical significance of the that coefficient. So let’s apply summary function to our model:

summary(m1)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.862  -8.582   0.353   8.578  56.341 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.414986   5.607021  -0.252  0.80079    
## TEAM_BATTING_H    0.038461   0.003769  10.204  < 2e-16 ***
## TEAM_BATTING_2B  -0.019783   0.009431  -2.098  0.03604 *  
## TEAM_BATTING_3B   0.102623   0.017633   5.820 6.72e-09 ***
## TEAM_BATTING_HR   0.075863   0.027554   2.753  0.00595 ** 
## TEAM_BATTING_BB   0.035305   0.004431   7.968 2.52e-15 ***
## TEAM_BATTING_SO   0.011331   0.004418   2.565  0.01039 *  
## TEAM_BASERUN_SB   0.030714   0.004590   6.692 2.76e-11 ***
## TEAM_PITCHING_H   0.006600   0.001173   5.625 2.08e-08 ***
## TEAM_PITCHING_HR -0.036080   0.024315  -1.484  0.13798    
## TEAM_PITCHING_BB -0.018563   0.007410  -2.505  0.01230 *  
## TEAM_PITCHING_SO -0.007138   0.003582  -1.993  0.04643 *  
## TEAM_FIELDING_E  -0.022174   0.003612  -6.138 9.82e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.49 on 2263 degrees of freedom
## Multiple R-squared:   0.27,  Adjusted R-squared:  0.2661 
## F-statistic: 69.74 on 12 and 2263 DF,  p-value: < 2.2e-16

As we cans see that TEAM_PITCHING_HR has a very high p-value which means that it is not statistically significant and it will be in our best interest to remove that from our model. Below are the plots which shows residuals vs fitted values, QQ plot, residuals distributions, standardized residuals and residuals vs leverage.

par(mfrow=c(2,2))
plot(m1)

hist(resid(m1), main="Histogram of Residuals")

Model 2:

As we saw in model 1 that TEAM_PITCHING_HR has a high p-value. Let’s create another model without TEAM_PITCHING_HRand see how that performs.

m2<- lm(TARGET_WINS~TEAM_BATTING_H+TEAM_BATTING_2B+TEAM_BATTING_3B+TEAM_BATTING_HR+TEAM_BATTING_BB+TEAM_BATTING_SO+TEAM_BASERUN_SB+TEAM_PITCHING_H+TEAM_PITCHING_BB+TEAM_FIELDING_E, training)

Now our model is ready let’s check the summary of our model.

summary(m2)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_FIELDING_E, 
##     data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.792  -8.559   0.415   8.600  58.424 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.296543   5.513356   0.417 0.677053    
## TEAM_BATTING_H    0.037750   0.003753  10.059  < 2e-16 ***
## TEAM_BATTING_2B  -0.017073   0.009419  -1.813 0.070013 .  
## TEAM_BATTING_3B   0.098129   0.017593   5.578 2.73e-08 ***
## TEAM_BATTING_HR   0.039122   0.009739   4.017 6.08e-05 ***
## TEAM_BATTING_BB   0.040121   0.004191   9.573  < 2e-16 ***
## TEAM_BATTING_SO   0.003934   0.002289   1.719 0.085833 .  
## TEAM_BASERUN_SB   0.030545   0.004591   6.653 3.58e-11 ***
## TEAM_PITCHING_H   0.004893   0.001021   4.790 1.78e-06 ***
## TEAM_PITCHING_BB -0.025057   0.007169  -3.495 0.000483 ***
## TEAM_FIELDING_E  -0.021481   0.003610  -5.950 3.10e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.52 on 2265 degrees of freedom
## Multiple R-squared:  0.2661, Adjusted R-squared:  0.2628 
## F-statistic: 82.11 on 10 and 2265 DF,  p-value: < 2.2e-16

As we can that after removing the TEAM_PITCHING_HR from the model our r-squared has dropped down a bit but at the same time features like TEAM_BATTING_SO and TEAM_BATTING_2B has lost their significance too, with p-values way over .05. So let’s remove these from our analysis and create a new model.

par(mfrow=c(2,2))
plot(m2)

hist(resid(m2), main="Histogram of Residuals")

Model 3:

In our model 3 we have removed all the three weakly correlated and low statistically significant features from our consideration. The following code chunk contains our model.

m3<- lm(TARGET_WINS~TEAM_BATTING_H+TEAM_BATTING_3B+TEAM_BATTING_HR+TEAM_BATTING_BB+TEAM_BASERUN_SB+TEAM_PITCHING_H+TEAM_PITCHING_BB+TEAM_FIELDING_E, training)

Let’s check the summary of our model 3.

summary(m3)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_H + 
##     TEAM_PITCHING_BB + TEAM_FIELDING_E, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.259  -8.493   0.383   8.484  56.343 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.000771   3.926302   2.547 0.010927 *  
## TEAM_BATTING_H    0.031665   0.002537  12.482  < 2e-16 ***
## TEAM_BATTING_3B   0.098564   0.017475   5.640 1.91e-08 ***
## TEAM_BATTING_HR   0.047017   0.007669   6.131 1.03e-09 ***
## TEAM_BATTING_BB   0.038505   0.004128   9.328  < 2e-16 ***
## TEAM_BASERUN_SB   0.033250   0.004347   7.650 2.96e-14 ***
## TEAM_PITCHING_H   0.004579   0.001012   4.526 6.33e-06 ***
## TEAM_PITCHING_BB -0.024725   0.007151  -3.458 0.000555 ***
## TEAM_FIELDING_E  -0.021393   0.003558  -6.013 2.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.53 on 2267 degrees of freedom
## Multiple R-squared:  0.2645, Adjusted R-squared:  0.2619 
## F-statistic: 101.9 on 8 and 2267 DF,  p-value: < 2.2e-16

As we can see that all the features used in the model 3 are statistically significant. Features like TEAM_BATTING_H, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BASERUN_SB and TEAM_PITCHING_H which are Base hits by batters, Triples by batters, homeruns by batters, walks by batters, stolen bases, and hits allowed, respectively, are all contributing positively towards the wins. Similarly, TEAM_PITCHING_BB and TEAM_FIELDING_E which are walks allowed and errors, respectively, are contributing negatively towards the wins.

Even though r-square went a touch high but all the features are statistically significant. Below are the plots which shows residuals vs fitted values, QQ plot, residuals distributions, standardized residuals and residuals vs leverage.

par(mfrow=c(2,2))
plot(m3)

hist(resid(m3), main = "Histogram of Residuals")

SELECTING MODELS AND PREDICTING:

In this section we will select the model that we are working with and we will use the predicting data to predict the target variable.

Selecting Models:

We are going with model 3 for prediction since all the feature all statistically significant and there is no weak correlation between the target and features. Another reason why we are not selecting m1 which has all the features and a better r-squared value is the result from analysis of variance (ANOVA). The code chunk below show ANOVA between the biggest model and the samllest model in terms of features used.

anova(m1, m3)

## Analysis of Variance Table
## 
## Model 1: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E
## Model 2: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + 
##     TEAM_FIELDING_E
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)   
## 1   2263 412102                               
## 2   2267 415188 -4   -3085.9 4.2365 0.00203 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see that the p-value is .00203 which below .05, meaning that the null hypothesis can not be rejected. Which in turn means that the results produce by each model i.e. m1 and m3 are not that varied from each other. So we will go with the smaller model in order to save computing power. Below are some of the main features of model 3 (m3)

sum_m3 <- summary(m3)
RSS <- c(crossprod(m3$residuals))
MSE <- RSS/length(m3$residuals)
print(paste0("Mean Squared Error: ", MSE))

## [1] "Mean Squared Error: 182.419913764793"

print(paste0("Root MSE: ", sqrt(MSE)))

## [1] "Root MSE: 13.5062916363002"

print(paste0("Adjusted R-squared: ", sum_m3$adj.r.squared))

## [1] "Adjusted R-squared: 0.261903459682437"

print(paste0("F-statistic: ",sum_m3$fstatistic[1]))

## [1] "F-statistic: 101.90657831176"

Here is the distribution of the residuals which show no pattern accross the horizontal line.

plot(resid(m3))
abline(h=0, col=2)

Predicting the Data:

Now we have selected our model so we can go ahead and load the testing data set to predict the TARGET_WINS using m3 model. The following code chunk load the data in our environment.

testing <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/moneyball-evaluation-data.csv")

Here is the first few row of our testing data set:

knitr::kable(head(testing))

INDEX	TEAM_BATTING_H	TEAM_BATTING_2B	TEAM_BATTING_3B	TEAM_BATTING_HR	TEAM_BATTING_BB	TEAM_BATTING_SO	TEAM_BASERUN_SB	TEAM_BASERUN_CS	TEAM_BATTING_HBP	TEAM_PITCHING_H	TEAM_PITCHING_HR	TEAM_PITCHING_BB	TEAM_PITCHING_SO	TEAM_FIELDING_E	TEAM_FIELDING_DP
9	1209	170	33	83	447	1080	62	50	NA	1209	83	447	1080	140	156
10	1221	151	29	88	516	929	54	39	NA	1221	88	516	929	135	164
14	1395	183	29	93	509	816	59	47	NA	1395	93	509	816	156	153
47	1539	309	29	159	486	914	148	57	42	1539	159	486	914	124	154
60	1445	203	68	5	95	416	NA	NA	NA	3902	14	257	1123	616	130
63	1431	236	53	10	215	377	NA	NA	NA	2793	20	420	736	572	105

Removing Un-necessary columns:

Let’s remove the columns that we do not need for this analysis:

testing <- testing[, !names(testing) %in% c('INDEX','TEAM_BATTING_HBP','TEAM_BASERUN_CS','TEAM_FIELDING_DP')]

knitr::kable(head(testing))

TEAM_BATTING_H	TEAM_BATTING_2B	TEAM_BATTING_3B	TEAM_BATTING_HR	TEAM_BATTING_BB	TEAM_BATTING_SO	TEAM_BASERUN_SB	TEAM_PITCHING_H	TEAM_PITCHING_HR	TEAM_PITCHING_BB	TEAM_PITCHING_SO	TEAM_FIELDING_E
1209	170	33	83	447	1080	62	1209	83	447	1080	140
1221	151	29	88	516	929	54	1221	88	516	929	135
1395	183	29	93	509	816	59	1395	93	509	816	156
1539	309	29	159	486	914	148	1539	159	486	914	124
1445	203	68	5	95	416	NA	3902	14	257	1123	616
1431	236	53	10	215	377	NA	2793	20	420	736	572

dim(testing)

## [1] 259  12

Fixing Missing Values:

Let’s check out and fix the missing values in our testing data set:

knitr::kable(colSums(is.na(testing)))

	x
TEAM_BATTING_H	0
TEAM_BATTING_2B	0
TEAM_BATTING_3B	0
TEAM_BATTING_HR	0
TEAM_BATTING_BB	0
TEAM_BATTING_SO	18
TEAM_BASERUN_SB	13
TEAM_PITCHING_H	0
TEAM_PITCHING_HR	0
TEAM_PITCHING_BB	0
TEAM_PITCHING_SO	18
TEAM_FIELDING_E	0

For Imputation we have used MICE package:

testing <- mice(testing, m=5, maxit = 5, method = 'pmm')
testing <- complete(testing)

Fixing Outliers:

testing$TEAM_PITCHING_H[testing$TEAM_PITCHING_H > 3*sd(testing$TEAM_PITCHING_H)] <- median(testing$TEAM_PITCHING_H)
testing$TEAM_PITCHING_BB[testing$TEAM_PITCHING_BB > 3*sd(testing$TEAM_PITCHING_BB)] <- median(testing$TEAM_PITCHING_BB)
testing$TEAM_PITCHING_SO[testing$TEAM_PITCHING_SO > 3*sd(testing$TEAM_PITCHING_SO)] <- median(testing$TEAM_PITCHING_SO)
testing$TEAM_FIELDING_E[testing$TEAM_FIELDING_E > 3*sd(testing$TEAM_FIELDING_E)] <- median(testing$TEAM_FIELDING_E)

Final Prediction:

The following Code chunk predicts the TARGET_WINS using testing data set:

final <- predict(m3, newdata = testing, interval="prediction")
knitr::kable(head(final,3))

fit	lwr	upr
66.19992	39.62767	92.77216
67.26743	40.69895	93.83591
73.42938	46.87113	99.98763

The above data set gives us the predicted values. Fit columns contains the fitted values for testing data set while upper and lower column contain the limit values for 95% confidence interval.

CONCLUSION:

In this particular setting a moderately dirty training data set was given to train a model and then use that model to predict the target variable using testing data set. Initially the data was loaded into the environment and explored through descriptive summary statistics and multiple distribution and relational plots. Data exploration was followed by data preparation stage where the training data was wrangled by dealing with the missing values and outliers. Once the training data set was ready, it was then put forward to create models. Three different models were created based on statistical significance of the features used in the model. After creating models, one model was selected based on ANOVA and statistical significance of the features. The selected model was then applied to the testing data set to predict the final output.

APPENDIX:

https://github.com/Umerfarooq122/Predicting-Wins-for-Baseball-Game-Using-Multiple-Linear-Regression/blob/main/Predicting%20Wins%20for%20Baseball%20Game%20Using%20Multiple%20Linear%20Regression.Rmd