Predicting Wins for Baseball Game Using Multiple Linear Regression
INTRODUCTION:
In this particular problem two data sets i.e. training and testing, are provided for Base Ball game. The idea of this study is to explore the training data set by checking the dimension of data set, carrying out descriptive summary statistics and to plots different features against the target variable and also check out the correlation between them. After that we have to prepare our data set for training a model. In order to do that we have to take care of any missing values in the data set and also to address the outliers. Once our data is prepared then we can go ahead and create different models using the features we found more statistically significant. After training the model we will go ahead and use the testing data set to predict the target variable. So Before any further due let’s kickoff this session by data exploration:
DATA EXPLORATION:
In this phase we explore the data and will get to know the data more before going straight towards the preparation or modeling. data explorations helps us understand what we are dealing with. Since the data set provided is about a Base ball game and usually in sports there is a lot of misconception about what are the real indicators of a team wining. As a data scientist our job is to find out those indicators and how they can effect the overall performance that could lead to win or lose for team using past data. In order to achieve that you have to have solid knowledge about the data you are dealing with. That is why data exploration holds a great impact on the overall modeling. Let’s start our data exploration by loading the data set into our markdown.
Loading Data:
The data set provided has been uploaded to github repository from where it has been loaded into the markdown using the code chunk below. The reason why it has uploaded to the github is to keep the work reproducible.
training <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/moneyball-training-data.csv")Now our data set has been loaded into the environment. Let’s display the first few rows of our data set.
| INDEX | TARGET_WINS | TEAM_BATTING_H | TEAM_BATTING_2B | TEAM_BATTING_3B | TEAM_BATTING_HR | TEAM_BATTING_BB | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_BASERUN_CS | TEAM_BATTING_HBP | TEAM_PITCHING_H | TEAM_PITCHING_HR | TEAM_PITCHING_BB | TEAM_PITCHING_SO | TEAM_FIELDING_E | TEAM_FIELDING_DP |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 39 | 1445 | 194 | 39 | 13 | 143 | 842 | NA | NA | NA | 9364 | 84 | 927 | 5456 | 1011 | NA |
| 2 | 70 | 1339 | 219 | 22 | 190 | 685 | 1075 | 37 | 28 | NA | 1347 | 191 | 689 | 1082 | 193 | 155 |
| 3 | 86 | 1377 | 232 | 35 | 137 | 602 | 917 | 46 | 27 | NA | 1377 | 137 | 602 | 917 | 175 | 153 |
| 4 | 70 | 1387 | 209 | 38 | 96 | 451 | 922 | 43 | 30 | NA | 1396 | 97 | 454 | 928 | 164 | 156 |
| 5 | 82 | 1297 | 186 | 27 | 102 | 472 | 920 | 49 | 39 | NA | 1297 | 102 | 472 | 920 | 138 | 168 |
| 6 | 75 | 1279 | 200 | 36 | 92 | 443 | 973 | 107 | 59 | NA | 1279 | 92 | 443 | 973 | 123 | 149 |
Everything looks fine but on a first glance we can see that there is
an INDEX column which will not be used in the analysis and
will be removed from the data set later on in the data preparation
stage.
Dimension of Data Set:
Let’s check out the dimension of our raw data set:
## [1] 2276 17
So we can see that we have got 2276 observations and 17 columns which
contains INDEX, our target variable
TARGET_WINS and all the features. Now these dimension might
not remain the same depending on the steps we carry out during data
preparation
Descriptive Summary Statistics:
Before we go ahead with plot the relational plot between target and
features. let’s check out the descriptive summary for each column.
Descriptive summary will give us the mean, median, min, max and
quartiles for each column. Descriptive statistics help us to simplify
large amounts of data in a sensible way. For instance, finding the
average of a column by going through all the 2276 observations could be
very hectic thus we use descriptive summary statistics. The code chunk
below uses summary() function from base R to give all the
summary statistics of each column
| INDEX | TARGET_WINS | TEAM_BATTING_H | TEAM_BATTING_2B | TEAM_BATTING_3B | TEAM_BATTING_HR | TEAM_BATTING_BB | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_BASERUN_CS | TEAM_BATTING_HBP | TEAM_PITCHING_H | TEAM_PITCHING_HR | TEAM_PITCHING_BB | TEAM_PITCHING_SO | TEAM_FIELDING_E | TEAM_FIELDING_DP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.0 | Min. : 0.00 | Min. : 891 | Min. : 69.0 | Min. : 0.00 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. :29.00 | Min. : 1137 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 65.0 | Min. : 52.0 | |
| 1st Qu.: 630.8 | 1st Qu.: 71.00 | 1st Qu.:1383 | 1st Qu.:208.0 | 1st Qu.: 34.00 | 1st Qu.: 42.00 | 1st Qu.:451.0 | 1st Qu.: 548.0 | 1st Qu.: 66.0 | 1st Qu.: 38.0 | 1st Qu.:50.50 | 1st Qu.: 1419 | 1st Qu.: 50.0 | 1st Qu.: 476.0 | 1st Qu.: 615.0 | 1st Qu.: 127.0 | 1st Qu.:131.0 | |
| Median :1270.5 | Median : 82.00 | Median :1454 | Median :238.0 | Median : 47.00 | Median :102.00 | Median :512.0 | Median : 750.0 | Median :101.0 | Median : 49.0 | Median :58.00 | Median : 1518 | Median :107.0 | Median : 536.5 | Median : 813.5 | Median : 159.0 | Median :149.0 | |
| Mean :1268.5 | Mean : 80.79 | Mean :1469 | Mean :241.2 | Mean : 55.25 | Mean : 99.61 | Mean :501.6 | Mean : 735.6 | Mean :124.8 | Mean : 52.8 | Mean :59.36 | Mean : 1779 | Mean :105.7 | Mean : 553.0 | Mean : 817.7 | Mean : 246.5 | Mean :146.4 | |
| 3rd Qu.:1915.5 | 3rd Qu.: 92.00 | 3rd Qu.:1537 | 3rd Qu.:273.0 | 3rd Qu.: 72.00 | 3rd Qu.:147.00 | 3rd Qu.:580.0 | 3rd Qu.: 930.0 | 3rd Qu.:156.0 | 3rd Qu.: 62.0 | 3rd Qu.:67.00 | 3rd Qu.: 1682 | 3rd Qu.:150.0 | 3rd Qu.: 611.0 | 3rd Qu.: 968.0 | 3rd Qu.: 249.2 | 3rd Qu.:164.0 | |
| Max. :2535.0 | Max. :146.00 | Max. :2554 | Max. :458.0 | Max. :223.00 | Max. :264.00 | Max. :878.0 | Max. :1399.0 | Max. :697.0 | Max. :201.0 | Max. :95.00 | Max. :30132 | Max. :343.0 | Max. :3645.0 | Max. :19278.0 | Max. :1898.0 | Max. :228.0 | |
| NA | NA | NA | NA | NA | NA | NA | NA’s :102 | NA’s :131 | NA’s :772 | NA’s :2085 | NA | NA | NA | NA’s :102 | NA | NA’s :286 |
As we can see that the table above gave us the summary for even
INDEX column too. So before moving and waiting for data
preparation remove that from our data set. The code chunk below remove
the redundant INDEX column from our data set.
Plotting the Target Column:
Before moving ahead with relational plot and correlation between
target and features. Let’s check out the distribution of of target
column TARGET_WINS.
ggplot()+
geom_histogram(data = training, mapping = aes(x=TARGET_WINS), bins = 50, color = "black", fill = "grey")+geom_vline(xintercept=mean(training$TARGET_WINS), color='red')+labs(x="Target Wins", y="Count",title ="Distribution of Target Wins")+theme_bw()As we can see that the distribution of TARGET_WINS is
fairly normal with mean around 80.8 (represented by red vertical line),
min at 0 and maximum at 146. We can observe min = 0 is a worry-some,
since it is very rare for a team to go win less so there might be
something that needs to be corrected in the data. We can also revisit
the summary statistics for TARGET_WINS column to confirm
the figures above.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 71.00 82.00 80.79 92.00 146.00
Relational Plots Between Target and Feature column:
Now we can go ahead and plot the relational plot and we can also
check the correlation between target and features. One thing must be
kept in mind that the correlation might change also carry out data
preparation since we still have the uncleaned, dirty data set. As we
know that there are at least 15 features in the data set and it will be
impractical to plot the relational graph between the target column and
each feature column so we will use pairs.panel from
psych library to plot multiple feature on the same graphic.
In the graph below our focus will be on the first row and first column.
The first row tell us the correlation between TARGET_WINS
and feature column and similarly the first column show the relational
plot between the same columns. On the diagonal we have the distribution
of each column.
Scatter_Matrix <- pairs.panels(training[, c(1, 2:6)], main = "Scatter Plot Matrix for Training Dataset")In the matrix plot above the value in the first row after
TARGET_WINS distribution plot shows the correlation between
TARGET_WINS and TEAM_BATTING_H which is .39.
Similarly the scatter plot below the TARGET_WINS
distribution is the plot between TARGET_WINS and
TEAM_BATTING_H with fitted regression line in red color. In
the matrix above we have only considered the frist 5 features because to
avoid over crowdness. For the remaining features we have the matrix plot
below:
From the above matrix plots we can see that diagonals have distributions of each features and we can see that some of these distributions contains outliers that needs to be address in our data preparation stage.
Missing values:
Let’s check out the missing values in the columns. We will check the number of missing values in each column using the code check below:
| x | |
|---|---|
| TARGET_WINS | 0 |
| TEAM_BATTING_H | 0 |
| TEAM_BATTING_2B | 0 |
| TEAM_BATTING_3B | 0 |
| TEAM_BATTING_HR | 0 |
| TEAM_BATTING_BB | 0 |
| TEAM_BATTING_SO | 102 |
| TEAM_BASERUN_SB | 131 |
| TEAM_BASERUN_CS | 772 |
| TEAM_BATTING_HBP | 2085 |
| TEAM_PITCHING_H | 0 |
| TEAM_PITCHING_HR | 0 |
| TEAM_PITCHING_BB | 0 |
| TEAM_PITCHING_SO | 102 |
| TEAM_FIELDING_E | 0 |
| TEAM_FIELDING_DP | 286 |
we can see that some the columns has missing values which also needs to be addressed in our data preparation stage.
DATA PREPARATION:
Every data set has some imperfection and so does ours. As we saw in the data exploration section that our data does contains missing values and outliers so let’s deal with that. We will try to address both missing values and outliers accordingly.
Fixing Missing Values and Outliers:
Removing Missing values:
As we saw earlier that some of the feature columns like
TEAM_BATTING_HBP,TEAM_BASERUN_CS and
TEAM_FIELDING_DP has a lot of missing values and it would
be to the best of our interest to remove them from our data set since
replacing them might create the issue of fidelity and bias.
training <- training[, !names(training) %in% c('TEAM_BATTING_HBP','TEAM_BASERUN_CS','TEAM_FIELDING_DP')]So the features with high number of missing values are being removed from the data set. We can check the dimensions now.
## [1] 2276 13
Impute Missing values:
Let’s check out the remaining features for missing values
| x | |
|---|---|
| TARGET_WINS | 0 |
| TEAM_BATTING_H | 0 |
| TEAM_BATTING_2B | 0 |
| TEAM_BATTING_3B | 0 |
| TEAM_BATTING_HR | 0 |
| TEAM_BATTING_BB | 0 |
| TEAM_BATTING_SO | 102 |
| TEAM_BASERUN_SB | 131 |
| TEAM_PITCHING_H | 0 |
| TEAM_PITCHING_HR | 0 |
| TEAM_PITCHING_BB | 0 |
| TEAM_PITCHING_SO | 102 |
| TEAM_FIELDING_E | 0 |
As we can see that features like
TEAM_BATTING_SO,TEAM_BASERUN_SB and
TEAM_PITCHING_SO has some missing values and the least of
those missing values are 102 in number which comes up to be 4.5% of the
total data set. According to some of the data scientists any data set
with 3% or lower missing values can be dealt with by removing the those
observations from the analysis. Over here we will try to replace the
missing values with median of their respective columns since there is
skweness in the distribution and medians are insensitive to
skewness.
training$TEAM_BATTING_SO[is.na(training$TEAM_BATTING_SO)] <- median(training$TEAM_BATTING_SO, na.rm = TRUE)
training$TEAM_BASERUN_SB[is.na(training$TEAM_BASERUN_SB)] <- median(training$TEAM_BASERUN_SB, na.rm = TRUE)
training$TEAM_PITCHING_SO[is.na(training$TEAM_PITCHING_SO)] <- median(training$TEAM_PITCHING_SO, na.rm = TRUE)## [1] 2276 13
Fixing Outliers:
Looking at the summary and the plots below we see that
PITCHING_H, PITCHING_BB,
`PITCHING_O, and FIELDING_E are all skewed by
their outliers. We also have some fields with a few missing values. Our
plan to fix that is to pick any value that is 3 standard deviations
above the mean and impute them as the median.
training$TEAM_PITCHING_H[training$TEAM_PITCHING_H > 3*sd(training$TEAM_PITCHING_H)] <- median(training$TEAM_PITCHING_H)
training$TEAM_PITCHING_BB[training$TEAM_PITCHING_BB > 3*sd(training$TEAM_PITCHING_BB)] <- median(training$TEAM_PITCHING_BB)
training$TEAM_PITCHING_SO[training$TEAM_PITCHING_SO > 3*sd(training$TEAM_PITCHING_SO)] <- median(training$TEAM_PITCHING_SO)
training$TEAM_FIELDING_E[training$TEAM_FIELDING_E > 3*sd(training$TEAM_FIELDING_E)] <- median(training$TEAM_FIELDING_E)Now that our data set is ready to be used to model here is the final distribution of target and features columns.
BUILDING MODELS:
Our data set is ready to be used for modeling. In this section we will create different models and then based on the statistical significance of the features used in the models. Out of these models that we will create, we are going to select one and use that model to predict using our prediction data set in the next section.
Model 1:
Even though we already know that some the features like
TEAM_PITCHING_BB, TEAM_PITCHING_HR has weak
correlation with our target which TARGET_WINS but we in our
first model we will try to fit all the features in the model and then
check there statistical significance. The code chunk below will create
our first model
We can check the intercept and coefficients of the model above using summary function. The summary function also give us the p-value of each coefficient which shows the statistical significance of the that coefficient. So let’s apply summary function to our model:
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.862 -8.582 0.353 8.578 56.341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.414986 5.607021 -0.252 0.80079
## TEAM_BATTING_H 0.038461 0.003769 10.204 < 2e-16 ***
## TEAM_BATTING_2B -0.019783 0.009431 -2.098 0.03604 *
## TEAM_BATTING_3B 0.102623 0.017633 5.820 6.72e-09 ***
## TEAM_BATTING_HR 0.075863 0.027554 2.753 0.00595 **
## TEAM_BATTING_BB 0.035305 0.004431 7.968 2.52e-15 ***
## TEAM_BATTING_SO 0.011331 0.004418 2.565 0.01039 *
## TEAM_BASERUN_SB 0.030714 0.004590 6.692 2.76e-11 ***
## TEAM_PITCHING_H 0.006600 0.001173 5.625 2.08e-08 ***
## TEAM_PITCHING_HR -0.036080 0.024315 -1.484 0.13798
## TEAM_PITCHING_BB -0.018563 0.007410 -2.505 0.01230 *
## TEAM_PITCHING_SO -0.007138 0.003582 -1.993 0.04643 *
## TEAM_FIELDING_E -0.022174 0.003612 -6.138 9.82e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.49 on 2263 degrees of freedom
## Multiple R-squared: 0.27, Adjusted R-squared: 0.2661
## F-statistic: 69.74 on 12 and 2263 DF, p-value: < 2.2e-16
As we cans see that TEAM_PITCHING_HR has a very high
p-value which means that it is not statistically significant and it will
be in our best interest to remove that from our model. Below are the
plots which shows residuals vs fitted values, QQ plot, residuals
distributions, standardized residuals and residuals vs leverage.
Model 2:
As we saw in model 1 that TEAM_PITCHING_HR has a high
p-value. Let’s create another model without
TEAM_PITCHING_HRand see how that performs.
m2<- lm(TARGET_WINS~TEAM_BATTING_H+TEAM_BATTING_2B+TEAM_BATTING_3B+TEAM_BATTING_HR+TEAM_BATTING_BB+TEAM_BATTING_SO+TEAM_BASERUN_SB+TEAM_PITCHING_H+TEAM_PITCHING_BB+TEAM_FIELDING_E, training)Now our model is ready let’s check the summary of our model.
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
## TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_FIELDING_E,
## data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.792 -8.559 0.415 8.600 58.424
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.296543 5.513356 0.417 0.677053
## TEAM_BATTING_H 0.037750 0.003753 10.059 < 2e-16 ***
## TEAM_BATTING_2B -0.017073 0.009419 -1.813 0.070013 .
## TEAM_BATTING_3B 0.098129 0.017593 5.578 2.73e-08 ***
## TEAM_BATTING_HR 0.039122 0.009739 4.017 6.08e-05 ***
## TEAM_BATTING_BB 0.040121 0.004191 9.573 < 2e-16 ***
## TEAM_BATTING_SO 0.003934 0.002289 1.719 0.085833 .
## TEAM_BASERUN_SB 0.030545 0.004591 6.653 3.58e-11 ***
## TEAM_PITCHING_H 0.004893 0.001021 4.790 1.78e-06 ***
## TEAM_PITCHING_BB -0.025057 0.007169 -3.495 0.000483 ***
## TEAM_FIELDING_E -0.021481 0.003610 -5.950 3.10e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.52 on 2265 degrees of freedom
## Multiple R-squared: 0.2661, Adjusted R-squared: 0.2628
## F-statistic: 82.11 on 10 and 2265 DF, p-value: < 2.2e-16
As we can that after removing the TEAM_PITCHING_HR from
the model our r-squared has dropped down a bit but at the same time
features like TEAM_BATTING_SO and
TEAM_BATTING_2B has lost their significance too, with
p-values way over .05. So let’s remove these from our analysis and
create a new model.
Model 3:
In our model 3 we have removed all the three weakly correlated and low statistically significant features from our consideration. The following code chunk contains our model.
m3<- lm(TARGET_WINS~TEAM_BATTING_H+TEAM_BATTING_3B+TEAM_BATTING_HR+TEAM_BATTING_BB+TEAM_BASERUN_SB+TEAM_PITCHING_H+TEAM_PITCHING_BB+TEAM_FIELDING_E, training)Let’s check the summary of our model 3.
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_H +
## TEAM_PITCHING_BB + TEAM_FIELDING_E, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.259 -8.493 0.383 8.484 56.343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.000771 3.926302 2.547 0.010927 *
## TEAM_BATTING_H 0.031665 0.002537 12.482 < 2e-16 ***
## TEAM_BATTING_3B 0.098564 0.017475 5.640 1.91e-08 ***
## TEAM_BATTING_HR 0.047017 0.007669 6.131 1.03e-09 ***
## TEAM_BATTING_BB 0.038505 0.004128 9.328 < 2e-16 ***
## TEAM_BASERUN_SB 0.033250 0.004347 7.650 2.96e-14 ***
## TEAM_PITCHING_H 0.004579 0.001012 4.526 6.33e-06 ***
## TEAM_PITCHING_BB -0.024725 0.007151 -3.458 0.000555 ***
## TEAM_FIELDING_E -0.021393 0.003558 -6.013 2.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.53 on 2267 degrees of freedom
## Multiple R-squared: 0.2645, Adjusted R-squared: 0.2619
## F-statistic: 101.9 on 8 and 2267 DF, p-value: < 2.2e-16
As we can see that all the features used in the model 3 are
statistically significant. Features like TEAM_BATTING_H,
TEAM_BATTING_3B, TEAM_BATTING_HR,
TEAM_BATTING_BB, TEAM_BASERUN_SB and
TEAM_PITCHING_H which are Base hits by batters, Triples by
batters, homeruns by batters, walks by batters, stolen bases, and hits
allowed, respectively, are all contributing positively towards the wins.
Similarly, TEAM_PITCHING_BB and
TEAM_FIELDING_E which are walks allowed and errors,
respectively, are contributing negatively towards the wins.
Even though r-square went a touch high but all the features are statistically significant. Below are the plots which shows residuals vs fitted values, QQ plot, residuals distributions, standardized residuals and residuals vs leverage.
SELECTING MODELS AND PREDICTING:
In this section we will select the model that we are working with and we will use the predicting data to predict the target variable.
Selecting Models:
We are going with model 3 for prediction since all the feature all statistically significant and there is no weak correlation between the target and features. Another reason why we are not selecting m1 which has all the features and a better r-squared value is the result from analysis of variance (ANOVA). The code chunk below show ANOVA between the biggest model and the samllest model in terms of features used.
## Analysis of Variance Table
##
## Model 1: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E
## Model 2: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_HR +
## TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB +
## TEAM_FIELDING_E
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2263 412102
## 2 2267 415188 -4 -3085.9 4.2365 0.00203 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see that the p-value is .00203 which below .05, meaning that the null hypothesis can not be rejected. Which in turn means that the results produce by each model i.e. m1 and m3 are not that varied from each other. So we will go with the smaller model in order to save computing power. Below are some of the main features of model 3 (m3)
sum_m3 <- summary(m3)
RSS <- c(crossprod(m3$residuals))
MSE <- RSS/length(m3$residuals)
print(paste0("Mean Squared Error: ", MSE))## [1] "Mean Squared Error: 182.419913764793"
## [1] "Root MSE: 13.5062916363002"
## [1] "Adjusted R-squared: 0.261903459682437"
## [1] "F-statistic: 101.90657831176"
Here is the distribution of the residuals which show no pattern accross the horizontal line.
Predicting the Data:
Now we have selected our model so we can go ahead and load the
testing data set to predict the TARGET_WINS using m3 model.
The following code chunk load the data in our environment.
testing <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/moneyball-evaluation-data.csv")Here is the first few row of our testing data set:
| INDEX | TEAM_BATTING_H | TEAM_BATTING_2B | TEAM_BATTING_3B | TEAM_BATTING_HR | TEAM_BATTING_BB | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_BASERUN_CS | TEAM_BATTING_HBP | TEAM_PITCHING_H | TEAM_PITCHING_HR | TEAM_PITCHING_BB | TEAM_PITCHING_SO | TEAM_FIELDING_E | TEAM_FIELDING_DP |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9 | 1209 | 170 | 33 | 83 | 447 | 1080 | 62 | 50 | NA | 1209 | 83 | 447 | 1080 | 140 | 156 |
| 10 | 1221 | 151 | 29 | 88 | 516 | 929 | 54 | 39 | NA | 1221 | 88 | 516 | 929 | 135 | 164 |
| 14 | 1395 | 183 | 29 | 93 | 509 | 816 | 59 | 47 | NA | 1395 | 93 | 509 | 816 | 156 | 153 |
| 47 | 1539 | 309 | 29 | 159 | 486 | 914 | 148 | 57 | 42 | 1539 | 159 | 486 | 914 | 124 | 154 |
| 60 | 1445 | 203 | 68 | 5 | 95 | 416 | NA | NA | NA | 3902 | 14 | 257 | 1123 | 616 | 130 |
| 63 | 1431 | 236 | 53 | 10 | 215 | 377 | NA | NA | NA | 2793 | 20 | 420 | 736 | 572 | 105 |
Removing Un-necessary columns:
Let’s remove the columns that we do not need for this analysis:
testing <- testing[, !names(testing) %in% c('INDEX','TEAM_BATTING_HBP','TEAM_BASERUN_CS','TEAM_FIELDING_DP')]| TEAM_BATTING_H | TEAM_BATTING_2B | TEAM_BATTING_3B | TEAM_BATTING_HR | TEAM_BATTING_BB | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_PITCHING_H | TEAM_PITCHING_HR | TEAM_PITCHING_BB | TEAM_PITCHING_SO | TEAM_FIELDING_E |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1209 | 170 | 33 | 83 | 447 | 1080 | 62 | 1209 | 83 | 447 | 1080 | 140 |
| 1221 | 151 | 29 | 88 | 516 | 929 | 54 | 1221 | 88 | 516 | 929 | 135 |
| 1395 | 183 | 29 | 93 | 509 | 816 | 59 | 1395 | 93 | 509 | 816 | 156 |
| 1539 | 309 | 29 | 159 | 486 | 914 | 148 | 1539 | 159 | 486 | 914 | 124 |
| 1445 | 203 | 68 | 5 | 95 | 416 | NA | 3902 | 14 | 257 | 1123 | 616 |
| 1431 | 236 | 53 | 10 | 215 | 377 | NA | 2793 | 20 | 420 | 736 | 572 |
## [1] 259 12
Fixing Missing Values:
Let’s check out and fix the missing values in our testing data set:
| x | |
|---|---|
| TEAM_BATTING_H | 0 |
| TEAM_BATTING_2B | 0 |
| TEAM_BATTING_3B | 0 |
| TEAM_BATTING_HR | 0 |
| TEAM_BATTING_BB | 0 |
| TEAM_BATTING_SO | 18 |
| TEAM_BASERUN_SB | 13 |
| TEAM_PITCHING_H | 0 |
| TEAM_PITCHING_HR | 0 |
| TEAM_PITCHING_BB | 0 |
| TEAM_PITCHING_SO | 18 |
| TEAM_FIELDING_E | 0 |
For Imputation we have used MICE package:
Fixing Outliers:
testing$TEAM_PITCHING_H[testing$TEAM_PITCHING_H > 3*sd(testing$TEAM_PITCHING_H)] <- median(testing$TEAM_PITCHING_H)
testing$TEAM_PITCHING_BB[testing$TEAM_PITCHING_BB > 3*sd(testing$TEAM_PITCHING_BB)] <- median(testing$TEAM_PITCHING_BB)
testing$TEAM_PITCHING_SO[testing$TEAM_PITCHING_SO > 3*sd(testing$TEAM_PITCHING_SO)] <- median(testing$TEAM_PITCHING_SO)
testing$TEAM_FIELDING_E[testing$TEAM_FIELDING_E > 3*sd(testing$TEAM_FIELDING_E)] <- median(testing$TEAM_FIELDING_E)Final Prediction:
The following Code chunk predicts the TARGET_WINS using
testing data set:
| fit | lwr | upr |
|---|---|---|
| 66.19992 | 39.62767 | 92.77216 |
| 67.26743 | 40.69895 | 93.83591 |
| 73.42938 | 46.87113 | 99.98763 |
The above data set gives us the predicted values. Fit columns contains the fitted values for testing data set while upper and lower column contain the limit values for 95% confidence interval.
CONCLUSION:
In this particular setting a moderately dirty training data set was given to train a model and then use that model to predict the target variable using testing data set. Initially the data was loaded into the environment and explored through descriptive summary statistics and multiple distribution and relational plots. Data exploration was followed by data preparation stage where the training data was wrangled by dealing with the missing values and outliers. Once the training data set was ready, it was then put forward to create models. Three different models were created based on statistical significance of the features used in the model. After creating models, one model was selected based on ANOVA and statistical significance of the features. The selected model was then applied to the testing data set to predict the final output.