Nick Oliver - Homework 1
1. DATA EXPLORATION
The data we are using for this model consists of two separate datasets. The training set consists of 2,276 rows across 15 variables (excluding the target and index). The data we are using for testing our model against contains 259 rows. All the variables in our dataset are numeric.
The table below shows us some valuable descriptive statistics for the training data. We can see that many of the variables have a minimum of 0 but not all.
One interesting piece of information is the min/max of the
TARGET_WINS variable. The minimum is 0 meaning there are
teams that did not win a single game. The maximum is 146 which indicates
no team in the training dataset had a perfect season, as we know from
the data a season consists of 162 games.
Also of note is the number of missing values from certain variables.
Most notably the TEAM_BATTING_HBP which we know is the
batters hit by pitch variable. With 91% of the data missing we will
remove this variable from our dataset because there simply is not enough
information to impute a sensible value.
The means and medians of each variable are all relatively close in value for each individual variable. This tells us that most data is free from extreme outliers as they tend to skew the mean relative to the median.
| mean | sd | median | min | max | skew | kurtosis | se | missing | |
|---|---|---|---|---|---|---|---|---|---|
| TARGET_WINS | 80.79 | 15.75 | 82.0 | 0 | 146 | -0.40 | 1.03 | 0.33 | 0 |
| TEAM_BATTING_H | 1469.27 | 144.59 | 1454.0 | 891 | 2554 | 1.57 | 7.28 | 3.03 | 0 |
| TEAM_BATTING_2B | 241.25 | 46.80 | 238.0 | 69 | 458 | 0.22 | 0.01 | 0.98 | 0 |
| TEAM_BATTING_3B | 55.25 | 27.94 | 47.0 | 0 | 223 | 1.11 | 1.50 | 0.59 | 0 |
| TEAM_BATTING_HR | 99.61 | 60.55 | 102.0 | 0 | 264 | 0.19 | -0.96 | 1.27 | 0 |
| TEAM_BATTING_BB | 501.56 | 122.67 | 512.0 | 0 | 878 | -1.03 | 2.18 | 2.57 | 0 |
| TEAM_BATTING_SO | 735.61 | 248.53 | 750.0 | 0 | 1399 | -0.30 | -0.32 | 5.33 | 102 |
| TEAM_BASERUN_SB | 124.76 | 87.79 | 101.0 | 0 | 697 | 1.97 | 5.49 | 1.90 | 131 |
| TEAM_BASERUN_CS | 52.80 | 22.96 | 49.0 | 0 | 201 | 1.98 | 7.62 | 0.59 | 772 |
| TEAM_BATTING_HBP | 59.36 | 12.97 | 58.0 | 29 | 95 | 0.32 | -0.11 | 0.94 | 2085 |
| TEAM_PITCHING_H | 1779.21 | 1406.84 | 1518.0 | 1137 | 30132 | 10.33 | 141.84 | 29.49 | 0 |
| TEAM_PITCHING_HR | 105.70 | 61.30 | 107.0 | 0 | 343 | 0.29 | -0.60 | 1.28 | 0 |
| TEAM_PITCHING_BB | 553.01 | 166.36 | 536.5 | 0 | 3645 | 6.74 | 96.97 | 3.49 | 0 |
| TEAM_PITCHING_SO | 817.73 | 553.09 | 813.5 | 0 | 19278 | 22.17 | 671.19 | 11.86 | 102 |
| TEAM_FIELDING_E | 246.48 | 227.77 | 159.0 | 65 | 1898 | 2.99 | 10.97 | 4.77 | 0 |
| TEAM_FIELDING_DP | 146.39 | 26.23 | 149.0 | 52 | 228 | -0.39 | 0.18 | 0.59 | 286 |
Doing some preliminary analysis of the data we can compare the team
pitching hits (TEAM_PITCHING_H) against the target wins
(TARGET_WINS) to validate our assumption that more hits
allowed is negatively correlated with the number of wins. While there
does appear to be a negative correlation it is not obviously linear.
There is clearly a lot of clustering below 5,000 allowed hits but the
number of wins varies significantly. There are also some obvious
outliers here with one team allowing nearly 30,000 hits.
Looking at another relationship between strike outs and wins we can
see a the strike out data has fewer outliers but not as strong of a
negative correlation as one would expect with target wins.
Plotting a histogram of the target wins we can see the distribution closely resembles the normal distribution.
Plotting the correlations between TARGET_WINS and the variables
(excluding INDEX and TEAM_BATTING_HBP) we can
see that very few variables are strongly correlated with the target
variable.
Unsurprisingly we can see that TEAM_BATTING_HR and
TEAM_PITCHING_HR are strongly positively correlated as one
would expect since they basically are measuring the same thing.
There are some other strong correlations that are less obvious such
as Errors (TEAM_FIELDING_E) being strongly correlated with
walks by batters (TEAM_BATTING_BB), strike outs
(TEAM_BATTING_SO), and team pitching hits allowed
(TEAM_PITCHING_H)
Digging a little deeper we can see there is a pearson correlation coefficient of -0.6559708 for errors and walks by batters which indicates a strong negative correlation between the two variables. Looking at errors compared with team pitching hits allowed we see a correlation of 0.667759 which indicates a strong positive correlation.
Ultimately I know very little about baseball in general so I do not have an intuition regarding why these variables would be strongly correlated with each other.
Lastly lets take a closer look at the missing data. We’ve already
determined that the batter hit by pitch (TEAM_BATTING_HBP)
variable is missing 91% of its data but what of the other variables.
Using the plot below we can visualize the missingness of the remaining variables. There are 5 variables that contain varying degrees of missing data. We will use the information to fill in the missing values in our data preparation step.
TEAM_BASERUN_CS appears to be missing the second most
amount of values but at only 772 missing values out of 2276 this is much
less of a concern than the HBP variable we identified earlier. The
remaining variables that are missing data have less than 25% of their
data missing so should be safe to impute.
2. DATA PREPARATION
First what we will do is drop the INDEX and
TEAM_BATTING_HBP variables from our dataset. We’ve already
determined that the TEAM_BATTING_HBP variable is missing
91% of its data and the INDEX variable is simply an index
of the row number.
Next we will use the mice library to impute the missing
values in the trainDf data.frame. MICE is actually an
acronym which stands for multiple imputation by chained equations. In
order to use MICE one must assume that the missing values are missing at
random, meaning the missingness can be accounted for by variables where
there is complete information. Then as the name implies MICE runs
multiple iterations over the data and generates the data to fill in the
missing values.
We now have a data set free from missing data values with the
meaningless INDEX variable removed and the
TEAM_BATTING_HBP removed as well due to it containing 91%
missing values. We can observe using the table below that there are no
longer any missing values.
One interesting comparison we can make is for a variable that had a large number of missing values, we can look at how the summary statistics may have changed with the imputed data.
TEAM_BASERUN_CS was missing 772 values or about 34% of
data. We can observe that the mean and median did change from 52.80 to
75.71 and 49.0 to 57.0 The min of 0 and max of 201 did not change which
is good.
| mean | sd | median | min | max | skew | kurtosis | se | missing | |
|---|---|---|---|---|---|---|---|---|---|
| TARGET_WINS | 80.79 | 15.75 | 82.0 | 0 | 146 | -0.40 | 1.03 | 0.33 | 0 |
| TEAM_BATTING_2B | 241.25 | 46.80 | 238.0 | 69 | 458 | 0.22 | 0.01 | 0.98 | 0 |
| TEAM_BATTING_3B | 55.25 | 27.94 | 47.0 | 0 | 223 | 1.11 | 1.50 | 0.59 | 0 |
| TEAM_BATTING_HR | 99.61 | 60.55 | 102.0 | 0 | 264 | 0.19 | -0.96 | 1.27 | 0 |
| TEAM_BATTING_BB | 501.56 | 122.67 | 512.0 | 0 | 878 | -1.03 | 2.18 | 2.57 | 0 |
| TEAM_BATTING_SO | 727.96 | 246.89 | 734.5 | 0 | 1399 | -0.23 | -0.36 | 5.18 | 0 |
| TEAM_BASERUN_SB | 135.62 | 98.91 | 106.0 | 0 | 697 | 1.73 | 3.53 | 2.07 | 0 |
| TEAM_BASERUN_CS | 75.71 | 51.29 | 57.0 | 0 | 201 | 1.45 | 0.94 | 1.08 | 0 |
| TEAM_BATTING_HBP | 63.93 | 11.71 | 69.0 | 29 | 95 | -0.48 | -0.34 | 0.25 | 0 |
| TEAM_PITCHING_H | 1779.21 | 1406.84 | 1518.0 | 1137 | 30132 | 10.33 | 141.84 | 29.49 | 0 |
| TEAM_PITCHING_HR | 105.70 | 61.30 | 107.0 | 0 | 343 | 0.29 | -0.60 | 1.28 | 0 |
| TEAM_PITCHING_BB | 553.01 | 166.36 | 536.5 | 0 | 3645 | 6.74 | 96.97 | 3.49 | 0 |
| TEAM_PITCHING_SO | 811.44 | 542.66 | 803.0 | 0 | 19278 | 22.46 | 692.95 | 11.37 | 0 |
| TEAM_FIELDING_E | 246.48 | 227.77 | 159.0 | 65 | 1898 | 2.99 | 10.97 | 4.77 | 0 |
| TEAM_FIELDING_DP | 141.87 | 29.58 | 146.0 | 52 | 228 | -0.32 | -0.17 | 0.62 | 0 |
3. BUILD MODELS
Now I will build three separate linear models and compare their performance and fit.
For the first model I will use the naive method of selecting every variable in the raw un-cleaned data set. Our intuition tells us this is unlikely to be the best performing model as we already know there are some issues with the data and we are using every variable.
For the second model I will use the cleaned data set with missing values imputed using the MICE method. In addition I will try to select variables which result in a better fit using my out intuition.
Below are the results \(R^2\), residual standard error, and F-statistics of each model. Surprisingly the non-cleaned, non-imputed raw training data had the best fitting statistics.
| r | rsse | adjusted.r |
|---|---|---|
| 0.55 | 8.49 | 0.51 |
| 0.19 | 14.16 | 0.19 |
| 0.30 | 13.18 | 0.30 |
4. SELECT MODELS
I will then use the evaluation data set to make predictions using the three models. In order to make the first and third prediction models work I will need to impute the missing values from the evaluation dataset. I will use the same method as I did for the training data set. If I do not add these missing values the prediction results will return no values for the rows that contain the missing values.
Below is a table containing the predicted TARGET_WINS for each model. Some things that stand out at a first glance are that the first model which is producing negative value predictions. Obviously it isn’t possible to have a negative amount of wins, so this model is not very useful. The second and third model which are both based on the cleaned and imputed data do not suffer from these issues of predicting large negative values. In general both the AIC generated model and the second model are producing similar results.
| lm1 | lm2 | aic |
|---|---|---|
| 48.44 | 71.67 | 60.78 |
| 58.75 | 70.29 | 63.52 |
| 64.74 | 72.05 | 69.32 |
| 79.89 | 83.42 | 87.68 |
| -3976.99 | 69.26 | 66.86 |
| -1630.68 | 69.45 | 84.37 |
| 17.94 | 72.11 | 82.06 |
| -137.22 | 67.43 | 71.00 |
| 23.91 | 77.89 | 72.55 |
| 32.11 | 74.41 | 72.94 |
| 39.95 | 75.16 | 68.54 |
| 66.46 | 85.61 | 85.37 |
| 92.45 | 86.82 | 86.12 |
| 89.70 | 78.95 | 84.50 |
| 86.04 | 74.37 | 84.02 |
We can also see when plotting the predictions that there doesn’t seem to be much obvious difference between the models aside from the clearly outrageous outliers generated by the first model.
We can the graphs below to check the validity of our models. All three models suffer from a lack of linearity which indicates that a linear regression model may not be the greatest technique for predicting values from this data with the given variables. The two models that included all the most variables (model 1 & model 3), suffer from co-linearity issues.
Model 1
Model 2
Model 3
Conclusion
Overall none of the models that I was able to generate instill much confidence in their ability to predict. The model with the best fit according to the \(R^2\) statistic was filled with missing data that caused clearly incorrect negative predictions.
The second and third models both had significantly lower \(R^2\) scores which indicated a poor fit overall. In addition, none of the models performed well when checked for linearity or homogeneity of variance. While the second model did not suffer from colinearity issues the other two models did.