homework1.knit

1. DATA EXPLORATION

The data we are using for this model consists of two separate datasets. The training set consists of 2,276 rows across 15 variables (excluding the target and index). The data we are using for testing our model against contains 259 rows. All the variables in our dataset are numeric.

The table below shows us some valuable descriptive statistics for the training data. We can see that many of the variables have a minimum of 0 but not all.

One interesting piece of information is the min/max of the TARGET_WINS variable. The minimum is 0 meaning there are teams that did not win a single game. The maximum is 146 which indicates no team in the training dataset had a perfect season, as we know from the data a season consists of 162 games.

Also of note is the number of missing values from certain variables. Most notably the TEAM_BATTING_HBP which we know is the batters hit by pitch variable. With 91% of the data missing we will remove this variable from our dataset because there simply is not enough information to impute a sensible value.

The means and medians of each variable are all relatively close in value for each individual variable. This tells us that most data is free from extreme outliers as they tend to skew the mean relative to the median.

	mean	sd	median	min	max	skew	kurtosis	se	missing
TARGET_WINS	80.79	15.75	82.0	0	146	-0.40	1.03	0.33	0
TEAM_BATTING_H	1469.27	144.59	1454.0	891	2554	1.57	7.28	3.03	0
TEAM_BATTING_2B	241.25	46.80	238.0	69	458	0.22	0.01	0.98	0
TEAM_BATTING_3B	55.25	27.94	47.0	0	223	1.11	1.50	0.59	0
TEAM_BATTING_HR	99.61	60.55	102.0	0	264	0.19	-0.96	1.27	0
TEAM_BATTING_BB	501.56	122.67	512.0	0	878	-1.03	2.18	2.57	0
TEAM_BATTING_SO	735.61	248.53	750.0	0	1399	-0.30	-0.32	5.33	102
TEAM_BASERUN_SB	124.76	87.79	101.0	0	697	1.97	5.49	1.90	131
TEAM_BASERUN_CS	52.80	22.96	49.0	0	201	1.98	7.62	0.59	772
TEAM_BATTING_HBP	59.36	12.97	58.0	29	95	0.32	-0.11	0.94	2085
TEAM_PITCHING_H	1779.21	1406.84	1518.0	1137	30132	10.33	141.84	29.49	0
TEAM_PITCHING_HR	105.70	61.30	107.0	0	343	0.29	-0.60	1.28	0
TEAM_PITCHING_BB	553.01	166.36	536.5	0	3645	6.74	96.97	3.49	0
TEAM_PITCHING_SO	817.73	553.09	813.5	0	19278	22.17	671.19	11.86	102
TEAM_FIELDING_E	246.48	227.77	159.0	65	1898	2.99	10.97	4.77	0
TEAM_FIELDING_DP	146.39	26.23	149.0	52	228	-0.39	0.18	0.59	286

Doing some preliminary analysis of the data we can compare the team pitching hits (TEAM_PITCHING_H) against the target wins (TARGET_WINS) to validate our assumption that more hits allowed is negatively correlated with the number of wins. While there does appear to be a negative correlation it is not obviously linear. There is clearly a lot of clustering below 5,000 allowed hits but the number of wins varies significantly. There are also some obvious outliers here with one team allowing nearly 30,000 hits.

Looking at another relationship between strike outs and wins we can see a the strike out data has fewer outliers but not as strong of a negative correlation as one would expect with target wins.

Plotting a histogram of the target wins we can see the distribution closely resembles the normal distribution.

Plotting the correlations between TARGET_WINS and the variables (excluding INDEX and TEAM_BATTING_HBP) we can see that very few variables are strongly correlated with the target variable.

Unsurprisingly we can see that TEAM_BATTING_HR and TEAM_PITCHING_HR are strongly positively correlated as one would expect since they basically are measuring the same thing.

There are some other strong correlations that are less obvious such as Errors (TEAM_FIELDING_E) being strongly correlated with walks by batters (TEAM_BATTING_BB), strike outs (TEAM_BATTING_SO), and team pitching hits allowed (TEAM_PITCHING_H)

Digging a little deeper we can see there is a pearson correlation coefficient of -0.6559708 for errors and walks by batters which indicates a strong negative correlation between the two variables. Looking at errors compared with team pitching hits allowed we see a correlation of 0.667759 which indicates a strong positive correlation.

Ultimately I know very little about baseball in general so I do not have an intuition regarding why these variables would be strongly correlated with each other.

Lastly lets take a closer look at the missing data. We’ve already determined that the batter hit by pitch (TEAM_BATTING_HBP) variable is missing 91% of its data but what of the other variables.

Using the plot below we can visualize the missingness of the remaining variables. There are 5 variables that contain varying degrees of missing data. We will use the information to fill in the missing values in our data preparation step.

TEAM_BASERUN_CS appears to be missing the second most amount of values but at only 772 missing values out of 2276 this is much less of a concern than the HBP variable we identified earlier. The remaining variables that are missing data have less than 25% of their data missing so should be safe to impute.

2. DATA PREPARATION

First what we will do is drop the INDEX and TEAM_BATTING_HBP variables from our dataset. We’ve already determined that the TEAM_BATTING_HBP variable is missing 91% of its data and the INDEX variable is simply an index of the row number.

Next we will use the mice library to impute the missing values in the trainDf data.frame. MICE is actually an acronym which stands for multiple imputation by chained equations. In order to use MICE one must assume that the missing values are missing at random, meaning the missingness can be accounted for by variables where there is complete information. Then as the name implies MICE runs multiple iterations over the data and generates the data to fill in the missing values.

We now have a data set free from missing data values with the meaningless INDEX variable removed and the TEAM_BATTING_HBP removed as well due to it containing 91% missing values. We can observe using the table below that there are no longer any missing values.

One interesting comparison we can make is for a variable that had a large number of missing values, we can look at how the summary statistics may have changed with the imputed data.

TEAM_BASERUN_CS was missing 772 values or about 34% of data. We can observe that the mean and median did change from 52.80 to 75.71 and 49.0 to 57.0 The min of 0 and max of 201 did not change which is good.

Money Ball Dataset
	mean	sd	median	min	max	skew	kurtosis	se
TARGET_WINS	80.79	15.75	82.0	0	146	-0.40	1.03	0.33
TEAM_BATTING_2B	241.25	46.80	238.0	69	458	0.22	0.01	0.98
TEAM_BATTING_3B	55.25	27.94	47.0	0	223	1.11	1.50	0.59
TEAM_BATTING_HR	99.61	60.55	102.0	0	264	0.19	-0.96	1.27
TEAM_BATTING_BB	501.56	122.67	512.0	0	878	-1.03	2.18	2.57
TEAM_BATTING_SO	727.96	246.89	734.5	0	1399	-0.23	-0.36	5.18
TEAM_BASERUN_SB	135.62	98.91	106.0	0	697	1.73	3.53	2.07
TEAM_BASERUN_CS	75.71	51.29	57.0	0	201	1.45	0.94	1.08
TEAM_BATTING_HBP	63.93	11.71	69.0	29	95	-0.48	-0.34	0.25
TEAM_PITCHING_H	1779.21	1406.84	1518.0	1137	30132	10.33	141.84	29.49
TEAM_PITCHING_HR	105.70	61.30	107.0	0	343	0.29	-0.60	1.28
TEAM_PITCHING_BB	553.01	166.36	536.5	0	3645	6.74	96.97	3.49
TEAM_PITCHING_SO	811.44	542.66	803.0	0	19278	22.46	692.95	11.37
TEAM_FIELDING_E	246.48	227.77	159.0	65	1898	2.99	10.97	4.77
TEAM_FIELDING_DP	141.87	29.58	146.0	52	228	-0.32	-0.17	0.62

3. BUILD MODELS

Now I will build three separate linear models and compare their performance and fit.

For the first model I will use the naive method of selecting every variable in the raw un-cleaned data set. Our intuition tells us this is unlikely to be the best performing model as we already know there are some issues with the data and we are using every variable.

For the second model I will use the cleaned data set with missing values imputed using the MICE method. In addition I will try to select variables which result in a better fit using my out intuition.

Below are the results \(R^2\), residual standard error, and F-statistics of each model. Surprisingly the non-cleaned, non-imputed raw training data had the best fitting statistics.

Money Ball Dataset
r	rsse	adjusted.r
0.55	8.49	0.51
0.19	14.16	0.19
0.30	13.18	0.30

4. SELECT MODELS

I will then use the evaluation data set to make predictions using the three models. In order to make the first and third prediction models work I will need to impute the missing values from the evaluation dataset. I will use the same method as I did for the training data set. If I do not add these missing values the prediction results will return no values for the rows that contain the missing values.

Below is a table containing the predicted TARGET_WINS for each model. Some things that stand out at a first glance are that the first model which is producing negative value predictions. Obviously it isn’t possible to have a negative amount of wins, so this model is not very useful. The second and third model which are both based on the cleaned and imputed data do not suffer from these issues of predicting large negative values. In general both the AIC generated model and the second model are producing similar results.

Money Ball Dataset
lm1	lm2	aic
48.44	71.67	60.78
58.75	70.29	63.52
64.74	72.05	69.32
79.89	83.42	87.68
-3976.99	69.26	66.86
-1630.68	69.45	84.37
17.94	72.11	82.06
-137.22	67.43	71.00
23.91	77.89	72.55
32.11	74.41	72.94
39.95	75.16	68.54
66.46	85.61	85.37
92.45	86.82	86.12
89.70	78.95	84.50
86.04	74.37	84.02

We can also see when plotting the predictions that there doesn’t seem to be much obvious difference between the models aside from the clearly outrageous outliers generated by the first model.

We can the graphs below to check the validity of our models. All three models suffer from a lack of linearity which indicates that a linear regression model may not be the greatest technique for predicting values from this data with the given variables. The two models that included all the most variables (model 1 & model 3), suffer from co-linearity issues.

Model 1

Model 2

Model 3

Conclusion

Overall none of the models that I was able to generate instill much confidence in their ability to predict. The model with the best fit according to the \(R^2\) statistic was filled with missing data that caused clearly incorrect negative predictions.

The second and third models both had significantly lower \(R^2\) scores which indicated a poor fit overall. In addition, none of the models performed well when checked for linearity or homogeneity of variance. While the second model did not suffer from colinearity issues the other two models did.

Nick Oliver - Homework 1