To see how regression will help us evaluate baseball team performance, this project is designed to explore whether a teams success in any given season can be predicted or explained by any number of statistics in that season. Our goal is to build a multiple linear regression model on the training data to predict the number of wins for the team. we will explore, analyze and model a historical baseball data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive, and the data include the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
While correlation does not imply causation, it is suggested that a focus on some of the variables such as single hits or triple or more hits to the exclusion of doubles might be worth pursuing. Also the data suggests that a focus on home runs allowed may not be worth giving up a number of more normal hits.
Because baseball is so numbers-heavy, there are many different statistics to consider when searching for the best predictors of team success. There are offensive statistics (offense meaning when a team is batting) and defensive statistics (defense meaning when a team is in the field). These categories can be broken up into many more subcategories. However, for the purpose of the this project we will use the available data to build a multiple linear regression model on the training data to predict the number of wins for the team.
To see how regression will help us predict the number of wins for the team, we actually don’t need to understand all the details about the game of baseball, which has over 100 rules. Here, we distill the sport to the basic knowledge one needs to know how to effectively attack the data science problem. The goal of a baseball game is to score more runs (points) than the other team. Each team has 9 batters that have an opportunity to hit a ball with a bat in a predetermined order. After the 9th batter has had their turn, the first batter bats again, then the second, and so on. Each time a batter has an opportunity to bat, we call it a plate appearance (PA). At each PA, the other team’s pitcher throws the ball and the batter tries to hit it. The PA ends with an binary outcome: the batter either makes an out (failure) and returns to the bench or the batter doesn’t (success) and can run around the bases, and potentially score a run (reach all 4 bases). Each team gets nine tries, referred to as innings, to score runs and each inning ends after three outs (three failures).
The dataset we will be using was provided in csv file. The files contain approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. The game statistics that will be used in this study are the following:
| VARIABLE NAME | DEFINITION | THEORETICAL EFFECT |
|---|---|---|
| INDEX | Identification Variable (do not use) | None |
| TARGET_WINS | Number of wins | Outcome Variable |
| TEAM_BATTING_H | Base Hits by batters (1B,2B,3B,HR) | Positive Impact on Wins |
| TEAM_BATTING_2B | Doubles by batters (2B) | Positive Impact on Wins |
| TEAM_BATTING_3B | Triples by batters (3B) | Positive Impact on Wins |
| TEAM_BATTING_HR | Homeruns by batters (4B) | Positive Impact on Wins |
| TEAM_BATTING_BB | Walks by batters | Positive Impact on Wins |
| TEAM_BATTING_HBP | Batters hit by pitch (get a free base) | Positive Impact on Wins |
| TEAM_BATTING_SO | Strikeouts by batters | Negative Impact on Wins |
| TEAM_BASERUN_SB | Stolen bases | Positive Impact on Wins |
| TEAM_BASERUN_CS | Caught stealing | Negative Impact on Wins |
| TEAM_FIELDING_E | Errors | Negative Impact on Wins |
| TEAM_FIELDING_DP | Double Plays | Positive Impact on Wins |
| TEAM_PITCHING_BB | Walks allowed | Negative Impact on Wins |
| TEAM_PITCHING_H | Hits allowed | Negative Impact on Wins |
| TEAM_PITCHING_HR | Homeruns allowed | Negative Impact on Wins |
| TEAM_PITCHING_SO | Strikeouts by pitchers | Positive Impact on Wins |
The initial steps are to download the data and take a quick glimpse of the columns, their data types, number of columns, and rows. Based on initial observations, the data contains 2276 teams with a variety of baseball performance statistics.
At first glance, the column BATTING_HBP has numerous NA values that will need to be addressed before building a model. Figure 1 show summary statistics of the target wins. The noteworthy statistics are the average number of wins in a season is 81 games, the median number of wins in a season is 82 games, and the standard deviation is 16 games.
| Characteristic | N = 2,2761 |
|---|---|
| TARGET_WINS | 81 (16) 82 0 146 |
| TEAM_BATTING_H | 1,469 (145) 1,454 891 2,554 |
| TEAM_BATTING_2B | 241 (47) 238 69 458 |
| TEAM_BATTING_3B | 55 (28) 47 0 223 |
| TEAM_BATTING_HR | 100 (61) 102 0 264 |
| TEAM_BATTING_BB | 502 (123) 512 0 878 |
| TEAM_BATTING_HBP | 59 (13) 58 29 95 |
| TEAM_BATTING_SO | 736 (249) 750 0 1,399 |
| TEAM_BASERUN_SB | 125 (88) 101 0 697 |
| TEAM_BASERUN_CS | 53 (23) 49 0 201 |
| TEAM_FIELDING_E | 246 (228) 159 65 1,898 |
| TEAM_FIELDING_DP | 146 (26) 149 52 228 |
| TEAM_PITCHING_BB | 553 (166) 536 0 3,645 |
| TEAM_PITCHING_SO | 818 (553) 814 0 19,278 |
|
1
Mean (SD) Median Minimum Maximum
|
|
By examining the target wins variable in detail, there is a clear guideline of how many wins each team should approximately win. Most teams will likely win the average number of games (81), but there will be some variability from the average with some teams winning more or less than 81 games.
The other variables also play an important role in understanding the data. In Figure 1, summary statistics are presented for all the variables. it is sufficient in getting the gist of each variable’s distribution. For example, the average Base Hits by batters per team is 1469 with the minimum base hits at 891 and maximum base hits at 2554. Remember that the dataset contains baseball statistics on 2276 teams. Missing values were excluded from the summary and they will be dealt with in the data preparation section of this report.
A quick look at Figure 2 will reveal the distribution of the target wins. The distribution is approximately normal with a majority of the target wins falling in the center of the distribution.The approximate normal distribution is confirmed by the QQ plot below the distribution plot. Most of the target wins fall on the line in the QQ plot with some data points diverging at the ends. This indicates possibility of outliers where some teams are winning more games or losing more games than what is expected in the normal range.In the boxplot, there are points that fall outside the whiskers which confirms our suspicions of outliers seen in the QQ plot.
Now in order to build our models properly It’s worth exploring for other columns with NA values.
| Columns_w_NA | Percent_NA |
|---|---|
| team_batting_so | 4.481547 |
| team_baserun_sb | 5.755712 |
| team_baserun_cs | 33.919156 |
| team_batting_hbp | 91.608084 |
| team_pitching_so | 4.481547 |
| team_fielding_dp | 12.565905 |
The following diagram shows the outliers for all the variables, both dependent and independent.
As we can see from the graph only 4 of the 16 variables are normally or close to normally distributed. the other 12 variables have a significant skew. The response variable Target_wins seems to be normally distributed. Batting_Hr, Batting_SO and Pitching_HR are bi-modal. 10 of the 16 variables have a minimum value of 0. This is not a major concern as the total % of 0 in each column is less than 1%. The variables Batting_BB, Batting_CS, Baserun_SB, Pitching_BB and Fielding_E have a significant number of outliers.
It is possible that not all variables will need to be used in creating an accurate model. In Figure 4, a correlation value is computed for each variable against target wins. Some variables are highly correlated with target wins, while other variables are not. For example, Base Hits by batters has a value of 0.38877 which is high while Caught stealing is barely correlated with target wins with a value of 0.0224.There is also a column for p-values which indicates whether the correlations are significant. We can use a decision rule of 95% meaning any variable with a p-value of less than 0.05 is significant. It appears that Strikeouts by batters (TEAM_BATTING_SO),Caught stealing(TEAM_BASERUN_CS), Batters hit by pitch (TEAM_BATTING_HBP), and Double plays (TEAM_FIELDING_DP)do not meet our decision rule and could be excluded from use.
| term | TARGET_WINS |
|---|---|
| INDEX | -0.02105643 |
| TEAM_BATTING_H | 0.38876752 |
| TEAM_BATTING_2B | 0.28910365 |
| TEAM_BATTING_3B | 0.14260841 |
| TEAM_BATTING_HR | 0.17615320 |
| TEAM_BATTING_BB | 0.23255986 |
| TEAM_BATTING_SO | -0.03175071 |
| TEAM_BASERUN_SB | 0.13513892 |
| TEAM_BASERUN_CS | 0.02240407 |
| TEAM_BATTING_HBP | 0.07350424 |
| TEAM_PITCHING_H | -0.10993705 |
| TEAM_PITCHING_HR | 0.18901373 |
| TEAM_PITCHING_BB | 0.12417454 |
| TEAM_PITCHING_SO | -0.07843609 |
| TEAM_FIELDING_E | -0.17648476 |
| TEAM_FIELDING_DP | -0.03485058 |
Before entirely excluding variables, it is a good idea to transform the data by fixing missing values or combining variables and reexamine the viability of those variables for predicting wins.
From the table we can see that there are positive or negative correlations among the predictors. If we look at the numerical correlations with the response variable. We can see that the predictors Batting_H, Batting_HR, Batting_BB, Pitching_H, and Pitching_HR are more correlated and should be included in our regression.
Also Examining significant correlations among the independent variables, we see that four of the pairs have a correlation close to 1. This can lead to multicollinearity issues in our analysis.
Missing values need to be handled before building models.They can be handled by either dropping the records, dropping the entire variable, or imputation. In this case, it was determined that Batters hit by pitch variable should be dropped altogether prior to model building because it has too many missing values to properly impute.All other variables with missing values will be considered for the model because a majority of the records are not missing.These variables will be imputed.
First we will remove Batting_HBP (Hit by Pitch) which has 92% missing values.
We will look at the patterns and intersections of missingness among the variables, using the naniar package. We can see that only 22 of the observations have all 5 variables missing, we will just delete these cases. The pattern suggests that the variables are Missing at Random (MAR)
By looking at the patterns and intersections of missing data among the variables. We can see that 5 variables have missing values, Team_BATTING has the most missing values so we are completely removing these observations. Overall, the pattern suggests that the variables are Missing at Random (MAR).
When it comes to fixing missing values,there are several methods at our disposal. The first technique is to fill the missing values with the mean values of each variable. We’ll use the Hmisc R Package to fill the missing data with the mean, most of the time, mean imputation will lead to good results. The same procedure will be used for the other variables with missing values in Model 2 but by using the Median instead of the mean
The second technique for imputing missing values is to use a decision tree. This is slightly more involved, but will likely give the better results. A decision tree will be created for each variable with missing values. In mean imputation, a fixed value is used for missing values of an entire variable whereas in decision tree imputation, a value is used based on certain conditions.
This is a full model containing all the variables with the meanusred for missing values. This is a good starting model to determine how well each variable helps predict wins. The mean is generally an adequate guess for missing values. In this model, no selection technique is used. All variables are manually included.
To the that we used the Hmisc R package to imputes missing value using user defined statistical method (mean in our case)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
## TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR +
## TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP,
## data = train_model1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.113 -7.633 -0.018 7.324 45.214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.237509 6.113895 9.689 < 2e-16 ***
## TEAM_BATTING_H -0.002070 0.005441 -0.380 0.703633
## TEAM_BATTING_2B -0.023765 0.008808 -2.698 0.007034 **
## TEAM_BATTING_3B 0.168568 0.018723 9.003 < 2e-16 ***
## TEAM_BATTING_HR 0.321816 0.055324 5.817 6.98e-09 ***
## TEAM_BATTING_BB 0.093055 0.018164 5.123 3.30e-07 ***
## TEAM_BATTING_SO -0.049966 0.008917 -5.603 2.40e-08 ***
## TEAM_BASERUN_SB 0.081187 0.006161 13.178 < 2e-16 ***
## TEAM_BASERUN_CS -0.059350 0.014616 -4.061 5.09e-05 ***
## TEAM_PITCHING_H 0.025458 0.002263 11.251 < 2e-16 ***
## TEAM_PITCHING_HR -0.216833 0.052137 -4.159 3.33e-05 ***
## TEAM_PITCHING_BB -0.060526 0.016914 -3.578 0.000354 ***
## TEAM_PITCHING_SO 0.028051 0.007993 3.509 0.000459 ***
## TEAM_FIELDING_E -0.084678 0.005292 -16.001 < 2e-16 ***
## TEAM_FIELDING_DP -0.120255 0.012485 -9.632 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.96 on 1975 degrees of freedom
## (286 observations deleted due to missingness)
## Multiple R-squared: 0.3858, Adjusted R-squared: 0.3814
## F-statistic: 88.61 on 14 and 1975 DF, p-value: < 2.2e-16
The overall p-value for Model 1is less than 0.0001,which indicates a significant model in predicting wins. The Adjusted R-Squared and Mean Square Error (MSE) will be the metrics used to determine the best model.A higher Adjusted R-Squared is better and a lower MSE is better.In this case, the Adjusted R-Squared is 0.3814 and MSE is 168.91, which will be the current benchmark.
The resulting equation for Model 1 is :
Most of Model 1 makes senseas positive measures of success like Base Hits, Triples, Homeruns, Walks by batters, and Stolen bases are positive coefficients in the equation while negative measures of success like Strikeouts by batters, Caught stealing, Hits allowed, and Errors are negative coefficients in the equation.All these values make intuitive sense.On the other hand, Doubles and Double plays are shown as negative coefficients when they should have a positive impact on wins. Also, Homeruns allowed and Walks allowed are shown as positive coefficients when they should have negative impact on wins.These values are counter intuitive.The counter intuitive parts of the model may need to be further investigated if this model were to be chosen for deployment.However, for now, the model will be kept as a benchmark despite certain measures not making sense.
Earlier in the exploration of the data, the analysis revealed the possibility of outliers present in the data. Because the mean is highly influenced by outliers, this model attempts to remedy that by using the median to impute missing values.Model 2 is a significant model based on the p-value of less than 0.0001. The Adjusted R-Squared is 0.3147 and MSE is 169.799.
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
## TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR +
## TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP,
## data = train_model2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.141 -7.562 -0.055 7.303 45.741
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.081468 6.120416 9.490 < 2e-16 ***
## TEAM_BATTING_H -0.002408 0.005486 -0.439 0.660776
## TEAM_BATTING_2B -0.023924 0.008849 -2.704 0.006915 **
## TEAM_BATTING_3B 0.169910 0.018789 9.043 < 2e-16 ***
## TEAM_BATTING_HR 0.316324 0.055363 5.714 1.27e-08 ***
## TEAM_BATTING_BB 0.089135 0.018298 4.871 1.20e-06 ***
## TEAM_BATTING_SO -0.045012 0.008991 -5.006 6.04e-07 ***
## TEAM_BASERUN_SB 0.078868 0.006138 12.850 < 2e-16 ***
## TEAM_BASERUN_CS -0.060070 0.014307 -4.198 2.81e-05 ***
## TEAM_PITCHING_H 0.026060 0.002291 11.374 < 2e-16 ***
## TEAM_PITCHING_HR -0.213394 0.052207 -4.087 4.54e-05 ***
## TEAM_PITCHING_BB -0.056704 0.017043 -3.327 0.000893 ***
## TEAM_PITCHING_SO 0.024406 0.008105 3.011 0.002635 **
## TEAM_FIELDING_E -0.082493 0.005281 -15.621 < 2e-16 ***
## TEAM_FIELDING_DP -0.121035 0.012572 -9.627 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11 on 1975 degrees of freedom
## (286 observations deleted due to missingness)
## Multiple R-squared: 0.3816, Adjusted R-squared: 0.3772
## F-statistic: 87.05 on 14 and 1975 DF, p-value: < 2.2e-16
The resulting equation for Model 2 is :
Building the stepwise regression model with knn imputed values
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_SO +
## TEAM_BASERUN_SB + TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP,
## data = knn_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.980 -8.547 0.148 8.398 63.362
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.203403 4.656401 7.560 5.80e-14 ***
## TEAM_BATTING_H 0.044863 0.002599 17.263 < 2e-16 ***
## TEAM_BATTING_SO -0.010323 0.002091 -4.937 8.52e-07 ***
## TEAM_BASERUN_SB 0.037616 0.003784 9.942 < 2e-16 ***
## TEAM_PITCHING_HR 0.059940 0.007971 7.520 7.85e-14 ***
## TEAM_FIELDING_E -0.027353 0.001582 -17.289 < 2e-16 ***
## TEAM_FIELDING_DP -0.117420 0.013006 -9.028 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.13 on 2269 degrees of freedom
## Multiple R-squared: 0.3069, Adjusted R-squared: 0.3051
## F-statistic: 167.5 on 6 and 2269 DF, p-value: < 2.2e-16
Now lets run the model…
As mentioned earlier, the main decision criterionis the Adjusted R-Squared. A higher Adjusted R-Squared is indicative of better performance. MSE is also used as a secondary criterion which measures the difference between actual and predicted values. A lower MSE is better.Aside from these criteria, the goal is also to have a highly interpretable model that makes sense.
| Measures | Model 1 | Model 2 | Model 3 |
|---|---|---|---|
| r_squared | 0.3858 | 0.3816 | 0.3067 |
| adj_rsquared | 0.3814 | 0.3772 | 0.3049 |
| rse | 10.9600 | 11.0000 | 13.1300 |
| f_stat | 88.6100 | 87.0500 | 167.3000 |
Looking at the three models, we are going to use model 1, and the model is selected based on the criteria we had mentioned.
The test data should go to to the same cleaning process with the training data.
After, we have cleaned the test dataset, here’s how the result looks like on the table above which include the predicted values, and the prediction intervals.
eval_model1 <- drop_na(eval_model1)
predict1 <- predict(model1, newdata = eval_model1, interval="prediction")
kable(predict1)
| fit | lwr | upr |
|---|---|---|
| 60.07336 | 38.49515 | 81.65158 |
| 65.67460 | 44.11320 | 87.23601 |
| 71.23288 | 49.68348 | 92.78227 |
| 84.86087 | 63.31623 | 106.40550 |
| 103.93799 | 81.28573 | 126.59026 |
| 71.85566 | 50.14909 | 93.56223 |
| 72.34177 | 50.75179 | 93.93174 |
| 75.22567 | 53.67294 | 96.77841 |
| 70.76950 | 49.23281 | 92.30620 |
| 64.63909 | 43.08353 | 86.19464 |
| 83.07114 | 61.51171 | 104.63057 |
| 85.21038 | 63.63224 | 106.78852 |
| 82.79881 | 61.22043 | 104.37720 |
| 85.81195 | 64.21798 | 107.40592 |
| 74.96797 | 53.40346 | 96.53247 |
| 71.29929 | 49.76256 | 92.83602 |
| 76.99432 | 55.46692 | 98.52171 |
| 68.15789 | 46.58147 | 89.73432 |
| 86.08610 | 64.50586 | 107.66633 |
| 85.29403 | 63.73352 | 106.85453 |
| 82.19979 | 60.64507 | 103.75451 |
| 83.06310 | 61.51523 | 104.61096 |
| 70.06209 | 48.52136 | 91.60282 |
| 78.84225 | 57.29569 | 100.38881 |
| 86.07804 | 64.51782 | 107.63826 |
| 68.95160 | 47.39845 | 90.50475 |
| 82.72954 | 61.13728 | 104.32179 |
| 67.20395 | 45.60529 | 88.80262 |
| 91.75254 | 70.18634 | 113.31875 |
| 86.83842 | 65.30544 | 108.37141 |
| 84.97968 | 63.42602 | 106.53334 |
| 85.02571 | 63.44522 | 106.60620 |
| 79.97374 | 58.43932 | 101.50816 |
| 86.02490 | 64.46247 | 107.58733 |
| 77.33769 | 55.81314 | 98.86224 |
| 90.51117 | 68.94900 | 112.07334 |
| 79.17794 | 57.40528 | 100.95060 |
| 88.95546 | 67.38655 | 110.52437 |
| 81.77011 | 60.20082 | 103.33939 |
| 92.94351 | 71.37142 | 114.51559 |
| 72.58169 | 50.93529 | 94.22809 |
| 68.45749 | 46.91301 | 90.00196 |
| 75.56876 | 53.92197 | 97.21555 |
| 73.74384 | 52.19562 | 95.29207 |
| 81.66188 | 60.09615 | 103.22760 |
| 74.53919 | 52.99974 | 96.07864 |
| 74.49560 | 52.94412 | 96.04708 |
| 69.84047 | 48.31247 | 91.36848 |
| 77.20263 | 55.66987 | 98.73539 |
| 98.68540 | 76.94744 | 120.42337 |
| 80.59269 | 59.00785 | 102.17753 |
| 66.38084 | 44.79346 | 87.96822 |
| 80.95703 | 59.39260 | 102.52145 |
| 93.48978 | 71.92510 | 115.05447 |
| 80.78763 | 59.17746 | 102.39781 |
| 87.42718 | 65.89639 | 108.95796 |
| 84.14919 | 62.57919 | 105.71919 |
| 83.42632 | 61.85401 | 104.99862 |
| 68.24651 | 46.66857 | 89.82444 |
| 74.11485 | 52.52794 | 95.70176 |
| 83.38865 | 61.76978 | 105.00753 |
| 98.26007 | 76.60928 | 119.91086 |
| 89.52138 | 67.90227 | 111.14049 |
| 70.93067 | 49.34972 | 92.51163 |
| 80.64191 | 59.08035 | 102.20346 |
| 91.96033 | 70.36565 | 113.55501 |
| 70.33282 | 48.74187 | 91.92377 |
| 77.83163 | 56.24180 | 99.42146 |
| 87.22196 | 65.68627 | 108.75766 |
| 82.48522 | 60.95591 | 104.01453 |
| 73.67207 | 52.11205 | 95.23209 |
| 76.75416 | 55.19624 | 98.31208 |
| 82.49050 | 60.89795 | 104.08305 |
| 84.99016 | 63.43133 | 106.54898 |
| 93.83859 | 72.25397 | 115.42321 |
| 71.80554 | 50.24320 | 93.36789 |
| 86.85487 | 65.30868 | 108.40105 |
| 79.28589 | 57.70949 | 100.86228 |
| 85.29262 | 63.74212 | 106.84311 |
| 82.46982 | 60.92057 | 104.01906 |
| 94.60183 | 73.02103 | 116.18262 |
| 91.34545 | 69.79067 | 112.90024 |
| 70.93903 | 49.32290 | 92.55516 |
| 83.10417 | 61.50827 | 104.70006 |
| 87.32008 | 65.70347 | 108.93668 |
| 88.43871 | 66.85095 | 110.02647 |
| 93.37461 | 71.63719 | 115.11203 |
| 99.36781 | 77.69373 | 121.04190 |
| 87.76843 | 66.19402 | 109.34283 |
| 91.17261 | 69.57685 | 112.76837 |
| 79.23140 | 57.67886 | 100.78395 |
| 73.24270 | 51.69310 | 94.79230 |
| 82.12689 | 60.59686 | 103.65691 |
| 86.71778 | 65.15951 | 108.27605 |
| 74.89025 | 53.33248 | 96.44801 |
| 66.92861 | 45.21966 | 88.63755 |
| 73.56909 | 52.00107 | 95.13711 |
| 89.88738 | 68.34726 | 111.42750 |
| 89.90138 | 68.34824 | 111.45452 |
| 90.11610 | 68.51220 | 111.72001 |
| 96.60351 | 75.03731 | 118.16970 |
| 91.47764 | 69.93489 | 113.02039 |
| 78.45313 | 56.91791 | 99.98835 |
| 76.26105 | 54.71209 | 97.81001 |
| 85.73440 | 64.17606 | 107.29275 |
| 82.29847 | 60.76377 | 103.83317 |
| 68.98400 | 47.42907 | 90.53894 |
| 68.50717 | 46.91714 | 90.09721 |
| 68.64108 | 47.08802 | 90.19414 |
| 70.55180 | 48.91215 | 92.19145 |
| 72.52800 | 50.97017 | 94.08583 |
| 83.69501 | 62.12356 | 105.26645 |
| 87.31126 | 65.72826 | 108.89426 |
| 77.52117 | 55.98289 | 99.05945 |
| 92.52258 | 70.96967 | 114.07550 |
| 88.00381 | 66.44678 | 109.56083 |
| 81.78382 | 60.23555 | 103.33209 |
| 81.03261 | 59.47052 | 102.59470 |
| 75.48230 | 53.92827 | 97.03633 |
| 83.17266 | 61.59474 | 104.75059 |
| 89.88120 | 68.33256 | 111.42983 |
| 61.40875 | 39.51990 | 83.29760 |
| 75.44101 | 53.89797 | 96.98406 |
| 73.59574 | 52.06000 | 95.13148 |
| 90.24117 | 68.61412 | 111.86822 |
| 78.64452 | 57.09326 | 100.19579 |
| 59.15853 | 37.58302 | 80.73405 |
| 75.04333 | 53.47800 | 96.60867 |
| 93.43360 | 71.86647 | 115.00072 |
| 69.90950 | 48.36080 | 91.45820 |
| 75.33609 | 53.77841 | 96.89378 |
| 72.45306 | 50.91713 | 93.98899 |
| 77.33786 | 55.80938 | 98.86633 |
| 80.48555 | 58.95066 | 102.02043 |
| 77.62882 | 56.07797 | 99.17968 |
| 83.52247 | 61.98974 | 105.05520 |
| 81.74656 | 60.20149 | 103.29163 |
| 78.47525 | 56.93024 | 100.02026 |
| 62.29530 | 40.71983 | 83.87077 |
| 77.26846 | 55.72412 | 98.81279 |
| 67.90964 | 46.34271 | 89.47657 |
| 93.36455 | 71.79030 | 114.93881 |
| 92.34106 | 70.63874 | 114.04339 |
| 72.06385 | 50.49380 | 93.63390 |
| 106.59538 | 84.94074 | 128.25001 |
| 112.15416 | 90.51505 | 133.79326 |
| 98.10654 | 76.54120 | 119.67189 |
| 107.50113 | 85.88524 | 129.11702 |
| 102.26469 | 80.66316 | 123.86623 |
| 96.85838 | 75.27515 | 118.44161 |
| 84.68299 | 63.12854 | 106.23745 |
| 84.18226 | 62.62349 | 105.74102 |
| 71.83081 | 50.28326 | 93.37836 |
| 79.35595 | 57.82623 | 100.88567 |
| 96.43348 | 74.79921 | 118.06776 |
| 96.26514 | 74.69948 | 117.83080 |
| 82.70737 | 61.15702 | 104.25772 |
| 93.53970 | 71.97243 | 115.10696 |
| 81.03998 | 59.49492 | 102.58503 |
| 77.87108 | 56.31519 | 99.42696 |
| 82.52147 | 60.95463 | 104.08831 |
| 69.11141 | 47.56080 | 90.66202 |
| 75.05836 | 53.51372 | 96.60301 |
| 81.15453 | 59.62197 | 102.68710 |
| 96.52924 | 74.81298 | 118.24550 |
| 90.01229 | 68.45875 | 111.56583 |
| 86.62406 | 65.09434 | 108.15378 |
| 87.07450 | 65.53194 | 108.61707 |
| 61.81109 | 40.19830 | 83.42388 |
| 81.44877 | 59.88653 | 103.01101 |
| 68.47654 | 46.90147 | 90.05161 |
| 71.40841 | 49.85422 | 92.96259 |
| 70.37271 | 48.78647 | 91.95896 |
| 62.53699 | 40.96878 | 84.10521 |
| 73.78619 | 52.25298 | 95.31940 |
| 92.46219 | 70.87723 | 114.04715 |
| 84.28704 | 62.74338 | 105.83070 |
| 84.99764 | 63.45073 | 106.54455 |
| 73.05263 | 51.47967 | 94.62559 |
| 78.48441 | 56.93818 | 100.03063 |
| 74.65449 | 53.07283 | 96.23614 |
| 91.52677 | 69.75094 | 113.30260 |
| 79.81631 | 58.27981 | 101.35281 |
| 87.78063 | 66.22539 | 109.33586 |
| 77.60570 | 56.04029 | 99.17110 |
| 76.33916 | 54.79145 | 97.88686 |
| 89.77318 | 68.17844 | 111.36792 |
| 63.69324 | 42.14486 | 85.24162 |
| 70.47613 | 48.90122 | 92.05103 |
| 82.18603 | 60.63547 | 103.73659 |
| 78.34127 | 56.75660 | 99.92595 |
| 93.86369 | 72.24692 | 115.48046 |
| 74.49629 | 52.96747 | 96.02511 |
| 81.45126 | 59.91940 | 102.98313 |
| 73.47833 | 51.93023 | 95.02644 |
| 70.77610 | 49.20815 | 92.34405 |
| 78.55512 | 57.01606 | 100.09419 |
| 69.23494 | 47.66356 | 90.80631 |
| 79.00303 | 57.47314 | 100.53292 |
| 77.51497 | 55.98000 | 99.04995 |
| 79.07279 | 57.52485 | 100.62074 |
| 84.13184 | 62.60172 | 105.66196 |
| 85.48324 | 63.88634 | 107.08014 |
| 92.64399 | 71.02526 | 114.26272 |
| 82.66563 | 61.11035 | 104.22092 |
| 88.02235 | 66.45445 | 109.59025 |
| 79.77963 | 58.25585 | 101.30342 |
| 75.85559 | 54.32689 | 97.38430 |
| 75.98735 | 54.40805 | 97.56665 |
| 79.50247 | 57.96465 | 101.04030 |
| 76.53748 | 54.96602 | 98.10895 |
| 88.38708 | 66.85186 | 109.92231 |
| 88.17293 | 66.63329 | 109.71256 |
| 81.87435 | 60.32467 | 103.42402 |
| 79.60554 | 58.06100 | 101.15007 |
| 60.48526 | 38.90558 | 82.06495 |
| 82.91219 | 61.33159 | 104.49280 |
| 76.71998 | 55.18154 | 98.25842 |
| 83.39090 | 61.85031 | 104.93148 |
| 71.70958 | 50.16819 | 93.25096 |
| 84.15537 | 62.58045 | 105.73028 |
| 80.57078 | 59.01359 | 102.12798 |
| 50.34285 | 28.40255 | 72.28315 |
| 68.12559 | 46.57338 | 89.67781 |
| 78.41185 | 56.83952 | 99.98418 |
| 83.98771 | 62.43525 | 105.54018 |
| 87.86446 | 66.29147 | 109.43745 |
| 72.38459 | 50.79144 | 93.97775 |
3 models were generated from baseball team data from 1871 to 2006 to predict number of wins. Prior to generating the models, the baseball data was analyzed to better understand the relationship between variables. The chosen model was Model 1 which was created using mean….
knitr::opts_chunk$set(cache =TRUE)
# 0. Librairies
library(corrplot)
library(tidyverse)
library(Hmisc)
library(PerformanceAnalytics)
library(corrplot)
library(mice)
library(gt)
library(DMwR)
library(caret)
library(bnstruct)
library(VIM)
library(corrr)
library(tidyverse)
library(gtsummary)
library(kableExtra)
# 1. Data Exploration
## Import data
train_data <- read.csv("https://raw.githubusercontent.com/aaitelmouden/DATA621/master/Project1/moneyball-training-data.csv")
glimpse(train_data)
## Summary table
wins <- train_data %>% select(TARGET_WINS, TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BATTING_HBP, TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_FIELDING_E, TEAM_FIELDING_DP, TEAM_PITCHING_BB, TEAM_PITCHING_SO, TEAM_FIELDING_E, TEAM_FIELDING_DP)
table1 <- tbl_summary(wins,
statistic = list(all_continuous() ~ "{mean} ({sd}) {median} {min} {max}"), missing = "no")
table1
## Create more variables
singles <- train_data$TEAM_BATTING_H - (train_data$TEAM_BATTING_2B + train_data$TEAM_BATTING_3B + train_data$TEAM_BATTING_HR)
train_data$TEAM_BATTING_SLG <- ((train_data$TEAM_BATTING_HR *4)+ (train_data$TEAM_BATTING_3B*3) + (train_data$TEAM_BATTING_2B*2)+ singles)/train_data$TEAM_BATTING_H
## Plots
### Layout to split the screen
layout(mat = matrix(c(1,2),2,1, byrow=TRUE), height = c(1,8))
#### Draw the boxplot and the histogram
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(train_data$TARGET_WINS ,main="Figure 2 : Distribution and Probability Plot for TARGET_WINS",cex.main=1, horizontal=TRUE , ylim=c(0,150), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F)
par(mar=c(4, 3.1, 1.1, 2.1))
hist(train_data$TARGET_WINS , breaks=40 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="TARGET WINS", xlim=c(0,150))
### qq plots
qqnorm(train_data$TARGET_WINS, pch = 1, frame = FALSE)
qqline(train_data$TARGET_WINS, col = "steelblue", lwd = 2)
max_obs <- 2276
batting_so_na <- ((102/max_obs) * 100)
baserun_sb_na <- (131/max_obs) * 100
baserun_cs_na <- (772/max_obs) * 100
batting_hbp_na <- (2085/max_obs) * 100
pitching_so_na <- (102/max_obs) * 100
fielding_dp_na <- (286/max_obs) * 100
df_percent_na <- data.frame(Columns_w_NA = c("team_batting_so", "team_baserun_sb", "team_baserun_cs", "team_batting_hbp", "team_pitching_so", "team_fielding_dp"), Percent_NA = c(batting_so_na, baserun_sb_na, baserun_cs_na, batting_hbp_na, pitching_so_na, fielding_dp_na))
### the largest islands in the world
gt_tbl <- gt(data = df_percent_na, )
### Show the gt Table
gt_tbl
### Outlliers
ggplot(stack(train_data[,-1]), aes(x = ind, y = values, fill=ind)) +
geom_boxplot(outlier.colour = "red", outlier.alpha=.4) +
coord_cartesian(ylim = c(0, 1000)) +
theme_classic()+
theme(axis.text.x=element_text(angle=45, hjust=1))
### Correlation
COR <- train_data %>%
correlate() %>%
focus(TARGET_WINS)
gt(COR)
COR %>%
mutate(rowname = factor(term, levels = term[order(TARGET_WINS)])) %>% # Order by correlation strength
ggplot(aes(x = rowname, y = TARGET_WINS)) +
geom_bar(stat = "identity") +
ylab("Correlation with TARGET WINS") +
xlab("Variables") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+ ggtitle("Figure 4: Correlation Against Target Win")
#### pairwise.complete.obs ignores NA values and computes correlation on complete observations
#### we might have to run these corrplots again after we handle the NA values
chart.Correlation(train_data[-c(1)], histograme=TRUE, method= "pearson", use="pairwise.complete.obs")
data.corr <- cor(train_data[-c(1)], use="pairwise.complete.obs")
corrplot(data.corr, type = "lower", method="square")
flattenCorrMatrix <- function(cormat, pmat) {
ut <- upper.tri(cormat)
data.frame(
row = rownames(cormat)[row(cormat)[ut]],
column = rownames(cormat)[col(cormat)[ut]],
cor =(cormat)[ut],
p = pmat[ut]
)
}
#### Eliminate INDEX from data frame
data_no_index <- train_data[-c(1)]
cor_matrix <- rcorr(as.matrix(data_no_index))
flattenCorrMatrix(cor_matrix$r, cor_matrix$P)
#. 2 Data preparation
train_data <- train_data[-11]
par(mfrow=c(1,2))
gg_miss_upset(train_data,
nsets = 5,
nintersects = NA)
gg_miss_case(train_data)+
theme_classic()
par(mfrow=c(1,2))
gg_miss_upset(train_data,
nsets = 5,
nintersects = NA)
gg_miss_case(train_data)+
theme_classic()
# 3. Build models
### MODEL 1: MEAN FULL MODEL
### Filling missing values with Mean using the impute package
train_model1 <- train_data
train_model1$TEAM_BATTING_SO[is.na(train_model1$TEAM_BATTING_SO)] = mean(train_model1$TEAM_BATTING_SO, na.rm=TRUE)
train_model1$TEAM_BASERUN_SB[is.na(train_model1$TEAM_BASERUN_SB)] = mean(train_model1$TEAM_BASERUN_SB, na.rm=TRUE)
train_model1$TEAM_BASERUN_CS[is.na(train_model1$TEAM_BASERUN_CS)] = mean(train_model1$TEAM_BASERUN_CS, na.rm=TRUE)
train_model1$TEAM_PITCHING_SO[is.na(train_model1$TEAM_PITCHING_SO)] = mean(train_model1$TEAM_PITCHING_SO, na.rm=TRUE)
model1 <- lm(TARGET_WINS ~
TEAM_BATTING_H + # Base Hits by batters (1B,2B,3B,HR)
TEAM_BATTING_2B + # Doubles by batters (2B)
TEAM_BATTING_3B + # Triples by batters (3B)
TEAM_BATTING_HR + # Homeruns by batters (4B)
TEAM_BATTING_BB + # Walks by batters
TEAM_BATTING_SO + # Strikeouts by batters
TEAM_BASERUN_SB + # Stolen bases
TEAM_BASERUN_CS + # Caught stealing
TEAM_PITCHING_H + # Hits allowed
TEAM_PITCHING_HR + # Homeruns allowed
TEAM_PITCHING_BB + # Walks allowed
TEAM_PITCHING_SO + # Strikeouts by pitchers
TEAM_FIELDING_E + # Errors
TEAM_FIELDING_DP, # Double Plays
data=train_model1)
summary(model1)
### Mean Square Error (MSE)
mean(model1$residuals^2)
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(model1)
### MODEL 2: MEDIAN WITH STEPWISE
train_model2 <- train_data
#### Filling missing values with Mean using the impute package
train_model2$TEAM_BATTING_SO[is.na(train_model2$TEAM_BATTING_SO)] = median(train_model2$TEAM_BATTING_SO, na.rm=TRUE)
train_model2$TEAM_BASERUN_SB[is.na(train_model2$TEAM_BASERUN_SB)] = median(train_model2$TEAM_BASERUN_SB, na.rm=TRUE)
train_model2$TEAM_BASERUN_CS[is.na(train_model2$TEAM_BASERUN_CS)] = median(train_model2$TEAM_BASERUN_CS, na.rm=TRUE)
train_model2$TEAM_PITCHING_SO[is.na(train_model2$TEAM_PITCHING_SO)] = median(train_model2$TEAM_PITCHING_SO, na.rm=TRUE)
model2 <- lm(TARGET_WINS ~
TEAM_BATTING_H + # Base Hits by batters (1B,2B,3B,HR)
TEAM_BATTING_2B + # Doubles by batters (2B)
TEAM_BATTING_3B + # Triples by batters (3B)
TEAM_BATTING_HR + # Homeruns by batters (4B)
TEAM_BATTING_BB + # Walks by batters
TEAM_BATTING_SO + # Strikeouts by batters
TEAM_BASERUN_SB + # Stolen bases
TEAM_BASERUN_CS + # Caught stealing
TEAM_PITCHING_H + # Hits allowed
TEAM_PITCHING_HR + # Homeruns allowed
TEAM_PITCHING_BB + # Walks allowed
TEAM_PITCHING_SO + # Strikeouts by pitchers
TEAM_FIELDING_E + # Errors
TEAM_FIELDING_DP, # Double Plays
data=train_model2)
summary(model1)
mean(model2$residuals^2)
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(model2)
### MODEL 3: kNN Imputation
train_data_impute <- select(train_data, -c(INDEX))
#### using default values
knn_data <- knnImputation(train_data_impute)
summary(knn_data)
train.control <- trainControl(method ='cv', number=10)
step.model <- train(TARGET_WINS ~., data = knn_data,
method = "leapSeq",
tuneGrid = data.frame(nvmax = 2:13),
trControl = train.control
)
step.model$results
step.model$bestTune
summary(step.model$finalModel)
coef(step.model$finalModel, 6)
knn_lm <- lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP, data = knn_data)
summary(knn_lm)
hist(knn_lm$residuals, xlab = "Residuals", ylab = "", breaks=100)
qqnorm(knn_lm$residuals)
qqline(knn_lm$residuals)
plot(fitted(knn_lm), residuals(knn_lm))
abline(h=0, lty = 2)
<<<<<<< HEAD
-Dealing with Missing Data using R : https://medium.com/coinmonks/dealing-with-missing-data-using-r-3ae428da2d17 -Decision Tree : http://www.learnbymarketing.com/tutorials/rpart-decision-trees-in-r/ -Decision Tree : https://www.datacamp.com/community/tutorials/decision-trees-R -Introduction to Data Science (Case Study Moneyball): https://rafalab.github.io/dsbook/linear-models.html#case-study-moneyball ======= >>>>>>> bf59d52d90d6b4ab9b840f4b7fa1cd132debf1fb