Abstract

To see how regression will help us evaluate baseball team performance, this project is designed to explore whether a teams success in any given season can be predicted or explained by any number of statistics in that season. Our goal is to build a multiple linear regression model on the training data to predict the number of wins for the team. we will explore, analyze and model a historical baseball data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive, and the data include the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

While correlation does not imply causation, it is suggested that a focus on some of the variables such as single hits or triple or more hits to the exclusion of doubles might be worth pursuing. Also the data suggests that a focus on home runs allowed may not be worth giving up a number of more normal hits.

Introduction

Because baseball is so numbers-heavy, there are many different statistics to consider when searching for the best predictors of team success. There are offensive statistics (offense meaning when a team is batting) and defensive statistics (defense meaning when a team is in the field). These categories can be broken up into many more subcategories. However, for the purpose of the this project we will use the available data to build a multiple linear regression model on the training data to predict the number of wins for the team.

To see how regression will help us predict the number of wins for the team, we actually don’t need to understand all the details about the game of baseball, which has over 100 rules. Here, we distill the sport to the basic knowledge one needs to know how to effectively attack the data science problem. The goal of a baseball game is to score more runs (points) than the other team. Each team has 9 batters that have an opportunity to hit a ball with a bat in a predetermined order. After the 9th batter has had their turn, the first batter bats again, then the second, and so on. Each time a batter has an opportunity to bat, we call it a plate appearance (PA). At each PA, the other team’s pitcher throws the ball and the batter tries to hit it. The PA ends with an binary outcome: the batter either makes an out (failure) and returns to the bench or the batter doesn’t (success) and can run around the bases, and potentially score a run (reach all 4 bases). Each team gets nine tries, referred to as innings, to score runs and each inning ends after three outs (three failures).

Data Exploration

The dataset we will be using was provided in csv file. The files contain approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. The game statistics that will be used in this study are the following:

VARIABLE NAME DEFINITION THEORETICAL EFFECT
INDEX Identification Variable (do not use) None
TARGET_WINS Number of wins Outcome Variable
TEAM_BATTING_H Base Hits by batters (1B,2B,3B,HR) Positive Impact on Wins
TEAM_BATTING_2B Doubles by batters (2B) Positive Impact on Wins
TEAM_BATTING_3B Triples by batters (3B) Positive Impact on Wins
TEAM_BATTING_HR Homeruns by batters (4B) Positive Impact on Wins
TEAM_BATTING_BB Walks by batters Positive Impact on Wins
TEAM_BATTING_HBP Batters hit by pitch (get a free base) Positive Impact on Wins
TEAM_BATTING_SO Strikeouts by batters Negative Impact on Wins
TEAM_BASERUN_SB Stolen bases Positive Impact on Wins
TEAM_BASERUN_CS Caught stealing Negative Impact on Wins
TEAM_FIELDING_E Errors Negative Impact on Wins
TEAM_FIELDING_DP Double Plays Positive Impact on Wins
TEAM_PITCHING_BB Walks allowed Negative Impact on Wins
TEAM_PITCHING_H Hits allowed Negative Impact on Wins
TEAM_PITCHING_HR Homeruns allowed Negative Impact on Wins
TEAM_PITCHING_SO Strikeouts by pitchers Positive Impact on Wins

The initial steps are to download the data and take a quick glimpse of the columns, their data types, number of columns, and rows. Based on initial observations, the data contains 2276 teams with a variety of baseball performance statistics.

At first glance, the column BATTING_HBP has numerous NA values that will need to be addressed before building a model. Figure 1 show summary statistics of the target wins. The noteworthy statistics are the average number of wins in a season is 81 games, the median number of wins in a season is 82 games, and the standard deviation is 16 games.

Characteristic N = 2,2761
TARGET_WINS 81 (16) 82 0 146
TEAM_BATTING_H 1,469 (145) 1,454 891 2,554
TEAM_BATTING_2B 241 (47) 238 69 458
TEAM_BATTING_3B 55 (28) 47 0 223
TEAM_BATTING_HR 100 (61) 102 0 264
TEAM_BATTING_BB 502 (123) 512 0 878
TEAM_BATTING_HBP 59 (13) 58 29 95
TEAM_BATTING_SO 736 (249) 750 0 1,399
TEAM_BASERUN_SB 125 (88) 101 0 697
TEAM_BASERUN_CS 53 (23) 49 0 201
TEAM_FIELDING_E 246 (228) 159 65 1,898
TEAM_FIELDING_DP 146 (26) 149 52 228
TEAM_PITCHING_BB 553 (166) 536 0 3,645
TEAM_PITCHING_SO 818 (553) 814 0 19,278

1 Mean (SD) Median Minimum Maximum

By examining the target wins variable in detail, there is a clear guideline of how many wins each team should approximately win. Most teams will likely win the average number of games (81), but there will be some variability from the average with some teams winning more or less than 81 games.

The other variables also play an important role in understanding the data. In Figure 1, summary statistics are presented for all the variables. it is sufficient in getting the gist of each variable’s distribution. For example, the average Base Hits by batters per team is 1469 with the minimum base hits at 891 and maximum base hits at 2554. Remember that the dataset contains baseball statistics on 2276 teams. Missing values were excluded from the summary and they will be dealt with in the data preparation section of this report.

A quick look at Figure 2 will reveal the distribution of the target wins. The distribution is approximately normal with a majority of the target wins falling in the center of the distribution.The approximate normal distribution is confirmed by the QQ plot below the distribution plot. Most of the target wins fall on the line in the QQ plot with some data points diverging at the ends. This indicates possibility of outliers where some teams are winning more games or losing more games than what is expected in the normal range.In the boxplot, there are points that fall outside the whiskers which confirms our suspicions of outliers seen in the QQ plot.

Now in order to build our models properly It’s worth exploring for other columns with NA values.

Columns_w_NA Percent_NA
team_batting_so 4.481547
team_baserun_sb 5.755712
team_baserun_cs 33.919156
team_batting_hbp 91.608084
team_pitching_so 4.481547
team_fielding_dp 12.565905

Outliers

The following diagram shows the outliers for all the variables, both dependent and independent.

As we can see from the graph only 4 of the 16 variables are normally or close to normally distributed. the other 12 variables have a significant skew. The response variable Target_wins seems to be normally distributed. Batting_Hr, Batting_SO and Pitching_HR are bi-modal. 10 of the 16 variables have a minimum value of 0. This is not a major concern as the total % of 0 in each column is less than 1%. The variables Batting_BB, Batting_CS, Baserun_SB, Pitching_BB and Fielding_E have a significant number of outliers.

Correlations among predictors and Variable Selection

It is possible that not all variables will need to be used in creating an accurate model. In Figure 4, a correlation value is computed for each variable against target wins. Some variables are highly correlated with target wins, while other variables are not. For example, Base Hits by batters has a value of 0.38877 which is high while Caught stealing is barely correlated with target wins with a value of 0.0224.There is also a column for p-values which indicates whether the correlations are significant. We can use a decision rule of 95% meaning any variable with a p-value of less than 0.05 is significant. It appears that Strikeouts by batters (TEAM_BATTING_SO),Caught stealing(TEAM_BASERUN_CS), Batters hit by pitch (TEAM_BATTING_HBP), and Double plays (TEAM_FIELDING_DP)do not meet our decision rule and could be excluded from use.

term TARGET_WINS
INDEX -0.02105643
TEAM_BATTING_H 0.38876752
TEAM_BATTING_2B 0.28910365
TEAM_BATTING_3B 0.14260841
TEAM_BATTING_HR 0.17615320
TEAM_BATTING_BB 0.23255986
TEAM_BATTING_SO -0.03175071
TEAM_BASERUN_SB 0.13513892
TEAM_BASERUN_CS 0.02240407
TEAM_BATTING_HBP 0.07350424
TEAM_PITCHING_H -0.10993705
TEAM_PITCHING_HR 0.18901373
TEAM_PITCHING_BB 0.12417454
TEAM_PITCHING_SO -0.07843609
TEAM_FIELDING_E -0.17648476
TEAM_FIELDING_DP -0.03485058

Before entirely excluding variables, it is a good idea to transform the data by fixing missing values or combining variables and reexamine the viability of those variables for predicting wins.

From the table we can see that there are positive or negative correlations among the predictors. If we look at the numerical correlations with the response variable. We can see that the predictors Batting_H, Batting_HR, Batting_BB, Pitching_H, and Pitching_HR are more correlated and should be included in our regression.

Also Examining significant correlations among the independent variables, we see that four of the pairs have a correlation close to 1. This can lead to multicollinearity issues in our analysis.

Data Preparation

Missing values need to be handled before building models.They can be handled by either dropping the records, dropping the entire variable, or imputation. In this case, it was determined that Batters hit by pitch variable should be dropped altogether prior to model building because it has too many missing values to properly impute.All other variables with missing values will be considered for the model because a majority of the records are not missing.These variables will be imputed.

First we will remove Batting_HBP (Hit by Pitch) which has 92% missing values.

We will look at the patterns and intersections of missingness among the variables, using the naniar package. We can see that only 22 of the observations have all 5 variables missing, we will just delete these cases. The pattern suggests that the variables are Missing at Random (MAR)

By looking at the patterns and intersections of missing data among the variables. We can see that 5 variables have missing values, Team_BATTING has the most missing values so we are completely removing these observations. Overall, the pattern suggests that the variables are Missing at Random (MAR).

When it comes to fixing missing values,there are several methods at our disposal. The first technique is to fill the missing values with the mean values of each variable. We’ll use the Hmisc R Package to fill the missing data with the mean, most of the time, mean imputation will lead to good results. The same procedure will be used for the other variables with missing values in Model 2 but by using the Median instead of the mean

The second technique for imputing missing values is to use a decision tree. This is slightly more involved, but will likely give the better results. A decision tree will be created for each variable with missing values. In mean imputation, a fixed value is used for missing values of an entire variable whereas in decision tree imputation, a value is used based on certain conditions.

Build Models

MODEL 1: MEAN FULL MODEL

This is a full model containing all the variables with the meanusred for missing values. This is a good starting model to determine how well each variable helps predict wins. The mean is generally an adequate guess for missing values. In this model, no selection technique is used. All variables are manually included.

To the that we used the Hmisc R package to imputes missing value using user defined statistical method (mean in our case)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + 
##     TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = train_model1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.113  -7.633  -0.018   7.324  45.214 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      59.237509   6.113895   9.689  < 2e-16 ***
## TEAM_BATTING_H   -0.002070   0.005441  -0.380 0.703633    
## TEAM_BATTING_2B  -0.023765   0.008808  -2.698 0.007034 ** 
## TEAM_BATTING_3B   0.168568   0.018723   9.003  < 2e-16 ***
## TEAM_BATTING_HR   0.321816   0.055324   5.817 6.98e-09 ***
## TEAM_BATTING_BB   0.093055   0.018164   5.123 3.30e-07 ***
## TEAM_BATTING_SO  -0.049966   0.008917  -5.603 2.40e-08 ***
## TEAM_BASERUN_SB   0.081187   0.006161  13.178  < 2e-16 ***
## TEAM_BASERUN_CS  -0.059350   0.014616  -4.061 5.09e-05 ***
## TEAM_PITCHING_H   0.025458   0.002263  11.251  < 2e-16 ***
## TEAM_PITCHING_HR -0.216833   0.052137  -4.159 3.33e-05 ***
## TEAM_PITCHING_BB -0.060526   0.016914  -3.578 0.000354 ***
## TEAM_PITCHING_SO  0.028051   0.007993   3.509 0.000459 ***
## TEAM_FIELDING_E  -0.084678   0.005292 -16.001  < 2e-16 ***
## TEAM_FIELDING_DP -0.120255   0.012485  -9.632  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.96 on 1975 degrees of freedom
##   (286 observations deleted due to missingness)
## Multiple R-squared:  0.3858, Adjusted R-squared:  0.3814 
## F-statistic: 88.61 on 14 and 1975 DF,  p-value: < 2.2e-16

The overall p-value for Model 1is less than 0.0001,which indicates a significant model in predicting wins. The Adjusted R-Squared and Mean Square Error (MSE) will be the metrics used to determine the best model.A higher Adjusted R-Squared is better and a lower MSE is better.In this case, the Adjusted R-Squared is 0.3814 and MSE is 168.91, which will be the current benchmark.

The resulting equation for Model 1 is :

Most of Model 1 makes senseas positive measures of success like Base Hits, Triples, Homeruns, Walks by batters, and Stolen bases are positive coefficients in the equation while negative measures of success like Strikeouts by batters, Caught stealing, Hits allowed, and Errors are negative coefficients in the equation.All these values make intuitive sense.On the other hand, Doubles and Double plays are shown as negative coefficients when they should have a positive impact on wins. Also, Homeruns allowed and Walks allowed are shown as positive coefficients when they should have negative impact on wins.These values are counter intuitive.The counter intuitive parts of the model may need to be further investigated if this model were to be chosen for deployment.However, for now, the model will be kept as a benchmark despite certain measures not making sense.

MODEL 2: MEDIAN WITH STEPWISE

Earlier in the exploration of the data, the analysis revealed the possibility of outliers present in the data. Because the mean is highly influenced by outliers, this model attempts to remedy that by using the median to impute missing values.Model 2 is a significant model based on the p-value of less than 0.0001. The Adjusted R-Squared is 0.3147 and MSE is 169.799.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + 
##     TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = train_model2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.141  -7.562  -0.055   7.303  45.741 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      58.081468   6.120416   9.490  < 2e-16 ***
## TEAM_BATTING_H   -0.002408   0.005486  -0.439 0.660776    
## TEAM_BATTING_2B  -0.023924   0.008849  -2.704 0.006915 ** 
## TEAM_BATTING_3B   0.169910   0.018789   9.043  < 2e-16 ***
## TEAM_BATTING_HR   0.316324   0.055363   5.714 1.27e-08 ***
## TEAM_BATTING_BB   0.089135   0.018298   4.871 1.20e-06 ***
## TEAM_BATTING_SO  -0.045012   0.008991  -5.006 6.04e-07 ***
## TEAM_BASERUN_SB   0.078868   0.006138  12.850  < 2e-16 ***
## TEAM_BASERUN_CS  -0.060070   0.014307  -4.198 2.81e-05 ***
## TEAM_PITCHING_H   0.026060   0.002291  11.374  < 2e-16 ***
## TEAM_PITCHING_HR -0.213394   0.052207  -4.087 4.54e-05 ***
## TEAM_PITCHING_BB -0.056704   0.017043  -3.327 0.000893 ***
## TEAM_PITCHING_SO  0.024406   0.008105   3.011 0.002635 ** 
## TEAM_FIELDING_E  -0.082493   0.005281 -15.621  < 2e-16 ***
## TEAM_FIELDING_DP -0.121035   0.012572  -9.627  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11 on 1975 degrees of freedom
##   (286 observations deleted due to missingness)
## Multiple R-squared:  0.3816, Adjusted R-squared:  0.3772 
## F-statistic: 87.05 on 14 and 1975 DF,  p-value: < 2.2e-16
The resulting equation for Model 2 is :

MODEL 3: knn Imputation

Building the stepwise regression model with knn imputed values

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = knn_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.980  -8.547   0.148   8.398  63.362 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      35.203403   4.656401   7.560 5.80e-14 ***
## TEAM_BATTING_H    0.044863   0.002599  17.263  < 2e-16 ***
## TEAM_BATTING_SO  -0.010323   0.002091  -4.937 8.52e-07 ***
## TEAM_BASERUN_SB   0.037616   0.003784   9.942  < 2e-16 ***
## TEAM_PITCHING_HR  0.059940   0.007971   7.520 7.85e-14 ***
## TEAM_FIELDING_E  -0.027353   0.001582 -17.289  < 2e-16 ***
## TEAM_FIELDING_DP -0.117420   0.013006  -9.028  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.13 on 2269 degrees of freedom
## Multiple R-squared:  0.3069, Adjusted R-squared:  0.3051 
## F-statistic: 167.5 on 6 and 2269 DF,  p-value: < 2.2e-16

Now lets run the model…

Select A Model

As mentioned earlier, the main decision criterionis the Adjusted R-Squared. A higher Adjusted R-Squared is indicative of better performance. MSE is also used as a secondary criterion which measures the difference between actual and predicted values. A lower MSE is better.Aside from these criteria, the goal is also to have a highly interpretable model that makes sense.

Measures Model 1 Model 2 Model 3
r_squared 0.3858 0.3816 0.3067
adj_rsquared 0.3814 0.3772 0.3049
rse 10.9600 11.0000 13.1300
f_stat 88.6100 87.0500 167.3000

Looking at the three models, we are going to use model 1, and the model is selected based on the criteria we had mentioned.

Prediction

The test data should go to to the same cleaning process with the training data.

After, we have cleaned the test dataset, here’s how the result looks like on the table above which include the predicted values, and the prediction intervals.

eval_model1 <- drop_na(eval_model1)
predict1 <- predict(model1, newdata = eval_model1, interval="prediction")
kable(predict1)
fit lwr upr
60.07336 38.49515 81.65158
65.67460 44.11320 87.23601
71.23288 49.68348 92.78227
84.86087 63.31623 106.40550
103.93799 81.28573 126.59026
71.85566 50.14909 93.56223
72.34177 50.75179 93.93174
75.22567 53.67294 96.77841
70.76950 49.23281 92.30620
64.63909 43.08353 86.19464
83.07114 61.51171 104.63057
85.21038 63.63224 106.78852
82.79881 61.22043 104.37720
85.81195 64.21798 107.40592
74.96797 53.40346 96.53247
71.29929 49.76256 92.83602
76.99432 55.46692 98.52171
68.15789 46.58147 89.73432
86.08610 64.50586 107.66633
85.29403 63.73352 106.85453
82.19979 60.64507 103.75451
83.06310 61.51523 104.61096
70.06209 48.52136 91.60282
78.84225 57.29569 100.38881
86.07804 64.51782 107.63826
68.95160 47.39845 90.50475
82.72954 61.13728 104.32179
67.20395 45.60529 88.80262
91.75254 70.18634 113.31875
86.83842 65.30544 108.37141
84.97968 63.42602 106.53334
85.02571 63.44522 106.60620
79.97374 58.43932 101.50816
86.02490 64.46247 107.58733
77.33769 55.81314 98.86224
90.51117 68.94900 112.07334
79.17794 57.40528 100.95060
88.95546 67.38655 110.52437
81.77011 60.20082 103.33939
92.94351 71.37142 114.51559
72.58169 50.93529 94.22809
68.45749 46.91301 90.00196
75.56876 53.92197 97.21555
73.74384 52.19562 95.29207
81.66188 60.09615 103.22760
74.53919 52.99974 96.07864
74.49560 52.94412 96.04708
69.84047 48.31247 91.36848
77.20263 55.66987 98.73539
98.68540 76.94744 120.42337
80.59269 59.00785 102.17753
66.38084 44.79346 87.96822
80.95703 59.39260 102.52145
93.48978 71.92510 115.05447
80.78763 59.17746 102.39781
87.42718 65.89639 108.95796
84.14919 62.57919 105.71919
83.42632 61.85401 104.99862
68.24651 46.66857 89.82444
74.11485 52.52794 95.70176
83.38865 61.76978 105.00753
98.26007 76.60928 119.91086
89.52138 67.90227 111.14049
70.93067 49.34972 92.51163
80.64191 59.08035 102.20346
91.96033 70.36565 113.55501
70.33282 48.74187 91.92377
77.83163 56.24180 99.42146
87.22196 65.68627 108.75766
82.48522 60.95591 104.01453
73.67207 52.11205 95.23209
76.75416 55.19624 98.31208
82.49050 60.89795 104.08305
84.99016 63.43133 106.54898
93.83859 72.25397 115.42321
71.80554 50.24320 93.36789
86.85487 65.30868 108.40105
79.28589 57.70949 100.86228
85.29262 63.74212 106.84311
82.46982 60.92057 104.01906
94.60183 73.02103 116.18262
91.34545 69.79067 112.90024
70.93903 49.32290 92.55516
83.10417 61.50827 104.70006
87.32008 65.70347 108.93668
88.43871 66.85095 110.02647
93.37461 71.63719 115.11203
99.36781 77.69373 121.04190
87.76843 66.19402 109.34283
91.17261 69.57685 112.76837
79.23140 57.67886 100.78395
73.24270 51.69310 94.79230
82.12689 60.59686 103.65691
86.71778 65.15951 108.27605
74.89025 53.33248 96.44801
66.92861 45.21966 88.63755
73.56909 52.00107 95.13711
89.88738 68.34726 111.42750
89.90138 68.34824 111.45452
90.11610 68.51220 111.72001
96.60351 75.03731 118.16970
91.47764 69.93489 113.02039
78.45313 56.91791 99.98835
76.26105 54.71209 97.81001
85.73440 64.17606 107.29275
82.29847 60.76377 103.83317
68.98400 47.42907 90.53894
68.50717 46.91714 90.09721
68.64108 47.08802 90.19414
70.55180 48.91215 92.19145
72.52800 50.97017 94.08583
83.69501 62.12356 105.26645
87.31126 65.72826 108.89426
77.52117 55.98289 99.05945
92.52258 70.96967 114.07550
88.00381 66.44678 109.56083
81.78382 60.23555 103.33209
81.03261 59.47052 102.59470
75.48230 53.92827 97.03633
83.17266 61.59474 104.75059
89.88120 68.33256 111.42983
61.40875 39.51990 83.29760
75.44101 53.89797 96.98406
73.59574 52.06000 95.13148
90.24117 68.61412 111.86822
78.64452 57.09326 100.19579
59.15853 37.58302 80.73405
75.04333 53.47800 96.60867
93.43360 71.86647 115.00072
69.90950 48.36080 91.45820
75.33609 53.77841 96.89378
72.45306 50.91713 93.98899
77.33786 55.80938 98.86633
80.48555 58.95066 102.02043
77.62882 56.07797 99.17968
83.52247 61.98974 105.05520
81.74656 60.20149 103.29163
78.47525 56.93024 100.02026
62.29530 40.71983 83.87077
77.26846 55.72412 98.81279
67.90964 46.34271 89.47657
93.36455 71.79030 114.93881
92.34106 70.63874 114.04339
72.06385 50.49380 93.63390
106.59538 84.94074 128.25001
112.15416 90.51505 133.79326
98.10654 76.54120 119.67189
107.50113 85.88524 129.11702
102.26469 80.66316 123.86623
96.85838 75.27515 118.44161
84.68299 63.12854 106.23745
84.18226 62.62349 105.74102
71.83081 50.28326 93.37836
79.35595 57.82623 100.88567
96.43348 74.79921 118.06776
96.26514 74.69948 117.83080
82.70737 61.15702 104.25772
93.53970 71.97243 115.10696
81.03998 59.49492 102.58503
77.87108 56.31519 99.42696
82.52147 60.95463 104.08831
69.11141 47.56080 90.66202
75.05836 53.51372 96.60301
81.15453 59.62197 102.68710
96.52924 74.81298 118.24550
90.01229 68.45875 111.56583
86.62406 65.09434 108.15378
87.07450 65.53194 108.61707
61.81109 40.19830 83.42388
81.44877 59.88653 103.01101
68.47654 46.90147 90.05161
71.40841 49.85422 92.96259
70.37271 48.78647 91.95896
62.53699 40.96878 84.10521
73.78619 52.25298 95.31940
92.46219 70.87723 114.04715
84.28704 62.74338 105.83070
84.99764 63.45073 106.54455
73.05263 51.47967 94.62559
78.48441 56.93818 100.03063
74.65449 53.07283 96.23614
91.52677 69.75094 113.30260
79.81631 58.27981 101.35281
87.78063 66.22539 109.33586
77.60570 56.04029 99.17110
76.33916 54.79145 97.88686
89.77318 68.17844 111.36792
63.69324 42.14486 85.24162
70.47613 48.90122 92.05103
82.18603 60.63547 103.73659
78.34127 56.75660 99.92595
93.86369 72.24692 115.48046
74.49629 52.96747 96.02511
81.45126 59.91940 102.98313
73.47833 51.93023 95.02644
70.77610 49.20815 92.34405
78.55512 57.01606 100.09419
69.23494 47.66356 90.80631
79.00303 57.47314 100.53292
77.51497 55.98000 99.04995
79.07279 57.52485 100.62074
84.13184 62.60172 105.66196
85.48324 63.88634 107.08014
92.64399 71.02526 114.26272
82.66563 61.11035 104.22092
88.02235 66.45445 109.59025
79.77963 58.25585 101.30342
75.85559 54.32689 97.38430
75.98735 54.40805 97.56665
79.50247 57.96465 101.04030
76.53748 54.96602 98.10895
88.38708 66.85186 109.92231
88.17293 66.63329 109.71256
81.87435 60.32467 103.42402
79.60554 58.06100 101.15007
60.48526 38.90558 82.06495
82.91219 61.33159 104.49280
76.71998 55.18154 98.25842
83.39090 61.85031 104.93148
71.70958 50.16819 93.25096
84.15537 62.58045 105.73028
80.57078 59.01359 102.12798
50.34285 28.40255 72.28315
68.12559 46.57338 89.67781
78.41185 56.83952 99.98418
83.98771 62.43525 105.54018
87.86446 66.29147 109.43745
72.38459 50.79144 93.97775

Conclusion

3 models were generated from baseball team data from 1871 to 2006 to predict number of wins. Prior to generating the models, the baseball data was analyzed to better understand the relationship between variables. The chosen model was Model 1 which was created using mean….

Appendix

knitr::opts_chunk$set(cache =TRUE)

# 0.  Librairies

library(corrplot)
library(tidyverse)
library(Hmisc)
library(PerformanceAnalytics)
library(corrplot)
library(mice)
library(gt)
library(DMwR)
library(caret)
library(bnstruct)
library(VIM)
library(corrr)
library(tidyverse)
library(gtsummary)
library(kableExtra)
# 1. Data Exploration

## Import data 

train_data <- read.csv("https://raw.githubusercontent.com/aaitelmouden/DATA621/master/Project1/moneyball-training-data.csv")
glimpse(train_data)


## Summary table 

wins <- train_data %>% select(TARGET_WINS, TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BATTING_HBP, TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_FIELDING_E, TEAM_FIELDING_DP, TEAM_PITCHING_BB, TEAM_PITCHING_SO, TEAM_FIELDING_E, TEAM_FIELDING_DP)
table1 <- tbl_summary(wins,
          statistic = list(all_continuous() ~ "{mean} ({sd}) {median} {min} {max}"), missing = "no") 
table1


## Create more variables

singles <- train_data$TEAM_BATTING_H - (train_data$TEAM_BATTING_2B + train_data$TEAM_BATTING_3B + train_data$TEAM_BATTING_HR)

train_data$TEAM_BATTING_SLG <- ((train_data$TEAM_BATTING_HR *4)+ (train_data$TEAM_BATTING_3B*3) + (train_data$TEAM_BATTING_2B*2)+ singles)/train_data$TEAM_BATTING_H
## Plots

### Layout to split the screen
layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))

####  Draw the boxplot and the histogram 
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(train_data$TARGET_WINS ,main="Figure 2 : Distribution and Probability Plot for TARGET_WINS",cex.main=1, horizontal=TRUE , ylim=c(0,150), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F)
par(mar=c(4, 3.1, 1.1, 2.1))
hist(train_data$TARGET_WINS , breaks=40 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="TARGET WINS", xlim=c(0,150))
### qq plots

qqnorm(train_data$TARGET_WINS, pch = 1, frame = FALSE)
qqline(train_data$TARGET_WINS, col = "steelblue", lwd = 2)


max_obs <- 2276
batting_so_na <- ((102/max_obs) * 100)
baserun_sb_na <- (131/max_obs) * 100
baserun_cs_na <- (772/max_obs) * 100
batting_hbp_na <- (2085/max_obs) * 100
pitching_so_na <- (102/max_obs) * 100
fielding_dp_na <- (286/max_obs) * 100

df_percent_na <- data.frame(Columns_w_NA = c("team_batting_so", "team_baserun_sb", "team_baserun_cs", "team_batting_hbp", "team_pitching_so", "team_fielding_dp"), Percent_NA = c(batting_so_na, baserun_sb_na, baserun_cs_na, batting_hbp_na, pitching_so_na, fielding_dp_na))


### the largest islands in the world

gt_tbl <- gt(data = df_percent_na, )


### Show the gt Table

gt_tbl
### Outlliers

ggplot(stack(train_data[,-1]), aes(x = ind, y = values, fill=ind)) + 
  geom_boxplot(outlier.colour = "red",  outlier.alpha=.4) +
  coord_cartesian(ylim = c(0, 1000)) +
  theme_classic()+
  theme(axis.text.x=element_text(angle=45, hjust=1)) 


### Correlation

COR <- train_data %>% 
  correlate() %>% 
  focus(TARGET_WINS)
gt(COR)

COR %>% 
  mutate(rowname = factor(term, levels = term[order(TARGET_WINS)])) %>%  # Order by correlation strength
  ggplot(aes(x = rowname, y = TARGET_WINS)) +
    geom_bar(stat = "identity") +
    ylab("Correlation with TARGET WINS") +
    xlab("Variables") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+ ggtitle("Figure 4: Correlation Against Target Win")
#### pairwise.complete.obs ignores NA values and computes correlation on complete observations
#### we might have to run these corrplots again after we handle the NA values

chart.Correlation(train_data[-c(1)], histograme=TRUE, method= "pearson", use="pairwise.complete.obs")

data.corr <- cor(train_data[-c(1)], use="pairwise.complete.obs")

corrplot(data.corr, type = "lower", method="square")

flattenCorrMatrix <- function(cormat, pmat) {
  ut <- upper.tri(cormat)
  data.frame(
    row = rownames(cormat)[row(cormat)[ut]],
    column = rownames(cormat)[col(cormat)[ut]],
    cor  =(cormat)[ut],
    p = pmat[ut]
    )
}

#### Eliminate INDEX from data frame
data_no_index <- train_data[-c(1)]

cor_matrix <- rcorr(as.matrix(data_no_index))

flattenCorrMatrix(cor_matrix$r, cor_matrix$P)
#. 2 Data preparation 

train_data <- train_data[-11]


par(mfrow=c(1,2))
gg_miss_upset(train_data, 
              nsets = 5,
              nintersects = NA)
gg_miss_case(train_data)+
  theme_classic()


par(mfrow=c(1,2))
gg_miss_upset(train_data, 
              nsets = 5,
              nintersects = NA)
gg_miss_case(train_data)+
  theme_classic()
# 3. Build models

### MODEL 1: MEAN FULL MODEL

### Filling missing values with Mean using the impute package

train_model1 <- train_data
train_model1$TEAM_BATTING_SO[is.na(train_model1$TEAM_BATTING_SO)] = mean(train_model1$TEAM_BATTING_SO, na.rm=TRUE)
train_model1$TEAM_BASERUN_SB[is.na(train_model1$TEAM_BASERUN_SB)] = mean(train_model1$TEAM_BASERUN_SB, na.rm=TRUE)
train_model1$TEAM_BASERUN_CS[is.na(train_model1$TEAM_BASERUN_CS)] = mean(train_model1$TEAM_BASERUN_CS, na.rm=TRUE)
train_model1$TEAM_PITCHING_SO[is.na(train_model1$TEAM_PITCHING_SO)] = mean(train_model1$TEAM_PITCHING_SO, na.rm=TRUE)

model1 <- lm(TARGET_WINS ~ 
               TEAM_BATTING_H +   # Base Hits by batters (1B,2B,3B,HR)
               TEAM_BATTING_2B +  # Doubles by batters (2B) 
               TEAM_BATTING_3B +  # Triples by batters (3B)
               TEAM_BATTING_HR +  # Homeruns by batters (4B) 
               TEAM_BATTING_BB +  # Walks by batters
               TEAM_BATTING_SO +  # Strikeouts by batters 
               TEAM_BASERUN_SB +  # Stolen bases
               TEAM_BASERUN_CS +  # Caught stealing 
               TEAM_PITCHING_H +  # Hits allowed
               TEAM_PITCHING_HR + # Homeruns allowed
               TEAM_PITCHING_BB + # Walks allowed
               TEAM_PITCHING_SO + # Strikeouts by pitchers
               TEAM_FIELDING_E +  # Errors
               TEAM_FIELDING_DP,  # Double Plays
             data=train_model1)
summary(model1)

### Mean Square Error (MSE)

mean(model1$residuals^2)

layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(model1)
### MODEL 2: MEDIAN WITH STEPWISE

train_model2 <- train_data

#### Filling missing values with Mean using the impute package

train_model2$TEAM_BATTING_SO[is.na(train_model2$TEAM_BATTING_SO)] = median(train_model2$TEAM_BATTING_SO, na.rm=TRUE)
train_model2$TEAM_BASERUN_SB[is.na(train_model2$TEAM_BASERUN_SB)] = median(train_model2$TEAM_BASERUN_SB, na.rm=TRUE)
train_model2$TEAM_BASERUN_CS[is.na(train_model2$TEAM_BASERUN_CS)] = median(train_model2$TEAM_BASERUN_CS, na.rm=TRUE)
train_model2$TEAM_PITCHING_SO[is.na(train_model2$TEAM_PITCHING_SO)] = median(train_model2$TEAM_PITCHING_SO, na.rm=TRUE)


model2 <- lm(TARGET_WINS ~ 
               TEAM_BATTING_H +   # Base Hits by batters (1B,2B,3B,HR)
               TEAM_BATTING_2B +  # Doubles by batters (2B) 
               TEAM_BATTING_3B +  # Triples by batters (3B)
               TEAM_BATTING_HR +  # Homeruns by batters (4B) 
               TEAM_BATTING_BB +  # Walks by batters
               TEAM_BATTING_SO +  # Strikeouts by batters 
               TEAM_BASERUN_SB +  # Stolen bases
               TEAM_BASERUN_CS +  # Caught stealing 
               TEAM_PITCHING_H +  # Hits allowed
               TEAM_PITCHING_HR + # Homeruns allowed
               TEAM_PITCHING_BB + # Walks allowed
               TEAM_PITCHING_SO + # Strikeouts by pitchers
               TEAM_FIELDING_E +  # Errors
               TEAM_FIELDING_DP,  # Double Plays
             data=train_model2)
summary(model1)

mean(model2$residuals^2)

layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(model2)
### MODEL 3: kNN Imputation

train_data_impute <- select(train_data, -c(INDEX))


#### using default values

knn_data <- knnImputation(train_data_impute)

summary(knn_data)

train.control <- trainControl(method ='cv', number=10)

step.model <- train(TARGET_WINS ~., data = knn_data,
                    method = "leapSeq",
                    tuneGrid = data.frame(nvmax = 2:13),
                    trControl = train.control
                    )

step.model$results

step.model$bestTune

summary(step.model$finalModel)

coef(step.model$finalModel, 6)


knn_lm <- lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP, data = knn_data)

summary(knn_lm)


hist(knn_lm$residuals, xlab = "Residuals", ylab = "", breaks=100)
qqnorm(knn_lm$residuals)
qqline(knn_lm$residuals)

plot(fitted(knn_lm), residuals(knn_lm))
abline(h=0, lty = 2)

References

<<<<<<< HEAD

-Dealing with Missing Data using R : https://medium.com/coinmonks/dealing-with-missing-data-using-r-3ae428da2d17 -Decision Tree : http://www.learnbymarketing.com/tutorials/rpart-decision-trees-in-r/ -Decision Tree : https://www.datacamp.com/community/tutorials/decision-trees-R -Introduction to Data Science (Case Study Moneyball): https://rafalab.github.io/dsbook/linear-models.html#case-study-moneyball ======= >>>>>>> bf59d52d90d6b4ab9b840f4b7fa1cd132debf1fb