Regression Analysis of Baseball Team Performance

Abstract

To see how regression will help us evaluate baseball team performance, this project is designed to explore whether a teams success in any given season can be predicted or explained by any number of statistics in that season. Our goal is to build a multiple linear regression model on the training data to predict the number of wins for the team. we will explore, analyze and model a historical baseball data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive, and the data include the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

While correlation does not imply causation, it is suggested that a focus on some of the variables such as single hits or triple or more hits to the exclusion of doubles might be worth pursuing. Also the data suggests that a focus on home runs allowed may not be worth giving up a number of more normal hits.

Introduction

Because baseball is so numbers-heavy, there are many different statistics to consider when searching for the best predictors of team success. There are offensive statistics (offense meaning when a team is batting) and defensive statistics (defense meaning when a team is in the field). These categories can be broken up into many more subcategories. However, for the purpose of the this project we will use the available data to build a multiple linear regression model on the training data to predict the number of wins for the team.

To see how regression will help us predict the number of wins for the team, we actually don’t need to understand all the details about the game of baseball, which has over 100 rules. Here, we distill the sport to the basic knowledge one needs to know how to effectively attack the data science problem. The goal of a baseball game is to score more runs (points) than the other team. Each team has 9 batters that have an opportunity to hit a ball with a bat in a predetermined order. After the 9th batter has had their turn, the first batter bats again, then the second, and so on. Each time a batter has an opportunity to bat, we call it a plate appearance (PA). At each PA, the other team’s pitcher throws the ball and the batter tries to hit it. The PA ends with an binary outcome: the batter either makes an out (failure) and returns to the bench or the batter doesn’t (success) and can run around the bases, and potentially score a run (reach all 4 bases). Each team gets nine tries, referred to as innings, to score runs and each inning ends after three outs (three failures).

Data Exploration

The dataset we will be using was provided in csv file. The files contain approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. The game statistics that will be used in this study are the following:

VARIABLE NAME	DEFINITION	THEORETICAL EFFECT
INDEX	Identification Variable (do not use)	None
TARGET_WINS	Number of wins	Outcome Variable
TEAM_BATTING_H	Base Hits by batters (1B,2B,3B,HR)	Positive Impact on Wins
TEAM_BATTING_2B	Doubles by batters (2B)	Positive Impact on Wins
TEAM_BATTING_3B	Triples by batters (3B)	Positive Impact on Wins
TEAM_BATTING_HR	Homeruns by batters (4B)	Positive Impact on Wins
TEAM_BATTING_BB	Walks by batters	Positive Impact on Wins
TEAM_BATTING_HBP	Batters hit by pitch (get a free base)	Positive Impact on Wins
TEAM_BATTING_SO	Strikeouts by batters	Negative Impact on Wins
TEAM_BASERUN_SB	Stolen bases	Positive Impact on Wins
TEAM_BASERUN_CS	Caught stealing	Negative Impact on Wins
TEAM_FIELDING_E	Errors	Negative Impact on Wins
TEAM_FIELDING_DP	Double Plays	Positive Impact on Wins
TEAM_PITCHING_BB	Walks allowed	Negative Impact on Wins
TEAM_PITCHING_H	Hits allowed	Negative Impact on Wins
TEAM_PITCHING_HR	Homeruns allowed	Negative Impact on Wins
TEAM_PITCHING_SO	Strikeouts by pitchers	Positive Impact on Wins

The initial steps are to download the data and take a quick glimpse of the columns, their data types, number of columns, and rows. Based on initial observations, the data contains 2276 teams with a variety of baseball performance statistics.

At first glance, the column BATTING_HBP has numerous NA values that will need to be addressed before building a model. Figure 1 show summary statistics of the target wins. The noteworthy statistics are the average number of wins in a season is 81 games, the median number of wins in a season is 82 games, and the standard deviation is 16 games.

Characteristic	N = 2,276¹
TARGET_WINS	81 (16) 82 0 146
TEAM_BATTING_H	1,469 (145) 1,454 891 2,554
TEAM_BATTING_2B	241 (47) 238 69 458
TEAM_BATTING_3B	55 (28) 47 0 223
TEAM_BATTING_HR	100 (61) 102 0 264
TEAM_BATTING_BB	502 (123) 512 0 878
TEAM_BATTING_HBP	59 (13) 58 29 95
TEAM_BATTING_SO	736 (249) 750 0 1,399
TEAM_BASERUN_SB	125 (88) 101 0 697
TEAM_BASERUN_CS	53 (23) 49 0 201
TEAM_FIELDING_E	246 (228) 159 65 1,898
TEAM_FIELDING_DP	146 (26) 149 52 228
TEAM_PITCHING_BB	553 (166) 536 0 3,645
TEAM_PITCHING_SO	818 (553) 814 0 19,278
¹ Mean (SD) Median Minimum Maximum

By examining the target wins variable in detail, there is a clear guideline of how many wins each team should approximately win. Most teams will likely win the average number of games (81), but there will be some variability from the average with some teams winning more or less than 81 games.

The other variables also play an important role in understanding the data. In Figure 1, summary statistics are presented for all the variables. it is sufficient in getting the gist of each variable’s distribution. For example, the average Base Hits by batters per team is 1469 with the minimum base hits at 891 and maximum base hits at 2554. Remember that the dataset contains baseball statistics on 2276 teams. Missing values were excluded from the summary and they will be dealt with in the data preparation section of this report.

A quick look at Figure 2 will reveal the distribution of the target wins. The distribution is approximately normal with a majority of the target wins falling in the center of the distribution.The approximate normal distribution is confirmed by the QQ plot below the distribution plot. Most of the target wins fall on the line in the QQ plot with some data points diverging at the ends. This indicates possibility of outliers where some teams are winning more games or losing more games than what is expected in the normal range.In the boxplot, there are points that fall outside the whiskers which confirms our suspicions of outliers seen in the QQ plot.

Now in order to build our models properly It’s worth exploring for other columns with NA values.

Columns_w_NA	Percent_NA
team_batting_so	4.481547
team_baserun_sb	5.755712
team_baserun_cs	33.919156
team_batting_hbp	91.608084
team_pitching_so	4.481547
team_fielding_dp	12.565905

Outliers

The following diagram shows the outliers for all the variables, both dependent and independent.

As we can see from the graph only 4 of the 16 variables are normally or close to normally distributed. the other 12 variables have a significant skew. The response variable Target_wins seems to be normally distributed. Batting_Hr, Batting_SO and Pitching_HR are bi-modal. 10 of the 16 variables have a minimum value of 0. This is not a major concern as the total % of 0 in each column is less than 1%. The variables Batting_BB, Batting_CS, Baserun_SB, Pitching_BB and Fielding_E have a significant number of outliers.

Correlations among predictors and Variable Selection

It is possible that not all variables will need to be used in creating an accurate model. In Figure 4, a correlation value is computed for each variable against target wins. Some variables are highly correlated with target wins, while other variables are not. For example, Base Hits by batters has a value of 0.38877 which is high while Caught stealing is barely correlated with target wins with a value of 0.0224.There is also a column for p-values which indicates whether the correlations are significant. We can use a decision rule of 95% meaning any variable with a p-value of less than 0.05 is significant. It appears that Strikeouts by batters (TEAM_BATTING_SO),Caught stealing(TEAM_BASERUN_CS), Batters hit by pitch (TEAM_BATTING_HBP), and Double plays (TEAM_FIELDING_DP)do not meet our decision rule and could be excluded from use.

term	TARGET_WINS
INDEX	-0.02105643
TEAM_BATTING_H	0.38876752
TEAM_BATTING_2B	0.28910365
TEAM_BATTING_3B	0.14260841
TEAM_BATTING_HR	0.17615320
TEAM_BATTING_BB	0.23255986
TEAM_BATTING_SO	-0.03175071
TEAM_BASERUN_SB	0.13513892
TEAM_BASERUN_CS	0.02240407
TEAM_BATTING_HBP	0.07350424
TEAM_PITCHING_H	-0.10993705
TEAM_PITCHING_HR	0.18901373
TEAM_PITCHING_BB	0.12417454
TEAM_PITCHING_SO	-0.07843609
TEAM_FIELDING_E	-0.17648476
TEAM_FIELDING_DP	-0.03485058

Before entirely excluding variables, it is a good idea to transform the data by fixing missing values or combining variables and reexamine the viability of those variables for predicting wins.

From the table we can see that there are positive or negative correlations among the predictors. If we look at the numerical correlations with the response variable. We can see that the predictors Batting_H, Batting_HR, Batting_BB, Pitching_H, and Pitching_HR are more correlated and should be included in our regression.

Also Examining significant correlations among the independent variables, we see that four of the pairs have a correlation close to 1. This can lead to multicollinearity issues in our analysis.

Data Preparation

Missing values need to be handled before building models.They can be handled by either dropping the records, dropping the entire variable, or imputation. In this case, it was determined that Batters hit by pitch variable should be dropped altogether prior to model building because it has too many missing values to properly impute.All other variables with missing values will be considered for the model because a majority of the records are not missing.These variables will be imputed.

First we will remove Batting_HBP (Hit by Pitch) which has 92% missing values.

We will look at the patterns and intersections of missingness among the variables, using the naniar package. We can see that only 22 of the observations have all 5 variables missing, we will just delete these cases. The pattern suggests that the variables are Missing at Random (MAR)

By looking at the patterns and intersections of missing data among the variables. We can see that 5 variables have missing values, Team_BATTING has the most missing values so we are completely removing these observations. Overall, the pattern suggests that the variables are Missing at Random (MAR).

When it comes to fixing missing values,there are several methods at our disposal. The first technique is to fill the missing values with the mean values of each variable. We’ll use the Hmisc R Package to fill the missing data with the mean, most of the time, mean imputation will lead to good results. The same procedure will be used for the other variables with missing values in Model 2 but by using the Median instead of the mean

The second technique for imputing missing values is to use a decision tree. This is slightly more involved, but will likely give the better results. A decision tree will be created for each variable with missing values. In mean imputation, a fixed value is used for missing values of an entire variable whereas in decision tree imputation, a value is used based on certain conditions.

Build Models

MODEL 1: MEAN FULL MODEL

This is a full model containing all the variables with the meanusred for missing values. This is a good starting model to determine how well each variable helps predict wins. The mean is generally an adequate guess for missing values. In this model, no selection technique is used. All variables are manually included.

To the that we used the Hmisc R package to imputes missing value using user defined statistical method (mean in our case)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + 
##     TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = train_model1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.113  -7.633  -0.018   7.324  45.214 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      59.237509   6.113895   9.689  < 2e-16 ***
## TEAM_BATTING_H   -0.002070   0.005441  -0.380 0.703633    
## TEAM_BATTING_2B  -0.023765   0.008808  -2.698 0.007034 ** 
## TEAM_BATTING_3B   0.168568   0.018723   9.003  < 2e-16 ***
## TEAM_BATTING_HR   0.321816   0.055324   5.817 6.98e-09 ***
## TEAM_BATTING_BB   0.093055   0.018164   5.123 3.30e-07 ***
## TEAM_BATTING_SO  -0.049966   0.008917  -5.603 2.40e-08 ***
## TEAM_BASERUN_SB   0.081187   0.006161  13.178  < 2e-16 ***
## TEAM_BASERUN_CS  -0.059350   0.014616  -4.061 5.09e-05 ***
## TEAM_PITCHING_H   0.025458   0.002263  11.251  < 2e-16 ***
## TEAM_PITCHING_HR -0.216833   0.052137  -4.159 3.33e-05 ***
## TEAM_PITCHING_BB -0.060526   0.016914  -3.578 0.000354 ***
## TEAM_PITCHING_SO  0.028051   0.007993   3.509 0.000459 ***
## TEAM_FIELDING_E  -0.084678   0.005292 -16.001  < 2e-16 ***
## TEAM_FIELDING_DP -0.120255   0.012485  -9.632  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.96 on 1975 degrees of freedom
##   (286 observations deleted due to missingness)
## Multiple R-squared:  0.3858, Adjusted R-squared:  0.3814 
## F-statistic: 88.61 on 14 and 1975 DF,  p-value: < 2.2e-16

The overall p-value for Model 1is less than 0.0001,which indicates a significant model in predicting wins. The Adjusted R-Squared and Mean Square Error (MSE) will be the metrics used to determine the best model.A higher Adjusted R-Squared is better and a lower MSE is better.In this case, the Adjusted R-Squared is 0.3814 and MSE is 168.91, which will be the current benchmark.

The resulting equation for Model 1 is :

Most of Model 1 makes senseas positive measures of success like Base Hits, Triples, Homeruns, Walks by batters, and Stolen bases are positive coefficients in the equation while negative measures of success like Strikeouts by batters, Caught stealing, Hits allowed, and Errors are negative coefficients in the equation.All these values make intuitive sense.On the other hand, Doubles and Double plays are shown as negative coefficients when they should have a positive impact on wins. Also, Homeruns allowed and Walks allowed are shown as positive coefficients when they should have negative impact on wins.These values are counter intuitive.The counter intuitive parts of the model may need to be further investigated if this model were to be chosen for deployment.However, for now, the model will be kept as a benchmark despite certain measures not making sense.

MODEL 2: MEDIAN WITH STEPWISE

Earlier in the exploration of the data, the analysis revealed the possibility of outliers present in the data. Because the mean is highly influenced by outliers, this model attempts to remedy that by using the median to impute missing values.Model 2 is a significant model based on the p-value of less than 0.0001. The Adjusted R-Squared is 0.3147 and MSE is 169.799.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + 
##     TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = train_model2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.141  -7.562  -0.055   7.303  45.741 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      58.081468   6.120416   9.490  < 2e-16 ***
## TEAM_BATTING_H   -0.002408   0.005486  -0.439 0.660776    
## TEAM_BATTING_2B  -0.023924   0.008849  -2.704 0.006915 ** 
## TEAM_BATTING_3B   0.169910   0.018789   9.043  < 2e-16 ***
## TEAM_BATTING_HR   0.316324   0.055363   5.714 1.27e-08 ***
## TEAM_BATTING_BB   0.089135   0.018298   4.871 1.20e-06 ***
## TEAM_BATTING_SO  -0.045012   0.008991  -5.006 6.04e-07 ***
## TEAM_BASERUN_SB   0.078868   0.006138  12.850  < 2e-16 ***
## TEAM_BASERUN_CS  -0.060070   0.014307  -4.198 2.81e-05 ***
## TEAM_PITCHING_H   0.026060   0.002291  11.374  < 2e-16 ***
## TEAM_PITCHING_HR -0.213394   0.052207  -4.087 4.54e-05 ***
## TEAM_PITCHING_BB -0.056704   0.017043  -3.327 0.000893 ***
## TEAM_PITCHING_SO  0.024406   0.008105   3.011 0.002635 ** 
## TEAM_FIELDING_E  -0.082493   0.005281 -15.621  < 2e-16 ***
## TEAM_FIELDING_DP -0.121035   0.012572  -9.627  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11 on 1975 degrees of freedom
##   (286 observations deleted due to missingness)
## Multiple R-squared:  0.3816, Adjusted R-squared:  0.3772 
## F-statistic: 87.05 on 14 and 1975 DF,  p-value: < 2.2e-16

The resulting equation for Model 2 is :

MODEL 3: knn Imputation

Building the stepwise regression model with knn imputed values

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = knn_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.980  -8.547   0.148   8.398  63.362 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      35.203403   4.656401   7.560 5.80e-14 ***
## TEAM_BATTING_H    0.044863   0.002599  17.263  < 2e-16 ***
## TEAM_BATTING_SO  -0.010323   0.002091  -4.937 8.52e-07 ***
## TEAM_BASERUN_SB   0.037616   0.003784   9.942  < 2e-16 ***
## TEAM_PITCHING_HR  0.059940   0.007971   7.520 7.85e-14 ***
## TEAM_FIELDING_E  -0.027353   0.001582 -17.289  < 2e-16 ***
## TEAM_FIELDING_DP -0.117420   0.013006  -9.028  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.13 on 2269 degrees of freedom
## Multiple R-squared:  0.3069, Adjusted R-squared:  0.3051 
## F-statistic: 167.5 on 6 and 2269 DF,  p-value: < 2.2e-16

Now lets run the model…

Select A Model

As mentioned earlier, the main decision criterionis the Adjusted R-Squared. A higher Adjusted R-Squared is indicative of better performance. MSE is also used as a secondary criterion which measures the difference between actual and predicted values. A lower MSE is better.Aside from these criteria, the goal is also to have a highly interpretable model that makes sense.

Measures	Model 1	Model 2	Model 3
r_squared	0.3858	0.3816	0.3067
adj_rsquared	0.3814	0.3772	0.3049
rse	10.9600	11.0000	13.1300
f_stat	88.6100	87.0500	167.3000

Looking at the three models, we are going to use model 1, and the model is selected based on the criteria we had mentioned.

Prediction

The test data should go to to the same cleaning process with the training data.

After, we have cleaned the test dataset, here’s how the result looks like on the table above which include the predicted values, and the prediction intervals.

eval_model1 <- drop_na(eval_model1)
predict1 <- predict(model1, newdata = eval_model1, interval="prediction")
kable(predict1)

fit	lwr	upr
60.07336	38.49515	81.65158
65.67460	44.11320	87.23601
71.23288	49.68348	92.78227
84.86087	63.31623	106.40550
103.93799	81.28573	126.59026
71.85566	50.14909	93.56223
72.34177	50.75179	93.93174
75.22567	53.67294	96.77841
70.76950	49.23281	92.30620
64.63909	43.08353	86.19464
83.07114	61.51171	104.63057
85.21038	63.63224	106.78852
82.79881	61.22043	104.37720
85.81195	64.21798	107.40592
74.96797	53.40346	96.53247
71.29929	49.76256	92.83602
76.99432	55.46692	98.52171
68.15789	46.58147	89.73432
86.08610	64.50586	107.66633
85.29403	63.73352	106.85453
82.19979	60.64507	103.75451
83.06310	61.51523	104.61096
70.06209	48.52136	91.60282
78.84225	57.29569	100.38881
86.07804	64.51782	107.63826
68.95160	47.39845	90.50475
82.72954	61.13728	104.32179
67.20395	45.60529	88.80262
91.75254	70.18634	113.31875
86.83842	65.30544	108.37141
84.97968	63.42602	106.53334
85.02571	63.44522	106.60620
79.97374	58.43932	101.50816
86.02490	64.46247	107.58733
77.33769	55.81314	98.86224
90.51117	68.94900	112.07334
79.17794	57.40528	100.95060
88.95546	67.38655	110.52437
81.77011	60.20082	103.33939
92.94351	71.37142	114.51559
72.58169	50.93529	94.22809
68.45749	46.91301	90.00196
75.56876	53.92197	97.21555
73.74384	52.19562	95.29207
81.66188	60.09615	103.22760
74.53919	52.99974	96.07864
74.49560	52.94412	96.04708
69.84047	48.31247	91.36848
77.20263	55.66987	98.73539
98.68540	76.94744	120.42337
80.59269	59.00785	102.17753
66.38084	44.79346	87.96822
80.95703	59.39260	102.52145
93.48978	71.92510	115.05447
80.78763	59.17746	102.39781
87.42718	65.89639	108.95796
84.14919	62.57919	105.71919
83.42632	61.85401	104.99862
68.24651	46.66857	89.82444
74.11485	52.52794	95.70176
83.38865	61.76978	105.00753
98.26007	76.60928	119.91086
89.52138	67.90227	111.14049
70.93067	49.34972	92.51163
80.64191	59.08035	102.20346
91.96033	70.36565	113.55501
70.33282	48.74187	91.92377
77.83163	56.24180	99.42146
87.22196	65.68627	108.75766
82.48522	60.95591	104.01453
73.67207	52.11205	95.23209
76.75416	55.19624	98.31208
82.49050	60.89795	104.08305
84.99016	63.43133	106.54898
93.83859	72.25397	115.42321
71.80554	50.24320	93.36789
86.85487	65.30868	108.40105
79.28589	57.70949	100.86228
85.29262	63.74212	106.84311
82.46982	60.92057	104.01906
94.60183	73.02103	116.18262
91.34545	69.79067	112.90024
70.93903	49.32290	92.55516
83.10417	61.50827	104.70006
87.32008	65.70347	108.93668
88.43871	66.85095	110.02647
93.37461	71.63719	115.11203
99.36781	77.69373	121.04190
87.76843	66.19402	109.34283
91.17261	69.57685	112.76837
79.23140	57.67886	100.78395
73.24270	51.69310	94.79230
82.12689	60.59686	103.65691
86.71778	65.15951	108.27605
74.89025	53.33248	96.44801
66.92861	45.21966	88.63755
73.56909	52.00107	95.13711
89.88738	68.34726	111.42750
89.90138	68.34824	111.45452
90.11610	68.51220	111.72001
96.60351	75.03731	118.16970
91.47764	69.93489	113.02039
78.45313	56.91791	99.98835
76.26105	54.71209	97.81001
85.73440	64.17606	107.29275
82.29847	60.76377	103.83317
68.98400	47.42907	90.53894
68.50717	46.91714	90.09721
68.64108	47.08802	90.19414
70.55180	48.91215	92.19145
72.52800	50.97017	94.08583
83.69501	62.12356	105.26645
87.31126	65.72826	108.89426
77.52117	55.98289	99.05945
92.52258	70.96967	114.07550
88.00381	66.44678	109.56083
81.78382	60.23555	103.33209
81.03261	59.47052	102.59470
75.48230	53.92827	97.03633
83.17266	61.59474	104.75059
89.88120	68.33256	111.42983
61.40875	39.51990	83.29760
75.44101	53.89797	96.98406
73.59574	52.06000	95.13148
90.24117	68.61412	111.86822
78.64452	57.09326	100.19579
59.15853	37.58302	80.73405
75.04333	53.47800	96.60867
93.43360	71.86647	115.00072
69.90950	48.36080	91.45820
75.33609	53.77841	96.89378
72.45306	50.91713	93.98899
77.33786	55.80938	98.86633
80.48555	58.95066	102.02043
77.62882	56.07797	99.17968
83.52247	61.98974	105.05520
81.74656	60.20149	103.29163
78.47525	56.93024	100.02026
62.29530	40.71983	83.87077
77.26846	55.72412	98.81279
67.90964	46.34271	89.47657
93.36455	71.79030	114.93881
92.34106	70.63874	114.04339
72.06385	50.49380	93.63390
106.59538	84.94074	128.25001
112.15416	90.51505	133.79326
98.10654	76.54120	119.67189
107.50113	85.88524	129.11702
102.26469	80.66316	123.86623
96.85838	75.27515	118.44161
84.68299	63.12854	106.23745
84.18226	62.62349	105.74102
71.83081	50.28326	93.37836
79.35595	57.82623	100.88567
96.43348	74.79921	118.06776
96.26514	74.69948	117.83080
82.70737	61.15702	104.25772
93.53970	71.97243	115.10696
81.03998	59.49492	102.58503
77.87108	56.31519	99.42696
82.52147	60.95463	104.08831
69.11141	47.56080	90.66202
75.05836	53.51372	96.60301
81.15453	59.62197	102.68710
96.52924	74.81298	118.24550
90.01229	68.45875	111.56583
86.62406	65.09434	108.15378
87.07450	65.53194	108.61707
61.81109	40.19830	83.42388
81.44877	59.88653	103.01101
68.47654	46.90147	90.05161
71.40841	49.85422	92.96259
70.37271	48.78647	91.95896
62.53699	40.96878	84.10521
73.78619	52.25298	95.31940
92.46219	70.87723	114.04715
84.28704	62.74338	105.83070
84.99764	63.45073	106.54455
73.05263	51.47967	94.62559
78.48441	56.93818	100.03063
74.65449	53.07283	96.23614
91.52677	69.75094	113.30260
79.81631	58.27981	101.35281
87.78063	66.22539	109.33586
77.60570	56.04029	99.17110
76.33916	54.79145	97.88686
89.77318	68.17844	111.36792
63.69324	42.14486	85.24162
70.47613	48.90122	92.05103
82.18603	60.63547	103.73659
78.34127	56.75660	99.92595
93.86369	72.24692	115.48046
74.49629	52.96747	96.02511
81.45126	59.91940	102.98313
73.47833	51.93023	95.02644
70.77610	49.20815	92.34405
78.55512	57.01606	100.09419
69.23494	47.66356	90.80631
79.00303	57.47314	100.53292
77.51497	55.98000	99.04995
79.07279	57.52485	100.62074
84.13184	62.60172	105.66196
85.48324	63.88634	107.08014
92.64399	71.02526	114.26272
82.66563	61.11035	104.22092
88.02235	66.45445	109.59025
79.77963	58.25585	101.30342
75.85559	54.32689	97.38430
75.98735	54.40805	97.56665
79.50247	57.96465	101.04030
76.53748	54.96602	98.10895
88.38708	66.85186	109.92231
88.17293	66.63329	109.71256
81.87435	60.32467	103.42402
79.60554	58.06100	101.15007
60.48526	38.90558	82.06495
82.91219	61.33159	104.49280
76.71998	55.18154	98.25842
83.39090	61.85031	104.93148
71.70958	50.16819	93.25096
84.15537	62.58045	105.73028
80.57078	59.01359	102.12798
50.34285	28.40255	72.28315
68.12559	46.57338	89.67781
78.41185	56.83952	99.98418
83.98771	62.43525	105.54018
87.86446	66.29147	109.43745
72.38459	50.79144	93.97775

Conclusion

3 models were generated from baseball team data from 1871 to 2006 to predict number of wins. Prior to generating the models, the baseball data was analyzed to better understand the relationship between variables. The chosen model was Model 1 which was created using mean….

Appendix

knitr::opts_chunk$set(cache =TRUE)

# 0.  Librairies

library(corrplot)
library(tidyverse)
library(Hmisc)
library(PerformanceAnalytics)
library(corrplot)
library(mice)
library(gt)
library(DMwR)
library(caret)
library(bnstruct)
library(VIM)
library(corrr)
library(tidyverse)
library(gtsummary)
library(kableExtra)

# 1. Data Exploration

## Import data 

train_data <- read.csv("https://raw.githubusercontent.com/aaitelmouden/DATA621/master/Project1/moneyball-training-data.csv")
glimpse(train_data)


## Summary table 

wins <- train_data %>% select(TARGET_WINS, TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BATTING_HBP, TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_FIELDING_E, TEAM_FIELDING_DP, TEAM_PITCHING_BB, TEAM_PITCHING_SO, TEAM_FIELDING_E, TEAM_FIELDING_DP)
table1 <- tbl_summary(wins,
          statistic = list(all_continuous() ~ "{mean} ({sd}) {median} {min} {max}"), missing = "no") 
table1


## Create more variables

singles <- train_data$TEAM_BATTING_H - (train_data$TEAM_BATTING_2B + train_data$TEAM_BATTING_3B + train_data$TEAM_BATTING_HR)

train_data$TEAM_BATTING_SLG <- ((train_data$TEAM_BATTING_HR *4)+ (train_data$TEAM_BATTING_3B*3) + (train_data$TEAM_BATTING_2B*2)+ singles)/train_data$TEAM_BATTING_H

## Plots

### Layout to split the screen
layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))

####  Draw the boxplot and the histogram 
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(train_data$TARGET_WINS ,main="Figure 2 : Distribution and Probability Plot for TARGET_WINS",cex.main=1, horizontal=TRUE , ylim=c(0,150), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F)
par(mar=c(4, 3.1, 1.1, 2.1))
hist(train_data$TARGET_WINS , breaks=40 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="TARGET WINS", xlim=c(0,150))

### qq plots

qqnorm(train_data$TARGET_WINS, pch = 1, frame = FALSE)
qqline(train_data$TARGET_WINS, col = "steelblue", lwd = 2)


max_obs <- 2276
batting_so_na <- ((102/max_obs) * 100)
baserun_sb_na <- (131/max_obs) * 100
baserun_cs_na <- (772/max_obs) * 100
batting_hbp_na <- (2085/max_obs) * 100
pitching_so_na <- (102/max_obs) * 100
fielding_dp_na <- (286/max_obs) * 100

df_percent_na <- data.frame(Columns_w_NA = c("team_batting_so", "team_baserun_sb", "team_baserun_cs", "team_batting_hbp", "team_pitching_so", "team_fielding_dp"), Percent_NA = c(batting_so_na, baserun_sb_na, baserun_cs_na, batting_hbp_na, pitching_so_na, fielding_dp_na))


### the largest islands in the world

gt_tbl <- gt(data = df_percent_na, )


### Show the gt Table

gt_tbl

### Outlliers

ggplot(stack(train_data[,-1]), aes(x = ind, y = values, fill=ind)) + 
  geom_boxplot(outlier.colour = "red",  outlier.alpha=.4) +
  coord_cartesian(ylim = c(0, 1000)) +
  theme_classic()+
  theme(axis.text.x=element_text(angle=45, hjust=1)) 


### Correlation

COR <- train_data %>% 
  correlate() %>% 
  focus(TARGET_WINS)
gt(COR)

COR %>% 
  mutate(rowname = factor(term, levels = term[order(TARGET_WINS)])) %>%  # Order by correlation strength
  ggplot(aes(x = rowname, y = TARGET_WINS)) +
    geom_bar(stat = "identity") +
    ylab("Correlation with TARGET WINS") +
    xlab("Variables") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+ ggtitle("Figure 4: Correlation Against Target Win")

#### pairwise.complete.obs ignores NA values and computes correlation on complete observations
#### we might have to run these corrplots again after we handle the NA values

chart.Correlation(train_data[-c(1)], histograme=TRUE, method= "pearson", use="pairwise.complete.obs")

data.corr <- cor(train_data[-c(1)], use="pairwise.complete.obs")

corrplot(data.corr, type = "lower", method="square")

flattenCorrMatrix <- function(cormat, pmat) {
  ut <- upper.tri(cormat)
  data.frame(
    row = rownames(cormat)[row(cormat)[ut]],
    column = rownames(cormat)[col(cormat)[ut]],
    cor  =(cormat)[ut],
    p = pmat[ut]
    )
}

#### Eliminate INDEX from data frame
data_no_index <- train_data[-c(1)]

cor_matrix <- rcorr(as.matrix(data_no_index))

flattenCorrMatrix(cor_matrix$r, cor_matrix$P)

#. 2 Data preparation 

train_data <- train_data[-11]


par(mfrow=c(1,2))
gg_miss_upset(train_data, 
              nsets = 5,
              nintersects = NA)
gg_miss_case(train_data)+
  theme_classic()


par(mfrow=c(1,2))
gg_miss_upset(train_data, 
              nsets = 5,
              nintersects = NA)
gg_miss_case(train_data)+
  theme_classic()

# 3. Build models

### MODEL 1: MEAN FULL MODEL

### Filling missing values with Mean using the impute package

train_model1 <- train_data
train_model1$TEAM_BATTING_SO[is.na(train_model1$TEAM_BATTING_SO)] = mean(train_model1$TEAM_BATTING_SO, na.rm=TRUE)
train_model1$TEAM_BASERUN_SB[is.na(train_model1$TEAM_BASERUN_SB)] = mean(train_model1$TEAM_BASERUN_SB, na.rm=TRUE)
train_model1$TEAM_BASERUN_CS[is.na(train_model1$TEAM_BASERUN_CS)] = mean(train_model1$TEAM_BASERUN_CS, na.rm=TRUE)
train_model1$TEAM_PITCHING_SO[is.na(train_model1$TEAM_PITCHING_SO)] = mean(train_model1$TEAM_PITCHING_SO, na.rm=TRUE)

model1 <- lm(TARGET_WINS ~ 
               TEAM_BATTING_H +   # Base Hits by batters (1B,2B,3B,HR)
               TEAM_BATTING_2B +  # Doubles by batters (2B) 
               TEAM_BATTING_3B +  # Triples by batters (3B)
               TEAM_BATTING_HR +  # Homeruns by batters (4B) 
               TEAM_BATTING_BB +  # Walks by batters
               TEAM_BATTING_SO +  # Strikeouts by batters 
               TEAM_BASERUN_SB +  # Stolen bases
               TEAM_BASERUN_CS +  # Caught stealing 
               TEAM_PITCHING_H +  # Hits allowed
               TEAM_PITCHING_HR + # Homeruns allowed
               TEAM_PITCHING_BB + # Walks allowed
               TEAM_PITCHING_SO + # Strikeouts by pitchers
               TEAM_FIELDING_E +  # Errors
               TEAM_FIELDING_DP,  # Double Plays
             data=train_model1)
summary(model1)

### Mean Square Error (MSE)

mean(model1$residuals^2)

layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(model1)

### MODEL 2: MEDIAN WITH STEPWISE

train_model2 <- train_data

#### Filling missing values with Mean using the impute package

train_model2$TEAM_BATTING_SO[is.na(train_model2$TEAM_BATTING_SO)] = median(train_model2$TEAM_BATTING_SO, na.rm=TRUE)
train_model2$TEAM_BASERUN_SB[is.na(train_model2$TEAM_BASERUN_SB)] = median(train_model2$TEAM_BASERUN_SB, na.rm=TRUE)
train_model2$TEAM_BASERUN_CS[is.na(train_model2$TEAM_BASERUN_CS)] = median(train_model2$TEAM_BASERUN_CS, na.rm=TRUE)
train_model2$TEAM_PITCHING_SO[is.na(train_model2$TEAM_PITCHING_SO)] = median(train_model2$TEAM_PITCHING_SO, na.rm=TRUE)


model2 <- lm(TARGET_WINS ~ 
               TEAM_BATTING_H +   # Base Hits by batters (1B,2B,3B,HR)
               TEAM_BATTING_2B +  # Doubles by batters (2B) 
               TEAM_BATTING_3B +  # Triples by batters (3B)
               TEAM_BATTING_HR +  # Homeruns by batters (4B) 
               TEAM_BATTING_BB +  # Walks by batters
               TEAM_BATTING_SO +  # Strikeouts by batters 
               TEAM_BASERUN_SB +  # Stolen bases
               TEAM_BASERUN_CS +  # Caught stealing 
               TEAM_PITCHING_H +  # Hits allowed
               TEAM_PITCHING_HR + # Homeruns allowed
               TEAM_PITCHING_BB + # Walks allowed
               TEAM_PITCHING_SO + # Strikeouts by pitchers
               TEAM_FIELDING_E +  # Errors
               TEAM_FIELDING_DP,  # Double Plays
             data=train_model2)
summary(model1)

mean(model2$residuals^2)

layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(model2)

### MODEL 3: kNN Imputation

train_data_impute <- select(train_data, -c(INDEX))


#### using default values

knn_data <- knnImputation(train_data_impute)

summary(knn_data)

train.control <- trainControl(method ='cv', number=10)

step.model <- train(TARGET_WINS ~., data = knn_data,
                    method = "leapSeq",
                    tuneGrid = data.frame(nvmax = 2:13),
                    trControl = train.control
                    )

step.model$results

step.model$bestTune

summary(step.model$finalModel)

coef(step.model$finalModel, 6)


knn_lm <- lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP, data = knn_data)

summary(knn_lm)


hist(knn_lm$residuals, xlab = "Residuals", ylab = "", breaks=100)
qqnorm(knn_lm$residuals)
qqline(knn_lm$residuals)

plot(fitted(knn_lm), residuals(knn_lm))
abline(h=0, lty = 2)

References

<<<<<<< HEAD

-Dealing with Missing Data using R : https://medium.com/coinmonks/dealing-with-missing-data-using-r-3ae428da2d17 -Decision Tree : http://www.learnbymarketing.com/tutorials/rpart-decision-trees-in-r/ -Decision Tree : https://www.datacamp.com/community/tutorials/decision-trees-R -Introduction to Data Science (Case Study Moneyball): https://rafalab.github.io/dsbook/linear-models.html#case-study-moneyball ======= >>>>>>> bf59d52d90d6b4ab9b840f4b7fa1cd132debf1fb