1 OVERVIEW

In this homework assignment, we will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

1.1 Objective:

To build a multiple linear regression model on the training data to predict the number of wins for the team. We can only use the variables provided (or variables that we will derive from the variables provided).

2 DATA EXPLORATION

2.1 Data Summary

The dataset consists of two data files: training and evaluation. The training dataset contains 17 columns, while the evaluation dataset contains 16. The evaluation dataset is missing column TARGET_WINS. We will start by exploring the training data set since it will be the one used to generate the regression model.

First we see that all data is numeric.

An important aspect of any dataset is to determine how much, if any, data is missing. We look at all the variables to see which if any have missing data. We look at the basic descriptive statistics as well as the missing data and their percentages:

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se	na_count	na_count_perc
TARGET_WINS	2	191	80.92670	12.115013	82	81.11765	13.3434	43	116	73	-0.1698314	-0.2952783	0.8766116	0	0.0
TEAM_BATTING_H	3	191	1478.62827	76.147869	1477	1477.42484	74.1300	1308	1667	359	0.1302702	-0.3710350	5.5098664	0	0.0
TEAM_BATTING_2B	4	191	297.19895	26.329335	296	296.62745	25.2042	201	373	172	0.0915189	0.4778716	1.9051238	0	0.0
TEAM_BATTING_3B	5	191	30.74346	9.043878	29	30.13072	8.8956	12	61	49	0.7007420	0.7446217	0.6543921	0	0.0
TEAM_BATTING_HR	6	191	178.05236	32.413243	175	176.81046	35.5824	116	260	144	0.2980673	-0.7172373	2.3453399	0	0.0
TEAM_BATTING_BB	7	191	543.31937	74.842133	535	541.31373	74.1300	365	775	410	0.3115199	-0.1474175	5.4153867	0	0.0
TEAM_BATTING_SO	8	191	1051.02618	104.156382	1050	1046.95425	97.8516	805	1399	594	0.3985050	0.3955105	7.5364913	102	4.5
TEAM_BASERUN_SB	9	191	90.90576	29.916401	87	89.06536	29.6520	31	177	146	0.5553966	-0.1414909	2.1646748	131	5.8
TEAM_BASERUN_CS	10	191	39.94241	11.898334	38	39.49020	11.8608	12	74	62	0.3468509	0.0006392	0.8609332	772	33.9
TEAM_BATTING_HBP	11	191	59.35602	12.967123	58	58.86275	11.8608	29	95	66	0.3185754	-0.1119828	0.9382681	2085	91.6
TEAM_PITCHING_H	12	191	1479.70157	75.788625	1480	1478.50327	72.6474	1312	1667	355	0.1279056	-0.3894781	5.4838725	0	0.0
TEAM_PITCHING_HR	13	191	178.17801	32.391678	175	176.93464	35.5824	116	260	144	0.2989191	-0.7190905	2.3437795	0	0.0
TEAM_PITCHING_BB	14	191	543.71728	74.916681	537	541.74510	72.6474	367	775	408	0.3144366	-0.1338563	5.4207808	0	0.0
TEAM_PITCHING_SO	15	191	1051.81675	104.347208	1052	1047.80392	97.8516	805	1399	594	0.3945586	0.3903991	7.5502990	102	4.5
TEAM_FIELDING_E	16	191	107.05236	16.632162	106	106.58170	17.7912	65	145	80	0.1780432	-0.3567367	1.2034610	0	0.0
TEAM_FIELDING_DP	17	191	152.33508	17.611682	152	152.04575	19.2738	113	204	91	0.2164822	-0.2115741	1.2743366	286	12.6

2.2 Missing and Invalid Data

From this result we can see how several variables have a number of missing values. The maximum number of missing values was 2085 in the TEAM_BATTING_HBP variable. This is a significant amount of missing data representing 91.6% of that data.

With missing data assessed, we can look into descriptive statistics in more detail. Interestingly we find that the difference between means and medians is fairly small for all data columns. The maximum difference is in fact only 4.77%. This means that we are to expect the distributions of this data to be fairly uniform. To visualize this we plot histograms for each data.

The plot of distributions does show fairly uniform data, but it also show the potential presence of outliers in at least two of the predictors. This is not the best way to vizualise ouliers. Instead we identify the predictors which seem to have outliers by looking at the scattered and box plots. Two variables with outliers appear to be TEAM_PITCHING_H, TEAM_PITCHING_SO, TEAM_PITCHING_BB and TEAM_FIELDING_E. We highlight these variables from the desity plots since we can see most of the data concentrated at the lower end of the scales which show tailing off to high values.

2.3 Correlation Plot

Looking at correlation of variables to number of wins provides some interesting data. We find some correlations that make sense from what might assume with subject knowledge of base, e.g., the number of hits and number of variables both have significant positive correlation with Wins and other statistics like stolen bases, while still positive, are not so strongly related. What is surprising though, are the pitching statistics. We would assume that a team that allowed the opposing team more hits, would lose more games (and win less), but that is not what the data shows us. Perhaps there are outliers swaying the correlation.

Regardless, we can use some of these correlations to drive initial models later, in terms of likely fields to choose for an effective model.

##      TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## [1,]           1      0.3887675       0.2891036       0.1426084
##      TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## [1,]       0.1761532       0.2325599              NA              NA
##      TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## [1,]              NA               NA      -0.1099371        0.1890137
##      TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## [1,]        0.1241745               NA      -0.1764848               NA

3 DATA PREPARATION

3.1 Variable Creation / Removal

First task under data preparation will be to eliminate all missing data. In the Data Exploration section we found one variable, TEAM_BATTING_HBP with an exceptionaly high percentage of missing data, so we commence by eliminating this variable. We also removed the “INDEX” column as that is not used.

Next task is to handle missing data in the other variables. Here, because the percentages of missing data are lower, we can replace missing data with the median. We prefer replacing with median instead of mean because the latter is more sensitive to outliers. So we get a clean dataset without missing values.

Note, we also consider zeros to be missing data. Since each row is a season of data for a given baseball team, it would be extraordinarily unlikely that any of these statistics would have zero as an actual value. Therefore we are assuming zero is another indicator of missing value and we will transform them into a median value.

In the exploratory phase we also identified several variables with outliers. Outliers will be substituted with median. Again we choose median becouse it is less influenced by these outliers. What cut-off to use to tag an outlier reading could be a 3 standard deviation from the mean, or 1.5 time the inter quartile range, but in this case because these variables have reciprocals as seen in the exploratory phase, we will use the maximum reading of those variables.

TEAM_PITCHING_H

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1137    1419    1518    1562    1636    2544

From the summary we now see that the maximum is 2544, which is a much more reasonable number. We can also see a wide spread between mean and median of 44, indicating a more normal distribution than before.

TEAM_PITCHING_SO

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   181.0   633.0   816.0   796.8   948.2  1399.0

From the summary we now see that the maximum is 1399, which is a much more reasonable number. We can also see a wide spread between mean and median of -19, indicating a more normal distribution than before.

New Variables

With a clean dataset, we can now start looking at what predictor variables can be combined and what new statistics can be derived.

Batting Hit Singles

On the batting side we can start by adding a variable for single hits since the dataset has a variable for all 4 kinds of hits.

TEAM_BATTING_HS = TEAM_BATTING_H - (TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR)

TEAM_BATTING_HS Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     606     990    1050    1072    1129    2112

There are other popular baseball statistics which are regularly calculated. The data given is limited, so it won’t be possible to make all these calculations. But we can use the data given to calculate some statistics that resemble some of the common baseball measurements.

The number of times a batter reaches base can be calculated as Times On Base:

Times On Base

TOB = Base Hits + Walks + Hits by Pitch TOB = ( TEAM_BATTING_H - TEAM_BATTING_HR ) + TEAM_BATTING_BB + TEAM_BATTING_HBP TOB = TEAM_BATTING_TOB

In our case we do not have TEAM_BATTING_HBP. We deleted this predictor since it didn’t contain enough data, so we will not include this term in calculating TOB.

TEAM_BATTING_TOB Summary:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     788    1269    1338    1369    1442    2518

On Base Percentage

If we divide this statistic by the times a batter appears on plate, we have a ratio for On Base Percentage. Batter appearances on plate is not a statistic that was given, but we can assumes it would the similar to the number of times a batter produces a hit and the times of strikeouts.

OBP = TOB / ( Base Hits + Walks + Hits by Pitch + Strikeouts ) OBP = TEAM_BATTING_TOB / (( TEAM_BATTING_H - TEAM_BATTING_HR ) + TEAM_BATTING_BB + TEAM_BATTING_HBP + TEAM_BATTING_SO ) OBP = TEAM_BATTING_OBP

Same as before TEAM_BATTING_HBP is missing so we do not include it.

TEAM_BATTING_OBP Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3801  0.4658  0.5122  0.5287  0.5743  0.9469

Batting Average

This statistics is calculated as the number of batter hits divided by times at bat or on plate. With our dataset we will compute times at bat as the sum of a batters hits and strike out, same as we did on the previous calculated variable since the number of Hits by Pitch is not available:

AVG = Hits / (Hits + Walks + Strikeouts) AVG = TEAM_BATTING_H / ( TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_SO ) AVG = TEAM_BATTING_BAVG

TEAM_BATTING_BAVG Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4131  0.4923  0.5290  0.5464  0.5846  0.9471

Slugging Percentage

A shortcoming of the previous statistic is that it weights any kind of hits equally. To account for the fact that some hits are more beneficial or carry higher weight we can calculate a slugging percentage by multiplying each kind of hit by an increasing number.

SLG = ( Single Hits + 2 * Double Hits + 3 * Tripple Hits + 4 * Home Runs ) / (Hits + Walks + Strikeouts) TEAM_BATTING_SLG = ( ( TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR ) + 2 * TEAM_BATTING_2B + 3 * TEAM_BATTING_3B + 4 * TEAM_BATTING_HR ) / ( mbTrain$TEAM_BATTING_H + mbTrain$TEAM_BATTING_BB + mbTrain$TEAM_BATTING_SO )

TEAM_BATTING_SLG Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5861  0.7288  0.7731  0.7842  0.8291  1.2690

Strikeout Efficiency

Measures how successful a pitches is at striking out batters:

PEFF = Strike Outs (Hits + Strike Outs) TEAM_PITCHING_PEFF = TEAM_PITCHING_SO (TEAM_PITCHING_H + TEAM_PITCHING_SO)

TEAM_PITCHING_PEFF Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1065  0.2813  0.3503  0.3358  0.3944  0.5038

Training and Test

Lastly, before we create models, let’s divide data into test and training sets, with 80% for training, 20% for test. This way we have a method to validate our models.

4 BUILD MODELS

4.1 Batting only model

Combine all batting variables.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.157  -8.688   0.679   9.599  45.949 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    17.472589   3.369651   5.185  2.4e-07 ***
## TEAM_BATTING_H  0.043178   0.002281  18.927  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.4 on 1818 degrees of freedom
## Multiple R-squared:  0.1646, Adjusted R-squared:  0.1642 
## F-statistic: 358.2 on 1 and 1818 DF,  p-value: < 2.2e-16

4.2 Pitching only model

Combine all pitching variables.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_H + TEAM_PITCHING_HR, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.379  -9.340   0.787   9.917  67.847 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      52.476611   2.691696  19.496  < 2e-16 ***
## TEAM_PITCHING_H   0.015148   0.001626   9.319  < 2e-16 ***
## TEAM_PITCHING_HR  0.045438   0.005863   7.750 1.51e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.2 on 1817 degrees of freedom
## Multiple R-squared:  0.06883,    Adjusted R-squared:  0.06781 
## F-statistic: 67.16 on 2 and 1817 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = TARGET_WINS ~ HITS_NOHR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.111  -8.461   0.808  10.497  42.679 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.86903    3.00594   12.60   <2e-16 ***
## HITS_NOHR    0.03144    0.00218   14.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.92 on 1818 degrees of freedom
## Multiple R-squared:  0.1027, Adjusted R-squared:  0.1022 
## F-statistic:   208 on 1 and 1818 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = TARGET_WINS ~ HITS_NOHR + TEAM_BATTING_BB + TEAM_FIELDING_E, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.256  -9.176   0.192   9.555  53.546 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     11.419903   3.533612   3.232  0.00125 ** 
## HITS_NOHR        0.047475   0.002322  20.445  < 2e-16 ***
## TEAM_BATTING_BB  0.018140   0.003524   5.148 2.92e-07 ***
## TEAM_FIELDING_E -0.018679   0.002141  -8.723  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.94 on 1816 degrees of freedom
## Multiple R-squared:  0.2172, Adjusted R-squared:  0.2159 
## F-statistic:   168 on 3 and 1816 DF,  p-value: < 2.2e-16

## Warning in abline(hitsNoHR_bb_e_mod): only using the first two of 4
## regression coefficients

Best in terms of residuals and Rsquared Hits, BB, and Fielding Errors. Plost look good except for short tailed issues in the QQ plot.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     TEAM_FIELDING_E, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.767  -8.959   0.003   9.071  50.788 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.277272   3.639218   1.175 0.240018    
## TEAM_BATTING_H   0.050269   0.002305  21.813  < 2e-16 ***
## TEAM_BATTING_BB  0.012509   0.003507   3.567 0.000371 ***
## TEAM_FIELDING_E -0.014193   0.002022  -7.018 3.17e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.77 on 1816 degrees of freedom
## Multiple R-squared:  0.237,  Adjusted R-squared:  0.2357 
## F-statistic:   188 on 3 and 1816 DF,  p-value: < 2.2e-16

## Warning in abline(hits_bb_e_mod): only using the first two of 4 regression
## coefficients

4.3 CoxBox Model

Attempt at boxcox, didn’t achieve better results in R squared (not SE not directly comparable due to adjustment with boxcox) The QQ plot seems to look a bit better as the negative quantiles are much closer to the line. Box Cox

## 
## Call:
## lm(formula = TARGET_WINS_BC ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     TEAM_FIELDING_E)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -194.979  -38.736   -1.184   37.665  223.257 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -64.017547  15.331986  -4.175 3.12e-05 ***
## TEAM_BATTING_H    0.211530   0.009709  21.787  < 2e-16 ***
## TEAM_BATTING_BB   0.053146   0.014776   3.597 0.000331 ***
## TEAM_FIELDING_E  -0.051346   0.008520  -6.026 2.03e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57.99 on 1816 degrees of freedom
## Multiple R-squared:  0.2317, Adjusted R-squared:  0.2305 
## F-statistic: 182.6 on 3 and 1816 DF,  p-value: < 2.2e-16

5 SELECT MODELS

5.1 Compare Model Statistics

Metric	Batting Model	Pitching Model	BoxCox Model
RSE	14.40	15.20	57.99
R^2	0.1646	0.06883	0.2317
Adj. R^2	0.1642	0.06781	0.2305
F Stat.	358.2	67.16	182.6

5.2 Pick the best regression model

5.3 Conclusion

6 APPENDIX

find counts of na’s notice that #hbp and cs (hit by pitch and caught stealing) have very large numbers of missing values, probably shouldn’t use, and knowing baseball, probably minor impact on wins regardless, so remove those columns…

##        TARGET_WINS     TEAM_BATTING_H    TEAM_BATTING_2B 
##                  0                  0                  0 
##    TEAM_BATTING_3B    TEAM_BATTING_HR    TEAM_BATTING_BB 
##                  0                  0                  0 
##    TEAM_BATTING_SO    TEAM_BASERUN_SB    TEAM_BASERUN_CS 
##                  0                  0                  0 
##    TEAM_PITCHING_H   TEAM_PITCHING_HR   TEAM_PITCHING_BB 
##                  0                  0                  0 
##   TEAM_PITCHING_SO    TEAM_FIELDING_E   TEAM_FIELDING_DP 
##                  0                  0                  0 
##    TEAM_BATTING_HS   TEAM_BATTING_TOB   TEAM_BATTING_OBP 
##                  0                  0                  0 
##  TEAM_BATTING_BAVG   TEAM_BATTING_SLG TEAM_PITCHING_PEFF 
##                  0                  0                  0

convert NA’s to -1 for now so we can run test on zero’s as zero’s for any of these stats clearly is inaccurate and eqivalent to an NA

Number of zeros fields is not so bad, but lets not replace them until we check for outliers and convert NA’s back to -1

##        TARGET_WINS     TEAM_BATTING_H    TEAM_BATTING_2B 
##                  1                  0                  0 
##    TEAM_BATTING_3B    TEAM_BATTING_HR    TEAM_BATTING_BB 
##                  0                  0                  0 
##    TEAM_BATTING_SO    TEAM_BASERUN_SB    TEAM_PITCHING_H 
##                  0                  0                  0 
##   TEAM_PITCHING_HR   TEAM_PITCHING_BB   TEAM_PITCHING_SO 
##                  0                  0                  0 
##    TEAM_FIELDING_E   TEAM_FIELDING_DP    TEAM_BATTING_HS 
##                  0                  0                  0 
##   TEAM_BATTING_TOB   TEAM_BATTING_OBP  TEAM_BATTING_BAVG 
##                  0                  0                  0 
##   TEAM_BATTING_SLG TEAM_PITCHING_PEFF 
##                  0                  0

Running summary we can see that there are some really big outliers (e.g., 19278 Strikeouts in one year?, that would be 119 per game….) at initial glance (can always revisit), looks like Pitching_Hits, Pitching_BB, Pitching_SO and Fielding Errors all have outliers that are not realistic values

##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  8.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.29  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  3.00   Min.   : 12.0   Min.   :  66.0   Min.   : 14.0  
##  1st Qu.: 42.75   1st Qu.:451.0   1st Qu.: 562.0   1st Qu.: 67.0  
##  Median :103.00   Median :512.0   Median : 754.0   Median :101.0  
##  Mean   :100.29   Mean   :501.8   Mean   : 743.1   Mean   :123.5  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.0   3rd Qu.:151.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
##  Min.   :1137    Min.   :  3.0    Min.   : 119.0   Min.   : 181.0  
##  1st Qu.:1419    1st Qu.: 52.0    1st Qu.: 476.0   1st Qu.: 633.0  
##  Median :1518    Median :108.0    Median : 537.0   Median : 816.0  
##  Mean   :1562    Mean   :106.4    Mean   : 553.2   Mean   : 796.8  
##  3rd Qu.:1636    3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.: 948.2  
##  Max.   :2544    Max.   :343.0    Max.   :3645.0   Max.   :1399.0  
##  TEAM_FIELDING_E  TEAM_FIELDING_DP TEAM_BATTING_HS TEAM_BATTING_TOB
##  Min.   :  65.0   Min.   : 52.0    Min.   : 606    Min.   : 788    
##  1st Qu.: 127.0   1st Qu.:134.0    1st Qu.: 990    1st Qu.:1269    
##  Median : 159.0   Median :149.0    Median :1050    Median :1338    
##  Mean   : 246.5   Mean   :146.7    Mean   :1072    Mean   :1369    
##  3rd Qu.: 249.2   3rd Qu.:161.2    3rd Qu.:1129    3rd Qu.:1442    
##  Max.   :1898.0   Max.   :228.0    Max.   :2112    Max.   :2518    
##  TEAM_BATTING_OBP TEAM_BATTING_BAVG TEAM_BATTING_SLG TEAM_PITCHING_PEFF
##  Min.   :0.3801   Min.   :0.4131    Min.   :0.5861   Min.   :0.1065    
##  1st Qu.:0.4658   1st Qu.:0.4923    1st Qu.:0.7288   1st Qu.:0.2813    
##  Median :0.5122   Median :0.5290    Median :0.7731   Median :0.3503    
##  Mean   :0.5287   Mean   :0.5464    Mean   :0.7842   Mean   :0.3358    
##  3rd Qu.:0.5743   3rd Qu.:0.5846    3rd Qu.:0.8291   3rd Qu.:0.3944    
##  Max.   :0.9469   Max.   :0.9471    Max.   :1.2690   Max.   :0.5038

## [1] 2544 2514 2498 2485 2477 2460

## [1] 3645 2876 2840 2396 2169 1750

## [1] 1399 1387 1386 1386 1385 1371

## [1] 1898 1890 1740 1728 1567 1553

Let’s plot to see where outliers are

There are a lot of rows with pitching hits greater than 3000 (there was max of 2554 hits by batting team, so a limit of 3000 hits (18.5 avg per game) by a pitching team doesn’t seem so unreasonable as max)we see 86, these should probably be tossed and replaced with median (mean?)

## [1] 0

Using same idea for base on balls, max for batting was 878, so perhaps a 1200 max for pitching (7.5 avg a game) might be reasonable, so 10 rows like this.

## [1] 10

for strikeouts, batting max of 1399, so perhaps 1800 for pitching max, and we have 9 rows like this.

## [1] 0

For fielding errors, there is no corresponding number to validate against (like with pitching compared to batting), but if we use a plausible limit of about 10 per game (which is still really high we can eliminate clear outliers).

## [1] 4

Now that we have identified them, let’s replace all these values

Now lets replace all zeros, plus the -1’s we had put in earlier for for the na’s.

We will use median for those values.

Do some correlation

##      TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## [1,]      0.3887675       0.2891036       0.1403293       0.1500749
##      TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H
## [1,]       0.2239747     -0.05029994       0.1195729       0.1808751
##      TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## [1,]        0.1622116        0.1831058      -0.08228224      -0.1310447
##      TEAM_FIELDING_DP TEAM_BATTING_HS TEAM_BATTING_TOB TEAM_BATTING_OBP
## [1,]       -0.0300863       0.2309099        0.3001907       0.04434842
##      TEAM_BATTING_BAVG TEAM_BATTING_SLG TEAM_PITCHING_PEFF
## [1,]        0.06481931        0.1994892         -0.1364644

Do some basic plots to start looking at data:

Fit a couple really simple models just to explore, one with everything, one with just home runs against wins.

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = mbTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.791  -8.467   0.144   8.271  59.525 
## 
## Coefficients: (2 not defined because of singularities)
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.969e+00  2.934e+01   0.272 0.785923    
## TEAM_BATTING_H      7.102e-02  8.184e-03   8.678  < 2e-16 ***
## TEAM_BATTING_2B     2.900e-02  2.389e-02   1.214 0.224900    
## TEAM_BATTING_3B     1.469e-01  4.915e-02   2.988 0.002841 ** 
## TEAM_BATTING_HR    -1.435e-01  8.102e-02  -1.771 0.076694 .  
## TEAM_BATTING_BB     8.640e-03  1.190e-02   0.726 0.467913    
## TEAM_BATTING_SO    -4.247e-02  1.023e-02  -4.150 3.44e-05 ***
## TEAM_BASERUN_SB     1.781e-02  4.255e-03   4.185 2.96e-05 ***
## TEAM_PITCHING_H     2.148e-02  5.526e-03   3.888 0.000104 ***
## TEAM_PITCHING_HR    1.471e-02  2.472e-02   0.595 0.551822    
## TEAM_PITCHING_BB   -1.705e-02  6.027e-03  -2.828 0.004719 ** 
## TEAM_PITCHING_SO   -2.931e-02  1.181e-02  -2.483 0.013115 *  
## TEAM_FIELDING_E    -1.727e-02  2.503e-03  -6.899 6.79e-12 ***
## TEAM_FIELDING_DP   -1.183e-01  1.367e-02  -8.655  < 2e-16 ***
## TEAM_BATTING_HS            NA         NA      NA       NA    
## TEAM_BATTING_TOB           NA         NA      NA       NA    
## TEAM_BATTING_OBP   -1.571e+03  2.873e+02  -5.470 5.00e-08 ***
## TEAM_BATTING_BAVG   1.587e+03  3.068e+02   5.174 2.49e-07 ***
## TEAM_BATTING_SLG   -1.005e+02  5.859e+01  -1.715 0.086468 .  
## TEAM_PITCHING_PEFF  1.611e+02  4.151e+01   3.880 0.000107 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.07 on 2258 degrees of freedom
## Multiple R-squared:  0.3162, Adjusted R-squared:  0.311 
## F-statistic: 61.41 on 17 and 2258 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_HR, data = mbTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.898  -9.784   0.797  10.227  68.018 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     76.839461   0.636084 120.801  < 2e-16 ***
## TEAM_BATTING_HR  0.039399   0.005443   7.239 6.18e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.58 on 2274 degrees of freedom
## Multiple R-squared:  0.02252,    Adjusted R-squared:  0.02209 
## F-statistic:  52.4 on 1 and 2274 DF,  p-value: 6.176e-13

Let’s plot hits against wins as that field should highest correlation.

Plot Residuals

Data 621 Homework 1: Moneyball

Tommy Jenkins, Violeta Stoyanova, Todd Weigel, Peter Kowalchuk, Eleanor R-Secoquian

September, 2019