1 OVERVIEW

In this homework assignment, we will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

1.1 Objective:

To build a multiple linear regression model on the training data to predict the number of wins for the team. We can only use the variables provided (or variables that we will derive from the variables provided).

2 DATA EXPLORATION

2.1 Data Summary

The dataset consists of two data files: training and evaluation. The training dataset contains 17 columns, while the evaluation dataset contains 16. The evaluation dataset is missing column TARGET_WINS. We will start by exploring the training data set since it will be the one used to generate the regression model.

First we see that all data is numeric.

An important aspect of any dataset is to determine how much, if any, data is missing. We look at all the variables to see which if any have missing data. We look at the basic descriptive statistics as well as the missing data and their percentages:

vars n mean sd median trimmed mad min max range skew kurtosis se na_count na_count_perc
TARGET_WINS 2 191 80.92670 12.115013 82 81.11765 13.3434 43 116 73 -0.1698314 -0.2952783 0.8766116 0 0.0
TEAM_BATTING_H 3 191 1478.62827 76.147869 1477 1477.42484 74.1300 1308 1667 359 0.1302702 -0.3710350 5.5098664 0 0.0
TEAM_BATTING_2B 4 191 297.19895 26.329335 296 296.62745 25.2042 201 373 172 0.0915189 0.4778716 1.9051238 0 0.0
TEAM_BATTING_3B 5 191 30.74346 9.043878 29 30.13072 8.8956 12 61 49 0.7007420 0.7446217 0.6543921 0 0.0
TEAM_BATTING_HR 6 191 178.05236 32.413243 175 176.81046 35.5824 116 260 144 0.2980673 -0.7172373 2.3453399 0 0.0
TEAM_BATTING_BB 7 191 543.31937 74.842133 535 541.31373 74.1300 365 775 410 0.3115199 -0.1474175 5.4153867 0 0.0
TEAM_BATTING_SO 8 191 1051.02618 104.156382 1050 1046.95425 97.8516 805 1399 594 0.3985050 0.3955105 7.5364913 102 4.5
TEAM_BASERUN_SB 9 191 90.90576 29.916401 87 89.06536 29.6520 31 177 146 0.5553966 -0.1414909 2.1646748 131 5.8
TEAM_BASERUN_CS 10 191 39.94241 11.898334 38 39.49020 11.8608 12 74 62 0.3468509 0.0006392 0.8609332 772 33.9
TEAM_BATTING_HBP 11 191 59.35602 12.967123 58 58.86275 11.8608 29 95 66 0.3185754 -0.1119828 0.9382681 2085 91.6
TEAM_PITCHING_H 12 191 1479.70157 75.788625 1480 1478.50327 72.6474 1312 1667 355 0.1279056 -0.3894781 5.4838725 0 0.0
TEAM_PITCHING_HR 13 191 178.17801 32.391678 175 176.93464 35.5824 116 260 144 0.2989191 -0.7190905 2.3437795 0 0.0
TEAM_PITCHING_BB 14 191 543.71728 74.916681 537 541.74510 72.6474 367 775 408 0.3144366 -0.1338563 5.4207808 0 0.0
TEAM_PITCHING_SO 15 191 1051.81675 104.347208 1052 1047.80392 97.8516 805 1399 594 0.3945586 0.3903991 7.5502990 102 4.5
TEAM_FIELDING_E 16 191 107.05236 16.632162 106 106.58170 17.7912 65 145 80 0.1780432 -0.3567367 1.2034610 0 0.0
TEAM_FIELDING_DP 17 191 152.33508 17.611682 152 152.04575 19.2738 113 204 91 0.2164822 -0.2115741 1.2743366 286 12.6

2.2 Missing and Invalid Data

From this result we can see how several variables have a number of missing values. The maximum number of missing values was 2085 in the TEAM_BATTING_HBP variable. This is a significant amount of missing data representing 91.6% of that data.

With missing data assessed, we can look into descriptive statistics in more detail. Interestingly we find that the difference between means and medians is fairly small for all data columns. The maximum difference is in fact only 4.77%. This means that we are to expect the distributions of this data to be fairly uniform. To visualize this we plot histograms for each data.

The plot of distributions does show fairly uniform data, but it also show the potential presence of outliers in at least two of the predictors. This is not the best way to vizualise ouliers. Instead we identify the predictors which seem to have outliers by looking at the scattered and box plots. Two variables with outliers appear to be TEAM_PITCHING_H, TEAM_PITCHING_SO, TEAM_PITCHING_BB and TEAM_FIELDING_E. We highlight these variables from the desity plots since we can see most of the data concentrated at the lower end of the scales which show tailing off to high values.

2.3 Correlation Plot

Looking at correlation of variables to number of wins provides some interesting data. We find some correlations that make sense from what might assume with subject knowledge of base, e.g., the number of hits and number of variables both have significant positive correlation with Wins and other statistics like stolen bases, while still positive, are not so strongly related. What is surprising though, are the pitching statistics. We would assume that a team that allowed the opposing team more hits, would lose more games (and win less), but that is not what the data shows us. Perhaps there are outliers swaying the correlation.

Regardless, we can use some of these correlations to drive initial models later, in terms of likely fields to choose for an effective model.

##      TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## [1,]           1      0.3887675       0.2891036       0.1426084
##      TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## [1,]       0.1761532       0.2325599              NA              NA
##      TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## [1,]              NA               NA      -0.1099371        0.1890137
##      TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## [1,]        0.1241745               NA      -0.1764848               NA

3 DATA PREPARATION

3.1 Variable Creation / Removal

First task under data preparation will be to eliminate all missing data. In the Data Exploration section we found one variable, TEAM_BATTING_HBP with an exceptionaly high percentage of missing data, so we commence by eliminating this variable. We also removed the “INDEX” column as that is not used.

Next task is to handle missing data in the other variables. Here, because the percentages of missing data are lower, we can replace missing data with the median. We prefer replacing with median instead of mean because the latter is more sensitive to outliers. So we get a clean dataset without missing values.

Note, we also consider zeros to be missing data. Since each row is a season of data for a given baseball team, it would be extraordinarily unlikely that any of these statistics would have zero as an actual value. Therefore we are assuming zero is another indicator of missing value and we will transform them into a median value.

In the exploratory phase we also identified several variables with outliers. Outliers will be substituted with median. Again we choose median becouse it is less influenced by these outliers. What cut-off to use to tag an outlier reading could be a 3 standard deviation from the mean, or 1.5 time the inter quartile range, but in this case because these variables have reciprocals as seen in the exploratory phase, we will use the maximum reading of those variables.

TEAM_PITCHING_H

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1137    1419    1518    1562    1636    2544

From the summary we now see that the maximum is 2544, which is a much more reasonable number. We can also see a wide spread between mean and median of 44, indicating a more normal distribution than before.

TEAM_PITCHING_SO

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   181.0   633.0   816.0   796.8   948.2  1399.0

From the summary we now see that the maximum is 1399, which is a much more reasonable number. We can also see a wide spread between mean and median of -19, indicating a more normal distribution than before.

New Variables

With a clean dataset, we can now start looking at what predictor variables can be combined and what new statistics can be derived.

Batting Hit Singles

On the batting side we can start by adding a variable for single hits since the dataset has a variable for all 4 kinds of hits.

TEAM_BATTING_HS = TEAM_BATTING_H - (TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR)

TEAM_BATTING_HS Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     606     990    1050    1072    1129    2112

There are other popular baseball statistics which are regularly calculated. The data given is limited, so it won’t be possible to make all these calculations. But we can use the data given to calculate some statistics that resemble some of the common baseball measurements.

The number of times a batter reaches base can be calculated as Times On Base:

Times On Base

TOB = Base Hits + Walks + Hits by Pitch TOB = ( TEAM_BATTING_H - TEAM_BATTING_HR ) + TEAM_BATTING_BB + TEAM_BATTING_HBP TOB = TEAM_BATTING_TOB

In our case we do not have TEAM_BATTING_HBP. We deleted this predictor since it didn’t contain enough data, so we will not include this term in calculating TOB.

TEAM_BATTING_TOB Summary:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     788    1269    1338    1369    1442    2518

On Base Percentage

If we divide this statistic by the times a batter appears on plate, we have a ratio for On Base Percentage. Batter appearances on plate is not a statistic that was given, but we can assumes it would the similar to the number of times a batter produces a hit and the times of strikeouts.

OBP = TOB / ( Base Hits + Walks + Hits by Pitch + Strikeouts ) OBP = TEAM_BATTING_TOB / (( TEAM_BATTING_H - TEAM_BATTING_HR ) + TEAM_BATTING_BB + TEAM_BATTING_HBP + TEAM_BATTING_SO ) OBP = TEAM_BATTING_OBP

Same as before TEAM_BATTING_HBP is missing so we do not include it.

TEAM_BATTING_OBP Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3801  0.4658  0.5122  0.5287  0.5743  0.9469

Batting Average

This statistics is calculated as the number of batter hits divided by times at bat or on plate. With our dataset we will compute times at bat as the sum of a batters hits and strike out, same as we did on the previous calculated variable since the number of Hits by Pitch is not available:

AVG = Hits / (Hits + Walks + Strikeouts) AVG = TEAM_BATTING_H / ( TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_SO ) AVG = TEAM_BATTING_BAVG

TEAM_BATTING_BAVG Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4131  0.4923  0.5290  0.5464  0.5846  0.9471

Slugging Percentage

A shortcoming of the previous statistic is that it weights any kind of hits equally. To account for the fact that some hits are more beneficial or carry higher weight we can calculate a slugging percentage by multiplying each kind of hit by an increasing number.

SLG = ( Single Hits + 2 * Double Hits + 3 * Tripple Hits + 4 * Home Runs ) / (Hits + Walks + Strikeouts) TEAM_BATTING_SLG = ( ( TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR ) + 2 * TEAM_BATTING_2B + 3 * TEAM_BATTING_3B + 4 * TEAM_BATTING_HR ) / ( mbTrain\(TEAM_BATTING_H + mbTrain\)TEAM_BATTING_BB + mbTrain$TEAM_BATTING_SO )

TEAM_BATTING_SLG Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5861  0.7288  0.7731  0.7842  0.8291  1.2690

Strikeout Efficiency

Measures how successful a pitches is at striking out batters:

PEFF = Strike Outs  (Hits + Strike Outs) TEAM_PITCHING_PEFF = TEAM_PITCHING_SO  (TEAM_PITCHING_H + TEAM_PITCHING_SO)

TEAM_PITCHING_PEFF Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1065  0.2813  0.3503  0.3358  0.3944  0.5038

Training and Test

Lastly, before we create models, let’s divide data into test and training sets, with 80% for training, 20% for test. This way we have a method to validate our models.

4 BUILD MODELS

4.1 Batting only model

Combine all batting variables.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.157  -8.688   0.679   9.599  45.949 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    17.472589   3.369651   5.185  2.4e-07 ***
## TEAM_BATTING_H  0.043178   0.002281  18.927  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.4 on 1818 degrees of freedom
## Multiple R-squared:  0.1646, Adjusted R-squared:  0.1642 
## F-statistic: 358.2 on 1 and 1818 DF,  p-value: < 2.2e-16

4.2 Pitching only model

Combine all pitching variables.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_H + TEAM_PITCHING_HR, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.379  -9.340   0.787   9.917  67.847 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      52.476611   2.691696  19.496  < 2e-16 ***
## TEAM_PITCHING_H   0.015148   0.001626   9.319  < 2e-16 ***
## TEAM_PITCHING_HR  0.045438   0.005863   7.750 1.51e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.2 on 1817 degrees of freedom
## Multiple R-squared:  0.06883,    Adjusted R-squared:  0.06781 
## F-statistic: 67.16 on 2 and 1817 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = TARGET_WINS ~ HITS_NOHR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.111  -8.461   0.808  10.497  42.679 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.86903    3.00594   12.60   <2e-16 ***
## HITS_NOHR    0.03144    0.00218   14.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.92 on 1818 degrees of freedom
## Multiple R-squared:  0.1027, Adjusted R-squared:  0.1022 
## F-statistic:   208 on 1 and 1818 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = TARGET_WINS ~ HITS_NOHR + TEAM_BATTING_BB + TEAM_FIELDING_E, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.256  -9.176   0.192   9.555  53.546 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     11.419903   3.533612   3.232  0.00125 ** 
## HITS_NOHR        0.047475   0.002322  20.445  < 2e-16 ***
## TEAM_BATTING_BB  0.018140   0.003524   5.148 2.92e-07 ***
## TEAM_FIELDING_E -0.018679   0.002141  -8.723  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.94 on 1816 degrees of freedom
## Multiple R-squared:  0.2172, Adjusted R-squared:  0.2159 
## F-statistic:   168 on 3 and 1816 DF,  p-value: < 2.2e-16
## Warning in abline(hitsNoHR_bb_e_mod): only using the first two of 4
## regression coefficients

Best in terms of residuals and Rsquared Hits, BB, and Fielding Errors. Plost look good except for short tailed issues in the QQ plot.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     TEAM_FIELDING_E, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.767  -8.959   0.003   9.071  50.788 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.277272   3.639218   1.175 0.240018    
## TEAM_BATTING_H   0.050269   0.002305  21.813  < 2e-16 ***
## TEAM_BATTING_BB  0.012509   0.003507   3.567 0.000371 ***
## TEAM_FIELDING_E -0.014193   0.002022  -7.018 3.17e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.77 on 1816 degrees of freedom
## Multiple R-squared:  0.237,  Adjusted R-squared:  0.2357 
## F-statistic:   188 on 3 and 1816 DF,  p-value: < 2.2e-16
## Warning in abline(hits_bb_e_mod): only using the first two of 4 regression
## coefficients

4.3 CoxBox Model

Attempt at boxcox, didn’t achieve better results in R squared (not SE not directly comparable due to adjustment with boxcox) The QQ plot seems to look a bit better as the negative quantiles are much closer to the line. Box Cox

## 
## Call:
## lm(formula = TARGET_WINS_BC ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     TEAM_FIELDING_E)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -194.979  -38.736   -1.184   37.665  223.257 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -64.017547  15.331986  -4.175 3.12e-05 ***
## TEAM_BATTING_H    0.211530   0.009709  21.787  < 2e-16 ***
## TEAM_BATTING_BB   0.053146   0.014776   3.597 0.000331 ***
## TEAM_FIELDING_E  -0.051346   0.008520  -6.026 2.03e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57.99 on 1816 degrees of freedom
## Multiple R-squared:  0.2317, Adjusted R-squared:  0.2305 
## F-statistic: 182.6 on 3 and 1816 DF,  p-value: < 2.2e-16

5 SELECT MODELS

5.1 Compare Model Statistics

Metric Batting Model Pitching Model BoxCox Model
RSE 14.40 15.20 57.99
R^2 0.1646 0.06883 0.2317
Adj. R^2 0.1642 0.06781 0.2305
F Stat. 358.2 67.16 182.6

5.2 Pick the best regression model

5.3 Conclusion

6 APPENDIX

find counts of na’s notice that #hbp and cs (hit by pitch and caught stealing) have very large numbers of missing values, probably shouldn’t use, and knowing baseball, probably minor impact on wins regardless, so remove those columns…

##        TARGET_WINS     TEAM_BATTING_H    TEAM_BATTING_2B 
##                  0                  0                  0 
##    TEAM_BATTING_3B    TEAM_BATTING_HR    TEAM_BATTING_BB 
##                  0                  0                  0 
##    TEAM_BATTING_SO    TEAM_BASERUN_SB    TEAM_BASERUN_CS 
##                  0                  0                  0 
##    TEAM_PITCHING_H   TEAM_PITCHING_HR   TEAM_PITCHING_BB 
##                  0                  0                  0 
##   TEAM_PITCHING_SO    TEAM_FIELDING_E   TEAM_FIELDING_DP 
##                  0                  0                  0 
##    TEAM_BATTING_HS   TEAM_BATTING_TOB   TEAM_BATTING_OBP 
##                  0                  0                  0 
##  TEAM_BATTING_BAVG   TEAM_BATTING_SLG TEAM_PITCHING_PEFF 
##                  0                  0                  0

convert NA’s to -1 for now so we can run test on zero’s as zero’s for any of these stats clearly is inaccurate and eqivalent to an NA

Number of zeros fields is not so bad, but lets not replace them until we check for outliers and convert NA’s back to -1

##        TARGET_WINS     TEAM_BATTING_H    TEAM_BATTING_2B 
##                  1                  0                  0 
##    TEAM_BATTING_3B    TEAM_BATTING_HR    TEAM_BATTING_BB 
##                  0                  0                  0 
##    TEAM_BATTING_SO    TEAM_BASERUN_SB    TEAM_PITCHING_H 
##                  0                  0                  0 
##   TEAM_PITCHING_HR   TEAM_PITCHING_BB   TEAM_PITCHING_SO 
##                  0                  0                  0 
##    TEAM_FIELDING_E   TEAM_FIELDING_DP    TEAM_BATTING_HS 
##                  0                  0                  0 
##   TEAM_BATTING_TOB   TEAM_BATTING_OBP  TEAM_BATTING_BAVG 
##                  0                  0                  0 
##   TEAM_BATTING_SLG TEAM_PITCHING_PEFF 
##                  0                  0

Running summary we can see that there are some really big outliers (e.g., 19278 Strikeouts in one year?, that would be 119 per game….) at initial glance (can always revisit), looks like Pitching_Hits, Pitching_BB, Pitching_SO and Fielding Errors all have outliers that are not realistic values

##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  8.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.29  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  3.00   Min.   : 12.0   Min.   :  66.0   Min.   : 14.0  
##  1st Qu.: 42.75   1st Qu.:451.0   1st Qu.: 562.0   1st Qu.: 67.0  
##  Median :103.00   Median :512.0   Median : 754.0   Median :101.0  
##  Mean   :100.29   Mean   :501.8   Mean   : 743.1   Mean   :123.5  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.0   3rd Qu.:151.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
##  Min.   :1137    Min.   :  3.0    Min.   : 119.0   Min.   : 181.0  
##  1st Qu.:1419    1st Qu.: 52.0    1st Qu.: 476.0   1st Qu.: 633.0  
##  Median :1518    Median :108.0    Median : 537.0   Median : 816.0  
##  Mean   :1562    Mean   :106.4    Mean   : 553.2   Mean   : 796.8  
##  3rd Qu.:1636    3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.: 948.2  
##  Max.   :2544    Max.   :343.0    Max.   :3645.0   Max.   :1399.0  
##  TEAM_FIELDING_E  TEAM_FIELDING_DP TEAM_BATTING_HS TEAM_BATTING_TOB
##  Min.   :  65.0   Min.   : 52.0    Min.   : 606    Min.   : 788    
##  1st Qu.: 127.0   1st Qu.:134.0    1st Qu.: 990    1st Qu.:1269    
##  Median : 159.0   Median :149.0    Median :1050    Median :1338    
##  Mean   : 246.5   Mean   :146.7    Mean   :1072    Mean   :1369    
##  3rd Qu.: 249.2   3rd Qu.:161.2    3rd Qu.:1129    3rd Qu.:1442    
##  Max.   :1898.0   Max.   :228.0    Max.   :2112    Max.   :2518    
##  TEAM_BATTING_OBP TEAM_BATTING_BAVG TEAM_BATTING_SLG TEAM_PITCHING_PEFF
##  Min.   :0.3801   Min.   :0.4131    Min.   :0.5861   Min.   :0.1065    
##  1st Qu.:0.4658   1st Qu.:0.4923    1st Qu.:0.7288   1st Qu.:0.2813    
##  Median :0.5122   Median :0.5290    Median :0.7731   Median :0.3503    
##  Mean   :0.5287   Mean   :0.5464    Mean   :0.7842   Mean   :0.3358    
##  3rd Qu.:0.5743   3rd Qu.:0.5846    3rd Qu.:0.8291   3rd Qu.:0.3944    
##  Max.   :0.9469   Max.   :0.9471    Max.   :1.2690   Max.   :0.5038
## [1] 2544 2514 2498 2485 2477 2460
## [1] 3645 2876 2840 2396 2169 1750
## [1] 1399 1387 1386 1386 1385 1371
## [1] 1898 1890 1740 1728 1567 1553

Let’s plot to see where outliers are

There are a lot of rows with pitching hits greater than 3000 (there was max of 2554 hits by batting team, so a limit of 3000 hits (18.5 avg per game) by a pitching team doesn’t seem so unreasonable as max)we see 86, these should probably be tossed and replaced with median (mean?)

## [1] 0

Using same idea for base on balls, max for batting was 878, so perhaps a 1200 max for pitching (7.5 avg a game) might be reasonable, so 10 rows like this.

## [1] 10

for strikeouts, batting max of 1399, so perhaps 1800 for pitching max, and we have 9 rows like this.

## [1] 0

For fielding errors, there is no corresponding number to validate against (like with pitching compared to batting), but if we use a plausible limit of about 10 per game (which is still really high we can eliminate clear outliers).

## [1] 4

Now that we have identified them, let’s replace all these values

Now lets replace all zeros, plus the -1’s we had put in earlier for for the na’s.

We will use median for those values.

Do some correlation

##      TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## [1,]      0.3887675       0.2891036       0.1403293       0.1500749
##      TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H
## [1,]       0.2239747     -0.05029994       0.1195729       0.1808751
##      TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## [1,]        0.1622116        0.1831058      -0.08228224      -0.1310447
##      TEAM_FIELDING_DP TEAM_BATTING_HS TEAM_BATTING_TOB TEAM_BATTING_OBP
## [1,]       -0.0300863       0.2309099        0.3001907       0.04434842
##      TEAM_BATTING_BAVG TEAM_BATTING_SLG TEAM_PITCHING_PEFF
## [1,]        0.06481931        0.1994892         -0.1364644

Do some basic plots to start looking at data:

Fit a couple really simple models just to explore, one with everything, one with just home runs against wins.

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = mbTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.791  -8.467   0.144   8.271  59.525 
## 
## Coefficients: (2 not defined because of singularities)
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.969e+00  2.934e+01   0.272 0.785923    
## TEAM_BATTING_H      7.102e-02  8.184e-03   8.678  < 2e-16 ***
## TEAM_BATTING_2B     2.900e-02  2.389e-02   1.214 0.224900    
## TEAM_BATTING_3B     1.469e-01  4.915e-02   2.988 0.002841 ** 
## TEAM_BATTING_HR    -1.435e-01  8.102e-02  -1.771 0.076694 .  
## TEAM_BATTING_BB     8.640e-03  1.190e-02   0.726 0.467913    
## TEAM_BATTING_SO    -4.247e-02  1.023e-02  -4.150 3.44e-05 ***
## TEAM_BASERUN_SB     1.781e-02  4.255e-03   4.185 2.96e-05 ***
## TEAM_PITCHING_H     2.148e-02  5.526e-03   3.888 0.000104 ***
## TEAM_PITCHING_HR    1.471e-02  2.472e-02   0.595 0.551822    
## TEAM_PITCHING_BB   -1.705e-02  6.027e-03  -2.828 0.004719 ** 
## TEAM_PITCHING_SO   -2.931e-02  1.181e-02  -2.483 0.013115 *  
## TEAM_FIELDING_E    -1.727e-02  2.503e-03  -6.899 6.79e-12 ***
## TEAM_FIELDING_DP   -1.183e-01  1.367e-02  -8.655  < 2e-16 ***
## TEAM_BATTING_HS            NA         NA      NA       NA    
## TEAM_BATTING_TOB           NA         NA      NA       NA    
## TEAM_BATTING_OBP   -1.571e+03  2.873e+02  -5.470 5.00e-08 ***
## TEAM_BATTING_BAVG   1.587e+03  3.068e+02   5.174 2.49e-07 ***
## TEAM_BATTING_SLG   -1.005e+02  5.859e+01  -1.715 0.086468 .  
## TEAM_PITCHING_PEFF  1.611e+02  4.151e+01   3.880 0.000107 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.07 on 2258 degrees of freedom
## Multiple R-squared:  0.3162, Adjusted R-squared:  0.311 
## F-statistic: 61.41 on 17 and 2258 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_HR, data = mbTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.898  -9.784   0.797  10.227  68.018 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     76.839461   0.636084 120.801  < 2e-16 ***
## TEAM_BATTING_HR  0.039399   0.005443   7.239 6.18e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.58 on 2274 degrees of freedom
## Multiple R-squared:  0.02252,    Adjusted R-squared:  0.02209 
## F-statistic:  52.4 on 1 and 2274 DF,  p-value: 6.176e-13

Let’s plot hits against wins as that field should highest correlation.

Plot Residuals