In this homework assignment, we will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
To build a multiple linear regression model on the training data to predict the number of wins for the team. We can only use the variables provided (or variables that we will derive from the variables provided).
The dataset consists of two data files: training and evaluation. The training dataset contains 17 columns, while the evaluation dataset contains 16. The evaluation dataset is missing column TARGET_WINS. We will start by exploring the training data set since it will be the one used to generate the regression model.
First we see that all data is numeric.
An important aspect of any dataset is to determine how much, if any, data is missing. We look at all the variables to see which if any have missing data. We look at the basic descriptive statistics as well as the missing data and their percentages:
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | na_count | na_count_perc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TARGET_WINS | 2 | 191 | 80.92670 | 12.115013 | 82 | 81.11765 | 13.3434 | 43 | 116 | 73 | -0.1698314 | -0.2952783 | 0.8766116 | 0 | 0.0 |
| TEAM_BATTING_H | 3 | 191 | 1478.62827 | 76.147869 | 1477 | 1477.42484 | 74.1300 | 1308 | 1667 | 359 | 0.1302702 | -0.3710350 | 5.5098664 | 0 | 0.0 |
| TEAM_BATTING_2B | 4 | 191 | 297.19895 | 26.329335 | 296 | 296.62745 | 25.2042 | 201 | 373 | 172 | 0.0915189 | 0.4778716 | 1.9051238 | 0 | 0.0 |
| TEAM_BATTING_3B | 5 | 191 | 30.74346 | 9.043878 | 29 | 30.13072 | 8.8956 | 12 | 61 | 49 | 0.7007420 | 0.7446217 | 0.6543921 | 0 | 0.0 |
| TEAM_BATTING_HR | 6 | 191 | 178.05236 | 32.413243 | 175 | 176.81046 | 35.5824 | 116 | 260 | 144 | 0.2980673 | -0.7172373 | 2.3453399 | 0 | 0.0 |
| TEAM_BATTING_BB | 7 | 191 | 543.31937 | 74.842133 | 535 | 541.31373 | 74.1300 | 365 | 775 | 410 | 0.3115199 | -0.1474175 | 5.4153867 | 0 | 0.0 |
| TEAM_BATTING_SO | 8 | 191 | 1051.02618 | 104.156382 | 1050 | 1046.95425 | 97.8516 | 805 | 1399 | 594 | 0.3985050 | 0.3955105 | 7.5364913 | 102 | 4.5 |
| TEAM_BASERUN_SB | 9 | 191 | 90.90576 | 29.916401 | 87 | 89.06536 | 29.6520 | 31 | 177 | 146 | 0.5553966 | -0.1414909 | 2.1646748 | 131 | 5.8 |
| TEAM_BASERUN_CS | 10 | 191 | 39.94241 | 11.898334 | 38 | 39.49020 | 11.8608 | 12 | 74 | 62 | 0.3468509 | 0.0006392 | 0.8609332 | 772 | 33.9 |
| TEAM_BATTING_HBP | 11 | 191 | 59.35602 | 12.967123 | 58 | 58.86275 | 11.8608 | 29 | 95 | 66 | 0.3185754 | -0.1119828 | 0.9382681 | 2085 | 91.6 |
| TEAM_PITCHING_H | 12 | 191 | 1479.70157 | 75.788625 | 1480 | 1478.50327 | 72.6474 | 1312 | 1667 | 355 | 0.1279056 | -0.3894781 | 5.4838725 | 0 | 0.0 |
| TEAM_PITCHING_HR | 13 | 191 | 178.17801 | 32.391678 | 175 | 176.93464 | 35.5824 | 116 | 260 | 144 | 0.2989191 | -0.7190905 | 2.3437795 | 0 | 0.0 |
| TEAM_PITCHING_BB | 14 | 191 | 543.71728 | 74.916681 | 537 | 541.74510 | 72.6474 | 367 | 775 | 408 | 0.3144366 | -0.1338563 | 5.4207808 | 0 | 0.0 |
| TEAM_PITCHING_SO | 15 | 191 | 1051.81675 | 104.347208 | 1052 | 1047.80392 | 97.8516 | 805 | 1399 | 594 | 0.3945586 | 0.3903991 | 7.5502990 | 102 | 4.5 |
| TEAM_FIELDING_E | 16 | 191 | 107.05236 | 16.632162 | 106 | 106.58170 | 17.7912 | 65 | 145 | 80 | 0.1780432 | -0.3567367 | 1.2034610 | 0 | 0.0 |
| TEAM_FIELDING_DP | 17 | 191 | 152.33508 | 17.611682 | 152 | 152.04575 | 19.2738 | 113 | 204 | 91 | 0.2164822 | -0.2115741 | 1.2743366 | 286 | 12.6 |
From this result we can see how several variables have a number of missing values. The maximum number of missing values was 2085 in the TEAM_BATTING_HBP variable. This is a significant amount of missing data representing 91.6% of that data.
With missing data assessed, we can look into descriptive statistics in more detail. Interestingly we find that the difference between means and medians is fairly small for all data columns. The maximum difference is in fact only 4.77%. This means that we are to expect the distributions of this data to be fairly uniform. To visualize this we plot histograms for each data.
The plot of distributions does show fairly uniform data, but it also show the potential presence of outliers in at least two of the predictors. This is not the best way to vizualise ouliers. Instead we identify the predictors which seem to have outliers by looking at the scattered and box plots. Two variables with outliers appear to be TEAM_PITCHING_H, TEAM_PITCHING_SO, TEAM_PITCHING_BB and TEAM_FIELDING_E. We highlight these variables from the desity plots since we can see most of the data concentrated at the lower end of the scales which show tailing off to high values.
Looking at correlation of variables to number of wins provides some interesting data. We find some correlations that make sense from what might assume with subject knowledge of base, e.g., the number of hits and number of variables both have significant positive correlation with Wins and other statistics like stolen bases, while still positive, are not so strongly related. What is surprising though, are the pitching statistics. We would assume that a team that allowed the opposing team more hits, would lose more games (and win less), but that is not what the data shows us. Perhaps there are outliers swaying the correlation.
Regardless, we can use some of these correlations to drive initial models later, in terms of likely fields to choose for an effective model.
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## [1,] 1 0.3887675 0.2891036 0.1426084
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## [1,] 0.1761532 0.2325599 NA NA
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## [1,] NA NA -0.1099371 0.1890137
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## [1,] 0.1241745 NA -0.1764848 NA
First task under data preparation will be to eliminate all missing data. In the Data Exploration section we found one variable, TEAM_BATTING_HBP with an exceptionaly high percentage of missing data, so we commence by eliminating this variable. We also removed the “INDEX” column as that is not used.
Next task is to handle missing data in the other variables. Here, because the percentages of missing data are lower, we can replace missing data with the median. We prefer replacing with median instead of mean because the latter is more sensitive to outliers. So we get a clean dataset without missing values.
Note, we also consider zeros to be missing data. Since each row is a season of data for a given baseball team, it would be extraordinarily unlikely that any of these statistics would have zero as an actual value. Therefore we are assuming zero is another indicator of missing value and we will transform them into a median value.
In the exploratory phase we also identified several variables with outliers. Outliers will be substituted with median. Again we choose median becouse it is less influenced by these outliers. What cut-off to use to tag an outlier reading could be a 3 standard deviation from the mean, or 1.5 time the inter quartile range, but in this case because these variables have reciprocals as seen in the exploratory phase, we will use the maximum reading of those variables.
TEAM_PITCHING_H
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1137 1419 1518 1562 1636 2544
From the summary we now see that the maximum is 2544, which is a much more reasonable number. We can also see a wide spread between mean and median of 44, indicating a more normal distribution than before.
TEAM_PITCHING_SO
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 181.0 633.0 816.0 796.8 948.2 1399.0
From the summary we now see that the maximum is 1399, which is a much more reasonable number. We can also see a wide spread between mean and median of -19, indicating a more normal distribution than before.
New Variables
With a clean dataset, we can now start looking at what predictor variables can be combined and what new statistics can be derived.
Batting Hit Singles
On the batting side we can start by adding a variable for single hits since the dataset has a variable for all 4 kinds of hits.
TEAM_BATTING_HS = TEAM_BATTING_H - (TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR)
TEAM_BATTING_HS Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 606 990 1050 1072 1129 2112
There are other popular baseball statistics which are regularly calculated. The data given is limited, so it won’t be possible to make all these calculations. But we can use the data given to calculate some statistics that resemble some of the common baseball measurements.
The number of times a batter reaches base can be calculated as Times On Base:
Times On Base
TOB = Base Hits + Walks + Hits by Pitch TOB = ( TEAM_BATTING_H - TEAM_BATTING_HR ) + TEAM_BATTING_BB + TEAM_BATTING_HBP TOB = TEAM_BATTING_TOB
In our case we do not have TEAM_BATTING_HBP. We deleted this predictor since it didn’t contain enough data, so we will not include this term in calculating TOB.
TEAM_BATTING_TOB Summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 788 1269 1338 1369 1442 2518
On Base Percentage
If we divide this statistic by the times a batter appears on plate, we have a ratio for On Base Percentage. Batter appearances on plate is not a statistic that was given, but we can assumes it would the similar to the number of times a batter produces a hit and the times of strikeouts.
OBP = TOB / ( Base Hits + Walks + Hits by Pitch + Strikeouts ) OBP = TEAM_BATTING_TOB / (( TEAM_BATTING_H - TEAM_BATTING_HR ) + TEAM_BATTING_BB + TEAM_BATTING_HBP + TEAM_BATTING_SO ) OBP = TEAM_BATTING_OBP
Same as before TEAM_BATTING_HBP is missing so we do not include it.
TEAM_BATTING_OBP Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3801 0.4658 0.5122 0.5287 0.5743 0.9469
Batting Average
This statistics is calculated as the number of batter hits divided by times at bat or on plate. With our dataset we will compute times at bat as the sum of a batters hits and strike out, same as we did on the previous calculated variable since the number of Hits by Pitch is not available:
AVG = Hits / (Hits + Walks + Strikeouts) AVG = TEAM_BATTING_H / ( TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_SO ) AVG = TEAM_BATTING_BAVG
TEAM_BATTING_BAVG Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4131 0.4923 0.5290 0.5464 0.5846 0.9471
Slugging Percentage
A shortcoming of the previous statistic is that it weights any kind of hits equally. To account for the fact that some hits are more beneficial or carry higher weight we can calculate a slugging percentage by multiplying each kind of hit by an increasing number.
SLG = ( Single Hits + 2 * Double Hits + 3 * Tripple Hits + 4 * Home Runs ) / (Hits + Walks + Strikeouts) TEAM_BATTING_SLG = ( ( TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR ) + 2 * TEAM_BATTING_2B + 3 * TEAM_BATTING_3B + 4 * TEAM_BATTING_HR ) / ( mbTrain\(TEAM_BATTING_H + mbTrain\)TEAM_BATTING_BB + mbTrain$TEAM_BATTING_SO )
TEAM_BATTING_SLG Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5861 0.7288 0.7731 0.7842 0.8291 1.2690
Strikeout Efficiency
Measures how successful a pitches is at striking out batters:
PEFF = Strike Outs (Hits + Strike Outs) TEAM_PITCHING_PEFF = TEAM_PITCHING_SO (TEAM_PITCHING_H + TEAM_PITCHING_SO)
TEAM_PITCHING_PEFF Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1065 0.2813 0.3503 0.3358 0.3944 0.5038
Training and Test
Lastly, before we create models, let’s divide data into test and training sets, with 80% for training, 20% for test. This way we have a method to validate our models.
Combine all batting variables.
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.157 -8.688 0.679 9.599 45.949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.472589 3.369651 5.185 2.4e-07 ***
## TEAM_BATTING_H 0.043178 0.002281 18.927 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.4 on 1818 degrees of freedom
## Multiple R-squared: 0.1646, Adjusted R-squared: 0.1642
## F-statistic: 358.2 on 1 and 1818 DF, p-value: < 2.2e-16
Combine all pitching variables.
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_H + TEAM_PITCHING_HR,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.379 -9.340 0.787 9.917 67.847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.476611 2.691696 19.496 < 2e-16 ***
## TEAM_PITCHING_H 0.015148 0.001626 9.319 < 2e-16 ***
## TEAM_PITCHING_HR 0.045438 0.005863 7.750 1.51e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.2 on 1817 degrees of freedom
## Multiple R-squared: 0.06883, Adjusted R-squared: 0.06781
## F-statistic: 67.16 on 2 and 1817 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ HITS_NOHR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.111 -8.461 0.808 10.497 42.679
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.86903 3.00594 12.60 <2e-16 ***
## HITS_NOHR 0.03144 0.00218 14.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.92 on 1818 degrees of freedom
## Multiple R-squared: 0.1027, Adjusted R-squared: 0.1022
## F-statistic: 208 on 1 and 1818 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ HITS_NOHR + TEAM_BATTING_BB + TEAM_FIELDING_E,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.256 -9.176 0.192 9.555 53.546
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.419903 3.533612 3.232 0.00125 **
## HITS_NOHR 0.047475 0.002322 20.445 < 2e-16 ***
## TEAM_BATTING_BB 0.018140 0.003524 5.148 2.92e-07 ***
## TEAM_FIELDING_E -0.018679 0.002141 -8.723 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.94 on 1816 degrees of freedom
## Multiple R-squared: 0.2172, Adjusted R-squared: 0.2159
## F-statistic: 168 on 3 and 1816 DF, p-value: < 2.2e-16
## Warning in abline(hitsNoHR_bb_e_mod): only using the first two of 4
## regression coefficients
Best in terms of residuals and Rsquared Hits, BB, and Fielding Errors. Plost look good except for short tailed issues in the QQ plot.
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB +
## TEAM_FIELDING_E, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.767 -8.959 0.003 9.071 50.788
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.277272 3.639218 1.175 0.240018
## TEAM_BATTING_H 0.050269 0.002305 21.813 < 2e-16 ***
## TEAM_BATTING_BB 0.012509 0.003507 3.567 0.000371 ***
## TEAM_FIELDING_E -0.014193 0.002022 -7.018 3.17e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.77 on 1816 degrees of freedom
## Multiple R-squared: 0.237, Adjusted R-squared: 0.2357
## F-statistic: 188 on 3 and 1816 DF, p-value: < 2.2e-16
## Warning in abline(hits_bb_e_mod): only using the first two of 4 regression
## coefficients
Attempt at boxcox, didn’t achieve better results in R squared (not SE not directly comparable due to adjustment with boxcox) The QQ plot seems to look a bit better as the negative quantiles are much closer to the line. Box Cox
##
## Call:
## lm(formula = TARGET_WINS_BC ~ TEAM_BATTING_H + TEAM_BATTING_BB +
## TEAM_FIELDING_E)
##
## Residuals:
## Min 1Q Median 3Q Max
## -194.979 -38.736 -1.184 37.665 223.257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.017547 15.331986 -4.175 3.12e-05 ***
## TEAM_BATTING_H 0.211530 0.009709 21.787 < 2e-16 ***
## TEAM_BATTING_BB 0.053146 0.014776 3.597 0.000331 ***
## TEAM_FIELDING_E -0.051346 0.008520 -6.026 2.03e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 57.99 on 1816 degrees of freedom
## Multiple R-squared: 0.2317, Adjusted R-squared: 0.2305
## F-statistic: 182.6 on 3 and 1816 DF, p-value: < 2.2e-16
| Metric | Batting Model | Pitching Model | BoxCox Model |
|---|---|---|---|
| RSE | 14.40 | 15.20 | 57.99 |
| R^2 | 0.1646 | 0.06883 | 0.2317 |
| Adj. R^2 | 0.1642 | 0.06781 | 0.2305 |
| F Stat. | 358.2 | 67.16 | 182.6 |
find counts of na’s notice that #hbp and cs (hit by pitch and caught stealing) have very large numbers of missing values, probably shouldn’t use, and knowing baseball, probably minor impact on wins regardless, so remove those columns…
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## 0 0 0
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
## 0 0 0
## TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 0 0 0
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## 0 0 0
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 0 0 0
## TEAM_BATTING_HS TEAM_BATTING_TOB TEAM_BATTING_OBP
## 0 0 0
## TEAM_BATTING_BAVG TEAM_BATTING_SLG TEAM_PITCHING_PEFF
## 0 0 0
convert NA’s to -1 for now so we can run test on zero’s as zero’s for any of these stats clearly is inaccurate and eqivalent to an NA
Number of zeros fields is not so bad, but lets not replace them until we check for outliers and convert NA’s back to -1
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## 1 0 0
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
## 0 0 0
## TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H
## 0 0 0
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 0 0 0
## TEAM_FIELDING_E TEAM_FIELDING_DP TEAM_BATTING_HS
## 0 0 0
## TEAM_BATTING_TOB TEAM_BATTING_OBP TEAM_BATTING_BAVG
## 0 0 0
## TEAM_BATTING_SLG TEAM_PITCHING_PEFF
## 0 0
Running summary we can see that there are some really big outliers (e.g., 19278 Strikeouts in one year?, that would be 119 per game….) at initial glance (can always revisit), looks like Pitching_Hits, Pitching_BB, Pitching_SO and Fielding Errors all have outliers that are not realistic values
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 8.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.29
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 3.00 Min. : 12.0 Min. : 66.0 Min. : 14.0
## 1st Qu.: 42.75 1st Qu.:451.0 1st Qu.: 562.0 1st Qu.: 67.0
## Median :103.00 Median :512.0 Median : 754.0 Median :101.0
## Mean :100.29 Mean :501.8 Mean : 743.1 Mean :123.5
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 925.0 3rd Qu.:151.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## Min. :1137 Min. : 3.0 Min. : 119.0 Min. : 181.0
## 1st Qu.:1419 1st Qu.: 52.0 1st Qu.: 476.0 1st Qu.: 633.0
## Median :1518 Median :108.0 Median : 537.0 Median : 816.0
## Mean :1562 Mean :106.4 Mean : 553.2 Mean : 796.8
## 3rd Qu.:1636 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 948.2
## Max. :2544 Max. :343.0 Max. :3645.0 Max. :1399.0
## TEAM_FIELDING_E TEAM_FIELDING_DP TEAM_BATTING_HS TEAM_BATTING_TOB
## Min. : 65.0 Min. : 52.0 Min. : 606 Min. : 788
## 1st Qu.: 127.0 1st Qu.:134.0 1st Qu.: 990 1st Qu.:1269
## Median : 159.0 Median :149.0 Median :1050 Median :1338
## Mean : 246.5 Mean :146.7 Mean :1072 Mean :1369
## 3rd Qu.: 249.2 3rd Qu.:161.2 3rd Qu.:1129 3rd Qu.:1442
## Max. :1898.0 Max. :228.0 Max. :2112 Max. :2518
## TEAM_BATTING_OBP TEAM_BATTING_BAVG TEAM_BATTING_SLG TEAM_PITCHING_PEFF
## Min. :0.3801 Min. :0.4131 Min. :0.5861 Min. :0.1065
## 1st Qu.:0.4658 1st Qu.:0.4923 1st Qu.:0.7288 1st Qu.:0.2813
## Median :0.5122 Median :0.5290 Median :0.7731 Median :0.3503
## Mean :0.5287 Mean :0.5464 Mean :0.7842 Mean :0.3358
## 3rd Qu.:0.5743 3rd Qu.:0.5846 3rd Qu.:0.8291 3rd Qu.:0.3944
## Max. :0.9469 Max. :0.9471 Max. :1.2690 Max. :0.5038
## [1] 2544 2514 2498 2485 2477 2460
## [1] 3645 2876 2840 2396 2169 1750
## [1] 1399 1387 1386 1386 1385 1371
## [1] 1898 1890 1740 1728 1567 1553
Let’s plot to see where outliers are
There are a lot of rows with pitching hits greater than 3000 (there was max of 2554 hits by batting team, so a limit of 3000 hits (18.5 avg per game) by a pitching team doesn’t seem so unreasonable as max)we see 86, these should probably be tossed and replaced with median (mean?)
## [1] 0
Using same idea for base on balls, max for batting was 878, so perhaps a 1200 max for pitching (7.5 avg a game) might be reasonable, so 10 rows like this.
## [1] 10
for strikeouts, batting max of 1399, so perhaps 1800 for pitching max, and we have 9 rows like this.
## [1] 0
For fielding errors, there is no corresponding number to validate against (like with pitching compared to batting), but if we use a plausible limit of about 10 per game (which is still really high we can eliminate clear outliers).
## [1] 4
Now that we have identified them, let’s replace all these values
Now lets replace all zeros, plus the -1’s we had put in earlier for for the na’s.
We will use median for those values.
Do some correlation
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## [1,] 0.3887675 0.2891036 0.1403293 0.1500749
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H
## [1,] 0.2239747 -0.05029994 0.1195729 0.1808751
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## [1,] 0.1622116 0.1831058 -0.08228224 -0.1310447
## TEAM_FIELDING_DP TEAM_BATTING_HS TEAM_BATTING_TOB TEAM_BATTING_OBP
## [1,] -0.0300863 0.2309099 0.3001907 0.04434842
## TEAM_BATTING_BAVG TEAM_BATTING_SLG TEAM_PITCHING_PEFF
## [1,] 0.06481931 0.1994892 -0.1364644
Do some basic plots to start looking at data:
Fit a couple really simple models just to explore, one with everything, one with just home runs against wins.
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = mbTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.791 -8.467 0.144 8.271 59.525
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.969e+00 2.934e+01 0.272 0.785923
## TEAM_BATTING_H 7.102e-02 8.184e-03 8.678 < 2e-16 ***
## TEAM_BATTING_2B 2.900e-02 2.389e-02 1.214 0.224900
## TEAM_BATTING_3B 1.469e-01 4.915e-02 2.988 0.002841 **
## TEAM_BATTING_HR -1.435e-01 8.102e-02 -1.771 0.076694 .
## TEAM_BATTING_BB 8.640e-03 1.190e-02 0.726 0.467913
## TEAM_BATTING_SO -4.247e-02 1.023e-02 -4.150 3.44e-05 ***
## TEAM_BASERUN_SB 1.781e-02 4.255e-03 4.185 2.96e-05 ***
## TEAM_PITCHING_H 2.148e-02 5.526e-03 3.888 0.000104 ***
## TEAM_PITCHING_HR 1.471e-02 2.472e-02 0.595 0.551822
## TEAM_PITCHING_BB -1.705e-02 6.027e-03 -2.828 0.004719 **
## TEAM_PITCHING_SO -2.931e-02 1.181e-02 -2.483 0.013115 *
## TEAM_FIELDING_E -1.727e-02 2.503e-03 -6.899 6.79e-12 ***
## TEAM_FIELDING_DP -1.183e-01 1.367e-02 -8.655 < 2e-16 ***
## TEAM_BATTING_HS NA NA NA NA
## TEAM_BATTING_TOB NA NA NA NA
## TEAM_BATTING_OBP -1.571e+03 2.873e+02 -5.470 5.00e-08 ***
## TEAM_BATTING_BAVG 1.587e+03 3.068e+02 5.174 2.49e-07 ***
## TEAM_BATTING_SLG -1.005e+02 5.859e+01 -1.715 0.086468 .
## TEAM_PITCHING_PEFF 1.611e+02 4.151e+01 3.880 0.000107 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.07 on 2258 degrees of freedom
## Multiple R-squared: 0.3162, Adjusted R-squared: 0.311
## F-statistic: 61.41 on 17 and 2258 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_HR, data = mbTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.898 -9.784 0.797 10.227 68.018
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.839461 0.636084 120.801 < 2e-16 ***
## TEAM_BATTING_HR 0.039399 0.005443 7.239 6.18e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.58 on 2274 degrees of freedom
## Multiple R-squared: 0.02252, Adjusted R-squared: 0.02209
## F-statistic: 52.4 on 1 and 2274 DF, p-value: 6.176e-13
Let’s plot hits against wins as that field should highest correlation.
Plot Residuals