DATA EXPLORATION
The moneyball training set consists of statistics from professional baseball teams from 1871 to 2006. There are 2276 records, which contain statistics of the performance of a particular team in a particular year. The statistics are adjusted to a season with 162 games.
The following is a list of the statistics that will be used to build a linear model to predict the number of wins:
- base hits by batters
- doubles hit by batters
- triples hit by batters
- homeruns hit by batters
- walks by batters
- batters hit by pitch
- strikeouts by batters
- stolen bases
- caught stealing bases
- errors
- double plays
- walks allows
- hits allowed
- homeruns allowed
- strikeouts by pitchers
The following are sumamry statistics for each of the variables described above:
## base_hits doubles_hit triples_hit homeruns_hit
## Min. : 891 Min. : 69.0 Min. : 0.00 Min. : 0.00
## 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00 1st Qu.: 42.00
## Median :1454 Median :238.0 Median : 47.00 Median :102.00
## Mean :1469 Mean :241.2 Mean : 55.25 Mean : 99.61
## 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00 3rd Qu.:147.00
## Max. :2554 Max. :458.0 Max. :223.00 Max. :264.00
##
## walks_by_batters strikeouts_at_bat stolen_bases caught_stealing_bases
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0 1st Qu.: 38.0
## Median :512.0 Median : 750.0 Median :101.0 Median : 49.0
## Mean :501.6 Mean : 735.6 Mean :124.8 Mean : 52.8
## 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0 3rd Qu.: 62.0
## Max. :878.0 Max. :1399.0 Max. :697.0 Max. :201.0
## NA's :102 NA's :131 NA's :772
## batters_hit_by_pitch hits_allowed homeruns_allowed walks_allowed
## Min. :29.00 Min. : 1137 Min. : 0.0 Min. : 0.0
## 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0 1st Qu.: 476.0
## Median :58.00 Median : 1518 Median :107.0 Median : 536.5
## Mean :59.36 Mean : 1779 Mean :105.7 Mean : 553.0
## 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0 3rd Qu.: 611.0
## Max. :95.00 Max. :30132 Max. :343.0 Max. :3645.0
## NA's :2085
## strikeouts_by_pitchers errors double_plays
## Min. : 0.0 Min. : 65.0 Min. : 52.0
## 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
## Median : 813.5 Median : 159.0 Median :149.0
## Mean : 817.7 Mean : 246.5 Mean :146.4
## 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
## Max. :19278.0 Max. :1898.0 Max. :228.0
## NA's :102 NA's :286
Hits Allowed
The mean is larger than the median for hits allowed and the distribution is skewed to the right, which can be seen in the histogram below. In addition, the maximum number of hits allowed is 30,132, which is about 17 times greater than the mean. The boxplot below displays a large number of outliers.
Errors
The mean is larger than the median for errors and the distribution is skewed to the right, which can be seen in the histogram below. In addition, the maximum number of errors is 1898, which is about 7.7 times greater than the mean. The boxplot below displays a large number of outliers.
Strikeouts By Pitchers
The mean is larger than the median for hits allowed and the distribution is skewed to the right, which can be seen in the histogram below. In addition, the maximum number of hits allowed is 19,278, which is about 23.5 times greater than the mean. The boxplot below displays the outliers.
Missing Data
The following variables have missing values:
- Strikeouts at Bat (102 missing data points)
- Batters hit by pitch (2085 missing data points)
- Caught stealing (772 missing data points)
- Stolen bases (131 missing data points)
- Double Plays (286 missing data points)
- Hits allowed (41 missing data points)
- Strikeouts by pitchers (107 missing data points)
- Errors (46 missing data points)
DATA PREPARATION
Hits Allowed
These outliers reflect an issue with the data, as 30,132 hits allowed per season is not possible. That would mean that a team averaged 186 hits against them per game! The most ever hit in a game is 33. Out of a concern that the outliers will make creating a linear regression model challenging, I will remove all hits allowed above 5000.
Errors
Out of a concern that the outliers will make creating a linear regression model challenging, I will remove all errors above 1000.
Strikeouts By Pitchers
These outliers reflect an issue with the data, as 19,278 strikeouts by pitchers in one season is not possible. That would mean that a team averaged 119 strikeouts per game! Out of a concern that the outliers will make creating a linear regression model challenging, I will remove all strikeouts by pitchers above 3000.
Missing Data
Because there is such a high number of missing values for batters hit by pitches, I will remove variable from the model. I do not think it would be helpful to impute values since there are so many missing values and removing all data sets that have missing values for that field would leave the data set very small. I will set the other missing values to be equal to the median for that variable. I am choosing the median and not the mean so that the value is less affected by outliers.
Correlation of Variables
The following are the correlation values between each of hte variables. The closer the correlation is to 1 or -1, the more highly correlated the variables.
## base_hits doubles_hit triples_hit homeruns_hit
## base_hits 1.0000000000 0.56284968 0.427696575 -0.006544685
## doubles_hit 0.5628496778 1.00000000 -0.107305824 0.435397293
## triples_hit 0.4276965751 -0.10730582 1.000000000 -0.635566946
## homeruns_hit -0.0065446845 0.43539729 -0.635566946 1.000000000
## walks_by_batters -0.0724640128 0.25572610 -0.287235841 0.513734810
## strikeouts_at_bat -0.4526861592 0.15173438 -0.655709613 0.693007648
## stolen_bases 0.1078237673 -0.18340432 0.485740156 -0.406889074
## caught_stealing_bases 0.0008261984 -0.04584955 0.136181182 -0.225458666
## hits_allowed 0.4168663020 0.07090646 0.390189788 -0.310363264
## homeruns_allowed 0.0728531193 0.45455082 -0.567836679 0.969371396
## walks_allowed 0.0941930273 0.17805420 -0.002224148 0.136927564
## strikeouts_by_pitchers -0.3690341239 0.10402157 -0.484234492 0.500529828
## errors 0.1581345432 -0.23624431 0.590984739 -0.599371756
## double_plays 0.1248087998 0.25696798 -0.227771884 0.391652434
## walks_by_batters strikeouts_at_bat stolen_bases
## base_hits -0.07246401 -0.45268616 0.10782377
## doubles_hit 0.25572610 0.15173438 -0.18340432
## triples_hit -0.28723584 -0.65570961 0.48574016
## homeruns_hit 0.51373481 0.69300765 -0.40688907
## walks_by_batters 1.00000000 0.37148892 -0.04268402
## strikeouts_at_bat 0.37148892 1.00000000 -0.21178758
## stolen_bases -0.04268402 -0.21178758 1.00000000
## caught_stealing_bases -0.04581766 -0.10250193 0.23324171
## hits_allowed -0.43118480 -0.43626075 0.13302827
## homeruns_allowed 0.45955207 0.63286033 -0.38005624
## walks_allowed 0.48936126 0.03498809 0.12928969
## strikeouts_by_pitchers 0.12691827 0.84559720 -0.14166094
## errors -0.47291113 -0.46665075 0.46028697
## double_plays 0.32963974 0.11089804 -0.27023400
## caught_stealing_bases hits_allowed homeruns_allowed
## base_hits 0.0008261984 0.41686630 0.07285312
## doubles_hit -0.0458495544 0.07090646 0.45455082
## triples_hit 0.1361811823 0.39018979 -0.56783668
## homeruns_hit -0.2254586663 -0.31036326 0.96937140
## walks_by_batters -0.0458176601 -0.43118480 0.45955207
## strikeouts_at_bat -0.1025019312 -0.43626075 0.63286033
## stolen_bases 0.2332417104 0.13302827 -0.38005624
## caught_stealing_bases 1.0000000000 -0.01546387 -0.22818525
## hits_allowed -0.0154638734 1.00000000 -0.21913286
## homeruns_allowed -0.2281852483 -0.21913286 1.00000000
## walks_allowed -0.0472289272 -0.06517793 0.22193750
## strikeouts_by_pitchers -0.1131632877 -0.10786921 0.52877504
## errors 0.0119941153 0.58140536 -0.54169911
## double_plays -0.1020021365 -0.16433848 0.38959550
## walks_allowed strikeouts_by_pitchers errors
## base_hits 0.094193027 -0.36903412 0.15813454
## doubles_hit 0.178054204 0.10402157 -0.23624431
## triples_hit -0.002224148 -0.48423449 0.59098474
## homeruns_hit 0.136927564 0.50052983 -0.59937176
## walks_by_batters 0.489361263 0.12691827 -0.47291113
## strikeouts_at_bat 0.034988093 0.84559720 -0.46665075
## stolen_bases 0.129289686 -0.14166094 0.46028697
## caught_stealing_bases -0.047228927 -0.11316329 0.01199412
## hits_allowed -0.065177930 -0.10786921 0.58140536
## homeruns_allowed 0.221937505 0.52877504 -0.54169911
## walks_allowed 1.000000000 0.10899535 -0.06247060
## strikeouts_by_pitchers 0.108995347 1.00000000 -0.18945657
## errors -0.062470604 -0.18945657 1.00000000
## double_plays 0.192348657 0.01453383 -0.27955636
## double_plays
## base_hits 0.12480880
## doubles_hit 0.25696798
## triples_hit -0.22777188
## homeruns_hit 0.39165243
## walks_by_batters 0.32963974
## strikeouts_at_bat 0.11089804
## stolen_bases -0.27023400
## caught_stealing_bases -0.10200214
## hits_allowed -0.16433848
## homeruns_allowed 0.38959550
## walks_allowed 0.19234866
## strikeouts_by_pitchers 0.01453383
## errors -0.27955636
## double_plays 1.00000000
There is a positive correlation between base hits and doubles, doubles and home runs hit, triples and errors, homeruns hit and walks at bat, home runs hits and strikeouts at bat, home runs hit and stolen bases, homeruns hit and homeruns allowed, strikeout at bat and homeruns allowed, strikeout at bat and strikeouts by pitchers, hits allowed and errors, and homeruns allowed and strikeouts by pitchers.
There is a negative correlation between triples hit and strikeouts at bat, triples hit and homeruns hit, triples hit and homeruns allowed, homeruns hit and errors, and homeruns allowed and errors.
The variables that were most correlated with each other are homeruns hit and homeruns allowed as well as strikeouts at bat and strikeouts by pitchers. This is a surprise, as I would not have expected the number of homeruns a team hit to be related to the number of homeruns a team allowed. Nor would I have expected the number of strikeouts at bat to be related to the number of strikeouts by pitchers for the same team.
To address these correlations, I will combine home runs hits and homeruns allowed by adding their values. In this way, the influence of these variabes will be included only once in the regression model.
I will combine the variables for strikeouts at bat and strikeouts by pitchers in the same way. After creating the new values, I will remove the original columns in the data table.
Build Models
Breaking the data set into a training set and a testing set
The data is shuffled randomly. 60% of the data is in the training set and 40% of the data is in the testing set.
Backward Elimination - Linear Regression Model
A linear regression model will be built using the backward elimination model. Initially all of the variables will be present, and then they will be removed one at a time. The variable with the highest p value, which has the least affect on wins, will be elinimated first. Variables will be removed until every predictor has a p value below 0.05.
##
## Call:
## lm(formula = `moneyball_df$TARGET_WINS` ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.725 -8.242 0.025 8.510 50.995
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.3760589 6.7813968 -0.498 0.61868
## base_hits 0.0501055 0.0049514 10.119 < 2e-16 ***
## doubles_hit -0.0288531 0.0119762 -2.409 0.01612 *
## triples_hit 0.0554820 0.0218123 2.544 0.01108 *
## walks_by_batters 0.0566052 0.0046477 12.179 < 2e-16 ***
## stolen_bases 0.0151043 0.0057817 2.612 0.00909 **
## caught_stealing_bases 0.0145006 0.0196246 0.739 0.46010
## hits_allowed 0.0017314 0.0012509 1.384 0.16657
## walks_allowed -0.0205572 0.0031911 -6.442 1.63e-10 ***
## errors 0.0037014 0.0037091 0.998 0.31850
## double_plays -0.1074706 0.0170759 -6.294 4.18e-10 ***
## homeruns 0.0304119 0.0059930 5.075 4.42e-07 ***
## strikeouts 0.0004875 0.0013014 0.375 0.70803
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.2 on 1353 degrees of freedom
## Multiple R-squared: 0.3172, Adjusted R-squared: 0.3112
## F-statistic: 52.38 on 12 and 1353 DF, p-value: < 2.2e-16
Strikeouts has the (highest p value) lowest affect on wins and will be removed next.
##
## Call:
## lm(formula = `moneyball_df$TARGET_WINS` ~ base_hits + doubles_hit +
## triples_hit + walks_by_batters + stolen_bases + caught_stealing_bases +
## hits_allowed + walks_allowed + errors + double_plays + homeruns,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.926 -8.168 0.015 8.531 51.301
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.634379 4.934814 -0.331 0.7405
## base_hits 0.049156 0.004253 11.558 < 2e-16 ***
## doubles_hit -0.027588 0.011486 -2.402 0.0164 *
## triples_hit 0.054606 0.021680 2.519 0.0119 *
## walks_by_batters 0.056389 0.004610 12.231 < 2e-16 ***
## stolen_bases 0.015507 0.005679 2.731 0.0064 **
## caught_stealing_bases 0.014778 0.019604 0.754 0.4511
## hits_allowed 0.001745 0.001250 1.396 0.1629
## walks_allowed -0.020505 0.003187 -6.434 1.72e-10 ***
## errors 0.003865 0.003682 1.050 0.2941
## double_plays -0.108657 0.016774 -6.478 1.30e-10 ***
## homeruns 0.031759 0.004792 6.628 4.90e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.2 on 1354 degrees of freedom
## Multiple R-squared: 0.3171, Adjusted R-squared: 0.3116
## F-statistic: 57.17 on 11 and 1354 DF, p-value: < 2.2e-16
Caught stealing bases has the least affect on wins and will be removed.
##
## Call:
## lm(formula = `moneyball_df$TARGET_WINS` ~ base_hits + doubles_hit +
## triples_hit + walks_by_batters + stolen_bases + hits_allowed +
## walks_allowed + errors + double_plays + homeruns, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.166 -8.264 0.091 8.650 51.455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.730250 4.786084 -0.153 0.87875
## base_hits 0.049018 0.004248 11.539 < 2e-16 ***
## doubles_hit -0.026937 0.011452 -2.352 0.01881 *
## triples_hit 0.055477 0.021645 2.563 0.01048 *
## walks_by_batters 0.056497 0.004607 12.263 < 2e-16 ***
## stolen_bases 0.016536 0.005512 3.000 0.00275 **
## hits_allowed 0.001774 0.001249 1.420 0.15573
## walks_allowed -0.020730 0.003173 -6.534 9.03e-11 ***
## errors 0.003313 0.003608 0.918 0.35864
## double_plays -0.108603 0.016772 -6.475 1.32e-10 ***
## homeruns 0.031118 0.004715 6.600 5.88e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.2 on 1355 degrees of freedom
## Multiple R-squared: 0.3169, Adjusted R-squared: 0.3118
## F-statistic: 62.85 on 10 and 1355 DF, p-value: < 2.2e-16
Errors has the lowest affect on wins and will be removed next.
##
## Call:
## lm(formula = `moneyball_df$TARGET_WINS` ~ base_hits + doubles_hit +
## triples_hit + walks_by_batters + stolen_bases + hits_allowed +
## walks_allowed + double_plays + homeruns, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.301 -8.171 0.101 8.560 52.911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.224937 4.754069 -0.047 0.962269
## base_hits 0.048509 0.004212 11.518 < 2e-16 ***
## doubles_hit -0.027239 0.011447 -2.380 0.017465 *
## triples_hit 0.060894 0.020825 2.924 0.003512 **
## walks_by_batters 0.056017 0.004577 12.238 < 2e-16 ***
## stolen_bases 0.018273 0.005177 3.530 0.000429 ***
## hits_allowed 0.002275 0.001124 2.024 0.043187 *
## walks_allowed -0.020835 0.003170 -6.572 7.06e-11 ***
## double_plays -0.107625 0.016737 -6.430 1.76e-10 ***
## homeruns 0.030500 0.004666 6.536 8.90e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.2 on 1356 degrees of freedom
## Multiple R-squared: 0.3164, Adjusted R-squared: 0.3119
## F-statistic: 69.74 on 9 and 1356 DF, p-value: < 2.2e-16
All of the remaining variables have p values lower that 0.05, which indicates that they are signficant. A p value below 0.05 indicates strong evidence against the null hypothesis that the variable in question does not affect wins.
The adjusted R squared value is .3133. 31.33% of the variability in wins is accounted for by the model.
The F statistic is 52.85, which is high, and further indicates that these variables are signficiant. The following graphs look for patterns in the residuals.
The plot of the residuals vs. the fitted values shows no pattern, indicating theat the variability in the residuals is nearly constant. The residuals mostly follow the line on the QQ plot, indicating that they are nearly normal.
Backward Elimination -Using a different training set by shuffling the data again.
##
## Call:
## lm(formula = `moneyball_df$TARGET_WINS` ~ ., data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.595 -7.988 0.243 8.286 45.461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0651145 6.7664401 1.044 0.296607
## base_hits 0.0394944 0.0045883 8.608 < 2e-16 ***
## doubles_hit -0.0223176 0.0112770 -1.979 0.048014 *
## triples_hit 0.0735854 0.0210341 3.498 0.000483 ***
## walks_by_batters 0.0499024 0.0044370 11.247 < 2e-16 ***
## stolen_bases 0.0298152 0.0055956 5.328 1.16e-07 ***
## caught_stealing_bases 0.0179297 0.0197616 0.907 0.364409
## hits_allowed 0.0070161 0.0012732 5.511 4.28e-08 ***
## walks_allowed -0.0186313 0.0029102 -6.402 2.11e-10 ***
## errors -0.0112403 0.0038088 -2.951 0.003221 **
## double_plays -0.1178421 0.0166536 -7.076 2.37e-12 ***
## homeruns 0.0341264 0.0060112 5.677 1.67e-08 ***
## strikeouts -0.0005672 0.0013583 -0.418 0.676301
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.9 on 1353 degrees of freedom
## Multiple R-squared: 0.3193, Adjusted R-squared: 0.3133
## F-statistic: 52.89 on 12 and 1353 DF, p-value: < 2.2e-16
Strikeouts has the (highest p value) lowest affect on wins and will be removed.
##
## Call:
## lm(formula = `moneyball_df$TARGET_WINS` ~ base_hits + doubles_hit +
## triples_hit + walks_by_batters + stolen_bases + caught_stealing_bases +
## hits_allowed + walks_allowed + errors + double_plays + homeruns,
## data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.603 -7.987 0.254 8.318 45.595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.086723 4.829802 1.053 0.292440
## base_hits 0.040517 0.003879 10.444 < 2e-16 ***
## doubles_hit -0.023653 0.010811 -2.188 0.028852 *
## triples_hit 0.074710 0.020855 3.582 0.000352 ***
## walks_by_batters 0.050223 0.004369 11.497 < 2e-16 ***
## stolen_bases 0.029311 0.005462 5.366 9.44e-08 ***
## caught_stealing_bases 0.017640 0.019743 0.893 0.371776
## hits_allowed 0.006997 0.001272 5.501 4.52e-08 ***
## walks_allowed -0.018688 0.002906 -6.430 1.76e-10 ***
## errors -0.011392 0.003790 -3.005 0.002702 **
## double_plays -0.116546 0.016357 -7.125 1.68e-12 ***
## homeruns 0.032599 0.004769 6.835 1.24e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.89 on 1354 degrees of freedom
## Multiple R-squared: 0.3192, Adjusted R-squared: 0.3137
## F-statistic: 57.71 on 11 and 1354 DF, p-value: < 2.2e-16
Caught stealing bases has the lowest affect on wins and will be removed.
##
## Call:
## lm(formula = `moneyball_df$TARGET_WINS` ~ base_hits + doubles_hit +
## triples_hit + walks_by_batters + stolen_bases + hits_allowed +
## walks_allowed + errors + double_plays + homeruns, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.915 -8.101 0.238 8.322 45.651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.140513 4.683224 1.311 0.190022
## base_hits 0.040392 0.003876 10.420 < 2e-16 ***
## doubles_hit -0.023030 0.010788 -2.135 0.032955 *
## triples_hit 0.074971 0.020851 3.596 0.000335 ***
## walks_by_batters 0.050224 0.004368 11.498 < 2e-16 ***
## stolen_bases 0.030450 0.005310 5.734 1.21e-08 ***
## hits_allowed 0.007064 0.001270 5.564 3.17e-08 ***
## walks_allowed -0.018874 0.002898 -6.512 1.04e-10 ***
## errors -0.012042 0.003720 -3.237 0.001236 **
## double_plays -0.116302 0.016353 -7.112 1.85e-12 ***
## homeruns 0.031783 0.004681 6.790 1.67e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.89 on 1355 degrees of freedom
## Multiple R-squared: 0.3188, Adjusted R-squared: 0.3138
## F-statistic: 63.41 on 10 and 1355 DF, p-value: < 2.2e-16
Doubles hit has the lowest affect on wins and will be removed.
##
## Call:
## lm(formula = `moneyball_df$TARGET_WINS` ~ base_hits + triples_hit +
## walks_by_batters + stolen_bases + hits_allowed + walks_allowed +
## errors + double_plays + homeruns, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.226 -8.163 0.217 8.645 45.811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.215473 4.587258 1.791 0.07353 .
## base_hits 0.035794 0.003227 11.091 < 2e-16 ***
## triples_hit 0.079039 0.020791 3.802 0.00015 ***
## walks_by_batters 0.048516 0.004300 11.283 < 2e-16 ***
## stolen_bases 0.031134 0.005308 5.866 5.61e-09 ***
## hits_allowed 0.006897 0.001269 5.435 6.49e-08 ***
## walks_allowed -0.018304 0.002890 -6.334 3.25e-10 ***
## errors -0.011620 0.003719 -3.124 0.00182 **
## double_plays -0.116127 0.016374 -7.092 2.12e-12 ***
## homeruns 0.029464 0.004559 6.463 1.43e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.91 on 1356 degrees of freedom
## Multiple R-squared: 0.3165, Adjusted R-squared: 0.312
## F-statistic: 69.77 on 9 and 1356 DF, p-value: < 2.2e-16
All of the remaining variables have p values lower that 0.05, which indicates that they are signficant. A p value below 0.05 indicates strong evidence against the null hypothesis that the variable in question does not affect wins.
The adjusted R squared value is .312. 31.2% of the variability in wins is accounted for by the model.
The F statistic is 69.77, which is high, and further indicates that these variables are signficiant.
Although the first and second model have very similar adjusted R squared values, they differ in that the first model includes doubles hit and removes errors, which the second model removes doubles hit and includes errors.
The following graphs look for patterns in the residuals.
The plot of the residuals vs. the fitted values shows no pattern, indicating theat the variability in the residuals is nearly constant. The residuals mostly follow the line on the QQ plot, indicating that they are nearly normal.
Building a Logarthmic model - Using backward elimination
In order to build a logarithmic model, all zero values not be given a non-zero value. I imputed 0.001 for every zero value in the data set.
##
## Call:
## lm(formula = log(`moneyball_df$TARGET_WINS`) ~ ., data = train3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6455 -0.1119 0.0158 0.1237 0.8514
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.972e+00 1.698e-01 11.617 < 2e-16 ***
## base_hits 1.325e-03 1.240e-04 10.689 < 2e-16 ***
## doubles_hit -1.169e-03 2.998e-04 -3.898 0.000102 ***
## triples_hit 1.462e-03 5.460e-04 2.677 0.007515 **
## walks_by_batters 1.314e-03 1.163e-04 11.298 < 2e-16 ***
## stolen_bases -8.361e-05 1.447e-04 -0.578 0.563603
## caught_stealing_bases 1.596e-03 4.913e-04 3.250 0.001184 **
## hits_allowed -1.375e-05 3.132e-05 -0.439 0.660685
## walks_allowed -3.251e-04 7.988e-05 -4.070 4.98e-05 ***
## errors 1.447e-04 9.285e-05 1.559 0.119315
## double_plays -1.738e-03 4.275e-04 -4.066 5.06e-05 ***
## homeruns 7.335e-05 1.500e-04 0.489 0.624998
## strikeouts 2.110e-04 3.258e-05 6.475 1.32e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3305 on 1353 degrees of freedom
## Multiple R-squared: 0.2472, Adjusted R-squared: 0.2405
## F-statistic: 37.02 on 12 and 1353 DF, p-value: < 2.2e-16
Hits allowed has the (highest p value) lowest affect on wins and will be removed.
##
## Call:
## lm(formula = log(`moneyball_df$TARGET_WINS`) ~ base_hits + doubles_hit +
## triples_hit + walks_by_batters + stolen_bases + caught_stealing_bases +
## walks_allowed + errors + double_plays + homeruns + strikeouts,
## data = train3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6490 -0.1112 0.0156 0.1254 0.8257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.967e+00 1.693e-01 11.618 < 2e-16 ***
## base_hits 1.312e-03 1.205e-04 10.891 < 2e-16 ***
## doubles_hit -1.170e-03 2.997e-04 -3.905 9.90e-05 ***
## triples_hit 1.469e-03 5.456e-04 2.692 0.00720 **
## walks_by_batters 1.329e-03 1.118e-04 11.878 < 2e-16 ***
## stolen_bases -7.233e-05 1.424e-04 -0.508 0.61157
## caught_stealing_bases 1.590e-03 4.909e-04 3.239 0.00123 **
## walks_allowed -3.317e-04 7.844e-05 -4.229 2.51e-05 ***
## errors 1.273e-04 8.394e-05 1.517 0.12956
## double_plays -1.723e-03 4.260e-04 -4.045 5.52e-05 ***
## homeruns 7.071e-05 1.499e-04 0.472 0.63711
## strikeouts 2.105e-04 3.256e-05 6.467 1.39e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3304 on 1354 degrees of freedom
## Multiple R-squared: 0.2471, Adjusted R-squared: 0.2409
## F-statistic: 40.39 on 11 and 1354 DF, p-value: < 2.2e-16
Homeruns has the lowest affect on wins and will be removed.
##
## Call:
## lm(formula = log(`moneyball_df$TARGET_WINS`) ~ base_hits + doubles_hit +
## triples_hit + walks_by_batters + stolen_bases + caught_stealing_bases +
## walks_allowed + errors + double_plays + strikeouts, data = train3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6372 -0.1093 0.0154 0.1245 0.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.930e+00 1.499e-01 12.877 < 2e-16 ***
## base_hits 1.335e-03 1.101e-04 12.132 < 2e-16 ***
## doubles_hit -1.172e-03 2.996e-04 -3.912 9.59e-05 ***
## triples_hit 1.401e-03 5.260e-04 2.663 0.00785 **
## walks_by_batters 1.341e-03 1.084e-04 12.372 < 2e-16 ***
## stolen_bases -8.297e-05 1.406e-04 -0.590 0.55508
## caught_stealing_bases 1.552e-03 4.842e-04 3.206 0.00138 **
## walks_allowed -3.315e-04 7.842e-05 -4.228 2.52e-05 ***
## errors 1.190e-04 8.203e-05 1.450 0.14724
## double_plays -1.681e-03 4.163e-04 -4.038 5.70e-05 ***
## strikeouts 2.198e-04 2.598e-05 8.459 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3303 on 1355 degrees of freedom
## Multiple R-squared: 0.2469, Adjusted R-squared: 0.2414
## F-statistic: 44.43 on 10 and 1355 DF, p-value: < 2.2e-16
Stolen bases has the lowest affect on wins and will be removed.
##
## Call:
## lm(formula = log(`moneyball_df$TARGET_WINS`) ~ base_hits + doubles_hit +
## triples_hit + walks_by_batters + caught_stealing_bases +
## walks_allowed + errors + double_plays + strikeouts, data = train3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6461 -0.1145 0.0137 0.1247 0.8220
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.935e+00 1.496e-01 12.935 < 2e-16 ***
## base_hits 1.330e-03 1.097e-04 12.125 < 2e-16 ***
## doubles_hit -1.145e-03 2.959e-04 -3.869 0.000115 ***
## triples_hit 1.351e-03 5.191e-04 2.602 0.009361 **
## walks_by_batters 1.330e-03 1.068e-04 12.459 < 2e-16 ***
## caught_stealing_bases 1.478e-03 4.674e-04 3.162 0.001603 **
## walks_allowed -3.377e-04 7.771e-05 -4.346 1.49e-05 ***
## errors 1.022e-04 7.694e-05 1.328 0.184312
## double_plays -1.632e-03 4.080e-04 -4.001 6.65e-05 ***
## strikeouts 2.181e-04 2.581e-05 8.448 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3302 on 1356 degrees of freedom
## Multiple R-squared: 0.2467, Adjusted R-squared: 0.2417
## F-statistic: 49.36 on 9 and 1356 DF, p-value: < 2.2e-16
Errors has the lowest affect on wins and will be removed.
##
## Call:
## lm(formula = log(`moneyball_df$TARGET_WINS`) ~ base_hits + doubles_hit +
## triples_hit + walks_by_batters + caught_stealing_bases +
## walks_allowed + double_plays + strikeouts, data = train3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6546 -0.1135 0.0137 0.1239 0.8015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.972e+00 1.470e-01 13.414 < 2e-16 ***
## base_hits 1.328e-03 1.097e-04 12.103 < 2e-16 ***
## doubles_hit -1.186e-03 2.944e-04 -4.028 5.93e-05 ***
## triples_hit 1.676e-03 4.579e-04 3.660 0.000262 ***
## walks_by_batters 1.293e-03 1.030e-04 12.551 < 2e-16 ***
## caught_stealing_bases 1.415e-03 4.652e-04 3.043 0.002388 **
## walks_allowed -3.299e-04 7.751e-05 -4.257 2.22e-05 ***
## double_plays -1.666e-03 4.073e-04 -4.091 4.56e-05 ***
## strikeouts 2.203e-04 2.577e-05 8.550 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3303 on 1357 degrees of freedom
## Multiple R-squared: 0.2458, Adjusted R-squared: 0.2413
## F-statistic: 55.27 on 8 and 1357 DF, p-value: < 2.2e-16
All of the remaining variables have p values lower that 0.05, which indicates that they are signficant. A p value below 0.05 indicates strong evidence against the null hypothesis that the variable in question does not affect wins.
The adjusted R squared value is .2413. 24.13% of the variability in wins is accounted for by the model.
The F statistic is 55.27, which is high, and further indicates that these variables are signficiant.
The graph of hte residuals vs. fitted values shows a narrowing at higher values, indicating that the distribution of residuals is not uniform.
The logarithmic model has a lower adjusted R squared value than the previous 2 models and a distribution of residuals that is not completely uniform. The logarithmic model is not an appropriate choice for this data.
Building a model in which cases with NA are not used to create the model. (This model will not impute missing values.)
##
## Call:
## lm(formula = `moneyball_df$TARGET_WINS` ~ ., data = train4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.2556 -6.2474 -0.1754 5.7225 21.6508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.901225 25.041549 2.512 0.01359 *
## base_hits 2.655188 2.188344 1.213 0.22783
## doubles_hit 0.064527 0.042877 1.505 0.13547
## triples_hit -0.134966 0.113580 -1.188 0.23750
## walks_by_batters -7.646664 5.965554 -1.282 0.20285
## stolen_bases 0.017067 0.039014 0.437 0.66272
## caught_stealing_bases 0.053478 0.090586 0.590 0.55627
## batters_hit_by_pitch 0.152920 0.072221 2.117 0.03668 *
## hits_allowed -2.636974 2.186906 -1.206 0.23071
## walks_allowed 7.713110 5.962717 1.294 0.19877
## errors -0.117572 0.055448 -2.120 0.03642 *
## double_plays -0.155739 0.048248 -3.228 0.00168 **
## homeruns 0.036707 0.016777 2.188 0.03098 *
## strikeouts -0.023554 0.005463 -4.312 3.77e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.666 on 101 degrees of freedom
## (1251 observations deleted due to missingness)
## Multiple R-squared: 0.5635, Adjusted R-squared: 0.5073
## F-statistic: 10.03 on 13 and 101 DF, p-value: 3.496e-13
Stolen bases has the lowest affect on wins and will be removed.
##
## Call:
## lm(formula = `moneyball_df$TARGET_WINS` ~ base_hits + doubles_hit +
## triples_hit + walks_by_batters + caught_stealing_bases +
## batters_hit_by_pitch + hits_allowed + walks_allowed + errors +
## double_plays + homeruns + strikeouts, data = train4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.2313 -6.0879 -0.1629 5.6109 21.7651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.673048 24.784825 2.488 0.01445 *
## base_hits 2.782749 2.160214 1.288 0.20060
## doubles_hit 0.059483 0.041134 1.446 0.15122
## triples_hit -0.137049 0.113029 -1.213 0.22812
## walks_by_batters -8.004596 5.885704 -1.360 0.17683
## caught_stealing_bases 0.075784 0.074573 1.016 0.31192
## batters_hit_by_pitch 0.152631 0.071931 2.122 0.03627 *
## hits_allowed -2.761923 2.159560 -1.279 0.20382
## walks_allowed 8.071025 5.882856 1.372 0.17309
## errors -0.117328 0.055225 -2.125 0.03604 *
## double_plays -0.159351 0.047347 -3.366 0.00108 **
## homeruns 0.035766 0.016573 2.158 0.03327 *
## strikeouts -0.023338 0.005418 -4.307 3.81e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.631 on 102 degrees of freedom
## (1251 observations deleted due to missingness)
## Multiple R-squared: 0.5627, Adjusted R-squared: 0.5112
## F-statistic: 10.94 on 12 and 102 DF, p-value: 1.118e-13
Caught stealing bases has the lowest affect on wins and will be removed.
##
## Call:
## lm(formula = `moneyball_df$TARGET_WINS` ~ base_hits + doubles_hit +
## triples_hit + walks_by_batters + batters_hit_by_pitch + hits_allowed +
## walks_allowed + errors + double_plays + homeruns + strikeouts,
## data = train4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.4917 -6.2209 -0.6594 5.5976 20.9959
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.035386 24.752476 2.547 0.01236 *
## base_hits 2.971698 2.152539 1.381 0.17040
## doubles_hit 0.053178 0.040670 1.308 0.19393
## triples_hit -0.115205 0.110984 -1.038 0.30169
## walks_by_batters -8.568031 5.860463 -1.462 0.14678
## batters_hit_by_pitch 0.162077 0.071339 2.272 0.02517 *
## hits_allowed -2.948784 2.152060 -1.370 0.17360
## walks_allowed 8.634008 5.857644 1.474 0.14354
## errors -0.108195 0.054497 -1.985 0.04977 *
## double_plays -0.160013 0.047350 -3.379 0.00103 **
## homeruns 0.032113 0.016181 1.985 0.04985 *
## strikeouts -0.023396 0.005419 -4.317 3.64e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.633 on 103 degrees of freedom
## (1251 observations deleted due to missingness)
## Multiple R-squared: 0.5582, Adjusted R-squared: 0.511
## F-statistic: 11.83 on 11 and 103 DF, p-value: 5.055e-14
Triples hit has the lowest affect on wins and will be removed.
##
## Call:
## lm(formula = `moneyball_df$TARGET_WINS` ~ base_hits + doubles_hit +
## walks_by_batters + batters_hit_by_pitch + hits_allowed +
## walks_allowed + errors + double_plays + homeruns + strikeouts,
## data = train4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.856 -6.048 -0.822 5.826 20.573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.220005 24.431087 2.751 0.007001 **
## base_hits 3.066089 2.151419 1.425 0.157109
## doubles_hit 0.057542 0.040467 1.422 0.158037
## walks_by_batters -8.882568 5.854804 -1.517 0.132265
## batters_hit_by_pitch 0.169982 0.070958 2.396 0.018385 *
## hits_allowed -3.050389 2.150634 -1.418 0.159073
## walks_allowed 8.951859 5.851814 1.530 0.129113
## errors -0.101130 0.054091 -1.870 0.064347 .
## double_plays -0.164837 0.047139 -3.497 0.000694 ***
## homeruns 0.036854 0.015529 2.373 0.019468 *
## strikeouts -0.024520 0.005312 -4.616 1.12e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.636 on 104 degrees of freedom
## (1251 observations deleted due to missingness)
## Multiple R-squared: 0.5536, Adjusted R-squared: 0.5107
## F-statistic: 12.9 on 10 and 104 DF, p-value: 2.238e-14
Hits allowed has the lowest affect on wins and will be removed.
##
## Call:
## lm(formula = `moneyball_df$TARGET_WINS` ~ base_hits + doubles_hit +
## walks_by_batters + batters_hit_by_pitch + walks_allowed +
## errors + double_plays + homeruns + strikeouts, data = train4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.9722 -6.0459 -0.5518 5.6908 20.7530
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.270713 24.539299 2.701 0.008070 **
## base_hits 0.014667 0.015617 0.939 0.349823
## doubles_hit 0.060748 0.040598 1.496 0.137566
## walks_by_batters -0.636816 0.697126 -0.913 0.363078
## batters_hit_by_pitch 0.162150 0.071083 2.281 0.024559 *
## walks_allowed 0.710361 0.697020 1.019 0.310479
## errors -0.112352 0.053766 -2.090 0.039067 *
## double_plays -0.160772 0.047278 -3.401 0.000952 ***
## homeruns 0.038795 0.015543 2.496 0.014114 *
## strikeouts -0.024765 0.005334 -4.643 1e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.678 on 105 degrees of freedom
## (1251 observations deleted due to missingness)
## Multiple R-squared: 0.545, Adjusted R-squared: 0.506
## F-statistic: 13.97 on 9 and 105 DF, p-value: 1.468e-14
Walks by batters has the lowest affect on wins and will be removed.
##
## Call:
## lm(formula = `moneyball_df$TARGET_WINS` ~ base_hits + doubles_hit +
## batters_hit_by_pitch + walks_allowed + errors + double_plays +
## homeruns + strikeouts, data = train4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.0718 -6.4072 -0.6232 5.5719 20.3302
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.172644 24.431713 2.790 0.00625 **
## base_hits 0.013765 0.015574 0.884 0.37877
## doubles_hit 0.061320 0.040561 1.512 0.13356
## batters_hit_by_pitch 0.162746 0.071025 2.291 0.02392 *
## walks_allowed 0.073752 0.012990 5.677 1.20e-07 ***
## errors -0.111288 0.053712 -2.072 0.04070 *
## double_plays -0.165150 0.046998 -3.514 0.00065 ***
## homeruns 0.037338 0.015448 2.417 0.01736 *
## strikeouts -0.024555 0.005325 -4.611 1.12e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.671 on 106 degrees of freedom
## (1251 observations deleted due to missingness)
## Multiple R-squared: 0.5414, Adjusted R-squared: 0.5067
## F-statistic: 15.64 on 8 and 106 DF, p-value: 5.273e-15
Base hits has the lowest affect on wins and will be removed.
##
## Call:
## lm(formula = `moneyball_df$TARGET_WINS` ~ doubles_hit + batters_hit_by_pitch +
## walks_allowed + errors + double_plays + homeruns + strikeouts,
## data = train4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.5842 -6.4536 -0.8166 5.3234 20.9979
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.394660 16.109757 5.239 8.16e-07 ***
## doubles_hit 0.080141 0.034488 2.324 0.022028 *
## batters_hit_by_pitch 0.166386 0.070833 2.349 0.020660 *
## walks_allowed 0.075152 0.012880 5.835 5.82e-08 ***
## errors -0.112546 0.053638 -2.098 0.038237 *
## double_plays -0.164899 0.046949 -3.512 0.000652 ***
## homeruns 0.042743 0.014172 3.016 0.003200 **
## strikeouts -0.026619 0.004781 -5.567 1.94e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.662 on 107 degrees of freedom
## (1251 observations deleted due to missingness)
## Multiple R-squared: 0.538, Adjusted R-squared: 0.5077
## F-statistic: 17.8 on 7 and 107 DF, p-value: 1.729e-15
All of the remaining variables have p values lower that 0.05, which indicates that they are signficant. A p value below 0.05 indicates strong evidence against the null hypothesis that the variable in question does not affect wins.
The adjusted R squared value is .5077 50.77% of the variability in wins is accounted for by the model. This is signficantly higher than the previous 3 models.
The F statistic is 17.8, which is high, and further indicates that these variables are signficiant.
The residual plots show no pattern and the residuas look nearly normal.
Using the Test Set To Make and Evaluate Predictions from Each of the 4 Models Built
Prediction from Model 1
The root mean square error from model 1 is
## [1] 14.20675
The number of wins in a season predited by model 1 is off, on average, by 14.
Prediction from Model 2
The root mean square error from model 2 is
## [1] 14.50733
The number of wins in a season predited by model 2 is off, on average, by 14.
Prediction from Model 3 - Logarithmic Model
The root mean square error from model 3 is
## [1] 31.08032
The number of wins in a season predited by model 3 is off, on average, by 31
Prediction from Model 4
The root mean square error from model 4 is
## [1] 27.28408
The number of wins in a season predited by model 4 is off, on average, by 27.
PREDICTING WINS in the EVALUATION DATA
The model I will use to predict values in the evaluation set is model 1.
Although model 4 had a higher adjusted R squared value, it used so few cases to build the model due to the number of missing values, that the root mean square error was significantly higher. I expect that models 1 and 2 will predict values in a similar manner, but model 1 has a slightly lower root mean square error. Model 3, the logarithmic model had the highest root mean square error and is not an appropriate choice.
The following steps are taken to prepare the evaluation set to make predictions:
The column titles are changed. Outliers above the chosen values in the evaluation set in hits allowed, errors and strikeouts by pitchers are set to NA. Homeruns allowed and homeruns hit are added. Strikeouts at bat and strikeouts by pitchers are added. The NA values are changed to the median values from the training set.
The number of wins is stored in predictwins_eval.
## Index base_hits doubles_hit triples_hit
## Min. : 9 Min. : 819 Min. : 44.0 Min. : 14.00
## 1st Qu.: 708 1st Qu.:1387 1st Qu.:210.0 1st Qu.: 35.00
## Median :1249 Median :1455 Median :239.0 Median : 52.00
## Mean :1264 Mean :1469 Mean :241.3 Mean : 55.91
## 3rd Qu.:1832 3rd Qu.:1548 3rd Qu.:278.5 3rd Qu.: 72.00
## Max. :2525 Max. :2170 Max. :376.0 Max. :155.00
##
## homeruns_hit walks_by_batters strikeouts_at_bat stolen_bases
## Min. : 0.00 Min. : 15.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 44.50 1st Qu.:436.5 1st Qu.: 545.0 1st Qu.: 59.0
## Median :101.00 Median :509.0 Median : 686.0 Median : 92.0
## Mean : 95.63 Mean :499.0 Mean : 709.3 Mean :123.7
## 3rd Qu.:135.50 3rd Qu.:565.5 3rd Qu.: 912.0 3rd Qu.:151.8
## Max. :242.00 Max. :792.0 Max. :1268.0 Max. :580.0
## NA's :18 NA's :13
## caught_stealing_bases batters_hit_by_pitch hits_allowed
## Min. : 0.00 Min. :42.00 Min. :1155
## 1st Qu.: 38.00 1st Qu.:53.50 1st Qu.:1423
## Median : 49.50 Median :62.00 Median :1506
## Mean : 52.32 Mean :62.37 Mean :1604
## 3rd Qu.: 63.00 3rd Qu.:67.50 3rd Qu.:1658
## Max. :154.00 Max. :96.00 Max. :4120
## NA's :87 NA's :240 NA's :6
## homeruns_allowed walks_allowed strikeouts_by_pitchers errors
## Min. : 0.0 Min. : 136.0 Min. : 0.0 Min. : 73.0
## 1st Qu.: 52.0 1st Qu.: 471.0 1st Qu.: 610.8 1st Qu.:130.8
## Median :104.0 Median : 526.0 Median : 745.0 Median :159.5
## Mean :102.1 Mean : 552.4 Mean : 761.5 Mean :221.1
## 3rd Qu.:142.5 3rd Qu.: 606.5 3rd Qu.: 933.5 3rd Qu.:244.2
## Max. :336.0 Max. :2008.0 Max. :1462.0 Max. :994.0
## NA's :19 NA's :7
## double_plays
## Min. : 69.0
## 1st Qu.:131.0
## Median :148.0
## Mean :146.1
## 3rd Qu.:164.0
## Max. :204.0
## NA's :31
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 63 66 75 84 65 67 82 74 69 71 69 80 81 80 81 75 74 79
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 71 90 82 85 83 72 79 84 57 72 85 73 94 86 86 90 81 88
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## 76 90 84 92 83 90 48 106 92 95 100 75 69 77 73 83 78 71
## 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 75 75 93 71 64 79 83 78 88 84 80 90 77 81 78 91 84 67
## 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 78 88 79 86 83 82 70 74 83 87 96 75 89 77 80 82 88 91
## 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## 79 73 74 82 82 83 91 105 88 88 79 74 83 85 82 71 58 76
## 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## 87 62 83 80 91 87 77 76 85 79 73 74 89 70 67 64 68 83
## 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 90 74 92 91 87 78 78 86 86 74 73 74 81 80 65 72 91 75
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## 72 72 79 79 78 80 81 78 58 68 77 69 88 68 96 75 108 111
## 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 95 108 102 90 83 78 70 79 89 88 80 93 84 75 80 71 74 81
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## 83 89 85 86 82 92 93 61 60 114 74 81 72 74 78 68 77 86
## 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## 81 87 78 80 74 84 78 85 78 77 81 77 105 92 83 66 68 84
## 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
## 78 89 74 78 78 71 82 71 92 74 80 82 82 76 79 90 81 88
## 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
## 81 75 84 75 94 70 88 84 82 78 62 84 78 83 70 81 81 68
## 253 254 255 256 257 258 259
## 96 10 68 80 78 81 75
APPENDIX
moneyball_df <- read.csv(“https://raw.githubusercontent.com/swigodsky/Data621/master/moneyball-training-data.csv”, stringsAsFactors = FALSE) #nrow(moneyball_df) moneyball_df <- moneyball_df[-1]
moneyball_df_wins_removed <- moneyball_df[-1] colnames(moneyball_df_wins_removed) <-c(“base_hits”,“doubles_hit”,“triples_hit”,“homeruns_hit”,“walks_by_batters”,“strikeouts_at_bat”,“stolen_bases”,“caught_stealing_bases”,“batters_hit_by_pitch”,“hits_allowed”,“homeruns_allowed”,“walks_allowed”,“strikeouts_by_pitchers”,“errors”,“double_plays”) summary(moneyball_df_wins_removed)
plot(moneyball_df_wins_removed\(hits_allowed, ylab="number of hits allowed", main="Hits Allowed In 1 Season") hist(moneyball_df_wins_removed\)hits_allowed, xlab=“number hits allowed”,main=“Hits Allowed In 1 Season”) boxplot(moneyball_df_wins_removed$hits_allowed, main=“Hits Allowed In 1 Season”)
plot(moneyball_df_wins_removed\(errors, ylab="number of errors", main="Number of Errors In 1 Season") hist(moneyball_df_wins_removed\)errors, xlab=“number of errors”,main=“Number of Errors In 1 Season”) boxplot(moneyball_df_wins_removed$errors, main=“Number of Errors In 1 Season”)
plot(moneyball_df_wins_removed\(strikeouts_by_pitchers, ylab="number of strikeouts by pitchers", main="Strikeouts By Pitchers In 1 Season") hist(moneyball_df_wins_removed\)strikeouts_by_pitchers, xlab=“number of strikeouts by pitchers”,main=“Strikeouts By Pitchers In 1 Season”) boxplot(moneyball_df_wins_removed$strikeouts_by_pitchers, main=“Strikeouts By Pitchers In 1 Season”)
library(dplyr) library(tidyr) library(ggplot2)
par(mar = c(2, 5, 4, 2)+ 0.1) par(cex.axis=.6) boxplot(moneyball_df_wins_removed, las=2,horizontal=TRUE, ylim=c(0,3000))
par(mar = c(2, 5, 4, 2)+ 0.1) par(cex.axis=.5) image(is.na(moneyball_df_wins_removed), axes=FALSE,col=gray(1:0), main=‘Missing Data’) axis(2, at=0:14/14, labels=colnames(moneyball_df_wins_removed),las=2) axis(1, at=0:2275/2275, labels=FALSE)
moneyball_df_wins_removed\(hits_allowed[moneyball_df_wins_removed\)hits_allowed > 5000] <- NA plot(moneyball_df_wins_removed\(hits_allowed, ylab="number of hits allowed", main="Hits Allowed In 1 Season") hist(moneyball_df_wins_removed\)hits_allowed, xlab=“number hits allowed”,main=“Hits Allowed In 1 Season”) boxplot(moneyball_df_wins_removed$hits_allowed, main=“Hits Allowed In 1 Season”)
moneyball_df_wins_removed\(errors[moneyball_df_wins_removed\)errors > 1000] <- NA plot(moneyball_df_wins_removed\(errors, ylab="number of errors", main="Number of Errors In 1 Season") hist(moneyball_df_wins_removed\)errors, xlab=“number of errors”,main=“Number of Errors In 1 Season”) boxplot(moneyball_df_wins_removed$errors, main=“Number of Errors In 1 Season”)
moneyball_df_wins_removed\(strikeouts_by_pitchers[moneyball_df_wins_removed\)strikeouts_by_pitchers > 3000] <- NA plot(moneyball_df_wins_removed\(strikeouts_by_pitchers, ylab="number of strikeouts by pitchers", main="Strikeouts By Pitchers In 1 Season") hist(moneyball_df_wins_removed\)strikeouts_by_pitchers, xlab=“number of strikeouts by pitchers”,main=“Strikeouts By Pitchers In 1 Season”) boxplot(moneyball_df_wins_removed$strikeouts_by_pitchers, main=“Strikeouts By Pitchers In 1 Season”)
moneyball_df_train <- subset(moneyball_df_wins_removed, select=-(batters_hit_by_pitch))
moneyball_df_train\(strikeouts_at_bat[is.na(moneyball_df_train\)strikeouts_at_bat)] <- median(moneyball_df_train$strikeouts_at_bat, na.rm=T)
moneyball_df_train\(caught_stealing_bases[is.na(moneyball_df_train\)caught_stealing_bases)] <- median(moneyball_df_train$caught_stealing_bases, na.rm=T)
moneyball_df_train\(stolen_bases[is.na(moneyball_df_train\)stolen_bases)] <- median(moneyball_df_train$stolen_bases, na.rm=T)
moneyball_df_train\(double_plays[is.na(moneyball_df_train\)double_plays)] <- median(moneyball_df_train$double_plays, na.rm=T)
moneyball_df_train\(hits_allowed[is.na(moneyball_df_train\)hits_allowed)] <- median(moneyball_df_train$hits_allowed, na.rm=T)
moneyball_df_train\(strikeouts_by_pitchers[is.na(moneyball_df_train\)strikeouts_by_pitchers)] <- median(moneyball_df_train$strikeouts_by_pitchers, na.rm=T)
moneyball_df_train\(errors[is.na(moneyball_df_train\)errors)] <- median(moneyball_df_train$errors, na.rm=T)
correlation <- cor(moneyball_df_train, method = “pearson”) correlation
library(dplyr) moneyball_df_train <- moneyball_df_train %>% mutate(homeruns = homeruns_hit+homeruns_allowed) %>% mutate(strikeouts = strikeouts_at_bat+strikeouts_by_pitchers)
moneyball_df_train <- select(moneyball_df_train, -c(homeruns_hit,homeruns_allowed, strikeouts_at_bat, strikeouts_by_pitchers))
n <- nrow(moneyball_df_train) moneyball_df_train <- cbind(moneyball_df_train, moneyball_df$TARGET_WINS) shuffle_df <- moneyball_df_train[sample(n),] train_indeces <- 1:round(0.6n) train <- shuffle_df[train_indeces,] test_indeces <- (round(.6n)+1):n test <- shuffle_df[test_indeces,]
wins_lm <- lm(moneyball_df$TARGET_WINS
~ ., data=train) summary(wins_lm)
wins_lm <- update(wins_lm, .~. -strikeouts, data = train) summary(wins_lm)
wins_lm <- update(wins_lm, .~. -caught_stealing_bases, data = train) summary(wins_lm)
wins_lm <- update(wins_lm, .~. -errors, data = train) summary(wins_lm)
plot(fitted(wins_lm),resid(wins_lm)) qqnorm(resid(wins_lm)) qqline(resid(wins_lm))
shuffle_df2 <- moneyball_df_train[sample(n),] train_indeces2 <- 1:round(0.6n) train2 <- shuffle_df2[train_indeces2,] test_indeces2 <- (round(.6n)+1):n test2 <- shuffle_df2[test_indeces2,]
wins_lm2 <- lm(moneyball_df$TARGET_WINS
~ ., data=train2) summary(wins_lm2)
wins_lm2 <- update(wins_lm2, .~. -strikeouts, data = train2) summary(wins_lm2) ```
wins_lm2 <- update(wins_lm2, .~. -caught_stealing_bases, data = train2) summary(wins_lm2)
wins_lm2 <- update(wins_lm2, .~. -doubles_hit, data = train2) summary(wins_lm2)
plot(fitted(wins_lm2),resid(wins_lm2)) qqnorm(resid(wins_lm2)) qqline(resid(wins_lm2))
train3 <- train train3[train3==0] <- 0.001 wins_lm3 <- lm(log(moneyball_df$TARGET_WINS
) ~ ., data=train3) summary(wins_lm3)
wins_lm3 <- update(wins_lm3, .~. -hits_allowed, data = train3) summary(wins_lm3)
wins_lm3 <- update(wins_lm3, .~. -homeruns, data = train3) summary(wins_lm3)
wins_lm3 <- update(wins_lm3, .~. -stolen_bases, data = train3) summary(wins_lm3)
wins_lm3 <- update(wins_lm3, .~. -errors, data = train3) summary(wins_lm3)
plot(fitted(wins_lm3),resid(wins_lm3)) qqnorm(resid(wins_lm3)) qqline(resid(wins_lm3))
moneyball_df_train4 <- moneyball_df_wins_removed %>% mutate(homeruns = homeruns_hit+homeruns_allowed) %>% mutate(strikeouts = strikeouts_at_bat+strikeouts_by_pitchers) moneyball_df_train4 <- cbind(moneyball_df_train4, moneyball_df$TARGET_WINS)
moneyball_df_train4 <- select(moneyball_df_train4, -c(homeruns_hit,homeruns_allowed, strikeouts_at_bat, strikeouts_by_pitchers)) shuffle_df4 <- moneyball_df_train4[sample(n),] train_indeces4 <- 1:round(0.6n) train4 <- shuffle_df4[train_indeces4,] test_indeces4 <- (round(.6n)+1):n test4 <- shuffle_df4[test_indeces4,]
wins_lm4 <- lm(moneyball_df$TARGET_WINS
~ ., data=train4) summary(wins_lm4)
wins_lm4 <- update(wins_lm4, .~. -stolen_bases, data = train4) summary(wins_lm4)
wins_lm4 <- update(wins_lm4, .~. -caught_stealing_bases, data = train4) summary(wins_lm4)
wins_lm4 <- update(wins_lm4, .~. -triples_hit, data = train4) summary(wins_lm4)
wins_lm4 <- update(wins_lm4, .~. -hits_allowed, data = train4) summary(wins_lm4)
wins_lm4 <- update(wins_lm4, .~. -walks_by_batters, data = train4)
wins_lm4 <- update(wins_lm4, .~. -base_hits, data = train4) summary(wins_lm4)
plot(fitted(wins_lm4),resid(wins_lm4)) qqnorm(resid(wins_lm4)) qqline(resid(wins_lm4))
predictwins <- predict(wins_lm, newdata=test, type=“response”) predictwins <- floor(predictwins) error <- predictwins-test\(`moneyball_df\)TARGET_WINSpredictwins <- cbind(predictwins, test$
moneyball_df$TARGET_WINS`, error) #head(predictwins) rmse <- sqrt(mean(error^2)) rmse
predictwins2 <- predict(wins_lm2, newdata=test2, type=“response”) predictwins2 <- floor(predictwins2) error2 <- predictwins2-test2\(`moneyball_df\)TARGET_WINS` rmse2 <- sqrt(mean(error2^2)) rmse2
test3 <- test test3[test3==0] <- 0.001 predictwins3 <- predict(wins_lm3,newdata=test3, type=“response”) predictwins3 <- floor(predictwins3) error3 <- exp(predictwins3)-test3\(`moneyball_df\)TARGET_WINS` rmse3 <- sqrt(mean(error3^2)) rmse3
med_base_hits <- median(moneyball_df_train\(base_hits, na.rm=T) med_doubles_hit <- median(moneyball_df_train\)doubles_hit, na.rm=T) med_triples_hit <- median(moneyball_df_train\(triples_hit, na.rm=T) med_walks_by_batters <- median(moneyball_df_train\)walks_by_batters, na.rm=T) med_stolen_bases <- median(moneyball_df_train\(stolen_bases, na.rm=T) med_caught_stealing_bases <- median(moneyball_df_train\)caught_stealing_bases, na.rm=T) med_batters_hit_by_pitch <- median(moneyball_df_wins_removed\(batters_hit_by_pitch, na.rm=T) med_hits_allowed <- median(moneyball_df_train\)hits_allowed, na.rm=T) med_walks_allowed <- median(moneyball_df_train\(walks_allowed, na.rm=T) med_errors <- median(moneyball_df_train\)errors, na.rm=T) med_double_plays <- median(moneyball_df_train\(double_plays, na.rm=T) med_homeruns <- median(moneyball_df_train\)homeruns, na.rm=T) med_strikeouts <- median(moneyball_df_train$strikeouts, na.rm=T)
test4\(base_hits[is.na(test4\)base_hits)] <- med_base_hits test4\(doubles_hit[is.na(test4\)doubles_hit)] <- med_doubles_hit test4\(triples_hit[is.na(test4\)triples_hit)] <- med_triples_hit test4\(walks_by_batters[is.na(test4\)walks_by_batters)] <- med_walks_by_batters test4\(stolen_bases[is.na(test4\)stolen_bases)] <- med_stolen_bases test4\(caught_stealing_bases[is.na(test4\)caught_stealing_bases)] <- med_caught_stealing_bases test4\(batters_hit_by_pitch[is.na(test4\)batters_hit_by_pitch)] <- med_batters_hit_by_pitch test4\(hits_allowed[is.na(test4\)hits_allowed)] <- med_hits_allowed test4\(errors[is.na(test4\)errors)] <- med_errors test4\(double_plays[is.na(test4\)double_plays)] <- med_double_plays test4\(homeruns[is.na(test4\)homeruns)] <- med_homeruns test4\(strikeouts[is.na(test4\)strikeouts)] <- med_strikeouts
predictwins4 <- predict(wins_lm4, newdata=test4, type=“response”) predictwins4 <- floor(predictwins4) error4 <- predictwins4-test4\(`moneyball_df\)TARGET_WINS` rmse4 <- sqrt(mean(error4^2))
moneyball_eval_names <- read.csv(“https://raw.githubusercontent.com/swigodsky/Data621/master/moneyball-evaluation-data.csv”, stringsAsFactors = FALSE) colnames(moneyball_eval_names) <-c(“Index”, “base_hits”,“doubles_hit”,“triples_hit”,“homeruns_hit”,“walks_by_batters”,“strikeouts_at_bat”,“stolen_bases”,“caught_stealing_bases”,“batters_hit_by_pitch”,“hits_allowed”,“homeruns_allowed”,“walks_allowed”,“strikeouts_by_pitchers”,“errors”,“double_plays”)
moneyball_eval_names\(hits_allowed[moneyball_eval_names\)hits_allowed > 5000] <- NA moneyball_eval_names\(errors[moneyball_eval_names\)errors > 1000] <- NA moneyball_eval_names\(strikeouts_by_pitchers[moneyball_eval_names\)strikeouts_by_pitchers > 3000] <- NA summary(moneyball_eval_names)
moneyball_eval_names <- moneyball_eval_names %>% mutate(homeruns = homeruns_hit+homeruns_allowed) %>% mutate(strikeouts = strikeouts_at_bat+strikeouts_by_pitchers)
moneyball_eval_names <- select(moneyball_eval_names, -c(homeruns_hit,homeruns_allowed, strikeouts_at_bat, strikeouts_by_pitchers))
moneyball_eval_names\(base_hits[is.na(moneyball_eval_names\)base_hits)] <- med_base_hits moneyball_eval_names\(doubles_hit[is.na(moneyball_eval_names\)doubles_hit)] <- med_doubles_hit moneyball_eval_names\(triples_hit[is.na(moneyball_eval_names\)triples_hit)] <- med_triples_hit moneyball_eval_names\(walks_by_batters[is.na(moneyball_eval_names\)walks_by_batters)] <- med_walks_by_batters moneyball_eval_names\(stolen_bases[is.na(moneyball_eval_names\)stolen_bases)] <- med_stolen_bases moneyball_eval_names\(caught_stealing_bases[is.na(moneyball_eval_names\)caught_stealing_bases)] <- med_caught_stealing_bases moneyball_eval_names\(batters_hit_by_pitch[is.na(moneyball_eval_names\)batters_hit_by_pitch)] <- med_batters_hit_by_pitch moneyball_eval_names\(hits_allowed[is.na(moneyball_eval_names\)hits_allowed)] <- med_hits_allowed moneyball_eval_names\(errors[is.na(moneyball_eval_names\)errors)] <- med_errors moneyball_eval_names\(double_plays[is.na(moneyball_eval_names\)double_plays)] <- med_double_plays moneyball_eval_names\(homeruns[is.na(moneyball_eval_names\)homeruns)] <- med_homeruns moneyball_eval_names\(strikeouts[is.na(moneyball_eval_names\)strikeouts)] <- med_strikeouts
predictwins_eval <- predict(wins_lm, newdata=moneyball_eval_names, type=“response”) predictwins_eval <- floor(predictwins_eval) predictwins_eval