This regression technique is used to determine the best predictor variable by adding all predictors to the model. After the model with all predictors is created the Akaike Information Criterion (AIC) is reviewed. The AIC provide information about which predictors should be removed from the model to create the best fit.
Predictors with the highest numbers are removed and the model is executed again to determine whether the importance to the model of the remaining predictor variables. The process of removing variables is performed until removing variables will no longer lower the AIC.
The AIC considers the fit of the model and the number of parameters. The more predictor variables that are used in the model, the higher the AIC. The AIC penalizes when a model has more parameters, the number of parameters must be reduced to improve the model.
The magnitude of the AIC value is not of importance. Instead using the model with the lowest AIC value indicates the predictors that are the best fit.
For this model the dependent is TARGET_WINS and all predictors in the dataset are used for the initial model to, determine the importance of the predictors in predicting TOTAL_WINS.
The AIC value suggest removal of the first three variables (BSR, BATTING_1B and PRITCHING_SO) will drop the AIC to 9393 from 9395, thus improving the model.
## Start: AIC=9395.25
## TARGET_WINS ~ PITCHING_H + PITCHING_HR + PITCHING_BB + PITCHING_SO +
## PITCHING_SO_BB + BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB +
## BATTING_SO + BATTING_1B + BATTING_TB + BATTING_BB_SO + FIELDING_E +
## FIELDING_DP + BASERUN_SB + BsR
##
##
## Step: AIC=9395.25
## TARGET_WINS ~ PITCHING_H + PITCHING_HR + PITCHING_BB + PITCHING_SO +
## PITCHING_SO_BB + BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB +
## BATTING_SO + BATTING_1B + BATTING_BB_SO + FIELDING_E + FIELDING_DP +
## BASERUN_SB + BsR
##
## Df Sum of Sq RSS AIC
## - BsR 1 0 224278 9393.3
## - BATTING_1B 1 99 224377 9394.1
## - PITCHING_SO 1 135 224413 9394.4
## <none> 224278 9395.3
## - BATTING_2B 1 232 224510 9395.3
## - BATTING_3B 1 262 224540 9395.6
## - BATTING_HR 1 709 224987 9399.5
## - PITCHING_HR 1 1076 225354 9402.7
## - BATTING_BB_SO 1 2041 226319 9411.2
## - BATTING_SO 1 2738 227016 9417.3
## - PITCHING_SO_BB 1 3217 227495 9421.4
## - PITCHING_BB 1 3923 228201 9427.6
## - PITCHING_H 1 4925 229203 9436.2
## - BATTING_BB 1 5771 230048 9443.5
## - FIELDING_DP 1 12307 236585 9499.0
## - BASERUN_SB 1 23639 247917 9591.6
## - FIELDING_E 1 55261 279539 9829.1
##
## Step: AIC=9393.26
## TARGET_WINS ~ PITCHING_H + PITCHING_HR + PITCHING_BB + PITCHING_SO +
## PITCHING_SO_BB + BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB +
## BATTING_SO + BATTING_1B + BATTING_BB_SO + FIELDING_E + FIELDING_DP +
## BASERUN_SB
##
## Df Sum of Sq RSS AIC
## - PITCHING_SO 1 135 224413 9392.4
## <none> 224278 9393.3
## - BATTING_1B 1 926 225205 9399.4
## - PITCHING_HR 1 1084 225362 9400.8
## - BATTING_HR 1 1521 225799 9404.6
## - BATTING_BB_SO 1 2147 226425 9410.1
## - BATTING_SO 1 2742 227020 9415.3
## - PITCHING_SO_BB 1 3419 227697 9421.2
## - PITCHING_BB 1 3941 228219 9425.7
## - BATTING_3B 1 4219 228497 9428.1
## - BATTING_2B 1 4346 228624 9429.2
## - PITCHING_H 1 4954 229232 9434.5
## - BATTING_BB 1 9479 233758 9473.2
## - FIELDING_DP 1 12377 236655 9497.6
## - BASERUN_SB 1 23808 248086 9590.9
## - FIELDING_E 1 55387 279665 9828.0
##
## Step: AIC=9392.45
## TARGET_WINS ~ PITCHING_H + PITCHING_HR + PITCHING_BB + PITCHING_SO_BB +
## BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB + BATTING_SO +
## BATTING_1B + BATTING_BB_SO + FIELDING_E + FIELDING_DP + BASERUN_SB
##
## Df Sum of Sq RSS AIC
## <none> 224413 9392.4
## - PITCHING_HR 1 994 225408 9399.2
## - BATTING_HR 1 1570 225983 9404.2
## - BATTING_1B 1 1664 226078 9405.1
## - BATTING_BB_SO 1 3126 227540 9417.8
## - PITCHING_SO_BB 1 3788 228201 9423.6
## - PITCHING_BB 1 3806 228219 9423.7
## - BATTING_3B 1 4132 228546 9426.6
## - BATTING_2B 1 5545 229958 9438.7
## - PITCHING_H 1 7736 232150 9457.5
## - BATTING_BB 1 9396 233810 9471.6
## - BATTING_SO 1 9952 234366 9476.3
## - FIELDING_DP 1 12574 236987 9498.3
## - BASERUN_SB 1 23890 248304 9590.6
## - FIELDING_E 1 55267 279681 9826.1
##
## Call:
## lm(formula = TARGET_WINS ~ PITCHING_H + PITCHING_HR + PITCHING_BB +
## PITCHING_SO_BB + BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB +
## BATTING_SO + BATTING_1B + BATTING_BB_SO + FIELDING_E + FIELDING_DP +
## BASERUN_SB, data = backReg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.357 -6.938 0.012 6.944 47.467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.120603 7.137530 4.921 9.35e-07 ***
## PITCHING_H 0.071381 0.008675 8.228 3.41e-16 ***
## PITCHING_HR -0.191683 0.064976 -2.950 0.003215 **
## PITCHING_BB -0.149522 0.025908 -5.771 9.12e-09 ***
## PITCHING_SO_BB 17.478624 3.035769 5.758 9.88e-09 ***
## BATTING_2B -0.081610 0.011716 -6.966 4.43e-12 ***
## BATTING_3B 0.114012 0.018959 6.014 2.16e-09 ***
## BATTING_HR 0.249909 0.067422 3.707 0.000216 ***
## BATTING_BB 0.264966 0.029219 9.068 < 2e-16 ***
## BATTING_SO -0.069789 0.007478 -9.333 < 2e-16 ***
## BATTING_1B -0.039170 0.010263 -3.817 0.000139 ***
## BATTING_BB_SO -18.235017 3.486053 -5.231 1.87e-07 ***
## FIELDING_E -0.101922 0.004634 -21.993 < 2e-16 ***
## FIELDING_DP -0.130336 0.012425 -10.490 < 2e-16 ***
## BASERUN_SB 0.068748 0.004754 14.460 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.69 on 1964 degrees of freedom
## Multiple R-squared: 0.4122, Adjusted R-squared: 0.408
## F-statistic: 98.38 on 14 and 1964 DF, p-value: < 2.2e-16
The AIC value suggest removal of the first variable (BATTING_2B) will drop the AIC to 9390 from 9392, thus improving the model.
## Start: AIC=9392.45
## TARGET_WINS ~ PITCHING_H + PITCHING_HR + PITCHING_BB + PITCHING_SO_BB +
## BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB + BATTING_SO +
## BATTING_TB + BATTING_BB_SO + FIELDING_E + FIELDING_DP + BASERUN_SB
##
## Df Sum of Sq RSS AIC
## - BATTING_2B 1 5 224419 9390.5
## <none> 224413 9392.4
## - PITCHING_HR 1 994 225408 9399.2
## - BATTING_TB 1 1664 226078 9405.1
## - BATTING_HR 1 3111 227524 9417.7
## - BATTING_BB_SO 1 3126 227540 9417.8
## - PITCHING_SO_BB 1 3788 228201 9423.6
## - PITCHING_BB 1 3806 228219 9423.7
## - BATTING_3B 1 7028 231442 9451.5
## - PITCHING_H 1 7736 232150 9457.5
## - BATTING_BB 1 9396 233810 9471.6
## - BATTING_SO 1 9952 234366 9476.3
## - FIELDING_DP 1 12574 236987 9498.3
## - BASERUN_SB 1 23890 248304 9590.6
## - FIELDING_E 1 55267 279681 9826.1
##
## Step: AIC=9390.5
## TARGET_WINS ~ PITCHING_H + PITCHING_HR + PITCHING_BB + PITCHING_SO_BB +
## BATTING_3B + BATTING_HR + BATTING_BB + BATTING_SO + BATTING_TB +
## BATTING_BB_SO + FIELDING_E + FIELDING_DP + BASERUN_SB
##
## Df Sum of Sq RSS AIC
## <none> 224419 9390.5
## - PITCHING_HR 1 1009 225428 9397.4
## - BATTING_BB_SO 1 3128 227547 9415.9
## - PITCHING_SO_BB 1 3783 228202 9421.6
## - BATTING_HR 1 3844 228263 9422.1
## - PITCHING_BB 1 5435 229854 9435.9
## - BATTING_TB 1 5745 230164 9438.5
## - BATTING_SO 1 9985 234404 9474.6
## - BATTING_BB 1 12356 236775 9494.6
## - FIELDING_DP 1 12634 237053 9496.9
## - PITCHING_H 1 12655 237073 9497.1
## - BATTING_3B 1 13743 238162 9506.1
## - BASERUN_SB 1 23989 248408 9589.5
## - FIELDING_E 1 55391 279810 9825.1
##
## Call:
## lm(formula = TARGET_WINS ~ PITCHING_H + PITCHING_HR + PITCHING_BB +
## PITCHING_SO_BB + BATTING_3B + BATTING_HR + BATTING_BB + BATTING_SO +
## BATTING_TB + BATTING_BB_SO + FIELDING_E + FIELDING_DP + BASERUN_SB,
## data = backReg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.327 -6.948 0.009 6.929 47.394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.659404 6.688631 5.331 1.09e-07 ***
## PITCHING_H 0.072522 0.006890 10.526 < 2e-16 ***
## PITCHING_HR -0.192645 0.064809 -2.973 0.00299 **
## PITCHING_BB -0.152450 0.022099 -6.899 7.06e-12 ***
## PITCHING_SO_BB 17.462216 3.034088 5.755 1.00e-08 ***
## BATTING_3B 0.235903 0.021505 10.970 < 2e-16 ***
## BATTING_HR 0.413415 0.071255 5.802 7.62e-09 ***
## BATTING_BB 0.267951 0.025761 10.401 < 2e-16 ***
## BATTING_SO -0.069852 0.007470 -9.350 < 2e-16 ***
## BATTING_TB -0.041007 0.005782 -7.093 1.83e-12 ***
## BATTING_BB_SO -18.168351 3.471604 -5.233 1.84e-07 ***
## FIELDING_E -0.101961 0.004630 -22.023 < 2e-16 ***
## FIELDING_DP -0.130086 0.012368 -10.518 < 2e-16 ***
## BASERUN_SB 0.068801 0.004747 14.493 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.69 on 1965 degrees of freedom
## Multiple R-squared: 0.4122, Adjusted R-squared: 0.4083
## F-statistic: 106 on 13 and 1965 DF, p-value: < 2.2e-16
This model excludes variable (BATTING_2B) based on the results of model 2. When model 2 was executed the same process of reviewing the AIC levels and removing any variables that can be removed to lower the AIC will be completed.
The output of model 3 indicates, the removal of any more predictor variables will not improve the model by lowering the AIC. Thus, this is the end of the process of using the AIC to identify the importance of the predictor variables.
The R-squared value of 0.41 indicates model 3 explains 41% of the variability around TOTAL_WINS.
The R-squared value ranges between 0%-100%, a higher R-squared value is desirable. A value of 0% indicates the model explains none of the variability around TOTAL_WINS and a value of 100% indicates the model explains all of the variability related to TOTAL_WINS.
The Adjusted R-squared value for model 3 is 0.41. The Adjust R-squared value differs from R-squared in that it adjusts based on the number of predictors of TOTAL_WINS in the model. However, the R-squared valued increases as the number of predictors of TOTAL_WINS increases.
The Adjust R-Squared value helps with determining whether including less predictors of TOTAL_WINS improves the model. A review on the Adjusted R-Squared values of models 1 and 2, which included more predictors than model 3, yields the same Adjusted R-Squared values. Thus model 3 is still the best of the 3 models based on AIC and the Adjusted R-Squared values.
The F-statistic generated by model 3 is 106. The F-statistic compares the linear relationship between TOTAL_WINS and the predictor variable of the 3 models.
The higher F-statistic indicates a better fit of the linear relationship. A review of models 1 and 2 indicates a lower F-statistic 98.4 for model 1 and the same value of F-statistic of 106 for models 2 and 3.
Analysis of the AIC, R-squared, Adjusted R-squared and F-statistic indicates model 3 is the best model.
The Standard error of each predictor should be close to zero is desirable. A review of the Standard error of the predictors of model 3 shows 2 predictors above 1 (PITCHING_SO_BB=3.03409, BATTING_BB_SO=3.47160). Since these 2 predictors are above 1 they will be removed and a new model will be developed to determine whether the removal improves the model.
## Start: AIC=9390.5
## TARGET_WINS ~ PITCHING_H + PITCHING_HR + PITCHING_BB + PITCHING_SO_BB +
## BATTING_3B + BATTING_HR + BATTING_BB + BATTING_SO + BATTING_TB +
## BATTING_BB_SO + FIELDING_E + FIELDING_DP + BASERUN_SB
##
## Df Sum of Sq RSS AIC
## <none> 224419 9390.5
## - PITCHING_HR 1 1009 225428 9397.4
## - BATTING_BB_SO 1 3128 227547 9415.9
## - PITCHING_SO_BB 1 3783 228202 9421.6
## - BATTING_HR 1 3844 228263 9422.1
## - PITCHING_BB 1 5435 229854 9435.9
## - BATTING_TB 1 5745 230164 9438.5
## - BATTING_SO 1 9985 234404 9474.6
## - BATTING_BB 1 12356 236775 9494.6
## - FIELDING_DP 1 12634 237053 9496.9
## - PITCHING_H 1 12655 237073 9497.1
## - BATTING_3B 1 13743 238162 9506.1
## - BASERUN_SB 1 23989 248408 9589.5
## - FIELDING_E 1 55391 279810 9825.1
##
## Call:
## lm(formula = TARGET_WINS ~ PITCHING_H + PITCHING_HR + PITCHING_BB +
## PITCHING_SO_BB + BATTING_3B + BATTING_HR + BATTING_BB + BATTING_SO +
## BATTING_TB + BATTING_BB_SO + FIELDING_E + FIELDING_DP + BASERUN_SB,
## data = backReg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.327 -6.948 0.009 6.929 47.394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.659404 6.688631 5.331 1.09e-07 ***
## PITCHING_H 0.072522 0.006890 10.526 < 2e-16 ***
## PITCHING_HR -0.192645 0.064809 -2.973 0.00299 **
## PITCHING_BB -0.152450 0.022099 -6.899 7.06e-12 ***
## PITCHING_SO_BB 17.462216 3.034088 5.755 1.00e-08 ***
## BATTING_3B 0.235903 0.021505 10.970 < 2e-16 ***
## BATTING_HR 0.413415 0.071255 5.802 7.62e-09 ***
## BATTING_BB 0.267951 0.025761 10.401 < 2e-16 ***
## BATTING_SO -0.069852 0.007470 -9.350 < 2e-16 ***
## BATTING_TB -0.041007 0.005782 -7.093 1.83e-12 ***
## BATTING_BB_SO -18.168351 3.471604 -5.233 1.84e-07 ***
## FIELDING_E -0.101961 0.004630 -22.023 < 2e-16 ***
## FIELDING_DP -0.130086 0.012368 -10.518 < 2e-16 ***
## BASERUN_SB 0.068801 0.004747 14.493 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.69 on 1965 degrees of freedom
## Multiple R-squared: 0.4122, Adjusted R-squared: 0.4083
## F-statistic: 106 on 13 and 1965 DF, p-value: < 2.2e-16
Removing predictors (PITCHING_SO_BB and BATTING_BB_SO) based on the Standard-error value greater than 1 did not improve the model. Thus, model 3 is predicts TOTAL_WINS.
The AIC of model 4 is higher then model 3 and the R-squared and Adjusted R-squared values are lower.
## Start: AIC=9439.74
## TARGET_WINS ~ PITCHING_H + PITCHING_HR + PITCHING_BB + BATTING_3B +
## BATTING_HR + BATTING_BB + BATTING_SO + BATTING_TB + FIELDING_E +
## FIELDING_DP + BASERUN_SB
##
## Df Sum of Sq RSS AIC
## <none> 230539 9439.7
## - PITCHING_HR 1 557 231096 9442.5
## - BATTING_HR 1 3266 233805 9465.6
## - PITCHING_BB 1 6193 236732 9490.2
## - BATTING_TB 1 7342 237881 9499.8
## - BATTING_SO 1 8308 238847 9507.8
## - BATTING_BB 1 8405 238944 9508.6
## - FIELDING_DP 1 12391 242930 9541.4
## - PITCHING_H 1 12767 243307 9544.4
## - BATTING_3B 1 14512 245052 9558.6
## - BASERUN_SB 1 22827 253366 9624.6
## - FIELDING_E 1 52398 282937 9843.0
##
## Call:
## lm(formula = TARGET_WINS ~ PITCHING_H + PITCHING_HR + PITCHING_BB +
## BATTING_3B + BATTING_HR + BATTING_BB + BATTING_SO + BATTING_TB +
## FIELDING_E + FIELDING_DP + BASERUN_SB, data = backReg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.360 -7.340 0.106 7.180 48.802
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.309297 5.220749 10.211 < 2e-16 ***
## PITCHING_H 0.072688 0.006964 10.437 < 2e-16 ***
## PITCHING_HR -0.141288 0.064825 -2.180 0.0294 *
## PITCHING_BB -0.160966 0.022144 -7.269 5.20e-13 ***
## BATTING_3B 0.242163 0.021762 11.128 < 2e-16 ***
## BATTING_HR 0.379487 0.071888 5.279 1.44e-07 ***
## BATTING_BB 0.204611 0.024162 8.468 < 2e-16 ***
## BATTING_SO -0.018536 0.002202 -8.419 < 2e-16 ***
## BATTING_TB -0.045664 0.005769 -7.915 4.10e-15 ***
## FIELDING_E -0.098442 0.004656 -21.144 < 2e-16 ***
## FIELDING_DP -0.128709 0.012518 -10.282 < 2e-16 ***
## BASERUN_SB 0.066725 0.004781 13.956 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.83 on 1967 degrees of freedom
## Multiple R-squared: 0.3962, Adjusted R-squared: 0.3928
## F-statistic: 117.3 on 11 and 1967 DF, p-value: < 2.2e-16
Visualizations of the residual values are used to determine whether, model 3, adheres to a linear relationship between TOTAL_WINS and the predictors. Residual values are the differences between the actual baseball statistics and the average of the baseball statics.
These plots show visualizes whether there is a linear relationship and the strength of the relationship for each of the predictor variables. Since the R-squared value of the model is 0.41, it accounts for 41% of the values that are around the line in the plot. Since there is no systematic patter the model does have a linear relationship.
The standardized residual plot visualizes whether the data follow a normal distribution. A normal distribution shows whether the data is symmetric, bell shaped. Since the points are fall along the straight line the data are symmetric and bell shaped.
Cook’s distance plot examines whether the distance of individual observations are considered influential to the quality of the model. The visualization suggest observations 1,377, 1,577 and 54 could be influential to the mode. However, the decision was made to retain these observations in the model.
## Test stat Pr(>|Test stat|)
## PITCHING_H 0.4624 0.6438768
## PITCHING_HR -0.1880 0.8508727
## PITCHING_BB 0.5850 0.5585954
## PITCHING_SO_BB 0.4194 0.6749598
## BATTING_3B -0.2581 0.7963247
## BATTING_HR -0.4505 0.6524050
## BATTING_BB -0.4316 0.6660611
## BATTING_SO -0.3623 0.7171860
## BATTING_TB -1.4395 0.1501725
## BATTING_BB_SO -0.6886 0.4911782
## FIELDING_E 2.7630 0.0057809 **
## FIELDING_DP 3.8828 0.0001067 ***
## BASERUN_SB 0.8885 0.3744033
## Tukey test -1.4832 0.1380154
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 346 1577
## rstudent unadjusted p-value Bonferonni p
## 1577 4.497814 7.2666e-06 0.014381
## StudRes Hat CookD
## 339 -1.1826729 0.09193725 0.01011321
## 346 3.9734582 0.01738419 0.01980267
## 634 0.9527759 0.13383258 0.01001922
## 1577 4.4978142 0.01829720 0.02667166
The model chosen here is the best. Transformed data are not presented here because evaluation of the output did not yield better results than presented. The removal of the same predictor variables as the model presented here were removed.