## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
##
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0
## Median :102.00 Median :512.0 Median : 750.0 Median :101.0
## Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## NA's :102 NA's :131
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0
## 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0
## Median : 49.0 Median :58.00 Median : 1518 Median :107.0
## Mean : 52.8 Mean :59.36 Mean : 1779 Mean :105.7
## 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0
## Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0
## NA's :772 NA's :2085
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
## 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
## Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
## Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :146.4
## 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
## Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
## NA's :102 NA's :286
From the summaries The maximum values for TEAM_PITCHING_H, TEAM_PITCHING_BB, TEAM_PITCHING_SO and TEAM_FIELDING_E seem abnormally large. There may be outliers in the columns. We can confirm this finding by checking the distributions of the values:
From the boxplots, there are indeed values in TEAM_PITCHING_H, TEAM_PITCHING_BB, TEAM_PITCHING_SO and TEAM_FIELDING_E that are extremly off from the majority of the data. We would handle these outliers later with the other problems.
From the summaries, we see that are are missing values for TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, EAM_BATTING_HBP, TEAM_PITCHING_SO, and TEAM_FIELDING_DP. Now we check the portion of missing data in each field:
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 0.00000000 0.00000000 0.00000000 0.00000000
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 0.00000000 0.00000000 0.04481547 0.05755712
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 0.33919156 0.91608084 0.00000000 0.00000000
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 0.00000000 0.04481547 0.00000000 0.12565905
91.6% of the data in TEAM_BATTING_HBP are missing. Since the minimum of TEAM_BATTING_HBP is 29, it is not plausible that the missing values are all 0. We will drop this field from our analysis as there are too many missing values and there is no good way of imputing the values.
For TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_PITCHING_SO, and TEAM_FIELDING_DP, we will do imputation and may be handled with other problems.
Now let’s look at the correlations between the variables
The following variables are nearly perfectly correlated
We take a more careful look at the correlation between the TARGET_WINS and other variables, and compare it with the theoretical effects
## correlation_with_TARGET_WINS Theoretical_Effect
## TEAM_BATTING_H 0.4699 Positive
## TEAM_BATTING_2B 0.3130 Positive
## TEAM_BATTING_3B -0.1243 Positive
## TEAM_BATTING_HR 0.4224 Positive
## TEAM_BATTING_BB 0.4687 Positive
## TEAM_BATTING_SO -0.2289 Negative
## TEAM_BASERUN_SB 0.0148 Positive
## TEAM_BASERUN_CS -0.1788 Negative
## TEAM_BATTING_HBP 0.0735 Positive
## TEAM_PITCHING_H 0.4712 Negative
## TEAM_PITCHING_HR 0.4225 Negative
## TEAM_PITCHING_BB 0.4684 Negative
## TEAM_PITCHING_SO -0.2294 Positive
## TEAM_FIELDING_E -0.3867 Negative
## TEAM_FIELDING_DP -0.1959 Positive
**The following variables do not have correlation matching with the theoretical effect:
For TEAM_BATTING_3B and TEAM_FIELDING_DP, we would need to perform deeper analysis on this finding.
For TEAM_PITCHING_H, TEAM_PITCHING_HR, TEAM_PITCHING_BB, TEAM_PITCHING_SO, we may consider dropping the variables since they are amlost perfectly correlated with one other variable. Also, TEAM_PITCHING_H, TEAM_PITCHING_BB and TEAM_PITCHING_SO have outlier problem as found above.
We check the normalities of the variables to determine if transformation is needed.
TEAM_PITCHING_H, TEAM_PITCHING_BB, TEAM_PITCHING_SO and TEAM_FIELDING_E are largely right skewed because of the outliers. We will check the distributions later again when the outliers are handled.
TEAM_BATTING_HR, TEAM_BATTING_SO, TEAM_PITCHING_HR are bimodal that may need to be transformed.
There are some variables such as TEAM_BASERUN_SB that are slightly right skewed. We may keep them as it for easier interpretation of the result.
TEAM_BATTING_HBP is dropped for the following reason(s)
TEAM_PITCHING_HR is dropped for the following reason(s)
TEAM_PITCHING_H,TEAM_PITCHING_BB,TEAM_PITCHING_SO are dropped for the following reason(s)
Since TEAM_FIELDING_E is extremely right skewed, we will transform the variable using Box-Cox transformation
The optimal lamba from the following result plot is near -1, so we will transform the variable using the power of -1
The following density plot and box plot show that the distribution is closer to normal and there are no extreme ourliers.
Since TEAM_BATTING_H also counts the number of TEAM_BATTING_2B, TEAM_BATTING_3B and TEAM_BATTING_3B.
Instead of using TEAM_BATTING_H, we will create a new variable TEAM_BATTING_1B by subtracting TEAM_BATTING_2B, TEAM_BATTING_3B and TEAM_BATTING_3B from TEAM_BATTING_H
The distribution of the new variable is slightly right skewed. We will keep it as it for now unless transformation is necessary when developing a model.
The only problem now is the missing values in TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS and TEAM_FIELDING_DP
We will impute the missing values by using linear regression models. We will not go deep into evaluating these models in this project. The purpose here is to impute missing values have better explained variance than simply using the means or medians.
In the imputation models, TARGET_WINS is not included as an independent variable since the values are not provided in the test data set. We will need to use the same models to impute missing values in the test data set.
We will perform separate imputations in the following order, based on the number of missing values: 1. TEAM_BATTING_SO 2. TEAM_BASERUN_SB 3. TEAM_FIELDING_DP 4. TEAM_BASERUN_CS
Variables that are already imputed are included in the models after and un-imputed variables are not included. We will keep the trained models and use them to impute missing values in the test data set.
Before imputations:
## train_prepared_df.TEAM_BATTING_SO train_prepared_df.TEAM_BASERUN_SB
## 4.481547 5.755712
## train_prepared_df.TEAM_BASERUN_CS train_prepared_df.TEAM_FIELDING_DP
## 33.919156 12.565905
After imputations:
## train_prepared_df.TEAM_BATTING_SO train_prepared_df.TEAM_BASERUN_SB
## 0 0
## train_prepared_df.TEAM_BASERUN_CS train_prepared_df.TEAM_FIELDING_DP
## 0 0
As you can all see, the 4 variables, TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_FIELDING_DP, all are having 0 missing values. Meaning they are no longer presented with missing values.
The R-squared values of the models are:
## [1] "TEAM_BATTING_SO - R-squared:0.7045"
## [1] "TEAM_BASERUN_SB - R-squared:0.1571"
## [1] "TEAM_BASERUN_CS - R-squared:0.599"
## [1] "TEAM_FIELDING_DP - R-squared:0.1227"
The numbers in the summary of our prepared data set all look plausible, we are ready for our model development
## TARGET_WINS TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## Min. : 0.00 Min. : 69.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:208.0 1st Qu.: 34.00 1st Qu.: 42.00
## Median : 82.00 Median :238.0 Median : 47.00 Median :102.00
## Mean : 80.79 Mean :241.2 Mean : 55.25 Mean : 99.61
## 3rd Qu.: 92.00 3rd Qu.:273.0 3rd Qu.: 72.00 3rd Qu.:147.00
## Max. :146.00 Max. :458.0 Max. :223.00 Max. :264.00
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.00
## 1st Qu.:451.0 1st Qu.: 542.0 1st Qu.: 67.0 1st Qu.: 43.00
## Median :512.0 Median : 732.0 Median :101.0 Median : 57.00
## Mean :501.6 Mean : 727.6 Mean :123.4 Mean : 69.27
## 3rd Qu.:580.0 3rd Qu.: 925.0 3rd Qu.:152.0 3rd Qu.: 86.89
## Max. :878.0 Max. :1399.0 Max. :697.0 Max. :272.21
## TEAM_FIELDING_DP TEAM_FIELDING_E_Transformed TEAM_BATTING_1B
## Min. : 52.0 Min. :0.0005269 Min. : 709.0
## 1st Qu.:130.0 1st Qu.:0.0040120 1st Qu.: 990.8
## Median :146.0 Median :0.0062893 Median :1050.0
## Mean :144.9 Mean :0.0059708 Mean :1073.2
## 3rd Qu.:162.0 3rd Qu.:0.0078740 3rd Qu.:1129.0
## Max. :228.0 Max. :0.0153846 Max. :2112.0
TEAM_BATTING_1B, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_FIELDING_DP, TEAM_FIELDING_E_Transformed
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_prepared_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.884 -8.302 0.085 8.208 66.698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.222e+01 5.369e+00 2.277 0.022887 *
## TEAM_BATTING_2B 2.067e-02 7.396e-03 2.795 0.005240 **
## TEAM_BATTING_3B 1.437e-01 1.713e-02 8.389 < 2e-16 ***
## TEAM_BATTING_HR 8.623e-02 9.275e-03 9.297 < 2e-16 ***
## TEAM_BATTING_BB 2.181e-02 2.942e-03 7.414 1.72e-13 ***
## TEAM_BATTING_SO -9.109e-03 2.383e-03 -3.823 0.000136 ***
## TEAM_BASERUN_SB 1.534e-02 8.865e-03 1.730 0.083685 .
## TEAM_BASERUN_CS 1.215e-02 2.325e-02 0.522 0.601455
## TEAM_FIELDING_DP -1.368e-01 1.369e-02 -9.993 < 2e-16 ***
## TEAM_FIELDING_E_Transformed 2.373e+03 2.113e+02 11.232 < 2e-16 ***
## TEAM_BATTING_1B 4.254e-02 3.716e-03 11.450 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.1 on 2265 degrees of freedom
## Multiple R-squared: 0.3109, Adjusted R-squared: 0.3079
## F-statistic: 102.2 on 10 and 2265 DF, p-value: < 2.2e-16
In the full model, The sign of TEAM_BASERUN_CS and TEAM_FIELDING_DP do not match with the theoretical effects.
By looking at the below confidence intervals with 5% significance level, The positive sign for TEAM_BASERUN_CS may just happen by chance. The high p-value of TEAM_BASERUN_CS also indicates that the variable is not significant here. With the fact that more than 30% of the values of TEAM_BASERUN_CS are imputed. We suggest to drop this variable from our model.
For TEAM_FIELDING_DP, confidence interval is below 0 and the correlation between TARGET_WINS and TEAM_FIELDING_DP is -0.2. In the correlation matrix, it doesn’t show any strong correlation with any other variables. One explanation is that double play happens when there is already a runner on a base. So a higher TEAM_FIELDING_DP means the team let the an opponent player stay on a base more frequently. A good team will get their opponents out before double play can happen. We may need to observe the behaviors of the baseball players directly to find out the true reason of the negative correlation. But, in this analysis, we will compare the performance of the models with or without TEAM_FIELDING_DP using the test data set to determine whether we should keep this variable in our model.
## 2.5 % 97.5 %
## (Intercept) 1.695863e+00 2.275411e+01
## TEAM_BATTING_2B 6.165689e-03 3.517413e-02
## TEAM_BATTING_3B 1.101057e-01 1.772853e-01
## TEAM_BATTING_HR 6.804495e-02 1.044223e-01
## TEAM_BATTING_BB 1.604367e-02 2.758226e-02
## TEAM_BATTING_SO -1.378154e-02 -4.435984e-03
## TEAM_BASERUN_SB -2.043957e-03 3.272634e-02
## TEAM_BASERUN_CS -3.345244e-02 5.774641e-02
## TEAM_FIELDING_DP -1.636101e-01 -1.099324e-01
## TEAM_FIELDING_E_Transformed 1.958557e+03 2.787093e+03
## TEAM_BATTING_1B 3.525880e-02 4.983134e-02
exclude TEAM_BASERUN_CS
##
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_BASERUN_CS, data = train_prepared_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.884 -8.330 0.106 8.252 66.698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.315e+01 5.068e+00 2.595 0.009519 **
## TEAM_BATTING_2B 2.029e-02 7.360e-03 2.757 0.005876 **
## TEAM_BATTING_3B 1.463e-01 1.637e-02 8.938 < 2e-16 ***
## TEAM_BATTING_HR 8.526e-02 9.083e-03 9.386 < 2e-16 ***
## TEAM_BATTING_BB 2.184e-02 2.941e-03 7.428 1.55e-13 ***
## TEAM_BATTING_SO -9.211e-03 2.374e-03 -3.879 0.000108 ***
## TEAM_BASERUN_SB 1.937e-02 4.353e-03 4.451 8.97e-06 ***
## TEAM_FIELDING_DP -1.373e-01 1.365e-02 -10.056 < 2e-16 ***
## TEAM_FIELDING_E_Transformed 2.357e+03 2.091e+02 11.274 < 2e-16 ***
## TEAM_BATTING_1B 4.225e-02 3.672e-03 11.505 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.1 on 2266 degrees of freedom
## Multiple R-squared: 0.3109, Adjusted R-squared: 0.3081
## F-statistic: 113.6 on 9 and 2266 DF, p-value: < 2.2e-16
The significance of TEAM_BASERUN_SB increases and the Adjusted R-squared increases. Dropping TEAM_BASERUN_CS gives us a better result. We will also verify this in our test data set.
exclude TEAM_BASERUN_CS and TEAM_FIELDING_DP
##
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_BASERUN_CS - TEAM_FIELDING_DP,
## data = train_prepared_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.701 -8.440 0.110 8.463 64.470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.731e+00 5.047e+00 0.343 0.7316
## TEAM_BATTING_2B 1.915e-02 7.520e-03 2.546 0.0110 *
## TEAM_BATTING_3B 1.527e-01 1.672e-02 9.136 < 2e-16 ***
## TEAM_BATTING_HR 7.170e-02 9.179e-03 7.811 8.59e-15 ***
## TEAM_BATTING_BB 1.560e-02 2.937e-03 5.310 1.20e-07 ***
## TEAM_BATTING_SO -7.351e-03 2.419e-03 -3.039 0.0024 **
## TEAM_BASERUN_SB 3.324e-02 4.219e-03 7.880 5.06e-15 ***
## TEAM_FIELDING_E_Transformed 2.160e+03 2.127e+02 10.152 < 2e-16 ***
## TEAM_BATTING_1B 3.671e-02 3.710e-03 9.893 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.39 on 2267 degrees of freedom
## Multiple R-squared: 0.2801, Adjusted R-squared: 0.2776
## F-statistic: 110.3 on 8 and 2267 DF, p-value: < 2.2e-16
The significance of TEAM_BATTING_2B and TEAM_BATTING_SO decreases and the Adjusted R-squared also decreases. It seems that it is better to keep TEAM_FIELDING_DP in our model.
exclude all variables with imputed values (TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS and TEAM_FIELDING_DP)
##
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_BATTING_SO - TEAM_BASERUN_SB -
## TEAM_BASERUN_CS - TEAM_FIELDING_DP, data = train_prepared_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.320 -8.342 0.075 8.584 64.966
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.779e+00 3.601e+00 -1.882 0.0599 .
## TEAM_BATTING_2B 1.457e-02 7.590e-03 1.919 0.0551 .
## TEAM_BATTING_3B 1.923e-01 1.608e-02 11.960 < 2e-16 ***
## TEAM_BATTING_HR 5.544e-02 8.527e-03 6.502 9.72e-11 ***
## TEAM_BATTING_BB 2.237e-02 2.831e-03 7.901 4.27e-15 ***
## TEAM_FIELDING_E_Transformed 1.816e+03 2.082e+02 8.722 < 2e-16 ***
## TEAM_BATTING_1B 4.272e-02 3.085e-03 13.848 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.56 on 2269 degrees of freedom
## Multiple R-squared: 0.2604, Adjusted R-squared: 0.2584
## F-statistic: 133.1 on 6 and 2269 DF, p-value: < 2.2e-16
As expected, including less variables produce a model with lower R-squared value. The Adjusted R-squared is also lower in this case. However, this model is much simpler and all the variables have no missing values. We will compare the performance of this model to the other ones we have built using the test data set.
The plots for all four models all look the same.
The residual plots and standardized residual plots show that the residuals are independent and approximately constant with mean 0 within the cloud of data. Basically, we see in all 4 residuals vs fitted plots (Column 1) that there is approximately symmetrical variation around 0, i.e. homoscedasticity. Nor did we detect any non-linear patterns.
The Q-Q plots show that the residuals are approximately normal except the two tails but the problem is mild.
The Residuals vs Leverage plots show no point outside of the Cook’s distance. There is no strong influence point.
We conclude that all four models follow the assumptions of OLS regression and so they are valid.
The F-statistic is used to measure the significance of one or more variables if they are added to a base model.
The base model is our model with TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS and TEAM_FIELDING_DP excluded
We compare it with the model with TEAM_BATTING_SO and TEAM_BASERUN_SB added
The F-value with degree of freedom 2 is 31.068, indicating that adding TEAM_BATTING_SO and TEAM_BASERUN_SB does make an improvement to the model.
Next use our improved model as our new base model check if adding TEAM_FIELDING_DP will make another improvement.
The result shows that the effect of TEAM_FIELDING_DP is significant
Last, let’s check if adding TEAM_BASERUN_CS will help
The variable TEAM_BASERUN_CS insignificant as we have checked before using the p-value of the coefficient.
From the F-Statistics, the optimal model is the model with the following predictors:
The R-squared or Adjusted R-squared is used to measure how well a model fit in the train data. Since our models have different number of predictor variables, it is better to compare the Adjusted R-squared of the 4 models
The RMSD (Root Mean Square Deviation) is a measurement for the difference between a model’s predicted values and the actual values.
The Adjusted R-squared and RMSD for all 4 models are as following:
The second model has the highest Adjusted_R_Squared value and an RMSD slightly higher than the smallest RMSD produced by the full model. We conclude that the optimal model is the same as we found from the F-Statistic tests.
Now, let’s check if the models are producing plausible prediction values using the test data set.
let’s look at the summary of our test data set:
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## Min. : 819 Min. : 44.0 Min. : 14.00 Min. : 0.00
## 1st Qu.:1387 1st Qu.:210.0 1st Qu.: 35.00 1st Qu.: 44.50
## Median :1455 Median :239.0 Median : 52.00 Median :101.00
## Mean :1469 Mean :241.3 Mean : 55.91 Mean : 95.63
## 3rd Qu.:1548 3rd Qu.:278.5 3rd Qu.: 72.00 3rd Qu.:135.50
## Max. :2170 Max. :376.0 Max. :155.00 Max. :242.00
##
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## Min. : 15.0 Min. : 0.0 Min. : 0.0 Min. : 0.00
## 1st Qu.:436.5 1st Qu.: 545.0 1st Qu.: 59.0 1st Qu.: 38.00
## Median :509.0 Median : 686.0 Median : 92.0 Median : 49.50
## Mean :499.0 Mean : 709.3 Mean :123.7 Mean : 52.32
## 3rd Qu.:565.5 3rd Qu.: 912.0 3rd Qu.:151.8 3rd Qu.: 63.00
## Max. :792.0 Max. :1268.0 Max. :580.0 Max. :154.00
## NA's :18 NA's :13 NA's :87
## TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## Min. :42.00 Min. : 1155 Min. : 0.0 Min. : 136.0
## 1st Qu.:53.50 1st Qu.: 1426 1st Qu.: 52.0 1st Qu.: 471.0
## Median :62.00 Median : 1515 Median :104.0 Median : 526.0
## Mean :62.37 Mean : 1813 Mean :102.1 Mean : 552.4
## 3rd Qu.:67.50 3rd Qu.: 1681 3rd Qu.:142.5 3rd Qu.: 606.5
## Max. :96.00 Max. :22768 Max. :336.0 Max. :2008.0
## NA's :240
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. : 0.0 Min. : 73.0 Min. : 69.0
## 1st Qu.: 613.0 1st Qu.: 131.0 1st Qu.:131.0
## Median : 745.0 Median : 163.0 Median :148.0
## Mean : 799.7 Mean : 249.7 Mean :146.1
## 3rd Qu.: 938.0 3rd Qu.: 252.0 3rd Qu.:164.0
## Max. :9963.0 Max. :1568.0 Max. :204.0
## NA's :18 NA's :31
First, let’s transform and impute the variables in the test data set.
The following is the summary of the prepared test data
## TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
## Min. : 44.0 Min. : 14.00 Min. : 0.00 Min. : 15.0
## 1st Qu.:210.0 1st Qu.: 35.00 1st Qu.: 44.50 1st Qu.:436.5
## Median :239.0 Median : 52.00 Median :101.00 Median :509.0
## Mean :241.3 Mean : 55.91 Mean : 95.63 Mean :499.0
## 3rd Qu.:278.5 3rd Qu.: 72.00 3rd Qu.:135.50 3rd Qu.:565.5
## Max. :376.0 Max. :155.00 Max. :242.00 Max. :792.0
## TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 73.0
## 1st Qu.: 534.9 1st Qu.: 59.0 1st Qu.: 43.00 1st Qu.: 131.0
## Median : 677.0 Median : 92.0 Median : 57.00 Median : 163.0
## Mean : 699.7 Mean :121.8 Mean : 69.77 Mean : 249.7
## 3rd Qu.: 904.5 3rd Qu.:149.0 3rd Qu.: 88.12 3rd Qu.: 252.0
## Max. :1268.0 Max. :580.0 Max. :240.08 Max. :1568.0
## TEAM_FIELDING_DP TEAM_FIELDING_E_Transformed TEAM_BATTING_1B
## Min. : 69.0 Min. :0.0006378 Min. : 657.0
## 1st Qu.:131.0 1st Qu.:0.0039685 1st Qu.: 990.5
## Median :146.0 Median :0.0061350 Median :1059.0
## Mean :144.7 Mean :0.0057873 Mean :1076.5
## 3rd Qu.:162.0 3rd Qu.:0.0076336 3rd Qu.:1134.0
## Max. :204.0 Max. :0.0136986 Max. :1846.0
The follow plot shows the distribution of the actual Target_Win value from the train data set and the distributions of the predicted values from the 4 models
The distribution of the predicted values for all 4 models are similar, with the same mean and variance. The means are also close to the mean of the Target_Win from our train data set. The variances are different because the test data has smaller sample size. All conclude that all 4 models produce plausible prediction values.
Based on above findings, the optimal model is
##
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_BASERUN_CS, data = train_prepared_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.884 -8.330 0.106 8.252 66.698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.315e+01 5.068e+00 2.595 0.009519 **
## TEAM_BATTING_2B 2.029e-02 7.360e-03 2.757 0.005876 **
## TEAM_BATTING_3B 1.463e-01 1.637e-02 8.938 < 2e-16 ***
## TEAM_BATTING_HR 8.526e-02 9.083e-03 9.386 < 2e-16 ***
## TEAM_BATTING_BB 2.184e-02 2.941e-03 7.428 1.55e-13 ***
## TEAM_BATTING_SO -9.211e-03 2.374e-03 -3.879 0.000108 ***
## TEAM_BASERUN_SB 1.937e-02 4.353e-03 4.451 8.97e-06 ***
## TEAM_FIELDING_DP -1.373e-01 1.365e-02 -10.056 < 2e-16 ***
## TEAM_FIELDING_E_Transformed 2.357e+03 2.091e+02 11.274 < 2e-16 ***
## TEAM_BATTING_1B 4.225e-02 3.672e-03 11.505 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.1 on 2266 degrees of freedom
## Multiple R-squared: 0.3109, Adjusted R-squared: 0.3081
## F-statistic: 113.6 on 9 and 2266 DF, p-value: < 2.2e-16