In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team.
The training data has 17 columns and 2,276 rows.
The explanatory columns are broken down into four categories:
Below you will see a preview of the columns and the first few observations broken down into these four categories.
The variable TARGET_WINS is the number of wins of a professional baseball team for a given year. The year is not part of the data set. This is the dependent variable that our models will attempt to predict.
As you can see, the distribution of the number of wins is unimodal and skewed to the left with some outliers towards the tail. It looks approximately normal. The minimum number of wins for a team is 0 and the maximum is 146. The mean is 80.79.
The boxplot above shows that there are suspected outliers at both ends.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 71.00 82.00 80.79 92.00 146.00
VARIABLE NAME | DEFINITION | THEORETICAL EFFECT |
---|---|---|
TEAM_BATTING_H | Base Hits by batters (1B,2B,3B,HR) | Positive Impact on Wins |
TEAM_BATTING_2B | Doubles by batters (2B) | Positive Impact on Wins |
TEAM_BATTING_3B | Triples by batters (3B) | Positive Impact on Wins |
TEAM_BATTING_HR | Homeruns by batters (4B) | Positive Impact on Wins |
TEAM_BATTING_BB | Walks by batters | Positive Impact on Wins |
TEAM_BATTING_HBP | Batters hit by pitch (get a free base) | Positive Impact on Wins |
TEAM_BATTING_SO | Strikeouts by batters | Negative Impact on Wins |
As you can see, two variables have some N/A values (n < 2276). Particularly, TEAM_BATTING_HPB only has 191 values that are not missing.
See Pairings of all Variables to view correlation table and scatter plots of all variables.
VARIABLE NAME | DEFINITION | THEORETICAL EFFECT |
---|---|---|
TEAM_BASERUN_SB | Stolen bases | Positive Impact on Wins |
TEAM_BASERUN_CS | Caught stealing | Negative Impact on Wins |
As you can see, both variables have some N/A values (n < 2276).
See Pairings of all Variables to view correlation table and scatter plots of all variables.
VARIABLE NAME | DEFINITION | THEORETICAL EFFECT |
---|---|---|
TEAM_PITCHING_BB | Walks allowed | Negative Impact on Wins |
TEAM_PITCHING_H | Hits allowed | Negative Impact on Wins |
TEAM_PITCHING_HR | Homeruns allowed | Negative Impact on Wins |
TEAM_PITCHING_SO | Strikeouts by pitchers | Positive Impact on Wins |
As you can see, one variable has some N/A values (n < 2276).
See Pairings of all Variables to view correlation table and scatter plots of all variables.
VARIABLE NAME | DEFINITION | THEORETICAL EFFECT |
---|---|---|
TEAM_FIELDING_E | Errors | Negative Impact on Wins |
TEAM_FIELDING_DP | Double Plays | Positive Impact on Wins |
As you can see, one variable has some N/A values (n < 2276)
See Pairings of all Variables to view correlation table and scatter plots of all variables.
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## TARGET_WINS 1.00 0.47 0.31
## TEAM_BATTING_H 0.47 1.00 0.56
## TEAM_BATTING_2B 0.31 0.56 1.00
## TEAM_BATTING_3B -0.12 0.21 0.04
## TEAM_BATTING_HR 0.42 0.40 0.25
## TEAM_BATTING_BB 0.47 0.20 0.20
## TEAM_BATTING_SO -0.23 -0.34 -0.06
## TEAM_BASERUN_SB 0.01 0.07 -0.19
## TEAM_BASERUN_CS -0.18 -0.09 -0.20
## TEAM_BATTING_HBP 0.07 -0.03 0.05
## TEAM_PITCHING_H 0.47 1.00 0.56
## TEAM_PITCHING_HR 0.42 0.39 0.25
## TEAM_PITCHING_BB 0.47 0.20 0.20
## TEAM_PITCHING_SO -0.23 -0.34 -0.07
## TEAM_FIELDING_E -0.39 -0.25 -0.19
## TEAM_FIELDING_DP -0.20 0.02 -0.02
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
## TARGET_WINS -0.12 0.42 0.47
## TEAM_BATTING_H 0.21 0.40 0.20
## TEAM_BATTING_2B 0.04 0.25 0.20
## TEAM_BATTING_3B 1.00 -0.22 -0.21
## TEAM_BATTING_HR -0.22 1.00 0.46
## TEAM_BATTING_BB -0.21 0.46 1.00
## TEAM_BATTING_SO -0.19 0.21 0.22
## TEAM_BASERUN_SB 0.17 -0.19 -0.09
## TEAM_BASERUN_CS 0.23 -0.28 -0.21
## TEAM_BATTING_HBP -0.17 0.11 0.05
## TEAM_PITCHING_H 0.21 0.40 0.20
## TEAM_PITCHING_HR -0.22 1.00 0.46
## TEAM_PITCHING_BB -0.21 0.46 1.00
## TEAM_PITCHING_SO -0.19 0.21 0.22
## TEAM_FIELDING_E -0.07 0.02 -0.08
## TEAM_FIELDING_DP 0.13 -0.06 -0.08
## TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## TARGET_WINS -0.23 0.01 -0.18
## TEAM_BATTING_H -0.34 0.07 -0.09
## TEAM_BATTING_2B -0.06 -0.19 -0.20
## TEAM_BATTING_3B -0.19 0.17 0.23
## TEAM_BATTING_HR 0.21 -0.19 -0.28
## TEAM_BATTING_BB 0.22 -0.09 -0.21
## TEAM_BATTING_SO 1.00 -0.07 -0.06
## TEAM_BASERUN_SB -0.07 1.00 0.62
## TEAM_BASERUN_CS -0.06 0.62 1.00
## TEAM_BATTING_HBP 0.22 -0.06 -0.07
## TEAM_PITCHING_H -0.34 0.07 -0.09
## TEAM_PITCHING_HR 0.21 -0.19 -0.28
## TEAM_PITCHING_BB 0.22 -0.09 -0.21
## TEAM_PITCHING_SO 1.00 -0.07 -0.06
## TEAM_FIELDING_E 0.31 0.04 0.21
## TEAM_FIELDING_DP -0.12 -0.13 -0.01
## TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## TARGET_WINS 0.07 0.47 0.42
## TEAM_BATTING_H -0.03 1.00 0.39
## TEAM_BATTING_2B 0.05 0.56 0.25
## TEAM_BATTING_3B -0.17 0.21 -0.22
## TEAM_BATTING_HR 0.11 0.40 1.00
## TEAM_BATTING_BB 0.05 0.20 0.46
## TEAM_BATTING_SO 0.22 -0.34 0.21
## TEAM_BASERUN_SB -0.06 0.07 -0.19
## TEAM_BASERUN_CS -0.07 -0.09 -0.28
## TEAM_BATTING_HBP 1.00 -0.03 0.11
## TEAM_PITCHING_H -0.03 1.00 0.39
## TEAM_PITCHING_HR 0.11 0.39 1.00
## TEAM_PITCHING_BB 0.05 0.20 0.46
## TEAM_PITCHING_SO 0.22 -0.34 0.21
## TEAM_FIELDING_E 0.04 -0.25 0.02
## TEAM_FIELDING_DP -0.07 0.01 -0.06
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## TARGET_WINS 0.47 -0.23 -0.39
## TEAM_BATTING_H 0.20 -0.34 -0.25
## TEAM_BATTING_2B 0.20 -0.07 -0.19
## TEAM_BATTING_3B -0.21 -0.19 -0.07
## TEAM_BATTING_HR 0.46 0.21 0.02
## TEAM_BATTING_BB 1.00 0.22 -0.08
## TEAM_BATTING_SO 0.22 1.00 0.31
## TEAM_BASERUN_SB -0.09 -0.07 0.04
## TEAM_BASERUN_CS -0.21 -0.06 0.21
## TEAM_BATTING_HBP 0.05 0.22 0.04
## TEAM_PITCHING_H 0.20 -0.34 -0.25
## TEAM_PITCHING_HR 0.46 0.21 0.02
## TEAM_PITCHING_BB 1.00 0.22 -0.08
## TEAM_PITCHING_SO 0.22 1.00 0.31
## TEAM_FIELDING_E -0.08 0.31 1.00
## TEAM_FIELDING_DP -0.08 -0.12 0.04
## TEAM_FIELDING_DP
## TARGET_WINS -0.20
## TEAM_BATTING_H 0.02
## TEAM_BATTING_2B -0.02
## TEAM_BATTING_3B 0.13
## TEAM_BATTING_HR -0.06
## TEAM_BATTING_BB -0.08
## TEAM_BATTING_SO -0.12
## TEAM_BASERUN_SB -0.13
## TEAM_BASERUN_CS -0.01
## TEAM_BATTING_HBP -0.07
## TEAM_PITCHING_H 0.01
## TEAM_PITCHING_HR -0.06
## TEAM_PITCHING_BB -0.08
## TEAM_PITCHING_SO -0.12
## TEAM_FIELDING_E 0.04
## TEAM_FIELDING_DP 1.00
There are 6 explanatory variables with missing values.
Results of simple linear regression of these variables with missing data (listed below) suggest that predictors are not significant at the 5% level. See section Simple Linear Regression of each Variable for more details. Recommendation is to drop these from the models.
Results of simple linear regression of variables (on TARGET_WINS) listed below suggest that predictors are significant.
This section investigates more closely the relationship of each explanatory variable with the response variable TARGET_WINS. This would also give us some some insights with how to handle some of the variables with missing values.
Base hits by batter (1B, 2B, 3B, HR). This variable is the sum of 1B, 2B, 3B, and HR.
No Missing values.
Positive relationship (expected theoretical effect is positive). No obvious curved patterns observed from visually inspecting the scatter plots only.
Linear regression predictor coefficient is 0.042353 (contributes 0.04 points to TARGET_WINS with every unit increase). The p-value is < 2e-16 *** (predictor is significant). The R-squared is 0.1511 (explains about 15% of variability of response variable).
summary(lm(TARGET_WINS ~ TEAM_BATTING_H, data=data))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.768 -8.757 0.856 9.762 46.016
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.562326 3.107523 5.973 2.69e-09 ***
## TEAM_BATTING_H 0.042353 0.002105 20.122 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.52 on 2274 degrees of freedom
## Multiple R-squared: 0.1511, Adjusted R-squared: 0.1508
## F-statistic: 404.9 on 1 and 2274 DF, p-value: < 2.2e-16
Doubles by batters (2B).
No Missing values.
Positive relationship (expected theoretical effect is positive). No obvious curved patterns observed from visually inspecting the scatter plots only.
Linear regression predictor coefficient is 0.097305 (points contributed to TARGET_WINS with every unit increase). The p-value is <2e-16 *** (predictor is significant). The R-squared is 0.08358 (explains about 8% of variability of response variable).
summary(lm(TARGET_WINS ~ TEAM_BATTING_2B, data=data))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_2B, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.453 -9.572 0.636 10.135 57.351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.316365 1.660403 34.52 <2e-16 ***
## TEAM_BATTING_2B 0.097305 0.006757 14.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.08 on 2274 degrees of freedom
## Multiple R-squared: 0.08358, Adjusted R-squared: 0.08318
## F-statistic: 207.4 on 1 and 2274 DF, p-value: < 2.2e-16
Tripples by batter (3B).
No missing values.
Positive relationship (expected theoretical effect is positive). No obvious curved patterns observed from visually inspecting the scatter plots only.
Linear regression predictor coefficient is 0.0804 (points contributed to TARGET_WINS with every unit increase). The p-value is 8.22e-12 *** (predictor is significant). The R-squared is 0.02034 (explains about 2% of variability of response variable).
summary(lm(TARGET_WINS ~ TEAM_BATTING_3B, data=data))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_3B, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.349 -9.120 1.104 10.683 60.727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.3485 0.7245 105.382 < 2e-16 ***
## TEAM_BATTING_3B 0.0804 0.0117 6.871 8.22e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.59 on 2274 degrees of freedom
## Multiple R-squared: 0.02034, Adjusted R-squared: 0.01991
## F-statistic: 47.21 on 1 and 2274 DF, p-value: 8.217e-12
Homeruns by batter (4B).
No missing values.
Positive relationship (expected theoretical effect is positive). No obvious curved patterns observed from visually inspecting the scatter plots only.
Linear regression predictor coefficient is 0.04583 (points contributed to TARGET_WINS with every unit increase). The p-value is <2e-16 *** (predictor is significant). The R-squared is 0.03103 (explains about 3% of variability of response variable).
Walks by batter.
No missing values.
Positive relationship (expected theoretical effect is positive). No obvious curved patterns observed from visually inspecting the scatter plots only.
Linear regression predictor coefficient is 0.029863 (points contributed to TARGET_WINS with every unit increase). The p-value is <2e-16 *** (predictor is significant). The R-squared is 0.05408 (explains about 5% of variability of response variable).
summary(lm(TARGET_WINS ~ TEAM_BATTING_BB, data=data))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_BB, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.813 -9.747 0.509 9.766 78.276
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.812815 1.352265 48.67 <2e-16 ***
## TEAM_BATTING_BB 0.029863 0.002619 11.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.32 on 2274 degrees of freedom
## Multiple R-squared: 0.05408, Adjusted R-squared: 0.05367
## F-statistic: 130 on 1 and 2274 DF, p-value: < 2.2e-16
Strikeouts by batter.
102 missing values.
Simple linear regression result below suggests that this variable is not significant.
The recommendation is to drop this variable from the model.
Negative relationship (expected theoretical effect is negative). No obvious curved patterns observed from visually inspecting the scatter plots only.
Linear regression predictor coefficient is -0.001990 (points contributed to TARGET_WINS with every unit increase). The p-value is 0.139 (predictor is not significant at 5% level). The R-squared is 0.001008 (explains less than 1% of variability of response variable).
summary(lm(TARGET_WINS ~ TEAM_BATTING_SO, data=data))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_SO, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.228 -9.308 0.963 10.609 63.772
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.228036 1.043434 78.81 <2e-16 ***
## TEAM_BATTING_SO -0.001990 0.001344 -1.48 0.139
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.57 on 2172 degrees of freedom
## (102 observations deleted due to missingness)
## Multiple R-squared: 0.001008, Adjusted R-squared: 0.0005482
## F-statistic: 2.192 on 1 and 2172 DF, p-value: 0.1389
Batters hit by pitch.
2085 missing values.
Simple linear regression result suggests that this variable is not significant.
Recommendation is to drop this variable from the model.
Positive relationship (expected theoretical effect is positive). No obvious curved patterns observed from visually inspecting the scatter plots only.
Linear regression predictor coefficient is 0.06867 (points contributed to TARGET_WINS with every unit increase). The p-value is 0.312 (predictor is not significant at 5% level). The R-squared is 0.005403 (explains less than 1% of variability of response variable).
summary(lm(TARGET_WINS ~ TEAM_BATTING_HBP, data=data))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_HBP, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.078 -9.677 0.999 9.594 34.892
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.85048 4.11728 18.665 <2e-16 ***
## TEAM_BATTING_HBP 0.06867 0.06778 1.013 0.312
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.11 on 189 degrees of freedom
## (2085 observations deleted due to missingness)
## Multiple R-squared: 0.005403, Adjusted R-squared: 0.0001405
## F-statistic: 1.027 on 1 and 189 DF, p-value: 0.3122
Walks allowed.
No missing values.
Positive relationship (expected theoretical impact negative). No obvious curved patterns observed from visually inspecting the scatter plots only. Noticeable outliers that may have strong influence on the linear regression line. Most points on the scatter plot are below 1000. A second scatter plot of points under 1000 continues to show a positive relationship.
All Points: Linear regression predictor coefficient is 0.01176 (points contributed to TARGET_WINS with every unit increase). The p-value is2.78e-09 *** (predictor is significant). The R-squared is 0.01542 (explains about 1.5% of variability of response variable).
Under 1000: Linear regression predictor coefficient is 0.028531 (points contributed to TARGET_WINS with every unit increase). The p-value is <2e-16 *** (predictor is significant). The R-squared is 0.04401 (explains about 4.4% of variability of response variable).
summary(lm(TARGET_WINS ~ TEAM_PITCHING_BB, data=data))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_BB, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.289 -9.376 0.944 10.632 70.171
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.28864 1.13779 65.292 < 2e-16 ***
## TEAM_PITCHING_BB 0.01176 0.00197 5.968 2.78e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.63 on 2274 degrees of freedom
## Multiple R-squared: 0.01542, Adjusted R-squared: 0.01499
## F-statistic: 35.61 on 1 and 2274 DF, p-value: 2.785e-09
Points under 1000.
summary(lm(TARGET_WINS ~ TEAM_PITCHING_BB, data=data[data$TEAM_PITCHING_BB<1000,]))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_BB, data = data[data$TEAM_PITCHING_BB <
## 1000, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.321 -9.493 0.873 10.157 76.942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.320528 1.556703 41.96 <2e-16 ***
## TEAM_PITCHING_BB 0.028531 0.002799 10.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.22 on 2257 degrees of freedom
## Multiple R-squared: 0.04401, Adjusted R-squared: 0.04359
## F-statistic: 103.9 on 1 and 2257 DF, p-value: < 2.2e-16
Hits allowed.
No missing values.
Negative relationship (expected theoretical effect is negative). No obvious curved patterns observed from visually inspecting the scatter plots only. Most points on the scatter plot are below 5000. A second scatter plot that only looks at points below 5000 suggest that the relationship is positive.
All points: Linear regression predictor coefficient is -0.0012309 (points contributed to TARGET_WINS with every unit increase). The p-value is 1.46e-07 *** (predictor is significant). The R-squared is 0.01209 (explains about 1.2% of variability of response variable).
Points under 5000: Linear regression predictor coefficient is 0.003825 (points contributed to TARGET_WINS with every unit increase). The p-value is 3.07e-07 *** (predictor is significant). The R-squared is 0.01167 (explains about 1.2% of variability of response variable).
summary(lm(TARGET_WINS ~ TEAM_PITCHING_H, data=data))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_H, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.165 -9.950 0.905 10.773 68.838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.9809701 0.5293050 156.773 < 2e-16 ***
## TEAM_PITCHING_H -0.0012309 0.0002334 -5.274 1.46e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.66 on 2274 degrees of freedom
## Multiple R-squared: 0.01209, Adjusted R-squared: 0.01165
## F-statistic: 27.82 on 1 and 2274 DF, p-value: 1.457e-07
Points under 5000.
summary(lm(TARGET_WINS ~ TEAM_PITCHING_H, data=data[data$TEAM_PITCHING_H<5000,]))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_H, data = data[data$TEAM_PITCHING_H <
## 5000, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.761 -9.384 0.886 10.534 53.153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.766091 1.251857 59.724 < 2e-16 ***
## TEAM_PITCHING_H 0.003825 0.000745 5.135 3.07e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.9 on 2233 degrees of freedom
## Multiple R-squared: 0.01167, Adjusted R-squared: 0.01123
## F-statistic: 26.36 on 1 and 2233 DF, p-value: 3.071e-07
Home runs allowed.
No missing values.
Positive relationship (expected theoretical effect is negative). No obvious curved patterns observed from visually inspecting the scatter plots only.
Linear regression predictor coefficient is 0.048572 (points contributed to TARGET_WINS with every unit increase). The p-value is <2e-16 *** (predictor is significant). The R-squared is 0.03573 (explains about 3.6% of variability of response variable).
summary(lm(TARGET_WINS ~ TEAM_PITCHING_HR, data=data))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_HR, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -75.657 -9.956 0.636 10.055 67.477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.656920 0.646540 117.018 <2e-16 ***
## TEAM_PITCHING_HR 0.048572 0.005292 9.179 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.47 on 2274 degrees of freedom
## Multiple R-squared: 0.03573, Adjusted R-squared: 0.0353
## F-statistic: 84.25 on 1 and 2274 DF, p-value: < 2.2e-16
Strikeouts by pitcher.
102 missing values.
Simple linear regression result suggests that this variable is significant.
Negative relationship (expected theoretical effect is positive). No obvious curved patterns observed from visually inspecting the scatter plots only. Outliers may have a strong influence on the regression line. Most points are under 2000. A second scatter plot that only looks at points below 2000 also shows a negative relationship.
Linear regression predictor coefficient is -0.0022085 (points contributed to TARGET_WINS with every unit increase). The p-value is 0.000252 *** (predictor is significant). The R-squared is 0.006152 (explains less than 1% of variability of response variable).
summary(lm(TARGET_WINS ~ TEAM_PITCHING_SO, data=data))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_SO, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.570 -9.402 0.970 10.484 63.430
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.5704787 0.5945630 138.876 < 2e-16 ***
## TEAM_PITCHING_SO -0.0022085 0.0006023 -3.667 0.000252 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.53 on 2172 degrees of freedom
## (102 observations deleted due to missingness)
## Multiple R-squared: 0.006152, Adjusted R-squared: 0.005695
## F-statistic: 13.45 on 1 and 2172 DF, p-value: 0.0002515
Scatter plot of points below 2000.
summary(lm(TARGET_WINS ~ TEAM_PITCHING_SO, data=data[data$TEAM_PITCHING_SO<2000,]))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_SO, data = data[data$TEAM_PITCHING_SO <
## 2000, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -84.154 -9.417 0.811 10.437 61.846
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.154074 1.121929 75.008 < 2e-16 ***
## TEAM_PITCHING_SO -0.004136 0.001347 -3.071 0.00216 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.41 on 2163 degrees of freedom
## (102 observations deleted due to missingness)
## Multiple R-squared: 0.004341, Adjusted R-squared: 0.00388
## F-statistic: 9.43 on 1 and 2163 DF, p-value: 0.002161
Stolen bases.
131 missing values.
Results of simple linear regression below suggests that this variable is significant.
Positive relationship (expected theoretical effect is positive). No obvious curved patterns observed from visually inspecting the scatter plots only.
Linear regression predictor coefficient is 0.02273 (points contributed to TARGET_WINS with every unit increase). The p-value is 3.3e-10 *** (predictor is significant). The R-squared is 0.01826 (explains about 1.8% of variability of response variable).
summary(lm(TARGET_WINS ~ TEAM_BASERUN_SB, data=data))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BASERUN_SB, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.009 -8.986 1.013 10.082 52.309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.00909 0.54912 142.061 < 2e-16 ***
## TEAM_BASERUN_SB 0.02273 0.00360 6.314 3.3e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.63 on 2143 degrees of freedom
## (131 observations deleted due to missingness)
## Multiple R-squared: 0.01826, Adjusted R-squared: 0.0178
## F-statistic: 39.86 on 1 and 2143 DF, p-value: 3.299e-10
Caught stealing.
772 missing values. Results of simple linear regression below suggests that this variable is not significant.
Recommendation is to drop this variable from the model.
Positive relationship (expected theoretical effect is negative). No obvious curved patterns observed from visually inspecting the scatter plots only.
Linear regression predictor coefficient is 0.01314 (points contributed to TARGET_WINS with every unit increase). The p-value is 0.385 (predictor is not significant at the 5% level). The R-squared is 0.0005019 (explains less than 1% variability of response variable).
summary(lm(TARGET_WINS ~ TEAM_BASERUN_CS, data=data))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BASERUN_CS, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.152 -8.727 0.573 9.185 53.217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.15192 0.87106 92.017 <2e-16 ***
## TEAM_BASERUN_CS 0.01314 0.01513 0.869 0.385
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.46 on 1502 degrees of freedom
## (772 observations deleted due to missingness)
## Multiple R-squared: 0.0005019, Adjusted R-squared: -0.0001635
## F-statistic: 0.7543 on 1 and 1502 DF, p-value: 0.3853
Errors.
No missing values.
Negative relationship (expected theoretical effect is negative). No obvious curved patterns observed from visually inspecting the scatter plots only.
Linear regression predictor coefficient is -0.012205 (points contributed to TARGET_WINS with every unit increase). The p-value is <2e-16 *** (predictor is significant). The R-squared is 0.03115 (explains about 3.1% variability of response variable).
summary(lm(TARGET_WINS ~ TEAM_FIELDING_E, data=data))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_FIELDING_E, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.461 -10.078 0.697 10.318 73.808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.799234 0.479030 174.94 <2e-16 ***
## TEAM_FIELDING_E -0.012205 0.001427 -8.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.51 on 2274 degrees of freedom
## Multiple R-squared: 0.03115, Adjusted R-squared: 0.03072
## F-statistic: 73.1 on 1 and 2274 DF, p-value: < 2.2e-16
Double plays.
286 missing values. Results of simple linear regression below suggests that this variable is not significant.
Recommendation is to drop this variable from the model.
Negative relationship (expected theoretical effect is positive). No obvious curved patterns observed from visually inspecting the scatter plots only.
Linear regression predictor coefficient is -0.01853 (points contributed to TARGET_WINS with every unit increase). The p-value is 0.12 (predictor is not significant at 5% level). The R-squared is 0.001215 (explains less than 1% variability of response variable).
summary(lm(TARGET_WINS ~ TEAM_FIELDING_DP, data=data))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_FIELDING_DP, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.642 -9.062 0.813 9.803 46.747
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.71655 1.77202 47.244 <2e-16 ***
## TEAM_FIELDING_DP -0.01853 0.01192 -1.555 0.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.94 on 1988 degrees of freedom
## (286 observations deleted due to missingness)
## Multiple R-squared: 0.001215, Adjusted R-squared: 0.0007122
## F-statistic: 2.417 on 1 and 1988 DF, p-value: 0.1201
As discussed in MISSING DATA section, predictors with missing data that are not significant based on simple linear regression with response variable are going to be dropped from the variable selection process. Details of regression is shown in section Simple Linear Regressions of each Variable. Predictors with missing data that are significant will be considered in the selection process, but incomplete observations are going to be dropped. At most 233 observations are going to be ignored if at least one variable is in the model. Of the six variables with missing data, four are dropped. Total possible observation is 2276.
DROP VARIABLES:
IGNORE OBSERVATIONS:
Below is a list of variables with opposite effect on response variable when compared to theoretical effected noted on the homework sheet.
See Simple Linear Regressions of each Variable for more details.
PITCHING VARS | BATTING VARS | OTHER VARS |
---|---|---|
TEAM_PITCHING_SO | TEAM_BATTING_BB | TEAM_FIELDING_E |
TEAM_PITCHING_BB | TEAM_BATTING_HR | TEAM_BASERUN_SB |
TEAM_PITCHING_HR | TEAM_BATTING_H | |
TEAM_PITCHING_H | TEAM_BATTING_2B | |
TEAM_BATTING_3B |
Because of strong correlation between PITCHING and BATTING variables, the model with either use variables from the PTCHING group OR BATTING group but not variables from both categories. Th variables in the OTHER VARS category do not show particularly strong correlations with any of the other variables in the possible selection list.
Model 1 is a linear-linear regression model with no variable transformations.
Because TEAM_BATTING_H includes TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, model 1 will only use TEAM_BATTING_H.
Selected variables from model 1:
model1 <- lm(data$TARGET_WINS ~ data$TEAM_BATTING_H + data$TEAM_BATTING_BB + data$TEAM_FIELDING_E + data$TEAM_BASERUN_SB, data=data)
Model 1 linear model summary below shows that all predictors (except the intercept) are significant.
The residuals median is close to zero (0.018).
The Adjusted R-squared is 0.3055 (explains 30.55% of variability of response variable).
summary(model1)
##
## Call:
## lm(formula = data$TARGET_WINS ~ data$TEAM_BATTING_H + data$TEAM_BATTING_BB +
## data$TEAM_FIELDING_E + data$TEAM_BASERUN_SB, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.476 -8.431 0.018 8.469 44.890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.484126 3.182849 0.466 0.641
## data$TEAM_BATTING_H 0.046404 0.002057 22.556 < 2e-16 ***
## data$TEAM_BATTING_BB 0.023591 0.003061 7.707 1.96e-14 ***
## data$TEAM_FIELDING_E -0.034254 0.002140 -16.008 < 2e-16 ***
## data$TEAM_BASERUN_SB 0.050277 0.003564 14.108 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.31 on 2140 degrees of freedom
## (131 observations deleted due to missingness)
## Multiple R-squared: 0.3067, Adjusted R-squared: 0.3055
## F-statistic: 236.7 on 4 and 2140 DF, p-value: < 2.2e-16
For a given predictor (p), multicollinearity can assessed by computing a score called the variance inflation factor (or VIF), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model.
The VIF of all predictors for model 1 are under 5. The smallest value for VIF is 1. As a rule of thumb, VIF of 5 or higher creates a problematic amount of collinearity.
VIF(model1)
## data$TEAM_BATTING_H data$TEAM_BATTING_BB data$TEAM_FIELDING_E
## 1.132248 1.285256 1.867236
## data$TEAM_BASERUN_SB
## 1.385953
The article linked below discusses how standardized residual plots should look like for an acceptable model.
Standardized residual plots should have these characteristics:
As you can see, the standardized residual plot of model 1 seem to display the characteristics described above. Generally, the residuals are between -2 and 2 (not a big range). They’re symmetrically distributed along the zero horizontal line (indicating low errors between actual and predicted). I do not see any curved patterns.
model1.predict <- predict(model1)
model1.stdres <- rstandard(model1)
plot(model1.predict, model1.stdres, ylab="Standardized Residuals", xlab="Predicted Target Wins", main="Model 1")
abline(0, 0)
The intercept (1.484126) is not significant.
TEAM_BATTING_H (0.046404): A unit increase in TEAM_BATTING_H (base hits by batter) increases TARGET_WINTS by 0.046404 points.
TEAM_BATTING_BB (0.02359): A unit increase in TEAM_BATTING_BB (walks by batters) increases TARGET_WINS by 0.02359 points.
TEAM_FIELDING_E (-0.034254): A unit increase in TEAM_FIELDING_E (errors) decreases TARGET_WINS by 0.034254 points.
TEAM_BASERUN_SB (0.050277): A unit increase in TEAM_BASERUN_SB (stolen bases) increases TARGET_WINS by 0.050277 points.
Model 2 is a linear-linear regression model with no variable transformations.
Because TEAM_BATTING_H includes TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, model 2 will not use TEAM_BATTING_H. Instead, it will use TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, and a calculated column for TEAM_BATTING_1B.
Selected variables for model 2:
data$TEAM_BATTING_1B <- data$TEAM_BATTING_H - data$TEAM_BATTING_2B - data$TEAM_BATTING_3B - data$TEAM_BATTING_HR
The first attempt at building model2 shows that TEAM_BATTING_2B is not a significant predictor (p-value 0.991).
So we are going to remove TEAM_BATTING_2B from the model.
model2 <- lm(data$TARGET_WINS ~ data$TEAM_BATTING_1B + data$TEAM_BATTING_2B + data$TEAM_BATTING_3B + data$TEAM_BATTING_HR + TEAM_BATTING_BB +
TEAM_FIELDING_E +TEAM_BASERUN_SB , data=data)
summary(model2)
##
## Call:
## lm(formula = data$TARGET_WINS ~ data$TEAM_BATTING_1B + data$TEAM_BATTING_2B +
## data$TEAM_BATTING_3B + data$TEAM_BATTING_HR + TEAM_BATTING_BB +
## TEAM_FIELDING_E + TEAM_BASERUN_SB, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.935 -8.318 0.001 8.066 49.247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.306e+00 3.420e+00 0.382 0.702
## data$TEAM_BATTING_1B 5.051e-02 3.290e-03 15.352 < 2e-16 ***
## data$TEAM_BATTING_2B 8.322e-05 7.193e-03 0.012 0.991
## data$TEAM_BATTING_3B 1.292e-01 1.563e-02 8.263 2.45e-16 ***
## data$TEAM_BATTING_HR 9.374e-02 7.339e-03 12.772 < 2e-16 ***
## TEAM_BATTING_BB 2.101e-02 3.157e-03 6.655 3.60e-11 ***
## TEAM_FIELDING_E -3.713e-02 2.322e-03 -15.988 < 2e-16 ***
## TEAM_BASERUN_SB 4.699e-02 3.809e-03 12.337 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.12 on 2137 degrees of freedom
## (131 observations deleted due to missingness)
## Multiple R-squared: 0.328, Adjusted R-squared: 0.3258
## F-statistic: 149 on 7 and 2137 DF, p-value: < 2.2e-16
The updated version of model 2, which excludes TEAM_BATTING_2B shows that all predictors are significant except the intercept.
The residual median of model 2 is -0.003 (which is closer to zero than model 1 at 0.018)
The Adjusted R-squared is 0.3261 (explains 32.61% of variability of response variable), which is larger than model 1 at 0.3255.
model2 <- lm(data$TARGET_WINS ~ data$TEAM_BATTING_1B + data$TEAM_BATTING_3B + data$TEAM_BATTING_HR + data$TEAM_BATTING_BB +
data$TEAM_FIELDING_E + data$TEAM_BASERUN_SB , data=data)
summary(model2)
##
## Call:
## lm(formula = data$TARGET_WINS ~ data$TEAM_BATTING_1B + data$TEAM_BATTING_3B +
## data$TEAM_BATTING_HR + data$TEAM_BATTING_BB + data$TEAM_FIELDING_E +
## data$TEAM_BASERUN_SB, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.939 -8.312 -0.003 8.064 49.255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.306120 3.418851 0.382 0.702
## data$TEAM_BATTING_1B 0.050521 0.003052 16.552 < 2e-16 ***
## data$TEAM_BATTING_3B 0.129173 0.015520 8.323 < 2e-16 ***
## data$TEAM_BATTING_HR 0.093776 0.006480 14.471 < 2e-16 ***
## data$TEAM_BATTING_BB 0.021014 0.003150 6.670 3.25e-11 ***
## data$TEAM_FIELDING_E -0.037129 0.002292 -16.197 < 2e-16 ***
## data$TEAM_BASERUN_SB 0.046989 0.003805 12.349 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.12 on 2138 degrees of freedom
## (131 observations deleted due to missingness)
## Multiple R-squared: 0.328, Adjusted R-squared: 0.3261
## F-statistic: 173.9 on 6 and 2138 DF, p-value: < 2.2e-16
None of the VIF values of the predictors are above 5. This means that model does not have problematic amount of collinearity.
VIF(model2)
## data$TEAM_BATTING_1B data$TEAM_BATTING_3B data$TEAM_BATTING_HR
## 1.945005 2.567734 2.126136
## data$TEAM_BATTING_BB data$TEAM_FIELDING_E data$TEAM_BASERUN_SB
## 1.403139 2.208898 1.628440
The standardized residual plot of model 2 has a very similar shape and pattern to model 1 although the range is larger (-4 to 4).
model2.predict <- predict(model2)
model2.stdres <- rstandard(model2)
plot(model2.predict, model2.stdres, ylab="Standardized Residuals", xlab="Predicted Target Wins", main="Model 2")
abline(0, 0)
Model 3 is a linear-log model. It takes model 1 and applies a log transformation to the predictors of model 1.
The response variable TARGET_WINS looks approximately normal already and does not have any noticeable marked skewness. So we’re leaving the response variable as is.
TEAM_BATTING_BB is skewed to the left but has an approximate normal distribution. TEAM_BATTING_H is skewed to the right but has an approximate normal shape. TEAM_FIELDING_E is markedly skewed to the right. TEAM_BASERUN_SB is also markedly skewed to the right.
Remove incomplete cases. We have 2143 complete cases.
data2 <- data[, c("TARGET_WINS", "TEAM_BATTING_H", "TEAM_BATTING_BB", "TEAM_BATTING_BB", "TEAM_FIELDING_E", "TEAM_BASERUN_SB")]
data2 <- data2[complete.cases(data2),]
data2<-data2[!(data2$TEAM_BATTING_BB==0),]
data2<-data2[!(data2$TEAM_BASERUN_SB==0),]
nrow(data2)
## [1] 2143
All predictors of model 3 are significant, including the intercept.
The residual median is -0.314 (still close to zero).
However, the adjusted R-squared is only 0.2653, which is lower compared to model 1 and model 2.
model3 <- lm(data2$TARGET_WINS ~ log(data2$TEAM_BATTING_H) + log(data2$TEAM_BATTING_BB) + log(data2$TEAM_FIELDING_E) +
log(data2$TEAM_BASERUN_SB), data=data2)
summary(model3)
##
## Call:
## lm(formula = data2$TARGET_WINS ~ log(data2$TEAM_BATTING_H) +
## log(data2$TEAM_BATTING_BB) + log(data2$TEAM_FIELDING_E) +
## log(data2$TEAM_BASERUN_SB), data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.732 -8.706 -0.314 8.425 50.891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -469.1348 24.0273 -19.525 <2e-16 ***
## log(data2$TEAM_BATTING_H) 67.9951 3.1409 21.648 <2e-16 ***
## log(data2$TEAM_BATTING_BB) 11.3472 1.3197 8.598 <2e-16 ***
## log(data2$TEAM_FIELDING_E) -8.3150 0.7074 -11.754 <2e-16 ***
## log(data2$TEAM_BASERUN_SB) 5.8308 0.4935 11.815 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.55 on 2138 degrees of freedom
## Multiple R-squared: 0.2667, Adjusted R-squared: 0.2653
## F-statistic: 194.4 on 4 and 2138 DF, p-value: < 2.2e-16
As expected, none of the VIF values of the predictors are above 5. This means that model does not have problematic amount of collinearity.
VIF(model3)
## log(data2$TEAM_BATTING_H) log(data2$TEAM_BATTING_BB)
## 1.058550 1.287283
## log(data2$TEAM_FIELDING_E) log(data2$TEAM_BASERUN_SB)
## 1.764209 1.361466
The standardized residual plot of model 3 has a very similar shape and pattern to both model 1 and 2. The range of spread is very similar to model 2 (-4 to 4); however, the points on the plot look more spread out.
The residual standard error of model 3 is highest among the 3 models (12.55), followed by model 1 (12.31), and model 2 (12.12).
So, model 2 has the best fit in terms of residuals.
model3.predict <- predict(model3)
model3.stdres <- rstandard(model3)
plot(model3.predict, model3.stdres, ylab="Standardized Residuals", xlab="Predicted Target Wins", main="Model 3")
abline(0, 0)
NOTE: In the group submission, this is referred to as Model 4.
Because model 2 has has the least residual standard error (12.12) and highest adjusted R-squared (0.3261), this model is chosen to do the prediction on the evaluation data.
Read data to be predicted.
Calculated TEAM_BATTING_1B in evaluation data.
eval_data$TEAM_BATTING_1B <- eval_data$TEAM_BATTING_H - eval_data$TEAM_BATTING_2B - eval_data$TEAM_BATTING_3B - eval_data$TEAM_BATTING_HR
Predict target_wins based on model 2 and write to CSV file.
predicted_targetWins <- predict(model2, newdata=data.frame(eval_data))
write.csv(predicted_targetWins, file = "Model2-Predcited_TargetWins.csv")