Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team.

Training Data Set

The training data has 17 columns and 2,276 rows.

The explanatory columns are broken down into four categories:

Batting
Base run
Pitching
Fielding

Below you will see a preview of the columns and the first few observations broken down into these four categories.

Response Variable

TARGET_WINS

The variable TARGET_WINS is the number of wins of a professional baseball team for a given year. The year is not part of the data set. This is the dependent variable that our models will attempt to predict.

Above is first few rows of dependent variable.

DESCRIPTIVE STATISTCS

PLOTS

As you can see, the distribution of the number of wins is unimodal and skewed to the left with some outliers towards the tail. It looks approximately normal. The minimum number of wins for a team is 0 and the maximum is 146. The mean is 80.79.

The boxplot above shows that there are suspected outliers at both ends.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   71.00   82.00   80.79   92.00  146.00

Explanatory Variables

1. BATTING VARIABLES (7)

DESCRIPTION AND THEORETICAL EFFECT

VARIABLE NAME	DEFINITION	THEORETICAL EFFECT
TEAM_BATTING_H	Base Hits by batters (1B,2B,3B,HR)	Positive Impact on Wins
TEAM_BATTING_2B	Doubles by batters (2B)	Positive Impact on Wins
TEAM_BATTING_3B	Triples by batters (3B)	Positive Impact on Wins
TEAM_BATTING_HR	Homeruns by batters (4B)	Positive Impact on Wins
TEAM_BATTING_BB	Walks by batters	Positive Impact on Wins
TEAM_BATTING_HBP	Batters hit by pitch (get a free base)	Positive Impact on Wins
TEAM_BATTING_SO	Strikeouts by batters	Negative Impact on Wins

Above is view of first few rows of batting variables.

DESCRIPTIVE STATISTCS

As you can see, two variables have some N/A values (n < 2276). Particularly, TEAM_BATTING_HPB only has 191 values that are not missing.

PLOTS

CORRELATION OF BATTING VARIABLES

See Pairings of all Variables to view correlation table and scatter plots of all variables.

2. BASE RUN VARIABLES (2)

DESCRIPTION AND THEORETICAL EFFECT

VARIABLE NAME	DEFINITION	THEORETICAL EFFECT
TEAM_BASERUN_SB	Stolen bases	Positive Impact on Wins
TEAM_BASERUN_CS	Caught stealing	Negative Impact on Wins

Above is view of first few rows of base run variables.

DESCRIPTIVE STATISTCS

As you can see, both variables have some N/A values (n < 2276).

PLOTS

CORRELATION OF BASE RUN VARIABLES

See Pairings of all Variables to view correlation table and scatter plots of all variables.

3. PITCHING VARIABLES (4)

DESCRIPTION AND THEORETICAL EFFECT

VARIABLE NAME	DEFINITION	THEORETICAL EFFECT
TEAM_PITCHING_BB	Walks allowed	Negative Impact on Wins
TEAM_PITCHING_H	Hits allowed	Negative Impact on Wins
TEAM_PITCHING_HR	Homeruns allowed	Negative Impact on Wins
TEAM_PITCHING_SO	Strikeouts by pitchers	Positive Impact on Wins

Above is view of first few rows of pitching variables.

DESCRIPTIVE STATISTCS

As you can see, one variable has some N/A values (n < 2276).

PLOTS

CORRELATION OF PITCHING VARIABLES

See Pairings of all Variables to view correlation table and scatter plots of all variables.

4. FIELDING VARIABLES (2)

DESCRIPTION AND THEORETICAL EFFECT

VARIABLE NAME	DEFINITION	THEORETICAL EFFECT
TEAM_FIELDING_E	Errors	Negative Impact on Wins
TEAM_FIELDING_DP	Double Plays	Positive Impact on Wins

Above is view of first few rows of fielding variables.

DESCRIPTIVE STATISTCS

As you can see, one variable has some N/A values (n < 2276)

PLOTS

CORRELATION OF FIELDING VARIABLES

See Pairings of all Variables to view correlation table and scatter plots of all variables.

Pairings of all Variables

CORRELATION OF ALL VARIABLES

##                  TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## TARGET_WINS             1.00           0.47            0.31
## TEAM_BATTING_H          0.47           1.00            0.56
## TEAM_BATTING_2B         0.31           0.56            1.00
## TEAM_BATTING_3B        -0.12           0.21            0.04
## TEAM_BATTING_HR         0.42           0.40            0.25
## TEAM_BATTING_BB         0.47           0.20            0.20
## TEAM_BATTING_SO        -0.23          -0.34           -0.06
## TEAM_BASERUN_SB         0.01           0.07           -0.19
## TEAM_BASERUN_CS        -0.18          -0.09           -0.20
## TEAM_BATTING_HBP        0.07          -0.03            0.05
## TEAM_PITCHING_H         0.47           1.00            0.56
## TEAM_PITCHING_HR        0.42           0.39            0.25
## TEAM_PITCHING_BB        0.47           0.20            0.20
## TEAM_PITCHING_SO       -0.23          -0.34           -0.07
## TEAM_FIELDING_E        -0.39          -0.25           -0.19
## TEAM_FIELDING_DP       -0.20           0.02           -0.02
##                  TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
## TARGET_WINS                -0.12            0.42            0.47
## TEAM_BATTING_H              0.21            0.40            0.20
## TEAM_BATTING_2B             0.04            0.25            0.20
## TEAM_BATTING_3B             1.00           -0.22           -0.21
## TEAM_BATTING_HR            -0.22            1.00            0.46
## TEAM_BATTING_BB            -0.21            0.46            1.00
## TEAM_BATTING_SO            -0.19            0.21            0.22
## TEAM_BASERUN_SB             0.17           -0.19           -0.09
## TEAM_BASERUN_CS             0.23           -0.28           -0.21
## TEAM_BATTING_HBP           -0.17            0.11            0.05
## TEAM_PITCHING_H             0.21            0.40            0.20
## TEAM_PITCHING_HR           -0.22            1.00            0.46
## TEAM_PITCHING_BB           -0.21            0.46            1.00
## TEAM_PITCHING_SO           -0.19            0.21            0.22
## TEAM_FIELDING_E            -0.07            0.02           -0.08
## TEAM_FIELDING_DP            0.13           -0.06           -0.08
##                  TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## TARGET_WINS                -0.23            0.01           -0.18
## TEAM_BATTING_H             -0.34            0.07           -0.09
## TEAM_BATTING_2B            -0.06           -0.19           -0.20
## TEAM_BATTING_3B            -0.19            0.17            0.23
## TEAM_BATTING_HR             0.21           -0.19           -0.28
## TEAM_BATTING_BB             0.22           -0.09           -0.21
## TEAM_BATTING_SO             1.00           -0.07           -0.06
## TEAM_BASERUN_SB            -0.07            1.00            0.62
## TEAM_BASERUN_CS            -0.06            0.62            1.00
## TEAM_BATTING_HBP            0.22           -0.06           -0.07
## TEAM_PITCHING_H            -0.34            0.07           -0.09
## TEAM_PITCHING_HR            0.21           -0.19           -0.28
## TEAM_PITCHING_BB            0.22           -0.09           -0.21
## TEAM_PITCHING_SO            1.00           -0.07           -0.06
## TEAM_FIELDING_E             0.31            0.04            0.21
## TEAM_FIELDING_DP           -0.12           -0.13           -0.01
##                  TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## TARGET_WINS                  0.07            0.47             0.42
## TEAM_BATTING_H              -0.03            1.00             0.39
## TEAM_BATTING_2B              0.05            0.56             0.25
## TEAM_BATTING_3B             -0.17            0.21            -0.22
## TEAM_BATTING_HR              0.11            0.40             1.00
## TEAM_BATTING_BB              0.05            0.20             0.46
## TEAM_BATTING_SO              0.22           -0.34             0.21
## TEAM_BASERUN_SB             -0.06            0.07            -0.19
## TEAM_BASERUN_CS             -0.07           -0.09            -0.28
## TEAM_BATTING_HBP             1.00           -0.03             0.11
## TEAM_PITCHING_H             -0.03            1.00             0.39
## TEAM_PITCHING_HR             0.11            0.39             1.00
## TEAM_PITCHING_BB             0.05            0.20             0.46
## TEAM_PITCHING_SO             0.22           -0.34             0.21
## TEAM_FIELDING_E              0.04           -0.25             0.02
## TEAM_FIELDING_DP            -0.07            0.01            -0.06
##                  TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## TARGET_WINS                  0.47            -0.23           -0.39
## TEAM_BATTING_H               0.20            -0.34           -0.25
## TEAM_BATTING_2B              0.20            -0.07           -0.19
## TEAM_BATTING_3B             -0.21            -0.19           -0.07
## TEAM_BATTING_HR              0.46             0.21            0.02
## TEAM_BATTING_BB              1.00             0.22           -0.08
## TEAM_BATTING_SO              0.22             1.00            0.31
## TEAM_BASERUN_SB             -0.09            -0.07            0.04
## TEAM_BASERUN_CS             -0.21            -0.06            0.21
## TEAM_BATTING_HBP             0.05             0.22            0.04
## TEAM_PITCHING_H              0.20            -0.34           -0.25
## TEAM_PITCHING_HR             0.46             0.21            0.02
## TEAM_PITCHING_BB             1.00             0.22           -0.08
## TEAM_PITCHING_SO             0.22             1.00            0.31
## TEAM_FIELDING_E             -0.08             0.31            1.00
## TEAM_FIELDING_DP            -0.08            -0.12            0.04
##                  TEAM_FIELDING_DP
## TARGET_WINS                 -0.20
## TEAM_BATTING_H               0.02
## TEAM_BATTING_2B             -0.02
## TEAM_BATTING_3B              0.13
## TEAM_BATTING_HR             -0.06
## TEAM_BATTING_BB             -0.08
## TEAM_BATTING_SO             -0.12
## TEAM_BASERUN_SB             -0.13
## TEAM_BASERUN_CS             -0.01
## TEAM_BATTING_HBP            -0.07
## TEAM_PITCHING_H              0.01
## TEAM_PITCHING_HR            -0.06
## TEAM_PITCHING_BB            -0.08
## TEAM_PITCHING_SO            -0.12
## TEAM_FIELDING_E              0.04
## TEAM_FIELDING_DP             1.00

SCATTER PLOTS OF ALL VARIABLES

MISSING DATA

There are 6 explanatory variables with missing values.

Results of simple linear regression of these variables with missing data (listed below) suggest that predictors are not significant at the 5% level. See section Simple Linear Regression of each Variable for more details. Recommendation is to drop these from the models.

TEAM_BATTING_HBP - 2085 missing values (drop)
TEAM_BATTING_SO - 102 missing values (drop)
TEAM_BASERUN_CS - 772 missing (drop)
TEAM_FIELDING_DP - 286 missing (drop)

Results of simple linear regression of variables (on TARGET_WINS) listed below suggest that predictors are significant.

TEAM_BASERUN_SB - 131 missing values (ignore observations with missing data)
TEAM_PITCHING_SO - 102 missing values (ignore observations with missing data)

Simple Linear Regressions of each Variable

This section investigates more closely the relationship of each explanatory variable with the response variable TARGET_WINS. This would also give us some some insights with how to handle some of the variables with missing values.

BATTING VARIABLES

TEAM_BATTING_H

Base hits by batter (1B, 2B, 3B, HR). This variable is the sum of 1B, 2B, 3B, and HR.

No Missing values.

Positive relationship (expected theoretical effect is positive). No obvious curved patterns observed from visually inspecting the scatter plots only.

Linear regression predictor coefficient is 0.042353 (contributes 0.04 points to TARGET_WINS with every unit increase). The p-value is < 2e-16 *** (predictor is significant). The R-squared is 0.1511 (explains about 15% of variability of response variable).

summary(lm(TARGET_WINS ~ TEAM_BATTING_H, data=data))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.768  -8.757   0.856   9.762  46.016 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    18.562326   3.107523   5.973 2.69e-09 ***
## TEAM_BATTING_H  0.042353   0.002105  20.122  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.52 on 2274 degrees of freedom
## Multiple R-squared:  0.1511, Adjusted R-squared:  0.1508 
## F-statistic: 404.9 on 1 and 2274 DF,  p-value: < 2.2e-16

TEAM_BATTING_2B

Doubles by batters (2B).

No Missing values.

Positive relationship (expected theoretical effect is positive). No obvious curved patterns observed from visually inspecting the scatter plots only.

Linear regression predictor coefficient is 0.097305 (points contributed to TARGET_WINS with every unit increase). The p-value is <2e-16 *** (predictor is significant). The R-squared is 0.08358 (explains about 8% of variability of response variable).

summary(lm(TARGET_WINS ~ TEAM_BATTING_2B, data=data))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_2B, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.453  -9.572   0.636  10.135  57.351 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     57.316365   1.660403   34.52   <2e-16 ***
## TEAM_BATTING_2B  0.097305   0.006757   14.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.08 on 2274 degrees of freedom
## Multiple R-squared:  0.08358,    Adjusted R-squared:  0.08318 
## F-statistic: 207.4 on 1 and 2274 DF,  p-value: < 2.2e-16

TEAM_BATTING_3B

Tripples by batter (3B).

No missing values.

Positive relationship (expected theoretical effect is positive). No obvious curved patterns observed from visually inspecting the scatter plots only.

Linear regression predictor coefficient is 0.0804 (points contributed to TARGET_WINS with every unit increase). The p-value is 8.22e-12 *** (predictor is significant). The R-squared is 0.02034 (explains about 2% of variability of response variable).

summary(lm(TARGET_WINS ~ TEAM_BATTING_3B, data=data))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_3B, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.349  -9.120   1.104  10.683  60.727 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      76.3485     0.7245 105.382  < 2e-16 ***
## TEAM_BATTING_3B   0.0804     0.0117   6.871 8.22e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.59 on 2274 degrees of freedom
## Multiple R-squared:  0.02034,    Adjusted R-squared:  0.01991 
## F-statistic: 47.21 on 1 and 2274 DF,  p-value: 8.217e-12

TEAM_BATTING_HR

Homeruns by batter (4B).

No missing values.

Positive relationship (expected theoretical effect is positive). No obvious curved patterns observed from visually inspecting the scatter plots only.

Linear regression predictor coefficient is 0.04583 (points contributed to TARGET_WINS with every unit increase). The p-value is <2e-16 *** (predictor is significant). The R-squared is 0.03103 (explains about 3% of variability of response variable).

TEAM_BATTING_BB

Walks by batter.

No missing values.

Positive relationship (expected theoretical effect is positive). No obvious curved patterns observed from visually inspecting the scatter plots only.

Linear regression predictor coefficient is 0.029863 (points contributed to TARGET_WINS with every unit increase). The p-value is <2e-16 *** (predictor is significant). The R-squared is 0.05408 (explains about 5% of variability of response variable).

summary(lm(TARGET_WINS ~ TEAM_BATTING_BB, data=data))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_BB, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.813  -9.747   0.509   9.766  78.276 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     65.812815   1.352265   48.67   <2e-16 ***
## TEAM_BATTING_BB  0.029863   0.002619   11.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.32 on 2274 degrees of freedom
## Multiple R-squared:  0.05408,    Adjusted R-squared:  0.05367 
## F-statistic:   130 on 1 and 2274 DF,  p-value: < 2.2e-16

TEAM_BATTING_SO (102 Missing)

Strikeouts by batter.

102 missing values.

Simple linear regression result below suggests that this variable is not significant.

The recommendation is to drop this variable from the model.

Negative relationship (expected theoretical effect is negative). No obvious curved patterns observed from visually inspecting the scatter plots only.

Linear regression predictor coefficient is -0.001990 (points contributed to TARGET_WINS with every unit increase). The p-value is 0.139 (predictor is not significant at 5% level). The R-squared is 0.001008 (explains less than 1% of variability of response variable).

summary(lm(TARGET_WINS ~ TEAM_BATTING_SO, data=data))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_SO, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.228  -9.308   0.963  10.609  63.772 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     82.228036   1.043434   78.81   <2e-16 ***
## TEAM_BATTING_SO -0.001990   0.001344   -1.48    0.139    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.57 on 2172 degrees of freedom
##   (102 observations deleted due to missingness)
## Multiple R-squared:  0.001008,   Adjusted R-squared:  0.0005482 
## F-statistic: 2.192 on 1 and 2172 DF,  p-value: 0.1389

TEAM_BATTING_HBP (2085 Missing)

Batters hit by pitch.

2085 missing values.

Simple linear regression result suggests that this variable is not significant.

Recommendation is to drop this variable from the model.

Positive relationship (expected theoretical effect is positive). No obvious curved patterns observed from visually inspecting the scatter plots only.

Linear regression predictor coefficient is 0.06867 (points contributed to TARGET_WINS with every unit increase). The p-value is 0.312 (predictor is not significant at 5% level). The R-squared is 0.005403 (explains less than 1% of variability of response variable).

summary(lm(TARGET_WINS ~ TEAM_BATTING_HBP, data=data))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_HBP, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.078  -9.677   0.999   9.594  34.892 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      76.85048    4.11728  18.665   <2e-16 ***
## TEAM_BATTING_HBP  0.06867    0.06778   1.013    0.312    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.11 on 189 degrees of freedom
##   (2085 observations deleted due to missingness)
## Multiple R-squared:  0.005403,   Adjusted R-squared:  0.0001405 
## F-statistic: 1.027 on 1 and 189 DF,  p-value: 0.3122

PITCHING VARIABLES

TEAM_PITCHING_BB

Walks allowed.

No missing values.

Positive relationship (expected theoretical impact negative). No obvious curved patterns observed from visually inspecting the scatter plots only. Noticeable outliers that may have strong influence on the linear regression line. Most points on the scatter plot are below 1000. A second scatter plot of points under 1000 continues to show a positive relationship.

All Points: Linear regression predictor coefficient is 0.01176 (points contributed to TARGET_WINS with every unit increase). The p-value is2.78e-09 *** (predictor is significant). The R-squared is 0.01542 (explains about 1.5% of variability of response variable).

Under 1000: Linear regression predictor coefficient is 0.028531 (points contributed to TARGET_WINS with every unit increase). The p-value is <2e-16 *** (predictor is significant). The R-squared is 0.04401 (explains about 4.4% of variability of response variable).

summary(lm(TARGET_WINS ~ TEAM_PITCHING_BB, data=data))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_BB, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.289  -9.376   0.944  10.632  70.171 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      74.28864    1.13779  65.292  < 2e-16 ***
## TEAM_PITCHING_BB  0.01176    0.00197   5.968 2.78e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.63 on 2274 degrees of freedom
## Multiple R-squared:  0.01542,    Adjusted R-squared:  0.01499 
## F-statistic: 35.61 on 1 and 2274 DF,  p-value: 2.785e-09

Points under 1000.

summary(lm(TARGET_WINS ~ TEAM_PITCHING_BB, data=data[data$TEAM_PITCHING_BB<1000,]))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_BB, data = data[data$TEAM_PITCHING_BB < 
##     1000, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.321  -9.493   0.873  10.157  76.942 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      65.320528   1.556703   41.96   <2e-16 ***
## TEAM_PITCHING_BB  0.028531   0.002799   10.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.22 on 2257 degrees of freedom
## Multiple R-squared:  0.04401,    Adjusted R-squared:  0.04359 
## F-statistic: 103.9 on 1 and 2257 DF,  p-value: < 2.2e-16

TEAM_PITCHING_H

Hits allowed.

No missing values.

Negative relationship (expected theoretical effect is negative). No obvious curved patterns observed from visually inspecting the scatter plots only. Most points on the scatter plot are below 5000. A second scatter plot that only looks at points below 5000 suggest that the relationship is positive.

All points: Linear regression predictor coefficient is -0.0012309 (points contributed to TARGET_WINS with every unit increase). The p-value is 1.46e-07 *** (predictor is significant). The R-squared is 0.01209 (explains about 1.2% of variability of response variable).

Points under 5000: Linear regression predictor coefficient is 0.003825 (points contributed to TARGET_WINS with every unit increase). The p-value is 3.07e-07 *** (predictor is significant). The R-squared is 0.01167 (explains about 1.2% of variability of response variable).

summary(lm(TARGET_WINS ~ TEAM_PITCHING_H, data=data))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_H, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.165  -9.950   0.905  10.773  68.838 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     82.9809701  0.5293050 156.773  < 2e-16 ***
## TEAM_PITCHING_H -0.0012309  0.0002334  -5.274 1.46e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.66 on 2274 degrees of freedom
## Multiple R-squared:  0.01209,    Adjusted R-squared:  0.01165 
## F-statistic: 27.82 on 1 and 2274 DF,  p-value: 1.457e-07

Points under 5000.

summary(lm(TARGET_WINS ~ TEAM_PITCHING_H, data=data[data$TEAM_PITCHING_H<5000,]))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_H, data = data[data$TEAM_PITCHING_H < 
##     5000, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.761  -9.384   0.886  10.534  53.153 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     74.766091   1.251857  59.724  < 2e-16 ***
## TEAM_PITCHING_H  0.003825   0.000745   5.135 3.07e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.9 on 2233 degrees of freedom
## Multiple R-squared:  0.01167,    Adjusted R-squared:  0.01123 
## F-statistic: 26.36 on 1 and 2233 DF,  p-value: 3.071e-07

TEAM_PITCHING_HR

Home runs allowed.

No missing values.

Positive relationship (expected theoretical effect is negative). No obvious curved patterns observed from visually inspecting the scatter plots only.

Linear regression predictor coefficient is 0.048572 (points contributed to TARGET_WINS with every unit increase). The p-value is <2e-16 *** (predictor is significant). The R-squared is 0.03573 (explains about 3.6% of variability of response variable).

summary(lm(TARGET_WINS ~ TEAM_PITCHING_HR, data=data))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_HR, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -75.657  -9.956   0.636  10.055  67.477 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      75.656920   0.646540 117.018   <2e-16 ***
## TEAM_PITCHING_HR  0.048572   0.005292   9.179   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.47 on 2274 degrees of freedom
## Multiple R-squared:  0.03573,    Adjusted R-squared:  0.0353 
## F-statistic: 84.25 on 1 and 2274 DF,  p-value: < 2.2e-16

TEAM_PITCHING_SO (102 Missing)

Strikeouts by pitcher.

102 missing values.

Simple linear regression result suggests that this variable is significant.

Negative relationship (expected theoretical effect is positive). No obvious curved patterns observed from visually inspecting the scatter plots only. Outliers may have a strong influence on the regression line. Most points are under 2000. A second scatter plot that only looks at points below 2000 also shows a negative relationship.

Linear regression predictor coefficient is -0.0022085 (points contributed to TARGET_WINS with every unit increase). The p-value is 0.000252 *** (predictor is significant). The R-squared is 0.006152 (explains less than 1% of variability of response variable).

summary(lm(TARGET_WINS ~ TEAM_PITCHING_SO, data=data))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_SO, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.570  -9.402   0.970  10.484  63.430 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      82.5704787  0.5945630 138.876  < 2e-16 ***
## TEAM_PITCHING_SO -0.0022085  0.0006023  -3.667 0.000252 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.53 on 2172 degrees of freedom
##   (102 observations deleted due to missingness)
## Multiple R-squared:  0.006152,   Adjusted R-squared:  0.005695 
## F-statistic: 13.45 on 1 and 2172 DF,  p-value: 0.0002515

Scatter plot of points below 2000.

summary(lm(TARGET_WINS ~ TEAM_PITCHING_SO, data=data[data$TEAM_PITCHING_SO<2000,]))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_PITCHING_SO, data = data[data$TEAM_PITCHING_SO < 
##     2000, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.154  -9.417   0.811  10.437  61.846 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      84.154074   1.121929  75.008  < 2e-16 ***
## TEAM_PITCHING_SO -0.004136   0.001347  -3.071  0.00216 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.41 on 2163 degrees of freedom
##   (102 observations deleted due to missingness)
## Multiple R-squared:  0.004341,   Adjusted R-squared:  0.00388 
## F-statistic:  9.43 on 1 and 2163 DF,  p-value: 0.002161

BASE RUN VARIABLES

TEAM_BASERUN_SB (131 Missing)

Stolen bases.

131 missing values.

Results of simple linear regression below suggests that this variable is significant.

Positive relationship (expected theoretical effect is positive). No obvious curved patterns observed from visually inspecting the scatter plots only.

Linear regression predictor coefficient is 0.02273 (points contributed to TARGET_WINS with every unit increase). The p-value is 3.3e-10 *** (predictor is significant). The R-squared is 0.01826 (explains about 1.8% of variability of response variable).

summary(lm(TARGET_WINS ~ TEAM_BASERUN_SB, data=data))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BASERUN_SB, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.009  -8.986   1.013  10.082  52.309 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     78.00909    0.54912 142.061  < 2e-16 ***
## TEAM_BASERUN_SB  0.02273    0.00360   6.314  3.3e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.63 on 2143 degrees of freedom
##   (131 observations deleted due to missingness)
## Multiple R-squared:  0.01826,    Adjusted R-squared:  0.0178 
## F-statistic: 39.86 on 1 and 2143 DF,  p-value: 3.299e-10

TEAM_BASERUN_CS (772 Missing)

Caught stealing.

772 missing values. Results of simple linear regression below suggests that this variable is not significant.

Recommendation is to drop this variable from the model.

Positive relationship (expected theoretical effect is negative). No obvious curved patterns observed from visually inspecting the scatter plots only.

Linear regression predictor coefficient is 0.01314 (points contributed to TARGET_WINS with every unit increase). The p-value is 0.385 (predictor is not significant at the 5% level). The R-squared is 0.0005019 (explains less than 1% variability of response variable).

summary(lm(TARGET_WINS ~ TEAM_BASERUN_CS, data=data))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BASERUN_CS, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.152  -8.727   0.573   9.185  53.217 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     80.15192    0.87106  92.017   <2e-16 ***
## TEAM_BASERUN_CS  0.01314    0.01513   0.869    0.385    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.46 on 1502 degrees of freedom
##   (772 observations deleted due to missingness)
## Multiple R-squared:  0.0005019,  Adjusted R-squared:  -0.0001635 
## F-statistic: 0.7543 on 1 and 1502 DF,  p-value: 0.3853

FIELDING VARIABLES

TEAM_FIELDING_E

Errors.

No missing values.

Negative relationship (expected theoretical effect is negative). No obvious curved patterns observed from visually inspecting the scatter plots only.

Linear regression predictor coefficient is -0.012205 (points contributed to TARGET_WINS with every unit increase). The p-value is <2e-16 *** (predictor is significant). The R-squared is 0.03115 (explains about 3.1% variability of response variable).

summary(lm(TARGET_WINS ~ TEAM_FIELDING_E, data=data))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_FIELDING_E, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.461 -10.078   0.697  10.318  73.808 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     83.799234   0.479030  174.94   <2e-16 ***
## TEAM_FIELDING_E -0.012205   0.001427   -8.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.51 on 2274 degrees of freedom
## Multiple R-squared:  0.03115,    Adjusted R-squared:  0.03072 
## F-statistic:  73.1 on 1 and 2274 DF,  p-value: < 2.2e-16

TEAM_FIELDING_DP (286 missing)

Double plays.

286 missing values. Results of simple linear regression below suggests that this variable is not significant.

Recommendation is to drop this variable from the model.

Negative relationship (expected theoretical effect is positive). No obvious curved patterns observed from visually inspecting the scatter plots only.

Linear regression predictor coefficient is -0.01853 (points contributed to TARGET_WINS with every unit increase). The p-value is 0.12 (predictor is not significant at 5% level). The R-squared is 0.001215 (explains less than 1% variability of response variable).

summary(lm(TARGET_WINS ~ TEAM_FIELDING_DP, data=data))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_FIELDING_DP, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.642  -9.062   0.813   9.803  46.747 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      83.71655    1.77202  47.244   <2e-16 ***
## TEAM_FIELDING_DP -0.01853    0.01192  -1.555     0.12    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.94 on 1988 degrees of freedom
##   (286 observations deleted due to missingness)
## Multiple R-squared:  0.001215,   Adjusted R-squared:  0.0007122 
## F-statistic: 2.417 on 1 and 1988 DF,  p-value: 0.1201

Linear Regression Model

Variable Selection

MISSING DATA

As discussed in MISSING DATA section, predictors with missing data that are not significant based on simple linear regression with response variable are going to be dropped from the variable selection process. Details of regression is shown in section Simple Linear Regressions of each Variable. Predictors with missing data that are significant will be considered in the selection process, but incomplete observations are going to be dropped. At most 233 observations are going to be ignored if at least one variable is in the model. Of the six variables with missing data, four are dropped. Total possible observation is 2276.

DROP VARIABLES:

TEAM_BATTING_HBP: 2085 missing values (p-value 0.312)
TEAM_BATTING_SO: 102 missing values (p-value 0.139)
TEAM_BASERUN_CS: 772 missing (p-value 0.385)
TEAM_FIELDING_DP: 286 missing (p-value 0.12)

IGNORE OBSERVATIONS:

TEAM_BASERUN_SB: 131 missing values
TEAM_PITCHING_SO: 102 missing values

STRONGLY CORRELATED VARIABLES

We need to be mindful when selecting variables in the model that are strongly correlated.

TEAM_BATTING_H and TEAM_PITCHING_H are perfectly correlated (cor 1.0).
TEAM_BATTING_HR and TEAM_PITCHING_HR are perfectly correlated (cor 1.0).
TEAM_BATTING_BB and TEAM_PITCHING_BB are perfectly correlated (cor 1.0).
TEAM_BATTING_SO and TEAM_PITCHING_SO are perfectly correlated (cor 1.0).

OPPOSITE EFFECT

Below is a list of variables with opposite effect on response variable when compared to theoretical effected noted on the homework sheet.

See Simple Linear Regressions of each Variable for more details.

TEAM_PITCHING_BB: training data effect is positive on response variable, but expected theoretical effect is negative.
TEAM_PITCHING_HR: training data effect is positive on response variable, but expected theoretical effect is negative.
TEAM_PITCHING_SO: training data effect is negative on response variable, but expected theoretical effect is positive.
TEAM_BASERUN_CS (dropped): training data effect is positive on response variable, but expected theoretical effect is negative.
TEAM_FIELDING_DP (dropped): training data effect is negative on response variable, but expected theoretical effect is positive.

POSSIBLE VARIABLES

PITCHING VARS	BATTING VARS	OTHER VARS
TEAM_PITCHING_SO	TEAM_BATTING_BB	TEAM_FIELDING_E
TEAM_PITCHING_BB	TEAM_BATTING_HR	TEAM_BASERUN_SB
TEAM_PITCHING_HR	TEAM_BATTING_H
TEAM_PITCHING_H	TEAM_BATTING_2B
	TEAM_BATTING_3B

Because of strong correlation between PITCHING and BATTING variables, the model with either use variables from the PTCHING group OR BATTING group but not variables from both categories. Th variables in the OTHER VARS category do not show particularly strong correlations with any of the other variables in the possible selection list.

Model 1

Model 1 is a linear-linear regression model with no variable transformations.

Because TEAM_BATTING_H includes TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, model 1 will only use TEAM_BATTING_H.

Selected variables from model 1:

TEAM_BATTING_H _ TEAM_BATTING_BB
TEAM_FIELDING_E
TEAM_BASERUN_SB

model1 <- lm(data$TARGET_WINS ~ data$TEAM_BATTING_H + data$TEAM_BATTING_BB + data$TEAM_FIELDING_E + data$TEAM_BASERUN_SB, data=data)

Model 1 linear model summary below shows that all predictors (except the intercept) are significant.

The residuals median is close to zero (0.018).

The Adjusted R-squared is 0.3055 (explains 30.55% of variability of response variable).

summary(model1)

## 
## Call:
## lm(formula = data$TARGET_WINS ~ data$TEAM_BATTING_H + data$TEAM_BATTING_BB + 
##     data$TEAM_FIELDING_E + data$TEAM_BASERUN_SB, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.476  -8.431   0.018   8.469  44.890 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.484126   3.182849   0.466    0.641    
## data$TEAM_BATTING_H   0.046404   0.002057  22.556  < 2e-16 ***
## data$TEAM_BATTING_BB  0.023591   0.003061   7.707 1.96e-14 ***
## data$TEAM_FIELDING_E -0.034254   0.002140 -16.008  < 2e-16 ***
## data$TEAM_BASERUN_SB  0.050277   0.003564  14.108  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.31 on 2140 degrees of freedom
##   (131 observations deleted due to missingness)
## Multiple R-squared:  0.3067, Adjusted R-squared:  0.3055 
## F-statistic: 236.7 on 4 and 2140 DF,  p-value: < 2.2e-16

VIF of Model 1

Source: http://www.sthda.com/english/articles/39-regression-model-diagnostics/160-multicollinearity-essentials-and-vif-in-r/

For a given predictor (p), multicollinearity can assessed by computing a score called the variance inflation factor (or VIF), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

The VIF of all predictors for model 1 are under 5. The smallest value for VIF is 1. As a rule of thumb, VIF of 5 or higher creates a problematic amount of collinearity.

VIF(model1)

##  data$TEAM_BATTING_H data$TEAM_BATTING_BB data$TEAM_FIELDING_E 
##             1.132248             1.285256             1.867236 
## data$TEAM_BASERUN_SB 
##             1.385953

Standardized Residual Analysis of Model 1

The article linked below discusses how standardized residual plots should look like for an acceptable model.

Source: https://www.qualtrics.com/support/stats-iq/analyses/regression-guides/interpreting-residual-plots-improve-regression/

Standardized residual plots should have these characteristics:

they’re pretty symmetrically distributed, tending to cluster towards the middle of the plot.
they’re clustered around the lower single digits of the y-axis (e.g., 0.5 or 1.5, not 30 or 150).
in general, there aren’t’t any clear patterns.

As you can see, the standardized residual plot of model 1 seem to display the characteristics described above. Generally, the residuals are between -2 and 2 (not a big range). They’re symmetrically distributed along the zero horizontal line (indicating low errors between actual and predicted). I do not see any curved patterns.

model1.predict <- predict(model1)
model1.stdres <- rstandard(model1)

plot(model1.predict, model1.stdres, ylab="Standardized Residuals", xlab="Predicted Target Wins",  main="Model 1") 
abline(0, 0)

Interpretation of Coefficients of Model 1

The intercept (1.484126) is not significant.
TEAM_BATTING_H (0.046404): A unit increase in TEAM_BATTING_H (base hits by batter) increases TARGET_WINTS by 0.046404 points.
TEAM_BATTING_BB (0.02359): A unit increase in TEAM_BATTING_BB (walks by batters) increases TARGET_WINS by 0.02359 points.
TEAM_FIELDING_E (-0.034254): A unit increase in TEAM_FIELDING_E (errors) decreases TARGET_WINS by 0.034254 points.
TEAM_BASERUN_SB (0.050277): A unit increase in TEAM_BASERUN_SB (stolen bases) increases TARGET_WINS by 0.050277 points.

Model 2

Model 2 is a linear-linear regression model with no variable transformations.

Because TEAM_BATTING_H includes TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, model 2 will not use TEAM_BATTING_H. Instead, it will use TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, and a calculated column for TEAM_BATTING_1B.

Selected variables for model 2:

TEAM_BATTING_1B (calculated)
TEAM_BATTING_2B (removed from final version of model 2)
TEAM_BATTING_3B
TEAM_BATTING_HR _ TEAM_BATTING_BB
TEAM_FIELDING_E
TEAM_BASERUN_SB

CALCULATE TEAM_BATTING_1B

data$TEAM_BATTING_1B <- data$TEAM_BATTING_H - data$TEAM_BATTING_2B - data$TEAM_BATTING_3B - data$TEAM_BATTING_HR

The first attempt at building model2 shows that TEAM_BATTING_2B is not a significant predictor (p-value 0.991).

So we are going to remove TEAM_BATTING_2B from the model.

model2 <- lm(data$TARGET_WINS ~ data$TEAM_BATTING_1B + data$TEAM_BATTING_2B + data$TEAM_BATTING_3B + data$TEAM_BATTING_HR + TEAM_BATTING_BB + 
             TEAM_FIELDING_E +TEAM_BASERUN_SB , data=data)
summary(model2)

## 
## Call:
## lm(formula = data$TARGET_WINS ~ data$TEAM_BATTING_1B + data$TEAM_BATTING_2B + 
##     data$TEAM_BATTING_3B + data$TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_FIELDING_E + TEAM_BASERUN_SB, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.935  -8.318   0.001   8.066  49.247 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.306e+00  3.420e+00   0.382    0.702    
## data$TEAM_BATTING_1B  5.051e-02  3.290e-03  15.352  < 2e-16 ***
## data$TEAM_BATTING_2B  8.322e-05  7.193e-03   0.012    0.991    
## data$TEAM_BATTING_3B  1.292e-01  1.563e-02   8.263 2.45e-16 ***
## data$TEAM_BATTING_HR  9.374e-02  7.339e-03  12.772  < 2e-16 ***
## TEAM_BATTING_BB       2.101e-02  3.157e-03   6.655 3.60e-11 ***
## TEAM_FIELDING_E      -3.713e-02  2.322e-03 -15.988  < 2e-16 ***
## TEAM_BASERUN_SB       4.699e-02  3.809e-03  12.337  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.12 on 2137 degrees of freedom
##   (131 observations deleted due to missingness)
## Multiple R-squared:  0.328,  Adjusted R-squared:  0.3258 
## F-statistic:   149 on 7 and 2137 DF,  p-value: < 2.2e-16

The updated version of model 2, which excludes TEAM_BATTING_2B shows that all predictors are significant except the intercept.

The residual median of model 2 is -0.003 (which is closer to zero than model 1 at 0.018)

The Adjusted R-squared is 0.3261 (explains 32.61% of variability of response variable), which is larger than model 1 at 0.3255.

model2 <- lm(data$TARGET_WINS ~ data$TEAM_BATTING_1B + data$TEAM_BATTING_3B + data$TEAM_BATTING_HR + data$TEAM_BATTING_BB + 
             data$TEAM_FIELDING_E + data$TEAM_BASERUN_SB , data=data)
summary(model2)

## 
## Call:
## lm(formula = data$TARGET_WINS ~ data$TEAM_BATTING_1B + data$TEAM_BATTING_3B + 
##     data$TEAM_BATTING_HR + data$TEAM_BATTING_BB + data$TEAM_FIELDING_E + 
##     data$TEAM_BASERUN_SB, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.939  -8.312  -0.003   8.064  49.255 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.306120   3.418851   0.382    0.702    
## data$TEAM_BATTING_1B  0.050521   0.003052  16.552  < 2e-16 ***
## data$TEAM_BATTING_3B  0.129173   0.015520   8.323  < 2e-16 ***
## data$TEAM_BATTING_HR  0.093776   0.006480  14.471  < 2e-16 ***
## data$TEAM_BATTING_BB  0.021014   0.003150   6.670 3.25e-11 ***
## data$TEAM_FIELDING_E -0.037129   0.002292 -16.197  < 2e-16 ***
## data$TEAM_BASERUN_SB  0.046989   0.003805  12.349  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.12 on 2138 degrees of freedom
##   (131 observations deleted due to missingness)
## Multiple R-squared:  0.328,  Adjusted R-squared:  0.3261 
## F-statistic: 173.9 on 6 and 2138 DF,  p-value: < 2.2e-16

VIF of Model 2

None of the VIF values of the predictors are above 5. This means that model does not have problematic amount of collinearity.

VIF(model2)

## data$TEAM_BATTING_1B data$TEAM_BATTING_3B data$TEAM_BATTING_HR 
##             1.945005             2.567734             2.126136 
## data$TEAM_BATTING_BB data$TEAM_FIELDING_E data$TEAM_BASERUN_SB 
##             1.403139             2.208898             1.628440

Standardized Residual Analysis of Model 2

The standardized residual plot of model 2 has a very similar shape and pattern to model 1 although the range is larger (-4 to 4).

model2.predict <- predict(model2)
model2.stdres <- rstandard(model2)

plot(model2.predict, model2.stdres, ylab="Standardized Residuals", xlab="Predicted Target Wins",  main="Model 2") 
abline(0, 0)

Interpretation of Coefficients of Model 2

The intercept (1.306120) is not significant.
The coefficients for TEAM_BATTING_1B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BASERUN_SB is the number of points TARGET_WINS increases by when predictor increases by one unit.
The coefficient for TEAM_FIELDING_SE has a negative sign. This means that a unit increase in TEAM_FIELDING_E (errors) decreases TARGET_WINS by 0.037129 points.

Model 3

Model 3 is a linear-log model. It takes model 1 and applies a log transformation to the predictors of model 1.

The response variable TARGET_WINS looks approximately normal already and does not have any noticeable marked skewness. So we’re leaving the response variable as is.

TEAM_BATTING_BB is skewed to the left but has an approximate normal distribution. TEAM_BATTING_H is skewed to the right but has an approximate normal shape. TEAM_FIELDING_E is markedly skewed to the right. TEAM_BASERUN_SB is also markedly skewed to the right.

Remove incomplete cases. We have 2143 complete cases.

data2 <- data[, c("TARGET_WINS", "TEAM_BATTING_H", "TEAM_BATTING_BB", "TEAM_BATTING_BB", "TEAM_FIELDING_E", "TEAM_BASERUN_SB")]
data2 <- data2[complete.cases(data2),]
data2<-data2[!(data2$TEAM_BATTING_BB==0),]
data2<-data2[!(data2$TEAM_BASERUN_SB==0),]
nrow(data2)

## [1] 2143

All predictors of model 3 are significant, including the intercept.

The residual median is -0.314 (still close to zero).

However, the adjusted R-squared is only 0.2653, which is lower compared to model 1 and model 2.

model3 <- lm(data2$TARGET_WINS ~ log(data2$TEAM_BATTING_H) + log(data2$TEAM_BATTING_BB) + log(data2$TEAM_FIELDING_E) + 
               log(data2$TEAM_BASERUN_SB), data=data2)
summary(model3)

## 
## Call:
## lm(formula = data2$TARGET_WINS ~ log(data2$TEAM_BATTING_H) + 
##     log(data2$TEAM_BATTING_BB) + log(data2$TEAM_FIELDING_E) + 
##     log(data2$TEAM_BASERUN_SB), data = data2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.732  -8.706  -0.314   8.425  50.891 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -469.1348    24.0273 -19.525   <2e-16 ***
## log(data2$TEAM_BATTING_H)    67.9951     3.1409  21.648   <2e-16 ***
## log(data2$TEAM_BATTING_BB)   11.3472     1.3197   8.598   <2e-16 ***
## log(data2$TEAM_FIELDING_E)   -8.3150     0.7074 -11.754   <2e-16 ***
## log(data2$TEAM_BASERUN_SB)    5.8308     0.4935  11.815   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.55 on 2138 degrees of freedom
## Multiple R-squared:  0.2667, Adjusted R-squared:  0.2653 
## F-statistic: 194.4 on 4 and 2138 DF,  p-value: < 2.2e-16

VIF of Model 3

As expected, none of the VIF values of the predictors are above 5. This means that model does not have problematic amount of collinearity.

VIF(model3)

##  log(data2$TEAM_BATTING_H) log(data2$TEAM_BATTING_BB) 
##                   1.058550                   1.287283 
## log(data2$TEAM_FIELDING_E) log(data2$TEAM_BASERUN_SB) 
##                   1.764209                   1.361466

Standardized Residual Analysis of Model 3

The standardized residual plot of model 3 has a very similar shape and pattern to both model 1 and 2. The range of spread is very similar to model 2 (-4 to 4); however, the points on the plot look more spread out.

The residual standard error of model 3 is highest among the 3 models (12.55), followed by model 1 (12.31), and model 2 (12.12).

So, model 2 has the best fit in terms of residuals.

model3.predict <- predict(model3)
model3.stdres <- rstandard(model3)

plot(model3.predict, model3.stdres, ylab="Standardized Residuals", xlab="Predicted Target Wins",  main="Model 3") 
abline(0, 0)

Interpretation of Coefficients of Model 3

The intercept of model 3 is significant (-469.1348).
A 1% increase in TEAM_BATTING_H will increase TARGET_WINS by 67.9951.
A 1% increase in TEAM_BATTING_BB will increase TARGET_WINS by 11.3472,
A 1% increase in TEAM_FIELDING_E will decrease TARGET_WINS by -8.3150.
A 1% increase in TEAM_BASERUN_SB will increase TARGET_WINS by 5.8308.

Model Chosen

NOTE: In the group submission, this is referred to as Model 4.

Because model 2 has has the least residual standard error (12.12) and highest adjusted R-squared (0.3261), this model is chosen to do the prediction on the evaluation data.

Read data to be predicted.

Calculated TEAM_BATTING_1B in evaluation data.

eval_data$TEAM_BATTING_1B <- eval_data$TEAM_BATTING_H - eval_data$TEAM_BATTING_2B - eval_data$TEAM_BATTING_3B - eval_data$TEAM_BATTING_HR

Predict target_wins based on model 2 and write to CSV file.

predicted_targetWins <- predict(model2, newdata=data.frame(eval_data))
write.csv(predicted_targetWins, file = "Model2-Predcited_TargetWins.csv")

Data 621 Homework 1

S. Tinapunan - Group 2 Member

September 21, 2019

Overview

Training Data Set

Response Variable

TARGET_WINS

DESCRIPTIVE STATISTCS

PLOTS

Explanatory Variables

1. BATTING VARIABLES (7)

DESCRIPTION AND THEORETICAL EFFECT

DESCRIPTIVE STATISTCS

PLOTS

CORRELATION OF BATTING VARIABLES

2. BASE RUN VARIABLES (2)

DESCRIPTION AND THEORETICAL EFFECT

DESCRIPTIVE STATISTCS

PLOTS

CORRELATION OF BASE RUN VARIABLES

3. PITCHING VARIABLES (4)

DESCRIPTION AND THEORETICAL EFFECT

DESCRIPTIVE STATISTCS

PLOTS

CORRELATION OF PITCHING VARIABLES

4. FIELDING VARIABLES (2)

DESCRIPTION AND THEORETICAL EFFECT

DESCRIPTIVE STATISTCS

PLOTS

CORRELATION OF FIELDING VARIABLES

Pairings of all Variables

CORRELATION OF ALL VARIABLES

SCATTER PLOTS OF ALL VARIABLES

MISSING DATA

Simple Linear Regressions of each Variable

BATTING VARIABLES

TEAM_BATTING_H

TEAM_BATTING_2B

TEAM_BATTING_3B

TEAM_BATTING_HR

TEAM_BATTING_BB

TEAM_BATTING_SO (102 Missing)

TEAM_BATTING_HBP (2085 Missing)

PITCHING VARIABLES

TEAM_PITCHING_BB

TEAM_PITCHING_H

TEAM_PITCHING_HR

TEAM_PITCHING_SO (102 Missing)

BASE RUN VARIABLES

TEAM_BASERUN_SB (131 Missing)

TEAM_BASERUN_CS (772 Missing)

FIELDING VARIABLES

TEAM_FIELDING_E

TEAM_FIELDING_DP (286 missing)

Linear Regression Model

Variable Selection

MISSING DATA

STRONGLY CORRELATED VARIABLES

OPPOSITE EFFECT

POSSIBLE VARIABLES

Model 1

VIF of Model 1

Standardized Residual Analysis of Model 1

Interpretation of Coefficients of Model 1

Model 2

CALCULATE TEAM_BATTING_1B

VIF of Model 2

Standardized Residual Analysis of Model 2

Interpretation of Coefficients of Model 2

Model 3

VIF of Model 3

Standardized Residual Analysis of Model 3

Interpretation of Coefficients of Model 3

Model Chosen