In this assignment, we will be analyzing approximately 2200 records from year of 1871 to 2006, Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 515465 27.6 1138110 60.8 NA 669294 35.8
## Vcells 948150 7.3 8388608 64.0 16384 1839880 14.1
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.1.8
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%() masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
## corrplot 0.92 loaded
##
## Loading required package: carData
##
##
## Attaching package: 'car'
##
##
## The following object is masked from 'package:dplyr':
##
## recode
##
##
## The following object is masked from 'package:purrr':
##
## some
##
##
## The following object is masked from 'package:psych':
##
## logit
##
##
## Loading required package: lattice
##
##
## Attaching package: 'caret'
##
##
## The following object is masked from 'package:purrr':
##
## lift
head(df1)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1 1 39 1445 194 39
## 2 2 70 1339 219 22
## 3 3 86 1377 232 35
## 4 4 70 1387 209 38
## 5 5 82 1297 186 27
## 6 6 75 1279 200 36
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1 13 143 842 NA
## 2 190 685 1075 37
## 3 137 602 917 46
## 4 96 451 922 43
## 5 102 472 920 49
## 6 92 443 973 107
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1 NA NA 9364 84
## 2 28 NA 1347 191
## 3 27 NA 1377 137
## 4 30 NA 1396 97
## 5 39 NA 1297 102
## 6 59 NA 1279 92
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1 927 5456 1011 NA
## 2 689 1082 193 155
## 3 602 917 175 153
## 4 454 928 164 156
## 5 472 920 138 168
## 6 443 973 123 149
summary(df1)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
##
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## NA's :102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
## 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
## Median :101.0 Median : 49.0 Median :58.00 Median : 1518
## Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
## 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
## Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
## NA's :131 NA's :772 NA's :2085
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## NA's :102
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:131.0
## Median :149.0
## Mean :146.4
## 3rd Qu.:164.0
## Max. :228.0
## NA's :286
# From the Summary we have noticed that there are 6 variables containing N/A values, hence we need to fill them with values in order to perform the analyses
df1_med <- df1%>%
mutate_all(~ifelse(is.na(.), median(., na.rm = TRUE), .))
df2_med <- df2%>%
mutate_all(~ifelse(is.na(.), median(., na.rm = TRUE), .))
# Data Imputation, perfomed by filling all the N/A values with the median of the column
As we can see the names are really complicated, in order to better understand the text, as well as making the graphs easier to read, it is advisable to make the names easier to read
train <- df1_med
test <- df2_med
train$INDEX <- NULL
test$INDEX <- NULL
clean <- function(train){
name_list <- names(train)
name_list <- gsub("TEAM_", "", name_list)
names(train) <- name_list
train
}
df1 <- clean(train)
df2 <- clean(test)
# Now the TEAM part of the variables have been removed
Calculate correlation between TEAM_WINS and all other variables, to determine which variable is worthy of creating a barchart for further analyze
# Calculate correlation between TEAM_WINS and all other variables
correlations <- cor(df1)
# Extract correlations for TEAM_WINS
wins_cor <- correlations["TARGET_WINS", ]
# Print correlations for TEAM_WINS only
print(wins_cor)
## TARGET_WINS BATTING_H BATTING_2B BATTING_3B BATTING_HR BATTING_BB
## 1.00000000 0.38876752 0.28910365 0.14260841 0.17615320 0.23255986
## BATTING_SO BASERUN_SB BASERUN_CS BATTING_HBP PITCHING_H PITCHING_HR
## -0.03058135 0.12361087 0.01595982 0.01651641 -0.10993705 0.18901373
## PITCHING_BB PITCHING_SO FIELDING_E FIELDING_DP
## 0.12417454 -0.07579967 -0.17648476 -0.03008630
# Then, we can calculate that correlations between all 16 different variables
DT::datatable(cor(drop_na(df1[,])), options = list(pagelength=5))
From the first correlation table, we can conclude that TEAM_BATTING_H(Base Hits by Batters), TEAM_BATTING_2B (Doubles by Batters), and TEAM_BATTING_BB(Walks by Batters) post a realtively strong correlation with Wins. Hence we are going to further plot a boxplot to better visualize the relationship
cor_res <- cor(df1, use = "complete.obs")
df1 %>%
cor(., use = "complete.obs") %>%
corrplot(., method = "color", type = "upper", tl.col = "black", diag = FALSE)
??corrplot
From the corrplot above, we can see that there are variables that show a strong positive correlation. BATTING_HR has a relatively strong correlation relationship with BATTING_BB AND BATTING_SO, it also has a very strong relationship with PITCHING_HR. BATTING_3B has a relative strong negative correlation with BATTING_HR and BATTING_SO.
df1 %>% # By creating a Distribution graph, we are able to analyze the skewness of the distrubution of values
gather(variable, value, TARGET_WINS:FIELDING_DP) %>%
ggplot(., aes(value)) +
geom_density(fill = "red", color="red") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = element_blank())
From the distribution graphs above, we can see that most of the data tend to have a distribution skewed to the right, with only several data have a relatively normal distribution, among those, Batting_2B, Batting_BB and Target_Wins are obvious in its relative normal distributions.
ggplot(stack(df1), aes(x = ind, y = values)) +
geom_boxplot() + coord_cartesian(ylim = c(0, 2500)) +
theme(legend.position="none") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(panel.background = element_rect(fill = 'grey'))
The outliers graph matches the distribution graph, indicating there are multiple independent variables have large amount of outliers .One potential impact of outliers is that they can distort the estimated relationships between variables in a statistical model.Especially in linear regression, outliers can pull the regression line away from the true trend of the data and result in biased estimates of the coefficients. Therefore we have to consider the effect of outliers while we performing linear relationship analyses in the end.
Even though we have used medians to fill up all the missing variables, however, considering the tremendous amount of missing values in BATTING_HBP(Batters hit by pitch), we can drop the variable out of the study for accuracy purposes.
HBP_removed <- function(df1){
# Drop the hit by pitcher variable
df1 %>% select(-BATTING_HBP)
}
df1 <- HBP_removed(df1)
df2 <- HBP_removed(df2) %>% na.omit()
Since we have already removed the data that has a tremendous amount of N/A values, and filled median to the columns with N/A values, hence it should be ready to start building linear relationship analytical models.
To initially use all the variables available, to discover the significance of all values based on our base data, and we can build other models upon the result we obtain from this model.
model1 <- lm (formula = TARGET_WINS ~ . , data = df1)
summary(model1)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.753 -8.626 0.120 8.395 58.561
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.6421579 5.3902272 4.386 1.21e-05 ***
## BATTING_H 0.0489152 0.0036949 13.239 < 2e-16 ***
## BATTING_2B -0.0209575 0.0091783 -2.283 0.022501 *
## BATTING_3B 0.0644788 0.0168040 3.837 0.000128 ***
## BATTING_HR 0.0527325 0.0274915 1.918 0.055219 .
## BATTING_BB 0.0104483 0.0058377 1.790 0.073621 .
## BATTING_SO -0.0084323 0.0025461 -3.312 0.000941 ***
## BASERUN_SB 0.0254236 0.0043565 5.836 6.12e-09 ***
## BASERUN_CS -0.0110027 0.0157842 -0.697 0.485829
## PITCHING_H -0.0008456 0.0003674 -2.302 0.021444 *
## PITCHING_HR 0.0129626 0.0243894 0.531 0.595135
## PITCHING_BB 0.0007798 0.0041571 0.188 0.851231
## PITCHING_SO 0.0028156 0.0009219 3.054 0.002284 **
## FIELDING_E -0.0195325 0.0024609 -7.937 3.23e-15 ***
## FIELDING_DP -0.1217801 0.0129421 -9.410 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.07 on 2261 degrees of freedom
## Multiple R-squared: 0.3154, Adjusted R-squared: 0.3111
## F-statistic: 74.4 on 14 and 2261 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model1)
The overall estimates predicts well, however, since there are several variables that are not statistically significant (Batting_BB, Baserun_CS, Pitching_HR, Pitching_BB) From the above residual plots:
The variability of the points is approximately the same in the mid values of x with the decrese of variations towards the two end points which depicts that the plot is unbaised and homoscedastic except for few outliers.
Normal QQ plot fulfills the assumptions of normality.
But since few coefficients of the model are not significant, let’s see if assumptions of other models are true.
The imple
model2 <- lm(formula = TARGET_WINS ~ BATTING_H + BATTING_2B + BATTING_HR
+ BATTING_BB + BATTING_SO + BASERUN_SB +
PITCHING_SO + PITCHING_H
+ FIELDING_E + FIELDING_DP,
data = df1)
summary(model2)
##
## Call:
## lm(formula = TARGET_WINS ~ BATTING_H + BATTING_2B + BATTING_HR +
## BATTING_BB + BATTING_SO + BASERUN_SB + PITCHING_SO + PITCHING_H +
## FIELDING_E + FIELDING_DP, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.339 -8.693 0.131 8.491 56.732
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.5750482 5.2324529 3.932 8.67e-05 ***
## BATTING_H 0.0543597 0.0034375 15.814 < 2e-16 ***
## BATTING_2B -0.0245164 0.0091592 -2.677 0.007489 **
## BATTING_HR 0.0555147 0.0091930 6.039 1.81e-09 ***
## BATTING_BB 0.0124752 0.0033780 3.693 0.000227 ***
## BATTING_SO -0.0098879 0.0024371 -4.057 5.13e-05 ***
## BASERUN_SB 0.0290169 0.0040966 7.083 1.87e-12 ***
## PITCHING_SO 0.0031793 0.0006720 4.731 2.37e-06 ***
## PITCHING_H -0.0010283 0.0003158 -3.256 0.001146 **
## FIELDING_E -0.0177728 0.0023798 -7.468 1.15e-13 ***
## FIELDING_DP -0.1241527 0.0129589 -9.580 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.11 on 2265 degrees of freedom
## Multiple R-squared: 0.3102, Adjusted R-squared: 0.3072
## F-statistic: 101.9 on 10 and 2265 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model2)
The P Values are indicating that all of the variables are considered statistically significant. From the above redisual plots: 1. The variability of the points is approximately the same throughout the values of x which depicts that this plot is also unbaised and homoscedastic with very few(minimum) outliers.
In this regression model, we will only keep those that are really statistically significant in Model 1
model3 <- lm(formula = TARGET_WINS ~ BATTING_2B + BATTING_3B + BASERUN_SB +
BASERUN_CS + FIELDING_E + FIELDING_DP, data = df1)
summary(model3)
##
## Call:
## lm(formula = TARGET_WINS ~ BATTING_2B + BATTING_3B + BASERUN_SB +
## BASERUN_CS + FIELDING_E + FIELDING_DP, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.143 -9.668 0.286 9.289 64.924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.668048 2.664735 23.142 < 2e-16 ***
## BATTING_2B 0.097707 0.006733 14.512 < 2e-16 ***
## BATTING_3B 0.137972 0.013559 10.176 < 2e-16 ***
## BASERUN_SB 0.027056 0.004166 6.495 1.02e-10 ***
## BASERUN_CS -0.046707 0.016505 -2.830 0.0047 **
## FIELDING_E -0.020846 0.001578 -13.211 < 2e-16 ***
## FIELDING_DP -0.053614 0.013035 -4.113 4.04e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.18 on 2269 degrees of freedom
## Multiple R-squared: 0.1918, Adjusted R-squared: 0.1896
## F-statistic: 89.73 on 6 and 2269 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model3)
From this model, we can see that the p values are all extremly small, only with the exceptance of BASERUN_CS, however, even though the p-value of this variable is relatively big, it is still well below 0.05, at 0.0047
stargazer( model1, model2, model3, type = "text")
##
## =================================================================================================
## Dependent variable:
## -----------------------------------------------------------------------------
## TARGET_WINS
## (1) (2) (3)
## -------------------------------------------------------------------------------------------------
## BATTING_H 0.049*** 0.054***
## (0.004) (0.003)
##
## BATTING_2B -0.021** -0.025*** 0.098***
## (0.009) (0.009) (0.007)
##
## BATTING_3B 0.064*** 0.138***
## (0.017) (0.014)
##
## BATTING_HR 0.053* 0.056***
## (0.027) (0.009)
##
## BATTING_BB 0.010* 0.012***
## (0.006) (0.003)
##
## BATTING_SO -0.008*** -0.010***
## (0.003) (0.002)
##
## BASERUN_SB 0.025*** 0.029*** 0.027***
## (0.004) (0.004) (0.004)
##
## BASERUN_CS -0.011 -0.047***
## (0.016) (0.017)
##
## PITCHING_H -0.001** -0.001***
## (0.0004) (0.0003)
##
## PITCHING_HR 0.013
## (0.024)
##
## PITCHING_BB 0.001
## (0.004)
##
## PITCHING_SO 0.003*** 0.003***
## (0.001) (0.001)
##
## FIELDING_E -0.020*** -0.018*** -0.021***
## (0.002) (0.002) (0.002)
##
## FIELDING_DP -0.122*** -0.124*** -0.054***
## (0.013) (0.013) (0.013)
##
## Constant 23.642*** 20.575*** 61.668***
## (5.390) (5.232) (2.665)
##
## -------------------------------------------------------------------------------------------------
## Observations 2,276 2,276 2,276
## R2 0.315 0.310 0.192
## Adjusted R2 0.311 0.307 0.190
## Residual Std. Error 13.074 (df = 2261) 13.111 (df = 2265) 14.180 (df = 2269)
## F Statistic 74.399*** (df = 14; 2261) 101.865*** (df = 10; 2265) 89.734*** (df = 6; 2269)
## =================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
vif(model1)
## BATTING_H BATTING_2B BATTING_3B BATTING_HR BATTING_BB BATTING_SO
## 3.798921 2.455952 2.933662 36.876862 6.825564 5.090936
## BASERUN_SB BASERUN_CS PITCHING_H PITCHING_HR PITCHING_BB PITCHING_SO
## 1.842541 1.165281 3.555278 29.749612 6.365752 3.305543
## FIELDING_E FIELDING_DP
## 4.181673 1.342312
vif(model2)
## BATTING_H BATTING_2B BATTING_HR BATTING_BB BATTING_SO BASERUN_SB
## 3.269276 2.431685 4.099876 2.272356 4.637931 1.619950
## PITCHING_SO PITCHING_H FIELDING_E FIELDING_DP
## 1.746238 2.612103 3.888127 1.338094
vif(model3)
## BATTING_2B BATTING_3B BASERUN_SB BASERUN_CS FIELDING_E FIELDING_DP
## 1.123365 1.623667 1.432052 1.083042 1.461507 1.157426
From the VIF(Variance Inflation Factors) function above, we can see that initially, in model 1, BATTING_HR and PITCHING_BB has a significantly high value,meaning that they were significantly impacting the model due to the high colinear relationship, and BATTING_BB, BATTING_SO, PITCHING_H, PITCHING_BB and FEILDING_E had relative colinear relationship that were mildly impacting the model.
Moving to the second model, after eliminating variables that were less statistically significant, we had successfully lowered the colinearity between variables, with only a few variables remained noticable.
In model 3, all 6 variables were left in the acceptable range of colinearity, hence the model would not be significantly impacted.
# Make predictions for model 1
predictions1 <- model1 %>% predict(df1)
# Model performance
data.frame(
RMSE = RMSE(predictions1, df1$TARGET_WINS,na.rm = TRUE),
R2 = R2(predictions1,df1$TARGET_WINS,na.rm = TRUE)
)
## RMSE R2
## 1 13.03069 0.3153854
# Make predictions for model 2
predictions2 <- model2 %>% predict(df1)
# Model performance
data.frame(
RMSE = RMSE(predictions2, df1$TARGET_WINS,na.rm = TRUE),
R2 = R2(predictions2,df1$TARGET_WINS,na.rm = TRUE)
)
## RMSE R2
## 1 13.07977 0.3102187
# Make predictions for model 3
predictions3 <- model3 %>% predict(df1)
# Model performance
data.frame(
RMSE = RMSE(predictions3, df1$TARGET_WINS,na.rm = TRUE),
R2 = R2(predictions3,df1$TARGET_WINS,na.rm = TRUE)
)
## RMSE R2
## 1 14.15824 0.1917802
The RMSE across three models are similar, however, model 3 has a significantly lower R2 comparing with the previous models, considering that it only had 6 variables left with a lower R2 and a F-statistic, it is not advisable to use model 3 as our concluding model2.
Further exploring model 1 and model 2, considering they had similar R2 and RMSE, one deciding confounding factor that seperate them apart would be Model 1 contained variables that had high Variance Inflation Factors, meaning they post high impact on the model due to high colinearity, hence it would be most advisable to use model 2 as our concluding factor because of the high R2 and F-Statistic Value