library(knitr)
library(tidyverse)
library(reshape2)
library(VIM)
library(corrplot)
library(naniar)
library(tidyverse)
library(skimr)
library(funModeling)
library(fastDummies)
In this homework assignment, we are asked to explore, analyze and model a baseball data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
Our objective is to build a multiple linear regression model on the training data to predict the number of wins for the team based on the variables given or variables that derive from the variables provided. Below is a short description of the variables of interest in the data set:
test <- read_csv("https://raw.githubusercontent.com/agersowitz/DATA-621/main/mb_eval.csv")
## Parsed with column specification:
## cols(
## INDEX = col_double(),
## TEAM_BATTING_H = col_double(),
## TEAM_BATTING_2B = col_double(),
## TEAM_BATTING_3B = col_double(),
## TEAM_BATTING_HR = col_double(),
## TEAM_BATTING_BB = col_double(),
## TEAM_BATTING_SO = col_double(),
## TEAM_BASERUN_SB = col_double(),
## TEAM_BASERUN_CS = col_double(),
## TEAM_BATTING_HBP = col_double(),
## TEAM_PITCHING_H = col_double(),
## TEAM_PITCHING_HR = col_double(),
## TEAM_PITCHING_BB = col_double(),
## TEAM_PITCHING_SO = col_double(),
## TEAM_FIELDING_E = col_double(),
## TEAM_FIELDING_DP = col_double()
## )
test <- data.frame(eval)
yr <- read_csv("https://raw.githubusercontent.com/agersowitz/DATA-621/main/year%20predict.csv")
## Parsed with column specification:
## cols(
## Year = col_double(),
## H = col_double(),
## `2B` = col_double(),
## `3B` = col_double(),
## HR = col_double(),
## SB = col_double(),
## BB = col_double(),
## SO = col_double()
## )
yr <- data.frame(yr)
train <- read_csv("https://raw.githubusercontent.com/agersowitz/DATA-621/main/mb_train.csv")
## Parsed with column specification:
## cols(
## INDEX = col_double(),
## TARGET_WINS = col_double(),
## TEAM_BATTING_H = col_double(),
## TEAM_BATTING_2B = col_double(),
## TEAM_BATTING_3B = col_double(),
## TEAM_BATTING_HR = col_double(),
## TEAM_BATTING_BB = col_double(),
## TEAM_BATTING_SO = col_double(),
## TEAM_BASERUN_SB = col_double(),
## TEAM_BASERUN_CS = col_double(),
## TEAM_BATTING_HBP = col_double(),
## TEAM_PITCHING_H = col_double(),
## TEAM_PITCHING_HR = col_double(),
## TEAM_PITCHING_BB = col_double(),
## TEAM_PITCHING_SO = col_double(),
## TEAM_FIELDING_E = col_double(),
## TEAM_FIELDING_DP = col_double()
## )
train <- data.frame(train)
#Change NA to median where appropriate
train<- train %>% replace_na(list(TEAM_BATTING_SO = median(train$TEAM_BATTING_SO[(is.na(train$TEAM_BATTING_SO) == FALSE)]),
TEAM_BASERUN_SB = median(train$TEAM_BASERUN_SB[(is.na(train$TEAM_BASERUN_SB) == FALSE)]),
TEAM_BASERUN_CS = median(train$TEAM_BASERUN_CS[(is.na(train$TEAM_BASERUN_CS) == FALSE)]),
TEAM_PITCHING_SO = median(train$TEAM_PITCHING_SO[(is.na(train$TEAM_PITCHING_SO) == FALSE)]),
TEAM_FIELDING_DP = median(train$TEAM_FIELDING_DP[(is.na(train$TEAM_FIELDING_DP) == FALSE)]),
TEAM_TARGET_WINS = median(train$TEAM_FIELDING_DP[(is.na(train$TEAM_TARGET_WINS) == FALSE)])
))
#Drop column with too many NA
train$TEAM_BATTING_HBP <- NULL
Baseball, perhaps more than any other sport, is defined by it’s eras. These range from a version of the game in the Dead-Ball era that was focused on “small-ball” type plays such as stolen bases, bunts, singles etc. This is drastically different from the modern game which focuses on the 3 true outcomes of the game (home runs, walks, strike outs) as the important counting statistics to focus on. Unfortunately, in our data set there is no indication of the year these statistics took place. this is particularly troubling because the dataset ranges over 100 years of baseball which has seen its fair share of evolution.
This model’s approach was to attempt to predict the year that these statistics took place using rates of key statistics that were found on baseball-reference.com. After creating a linear model based on these ratios to predict the year we will then use that model to predict the year of our test dataset. The Year prediction model had an R-squared > 0.96 which indicates it is extremely accurate and can be relied upon to predict the year of our dataset.
Once we have our predicted year we will then create dummy variables based on widely agreed upon eras in the history of baseball. Finally, our model will be based on the interaction of these eras and the counting statistics that were given to us. For example, TEAM_BATTING_HITS*era_modern will produce 0 if the team was not predicted to have played in the modern era but will be the number of hits the team had if they did play in the modern era. This essentially creates features that are “number of hits in the modern era” vs “number of hits”. This allows us to get a better understanding of which statistics were more important in which eras.
Reference Links: Statistics by year: https://www.baseball-reference.com/leagues/MLB/bat.shtml Three True Outcomes: https://www.mlb.com/glossary/idioms/three-true-outcomes#:~:text=The%20%22three%20true%20outcomes%22%20in,the%20pitcher%20or%20the%20catcher. Eras of Baseball: https://thesportjournal.org/article/examining-perceptions-of-baseballs-eras/#:~:text=A%20common%20list%20presented%20at,%2D2005)%20(17).
##correlation_table(train, target = "TARGET_WINS")
#true<-lm(TARGET_WINS ~ hr_era+hr_era_p, data = train)
train$X2B=(train$TEAM_BATTING_2B/162)
train$X3B=(train$TEAM_BATTING_3B/162)
train$BB=((train$TEAM_BATTING_BB/162)+(train$TEAM_PITCHING_BB/162))/2
train$SO=((train$TEAM_BATTING_SO/162)+(train$TEAM_PITCHING_SO/162))/2
year<-lm(Year ~ X2B+X3B+BB+SO, data = yr)
summary(year)
##
## Call:
## lm(formula = Year ~ X2B + X3B + BB + SO, data = yr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.269 -4.950 -0.753 5.447 36.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1853.9638 7.3246 253.113 < 2e-16 ***
## X2B 37.8715 3.4045 11.124 < 2e-16 ***
## X3B -113.4110 8.5250 -13.303 < 2e-16 ***
## BB 9.9854 1.1284 8.849 2.79e-15 ***
## SO 10.5675 0.7793 13.561 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.526 on 145 degrees of freedom
## Multiple R-squared: 0.9625, Adjusted R-squared: 0.9615
## F-statistic: 930.9 on 4 and 145 DF, p-value: < 2.2e-16
#plot(year)
predicted_year<- predict(year, newdata = train)
train<-cbind(train,predicted_year)
train$era = ifelse(train$predicted_year>= 1994,"Modern",
ifelse(train$predicted_year> 1977 & train$predicted_year<1993, "FreeAgency",
ifelse(train$predicted_year> 1961 & train$predicted_yea<1976, "Expansion",
ifelse(train$predicted_year> 1942 & train$predicted_yea<1960, "Integration",
ifelse(train$predicted_year> 1920 & train$predicted_yea<1941, "LiveBall",
"DeadBall")))))
train<-dummy_cols(train,select_columns=c("era"))
#skim(train)
train$H_era_m <- (train$TEAM_BATTING_H)*train$era_Modern
train$H_era_fa <- (train$TEAM_BATTING_H)*train$era_FreeAgency
train$H_era_e <- (train$TEAM_BATTING_H)*train$era_Expansion
train$H_era_i <- (train$TEAM_BATTING_H)*train$era_Integration
train$H_era_lb <- (train$TEAM_BATTING_H)*train$era_LiveBall
train$H_era_db <- (train$TEAM_BATTING_H)*train$era_DeadBall
train$H_era_m_p <- (train$TEAM_PITCHING_H)*train$era_Modern
train$H_era_fa_p <- (train$TEAM_PITCHING_H)*train$era_FreeAgency
train$H_era_e_p <- (train$TEAM_PITCHING_H)*train$era_Expansion
train$H_era_i_p <- (train$TEAM_PITCHING_H)*train$era_Integration
train$H_era_lb_p <- (train$TEAM_PITCHING_H)*train$era_LiveBall
train$H_era_db_p <- (train$TEAM_PITCHING_H)*train$era_DeadBall
train$bb_era_m <- (train$TEAM_BATTING_BB)*train$era_Modern
train$bb_era_fa <- (train$TEAM_BATTING_BB)*train$era_FreeAgency
train$bb_era_e <- (train$TEAM_BATTING_BB)*train$era_Expansion
train$bb_era_i <- (train$TEAM_BATTING_BB)*train$era_Integration
train$bb_era_lb <- (train$TEAM_BATTING_BB)*train$era_LiveBall
train$bb_era_db <- (train$TEAM_BATTING_BB)*train$era_DeadBall
train$bb_era_m_p <- (train$TEAM_PITCHING_BB)*train$era_Modern
train$bb_era_fa_p <- (train$TEAM_PITCHING_BB)*train$era_FreeAgency
train$bb_era_e_p <- (train$TEAM_PITCHING_BB)*train$era_Expansion
train$bb_era_i_p <- (train$TEAM_PITCHING_BB)*train$era_Integration
train$bb_era_lb_p <- (train$TEAM_PITCHING_BB)*train$era_LiveBall
train$bb_era_db_p <- (train$TEAM_PITCHING_BB)*train$era_DeadBall
train$hr_era_m <- (train$TEAM_BATTING_HR)*train$era_Modern
train$hr_era_fa <- (train$TEAM_BATTING_HR)*train$era_FreeAgency
train$hr_era_e <- (train$TEAM_BATTING_HR)*train$era_Expansion
train$hr_era_i <- (train$TEAM_BATTING_HR)*train$era_Integration
train$hr_era_lb <- (train$TEAM_BATTING_HR)*train$era_LiveBall
train$hr_era_db <- (train$TEAM_BATTING_HR)*train$era_DeadBall
train$hr_era_m_p <- (train$TEAM_PITCHING_HR)*train$era_Modern
train$hr_era_fa_p <- (train$TEAM_PITCHING_HR)*train$era_FreeAgency
train$hr_era_e_p <- (train$TEAM_PITCHING_HR)*train$era_Expansion
train$hr_era_i_p <- (train$TEAM_PITCHING_HR)*train$era_Integration
train$hr_era_lb_p <- (train$TEAM_PITCHING_HR)*train$era_LiveBall
train$hr_era_db_p <- (train$TEAM_PITCHING_HR)*train$era_DeadBall
train$so_era_m <- (train$TEAM_BATTING_SO)*train$era_Modern
train$so_era_fa <- (train$TEAM_BATTING_SO)*train$era_FreeAgency
train$so_era_e <- (train$TEAM_BATTING_SO)*train$era_Expansion
train$so_era_i <- (train$TEAM_BATTING_SO)*train$era_Integration
train$so_era_lb <- (train$TEAM_BATTING_SO)*train$era_LiveBall
train$so_era_db <- (train$TEAM_BATTING_SO)*train$era_DeadBall
train$so_era_m_p <- (train$TEAM_PITCHING_SO)*train$era_Modern
train$so_era_fa_p <- (train$TEAM_PITCHING_SO)*train$era_FreeAgency
train$so_era_e_p <- (train$TEAM_PITCHING_SO)*train$era_Expansion
train$so_era_i_p <- (train$TEAM_PITCHING_SO)*train$era_Integration
train$so_era_lb_p <- (train$TEAM_PITCHING_SO)*train$era_LiveBall
train$so_era_db_p <- (train$TEAM_PITCHING_SO)*train$era_DeadBall
train$x2b_era_m <- (train$TEAM_BATTING_2B)*train$era_Modern
train$x2b_era_fa <- (train$TEAM_BATTING_2B)*train$era_FreeAgency
train$x2b_era_e <- (train$TEAM_BATTING_2B)*train$era_Expansion
train$x2b_era_i <- (train$TEAM_BATTING_2B)*train$era_Integration
train$x2b_era_lb <- (train$TEAM_BATTING_2B)*train$era_LiveBall
train$x2b_era_db <- (train$TEAM_BATTING_2B)*train$era_DeadBall
train$x3b_era_m <- (train$TEAM_BATTING_3B)*train$era_Modern
train$x3b_era_fa <- (train$TEAM_BATTING_3B)*train$era_FreeAgency
train$x3b_era_e <- (train$TEAM_BATTING_3B)*train$era_Expansion
train$x3b_era_i <- (train$TEAM_BATTING_3B)*train$era_Integration
train$x3b_era_lb <- (train$TEAM_BATTING_3B)*train$era_LiveBall
train$x3b_era_db <- (train$TEAM_BATTING_3B)*train$era_DeadBall
train$sb_era_m <- (train$TEAM_BASERUN_SB)*train$era_Modern
train$sb_era_fa <- (train$TEAM_BASERUN_SB)*train$era_FreeAgency
train$sb_era_e <- (train$TEAM_BASERUN_SB)*train$era_Expansion
train$sb_era_i <- (train$TEAM_BASERUN_SB)*train$era_Integration
train$sb_era_lb <- (train$TEAM_BASERUN_SB)*train$era_LiveBall
train$sb_era_db <- (train$TEAM_BASERUN_SB)*train$era_DeadBall
train$cs_era_m <- (train$TEAM_BASERUN_CS)*train$era_Modern
train$cs_era_fa <- (train$TEAM_BASERUN_CS)*train$era_FreeAgency
train$cs_era_e <- (train$TEAM_BASERUN_CS)*train$era_Expansion
train$cs_era_i <- (train$TEAM_BASERUN_CS)*train$era_Integration
train$cs_era_lb <- (train$TEAM_BASERUN_CS)*train$era_LiveBall
train$cs_era_db <- (train$TEAM_BASERUN_CS)*train$era_DeadBall
train$e_era_m <- (train$TEAM_FIELDING_E)*train$era_Modern
train$e_era_fa <- (train$TEAM_FIELDING_E)*train$era_FreeAgency
train$e_era_e <- (train$TEAM_FIELDING_E)*train$era_Expansion
train$e_era_i <- (train$TEAM_FIELDING_E)*train$era_Integration
train$e_era_lb <- (train$TEAM_FIELDING_E)*train$era_LiveBall
train$e_era_db <- (train$TEAM_FIELDING_E)*train$era_DeadBall
train$dp_era_m <- (train$TEAM_FIELDING_DP)*train$era_Modern
train$dp_era_fa <- (train$TEAM_FIELDING_DP)*train$era_FreeAgency
train$dp_era_e <- (train$TEAM_FIELDING_DP)*train$era_Expansion
train$dp_era_i <- (train$TEAM_FIELDING_DP)*train$era_Integration
train$dp_era_lb <- (train$TEAM_FIELDING_DP)*train$era_LiveBall
train$dp_era_db <- (train$TEAM_FIELDING_DP)*train$era_DeadBall
era<-lm((TARGET_WINS) ~ H_era_m+H_era_fa+H_era_e+H_era_i+H_era_lb+H_era_db+
H_era_m_p+H_era_fa_p+H_era_e_p+H_era_i_p+H_era_lb_p+H_era_db_p+
bb_era_m+bb_era_fa+bb_era_e+bb_era_i+bb_era_lb+bb_era_db+
bb_era_m_p+bb_era_fa_p+bb_era_e_p+bb_era_i_p+bb_era_lb_p+bb_era_db_p+
hr_era_m+hr_era_fa+hr_era_e+hr_era_i+hr_era_lb+hr_era_db+
hr_era_m_p+hr_era_fa_p+hr_era_e_p+hr_era_i_p+hr_era_lb_p+hr_era_db_p+
so_era_m+so_era_fa+so_era_e+so_era_i+so_era_lb+so_era_db+
so_era_m_p+so_era_fa_p+so_era_e_p+so_era_i_p+so_era_lb_p+so_era_db_p+
x2b_era_m+x2b_era_fa+x2b_era_e+x2b_era_i+x2b_era_lb+x2b_era_db+
x3b_era_m+x3b_era_fa+x3b_era_e+x3b_era_i+x3b_era_lb+x3b_era_db+
e_era_m+e_era_fa+e_era_e+e_era_i+e_era_lb+e_era_db+
dp_era_m+dp_era_fa+dp_era_e+dp_era_i+dp_era_lb+dp_era_db+
sb_era_m+sb_era_fa+sb_era_e+sb_era_i+sb_era_lb+sb_era_db
, data = train)
summary(era)
##
## Call:
## lm(formula = (TARGET_WINS) ~ H_era_m + H_era_fa + H_era_e + H_era_i +
## H_era_lb + H_era_db + H_era_m_p + H_era_fa_p + H_era_e_p +
## H_era_i_p + H_era_lb_p + H_era_db_p + bb_era_m + bb_era_fa +
## bb_era_e + bb_era_i + bb_era_lb + bb_era_db + bb_era_m_p +
## bb_era_fa_p + bb_era_e_p + bb_era_i_p + bb_era_lb_p + bb_era_db_p +
## hr_era_m + hr_era_fa + hr_era_e + hr_era_i + hr_era_lb +
## hr_era_db + hr_era_m_p + hr_era_fa_p + hr_era_e_p + hr_era_i_p +
## hr_era_lb_p + hr_era_db_p + so_era_m + so_era_fa + so_era_e +
## so_era_i + so_era_lb + so_era_db + so_era_m_p + so_era_fa_p +
## so_era_e_p + so_era_i_p + so_era_lb_p + so_era_db_p + x2b_era_m +
## x2b_era_fa + x2b_era_e + x2b_era_i + x2b_era_lb + x2b_era_db +
## x3b_era_m + x3b_era_fa + x3b_era_e + x3b_era_i + x3b_era_lb +
## x3b_era_db + e_era_m + e_era_fa + e_era_e + e_era_i + e_era_lb +
## e_era_db + dp_era_m + dp_era_fa + dp_era_e + dp_era_i + dp_era_lb +
## dp_era_db + sb_era_m + sb_era_fa + sb_era_e + sb_era_i +
## sb_era_lb + sb_era_db, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.346 -7.716 0.025 7.535 55.109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.5149454 6.3583950 4.170 3.16e-05 ***
## H_era_m 0.0249513 0.0120527 2.070 0.038552 *
## H_era_fa 0.0067291 0.0293094 0.230 0.818433
## H_era_e -0.0205919 0.0247588 -0.832 0.405668
## H_era_i 0.0388812 0.0084105 4.623 4.00e-06 ***
## H_era_lb 0.0208088 0.0076115 2.734 0.006310 **
## H_era_db 0.0521409 0.0047745 10.921 < 2e-16 ***
## H_era_m_p 0.0038292 0.0045766 0.837 0.402855
## H_era_fa_p 0.0437280 0.0249128 1.755 0.079357 .
## H_era_e_p 0.0573490 0.0217278 2.639 0.008363 **
## H_era_i_p -0.0003111 0.0011637 -0.267 0.789204
## H_era_lb_p 0.0063022 0.0022585 2.791 0.005308 **
## H_era_db_p -0.0009699 0.0005150 -1.883 0.059817 .
## bb_era_m 0.0391196 0.0147350 2.655 0.007991 **
## bb_era_fa 0.1571089 0.0752044 2.089 0.036814 *
## bb_era_e 0.4843324 0.1324373 3.657 0.000261 ***
## bb_era_i 0.0038739 0.0235637 0.164 0.869432
## bb_era_lb -0.0276587 0.0265984 -1.040 0.298518
## bb_era_db -0.0130167 0.0117329 -1.109 0.267369
## bb_era_m_p 0.0083550 0.0084392 0.990 0.322272
## bb_era_fa_p -0.1143946 0.0716431 -1.597 0.110470
## bb_era_e_p -0.4422586 0.1287088 -3.436 0.000601 ***
## bb_era_i_p -0.0001063 0.0221287 -0.005 0.996168
## bb_era_lb_p 0.0394072 0.0214444 1.838 0.066249 .
## bb_era_db_p -0.0122104 0.0080862 -1.510 0.131182
## hr_era_m 0.1449235 0.0442409 3.276 0.001070 **
## hr_era_fa -0.3311064 0.1668553 -1.984 0.047336 *
## hr_era_e -0.5880747 0.5072753 -1.159 0.246468
## hr_era_i -0.0284609 0.2527823 -0.113 0.910365
## hr_era_lb 0.3610345 0.1587916 2.274 0.023084 *
## hr_era_db 0.0325140 0.0606093 0.536 0.591700
## hr_era_m_p -0.0744317 0.0363347 -2.049 0.040629 *
## hr_era_fa_p 0.3691369 0.1595200 2.314 0.020757 *
## hr_era_e_p 0.6626919 0.4950401 1.339 0.180819
## hr_era_i_p 0.0916041 0.2415589 0.379 0.704561
## hr_era_lb_p -0.2160877 0.1439625 -1.501 0.133499
## hr_era_db_p 0.0427567 0.0495391 0.863 0.388182
## so_era_m -0.0191559 0.0078753 -2.432 0.015079 *
## so_era_fa 0.0299327 0.0283780 1.055 0.291640
## so_era_e -0.1383335 0.0563040 -2.457 0.014091 *
## so_era_i -0.0306074 0.0140997 -2.171 0.030054 *
## so_era_lb -0.0835988 0.0171974 -4.861 1.25e-06 ***
## so_era_db 0.0058190 0.0083171 0.700 0.484225
## so_era_m_p -0.0033599 0.0039780 -0.845 0.398408
## so_era_fa_p -0.0571447 0.0265302 -2.154 0.031352 *
## so_era_e_p 0.1313924 0.0548716 2.395 0.016725 *
## so_era_i_p 0.0254026 0.0130496 1.947 0.051708 .
## so_era_lb_p 0.0856253 0.0148685 5.759 9.66e-09 ***
## so_era_db_p -0.0027319 0.0060722 -0.450 0.652822
## x2b_era_m 0.0440077 0.0305506 1.440 0.149873
## x2b_era_fa -0.0668614 0.0417534 -1.601 0.109445
## x2b_era_e -0.0262660 0.0408917 -0.642 0.520726
## x2b_era_i -0.0306346 0.0346856 -0.883 0.377222
## x2b_era_lb 0.0327525 0.0325232 1.007 0.314022
## x2b_era_db 0.0072839 0.0194255 0.375 0.707721
## x3b_era_m 0.0676848 0.0794700 0.852 0.394472
## x3b_era_fa -0.0304846 0.1048836 -0.291 0.771345
## x3b_era_e 0.1594218 0.0949959 1.678 0.093451 .
## x3b_era_i 0.2696816 0.0841463 3.205 0.001370 **
## x3b_era_lb 0.2008185 0.0724950 2.770 0.005651 **
## x3b_era_db 0.0770310 0.0265484 2.902 0.003750 **
## e_era_m -0.0204063 0.0119897 -1.702 0.088900 .
## e_era_fa 0.0260124 0.0175436 1.483 0.138290
## e_era_e -0.1116542 0.0222972 -5.008 5.95e-07 ***
## e_era_i -0.0451841 0.0128702 -3.511 0.000456 ***
## e_era_lb -0.0905786 0.0094137 -9.622 < 2e-16 ***
## e_era_db -0.0234638 0.0037249 -6.299 3.60e-10 ***
## dp_era_m -0.1112532 0.0369904 -3.008 0.002663 **
## dp_era_fa -0.0806575 0.0390467 -2.066 0.038977 *
## dp_era_e -0.0746975 0.0351085 -2.128 0.033480 *
## dp_era_i -0.0763914 0.0295604 -2.584 0.009823 **
## dp_era_lb -0.1341195 0.0286437 -4.682 3.01e-06 ***
## dp_era_db -0.1420045 0.0247892 -5.728 1.15e-08 ***
## sb_era_m 0.0182889 0.0190365 0.961 0.336795
## sb_era_fa 0.0549168 0.0181246 3.030 0.002474 **
## sb_era_e 0.0314596 0.0158083 1.990 0.046707 *
## sb_era_i 0.0622565 0.0132599 4.695 2.83e-06 ***
## sb_era_lb 0.0911188 0.0105364 8.648 < 2e-16 ***
## sb_era_db 0.0275920 0.0067918 4.063 5.02e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.3 on 2197 degrees of freedom
## Multiple R-squared: 0.4115, Adjusted R-squared: 0.3906
## F-statistic: 19.7 on 78 and 2197 DF, p-value: < 2.2e-16
plot(era)
hist(era$residuals)
This model show promise with a relatively high r-squared of 0.4115. However, it needs cleaning and more advance feature selection. The residuals for this model seem close to normally distributed with QQ plot that has heavy tails which is driven by high leverage outliers. While this model takes advantage of publicly available data to achieve more accurate results, it still needs refining. In the next model we will take some of the elements from model 2 and apply them to a more thorough feature selection and deal with some other outstanding issues such as colinearity.