Libraries

library(knitr)
library(tidyverse)
library(reshape2)
library(VIM)
library(corrplot)
library(naniar)
library(tidyverse)
library(skimr)
library(funModeling)
library(fastDummies)

Data

In this homework assignment, we are asked to explore, analyze and model a baseball data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Our objective is to build a multiple linear regression model on the training data to predict the number of wins for the team based on the variables given or variables that derive from the variables provided. Below is a short description of the variables of interest in the data set:

Load Data

test <- read_csv("https://raw.githubusercontent.com/agersowitz/DATA-621/main/mb_eval.csv")

## Parsed with column specification:
## cols(
##   INDEX = col_double(),
##   TEAM_BATTING_H = col_double(),
##   TEAM_BATTING_2B = col_double(),
##   TEAM_BATTING_3B = col_double(),
##   TEAM_BATTING_HR = col_double(),
##   TEAM_BATTING_BB = col_double(),
##   TEAM_BATTING_SO = col_double(),
##   TEAM_BASERUN_SB = col_double(),
##   TEAM_BASERUN_CS = col_double(),
##   TEAM_BATTING_HBP = col_double(),
##   TEAM_PITCHING_H = col_double(),
##   TEAM_PITCHING_HR = col_double(),
##   TEAM_PITCHING_BB = col_double(),
##   TEAM_PITCHING_SO = col_double(),
##   TEAM_FIELDING_E = col_double(),
##   TEAM_FIELDING_DP = col_double()
## )

test <- data.frame(eval)

yr <- read_csv("https://raw.githubusercontent.com/agersowitz/DATA-621/main/year%20predict.csv")

## Parsed with column specification:
## cols(
##   Year = col_double(),
##   H = col_double(),
##   `2B` = col_double(),
##   `3B` = col_double(),
##   HR = col_double(),
##   SB = col_double(),
##   BB = col_double(),
##   SO = col_double()
## )

yr <- data.frame(yr)

train <- read_csv("https://raw.githubusercontent.com/agersowitz/DATA-621/main/mb_train.csv")

## Parsed with column specification:
## cols(
##   INDEX = col_double(),
##   TARGET_WINS = col_double(),
##   TEAM_BATTING_H = col_double(),
##   TEAM_BATTING_2B = col_double(),
##   TEAM_BATTING_3B = col_double(),
##   TEAM_BATTING_HR = col_double(),
##   TEAM_BATTING_BB = col_double(),
##   TEAM_BATTING_SO = col_double(),
##   TEAM_BASERUN_SB = col_double(),
##   TEAM_BASERUN_CS = col_double(),
##   TEAM_BATTING_HBP = col_double(),
##   TEAM_PITCHING_H = col_double(),
##   TEAM_PITCHING_HR = col_double(),
##   TEAM_PITCHING_BB = col_double(),
##   TEAM_PITCHING_SO = col_double(),
##   TEAM_FIELDING_E = col_double(),
##   TEAM_FIELDING_DP = col_double()
## )

train <- data.frame(train)

skim(train)

Data summary
Name	train
Number of rows	2276
Number of columns	17
_______________________
Column type frequency:
numeric	17
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
INDEX	0	1.00	1268.46	736.35	1	630.75	1270.5	1915.50	2535	▇▇▇▇▇
TARGET_WINS	0	1.00	80.79	15.75	0	71.00	82.0	92.00	146	▁▁▇▅▁
TEAM_BATTING_H	0	1.00	1469.27	144.59	891	1383.00	1454.0	1537.25	2554	▁▇▂▁▁
TEAM_BATTING_2B	0	1.00	241.25	46.80	69	208.00	238.0	273.00	458	▁▆▇▂▁
TEAM_BATTING_3B	0	1.00	55.25	27.94	0	34.00	47.0	72.00	223	▇▇▂▁▁
TEAM_BATTING_HR	0	1.00	99.61	60.55	0	42.00	102.0	147.00	264	▇▆▇▅▁
TEAM_BATTING_BB	0	1.00	501.56	122.67	0	451.00	512.0	580.00	878	▁▁▇▇▁
TEAM_BATTING_SO	102	0.96	735.61	248.53	0	548.00	750.0	930.00	1399	▁▆▇▇▁
TEAM_BASERUN_SB	131	0.94	124.76	87.79	0	66.00	101.0	156.00	697	▇▃▁▁▁
TEAM_BASERUN_CS	772	0.66	52.80	22.96	0	38.00	49.0	62.00	201	▃▇▁▁▁
TEAM_BATTING_HBP	2085	0.08	59.36	12.97	29	50.50	58.0	67.00	95	▂▇▇▅▁
TEAM_PITCHING_H	0	1.00	1779.21	1406.84	1137	1419.00	1518.0	1682.50	30132	▇▁▁▁▁
TEAM_PITCHING_HR	0	1.00	105.70	61.30	0	50.00	107.0	150.00	343	▇▇▆▁▁
TEAM_PITCHING_BB	0	1.00	553.01	166.36	0	476.00	536.5	611.00	3645	▇▁▁▁▁
TEAM_PITCHING_SO	102	0.96	817.73	553.09	0	615.00	813.5	968.00	19278	▇▁▁▁▁
TEAM_FIELDING_E	0	1.00	246.48	227.77	65	127.00	159.0	249.25	1898	▇▁▁▁▁
TEAM_FIELDING_DP	286	0.87	146.39	26.23	52	131.00	149.0	164.00	228	▁▂▇▆▁

As the first step in data exploration I use the skim function form the skimr package. This shows missing data, mean, percentiles and a histogram of the distribution of all of the data fields all in one output.

We can see there are numerous fields that are missing data. We will replace these values with the median value for that field.

str(train)

## 'data.frame':    2276 obs. of  17 variables:
##  $ INDEX           : num  1 2 3 4 5 6 7 8 11 12 ...
##  $ TARGET_WINS     : num  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : num  1445 1339 1377 1387 1297 ...
##  $ TEAM_BATTING_2B : num  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : num  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : num  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : num  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : num  842 1075 917 922 920 ...
##  $ TEAM_BASERUN_SB : num  NA 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : num  NA 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : num  9364 1347 1377 1396 1297 ...
##  $ TEAM_PITCHING_HR: num  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: num  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: num  5456 1082 917 928 920 ...
##  $ TEAM_FIELDING_E : num  1011 193 175 164 138 ...
##  $ TEAM_FIELDING_DP: num  NA 155 153 156 168 149 186 136 169 159 ...

All the variables are integer and TEAM_BATTING_HBP has a lot of missing values. Let’s look at the summary of the data.

summary(train)

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286

g = melt(train)
ggplot(g, aes(x= value)) + 
   geom_density(fill='blue') + 
   facet_wrap(~variable, scales = 'free') +
   theme_light()

Check for rows with missing values

sum(complete.cases(train))

## [1] 191

Check how many rows with missing values in terms of percentage

sum(complete.cases(train))/(nrow(train)) *100

## [1] 8.391916

missing_plot <- aggr(train, col=c('blue','red'),numbers=TRUE, sortVars=TRUE,labels=names(train), cex.axis=.7,gap=3, ylab=c("Missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##          Variable      Count
##  TEAM_BATTING_HBP 0.91608084
##   TEAM_BASERUN_CS 0.33919156
##  TEAM_FIELDING_DP 0.12565905
##   TEAM_BASERUN_SB 0.05755712
##   TEAM_BATTING_SO 0.04481547
##  TEAM_PITCHING_SO 0.04481547
##             INDEX 0.00000000
##       TARGET_WINS 0.00000000
##    TEAM_BATTING_H 0.00000000
##   TEAM_BATTING_2B 0.00000000
##   TEAM_BATTING_3B 0.00000000
##   TEAM_BATTING_HR 0.00000000
##   TEAM_BATTING_BB 0.00000000
##   TEAM_PITCHING_H 0.00000000
##  TEAM_PITCHING_HR 0.00000000
##  TEAM_PITCHING_BB 0.00000000
##   TEAM_FIELDING_E 0.00000000

Six of the variable has missing values.

Outliers

ggplot(stack(train[,-1]), aes(x = ind, y = values, fill=ind)) + 
  geom_boxplot(outlier.colour = "red",  outlier.alpha=.3) +
  coord_cartesian(ylim = c(0, 1000)) +
  theme_light()+
  theme(axis.text.x=element_text(angle=45, hjust=1))

Correlations

train %>% 
  cor(., use = "complete.obs") %>%
  corrplot(., method = "color", type = "upper", tl.col = "black", tl.cex=.8, diag = FALSE)

cleanup <- function(df, outlier_mult = 1.5) {  
  #Outlier_mult is what mutliple of the IQR a value needs to be away from the median to be considered an outlier
  
  #Change NA to median where appropriate
  df<- df %>% replace_na(list(TEAM_BATTING_SO = median(df$TEAM_BATTING_SO[(is.na(df$TEAM_BATTING_SO) == FALSE)]),
                               TEAM_BASERUN_SB = median(df$TEAM_BASERUN_SB[(is.na(df$TEAM_BASERUN_SB) == FALSE)]),
                               TEAM_BASERUN_CS = median(df$TEAM_BASERUN_CS[(is.na(df$TEAM_BASERUN_CS) == FALSE)]),
                               TEAM_PITCHING_SO = median(df$TEAM_PITCHING_SO[(is.na(df$TEAM_PITCHING_SO) == FALSE)]),
                               TEAM_FIELDING_DP = median(df$TEAM_FIELDING_DP[(is.na(df$TEAM_FIELDING_DP) == FALSE)])
                               ))
  
  #Drop column with too many NA
  df$TEAM_BATTING_HBP  <- NULL
  
  #Removes with outliers
  i = 1 #Skip the index
  k = 0 #Count of outliers changed
  while (i < length(colnames(df))) {   #Column cycle
    
    i = i+1
    iqr= IQR(df[,i])
    med = median(df[,i])
    max_range = c(med - iqr * outlier_mult, med + iqr * outlier_mult)  #Defines the maximum range, outside of which values are considered outliers
      
    j = 0
    while (j < length(df[,i])) { #Row cycle
      j = j + 1
      
      if (df[j,i] < max_range[1] || df[j,i] > max_range[2] ) {
        
        df[j,i] <- med #Sets outliers to median column value 
        k = k + 1
      }
    }
  }
  
  print(paste("set",k, "outliers to median"))
  return(df)
}

#cleanup(train)

#Change NA to median where appropriate
  train<- train %>% replace_na(list(TEAM_BATTING_SO = median(train$TEAM_BATTING_SO[(is.na(train$TEAM_BATTING_SO) == FALSE)]),
                               TEAM_BASERUN_SB = median(train$TEAM_BASERUN_SB[(is.na(train$TEAM_BASERUN_SB) == FALSE)]),
                               TEAM_BASERUN_CS = median(train$TEAM_BASERUN_CS[(is.na(train$TEAM_BASERUN_CS) == FALSE)]),
                               TEAM_PITCHING_SO = median(train$TEAM_PITCHING_SO[(is.na(train$TEAM_PITCHING_SO) == FALSE)]),
                               TEAM_FIELDING_DP = median(train$TEAM_FIELDING_DP[(is.na(train$TEAM_FIELDING_DP) == FALSE)])
                               ))
  
  #Drop column with too many NA
  train$TEAM_BATTING_HBP  <- NULL

##Build Models

INTERACTION TERMS BY ERA

##correlation_table(train, target = "TARGET_WINS")

#true<-lm(TARGET_WINS ~ hr_era+hr_era_p, data = train)

train$X2B=(train$TEAM_BATTING_2B/162)
train$X3B=(train$TEAM_BATTING_3B/162)
train$BB=((train$TEAM_BATTING_BB/162)+(train$TEAM_PITCHING_BB/162))/2
train$SO=((train$TEAM_BATTING_SO/162)+(train$TEAM_PITCHING_SO/162))/2

year<-lm(Year ~ X2B+X3B+BB+SO, data = yr)

#summary(year)
#plot(year)

predicted_year<- predict(year, newdata = train)

train<-cbind(train,predicted_year)



train$era = ifelse(train$predicted_year>= 1994,"Modern",
                   ifelse(train$predicted_year> 1977 & train$predicted_year<1993, "FreeAgency",
                   ifelse(train$predicted_year> 1961 & train$predicted_yea<1976, "Expansion",
                   ifelse(train$predicted_year> 1942 & train$predicted_yea<1960, "Integration",
                    ifelse(train$predicted_year> 1920 & train$predicted_yea<1941, "LiveBall",
                          "DeadBall")))))

train<-dummy_cols(train,select_columns=c("era"))

#skim(train)


train$H_era_m <- (train$TEAM_BATTING_H)*train$era_Modern
train$H_era_fa <- (train$TEAM_BATTING_H)*train$era_FreeAgency
train$H_era_e <- (train$TEAM_BATTING_H)*train$era_Expansion
train$H_era_i <- (train$TEAM_BATTING_H)*train$era_Integration
train$H_era_lb <- (train$TEAM_BATTING_H)*train$era_LiveBall
train$H_era_db <- (train$TEAM_BATTING_H)*train$era_DeadBall

train$H_era_m_p <- (train$TEAM_PITCHING_H)*train$era_Modern
train$H_era_fa_p <- (train$TEAM_PITCHING_H)*train$era_FreeAgency
train$H_era_e_p <- (train$TEAM_PITCHING_H)*train$era_Expansion
train$H_era_i_p <- (train$TEAM_PITCHING_H)*train$era_Integration
train$H_era_lb_p <- (train$TEAM_PITCHING_H)*train$era_LiveBall
train$H_era_db_p <- (train$TEAM_PITCHING_H)*train$era_DeadBall

train$bb_era_m <- (train$TEAM_BATTING_BB)*train$era_Modern
train$bb_era_fa <- (train$TEAM_BATTING_BB)*train$era_FreeAgency
train$bb_era_e <- (train$TEAM_BATTING_BB)*train$era_Expansion
train$bb_era_i <- (train$TEAM_BATTING_BB)*train$era_Integration
train$bb_era_lb <- (train$TEAM_BATTING_BB)*train$era_LiveBall
train$bb_era_db <- (train$TEAM_BATTING_BB)*train$era_DeadBall

train$bb_era_m_p <- (train$TEAM_PITCHING_BB)*train$era_Modern
train$bb_era_fa_p <- (train$TEAM_PITCHING_BB)*train$era_FreeAgency
train$bb_era_e_p <- (train$TEAM_PITCHING_BB)*train$era_Expansion
train$bb_era_i_p <- (train$TEAM_PITCHING_BB)*train$era_Integration
train$bb_era_lb_p <- (train$TEAM_PITCHING_BB)*train$era_LiveBall
train$bb_era_db_p <- (train$TEAM_PITCHING_BB)*train$era_DeadBall

train$hr_era_m <- (train$TEAM_BATTING_HR)*train$era_Modern
train$hr_era_fa <- (train$TEAM_BATTING_HR)*train$era_FreeAgency
train$hr_era_e <- (train$TEAM_BATTING_HR)*train$era_Expansion
train$hr_era_i <- (train$TEAM_BATTING_HR)*train$era_Integration
train$hr_era_lb <- (train$TEAM_BATTING_HR)*train$era_LiveBall
train$hr_era_db <- (train$TEAM_BATTING_HR)*train$era_DeadBall

train$hr_era_m_p <- (train$TEAM_PITCHING_HR)*train$era_Modern
train$hr_era_fa_p <- (train$TEAM_PITCHING_HR)*train$era_FreeAgency
train$hr_era_e_p <- (train$TEAM_PITCHING_HR)*train$era_Expansion
train$hr_era_i_p <- (train$TEAM_PITCHING_HR)*train$era_Integration
train$hr_era_lb_p <- (train$TEAM_PITCHING_HR)*train$era_LiveBall
train$hr_era_db_p <- (train$TEAM_PITCHING_HR)*train$era_DeadBall

train$so_era_m <- (train$TEAM_BATTING_SO)*train$era_Modern
train$so_era_fa <- (train$TEAM_BATTING_SO)*train$era_FreeAgency
train$so_era_e <- (train$TEAM_BATTING_SO)*train$era_Expansion
train$so_era_i <- (train$TEAM_BATTING_SO)*train$era_Integration
train$so_era_lb <- (train$TEAM_BATTING_SO)*train$era_LiveBall
train$so_era_db <- (train$TEAM_BATTING_SO)*train$era_DeadBall

train$so_era_m_p <- (train$TEAM_PITCHING_SO)*train$era_Modern
train$so_era_fa_p <- (train$TEAM_PITCHING_SO)*train$era_FreeAgency
train$so_era_e_p <- (train$TEAM_PITCHING_SO)*train$era_Expansion
train$so_era_i_p <- (train$TEAM_PITCHING_SO)*train$era_Integration
train$so_era_lb_p <- (train$TEAM_PITCHING_SO)*train$era_LiveBall
train$so_era_db_p <- (train$TEAM_PITCHING_SO)*train$era_DeadBall

train$x2b_era_m <- (train$TEAM_BATTING_2B)*train$era_Modern
train$x2b_era_fa <- (train$TEAM_BATTING_2B)*train$era_FreeAgency
train$x2b_era_e <- (train$TEAM_BATTING_2B)*train$era_Expansion
train$x2b_era_i <- (train$TEAM_BATTING_2B)*train$era_Integration
train$x2b_era_lb <- (train$TEAM_BATTING_2B)*train$era_LiveBall
train$x2b_era_db <- (train$TEAM_BATTING_2B)*train$era_DeadBall

train$x3b_era_m <- (train$TEAM_BATTING_3B)*train$era_Modern
train$x3b_era_fa <- (train$TEAM_BATTING_3B)*train$era_FreeAgency
train$x3b_era_e <- (train$TEAM_BATTING_3B)*train$era_Expansion
train$x3b_era_i <- (train$TEAM_BATTING_3B)*train$era_Integration
train$x3b_era_lb <- (train$TEAM_BATTING_3B)*train$era_LiveBall
train$x3b_era_db <- (train$TEAM_BATTING_3B)*train$era_DeadBall

train$sb_era_m <- (train$TEAM_BASERUN_SB)*train$era_Modern
train$sb_era_fa <- (train$TEAM_BASERUN_SB)*train$era_FreeAgency
train$sb_era_e <- (train$TEAM_BASERUN_SB)*train$era_Expansion
train$sb_era_i <- (train$TEAM_BASERUN_SB)*train$era_Integration
train$sb_era_lb <- (train$TEAM_BASERUN_SB)*train$era_LiveBall
train$sb_era_db <- (train$TEAM_BASERUN_SB)*train$era_DeadBall

train$cs_era_m <- (train$TEAM_BASERUN_CS)*train$era_Modern
train$cs_era_fa <- (train$TEAM_BASERUN_CS)*train$era_FreeAgency
train$cs_era_e <- (train$TEAM_BASERUN_CS)*train$era_Expansion
train$cs_era_i <- (train$TEAM_BASERUN_CS)*train$era_Integration
train$cs_era_lb <- (train$TEAM_BASERUN_CS)*train$era_LiveBall
train$cs_era_db <- (train$TEAM_BASERUN_CS)*train$era_DeadBall


train$e_era_m <- (train$TEAM_FIELDING_E)*train$era_Modern
train$e_era_fa <- (train$TEAM_FIELDING_E)*train$era_FreeAgency
train$e_era_e <- (train$TEAM_FIELDING_E)*train$era_Expansion
train$e_era_i <- (train$TEAM_FIELDING_E)*train$era_Integration
train$e_era_lb <- (train$TEAM_FIELDING_E)*train$era_LiveBall
train$e_era_db <- (train$TEAM_FIELDING_E)*train$era_DeadBall


train$dp_era_m <- (train$TEAM_FIELDING_DP)*train$era_Modern
train$dp_era_fa <- (train$TEAM_FIELDING_DP)*train$era_FreeAgency
train$dp_era_e <- (train$TEAM_FIELDING_DP)*train$era_Expansion
train$dp_era_i <- (train$TEAM_FIELDING_DP)*train$era_Integration
train$dp_era_lb <- (train$TEAM_FIELDING_DP)*train$era_LiveBall
train$dp_era_db <- (train$TEAM_FIELDING_DP)*train$era_DeadBall



era<-lm(TARGET_WINS ~ H_era_m+H_era_fa+H_era_e+H_era_i+H_era_lb+H_era_db+
                      H_era_m_p+H_era_fa_p+H_era_e_p+H_era_i_p+H_era_lb_p+H_era_db_p+
          
                      bb_era_m+bb_era_fa+bb_era_e+bb_era_i+bb_era_lb+bb_era_db+
                      bb_era_m_p+bb_era_fa_p+bb_era_e_p+bb_era_i_p+bb_era_lb_p+bb_era_db_p+
          
                      hr_era_m+hr_era_fa+hr_era_e+hr_era_i+hr_era_lb+hr_era_db+
                      hr_era_m_p+hr_era_fa_p+hr_era_e_p+hr_era_i_p+hr_era_lb_p+hr_era_db_p+
          
                      so_era_m+so_era_fa+so_era_e+so_era_i+so_era_lb+so_era_db+
                      so_era_m_p+so_era_fa_p+so_era_e_p+so_era_i_p+so_era_lb_p+so_era_db_p+
          
                      x2b_era_m+x2b_era_fa+x2b_era_e+x2b_era_i+x2b_era_lb+x2b_era_db+
          
                      x3b_era_m+x3b_era_fa+x3b_era_e+x3b_era_i+x3b_era_lb+x3b_era_db+
          
                      e_era_m+e_era_fa+e_era_e+e_era_i+e_era_lb+e_era_db+
          
                      dp_era_m+dp_era_fa+dp_era_e+dp_era_i+dp_era_lb+dp_era_db+
          
                      sb_era_m+sb_era_fa+sb_era_e+sb_era_i+sb_era_lb+sb_era_db
        
          
          
                      
          , data = train)

summary(era)

## 
## Call:
## lm(formula = TARGET_WINS ~ H_era_m + H_era_fa + H_era_e + H_era_i + 
##     H_era_lb + H_era_db + H_era_m_p + H_era_fa_p + H_era_e_p + 
##     H_era_i_p + H_era_lb_p + H_era_db_p + bb_era_m + bb_era_fa + 
##     bb_era_e + bb_era_i + bb_era_lb + bb_era_db + bb_era_m_p + 
##     bb_era_fa_p + bb_era_e_p + bb_era_i_p + bb_era_lb_p + bb_era_db_p + 
##     hr_era_m + hr_era_fa + hr_era_e + hr_era_i + hr_era_lb + 
##     hr_era_db + hr_era_m_p + hr_era_fa_p + hr_era_e_p + hr_era_i_p + 
##     hr_era_lb_p + hr_era_db_p + so_era_m + so_era_fa + so_era_e + 
##     so_era_i + so_era_lb + so_era_db + so_era_m_p + so_era_fa_p + 
##     so_era_e_p + so_era_i_p + so_era_lb_p + so_era_db_p + x2b_era_m + 
##     x2b_era_fa + x2b_era_e + x2b_era_i + x2b_era_lb + x2b_era_db + 
##     x3b_era_m + x3b_era_fa + x3b_era_e + x3b_era_i + x3b_era_lb + 
##     x3b_era_db + e_era_m + e_era_fa + e_era_e + e_era_i + e_era_lb + 
##     e_era_db + dp_era_m + dp_era_fa + dp_era_e + dp_era_i + dp_era_lb + 
##     dp_era_db + sb_era_m + sb_era_fa + sb_era_e + sb_era_i + 
##     sb_era_lb + sb_era_db, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.346  -7.716   0.025   7.535  55.109 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 26.5149454  6.3583950   4.170 3.16e-05 ***
## H_era_m      0.0249513  0.0120527   2.070 0.038552 *  
## H_era_fa     0.0067291  0.0293094   0.230 0.818433    
## H_era_e     -0.0205919  0.0247588  -0.832 0.405668    
## H_era_i      0.0388812  0.0084105   4.623 4.00e-06 ***
## H_era_lb     0.0208088  0.0076115   2.734 0.006310 ** 
## H_era_db     0.0521409  0.0047745  10.921  < 2e-16 ***
## H_era_m_p    0.0038292  0.0045766   0.837 0.402855    
## H_era_fa_p   0.0437280  0.0249128   1.755 0.079357 .  
## H_era_e_p    0.0573490  0.0217278   2.639 0.008363 ** 
## H_era_i_p   -0.0003111  0.0011637  -0.267 0.789204    
## H_era_lb_p   0.0063022  0.0022585   2.791 0.005308 ** 
## H_era_db_p  -0.0009699  0.0005150  -1.883 0.059817 .  
## bb_era_m     0.0391196  0.0147350   2.655 0.007991 ** 
## bb_era_fa    0.1571089  0.0752044   2.089 0.036814 *  
## bb_era_e     0.4843324  0.1324373   3.657 0.000261 ***
## bb_era_i     0.0038739  0.0235637   0.164 0.869432    
## bb_era_lb   -0.0276587  0.0265984  -1.040 0.298518    
## bb_era_db   -0.0130167  0.0117329  -1.109 0.267369    
## bb_era_m_p   0.0083550  0.0084392   0.990 0.322272    
## bb_era_fa_p -0.1143946  0.0716431  -1.597 0.110470    
## bb_era_e_p  -0.4422586  0.1287088  -3.436 0.000601 ***
## bb_era_i_p  -0.0001063  0.0221287  -0.005 0.996168    
## bb_era_lb_p  0.0394072  0.0214444   1.838 0.066249 .  
## bb_era_db_p -0.0122104  0.0080862  -1.510 0.131182    
## hr_era_m     0.1449235  0.0442409   3.276 0.001070 ** 
## hr_era_fa   -0.3311064  0.1668553  -1.984 0.047336 *  
## hr_era_e    -0.5880747  0.5072753  -1.159 0.246468    
## hr_era_i    -0.0284609  0.2527823  -0.113 0.910365    
## hr_era_lb    0.3610345  0.1587916   2.274 0.023084 *  
## hr_era_db    0.0325140  0.0606093   0.536 0.591700    
## hr_era_m_p  -0.0744317  0.0363347  -2.049 0.040629 *  
## hr_era_fa_p  0.3691369  0.1595200   2.314 0.020757 *  
## hr_era_e_p   0.6626919  0.4950401   1.339 0.180819    
## hr_era_i_p   0.0916041  0.2415589   0.379 0.704561    
## hr_era_lb_p -0.2160877  0.1439625  -1.501 0.133499    
## hr_era_db_p  0.0427567  0.0495391   0.863 0.388182    
## so_era_m    -0.0191559  0.0078753  -2.432 0.015079 *  
## so_era_fa    0.0299327  0.0283780   1.055 0.291640    
## so_era_e    -0.1383335  0.0563040  -2.457 0.014091 *  
## so_era_i    -0.0306074  0.0140997  -2.171 0.030054 *  
## so_era_lb   -0.0835988  0.0171974  -4.861 1.25e-06 ***
## so_era_db    0.0058190  0.0083171   0.700 0.484225    
## so_era_m_p  -0.0033599  0.0039780  -0.845 0.398408    
## so_era_fa_p -0.0571447  0.0265302  -2.154 0.031352 *  
## so_era_e_p   0.1313924  0.0548716   2.395 0.016725 *  
## so_era_i_p   0.0254026  0.0130496   1.947 0.051708 .  
## so_era_lb_p  0.0856253  0.0148685   5.759 9.66e-09 ***
## so_era_db_p -0.0027319  0.0060722  -0.450 0.652822    
## x2b_era_m    0.0440077  0.0305506   1.440 0.149873    
## x2b_era_fa  -0.0668614  0.0417534  -1.601 0.109445    
## x2b_era_e   -0.0262660  0.0408917  -0.642 0.520726    
## x2b_era_i   -0.0306346  0.0346856  -0.883 0.377222    
## x2b_era_lb   0.0327525  0.0325232   1.007 0.314022    
## x2b_era_db   0.0072839  0.0194255   0.375 0.707721    
## x3b_era_m    0.0676848  0.0794700   0.852 0.394472    
## x3b_era_fa  -0.0304846  0.1048836  -0.291 0.771345    
## x3b_era_e    0.1594218  0.0949959   1.678 0.093451 .  
## x3b_era_i    0.2696816  0.0841463   3.205 0.001370 ** 
## x3b_era_lb   0.2008185  0.0724950   2.770 0.005651 ** 
## x3b_era_db   0.0770310  0.0265484   2.902 0.003750 ** 
## e_era_m     -0.0204063  0.0119897  -1.702 0.088900 .  
## e_era_fa     0.0260124  0.0175436   1.483 0.138290    
## e_era_e     -0.1116542  0.0222972  -5.008 5.95e-07 ***
## e_era_i     -0.0451841  0.0128702  -3.511 0.000456 ***
## e_era_lb    -0.0905786  0.0094137  -9.622  < 2e-16 ***
## e_era_db    -0.0234638  0.0037249  -6.299 3.60e-10 ***
## dp_era_m    -0.1112532  0.0369904  -3.008 0.002663 ** 
## dp_era_fa   -0.0806575  0.0390467  -2.066 0.038977 *  
## dp_era_e    -0.0746975  0.0351085  -2.128 0.033480 *  
## dp_era_i    -0.0763914  0.0295604  -2.584 0.009823 ** 
## dp_era_lb   -0.1341195  0.0286437  -4.682 3.01e-06 ***
## dp_era_db   -0.1420045  0.0247892  -5.728 1.15e-08 ***
## sb_era_m     0.0182889  0.0190365   0.961 0.336795    
## sb_era_fa    0.0549168  0.0181246   3.030 0.002474 ** 
## sb_era_e     0.0314596  0.0158083   1.990 0.046707 *  
## sb_era_i     0.0622565  0.0132599   4.695 2.83e-06 ***
## sb_era_lb    0.0911188  0.0105364   8.648  < 2e-16 ***
## sb_era_db    0.0275920  0.0067918   4.063 5.02e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.3 on 2197 degrees of freedom
## Multiple R-squared:  0.4115, Adjusted R-squared:  0.3906 
## F-statistic:  19.7 on 78 and 2197 DF,  p-value: < 2.2e-16

plot(era)

HW1

Adam Gersowitz

3/1/2021

Libraries

Data

Load Data

Outliers

Correlations