Overview You will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Deliverables: ??? A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details. ??? Assigned predictions (the number of wins for the team) for the evaluation data set. ??? Include your R statistical programming code in an Appendix.

## start w a brand new environment
# rm(list=ls(all.names=TRUE))
## detect all package conflicts
library (conflicted)

1- DATA EXPLORATION (25 POINTS)

Describe the size and the variables in the moneyball training data set. a. Mean / Standard Deviation / Median b. Bar Chart or Box Plot of the data and/or Histograms c. Is the data correlated to the target variable (or to other variables?)
d. Are any of the variables missing and need to be imputed “fixed”?

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286
##      INDEX      TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :   9   Min.   : 819   Min.   : 44.0   Min.   : 14.00  
##  1st Qu.: 708   1st Qu.:1387   1st Qu.:210.0   1st Qu.: 35.00  
##  Median :1249   Median :1455   Median :239.0   Median : 52.00  
##  Mean   :1264   Mean   :1469   Mean   :241.3   Mean   : 55.91  
##  3rd Qu.:1832   3rd Qu.:1548   3rd Qu.:278.5   3rd Qu.: 72.00  
##  Max.   :2525   Max.   :2170   Max.   :376.0   Max.   :155.00  
##                                                                
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   : 15.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 44.50   1st Qu.:436.5   1st Qu.: 545.0   1st Qu.: 59.0  
##  Median :101.00   Median :509.0   Median : 686.0   Median : 92.0  
##  Mean   : 95.63   Mean   :499.0   Mean   : 709.3   Mean   :123.7  
##  3rd Qu.:135.50   3rd Qu.:565.5   3rd Qu.: 912.0   3rd Qu.:151.8  
##  Max.   :242.00   Max.   :792.0   Max.   :1268.0   Max.   :580.0  
##                                   NA's   :18       NA's   :13     
##  TEAM_BASERUN_CS  TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
##  Min.   :  0.00   Min.   :42.00    Min.   : 1155   Min.   :  0.0   
##  1st Qu.: 38.00   1st Qu.:53.50    1st Qu.: 1426   1st Qu.: 52.0   
##  Median : 49.50   Median :62.00    Median : 1515   Median :104.0   
##  Mean   : 52.32   Mean   :62.37    Mean   : 1813   Mean   :102.1   
##  3rd Qu.: 63.00   3rd Qu.:67.50    3rd Qu.: 1681   3rd Qu.:142.5   
##  Max.   :154.00   Max.   :96.00    Max.   :22768   Max.   :336.0   
##  NA's   :87       NA's   :240                                      
##  TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   : 136.0   Min.   :   0.0   Min.   :  73.0   Min.   : 69.0   
##  1st Qu.: 471.0   1st Qu.: 613.0   1st Qu.: 131.0   1st Qu.:131.0   
##  Median : 526.0   Median : 745.0   Median : 163.0   Median :148.0   
##  Mean   : 552.4   Mean   : 799.7   Mean   : 249.7   Mean   :146.1   
##  3rd Qu.: 606.5   3rd Qu.: 938.0   3rd Qu.: 252.0   3rd Qu.:164.0   
##  Max.   :2008.0   Max.   :9963.0   Max.   :1568.0   Max.   :204.0   
##                   NA's   :18                        NA's   :31
namediff_trainvstest <- setdiff (colnames(train), colnames(test))
print (namediff_trainvstest)
## [1] "TARGET_WINS"

Check missing data

vis_miss(train)

psych::describe (train, na.rm=F)
##                  vars   n    mean     sd median trimmed    mad  min  max
## INDEX               1 191 1383.59 765.24   1380 1408.07 968.14   41 2534
## TARGET_WINS         2 191   80.93  12.12     82   81.12  13.34   43  116
## TEAM_BATTING_H      3 191 1478.63  76.15   1477 1477.42  74.13 1308 1667
## TEAM_BATTING_2B     4 191  297.20  26.33    296  296.63  25.20  201  373
## TEAM_BATTING_3B     5 191   30.74   9.04     29   30.13   8.90   12   61
## TEAM_BATTING_HR     6 191  178.05  32.41    175  176.81  35.58  116  260
## TEAM_BATTING_BB     7 191  543.32  74.84    535  541.31  74.13  365  775
## TEAM_BATTING_SO     8 191 1051.03 104.16   1050 1046.95  97.85  805 1399
## TEAM_BASERUN_SB     9 191   90.91  29.92     87   89.07  29.65   31  177
## TEAM_BASERUN_CS    10 191   39.94  11.90     38   39.49  11.86   12   74
## TEAM_BATTING_HBP   11 191   59.36  12.97     58   58.86  11.86   29   95
## TEAM_PITCHING_H    12 191 1479.70  75.79   1480 1478.50  72.65 1312 1667
## TEAM_PITCHING_HR   13 191  178.18  32.39    175  176.93  35.58  116  260
## TEAM_PITCHING_BB   14 191  543.72  74.92    537  541.75  72.65  367  775
## TEAM_PITCHING_SO   15 191 1051.82 104.35   1052 1047.80  97.85  805 1399
## TEAM_FIELDING_E    16 191  107.05  16.63    106  106.58  17.79   65  145
## TEAM_FIELDING_DP   17 191  152.34  17.61    152  152.05  19.27  113  204
##                  range  skew kurtosis    se
## INDEX             2493 -0.13    -1.14 55.37
## TARGET_WINS         73 -0.17    -0.30  0.88
## TEAM_BATTING_H     359  0.13    -0.37  5.51
## TEAM_BATTING_2B    172  0.09     0.48  1.91
## TEAM_BATTING_3B     49  0.70     0.74  0.65
## TEAM_BATTING_HR    144  0.30    -0.72  2.35
## TEAM_BATTING_BB    410  0.31    -0.15  5.42
## TEAM_BATTING_SO    594  0.40     0.40  7.54
## TEAM_BASERUN_SB    146  0.56    -0.14  2.16
## TEAM_BASERUN_CS     62  0.35     0.00  0.86
## TEAM_BATTING_HBP    66  0.32    -0.11  0.94
## TEAM_PITCHING_H    355  0.13    -0.39  5.48
## TEAM_PITCHING_HR   144  0.30    -0.72  2.34
## TEAM_PITCHING_BB   408  0.31    -0.13  5.42
## TEAM_PITCHING_SO   594  0.39     0.39  7.55
## TEAM_FIELDING_E     80  0.18    -0.36  1.20
## TEAM_FIELDING_DP    91  0.22    -0.21  1.27

This data has 2 csv files, one file contains the training data, the other data contains the evaluation /test data. The mean, standard deviation, minimum, maximum of each of the variables are produced above. COmparing two datasets, we can see that the training data contains 2276 observations, with 16 variables (listed by the professor already), while the test data contains 259 observations, with 17 variables which are exactly the same as columns in the training data, with the only difference as column “Target_wins”. The outcome variable here i the Target_wins.There is one INDEX variable, which will be dropped off from the analysis. THe rest of the predictors lie into four categories: Batting (6 columns), BaseRun (2 columns), Fielding (2 columns), and Pitching (4 variabls). Regarding to the missing numbers, we can see that there are 240 missing values in the variable TEAM_BATTING_HBP, and 87 missing values in the variable TEAM_BASERUN_CS. Due to the large missing number proportions, we will exclude them in the analysis. The other variables that contain missing values are: TEAM_BASERUN_SB, TEAM_FIELDING_DP, TEAM_PITCHING_SO, which have about a dozen to three dozens misssings. FOr these variables, we will replace the missing numbers by imputation, using their median value.

Plotting out the outcome variable, we can see that its distribution is close to normal with a slight right deviation, with the mean at 80.1 (ranage from 71-146). In the analysis below, we will not transform this variable. Rather, will perform anaysis using its original format.

train_melted <- melt(as.data.frame(train))  
## No id variables; using all as measure variables
 # 2 VR-Index, Value
ggplot(train_melted,aes(x = variable,y = value)) + facet_wrap(~variable) + geom_boxplot()
## Warning: Removed 3478 rows containing non-finite values (stat_boxplot).

training_corr_plot <- train[ , !(names(train) %in% c( "INDEX" ))]
training_corr_plot <- training_corr_plot[complete.cases(training_corr_plot), ]
MatrixCorrelation <- cor(training_corr_plot) 
corrplot(MatrixCorrelation, method = "ellipse")
# print( MatrixCorrelation)
## buggy in knit function, but no problem in regular viewing
# colnames(train)

When plotting out each of the predictors, we can see that the pitching variables, ie, Team_Pitching_H, and Team_Pitching_HR, have a wide range, about 0-3000. The rest of the predictor variables have narrow range. The correlation matrix check revealed that the variables that positively impact outcomes include: TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BATTING_HBP, TEAM_BASERUN_SB, TEAM_FIELDING_DP, TEAM_PITCHING_SO.
The variables that negatively impact outcome variable include:TEAM_BATTING_SO, TEAM_BASERUN_CS, TEAM_FIELDING_E, TEAM_PITCHING_BB, TEAM_PITCHING_H, TEAM_PITCHING_HR. Regarding to the colinarity, the varaibles TEAM_BATTING_HR and TEAM_PITCHING_HR are strongly associated with each other. We will handle this by log transformation for each of these two variables, therefore, to reduce their overall effect on the analysis model. Besides the log transformation, we will use the backward selection analysis approach, which will take care of some of the colinearity iss. FOr the rest of the variables, there does not seem to be a strong association that will impact the results interpretation among the predictor variables.

2. DATA PREPERATION (25 POINTS)

Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations. a. Fix missing values (maybe with a Mean or Median value) b. Create flags to suggest if a variable was missing c. Transform data by putting it into buckets d. Mathematical transforms such as log or square root e. Combine variables (such as ratios or adding or multiplying) to create new variables

train1 <- train[, -c(1)]
glimpse((train1))
## Observations: 2,276
## Variables: 16
## $ TARGET_WINS      <int> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 6...
## $ TEAM_BATTING_H   <int> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 127...
## $ TEAM_BATTING_2B  <int> 194, 219, 232, 209, 186, 200, 179, 171, 197, ...
## $ TEAM_BATTING_3B  <int> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 3...
## $ TEAM_BATTING_HR  <int> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96,...
## $ TEAM_BATTING_BB  <int> 143, 685, 602, 451, 472, 443, 525, 456, 447, ...
## $ TEAM_BATTING_SO  <int> 842, 1075, 917, 922, 920, 973, 1062, 1027, 92...
## $ TEAM_BASERUN_SB  <int> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, ...
## $ TEAM_BASERUN_CS  <int> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 7...
## $ TEAM_BATTING_HBP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ TEAM_PITCHING_H  <int> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 128...
## $ TEAM_PITCHING_HR <int> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96,...
## $ TEAM_PITCHING_BB <int> 927, 689, 602, 454, 472, 443, 525, 459, 447, ...
## $ TEAM_PITCHING_SO <int> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 9...
## $ TEAM_FIELDING_E  <int> 1011, 193, 175, 164, 138, 123, 136, 112, 127,...
## $ TEAM_FIELDING_DP <int> NA, 155, 153, 156, 168, 149, 186, 136, 169, 1...

The index column was removed.

myvars <- names (train1) %in% c("TEAM_BATTING_HBP", "TEAM_BASERUN_CS")  ## add INDEX will be BUGGY
train_original <- train1[ , !(names(train1) %in% myvars )]
head(train_original)
##   TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1          39           1445             194              39
## 2          70           1339             219              22
## 3          86           1377             232              35
## 4          70           1387             209              38
## 5          82           1297             186              27
## 6          75           1279             200              36
##   TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1              13             143             842              NA
## 2             190             685            1075              37
## 3             137             602             917              46
## 4              96             451             922              43
## 5             102             472             920              49
## 6              92             443             973             107
##   TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1              NA               NA            9364               84
## 2              28               NA            1347              191
## 3              27               NA            1377              137
## 4              30               NA            1396               97
## 5              39               NA            1297              102
## 6              59               NA            1279               92
##   TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1              927             5456            1011               NA
## 2              689             1082             193              155
## 3              602              917             175              153
## 4              454              928             164              156
## 5              472              920             138              168
## 6              443              973             123              149

We excluded the INDEX variable from the dataset training. ALso we exluded the two variables that have disproportionally large missing numbers, TEAM_BATTING_HBP“,”TEAM_BASERUN_CS. For the rest of the variables that contain some small proportion of missing numbers, we will impute the missings with their population median level.

Dealing with missing values

Uses median for imputation

myvars <- names (train1) %in% c("TEAM_BATTING_HBP", "TEAM_BASERUN_CS")  ## add INDEX will be BUGGY
train_imputed <- train1[ , !(names(train1) %in% myvars )]
for(i in 1:ncol(train_imputed)){
  train_imputed[is.na(train1[,i]), i] <- median(train1[,i], na.rm = TRUE)}
print(summary(train_imputed))
head(train_imputed)
# is.na(train_imputed)
## double checked, every missing value is imputed aleady

All the missings are now taken care of.

Transforming variables

# ## BUGGY, it runs fine in regular session, but buggy in knitting, object "train_imputed" not found
train_t <- train_imputed 
train_t$TEAM_BATTING_HR_tr <- log(train_t$TEAM_BATTING_HR +1)
train_t$TEAM_BATTING_SO_tr <- log(train_t$TEAM_BATTING_SO +1)
train_t$TEAM_PITCHING_SO_tr <- log(train_t$TEAM_PITCHING_SO +1)


test_t<- test
test_t$TEAM_BATTING_HR_tr <- log(test_t$TEAM_BATTING_HR +1)
test_t$TEAM_BATTING_SO_tr <- log(test_t$TEAM_BATTING_SO +1)
test_t$TEAM_PITCHING_SO_tr <- log(test_t$TEAM_PITCHING_SO +1)
colnames((train_t))
##  [1] "TARGET_WINS"         "TEAM_BATTING_H"      "TEAM_BATTING_2B"    
##  [4] "TEAM_BATTING_3B"     "TEAM_BATTING_HR"     "TEAM_BATTING_BB"    
##  [7] "TEAM_BATTING_SO"     "TEAM_BASERUN_SB"     "TEAM_BASERUN_CS"    
## [10] "TEAM_BATTING_HBP"    "TEAM_PITCHING_H"     "TEAM_PITCHING_HR"   
## [13] "TEAM_PITCHING_BB"    "TEAM_PITCHING_SO"    "TEAM_FIELDING_E"    
## [16] "TEAM_FIELDING_DP"    "TEAM_BATTING_HR_tr"  "TEAM_BATTING_SO_tr" 
## [19] "TEAM_PITCHING_SO_tr"
 # to drop the original non-transformed variables from the dataset traiing
train_t3<- train_t[ ,-c(5,7,14)]
colnames (train_t3)
##  [1] "TARGET_WINS"         "TEAM_BATTING_H"      "TEAM_BATTING_2B"    
##  [4] "TEAM_BATTING_3B"     "TEAM_BATTING_BB"     "TEAM_BASERUN_SB"    
##  [7] "TEAM_BASERUN_CS"     "TEAM_BATTING_HBP"    "TEAM_PITCHING_H"    
## [10] "TEAM_PITCHING_HR"    "TEAM_PITCHING_BB"    "TEAM_FIELDING_E"    
## [13] "TEAM_FIELDING_DP"    "TEAM_BATTING_HR_tr"  "TEAM_BATTING_SO_tr" 
## [16] "TEAM_PITCHING_SO_tr"
colnames(test_t)
##  [1] "INDEX"               "TEAM_BATTING_H"      "TEAM_BATTING_2B"    
##  [4] "TEAM_BATTING_3B"     "TEAM_BATTING_HR"     "TEAM_BATTING_BB"    
##  [7] "TEAM_BATTING_SO"     "TEAM_BASERUN_SB"     "TEAM_BASERUN_CS"    
## [10] "TEAM_BATTING_HBP"    "TEAM_PITCHING_H"     "TEAM_PITCHING_HR"   
## [13] "TEAM_PITCHING_BB"    "TEAM_PITCHING_SO"    "TEAM_FIELDING_E"    
## [16] "TEAM_FIELDING_DP"    "TEAM_BATTING_HR_tr"  "TEAM_BATTING_SO_tr" 
## [19] "TEAM_PITCHING_SO_tr"
 # to drop the original non-transformed variables from the dataset testing as well
test_t3<- test_t[ ,-c(5,7,14)]
colnames (test_t3)
##  [1] "INDEX"               "TEAM_BATTING_H"      "TEAM_BATTING_2B"    
##  [4] "TEAM_BATTING_3B"     "TEAM_BATTING_BB"     "TEAM_BASERUN_SB"    
##  [7] "TEAM_BASERUN_CS"     "TEAM_BATTING_HBP"    "TEAM_PITCHING_H"    
## [10] "TEAM_PITCHING_HR"    "TEAM_PITCHING_BB"    "TEAM_FIELDING_E"    
## [13] "TEAM_FIELDING_DP"    "TEAM_BATTING_HR_tr"  "TEAM_BATTING_SO_tr" 
## [16] "TEAM_PITCHING_SO_tr"

The indivual check for each of the predictor variables revealed that 3 variables are not normally distributed, and two of them are colineared. We performed log tansforamtion on each of these variables, TEAM_BATTING_HR, TEAM_BATTING_SO, and TEAM_PITCHING_SO. We performed such transformation for both the training dataset, as well as the testing dataset. The final data for the training is test_t3, that was cleaned with all missing nubmers

3. BUILD MODELS (25 POINTS)

Using the training data set, build at least three different multiple linear regression models, using different variables (or the same variables with different transformations). Since we have not yet covered automated variable selection methods, you should select the variables manually (unless you previously learned Forward or Stepwise selection, etc.). Since you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done. Discuss the coefficients in the models, do they make sense? For example, if a team hits a lot of Home Runs, it would be reasonably expected that such a team would win more games. However, if the coefficient is negative (suggesting that the team would lose more games), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

MODEL , Backward selection

full.model.imputed <- lm (TARGET_WINS ~   . , data=train_t3)
reduced.full.model.imputed<- step (full.model.imputed, direction = "backward") 
## Start:  AIC=11701.87
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_BATTING_HBP + 
##     TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + TEAM_BATTING_HR_tr + TEAM_BATTING_SO_tr + 
##     TEAM_PITCHING_SO_tr
## 
##                       Df Sum of Sq    RSS   AIC
## - TEAM_PITCHING_BB     1       3.1 383672 11700
## - TEAM_BATTING_HBP     1      66.0 383734 11700
## - TEAM_BATTING_HR_tr   1     143.5 383812 11701
## - TEAM_BATTING_2B      1     181.4 383850 11701
## - TEAM_BASERUN_CS      1     271.1 383940 11702
## <none>                             383668 11702
## - TEAM_BATTING_BB      1    1361.9 385030 11708
## - TEAM_PITCHING_SO_tr  1    2110.9 385779 11712
## - TEAM_BATTING_SO_tr   1    2841.6 386510 11717
## - TEAM_PITCHING_H      1    2958.6 386627 11717
## - TEAM_BATTING_3B      1    3189.7 386858 11719
## - TEAM_PITCHING_HR     1    4341.1 388010 11726
## - TEAM_BASERUN_SB      1    7530.3 391199 11744
## - TEAM_FIELDING_DP     1   12522.7 396191 11773
## - TEAM_FIELDING_E      1   16884.9 400553 11798
## - TEAM_BATTING_H       1   22867.1 406536 11832
## 
## Step:  AIC=11699.89
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_BATTING_HBP + 
##     TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP + 
##     TEAM_BATTING_HR_tr + TEAM_BATTING_SO_tr + TEAM_PITCHING_SO_tr
## 
##                       Df Sum of Sq    RSS   AIC
## - TEAM_BATTING_HBP     1      66.0 383738 11698
## - TEAM_BATTING_HR_tr   1     142.8 383814 11699
## - TEAM_BATTING_2B      1     194.3 383866 11699
## - TEAM_BASERUN_CS      1     270.0 383942 11700
## <none>                             383672 11700
## - TEAM_PITCHING_H      1    3175.6 386847 11717
## - TEAM_BATTING_3B      1    3211.8 386883 11717
## - TEAM_PITCHING_SO_tr  1    3272.5 386944 11717
## - TEAM_BATTING_BB      1    3328.2 387000 11718
## - TEAM_BATTING_SO_tr   1    4166.2 387838 11722
## - TEAM_PITCHING_HR     1    4340.9 388012 11724
## - TEAM_BASERUN_SB      1    8066.3 391738 11745
## - TEAM_FIELDING_DP     1   12656.2 396328 11772
## - TEAM_FIELDING_E      1   18497.3 402169 11805
## - TEAM_BATTING_H       1   25614.6 409286 11845
## 
## Step:  AIC=11698.28
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + 
##     TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP + TEAM_BATTING_HR_tr + 
##     TEAM_BATTING_SO_tr + TEAM_PITCHING_SO_tr
## 
##                       Df Sum of Sq    RSS   AIC
## - TEAM_BATTING_HR_tr   1     148.3 383886 11697
## - TEAM_BATTING_2B      1     187.1 383925 11697
## - TEAM_BASERUN_CS      1     274.8 384012 11698
## <none>                             383738 11698
## - TEAM_PITCHING_H      1    3175.8 386913 11715
## - TEAM_BATTING_3B      1    3199.0 386937 11715
## - TEAM_PITCHING_SO_tr  1    3253.2 386991 11716
## - TEAM_BATTING_BB      1    3331.6 387069 11716
## - TEAM_BATTING_SO_tr   1    4143.6 387881 11721
## - TEAM_PITCHING_HR     1    4384.1 388122 11722
## - TEAM_BASERUN_SB      1    8063.1 391801 11744
## - TEAM_FIELDING_DP     1   12702.9 396440 11770
## - TEAM_FIELDING_E      1   18456.9 402194 11803
## - TEAM_BATTING_H       1   25606.2 409344 11843
## 
## Step:  AIC=11697.16
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + 
##     TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP + TEAM_BATTING_SO_tr + 
##     TEAM_PITCHING_SO_tr
## 
##                       Df Sum of Sq    RSS   AIC
## - TEAM_BATTING_2B      1     180.6 384066 11696
## - TEAM_BASERUN_CS      1     288.1 384174 11697
## <none>                             383886 11697
## - TEAM_PITCHING_H      1    3067.9 386954 11713
## - TEAM_BATTING_3B      1    3143.9 387030 11714
## - TEAM_BATTING_BB      1    3281.1 387167 11714
## - TEAM_PITCHING_SO_tr  1    4090.8 387977 11719
## - TEAM_BATTING_SO_tr   1    5234.2 389120 11726
## - TEAM_BASERUN_SB      1    8497.7 392384 11745
## - TEAM_PITCHING_HR     1    9184.6 393070 11749
## - TEAM_FIELDING_DP     1   14264.8 398151 11778
## - TEAM_FIELDING_E      1   18308.6 402194 11801
## - TEAM_BATTING_H       1   26014.9 409901 11844
## 
## Step:  AIC=11696.23
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_BB + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP + TEAM_BATTING_SO_tr + 
##     TEAM_PITCHING_SO_tr
## 
##                       Df Sum of Sq    RSS   AIC
## - TEAM_BASERUN_CS      1       308 384375 11696
## <none>                             384066 11696
## - TEAM_BATTING_BB      1      3233 387300 11713
## - TEAM_PITCHING_H      1      3349 387415 11714
## - TEAM_BATTING_3B      1      3431 387498 11714
## - TEAM_PITCHING_SO_tr  1      4360 388426 11720
## - TEAM_BATTING_SO_tr   1      5639 389705 11727
## - TEAM_BASERUN_SB      1      8814 392881 11746
## - TEAM_PITCHING_HR     1      9009 393075 11747
## - TEAM_FIELDING_DP     1     14265 398331 11777
## - TEAM_FIELDING_E      1     18177 402243 11800
## - TEAM_BATTING_H       1     38474 422540 11912
## 
## Step:  AIC=11696.06
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_BB + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP + TEAM_BATTING_SO_tr + TEAM_PITCHING_SO_tr
## 
##                       Df Sum of Sq    RSS   AIC
## <none>                             384375 11696
## - TEAM_PITCHING_H      1      3388 387763 11714
## - TEAM_BATTING_BB      1      3459 387834 11714
## - TEAM_BATTING_3B      1      3468 387843 11714
## - TEAM_PITCHING_SO_tr  1      4276 388651 11719
## - TEAM_BATTING_SO_tr   1      5530 389905 11727
## - TEAM_BASERUN_SB      1      8520 392894 11744
## - TEAM_PITCHING_HR     1      9894 394269 11752
## - TEAM_FIELDING_DP     1     14339 398713 11777
## - TEAM_FIELDING_E      1     17987 402362 11798
## - TEAM_BATTING_H       1     38278 422653 11910
summary(reduced.full.model.imputed)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + 
##     TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_HR + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP + TEAM_BATTING_SO_tr + 
##     TEAM_PITCHING_SO_tr, data = train_t3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.487  -8.516   0.193   8.346  62.838 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.846e+01  7.110e+00   8.222 3.33e-16 ***
## TEAM_BATTING_H       3.935e-02  2.620e-03  15.019  < 2e-16 ***
## TEAM_BATTING_3B      7.164e-02  1.585e-02   4.521 6.48e-06 ***
## TEAM_BATTING_BB      1.497e-02  3.315e-03   4.515 6.66e-06 ***
## TEAM_BASERUN_SB      2.899e-02  4.092e-03   7.086 1.84e-12 ***
## TEAM_PITCHING_H     -1.708e-03  3.824e-04  -4.468 8.27e-06 ***
## TEAM_PITCHING_HR     5.374e-02  7.038e-03   7.636 3.29e-14 ***
## TEAM_FIELDING_E     -2.944e-02  2.859e-03 -10.295  < 2e-16 ***
## TEAM_FIELDING_DP    -1.179e-01  1.283e-02  -9.192  < 2e-16 ***
## TEAM_BATTING_SO_tr  -1.739e+01  3.046e+00  -5.709 1.29e-08 ***
## TEAM_PITCHING_SO_tr  1.278e+01  2.546e+00   5.020 5.57e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.03 on 2265 degrees of freedom
## Multiple R-squared:  0.3191, Adjusted R-squared:  0.3161 
## F-statistic: 106.1 on 10 and 2265 DF,  p-value: < 2.2e-16
summary(reduced.full.model.imputed)$r.squared
## [1] 0.3190839
summary(reduced.full.model.imputed)$adj.r.squared
## [1] 0.3160776

The Backward Elimination operator started with the full set of 15 variables, and, in each round of elimination based on the AIC values, removed each remaining variables of the given dataset, which is the cleaned training datasets in our case. For each removed attribute, the performance is estimated using the inner operators, e.g. a cross-validation. Only the attribute giving the least decrease of performance is finally removed from the selection. Then a new round is started with the modified selection. The process of the backward model selection resulted in deletion of these variables:

And this is the resulted final model from the backward selection process: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP + TEAM_BATTING_SO_tr + TEAM_PITCHING_SO_tr

In comparison to the backward elimination on the “cleaned training data”, I also performed the same analysis on the “original training data”.

full.model.originaldata <- lm (TARGET_WINS ~   . , data=train_original)
reduced.full.model.originaldata<- step (full.model.originaldata, direction = "backward")    
## Start:  AIC=831.31
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_BASERUN_CS + TEAM_BATTING_HBP + TEAM_PITCHING_H + TEAM_PITCHING_HR + 
##     TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq   RSS    AIC
## - TEAM_BATTING_SO   1      1.24 12547 829.33
## - TEAM_PITCHING_SO  1      1.48 12547 829.33
## - TEAM_BASERUN_CS   1      1.71 12548 829.34
## - TEAM_BATTING_HR   1     15.23 12561 829.54
## - TEAM_PITCHING_HR  1     15.79 12562 829.55
## - TEAM_PITCHING_H   1     33.63 12580 829.82
## - TEAM_BATTING_H    1     34.42 12580 829.83
## - TEAM_BATTING_2B   1     54.41 12600 830.14
## - TEAM_BASERUN_SB   1     95.22 12641 830.76
## - TEAM_BATTING_BB   1    107.84 12654 830.95
## - TEAM_PITCHING_BB  1    110.48 12656 830.99
## - TEAM_BATTING_3B   1    122.16 12668 831.16
## <none>                          12546 831.31
## - TEAM_BATTING_HBP  1    198.21 12744 832.31
## - TEAM_FIELDING_DP  1    628.49 13174 838.65
## - TEAM_FIELDING_E   1   1237.79 13784 847.28
## 
## Step:  AIC=829.33
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BASERUN_CS + 
##     TEAM_BATTING_HBP + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + 
##     TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq   RSS    AIC
## - TEAM_BASERUN_CS   1      1.59 12549 827.35
## - TEAM_BATTING_HR   1     15.82 12563 827.57
## - TEAM_PITCHING_HR  1     16.39 12564 827.58
## - TEAM_BATTING_2B   1     53.47 12601 828.14
## - TEAM_PITCHING_H   1     88.45 12636 828.67
## - TEAM_BATTING_H    1     90.30 12637 828.70
## - TEAM_BASERUN_SB   1     94.19 12641 828.76
## - TEAM_BATTING_BB   1    107.95 12655 828.97
## - TEAM_PITCHING_BB  1    110.60 12658 829.01
## - TEAM_BATTING_3B   1    122.20 12669 829.18
## <none>                          12547 829.33
## - TEAM_BATTING_HBP  1    197.11 12744 830.31
## - TEAM_FIELDING_DP  1    630.68 13178 836.70
## - TEAM_FIELDING_E   1   1240.80 13788 845.34
## - TEAM_PITCHING_SO  1   1312.89 13860 846.34
## 
## Step:  AIC=827.35
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BATTING_HBP + 
##     TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq   RSS    AIC
## - TEAM_BATTING_HR   1     16.06 12565 825.60
## - TEAM_PITCHING_HR  1     16.64 12565 825.61
## - TEAM_BATTING_2B   1     53.05 12602 826.16
## - TEAM_PITCHING_H   1     90.24 12639 826.72
## - TEAM_BATTING_H    1     92.13 12641 826.75
## - TEAM_BATTING_BB   1    110.31 12659 827.03
## - TEAM_PITCHING_BB  1    113.00 12662 827.07
## - TEAM_BASERUN_SB   1    123.42 12672 827.22
## - TEAM_BATTING_3B   1    129.33 12678 827.31
## <none>                          12549 827.35
## - TEAM_BATTING_HBP  1    197.23 12746 828.33
## - TEAM_FIELDING_DP  1    635.62 13184 834.79
## - TEAM_PITCHING_SO  1   1311.88 13861 844.35
## - TEAM_FIELDING_E   1   1322.05 13871 844.49
## 
## Step:  AIC=825.6
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BATTING_HBP + TEAM_PITCHING_H + 
##     TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq   RSS    AIC
## - TEAM_BATTING_2B   1     55.48 12620 824.44
## - TEAM_PITCHING_H   1     89.26 12654 824.95
## - TEAM_BATTING_H    1     91.97 12657 824.99
## - TEAM_BATTING_BB   1    104.58 12669 825.18
## - TEAM_PITCHING_BB  1    107.19 12672 825.22
## <none>                          12565 825.60
## - TEAM_BATTING_3B   1    137.48 12702 825.68
## - TEAM_BASERUN_SB   1    146.90 12712 825.82
## - TEAM_BATTING_HBP  1    200.36 12765 826.62
## - TEAM_FIELDING_DP  1    628.95 13194 832.93
## - TEAM_PITCHING_HR  1    853.54 13418 836.15
## - TEAM_PITCHING_SO  1   1316.68 13882 842.63
## - TEAM_FIELDING_E   1   1333.15 13898 842.86
## 
## Step:  AIC=824.44
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_BB + 
##     TEAM_BASERUN_SB + TEAM_BATTING_HBP + TEAM_PITCHING_H + TEAM_PITCHING_HR + 
##     TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq   RSS    AIC
## - TEAM_PITCHING_H   1     84.47 12705 823.71
## - TEAM_BATTING_H    1     87.79 12708 823.76
## - TEAM_BATTING_BB   1     98.92 12719 823.93
## - TEAM_PITCHING_BB  1    101.48 12722 823.97
## - TEAM_BASERUN_SB   1    109.27 12730 824.09
## <none>                          12620 824.44
## - TEAM_BATTING_3B   1    147.01 12767 824.65
## - TEAM_BATTING_HBP  1    204.39 12825 825.51
## - TEAM_FIELDING_DP  1    649.12 13269 832.02
## - TEAM_PITCHING_HR  1    812.92 13433 834.36
## - TEAM_PITCHING_SO  1   1262.90 13883 840.66
## - TEAM_FIELDING_E   1   1379.34 14000 842.25
## 
## Step:  AIC=823.71
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_BB + 
##     TEAM_BASERUN_SB + TEAM_BATTING_HBP + TEAM_PITCHING_HR + TEAM_PITCHING_BB + 
##     TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq   RSS    AIC
## - TEAM_BATTING_BB   1     32.85 12738 822.21
## - TEAM_PITCHING_BB  1     43.42 12748 822.37
## - TEAM_BASERUN_SB   1    105.16 12810 823.29
## <none>                          12705 823.71
## - TEAM_BATTING_3B   1    153.13 12858 824.00
## - TEAM_BATTING_HBP  1    183.82 12888 824.46
## - TEAM_BATTING_H    1    504.11 13209 829.15
## - TEAM_FIELDING_DP  1    602.80 13308 830.57
## - TEAM_PITCHING_HR  1    850.25 13555 834.09
## - TEAM_PITCHING_SO  1   1259.72 13964 839.77
## - TEAM_FIELDING_E   1   1419.39 14124 841.94
## 
## Step:  AIC=822.21
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BASERUN_SB + 
##     TEAM_BATTING_HBP + TEAM_PITCHING_HR + TEAM_PITCHING_BB + 
##     TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq   RSS    AIC
## - TEAM_BASERUN_SB   1    109.99 12848 821.85
## <none>                          12738 822.21
## - TEAM_BATTING_3B   1    156.45 12894 822.54
## - TEAM_BATTING_HBP  1    186.58 12924 822.98
## - TEAM_BATTING_H    1    485.67 13223 827.35
## - TEAM_FIELDING_DP  1    623.19 13361 829.33
## - TEAM_PITCHING_HR  1    843.83 13581 832.46
## - TEAM_PITCHING_SO  1   1267.25 14005 838.32
## - TEAM_FIELDING_E   1   1395.02 14133 840.06
## - TEAM_PITCHING_BB  1   2364.81 15102 852.73
## 
## Step:  AIC=821.85
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_HBP + 
##     TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq   RSS    AIC
## - TEAM_BATTING_3B   1    133.47 12981 821.82
## <none>                          12848 821.85
## - TEAM_BATTING_HBP  1    177.11 13025 822.46
## - TEAM_BATTING_H    1    566.11 13414 828.09
## - TEAM_FIELDING_DP  1    737.46 13585 830.51
## - TEAM_PITCHING_HR  1    756.49 13604 830.78
## - TEAM_PITCHING_SO  1   1257.91 14106 837.69
## - TEAM_FIELDING_E   1   1330.40 14178 838.67
## - TEAM_PITCHING_BB  1   2371.12 15219 852.20
## 
## Step:  AIC=821.82
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HBP + TEAM_PITCHING_HR + 
##     TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq   RSS    AIC
## <none>                          12981 821.82
## - TEAM_BATTING_HBP  1    228.70 13210 823.16
## - TEAM_BATTING_H    1    449.87 13431 826.33
## - TEAM_FIELDING_DP  1    813.17 13794 831.43
## - TEAM_PITCHING_HR  1    990.20 13971 833.86
## - TEAM_PITCHING_SO  1   1316.56 14298 838.27
## - TEAM_FIELDING_E   1   1334.60 14316 838.52
## - TEAM_PITCHING_BB  1   2583.00 15564 854.49
summary(reduced.full.model.originaldata)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HBP + 
##     TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = train_original)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.2248  -5.6294  -0.0212   5.0439  21.3065 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      60.95454   19.10292   3.191 0.001670 ** 
## TEAM_BATTING_H    0.02541    0.01009   2.518 0.012648 *  
## TEAM_BATTING_HBP  0.08712    0.04852   1.796 0.074211 .  
## TEAM_PITCHING_HR  0.08945    0.02394   3.736 0.000249 ***
## TEAM_PITCHING_BB  0.05672    0.00940   6.034 8.66e-09 ***
## TEAM_PITCHING_SO -0.03136    0.00728  -4.308 2.68e-05 ***
## TEAM_FIELDING_E  -0.17218    0.03970  -4.338 2.38e-05 ***
## TEAM_FIELDING_DP -0.11904    0.03516  -3.386 0.000869 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.422 on 183 degrees of freedom
##   (2085 observations deleted due to missingness)
## Multiple R-squared:  0.5345, Adjusted R-squared:  0.5167 
## F-statistic: 30.02 on 7 and 183 DF,  p-value: < 2.2e-16
summary(reduced.full.model.originaldata)$r.squared
## [1] 0.5345121
summary(reduced.full.model.originaldata)$adj.r.squared
## [1] 0.5167065

The result from the above backward elimination process is as follows: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HBP + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP`

MODEL: I also did an alternative model, which is the model With top 6 high correlation columns as features

cors <- sapply(train_t3, cor, y=train_t3$TARGET_WINS)
mask <- (rank(-abs(cors)) <= 8 )
best7.pred <- train_t3[, mask]

best5.pred <- subset(best7.pred, select = c(-TARGET_WINS) )
summary(best7.pred)
##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##  TEAM_BATTING_BB TEAM_PITCHING_HR TEAM_FIELDING_E  TEAM_BATTING_HR_tr
##  Min.   :  0.0   Min.   :  0.0    Min.   :  65.0   Min.   :0.000     
##  1st Qu.:451.0   1st Qu.: 50.0    1st Qu.: 127.0   1st Qu.:3.761     
##  Median :512.0   Median :107.0    Median : 159.0   Median :4.635     
##  Mean   :501.6   Mean   :105.7    Mean   : 246.5   Mean   :4.325     
##  3rd Qu.:580.0   3rd Qu.:150.0    3rd Qu.: 249.2   3rd Qu.:4.997     
##  Max.   :878.0   Max.   :343.0    Max.   :1898.0   Max.   :5.580
# summary(best7.pred)$r.squared
# summary(best7.pred)$adj.r.squared
  # $ operator is invalid for atomic vectors

The result of the above model selected the following top 7 predictors: TEAM_BATTING_H, TEAM_BATTING_2B,TEAM_BATTUBG_3B, TEAM_BATTING_BB, TEAM_PITCHING_HR, TEAM_FIELSING_E,and TEAM_BATTING_HR_tr (log transformation)

4.SELECT MODELS (25 POINTS)

Decide on the criteria for selecting the best multiple linear regression model. Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model. For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on (a) mean squared error, (b) R2, (c) F-statistic, and (d) residual plots. Make predictions using the evaluation data set.

glimpse(test)
## Observations: 259
## Variables: 16
## $ INDEX            <int> 9, 10, 14, 47, 60, 63, 74, 83, 98, 120, 123, ...
## $ TEAM_BATTING_H   <int> 1209, 1221, 1395, 1539, 1445, 1431, 1430, 138...
## $ TEAM_BATTING_2B  <int> 170, 151, 183, 309, 203, 236, 219, 158, 177, ...
## $ TEAM_BATTING_3B  <int> 33, 29, 29, 29, 68, 53, 55, 42, 78, 42, 40, 5...
## $ TEAM_BATTING_HR  <int> 83, 88, 93, 159, 5, 10, 37, 33, 23, 58, 50, 1...
## $ TEAM_BATTING_BB  <int> 447, 516, 509, 486, 95, 215, 568, 356, 466, 4...
## $ TEAM_BATTING_SO  <int> 1080, 929, 816, 914, 416, 377, 527, 609, 689,...
## $ TEAM_BASERUN_SB  <int> 62, 54, 59, 148, NA, NA, 365, 185, 150, 52, 6...
## $ TEAM_BASERUN_CS  <int> 50, 39, 47, 57, NA, NA, NA, NA, NA, NA, NA, 2...
## $ TEAM_BATTING_HBP <int> NA, NA, NA, 42, NA, NA, NA, NA, NA, NA, NA, N...
## $ TEAM_PITCHING_H  <int> 1209, 1221, 1395, 1539, 3902, 2793, 1544, 162...
## $ TEAM_PITCHING_HR <int> 83, 88, 93, 159, 14, 20, 40, 39, 25, 62, 53, ...
## $ TEAM_PITCHING_BB <int> 447, 516, 509, 486, 257, 420, 613, 418, 497, ...
## $ TEAM_PITCHING_SO <int> 1080, 929, 816, 914, 1123, 736, 569, 715, 734...
## $ TEAM_FIELDING_E  <int> 140, 135, 156, 124, 616, 572, 490, 328, 226, ...
## $ TEAM_FIELDING_DP <int> 156, 164, 153, 154, 130, 105, NA, 104, 132, 1...
head(test)
##   INDEX TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 1     9           1209             170              33              83
## 2    10           1221             151              29              88
## 3    14           1395             183              29              93
## 4    47           1539             309              29             159
## 5    60           1445             203              68               5
## 6    63           1431             236              53              10
##   TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 1             447            1080              62              50
## 2             516             929              54              39
## 3             509             816              59              47
## 4             486             914             148              57
## 5              95             416              NA              NA
## 6             215             377              NA              NA
##   TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## 1               NA            1209               83              447
## 2               NA            1221               88              516
## 3               NA            1395               93              509
## 4               42            1539              159              486
## 5               NA            3902               14              257
## 6               NA            2793               20              420
##   TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1             1080             140              156
## 2              929             135              164
## 3              816             156              153
## 4              914             124              154
## 5             1123             616              130
## 6              736             572              105

Based on above statistical models, I have selected the final models with the following variables: XXXXX The reason behind these variables are: these are the sam variables identified by both the backward elimination model, as well as the from the top performers model. Next, I will use this model to perform predictions on the test data.

Model Final

model.final <- lm(TARGET_WINS~TEAM_BATTING_H + TEAM_BATTING_2B +TEAM_BATTING_3B +
                    TEAM_BATTING_BB +TEAM_PITCHING_HR+ TEAM_FIELDING_E +  TEAM_BATTING_HR_tr
                     ,data = train_t3 )
summary(model.final)
# summary(model.final)$r.squared
# summary(model.final)$adj.r.squared
# the reason to do this final again model is to use it to work on the test model. Remember that test dataset did not go through formal data cleaning as the training dataset did
model.final.again <- lm(TARGET_WINS~TEAM_BATTING_H + TEAM_BATTING_2B +TEAM_BATTING_3B +
                    TEAM_BATTING_BB +TEAM_PITCHING_HR+ TEAM_FIELDING_E 
                     ,data = train_t3 )
summary(model.final.again)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_PITCHING_HR + TEAM_FIELDING_E, 
##     data = train_t3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.847  -8.842   0.127   8.707  63.502 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.741473   3.463381   2.524  0.01167 *  
## TEAM_BATTING_H    0.045499   0.003159  14.402  < 2e-16 ***
## TEAM_BATTING_2B  -0.023334   0.008996  -2.594  0.00956 ** 
## TEAM_BATTING_3B   0.109297   0.015636   6.990 3.60e-12 ***
## TEAM_BATTING_BB   0.012734   0.003185   3.998 6.58e-05 ***
## TEAM_PITCHING_HR  0.029989   0.006926   4.330 1.55e-05 ***
## TEAM_FIELDING_E  -0.019339   0.001918 -10.083  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.53 on 2269 degrees of freedom
## Multiple R-squared:  0.2644, Adjusted R-squared:  0.2625 
## F-statistic:   136 on 6 and 2269 DF,  p-value: < 2.2e-16
### Make Predictions  , buggy program
# testing <- test
# testing$Predicted_Wins <- predict (model.final.again, testing[,c("TEAM_BATTING_H", 
                                                            "TEAM_BATTING_2B", "TEAM_BATTING_3B",     "TEAM_BATTING_BB",
                                                           "TEAM_PITCHING_HR","TEAM_FIELDING_E"  )] , 30)