Overview You will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Deliverables: ??? A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details. ??? Assigned predictions (the number of wins for the team) for the evaluation data set. ??? Include your R statistical programming code in an Appendix.
## start w a brand new environment
# rm(list=ls(all.names=TRUE))
## detect all package conflicts
library (conflicted)
Describe the size and the variables in the moneyball training data set. a. Mean / Standard Deviation / Median b. Bar Chart or Box Plot of the data and/or Histograms c. Is the data correlated to the target variable (or to other variables?)
d. Are any of the variables missing and need to be imputed “fixed”?
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
##
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## NA's :102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
## 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
## Median :101.0 Median : 49.0 Median :58.00 Median : 1518
## Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
## 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
## Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
## NA's :131 NA's :772 NA's :2085
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## NA's :102
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:131.0
## Median :149.0
## Mean :146.4
## 3rd Qu.:164.0
## Max. :228.0
## NA's :286
## INDEX TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 9 Min. : 819 Min. : 44.0 Min. : 14.00
## 1st Qu.: 708 1st Qu.:1387 1st Qu.:210.0 1st Qu.: 35.00
## Median :1249 Median :1455 Median :239.0 Median : 52.00
## Mean :1264 Mean :1469 Mean :241.3 Mean : 55.91
## 3rd Qu.:1832 3rd Qu.:1548 3rd Qu.:278.5 3rd Qu.: 72.00
## Max. :2525 Max. :2170 Max. :376.0 Max. :155.00
##
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 0.00 Min. : 15.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 44.50 1st Qu.:436.5 1st Qu.: 545.0 1st Qu.: 59.0
## Median :101.00 Median :509.0 Median : 686.0 Median : 92.0
## Mean : 95.63 Mean :499.0 Mean : 709.3 Mean :123.7
## 3rd Qu.:135.50 3rd Qu.:565.5 3rd Qu.: 912.0 3rd Qu.:151.8
## Max. :242.00 Max. :792.0 Max. :1268.0 Max. :580.0
## NA's :18 NA's :13
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## Min. : 0.00 Min. :42.00 Min. : 1155 Min. : 0.0
## 1st Qu.: 38.00 1st Qu.:53.50 1st Qu.: 1426 1st Qu.: 52.0
## Median : 49.50 Median :62.00 Median : 1515 Median :104.0
## Mean : 52.32 Mean :62.37 Mean : 1813 Mean :102.1
## 3rd Qu.: 63.00 3rd Qu.:67.50 3rd Qu.: 1681 3rd Qu.:142.5
## Max. :154.00 Max. :96.00 Max. :22768 Max. :336.0
## NA's :87 NA's :240
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. : 136.0 Min. : 0.0 Min. : 73.0 Min. : 69.0
## 1st Qu.: 471.0 1st Qu.: 613.0 1st Qu.: 131.0 1st Qu.:131.0
## Median : 526.0 Median : 745.0 Median : 163.0 Median :148.0
## Mean : 552.4 Mean : 799.7 Mean : 249.7 Mean :146.1
## 3rd Qu.: 606.5 3rd Qu.: 938.0 3rd Qu.: 252.0 3rd Qu.:164.0
## Max. :2008.0 Max. :9963.0 Max. :1568.0 Max. :204.0
## NA's :18 NA's :31
namediff_trainvstest <- setdiff (colnames(train), colnames(test))
print (namediff_trainvstest)
## [1] "TARGET_WINS"
Check missing data
vis_miss(train)
psych::describe (train, na.rm=F)
## vars n mean sd median trimmed mad min max
## INDEX 1 191 1383.59 765.24 1380 1408.07 968.14 41 2534
## TARGET_WINS 2 191 80.93 12.12 82 81.12 13.34 43 116
## TEAM_BATTING_H 3 191 1478.63 76.15 1477 1477.42 74.13 1308 1667
## TEAM_BATTING_2B 4 191 297.20 26.33 296 296.63 25.20 201 373
## TEAM_BATTING_3B 5 191 30.74 9.04 29 30.13 8.90 12 61
## TEAM_BATTING_HR 6 191 178.05 32.41 175 176.81 35.58 116 260
## TEAM_BATTING_BB 7 191 543.32 74.84 535 541.31 74.13 365 775
## TEAM_BATTING_SO 8 191 1051.03 104.16 1050 1046.95 97.85 805 1399
## TEAM_BASERUN_SB 9 191 90.91 29.92 87 89.07 29.65 31 177
## TEAM_BASERUN_CS 10 191 39.94 11.90 38 39.49 11.86 12 74
## TEAM_BATTING_HBP 11 191 59.36 12.97 58 58.86 11.86 29 95
## TEAM_PITCHING_H 12 191 1479.70 75.79 1480 1478.50 72.65 1312 1667
## TEAM_PITCHING_HR 13 191 178.18 32.39 175 176.93 35.58 116 260
## TEAM_PITCHING_BB 14 191 543.72 74.92 537 541.75 72.65 367 775
## TEAM_PITCHING_SO 15 191 1051.82 104.35 1052 1047.80 97.85 805 1399
## TEAM_FIELDING_E 16 191 107.05 16.63 106 106.58 17.79 65 145
## TEAM_FIELDING_DP 17 191 152.34 17.61 152 152.05 19.27 113 204
## range skew kurtosis se
## INDEX 2493 -0.13 -1.14 55.37
## TARGET_WINS 73 -0.17 -0.30 0.88
## TEAM_BATTING_H 359 0.13 -0.37 5.51
## TEAM_BATTING_2B 172 0.09 0.48 1.91
## TEAM_BATTING_3B 49 0.70 0.74 0.65
## TEAM_BATTING_HR 144 0.30 -0.72 2.35
## TEAM_BATTING_BB 410 0.31 -0.15 5.42
## TEAM_BATTING_SO 594 0.40 0.40 7.54
## TEAM_BASERUN_SB 146 0.56 -0.14 2.16
## TEAM_BASERUN_CS 62 0.35 0.00 0.86
## TEAM_BATTING_HBP 66 0.32 -0.11 0.94
## TEAM_PITCHING_H 355 0.13 -0.39 5.48
## TEAM_PITCHING_HR 144 0.30 -0.72 2.34
## TEAM_PITCHING_BB 408 0.31 -0.13 5.42
## TEAM_PITCHING_SO 594 0.39 0.39 7.55
## TEAM_FIELDING_E 80 0.18 -0.36 1.20
## TEAM_FIELDING_DP 91 0.22 -0.21 1.27
This data has 2 csv files, one file contains the training data, the other data contains the evaluation /test data. The mean, standard deviation, minimum, maximum of each of the variables are produced above. COmparing two datasets, we can see that the training data contains 2276 observations, with 16 variables (listed by the professor already), while the test data contains 259 observations, with 17 variables which are exactly the same as columns in the training data, with the only difference as column “Target_wins”. The outcome variable here i the Target_wins.There is one INDEX variable, which will be dropped off from the analysis. THe rest of the predictors lie into four categories: Batting (6 columns), BaseRun (2 columns), Fielding (2 columns), and Pitching (4 variabls). Regarding to the missing numbers, we can see that there are 240 missing values in the variable TEAM_BATTING_HBP, and 87 missing values in the variable TEAM_BASERUN_CS. Due to the large missing number proportions, we will exclude them in the analysis. The other variables that contain missing values are: TEAM_BASERUN_SB, TEAM_FIELDING_DP, TEAM_PITCHING_SO, which have about a dozen to three dozens misssings. FOr these variables, we will replace the missing numbers by imputation, using their median value.
Plotting out the outcome variable, we can see that its distribution is close to normal with a slight right deviation, with the mean at 80.1 (ranage from 71-146). In the analysis below, we will not transform this variable. Rather, will perform anaysis using its original format.
train_melted <- melt(as.data.frame(train))
## No id variables; using all as measure variables
# 2 VR-Index, Value
ggplot(train_melted,aes(x = variable,y = value)) + facet_wrap(~variable) + geom_boxplot()
## Warning: Removed 3478 rows containing non-finite values (stat_boxplot).
training_corr_plot <- train[ , !(names(train) %in% c( "INDEX" ))]
training_corr_plot <- training_corr_plot[complete.cases(training_corr_plot), ]
MatrixCorrelation <- cor(training_corr_plot)
corrplot(MatrixCorrelation, method = "ellipse")
# print( MatrixCorrelation)
## buggy in knit function, but no problem in regular viewing
# colnames(train)
When plotting out each of the predictors, we can see that the pitching variables, ie, Team_Pitching_H, and Team_Pitching_HR, have a wide range, about 0-3000. The rest of the predictor variables have narrow range. The correlation matrix check revealed that the variables that positively impact outcomes include: TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BATTING_HBP, TEAM_BASERUN_SB, TEAM_FIELDING_DP, TEAM_PITCHING_SO.
The variables that negatively impact outcome variable include:TEAM_BATTING_SO, TEAM_BASERUN_CS, TEAM_FIELDING_E, TEAM_PITCHING_BB, TEAM_PITCHING_H, TEAM_PITCHING_HR. Regarding to the colinarity, the varaibles TEAM_BATTING_HR and TEAM_PITCHING_HR are strongly associated with each other. We will handle this by log transformation for each of these two variables, therefore, to reduce their overall effect on the analysis model. Besides the log transformation, we will use the backward selection analysis approach, which will take care of some of the colinearity iss. FOr the rest of the variables, there does not seem to be a strong association that will impact the results interpretation among the predictor variables.
Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations. a. Fix missing values (maybe with a Mean or Median value) b. Create flags to suggest if a variable was missing c. Transform data by putting it into buckets d. Mathematical transforms such as log or square root e. Combine variables (such as ratios or adding or multiplying) to create new variables
train1 <- train[, -c(1)]
glimpse((train1))
## Observations: 2,276
## Variables: 16
## $ TARGET_WINS <int> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 6...
## $ TEAM_BATTING_H <int> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 127...
## $ TEAM_BATTING_2B <int> 194, 219, 232, 209, 186, 200, 179, 171, 197, ...
## $ TEAM_BATTING_3B <int> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 3...
## $ TEAM_BATTING_HR <int> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96,...
## $ TEAM_BATTING_BB <int> 143, 685, 602, 451, 472, 443, 525, 456, 447, ...
## $ TEAM_BATTING_SO <int> 842, 1075, 917, 922, 920, 973, 1062, 1027, 92...
## $ TEAM_BASERUN_SB <int> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, ...
## $ TEAM_BASERUN_CS <int> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 7...
## $ TEAM_BATTING_HBP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ TEAM_PITCHING_H <int> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 128...
## $ TEAM_PITCHING_HR <int> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96,...
## $ TEAM_PITCHING_BB <int> 927, 689, 602, 454, 472, 443, 525, 459, 447, ...
## $ TEAM_PITCHING_SO <int> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 9...
## $ TEAM_FIELDING_E <int> 1011, 193, 175, 164, 138, 123, 136, 112, 127,...
## $ TEAM_FIELDING_DP <int> NA, 155, 153, 156, 168, 149, 186, 136, 169, 1...
The index column was removed.
myvars <- names (train1) %in% c("TEAM_BATTING_HBP", "TEAM_BASERUN_CS") ## add INDEX will be BUGGY
train_original <- train1[ , !(names(train1) %in% myvars )]
head(train_original)
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1 39 1445 194 39
## 2 70 1339 219 22
## 3 86 1377 232 35
## 4 70 1387 209 38
## 5 82 1297 186 27
## 6 75 1279 200 36
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1 13 143 842 NA
## 2 190 685 1075 37
## 3 137 602 917 46
## 4 96 451 922 43
## 5 102 472 920 49
## 6 92 443 973 107
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1 NA NA 9364 84
## 2 28 NA 1347 191
## 3 27 NA 1377 137
## 4 30 NA 1396 97
## 5 39 NA 1297 102
## 6 59 NA 1279 92
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1 927 5456 1011 NA
## 2 689 1082 193 155
## 3 602 917 175 153
## 4 454 928 164 156
## 5 472 920 138 168
## 6 443 973 123 149
We excluded the INDEX variable from the dataset training. ALso we exluded the two variables that have disproportionally large missing numbers, TEAM_BATTING_HBP“,”TEAM_BASERUN_CS. For the rest of the variables that contain some small proportion of missing numbers, we will impute the missings with their population median level.
myvars <- names (train1) %in% c("TEAM_BATTING_HBP", "TEAM_BASERUN_CS") ## add INDEX will be BUGGY
train_imputed <- train1[ , !(names(train1) %in% myvars )]
for(i in 1:ncol(train_imputed)){
train_imputed[is.na(train1[,i]), i] <- median(train1[,i], na.rm = TRUE)}
print(summary(train_imputed))
head(train_imputed)
# is.na(train_imputed)
## double checked, every missing value is imputed aleady
All the missings are now taken care of.
# ## BUGGY, it runs fine in regular session, but buggy in knitting, object "train_imputed" not found
train_t <- train_imputed
train_t$TEAM_BATTING_HR_tr <- log(train_t$TEAM_BATTING_HR +1)
train_t$TEAM_BATTING_SO_tr <- log(train_t$TEAM_BATTING_SO +1)
train_t$TEAM_PITCHING_SO_tr <- log(train_t$TEAM_PITCHING_SO +1)
test_t<- test
test_t$TEAM_BATTING_HR_tr <- log(test_t$TEAM_BATTING_HR +1)
test_t$TEAM_BATTING_SO_tr <- log(test_t$TEAM_BATTING_SO +1)
test_t$TEAM_PITCHING_SO_tr <- log(test_t$TEAM_PITCHING_SO +1)
colnames((train_t))
## [1] "TARGET_WINS" "TEAM_BATTING_H" "TEAM_BATTING_2B"
## [4] "TEAM_BATTING_3B" "TEAM_BATTING_HR" "TEAM_BATTING_BB"
## [7] "TEAM_BATTING_SO" "TEAM_BASERUN_SB" "TEAM_BASERUN_CS"
## [10] "TEAM_BATTING_HBP" "TEAM_PITCHING_H" "TEAM_PITCHING_HR"
## [13] "TEAM_PITCHING_BB" "TEAM_PITCHING_SO" "TEAM_FIELDING_E"
## [16] "TEAM_FIELDING_DP" "TEAM_BATTING_HR_tr" "TEAM_BATTING_SO_tr"
## [19] "TEAM_PITCHING_SO_tr"
# to drop the original non-transformed variables from the dataset traiing
train_t3<- train_t[ ,-c(5,7,14)]
colnames (train_t3)
## [1] "TARGET_WINS" "TEAM_BATTING_H" "TEAM_BATTING_2B"
## [4] "TEAM_BATTING_3B" "TEAM_BATTING_BB" "TEAM_BASERUN_SB"
## [7] "TEAM_BASERUN_CS" "TEAM_BATTING_HBP" "TEAM_PITCHING_H"
## [10] "TEAM_PITCHING_HR" "TEAM_PITCHING_BB" "TEAM_FIELDING_E"
## [13] "TEAM_FIELDING_DP" "TEAM_BATTING_HR_tr" "TEAM_BATTING_SO_tr"
## [16] "TEAM_PITCHING_SO_tr"
colnames(test_t)
## [1] "INDEX" "TEAM_BATTING_H" "TEAM_BATTING_2B"
## [4] "TEAM_BATTING_3B" "TEAM_BATTING_HR" "TEAM_BATTING_BB"
## [7] "TEAM_BATTING_SO" "TEAM_BASERUN_SB" "TEAM_BASERUN_CS"
## [10] "TEAM_BATTING_HBP" "TEAM_PITCHING_H" "TEAM_PITCHING_HR"
## [13] "TEAM_PITCHING_BB" "TEAM_PITCHING_SO" "TEAM_FIELDING_E"
## [16] "TEAM_FIELDING_DP" "TEAM_BATTING_HR_tr" "TEAM_BATTING_SO_tr"
## [19] "TEAM_PITCHING_SO_tr"
# to drop the original non-transformed variables from the dataset testing as well
test_t3<- test_t[ ,-c(5,7,14)]
colnames (test_t3)
## [1] "INDEX" "TEAM_BATTING_H" "TEAM_BATTING_2B"
## [4] "TEAM_BATTING_3B" "TEAM_BATTING_BB" "TEAM_BASERUN_SB"
## [7] "TEAM_BASERUN_CS" "TEAM_BATTING_HBP" "TEAM_PITCHING_H"
## [10] "TEAM_PITCHING_HR" "TEAM_PITCHING_BB" "TEAM_FIELDING_E"
## [13] "TEAM_FIELDING_DP" "TEAM_BATTING_HR_tr" "TEAM_BATTING_SO_tr"
## [16] "TEAM_PITCHING_SO_tr"
The indivual check for each of the predictor variables revealed that 3 variables are not normally distributed, and two of them are colineared. We performed log tansforamtion on each of these variables, TEAM_BATTING_HR, TEAM_BATTING_SO, and TEAM_PITCHING_SO. We performed such transformation for both the training dataset, as well as the testing dataset. The final data for the training is test_t3, that was cleaned with all missing nubmers
Using the training data set, build at least three different multiple linear regression models, using different variables (or the same variables with different transformations). Since we have not yet covered automated variable selection methods, you should select the variables manually (unless you previously learned Forward or Stepwise selection, etc.). Since you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done. Discuss the coefficients in the models, do they make sense? For example, if a team hits a lot of Home Runs, it would be reasonably expected that such a team would win more games. However, if the coefficient is negative (suggesting that the team would lose more games), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.
full.model.imputed <- lm (TARGET_WINS ~ . , data=train_t3)
reduced.full.model.imputed<- step (full.model.imputed, direction = "backward")
## Start: AIC=11701.87
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_BATTING_HBP +
## TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_FIELDING_E +
## TEAM_FIELDING_DP + TEAM_BATTING_HR_tr + TEAM_BATTING_SO_tr +
## TEAM_PITCHING_SO_tr
##
## Df Sum of Sq RSS AIC
## - TEAM_PITCHING_BB 1 3.1 383672 11700
## - TEAM_BATTING_HBP 1 66.0 383734 11700
## - TEAM_BATTING_HR_tr 1 143.5 383812 11701
## - TEAM_BATTING_2B 1 181.4 383850 11701
## - TEAM_BASERUN_CS 1 271.1 383940 11702
## <none> 383668 11702
## - TEAM_BATTING_BB 1 1361.9 385030 11708
## - TEAM_PITCHING_SO_tr 1 2110.9 385779 11712
## - TEAM_BATTING_SO_tr 1 2841.6 386510 11717
## - TEAM_PITCHING_H 1 2958.6 386627 11717
## - TEAM_BATTING_3B 1 3189.7 386858 11719
## - TEAM_PITCHING_HR 1 4341.1 388010 11726
## - TEAM_BASERUN_SB 1 7530.3 391199 11744
## - TEAM_FIELDING_DP 1 12522.7 396191 11773
## - TEAM_FIELDING_E 1 16884.9 400553 11798
## - TEAM_BATTING_H 1 22867.1 406536 11832
##
## Step: AIC=11699.89
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_BATTING_HBP +
## TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP +
## TEAM_BATTING_HR_tr + TEAM_BATTING_SO_tr + TEAM_PITCHING_SO_tr
##
## Df Sum of Sq RSS AIC
## - TEAM_BATTING_HBP 1 66.0 383738 11698
## - TEAM_BATTING_HR_tr 1 142.8 383814 11699
## - TEAM_BATTING_2B 1 194.3 383866 11699
## - TEAM_BASERUN_CS 1 270.0 383942 11700
## <none> 383672 11700
## - TEAM_PITCHING_H 1 3175.6 386847 11717
## - TEAM_BATTING_3B 1 3211.8 386883 11717
## - TEAM_PITCHING_SO_tr 1 3272.5 386944 11717
## - TEAM_BATTING_BB 1 3328.2 387000 11718
## - TEAM_BATTING_SO_tr 1 4166.2 387838 11722
## - TEAM_PITCHING_HR 1 4340.9 388012 11724
## - TEAM_BASERUN_SB 1 8066.3 391738 11745
## - TEAM_FIELDING_DP 1 12656.2 396328 11772
## - TEAM_FIELDING_E 1 18497.3 402169 11805
## - TEAM_BATTING_H 1 25614.6 409286 11845
##
## Step: AIC=11698.28
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H +
## TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP + TEAM_BATTING_HR_tr +
## TEAM_BATTING_SO_tr + TEAM_PITCHING_SO_tr
##
## Df Sum of Sq RSS AIC
## - TEAM_BATTING_HR_tr 1 148.3 383886 11697
## - TEAM_BATTING_2B 1 187.1 383925 11697
## - TEAM_BASERUN_CS 1 274.8 384012 11698
## <none> 383738 11698
## - TEAM_PITCHING_H 1 3175.8 386913 11715
## - TEAM_BATTING_3B 1 3199.0 386937 11715
## - TEAM_PITCHING_SO_tr 1 3253.2 386991 11716
## - TEAM_BATTING_BB 1 3331.6 387069 11716
## - TEAM_BATTING_SO_tr 1 4143.6 387881 11721
## - TEAM_PITCHING_HR 1 4384.1 388122 11722
## - TEAM_BASERUN_SB 1 8063.1 391801 11744
## - TEAM_FIELDING_DP 1 12702.9 396440 11770
## - TEAM_FIELDING_E 1 18456.9 402194 11803
## - TEAM_BATTING_H 1 25606.2 409344 11843
##
## Step: AIC=11697.16
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H +
## TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP + TEAM_BATTING_SO_tr +
## TEAM_PITCHING_SO_tr
##
## Df Sum of Sq RSS AIC
## - TEAM_BATTING_2B 1 180.6 384066 11696
## - TEAM_BASERUN_CS 1 288.1 384174 11697
## <none> 383886 11697
## - TEAM_PITCHING_H 1 3067.9 386954 11713
## - TEAM_BATTING_3B 1 3143.9 387030 11714
## - TEAM_BATTING_BB 1 3281.1 387167 11714
## - TEAM_PITCHING_SO_tr 1 4090.8 387977 11719
## - TEAM_BATTING_SO_tr 1 5234.2 389120 11726
## - TEAM_BASERUN_SB 1 8497.7 392384 11745
## - TEAM_PITCHING_HR 1 9184.6 393070 11749
## - TEAM_FIELDING_DP 1 14264.8 398151 11778
## - TEAM_FIELDING_E 1 18308.6 402194 11801
## - TEAM_BATTING_H 1 26014.9 409901 11844
##
## Step: AIC=11696.23
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_BB +
## TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR +
## TEAM_FIELDING_E + TEAM_FIELDING_DP + TEAM_BATTING_SO_tr +
## TEAM_PITCHING_SO_tr
##
## Df Sum of Sq RSS AIC
## - TEAM_BASERUN_CS 1 308 384375 11696
## <none> 384066 11696
## - TEAM_BATTING_BB 1 3233 387300 11713
## - TEAM_PITCHING_H 1 3349 387415 11714
## - TEAM_BATTING_3B 1 3431 387498 11714
## - TEAM_PITCHING_SO_tr 1 4360 388426 11720
## - TEAM_BATTING_SO_tr 1 5639 389705 11727
## - TEAM_BASERUN_SB 1 8814 392881 11746
## - TEAM_PITCHING_HR 1 9009 393075 11747
## - TEAM_FIELDING_DP 1 14265 398331 11777
## - TEAM_FIELDING_E 1 18177 402243 11800
## - TEAM_BATTING_H 1 38474 422540 11912
##
## Step: AIC=11696.06
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_BB +
## TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_FIELDING_E +
## TEAM_FIELDING_DP + TEAM_BATTING_SO_tr + TEAM_PITCHING_SO_tr
##
## Df Sum of Sq RSS AIC
## <none> 384375 11696
## - TEAM_PITCHING_H 1 3388 387763 11714
## - TEAM_BATTING_BB 1 3459 387834 11714
## - TEAM_BATTING_3B 1 3468 387843 11714
## - TEAM_PITCHING_SO_tr 1 4276 388651 11719
## - TEAM_BATTING_SO_tr 1 5530 389905 11727
## - TEAM_BASERUN_SB 1 8520 392894 11744
## - TEAM_PITCHING_HR 1 9894 394269 11752
## - TEAM_FIELDING_DP 1 14339 398713 11777
## - TEAM_FIELDING_E 1 17987 402362 11798
## - TEAM_BATTING_H 1 38278 422653 11910
summary(reduced.full.model.imputed)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B +
## TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_HR +
## TEAM_FIELDING_E + TEAM_FIELDING_DP + TEAM_BATTING_SO_tr +
## TEAM_PITCHING_SO_tr, data = train_t3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.487 -8.516 0.193 8.346 62.838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.846e+01 7.110e+00 8.222 3.33e-16 ***
## TEAM_BATTING_H 3.935e-02 2.620e-03 15.019 < 2e-16 ***
## TEAM_BATTING_3B 7.164e-02 1.585e-02 4.521 6.48e-06 ***
## TEAM_BATTING_BB 1.497e-02 3.315e-03 4.515 6.66e-06 ***
## TEAM_BASERUN_SB 2.899e-02 4.092e-03 7.086 1.84e-12 ***
## TEAM_PITCHING_H -1.708e-03 3.824e-04 -4.468 8.27e-06 ***
## TEAM_PITCHING_HR 5.374e-02 7.038e-03 7.636 3.29e-14 ***
## TEAM_FIELDING_E -2.944e-02 2.859e-03 -10.295 < 2e-16 ***
## TEAM_FIELDING_DP -1.179e-01 1.283e-02 -9.192 < 2e-16 ***
## TEAM_BATTING_SO_tr -1.739e+01 3.046e+00 -5.709 1.29e-08 ***
## TEAM_PITCHING_SO_tr 1.278e+01 2.546e+00 5.020 5.57e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.03 on 2265 degrees of freedom
## Multiple R-squared: 0.3191, Adjusted R-squared: 0.3161
## F-statistic: 106.1 on 10 and 2265 DF, p-value: < 2.2e-16
summary(reduced.full.model.imputed)$r.squared
## [1] 0.3190839
summary(reduced.full.model.imputed)$adj.r.squared
## [1] 0.3160776
The Backward Elimination operator started with the full set of 15 variables, and, in each round of elimination based on the AIC values, removed each remaining variables of the given dataset, which is the cleaned training datasets in our case. For each removed attribute, the performance is estimated using the inner operators, e.g. a cross-validation. Only the attribute giving the least decrease of performance is finally removed from the selection. Then a new round is started with the modified selection. The process of the backward model selection resulted in deletion of these variables:
And this is the resulted final model from the backward selection process: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP + TEAM_BATTING_SO_tr + TEAM_PITCHING_SO_tr
In comparison to the backward elimination on the “cleaned training data”, I also performed the same analysis on the “original training data”.
full.model.originaldata <- lm (TARGET_WINS ~ . , data=train_original)
reduced.full.model.originaldata<- step (full.model.originaldata, direction = "backward")
## Start: AIC=831.31
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## TEAM_BASERUN_CS + TEAM_BATTING_HBP + TEAM_PITCHING_H + TEAM_PITCHING_HR +
## TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_BATTING_SO 1 1.24 12547 829.33
## - TEAM_PITCHING_SO 1 1.48 12547 829.33
## - TEAM_BASERUN_CS 1 1.71 12548 829.34
## - TEAM_BATTING_HR 1 15.23 12561 829.54
## - TEAM_PITCHING_HR 1 15.79 12562 829.55
## - TEAM_PITCHING_H 1 33.63 12580 829.82
## - TEAM_BATTING_H 1 34.42 12580 829.83
## - TEAM_BATTING_2B 1 54.41 12600 830.14
## - TEAM_BASERUN_SB 1 95.22 12641 830.76
## - TEAM_BATTING_BB 1 107.84 12654 830.95
## - TEAM_PITCHING_BB 1 110.48 12656 830.99
## - TEAM_BATTING_3B 1 122.16 12668 831.16
## <none> 12546 831.31
## - TEAM_BATTING_HBP 1 198.21 12744 832.31
## - TEAM_FIELDING_DP 1 628.49 13174 838.65
## - TEAM_FIELDING_E 1 1237.79 13784 847.28
##
## Step: AIC=829.33
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BASERUN_CS +
## TEAM_BATTING_HBP + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB +
## TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_BASERUN_CS 1 1.59 12549 827.35
## - TEAM_BATTING_HR 1 15.82 12563 827.57
## - TEAM_PITCHING_HR 1 16.39 12564 827.58
## - TEAM_BATTING_2B 1 53.47 12601 828.14
## - TEAM_PITCHING_H 1 88.45 12636 828.67
## - TEAM_BATTING_H 1 90.30 12637 828.70
## - TEAM_BASERUN_SB 1 94.19 12641 828.76
## - TEAM_BATTING_BB 1 107.95 12655 828.97
## - TEAM_PITCHING_BB 1 110.60 12658 829.01
## - TEAM_BATTING_3B 1 122.20 12669 829.18
## <none> 12547 829.33
## - TEAM_BATTING_HBP 1 197.11 12744 830.31
## - TEAM_FIELDING_DP 1 630.68 13178 836.70
## - TEAM_FIELDING_E 1 1240.80 13788 845.34
## - TEAM_PITCHING_SO 1 1312.89 13860 846.34
##
## Step: AIC=827.35
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BATTING_HBP +
## TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_BATTING_HR 1 16.06 12565 825.60
## - TEAM_PITCHING_HR 1 16.64 12565 825.61
## - TEAM_BATTING_2B 1 53.05 12602 826.16
## - TEAM_PITCHING_H 1 90.24 12639 826.72
## - TEAM_BATTING_H 1 92.13 12641 826.75
## - TEAM_BATTING_BB 1 110.31 12659 827.03
## - TEAM_PITCHING_BB 1 113.00 12662 827.07
## - TEAM_BASERUN_SB 1 123.42 12672 827.22
## - TEAM_BATTING_3B 1 129.33 12678 827.31
## <none> 12549 827.35
## - TEAM_BATTING_HBP 1 197.23 12746 828.33
## - TEAM_FIELDING_DP 1 635.62 13184 834.79
## - TEAM_PITCHING_SO 1 1311.88 13861 844.35
## - TEAM_FIELDING_E 1 1322.05 13871 844.49
##
## Step: AIC=825.6
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BATTING_HBP + TEAM_PITCHING_H +
## TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_BATTING_2B 1 55.48 12620 824.44
## - TEAM_PITCHING_H 1 89.26 12654 824.95
## - TEAM_BATTING_H 1 91.97 12657 824.99
## - TEAM_BATTING_BB 1 104.58 12669 825.18
## - TEAM_PITCHING_BB 1 107.19 12672 825.22
## <none> 12565 825.60
## - TEAM_BATTING_3B 1 137.48 12702 825.68
## - TEAM_BASERUN_SB 1 146.90 12712 825.82
## - TEAM_BATTING_HBP 1 200.36 12765 826.62
## - TEAM_FIELDING_DP 1 628.95 13194 832.93
## - TEAM_PITCHING_HR 1 853.54 13418 836.15
## - TEAM_PITCHING_SO 1 1316.68 13882 842.63
## - TEAM_FIELDING_E 1 1333.15 13898 842.86
##
## Step: AIC=824.44
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_BB +
## TEAM_BASERUN_SB + TEAM_BATTING_HBP + TEAM_PITCHING_H + TEAM_PITCHING_HR +
## TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_PITCHING_H 1 84.47 12705 823.71
## - TEAM_BATTING_H 1 87.79 12708 823.76
## - TEAM_BATTING_BB 1 98.92 12719 823.93
## - TEAM_PITCHING_BB 1 101.48 12722 823.97
## - TEAM_BASERUN_SB 1 109.27 12730 824.09
## <none> 12620 824.44
## - TEAM_BATTING_3B 1 147.01 12767 824.65
## - TEAM_BATTING_HBP 1 204.39 12825 825.51
## - TEAM_FIELDING_DP 1 649.12 13269 832.02
## - TEAM_PITCHING_HR 1 812.92 13433 834.36
## - TEAM_PITCHING_SO 1 1262.90 13883 840.66
## - TEAM_FIELDING_E 1 1379.34 14000 842.25
##
## Step: AIC=823.71
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_BB +
## TEAM_BASERUN_SB + TEAM_BATTING_HBP + TEAM_PITCHING_HR + TEAM_PITCHING_BB +
## TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_BATTING_BB 1 32.85 12738 822.21
## - TEAM_PITCHING_BB 1 43.42 12748 822.37
## - TEAM_BASERUN_SB 1 105.16 12810 823.29
## <none> 12705 823.71
## - TEAM_BATTING_3B 1 153.13 12858 824.00
## - TEAM_BATTING_HBP 1 183.82 12888 824.46
## - TEAM_BATTING_H 1 504.11 13209 829.15
## - TEAM_FIELDING_DP 1 602.80 13308 830.57
## - TEAM_PITCHING_HR 1 850.25 13555 834.09
## - TEAM_PITCHING_SO 1 1259.72 13964 839.77
## - TEAM_FIELDING_E 1 1419.39 14124 841.94
##
## Step: AIC=822.21
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BASERUN_SB +
## TEAM_BATTING_HBP + TEAM_PITCHING_HR + TEAM_PITCHING_BB +
## TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_BASERUN_SB 1 109.99 12848 821.85
## <none> 12738 822.21
## - TEAM_BATTING_3B 1 156.45 12894 822.54
## - TEAM_BATTING_HBP 1 186.58 12924 822.98
## - TEAM_BATTING_H 1 485.67 13223 827.35
## - TEAM_FIELDING_DP 1 623.19 13361 829.33
## - TEAM_PITCHING_HR 1 843.83 13581 832.46
## - TEAM_PITCHING_SO 1 1267.25 14005 838.32
## - TEAM_FIELDING_E 1 1395.02 14133 840.06
## - TEAM_PITCHING_BB 1 2364.81 15102 852.73
##
## Step: AIC=821.85
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_HBP +
## TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_BATTING_3B 1 133.47 12981 821.82
## <none> 12848 821.85
## - TEAM_BATTING_HBP 1 177.11 13025 822.46
## - TEAM_BATTING_H 1 566.11 13414 828.09
## - TEAM_FIELDING_DP 1 737.46 13585 830.51
## - TEAM_PITCHING_HR 1 756.49 13604 830.78
## - TEAM_PITCHING_SO 1 1257.91 14106 837.69
## - TEAM_FIELDING_E 1 1330.40 14178 838.67
## - TEAM_PITCHING_BB 1 2371.12 15219 852.20
##
## Step: AIC=821.82
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HBP + TEAM_PITCHING_HR +
## TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## <none> 12981 821.82
## - TEAM_BATTING_HBP 1 228.70 13210 823.16
## - TEAM_BATTING_H 1 449.87 13431 826.33
## - TEAM_FIELDING_DP 1 813.17 13794 831.43
## - TEAM_PITCHING_HR 1 990.20 13971 833.86
## - TEAM_PITCHING_SO 1 1316.56 14298 838.27
## - TEAM_FIELDING_E 1 1334.60 14316 838.52
## - TEAM_PITCHING_BB 1 2583.00 15564 854.49
summary(reduced.full.model.originaldata)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HBP +
## TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP, data = train_original)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.2248 -5.6294 -0.0212 5.0439 21.3065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.95454 19.10292 3.191 0.001670 **
## TEAM_BATTING_H 0.02541 0.01009 2.518 0.012648 *
## TEAM_BATTING_HBP 0.08712 0.04852 1.796 0.074211 .
## TEAM_PITCHING_HR 0.08945 0.02394 3.736 0.000249 ***
## TEAM_PITCHING_BB 0.05672 0.00940 6.034 8.66e-09 ***
## TEAM_PITCHING_SO -0.03136 0.00728 -4.308 2.68e-05 ***
## TEAM_FIELDING_E -0.17218 0.03970 -4.338 2.38e-05 ***
## TEAM_FIELDING_DP -0.11904 0.03516 -3.386 0.000869 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.422 on 183 degrees of freedom
## (2085 observations deleted due to missingness)
## Multiple R-squared: 0.5345, Adjusted R-squared: 0.5167
## F-statistic: 30.02 on 7 and 183 DF, p-value: < 2.2e-16
summary(reduced.full.model.originaldata)$r.squared
## [1] 0.5345121
summary(reduced.full.model.originaldata)$adj.r.squared
## [1] 0.5167065
The result from the above backward elimination process is as follows: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HBP + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP`
MODEL: I also did an alternative model, which is the model With top 6 high correlation columns as features
cors <- sapply(train_t3, cor, y=train_t3$TARGET_WINS)
mask <- (rank(-abs(cors)) <= 8 )
best7.pred <- train_t3[, mask]
best5.pred <- subset(best7.pred, select = c(-TARGET_WINS) )
summary(best7.pred)
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
## TEAM_BATTING_BB TEAM_PITCHING_HR TEAM_FIELDING_E TEAM_BATTING_HR_tr
## Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. :0.000
## 1st Qu.:451.0 1st Qu.: 50.0 1st Qu.: 127.0 1st Qu.:3.761
## Median :512.0 Median :107.0 Median : 159.0 Median :4.635
## Mean :501.6 Mean :105.7 Mean : 246.5 Mean :4.325
## 3rd Qu.:580.0 3rd Qu.:150.0 3rd Qu.: 249.2 3rd Qu.:4.997
## Max. :878.0 Max. :343.0 Max. :1898.0 Max. :5.580
# summary(best7.pred)$r.squared
# summary(best7.pred)$adj.r.squared
# $ operator is invalid for atomic vectors
The result of the above model selected the following top 7 predictors: TEAM_BATTING_H, TEAM_BATTING_2B,TEAM_BATTUBG_3B, TEAM_BATTING_BB, TEAM_PITCHING_HR, TEAM_FIELSING_E,and TEAM_BATTING_HR_tr (log transformation)
Decide on the criteria for selecting the best multiple linear regression model. Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model. For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on (a) mean squared error, (b) R2, (c) F-statistic, and (d) residual plots. Make predictions using the evaluation data set.
glimpse(test)
## Observations: 259
## Variables: 16
## $ INDEX <int> 9, 10, 14, 47, 60, 63, 74, 83, 98, 120, 123, ...
## $ TEAM_BATTING_H <int> 1209, 1221, 1395, 1539, 1445, 1431, 1430, 138...
## $ TEAM_BATTING_2B <int> 170, 151, 183, 309, 203, 236, 219, 158, 177, ...
## $ TEAM_BATTING_3B <int> 33, 29, 29, 29, 68, 53, 55, 42, 78, 42, 40, 5...
## $ TEAM_BATTING_HR <int> 83, 88, 93, 159, 5, 10, 37, 33, 23, 58, 50, 1...
## $ TEAM_BATTING_BB <int> 447, 516, 509, 486, 95, 215, 568, 356, 466, 4...
## $ TEAM_BATTING_SO <int> 1080, 929, 816, 914, 416, 377, 527, 609, 689,...
## $ TEAM_BASERUN_SB <int> 62, 54, 59, 148, NA, NA, 365, 185, 150, 52, 6...
## $ TEAM_BASERUN_CS <int> 50, 39, 47, 57, NA, NA, NA, NA, NA, NA, NA, 2...
## $ TEAM_BATTING_HBP <int> NA, NA, NA, 42, NA, NA, NA, NA, NA, NA, NA, N...
## $ TEAM_PITCHING_H <int> 1209, 1221, 1395, 1539, 3902, 2793, 1544, 162...
## $ TEAM_PITCHING_HR <int> 83, 88, 93, 159, 14, 20, 40, 39, 25, 62, 53, ...
## $ TEAM_PITCHING_BB <int> 447, 516, 509, 486, 257, 420, 613, 418, 497, ...
## $ TEAM_PITCHING_SO <int> 1080, 929, 816, 914, 1123, 736, 569, 715, 734...
## $ TEAM_FIELDING_E <int> 140, 135, 156, 124, 616, 572, 490, 328, 226, ...
## $ TEAM_FIELDING_DP <int> 156, 164, 153, 154, 130, 105, NA, 104, 132, 1...
head(test)
## INDEX TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 1 9 1209 170 33 83
## 2 10 1221 151 29 88
## 3 14 1395 183 29 93
## 4 47 1539 309 29 159
## 5 60 1445 203 68 5
## 6 63 1431 236 53 10
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 1 447 1080 62 50
## 2 516 929 54 39
## 3 509 816 59 47
## 4 486 914 148 57
## 5 95 416 NA NA
## 6 215 377 NA NA
## TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## 1 NA 1209 83 447
## 2 NA 1221 88 516
## 3 NA 1395 93 509
## 4 42 1539 159 486
## 5 NA 3902 14 257
## 6 NA 2793 20 420
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1 1080 140 156
## 2 929 135 164
## 3 816 156 153
## 4 914 124 154
## 5 1123 616 130
## 6 736 572 105
Based on above statistical models, I have selected the final models with the following variables: XXXXX The reason behind these variables are: these are the sam variables identified by both the backward elimination model, as well as the from the top performers model. Next, I will use this model to perform predictions on the test data.
model.final <- lm(TARGET_WINS~TEAM_BATTING_H + TEAM_BATTING_2B +TEAM_BATTING_3B +
TEAM_BATTING_BB +TEAM_PITCHING_HR+ TEAM_FIELDING_E + TEAM_BATTING_HR_tr
,data = train_t3 )
summary(model.final)
# summary(model.final)$r.squared
# summary(model.final)$adj.r.squared
# the reason to do this final again model is to use it to work on the test model. Remember that test dataset did not go through formal data cleaning as the training dataset did
model.final.again <- lm(TARGET_WINS~TEAM_BATTING_H + TEAM_BATTING_2B +TEAM_BATTING_3B +
TEAM_BATTING_BB +TEAM_PITCHING_HR+ TEAM_FIELDING_E
,data = train_t3 )
summary(model.final.again)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_PITCHING_HR + TEAM_FIELDING_E,
## data = train_t3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.847 -8.842 0.127 8.707 63.502
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.741473 3.463381 2.524 0.01167 *
## TEAM_BATTING_H 0.045499 0.003159 14.402 < 2e-16 ***
## TEAM_BATTING_2B -0.023334 0.008996 -2.594 0.00956 **
## TEAM_BATTING_3B 0.109297 0.015636 6.990 3.60e-12 ***
## TEAM_BATTING_BB 0.012734 0.003185 3.998 6.58e-05 ***
## TEAM_PITCHING_HR 0.029989 0.006926 4.330 1.55e-05 ***
## TEAM_FIELDING_E -0.019339 0.001918 -10.083 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.53 on 2269 degrees of freedom
## Multiple R-squared: 0.2644, Adjusted R-squared: 0.2625
## F-statistic: 136 on 6 and 2269 DF, p-value: < 2.2e-16
### Make Predictions , buggy program
# testing <- test
# testing$Predicted_Wins <- predict (model.final.again, testing[,c("TEAM_BATTING_H",
"TEAM_BATTING_2B", "TEAM_BATTING_3B", "TEAM_BATTING_BB",
"TEAM_PITCHING_HR","TEAM_FIELDING_E" )] , 30)