1. OVERVIEW

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Describe the size and the variables in the moneyball training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment.

You should have your own thoughts on what to tell the boss. These are just ideas. a. Mean / Standard Deviation / Median b. Bar Chart or Box Plot of the data c. Is the data correlated to the target variable (or to other variables?) d. Are any of the variables missing and need to be imputed “fixed”?

2. DATA EXPLORATION

Table1.Moneyball dataset
INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
1 39 1445 194 39 13 143 842 NA NA NA 9364 84 927 5456 1011 NA
2 70 1339 219 22 190 685 1075 37 28 NA 1347 191 689 1082 193 155
3 86 1377 232 35 137 602 917 46 27 NA 1377 137 602 917 175 153
4 70 1387 209 38 96 451 922 43 30 NA 1396 97 454 928 164 156
5 82 1297 186 27 102 472 920 49 39 NA 1297 102 472 920 138 168
6 75 1279 200 36 92 443 973 107 59 NA 1279 92 443 973 123 149

Visualize the data

vars n mean sd median trimmed mad min max range skew kurtosis se
INDEX 1 2276 1268.46353 736.34904 1270.5 1268.56970 952.5705 1 2535 2534 0.0042149 -1.2167564 15.4346788
TARGET_WINS 2 2276 80.79086 15.75215 82.0 81.31229 14.8260 0 146 146 -0.3987232 1.0274757 0.3301823
TEAM_BATTING_H 3 2276 1469.26977 144.59120 1454.0 1459.04116 114.1602 891 2554 1663 1.5713335 7.2785261 3.0307891
TEAM_BATTING_2B 4 2276 241.24692 46.80141 238.0 240.39627 47.4432 69 458 389 0.2151018 0.0061609 0.9810087
TEAM_BATTING_3B 5 2276 55.25000 27.93856 47.0 52.17563 23.7216 0 223 223 1.1094652 1.5032418 0.5856226
TEAM_BATTING_HR 6 2276 99.61204 60.54687 102.0 97.38529 78.5778 0 264 264 0.1860421 -0.9631189 1.2691285
TEAM_BATTING_BB 7 2276 501.55888 122.67086 512.0 512.18331 94.8864 0 878 878 -1.0257599 2.1828544 2.5713150
TEAM_BATTING_SO 8 2174 735.60534 248.52642 750.0 742.31322 284.6592 0 1399 1399 -0.2978001 -0.3207992 5.3301912
TEAM_BASERUN_SB 9 2145 124.76177 87.79117 101.0 110.81188 60.7866 0 697 697 1.9724140 5.4896754 1.8955584
TEAM_BASERUN_CS 10 1504 52.80386 22.95634 49.0 50.35963 17.7912 0 201 201 1.9762180 7.6203818 0.5919414
TEAM_BATTING_HBP 11 191 59.35602 12.96712 58.0 58.86275 11.8608 29 95 66 0.3185754 -0.1119828 0.9382681
TEAM_PITCHING_H 12 2276 1779.21046 1406.84293 1518.0 1555.89517 174.9468 1137 30132 28995 10.3295111 141.8396985 29.4889618
TEAM_PITCHING_HR 13 2276 105.69859 61.29875 107.0 103.15697 74.1300 0 343 343 0.2877877 -0.6046311 1.2848886
TEAM_PITCHING_BB 14 2276 553.00791 166.35736 536.5 542.62459 98.5929 0 3645 3645 6.7438995 96.9676398 3.4870317
TEAM_PITCHING_SO 15 2174 817.73045 553.08503 813.5 796.93391 257.2311 0 19278 19278 22.1745535 671.1891292 11.8621151
TEAM_FIELDING_E 16 2276 246.48067 227.77097 159.0 193.43798 62.2692 65 1898 1833 2.9904656 10.9702717 4.7743279
TEAM_FIELDING_DP 17 1990 146.38794 26.22639 149.0 147.57789 23.7216 52 228 176 -0.3889390 0.1817397 0.5879114

There are 2276 subjects with 17 variables. We can see that the average number of Target Wins is around 81 with a standard deviation (SD) of 16.

Let’s take a look at distribution of data points across our variables. The charts below display ranges within variables measured. This includes the outliers, the median, the mode, and where the majority of the data points lie in the “box”. The variance of some of the explanatory variables greatly exceeds the variance of the response “Target_WINS” variable.

During the exploration phase, we observed a correlation heatmap to understand the relationships between different variables in our dataset. Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, making it challenging to determine their individual effects or impacts on the dependent variable.

In our dataset, we identified four pairs of variables with a correlation coefficient of 1.00, indicating a perfect linear relationship:

  • TEAM_BATTING_H ~ TEAM_PITCHING_H
  • TEAM_BATTING_HR ~ TEAM_PITCHING_HR
  • TEAM_BATTING_BB ~ TEAM_PITCHING_BB
  • TEAM_BATTING_SO ~ TEAM_PITCHING_SO

Therefore, mentioned strong correlations can create challenges for the model interpretation, making it difficult to ascertain the effects of each variable. To mitigate these challenges, we checked for skewness and we carefully considered the inclusion of variables in our modeling process.

Checking for skewness in the data

In the process of exploring our dataset, we investigated the distribution of several variables to understand their skewness. The skewness indicates that the data points are not evenly distributed around the mean.

We noticed several skewed variables such as ‘TEAM_FIELDING_E’, ‘PITCHING_H’, ‘TEAM_PITCHING_BB, ’TEAM_PITCHING_SO’ and ‘TEAM_PITCHING_H’, which exhibit significant skewness. Addressing the skewness of these variables is important for building functional predictor models. These methods aim to normalize the distribution and improve the model’s performance.

## Warning: Removed 3478 rows containing non-finite values (`stat_density()`).

Relationship between Predictors and Target Variable

Below we see how the data is distributed when compared to the linear regression. We can state that PITCHING_H and PITCHING_SO are highly heteroscedastic, while BATTING_HBP is the most homoscedastic.

## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 3478 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 3478 rows containing missing values (`geom_point()`).

3. DATA PREPARATION

Missing Values

We encountered missing values in certain columns of our dataset. In order to preserve the integrity of our analysis, we employed a strategy of imputing missing values rather than filtering out entire records. This helps to maintain a completed dataset.

For columns with missing values, ‘TEAM_BASERUN_SO’ and ‘TEAM_BATTING_SO’ we opted to replace the missing values with the median of the respective column.The median is less sensitive to extreme values, making it a suitable choice for imputation. ‘TEAM_BATTING_HBP’ has 91.61% of missing values. Because this could impact the analysis and compromise model quality, we made the decision to remove this variable from our dataset. We also decided to remove the ‘INDEX’ variable, because while serving as an identifier for records, it is not contributing meaningful information to our predictive model.

** Check if any NA left**

## 'data.frame':    2276 obs. of  17 variables:
##  $ INDEX           : int  1 2 3 4 5 6 7 8 11 12 ...
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : num  842 1075 917 922 920 ...
##  $ TEAM_BASERUN_SB : int  101 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : num  49 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: int  58 58 58 58 58 58 58 58 58 58 ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: num  5456 1082 917 928 920 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: num  149 155 153 156 168 149 186 136 169 159 ...

4. BUILD MODELS

Model 1 - Everything model

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = t)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.580  -8.599   0.038   8.394  59.983 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      23.1689621  5.3601622   4.322 1.61e-05 ***
## TEAM_BATTING_H    0.0480780  0.0036475  13.181  < 2e-16 ***
## TEAM_BATTING_2B  -0.0215776  0.0091791  -2.351  0.01882 *  
## TEAM_BATTING_3B   0.0734621  0.0163497   4.493 7.37e-06 ***
## TEAM_BATTING_HR   0.0642203  0.0097483   6.588 5.53e-11 ***
## TEAM_BATTING_BB   0.0138940  0.0049198   2.824  0.00478 ** 
## TEAM_BATTING_SO  -0.0076953  0.0025078  -3.069  0.00218 ** 
## TEAM_BASERUN_SB   0.0268580  0.0042914   6.259 4.63e-10 ***
## TEAM_BASERUN_CS  -0.0126634  0.0157644  -0.803  0.42189    
## TEAM_PITCHING_BB -0.0021431  0.0031500  -0.680  0.49636    
## TEAM_PITCHING_SO  0.0026548  0.0008759   3.031  0.00247 ** 
## TEAM_FIELDING_E  -0.0219496  0.0022200  -9.887  < 2e-16 ***
## TEAM_FIELDING_DP -0.1213765  0.0129517  -9.371  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.08 on 2263 degrees of freedom
## Multiple R-squared:  0.3136, Adjusted R-squared:   0.31 
## F-statistic: 86.17 on 12 and 2263 DF,  p-value: < 2.2e-16

Residuals vs. fitted plot

Get the model residuals

Plot the result

Plot the residuals and Q-Q line

Model 2 - Boxcox transformation model

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = mb_bc_transformed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.834  -8.135   0.043   8.041  62.218 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -6.759e+05  5.691e+04 -11.876  < 2e-16 ***
## TEAM_BATTING_H    8.807e+05  7.400e+04  11.901  < 2e-16 ***
## TEAM_BATTING_2B  -2.195e-01  8.417e-02  -2.607  0.00919 ** 
## TEAM_BATTING_3B   1.203e-01  1.716e-02   7.014 3.04e-12 ***
## TEAM_BATTING_HR   4.095e-02  1.011e-02   4.051 5.27e-05 ***
## TEAM_BATTING_BB   2.790e-02  3.987e-03   6.999 3.39e-12 ***
## TEAM_BATTING_SO  -1.160e-02  2.547e-03  -4.553 5.58e-06 ***
## TEAM_BASERUN_SB   2.633e-02  4.266e-03   6.171 8.03e-10 ***
## TEAM_BASERUN_CS  -4.648e-03  1.563e-02  -0.297  0.76626    
## TEAM_PITCHING_BB -8.590e-03  2.958e-03  -2.904  0.00372 ** 
## TEAM_PITCHING_SO  4.074e-03  8.769e-04   4.646 3.59e-06 ***
## TEAM_FIELDING_E  -1.297e+03  1.289e+02 -10.061  < 2e-16 ***
## TEAM_FIELDING_DP -2.409e-03  2.457e-04  -9.806  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.14 on 2263 degrees of freedom
## Multiple R-squared:  0.3083, Adjusted R-squared:  0.3046 
## F-statistic: 84.04 on 12 and 2263 DF,  p-value: < 2.2e-16

Residuals vs. fitted plot

Get the model residuals

Plot the result

Plot the residuals and Q-Q line

Model 3 - Stepwise model

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = t)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.401  -8.562   0.000   8.400  60.235 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      22.868430   5.234963   4.368 1.31e-05 ***
## TEAM_BATTING_H    0.047827   0.003636  13.152  < 2e-16 ***
## TEAM_BATTING_2B  -0.021930   0.009170  -2.392 0.016858 *  
## TEAM_BATTING_3B   0.074444   0.016320   4.562 5.35e-06 ***
## TEAM_BATTING_HR   0.065398   0.009606   6.808 1.26e-11 ***
## TEAM_BATTING_BB   0.011733   0.003377   3.474 0.000523 ***
## TEAM_BATTING_SO  -0.007162   0.002390  -2.996 0.002763 ** 
## TEAM_BASERUN_SB   0.025968   0.004191   6.196 6.88e-10 ***
## TEAM_PITCHING_SO  0.002217   0.000597   3.713 0.000210 ***
## TEAM_FIELDING_E  -0.022093   0.002027 -10.900  < 2e-16 ***
## TEAM_FIELDING_DP -0.121783   0.012944  -9.409  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.08 on 2265 degrees of freedom
## Multiple R-squared:  0.3133, Adjusted R-squared:  0.3103 
## F-statistic: 103.3 on 10 and 2265 DF,  p-value: < 2.2e-16

Residuals vs. fitted plot

Get the model residuals

Plot the result

Plot the residuals and Q-Q line

Normal Q-Q: is used to check the normality of residuals assumption. If the majority of the residuals follow the straight dashed line, then the assumption is fulfilled.

5. SELECT MODEL

Create tidy output for all three models

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.314         0.310  13.1      86.2 5.53e-175    12 -9076. 18179. 18259.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.308         0.305  13.1      84.0 3.47e-171    12 -9084. 18197. 18277.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.313         0.310  13.1      103. 9.27e-177    10 -9076. 18176. 18245.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Compare Model1 and Model 2: Comparing Model1 and Model2 we can conclude that Model1 performed better than Model2: RSE number is higher for Model2, F-statistic and Adj. R2 are lower for Model1. Overall we see a negative change.

Compare Model 2 to Model 3:

Comparing Model2 and Model3 we can conclude that Model3 performed better than Model2: RSE number is lower for Model3, F-statistic and Adj. R2 are higher for Model3. Overall we see a positive change.

Compare Model 1 to Model 3:

Comparing Model1 and Model3 we can conclude that Model3 performed better than Model: RSE number didn’t change, F-statistic and Adj. R2 are higher for Model3. Overall we see a positive change.

Based on the analysis above we decided that Model3 is the best model to choose.

We have already cleaned Evaluation Dataset and now we can feed our Model3

## Rows: 259
## Columns: 13
## $ TEAM_BATTING_H   <int> 1209, 1221, 1395, 1539, 1445, 1431, 1430, 1385, 1259,…
## $ TEAM_BATTING_2B  <int> 170, 151, 183, 309, 203, 236, 219, 158, 177, 212, 243…
## $ TEAM_BATTING_3B  <int> 33, 29, 29, 29, 68, 53, 55, 42, 78, 42, 40, 55, 57, 2…
## $ TEAM_BATTING_HR  <int> 83, 88, 93, 159, 5, 10, 37, 33, 23, 58, 50, 164, 186,…
## $ TEAM_BATTING_BB  <int> 447, 516, 509, 486, 95, 215, 568, 356, 466, 452, 495,…
## $ TEAM_BATTING_SO  <int> 1080, 929, 816, 914, 416, 377, 527, 609, 689, 584, 64…
## $ TEAM_BASERUN_SB  <dbl> 62, 54, 59, 148, 92, 92, 365, 185, 150, 52, 64, 48, 3…
## $ TEAM_BASERUN_CS  <dbl> 50.0, 39.0, 47.0, 57.0, 49.5, 49.5, 49.5, 49.5, 49.5,…
## $ TEAM_PITCHING_BB <int> 447, 516, 509, 486, 257, 420, 613, 418, 497, 482, 521…
## $ TEAM_PITCHING_SO <int> 1080, 929, 816, 914, 1123, 736, 569, 715, 734, 622, 6…
## $ TEAM_FIELDING_E  <int> 140, 135, 156, 124, 616, 572, 490, 328, 226, 184, 200…
## $ TEAM_FIELDING_DP <dbl> 156, 164, 153, 154, 130, 105, 148, 104, 132, 145, 183…
## $ TARGET_WINS      <dbl> 64.26984, 65.77449, 75.20381, 85.78582, 66.48780, 69.…

Comparing statistics from the known TARGET_WINS to the predicted TARGET_WINS

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   71.00   82.00   80.79   92.00  146.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   26.87   75.57   80.90   80.66   86.18  111.58

5. CONCLUSION

In summary, our exploration and modeling of the baseball team dataset have provided valuable insights into the factors influencing team performance, as measured by the number of wins, ‘TARGET_WINS’.

Through a meticulous data preparation process, including handling missing values and addressing multicollinearity, we have crafted a thorough predictive model. After evaluating three different models, model 3 became our preferred choice due to a combination of performance metrics, including a lower residual standard error, and higher F-statistic and adjusted R-squared. This model not only accurately represents the data but also aligns with the linear regression. After applying model 3 to an evaluation dataset, we observed that the predicted ‘TARGET_WINS’ match closely with the actual outcomes, indicating reliability, and that the model can formulate accurate predictions.

This model provides a valuable tool for predicting the number of wins based on various team statistics, aiding in decision-making within the realm of baseball team management. Moving forward however, continuous monitoring and refinement of the model will contribute to its ongoing accuracy and relevance.

6. APPENDIX

Appendix: All code for this report

suppressMessages(library(tidyverse))
suppressMessages(library(dplyr))
suppressMessages(library(kableExtra))
suppressMessages(library(knitr))
suppressMessages(library(caret))
suppressMessages(library(corrplot))
suppressMessages(library(mlbench))
suppressMessages(library(randomForest))
suppressMessages(library(highcharter))
suppressMessages(library(reshape))
suppressMessages(library(DataExplorer))
suppressMessages(library(broom))
suppressMessages(library(GGally))
suppressMessages(library(MASS))
suppressMessages(library(ggpubr))
suppressMessages(library(moments))
suppressMessages(library(car))
suppressMessages(library(psych))

eval_df <- read.csv('https://raw.githubusercontent.com/uplotnik/DATA621/main/moneyball-evaluation-data.csv')
train_df <-read.csv('https://raw.githubusercontent.com/uplotnik/DATA621/main/moneyball-training-data.csv')
knitr::kable(
  head(train_df), caption = "Table1.Moneyball dataset")%>%
  kable_styling("striped", full_width = F)
suppressWarnings({ knitr::kable(describeBy(train_df))})
new<-train_df %>% dplyr::select(-INDEX)
gather_df <- new %>% 
  gather(key = 'variable', value = 'value')
dat <- data_to_boxplot(gather_df, value, variable, name = "height in meters")
highchart() %>%
  hc_xAxis(type = "category") %>% hc_add_theme(hc_theme_economist())%>% 
  hc_add_series_list(dat)
hchart(gather_df, "scatter", hcaes(x = variable, y = value, group = variable)) %>% 
hc_title(
    text = "Closer look to Outliers",
    margin = 20,
    align = "left")%>% hc_add_theme(hc_theme_economist())
suppressWarnings({df<-train_df  %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(),
               names_to = "variables", values_to="missing") %>%  
  count(variables, missing) %>%mutate(percent = n / nrow(train_df) * 100)}) 
suppressWarnings({ 
missing<- df%>%
   hchart('bar', hcaes(x = 'variables', y = 'n', group = 'missing')) %>% 
 # hc_colors(c("#0073C2FF", "#EFC000FF")) %>% hc_add_theme(hc_theme_economist())%>%
hc_title(
    text = "Missing Values",
    margin = 20,
    align = "left")%>% hc_add_theme(hc_theme_economist())
missing}) 
hchart(cor(train_df, use = "na.or.complete")) %>% hc_title(
    text = "Correlation Plot Among The Variables",
    margin = 20,
    align = "left")%>% hc_add_theme(hc_theme_economist())
suppressWarnings({ 
train_df%>%
  gather(variable, value, - INDEX) %>%ggplot(., aes(value)) + 
geom_density(fill = "lightblue", color="blue") + theme (legend.position="none")+
facet_wrap(~variable, scales ="free", ncol = 3) +
labs(x = element_blank(), y = element_blank())
 
})
suppressWarnings({train_df %>%
  gather(variable, value, -TARGET_WINS) %>%
  ggplot(., aes(value, TARGET_WINS)) + 
   geom_point(fill = "lightblue", color="lightblue") + 
   facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = "value", y = "Wins")+
  geom_smooth(method = "lm",  color = "blue",se=F, size=0.2)})
plot_missing(train_df)
training <- train_df %>% mutate_all(~ifelse(is.na(.x), median(.x, na.rm = TRUE), .x))
eval <- eval_df %>% mutate_all(~ifelse(is.na(.x), median(.x, na.rm = TRUE), .x))
str(training)
t<- training %>% dplyr::select(-INDEX, -TEAM_BATTING_HBP,-TEAM_PITCHING_H, -TEAM_PITCHING_HR,TEAM_PITCHING_BB,TEAM_PITCHING_SO)

eval <- eval %>% dplyr::select(-INDEX,-TEAM_BATTING_HBP,-TEAM_PITCHING_H, -TEAM_PITCHING_HR,TEAM_PITCHING_BB,TEAM_PITCHING_SO)
plot_missing(t)
model1 <- lm(TARGET_WINS ~., t)
summary(model1)
perf1<-augment(model1)
ggplot(perf1, aes(x= .fitted, y=.resid))+ geom_point()+geom_hline(yintercept=0)
model_residuals1 = model1$residuals
hist(model_residuals1)
qqnorm(model_residuals1)
qqline(model_residuals1)

mbtrain_boxcox <- preProcess(t, c("BoxCox"))
mb_bc_transformed <- predict(mbtrain_boxcox, t)
model2 <- lm(TARGET_WINS ~ ., mb_bc_transformed)
summary(model2)
perf2<-augment(model2)
ggplot(perf2, aes(x= .fitted, y=.resid))+ geom_point()+geom_hline(yintercept=0)
model_residuals2 = model2$residuals

hist(model_residuals2)

qqnorm(model_residuals2)
qqline(model_residuals2)

model3 <- stepAIC(model1, direction = "both", trace = FALSE)
summary(model3)
perf3<-augment(model3)
ggplot(perf3, aes(x= .fitted, y=.resid))+ geom_point()+geom_hline(yintercept=0)
model_residuals3 = model3$residuals
hist(model_residuals3)
qqnorm(model_residuals3)
qqline(model_residuals3)

broom::glance(model1)
broom::glance(model2)
broom::glance(model3)
eval$TARGET_WINS <- predict(model3,eval)

glimpse(eval)
summary(t$TARGET_WINS)
summary(eval$TARGET_WINS)