DATA 621 Homework 1

1. OVERVIEW

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Describe the size and the variables in the moneyball training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment.

You should have your own thoughts on what to tell the boss. These are just ideas. a. Mean / Standard Deviation / Median b. Bar Chart or Box Plot of the data c. Is the data correlated to the target variable (or to other variables?) d. Are any of the variables missing and need to be imputed “fixed”?

2. DATA EXPLORATION

Table1.Moneyball dataset
INDEX	TARGET_WINS	TEAM_BATTING_H	TEAM_BATTING_2B	TEAM_BATTING_3B	TEAM_BATTING_HR	TEAM_BATTING_BB	TEAM_BATTING_SO	TEAM_BASERUN_SB	TEAM_BASERUN_CS	TEAM_BATTING_HBP	TEAM_PITCHING_H	TEAM_PITCHING_HR	TEAM_PITCHING_BB	TEAM_PITCHING_SO	TEAM_FIELDING_E	TEAM_FIELDING_DP
1	39	1445	194	39	13	143	842	NA	NA	NA	9364	84	927	5456	1011	NA
2	70	1339	219	22	190	685	1075	37	28	NA	1347	191	689	1082	193	155
3	86	1377	232	35	137	602	917	46	27	NA	1377	137	602	917	175	153
4	70	1387	209	38	96	451	922	43	30	NA	1396	97	454	928	164	156
5	82	1297	186	27	102	472	920	49	39	NA	1297	102	472	920	138	168
6	75	1279	200	36	92	443	973	107	59	NA	1279	92	443	973	123	149

Visualize the data

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
INDEX	1	2276	1268.46353	736.34904	1270.5	1268.56970	952.5705	1	2535	2534	0.0042149	-1.2167564	15.4346788
TARGET_WINS	2	2276	80.79086	15.75215	82.0	81.31229	14.8260	0	146	146	-0.3987232	1.0274757	0.3301823
TEAM_BATTING_H	3	2276	1469.26977	144.59120	1454.0	1459.04116	114.1602	891	2554	1663	1.5713335	7.2785261	3.0307891
TEAM_BATTING_2B	4	2276	241.24692	46.80141	238.0	240.39627	47.4432	69	458	389	0.2151018	0.0061609	0.9810087
TEAM_BATTING_3B	5	2276	55.25000	27.93856	47.0	52.17563	23.7216	0	223	223	1.1094652	1.5032418	0.5856226
TEAM_BATTING_HR	6	2276	99.61204	60.54687	102.0	97.38529	78.5778	0	264	264	0.1860421	-0.9631189	1.2691285
TEAM_BATTING_BB	7	2276	501.55888	122.67086	512.0	512.18331	94.8864	0	878	878	-1.0257599	2.1828544	2.5713150
TEAM_BATTING_SO	8	2174	735.60534	248.52642	750.0	742.31322	284.6592	0	1399	1399	-0.2978001	-0.3207992	5.3301912
TEAM_BASERUN_SB	9	2145	124.76177	87.79117	101.0	110.81188	60.7866	0	697	697	1.9724140	5.4896754	1.8955584
TEAM_BASERUN_CS	10	1504	52.80386	22.95634	49.0	50.35963	17.7912	0	201	201	1.9762180	7.6203818	0.5919414
TEAM_BATTING_HBP	11	191	59.35602	12.96712	58.0	58.86275	11.8608	29	95	66	0.3185754	-0.1119828	0.9382681
TEAM_PITCHING_H	12	2276	1779.21046	1406.84293	1518.0	1555.89517	174.9468	1137	30132	28995	10.3295111	141.8396985	29.4889618
TEAM_PITCHING_HR	13	2276	105.69859	61.29875	107.0	103.15697	74.1300	0	343	343	0.2877877	-0.6046311	1.2848886
TEAM_PITCHING_BB	14	2276	553.00791	166.35736	536.5	542.62459	98.5929	0	3645	3645	6.7438995	96.9676398	3.4870317
TEAM_PITCHING_SO	15	2174	817.73045	553.08503	813.5	796.93391	257.2311	0	19278	19278	22.1745535	671.1891292	11.8621151
TEAM_FIELDING_E	16	2276	246.48067	227.77097	159.0	193.43798	62.2692	65	1898	1833	2.9904656	10.9702717	4.7743279
TEAM_FIELDING_DP	17	1990	146.38794	26.22639	149.0	147.57789	23.7216	52	228	176	-0.3889390	0.1817397	0.5879114

There are 2276 subjects with 17 variables. We can see that the average number of Target Wins is around 81 with a standard deviation (SD) of 16.

Let’s take a look at distribution of data points across our variables. The charts below display ranges within variables measured. This includes the outliers, the median, the mode, and where the majority of the data points lie in the “box”. The variance of some of the explanatory variables greatly exceeds the variance of the response “Target_WINS” variable.

During the exploration phase, we observed a correlation heatmap to understand the relationships between different variables in our dataset. Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, making it challenging to determine their individual effects or impacts on the dependent variable.

In our dataset, we identified four pairs of variables with a correlation coefficient of 1.00, indicating a perfect linear relationship:

TEAM_BATTING_H ~ TEAM_PITCHING_H
TEAM_BATTING_HR ~ TEAM_PITCHING_HR
TEAM_BATTING_BB ~ TEAM_PITCHING_BB
TEAM_BATTING_SO ~ TEAM_PITCHING_SO

Therefore, mentioned strong correlations can create challenges for the model interpretation, making it difficult to ascertain the effects of each variable. To mitigate these challenges, we checked for skewness and we carefully considered the inclusion of variables in our modeling process.

Checking for skewness in the data

In the process of exploring our dataset, we investigated the distribution of several variables to understand their skewness. The skewness indicates that the data points are not evenly distributed around the mean.

We noticed several skewed variables such as ‘TEAM_FIELDING_E’, ‘PITCHING_H’, ‘TEAM_PITCHING_BB, ’TEAM_PITCHING_SO’ and ‘TEAM_PITCHING_H’, which exhibit significant skewness. Addressing the skewness of these variables is important for building functional predictor models. These methods aim to normalize the distribution and improve the model’s performance.

## Warning: Removed 3478 rows containing non-finite values (`stat_density()`).

Relationship between Predictors and Target Variable

Below we see how the data is distributed when compared to the linear regression. We can state that PITCHING_H and PITCHING_SO are highly heteroscedastic, while BATTING_HBP is the most homoscedastic.

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 3478 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 3478 rows containing missing values (`geom_point()`).

3. DATA PREPARATION

Missing Values

We encountered missing values in certain columns of our dataset. In order to preserve the integrity of our analysis, we employed a strategy of imputing missing values rather than filtering out entire records. This helps to maintain a completed dataset.

For columns with missing values, ‘TEAM_BASERUN_SO’ and ‘TEAM_BATTING_SO’ we opted to replace the missing values with the median of the respective column.The median is less sensitive to extreme values, making it a suitable choice for imputation. ‘TEAM_BATTING_HBP’ has 91.61% of missing values. Because this could impact the analysis and compromise model quality, we made the decision to remove this variable from our dataset. We also decided to remove the ‘INDEX’ variable, because while serving as an identifier for records, it is not contributing meaningful information to our predictive model.

** Check if any NA left**

## 'data.frame':    2276 obs. of  17 variables:
##  $ INDEX           : int  1 2 3 4 5 6 7 8 11 12 ...
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : num  842 1075 917 922 920 ...
##  $ TEAM_BASERUN_SB : int  101 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : num  49 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: int  58 58 58 58 58 58 58 58 58 58 ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: num  5456 1082 917 928 920 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: num  149 155 153 156 168 149 186 136 169 159 ...

4. BUILD MODELS

Model 1 - Everything model

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = t)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.580  -8.599   0.038   8.394  59.983 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      23.1689621  5.3601622   4.322 1.61e-05 ***
## TEAM_BATTING_H    0.0480780  0.0036475  13.181  < 2e-16 ***
## TEAM_BATTING_2B  -0.0215776  0.0091791  -2.351  0.01882 *  
## TEAM_BATTING_3B   0.0734621  0.0163497   4.493 7.37e-06 ***
## TEAM_BATTING_HR   0.0642203  0.0097483   6.588 5.53e-11 ***
## TEAM_BATTING_BB   0.0138940  0.0049198   2.824  0.00478 ** 
## TEAM_BATTING_SO  -0.0076953  0.0025078  -3.069  0.00218 ** 
## TEAM_BASERUN_SB   0.0268580  0.0042914   6.259 4.63e-10 ***
## TEAM_BASERUN_CS  -0.0126634  0.0157644  -0.803  0.42189    
## TEAM_PITCHING_BB -0.0021431  0.0031500  -0.680  0.49636    
## TEAM_PITCHING_SO  0.0026548  0.0008759   3.031  0.00247 ** 
## TEAM_FIELDING_E  -0.0219496  0.0022200  -9.887  < 2e-16 ***
## TEAM_FIELDING_DP -0.1213765  0.0129517  -9.371  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.08 on 2263 degrees of freedom
## Multiple R-squared:  0.3136, Adjusted R-squared:   0.31 
## F-statistic: 86.17 on 12 and 2263 DF,  p-value: < 2.2e-16

Residuals vs. fitted plot

Get the model residuals

Plot the result

Plot the residuals and Q-Q line

Model 2 - Boxcox transformation model

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = mb_bc_transformed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.834  -8.135   0.043   8.041  62.218 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -6.759e+05  5.691e+04 -11.876  < 2e-16 ***
## TEAM_BATTING_H    8.807e+05  7.400e+04  11.901  < 2e-16 ***
## TEAM_BATTING_2B  -2.195e-01  8.417e-02  -2.607  0.00919 ** 
## TEAM_BATTING_3B   1.203e-01  1.716e-02   7.014 3.04e-12 ***
## TEAM_BATTING_HR   4.095e-02  1.011e-02   4.051 5.27e-05 ***
## TEAM_BATTING_BB   2.790e-02  3.987e-03   6.999 3.39e-12 ***
## TEAM_BATTING_SO  -1.160e-02  2.547e-03  -4.553 5.58e-06 ***
## TEAM_BASERUN_SB   2.633e-02  4.266e-03   6.171 8.03e-10 ***
## TEAM_BASERUN_CS  -4.648e-03  1.563e-02  -0.297  0.76626    
## TEAM_PITCHING_BB -8.590e-03  2.958e-03  -2.904  0.00372 ** 
## TEAM_PITCHING_SO  4.074e-03  8.769e-04   4.646 3.59e-06 ***
## TEAM_FIELDING_E  -1.297e+03  1.289e+02 -10.061  < 2e-16 ***
## TEAM_FIELDING_DP -2.409e-03  2.457e-04  -9.806  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.14 on 2263 degrees of freedom
## Multiple R-squared:  0.3083, Adjusted R-squared:  0.3046 
## F-statistic: 84.04 on 12 and 2263 DF,  p-value: < 2.2e-16

Residuals vs. fitted plot

Get the model residuals

Plot the result

Plot the residuals and Q-Q line

Model 3 - Stepwise model

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = t)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.401  -8.562   0.000   8.400  60.235 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      22.868430   5.234963   4.368 1.31e-05 ***
## TEAM_BATTING_H    0.047827   0.003636  13.152  < 2e-16 ***
## TEAM_BATTING_2B  -0.021930   0.009170  -2.392 0.016858 *  
## TEAM_BATTING_3B   0.074444   0.016320   4.562 5.35e-06 ***
## TEAM_BATTING_HR   0.065398   0.009606   6.808 1.26e-11 ***
## TEAM_BATTING_BB   0.011733   0.003377   3.474 0.000523 ***
## TEAM_BATTING_SO  -0.007162   0.002390  -2.996 0.002763 ** 
## TEAM_BASERUN_SB   0.025968   0.004191   6.196 6.88e-10 ***
## TEAM_PITCHING_SO  0.002217   0.000597   3.713 0.000210 ***
## TEAM_FIELDING_E  -0.022093   0.002027 -10.900  < 2e-16 ***
## TEAM_FIELDING_DP -0.121783   0.012944  -9.409  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.08 on 2265 degrees of freedom
## Multiple R-squared:  0.3133, Adjusted R-squared:  0.3103 
## F-statistic: 103.3 on 10 and 2265 DF,  p-value: < 2.2e-16

Residuals vs. fitted plot

Get the model residuals

Plot the result

Plot the residuals and Q-Q line

Normal Q-Q: is used to check the normality of residuals assumption. If the majority of the residuals follow the straight dashed line, then the assumption is fulfilled.

5. SELECT MODEL

Create tidy output for all three models

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.314         0.310  13.1      86.2 5.53e-175    12 -9076. 18179. 18259.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.308         0.305  13.1      84.0 3.47e-171    12 -9084. 18197. 18277.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.313         0.310  13.1      103. 9.27e-177    10 -9076. 18176. 18245.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Compare Model1 and Model 2: Comparing Model1 and Model2 we can conclude that Model1 performed better than Model2: RSE number is higher for Model2, F-statistic and Adj. R2 are lower for Model1. Overall we see a negative change.

Compare Model 2 to Model 3:

Comparing Model2 and Model3 we can conclude that Model3 performed better than Model2: RSE number is lower for Model3, F-statistic and Adj. R2 are higher for Model3. Overall we see a positive change.

Compare Model 1 to Model 3:

Comparing Model1 and Model3 we can conclude that Model3 performed better than Model: RSE number didn’t change, F-statistic and Adj. R2 are higher for Model3. Overall we see a positive change.

Based on the analysis above we decided that Model3 is the best model to choose.

We have already cleaned Evaluation Dataset and now we can feed our Model3

## Rows: 259
## Columns: 13
## $ TEAM_BATTING_H   <int> 1209, 1221, 1395, 1539, 1445, 1431, 1430, 1385, 1259,…
## $ TEAM_BATTING_2B  <int> 170, 151, 183, 309, 203, 236, 219, 158, 177, 212, 243…
## $ TEAM_BATTING_3B  <int> 33, 29, 29, 29, 68, 53, 55, 42, 78, 42, 40, 55, 57, 2…
## $ TEAM_BATTING_HR  <int> 83, 88, 93, 159, 5, 10, 37, 33, 23, 58, 50, 164, 186,…
## $ TEAM_BATTING_BB  <int> 447, 516, 509, 486, 95, 215, 568, 356, 466, 452, 495,…
## $ TEAM_BATTING_SO  <int> 1080, 929, 816, 914, 416, 377, 527, 609, 689, 584, 64…
## $ TEAM_BASERUN_SB  <dbl> 62, 54, 59, 148, 92, 92, 365, 185, 150, 52, 64, 48, 3…
## $ TEAM_BASERUN_CS  <dbl> 50.0, 39.0, 47.0, 57.0, 49.5, 49.5, 49.5, 49.5, 49.5,…
## $ TEAM_PITCHING_BB <int> 447, 516, 509, 486, 257, 420, 613, 418, 497, 482, 521…
## $ TEAM_PITCHING_SO <int> 1080, 929, 816, 914, 1123, 736, 569, 715, 734, 622, 6…
## $ TEAM_FIELDING_E  <int> 140, 135, 156, 124, 616, 572, 490, 328, 226, 184, 200…
## $ TEAM_FIELDING_DP <dbl> 156, 164, 153, 154, 130, 105, 148, 104, 132, 145, 183…
## $ TARGET_WINS      <dbl> 64.26984, 65.77449, 75.20381, 85.78582, 66.48780, 69.…

Comparing statistics from the known TARGET_WINS to the predicted TARGET_WINS

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   71.00   82.00   80.79   92.00  146.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   26.87   75.57   80.90   80.66   86.18  111.58

5. CONCLUSION

In summary, our exploration and modeling of the baseball team dataset have provided valuable insights into the factors influencing team performance, as measured by the number of wins, ‘TARGET_WINS’.

Through a meticulous data preparation process, including handling missing values and addressing multicollinearity, we have crafted a thorough predictive model. After evaluating three different models, model 3 became our preferred choice due to a combination of performance metrics, including a lower residual standard error, and higher F-statistic and adjusted R-squared. This model not only accurately represents the data but also aligns with the linear regression. After applying model 3 to an evaluation dataset, we observed that the predicted ‘TARGET_WINS’ match closely with the actual outcomes, indicating reliability, and that the model can formulate accurate predictions.

This model provides a valuable tool for predicting the number of wins based on various team statistics, aiding in decision-making within the realm of baseball team management. Moving forward however, continuous monitoring and refinement of the model will contribute to its ongoing accuracy and relevance.

6. APPENDIX

Appendix: All code for this report

suppressMessages(library(tidyverse))
suppressMessages(library(dplyr))
suppressMessages(library(kableExtra))
suppressMessages(library(knitr))
suppressMessages(library(caret))
suppressMessages(library(corrplot))
suppressMessages(library(mlbench))
suppressMessages(library(randomForest))
suppressMessages(library(highcharter))
suppressMessages(library(reshape))
suppressMessages(library(DataExplorer))
suppressMessages(library(broom))
suppressMessages(library(GGally))
suppressMessages(library(MASS))
suppressMessages(library(ggpubr))
suppressMessages(library(moments))
suppressMessages(library(car))
suppressMessages(library(psych))

eval_df <- read.csv('https://raw.githubusercontent.com/uplotnik/DATA621/main/moneyball-evaluation-data.csv')
train_df <-read.csv('https://raw.githubusercontent.com/uplotnik/DATA621/main/moneyball-training-data.csv')
knitr::kable(
  head(train_df), caption = "Table1.Moneyball dataset")%>%
  kable_styling("striped", full_width = F)
suppressWarnings({ knitr::kable(describeBy(train_df))})
new<-train_df %>% dplyr::select(-INDEX)
gather_df <- new %>% 
  gather(key = 'variable', value = 'value')
dat <- data_to_boxplot(gather_df, value, variable, name = "height in meters")
highchart() %>%
  hc_xAxis(type = "category") %>% hc_add_theme(hc_theme_economist())%>% 
  hc_add_series_list(dat)
hchart(gather_df, "scatter", hcaes(x = variable, y = value, group = variable)) %>% 
hc_title(
    text = "Closer look to Outliers",
    margin = 20,
    align = "left")%>% hc_add_theme(hc_theme_economist())
suppressWarnings({df<-train_df  %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(),
               names_to = "variables", values_to="missing") %>%  
  count(variables, missing) %>%mutate(percent = n / nrow(train_df) * 100)}) 
suppressWarnings({ 
missing<- df%>%
   hchart('bar', hcaes(x = 'variables', y = 'n', group = 'missing')) %>% 
 # hc_colors(c("#0073C2FF", "#EFC000FF")) %>% hc_add_theme(hc_theme_economist())%>%
hc_title(
    text = "Missing Values",
    margin = 20,
    align = "left")%>% hc_add_theme(hc_theme_economist())
missing}) 
hchart(cor(train_df, use = "na.or.complete")) %>% hc_title(
    text = "Correlation Plot Among The Variables",
    margin = 20,
    align = "left")%>% hc_add_theme(hc_theme_economist())
suppressWarnings({ 
train_df%>%
  gather(variable, value, - INDEX) %>%ggplot(., aes(value)) + 
geom_density(fill = "lightblue", color="blue") + theme (legend.position="none")+
facet_wrap(~variable, scales ="free", ncol = 3) +
labs(x = element_blank(), y = element_blank())
 
})
suppressWarnings({train_df %>%
  gather(variable, value, -TARGET_WINS) %>%
  ggplot(., aes(value, TARGET_WINS)) + 
   geom_point(fill = "lightblue", color="lightblue") + 
   facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = "value", y = "Wins")+
  geom_smooth(method = "lm",  color = "blue",se=F, size=0.2)})
plot_missing(train_df)
training <- train_df %>% mutate_all(~ifelse(is.na(.x), median(.x, na.rm = TRUE), .x))
eval <- eval_df %>% mutate_all(~ifelse(is.na(.x), median(.x, na.rm = TRUE), .x))
str(training)
t<- training %>% dplyr::select(-INDEX, -TEAM_BATTING_HBP,-TEAM_PITCHING_H, -TEAM_PITCHING_HR,TEAM_PITCHING_BB,TEAM_PITCHING_SO)

eval <- eval %>% dplyr::select(-INDEX,-TEAM_BATTING_HBP,-TEAM_PITCHING_H, -TEAM_PITCHING_HR,TEAM_PITCHING_BB,TEAM_PITCHING_SO)
plot_missing(t)
model1 <- lm(TARGET_WINS ~., t)
summary(model1)
perf1<-augment(model1)
ggplot(perf1, aes(x= .fitted, y=.resid))+ geom_point()+geom_hline(yintercept=0)
model_residuals1 = model1$residuals
hist(model_residuals1)
qqnorm(model_residuals1)
qqline(model_residuals1)

mbtrain_boxcox <- preProcess(t, c("BoxCox"))
mb_bc_transformed <- predict(mbtrain_boxcox, t)
model2 <- lm(TARGET_WINS ~ ., mb_bc_transformed)
summary(model2)
perf2<-augment(model2)
ggplot(perf2, aes(x= .fitted, y=.resid))+ geom_point()+geom_hline(yintercept=0)
model_residuals2 = model2$residuals

hist(model_residuals2)

qqnorm(model_residuals2)
qqline(model_residuals2)

model3 <- stepAIC(model1, direction = "both", trace = FALSE)
summary(model3)
perf3<-augment(model3)
ggplot(perf3, aes(x= .fitted, y=.resid))+ geom_point()+geom_hline(yintercept=0)
model_residuals3 = model3$residuals
hist(model_residuals3)
qqnorm(model_residuals3)
qqline(model_residuals3)

broom::glance(model1)
broom::glance(model2)
broom::glance(model3)
eval$TARGET_WINS <- predict(model3,eval)

glimpse(eval)
summary(t$TARGET_WINS)
summary(eval$TARGET_WINS)

DATA 621 Homework 1

Critical Thinking Group 5: Uliana Plotnikova, Laura Puebla Aguila, Renida Kasa

2024-02-22

1. OVERVIEW

2. DATA EXPLORATION

Visualize the data

Checking for skewness in the data

Relationship between Predictors and Target Variable

3. DATA PREPARATION

Missing Values

4. BUILD MODELS

Model 1 - Everything model

Model 2 - Boxcox transformation model

Model 3 - Stepwise model

5. SELECT MODEL

5. CONCLUSION

6. APPENDIX

Appendix: All code for this report