In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
Describe the size and the variables in the moneyball training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment.
You should have your own thoughts on what to tell the boss. These are just ideas. a. Mean / Standard Deviation / Median b. Bar Chart or Box Plot of the data c. Is the data correlated to the target variable (or to other variables?) d. Are any of the variables missing and need to be imputed “fixed”?
| INDEX | TARGET_WINS | TEAM_BATTING_H | TEAM_BATTING_2B | TEAM_BATTING_3B | TEAM_BATTING_HR | TEAM_BATTING_BB | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_BASERUN_CS | TEAM_BATTING_HBP | TEAM_PITCHING_H | TEAM_PITCHING_HR | TEAM_PITCHING_BB | TEAM_PITCHING_SO | TEAM_FIELDING_E | TEAM_FIELDING_DP |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 39 | 1445 | 194 | 39 | 13 | 143 | 842 | NA | NA | NA | 9364 | 84 | 927 | 5456 | 1011 | NA |
| 2 | 70 | 1339 | 219 | 22 | 190 | 685 | 1075 | 37 | 28 | NA | 1347 | 191 | 689 | 1082 | 193 | 155 |
| 3 | 86 | 1377 | 232 | 35 | 137 | 602 | 917 | 46 | 27 | NA | 1377 | 137 | 602 | 917 | 175 | 153 |
| 4 | 70 | 1387 | 209 | 38 | 96 | 451 | 922 | 43 | 30 | NA | 1396 | 97 | 454 | 928 | 164 | 156 |
| 5 | 82 | 1297 | 186 | 27 | 102 | 472 | 920 | 49 | 39 | NA | 1297 | 102 | 472 | 920 | 138 | 168 |
| 6 | 75 | 1279 | 200 | 36 | 92 | 443 | 973 | 107 | 59 | NA | 1279 | 92 | 443 | 973 | 123 | 149 |
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| INDEX | 1 | 2276 | 1268.46353 | 736.34904 | 1270.5 | 1268.56970 | 952.5705 | 1 | 2535 | 2534 | 0.0042149 | -1.2167564 | 15.4346788 |
| TARGET_WINS | 2 | 2276 | 80.79086 | 15.75215 | 82.0 | 81.31229 | 14.8260 | 0 | 146 | 146 | -0.3987232 | 1.0274757 | 0.3301823 |
| TEAM_BATTING_H | 3 | 2276 | 1469.26977 | 144.59120 | 1454.0 | 1459.04116 | 114.1602 | 891 | 2554 | 1663 | 1.5713335 | 7.2785261 | 3.0307891 |
| TEAM_BATTING_2B | 4 | 2276 | 241.24692 | 46.80141 | 238.0 | 240.39627 | 47.4432 | 69 | 458 | 389 | 0.2151018 | 0.0061609 | 0.9810087 |
| TEAM_BATTING_3B | 5 | 2276 | 55.25000 | 27.93856 | 47.0 | 52.17563 | 23.7216 | 0 | 223 | 223 | 1.1094652 | 1.5032418 | 0.5856226 |
| TEAM_BATTING_HR | 6 | 2276 | 99.61204 | 60.54687 | 102.0 | 97.38529 | 78.5778 | 0 | 264 | 264 | 0.1860421 | -0.9631189 | 1.2691285 |
| TEAM_BATTING_BB | 7 | 2276 | 501.55888 | 122.67086 | 512.0 | 512.18331 | 94.8864 | 0 | 878 | 878 | -1.0257599 | 2.1828544 | 2.5713150 |
| TEAM_BATTING_SO | 8 | 2174 | 735.60534 | 248.52642 | 750.0 | 742.31322 | 284.6592 | 0 | 1399 | 1399 | -0.2978001 | -0.3207992 | 5.3301912 |
| TEAM_BASERUN_SB | 9 | 2145 | 124.76177 | 87.79117 | 101.0 | 110.81188 | 60.7866 | 0 | 697 | 697 | 1.9724140 | 5.4896754 | 1.8955584 |
| TEAM_BASERUN_CS | 10 | 1504 | 52.80386 | 22.95634 | 49.0 | 50.35963 | 17.7912 | 0 | 201 | 201 | 1.9762180 | 7.6203818 | 0.5919414 |
| TEAM_BATTING_HBP | 11 | 191 | 59.35602 | 12.96712 | 58.0 | 58.86275 | 11.8608 | 29 | 95 | 66 | 0.3185754 | -0.1119828 | 0.9382681 |
| TEAM_PITCHING_H | 12 | 2276 | 1779.21046 | 1406.84293 | 1518.0 | 1555.89517 | 174.9468 | 1137 | 30132 | 28995 | 10.3295111 | 141.8396985 | 29.4889618 |
| TEAM_PITCHING_HR | 13 | 2276 | 105.69859 | 61.29875 | 107.0 | 103.15697 | 74.1300 | 0 | 343 | 343 | 0.2877877 | -0.6046311 | 1.2848886 |
| TEAM_PITCHING_BB | 14 | 2276 | 553.00791 | 166.35736 | 536.5 | 542.62459 | 98.5929 | 0 | 3645 | 3645 | 6.7438995 | 96.9676398 | 3.4870317 |
| TEAM_PITCHING_SO | 15 | 2174 | 817.73045 | 553.08503 | 813.5 | 796.93391 | 257.2311 | 0 | 19278 | 19278 | 22.1745535 | 671.1891292 | 11.8621151 |
| TEAM_FIELDING_E | 16 | 2276 | 246.48067 | 227.77097 | 159.0 | 193.43798 | 62.2692 | 65 | 1898 | 1833 | 2.9904656 | 10.9702717 | 4.7743279 |
| TEAM_FIELDING_DP | 17 | 1990 | 146.38794 | 26.22639 | 149.0 | 147.57789 | 23.7216 | 52 | 228 | 176 | -0.3889390 | 0.1817397 | 0.5879114 |
There are 2276 subjects with 17 variables. We can see that the average number of Target Wins is around 81 with a standard deviation (SD) of 16.
Let’s take a look at distribution of data points across our variables. The charts below display ranges within variables measured. This includes the outliers, the median, the mode, and where the majority of the data points lie in the “box”. The variance of some of the explanatory variables greatly exceeds the variance of the response “Target_WINS” variable.
During the exploration phase, we observed a correlation heatmap to understand the relationships between different variables in our dataset. Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, making it challenging to determine their individual effects or impacts on the dependent variable.
In our dataset, we identified four pairs of variables with a correlation coefficient of 1.00, indicating a perfect linear relationship:
Therefore, mentioned strong correlations can create challenges for the model interpretation, making it difficult to ascertain the effects of each variable. To mitigate these challenges, we checked for skewness and we carefully considered the inclusion of variables in our modeling process.
In the process of exploring our dataset, we investigated the distribution of several variables to understand their skewness. The skewness indicates that the data points are not evenly distributed around the mean.
We noticed several skewed variables such as ‘TEAM_FIELDING_E’, ‘PITCHING_H’, ‘TEAM_PITCHING_BB, ’TEAM_PITCHING_SO’ and ‘TEAM_PITCHING_H’, which exhibit significant skewness. Addressing the skewness of these variables is important for building functional predictor models. These methods aim to normalize the distribution and improve the model’s performance.
## Warning: Removed 3478 rows containing non-finite values (`stat_density()`).
Below we see how the data is distributed when compared to the linear regression. We can state that PITCHING_H and PITCHING_SO are highly heteroscedastic, while BATTING_HBP is the most homoscedastic.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 3478 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 3478 rows containing missing values (`geom_point()`).
We encountered missing values in certain columns of our dataset. In order to preserve the integrity of our analysis, we employed a strategy of imputing missing values rather than filtering out entire records. This helps to maintain a completed dataset.
For columns with missing values, ‘TEAM_BASERUN_SO’ and ‘TEAM_BATTING_SO’ we opted to replace the missing values with the median of the respective column.The median is less sensitive to extreme values, making it a suitable choice for imputation. ‘TEAM_BATTING_HBP’ has 91.61% of missing values. Because this could impact the analysis and compromise model quality, we made the decision to remove this variable from our dataset. We also decided to remove the ‘INDEX’ variable, because while serving as an identifier for records, it is not contributing meaningful information to our predictive model.
** Check if any NA left**
## 'data.frame': 2276 obs. of 17 variables:
## $ INDEX : int 1 2 3 4 5 6 7 8 11 12 ...
## $ TARGET_WINS : int 39 70 86 70 82 75 80 85 86 76 ...
## $ TEAM_BATTING_H : int 1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
## $ TEAM_BATTING_2B : int 194 219 232 209 186 200 179 171 197 213 ...
## $ TEAM_BATTING_3B : int 39 22 35 38 27 36 54 37 40 18 ...
## $ TEAM_BATTING_HR : int 13 190 137 96 102 92 122 115 114 96 ...
## $ TEAM_BATTING_BB : int 143 685 602 451 472 443 525 456 447 441 ...
## $ TEAM_BATTING_SO : num 842 1075 917 922 920 ...
## $ TEAM_BASERUN_SB : int 101 37 46 43 49 107 80 40 69 72 ...
## $ TEAM_BASERUN_CS : num 49 28 27 30 39 59 54 36 27 34 ...
## $ TEAM_BATTING_HBP: int 58 58 58 58 58 58 58 58 58 58 ...
## $ TEAM_PITCHING_H : int 9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
## $ TEAM_PITCHING_HR: int 84 191 137 97 102 92 122 116 114 96 ...
## $ TEAM_PITCHING_BB: int 927 689 602 454 472 443 525 459 447 441 ...
## $ TEAM_PITCHING_SO: num 5456 1082 917 928 920 ...
## $ TEAM_FIELDING_E : int 1011 193 175 164 138 123 136 112 127 131 ...
## $ TEAM_FIELDING_DP: num 149 155 153 156 168 149 186 136 169 159 ...
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = t)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.580 -8.599 0.038 8.394 59.983
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.1689621 5.3601622 4.322 1.61e-05 ***
## TEAM_BATTING_H 0.0480780 0.0036475 13.181 < 2e-16 ***
## TEAM_BATTING_2B -0.0215776 0.0091791 -2.351 0.01882 *
## TEAM_BATTING_3B 0.0734621 0.0163497 4.493 7.37e-06 ***
## TEAM_BATTING_HR 0.0642203 0.0097483 6.588 5.53e-11 ***
## TEAM_BATTING_BB 0.0138940 0.0049198 2.824 0.00478 **
## TEAM_BATTING_SO -0.0076953 0.0025078 -3.069 0.00218 **
## TEAM_BASERUN_SB 0.0268580 0.0042914 6.259 4.63e-10 ***
## TEAM_BASERUN_CS -0.0126634 0.0157644 -0.803 0.42189
## TEAM_PITCHING_BB -0.0021431 0.0031500 -0.680 0.49636
## TEAM_PITCHING_SO 0.0026548 0.0008759 3.031 0.00247 **
## TEAM_FIELDING_E -0.0219496 0.0022200 -9.887 < 2e-16 ***
## TEAM_FIELDING_DP -0.1213765 0.0129517 -9.371 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.08 on 2263 degrees of freedom
## Multiple R-squared: 0.3136, Adjusted R-squared: 0.31
## F-statistic: 86.17 on 12 and 2263 DF, p-value: < 2.2e-16
Residuals vs. fitted plot
Get the model residuals
Plot the result
Plot the residuals and Q-Q line
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = mb_bc_transformed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.834 -8.135 0.043 8.041 62.218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.759e+05 5.691e+04 -11.876 < 2e-16 ***
## TEAM_BATTING_H 8.807e+05 7.400e+04 11.901 < 2e-16 ***
## TEAM_BATTING_2B -2.195e-01 8.417e-02 -2.607 0.00919 **
## TEAM_BATTING_3B 1.203e-01 1.716e-02 7.014 3.04e-12 ***
## TEAM_BATTING_HR 4.095e-02 1.011e-02 4.051 5.27e-05 ***
## TEAM_BATTING_BB 2.790e-02 3.987e-03 6.999 3.39e-12 ***
## TEAM_BATTING_SO -1.160e-02 2.547e-03 -4.553 5.58e-06 ***
## TEAM_BASERUN_SB 2.633e-02 4.266e-03 6.171 8.03e-10 ***
## TEAM_BASERUN_CS -4.648e-03 1.563e-02 -0.297 0.76626
## TEAM_PITCHING_BB -8.590e-03 2.958e-03 -2.904 0.00372 **
## TEAM_PITCHING_SO 4.074e-03 8.769e-04 4.646 3.59e-06 ***
## TEAM_FIELDING_E -1.297e+03 1.289e+02 -10.061 < 2e-16 ***
## TEAM_FIELDING_DP -2.409e-03 2.457e-04 -9.806 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.14 on 2263 degrees of freedom
## Multiple R-squared: 0.3083, Adjusted R-squared: 0.3046
## F-statistic: 84.04 on 12 and 2263 DF, p-value: < 2.2e-16
Residuals vs. fitted plot
Get the model residuals
Plot the result
Plot the residuals and Q-Q line
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
## TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP,
## data = t)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.401 -8.562 0.000 8.400 60.235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.868430 5.234963 4.368 1.31e-05 ***
## TEAM_BATTING_H 0.047827 0.003636 13.152 < 2e-16 ***
## TEAM_BATTING_2B -0.021930 0.009170 -2.392 0.016858 *
## TEAM_BATTING_3B 0.074444 0.016320 4.562 5.35e-06 ***
## TEAM_BATTING_HR 0.065398 0.009606 6.808 1.26e-11 ***
## TEAM_BATTING_BB 0.011733 0.003377 3.474 0.000523 ***
## TEAM_BATTING_SO -0.007162 0.002390 -2.996 0.002763 **
## TEAM_BASERUN_SB 0.025968 0.004191 6.196 6.88e-10 ***
## TEAM_PITCHING_SO 0.002217 0.000597 3.713 0.000210 ***
## TEAM_FIELDING_E -0.022093 0.002027 -10.900 < 2e-16 ***
## TEAM_FIELDING_DP -0.121783 0.012944 -9.409 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.08 on 2265 degrees of freedom
## Multiple R-squared: 0.3133, Adjusted R-squared: 0.3103
## F-statistic: 103.3 on 10 and 2265 DF, p-value: < 2.2e-16
Residuals vs. fitted plot
Get the model residuals
Plot the result
Plot the residuals and Q-Q line
Normal Q-Q: is used to check the normality of residuals assumption. If the majority of the residuals follow the straight dashed line, then the assumption is fulfilled.
Create tidy output for all three models
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.314 0.310 13.1 86.2 5.53e-175 12 -9076. 18179. 18259.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.308 0.305 13.1 84.0 3.47e-171 12 -9084. 18197. 18277.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.313 0.310 13.1 103. 9.27e-177 10 -9076. 18176. 18245.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Compare Model1 and Model 2: Comparing Model1 and Model2 we can conclude that Model1 performed better than Model2: RSE number is higher for Model2, F-statistic and Adj. R2 are lower for Model1. Overall we see a negative change.
Compare Model 2 to Model 3:
Comparing Model2 and Model3 we can conclude that Model3 performed better than Model2: RSE number is lower for Model3, F-statistic and Adj. R2 are higher for Model3. Overall we see a positive change.
Compare Model 1 to Model 3:
Comparing Model1 and Model3 we can conclude that Model3 performed better than Model: RSE number didn’t change, F-statistic and Adj. R2 are higher for Model3. Overall we see a positive change.
Based on the analysis above we decided that Model3 is the best model to choose.
We have already cleaned Evaluation Dataset and now we can feed our Model3
## Rows: 259
## Columns: 13
## $ TEAM_BATTING_H <int> 1209, 1221, 1395, 1539, 1445, 1431, 1430, 1385, 1259,…
## $ TEAM_BATTING_2B <int> 170, 151, 183, 309, 203, 236, 219, 158, 177, 212, 243…
## $ TEAM_BATTING_3B <int> 33, 29, 29, 29, 68, 53, 55, 42, 78, 42, 40, 55, 57, 2…
## $ TEAM_BATTING_HR <int> 83, 88, 93, 159, 5, 10, 37, 33, 23, 58, 50, 164, 186,…
## $ TEAM_BATTING_BB <int> 447, 516, 509, 486, 95, 215, 568, 356, 466, 452, 495,…
## $ TEAM_BATTING_SO <int> 1080, 929, 816, 914, 416, 377, 527, 609, 689, 584, 64…
## $ TEAM_BASERUN_SB <dbl> 62, 54, 59, 148, 92, 92, 365, 185, 150, 52, 64, 48, 3…
## $ TEAM_BASERUN_CS <dbl> 50.0, 39.0, 47.0, 57.0, 49.5, 49.5, 49.5, 49.5, 49.5,…
## $ TEAM_PITCHING_BB <int> 447, 516, 509, 486, 257, 420, 613, 418, 497, 482, 521…
## $ TEAM_PITCHING_SO <int> 1080, 929, 816, 914, 1123, 736, 569, 715, 734, 622, 6…
## $ TEAM_FIELDING_E <int> 140, 135, 156, 124, 616, 572, 490, 328, 226, 184, 200…
## $ TEAM_FIELDING_DP <dbl> 156, 164, 153, 154, 130, 105, 148, 104, 132, 145, 183…
## $ TARGET_WINS <dbl> 64.26984, 65.77449, 75.20381, 85.78582, 66.48780, 69.…
Comparing statistics from the known TARGET_WINS to the predicted TARGET_WINS
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 71.00 82.00 80.79 92.00 146.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 26.87 75.57 80.90 80.66 86.18 111.58
In summary, our exploration and modeling of the baseball team dataset have provided valuable insights into the factors influencing team performance, as measured by the number of wins, ‘TARGET_WINS’.
Through a meticulous data preparation process, including handling missing values and addressing multicollinearity, we have crafted a thorough predictive model. After evaluating three different models, model 3 became our preferred choice due to a combination of performance metrics, including a lower residual standard error, and higher F-statistic and adjusted R-squared. This model not only accurately represents the data but also aligns with the linear regression. After applying model 3 to an evaluation dataset, we observed that the predicted ‘TARGET_WINS’ match closely with the actual outcomes, indicating reliability, and that the model can formulate accurate predictions.
This model provides a valuable tool for predicting the number of wins based on various team statistics, aiding in decision-making within the realm of baseball team management. Moving forward however, continuous monitoring and refinement of the model will contribute to its ongoing accuracy and relevance.
suppressMessages(library(tidyverse))
suppressMessages(library(dplyr))
suppressMessages(library(kableExtra))
suppressMessages(library(knitr))
suppressMessages(library(caret))
suppressMessages(library(corrplot))
suppressMessages(library(mlbench))
suppressMessages(library(randomForest))
suppressMessages(library(highcharter))
suppressMessages(library(reshape))
suppressMessages(library(DataExplorer))
suppressMessages(library(broom))
suppressMessages(library(GGally))
suppressMessages(library(MASS))
suppressMessages(library(ggpubr))
suppressMessages(library(moments))
suppressMessages(library(car))
suppressMessages(library(psych))
eval_df <- read.csv('https://raw.githubusercontent.com/uplotnik/DATA621/main/moneyball-evaluation-data.csv')
train_df <-read.csv('https://raw.githubusercontent.com/uplotnik/DATA621/main/moneyball-training-data.csv')
knitr::kable(
head(train_df), caption = "Table1.Moneyball dataset")%>%
kable_styling("striped", full_width = F)
suppressWarnings({ knitr::kable(describeBy(train_df))})
new<-train_df %>% dplyr::select(-INDEX)
gather_df <- new %>%
gather(key = 'variable', value = 'value')
dat <- data_to_boxplot(gather_df, value, variable, name = "height in meters")
highchart() %>%
hc_xAxis(type = "category") %>% hc_add_theme(hc_theme_economist())%>%
hc_add_series_list(dat)
hchart(gather_df, "scatter", hcaes(x = variable, y = value, group = variable)) %>%
hc_title(
text = "Closer look to Outliers",
margin = 20,
align = "left")%>% hc_add_theme(hc_theme_economist())
suppressWarnings({df<-train_df %>%
summarise_all(list(~is.na(.)))%>%
pivot_longer(everything(),
names_to = "variables", values_to="missing") %>%
count(variables, missing) %>%mutate(percent = n / nrow(train_df) * 100)})
suppressWarnings({
missing<- df%>%
hchart('bar', hcaes(x = 'variables', y = 'n', group = 'missing')) %>%
# hc_colors(c("#0073C2FF", "#EFC000FF")) %>% hc_add_theme(hc_theme_economist())%>%
hc_title(
text = "Missing Values",
margin = 20,
align = "left")%>% hc_add_theme(hc_theme_economist())
missing})
hchart(cor(train_df, use = "na.or.complete")) %>% hc_title(
text = "Correlation Plot Among The Variables",
margin = 20,
align = "left")%>% hc_add_theme(hc_theme_economist())
suppressWarnings({
train_df%>%
gather(variable, value, - INDEX) %>%ggplot(., aes(value)) +
geom_density(fill = "lightblue", color="blue") + theme (legend.position="none")+
facet_wrap(~variable, scales ="free", ncol = 3) +
labs(x = element_blank(), y = element_blank())
})
suppressWarnings({train_df %>%
gather(variable, value, -TARGET_WINS) %>%
ggplot(., aes(value, TARGET_WINS)) +
geom_point(fill = "lightblue", color="lightblue") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = "value", y = "Wins")+
geom_smooth(method = "lm", color = "blue",se=F, size=0.2)})
plot_missing(train_df)
training <- train_df %>% mutate_all(~ifelse(is.na(.x), median(.x, na.rm = TRUE), .x))
eval <- eval_df %>% mutate_all(~ifelse(is.na(.x), median(.x, na.rm = TRUE), .x))
str(training)
t<- training %>% dplyr::select(-INDEX, -TEAM_BATTING_HBP,-TEAM_PITCHING_H, -TEAM_PITCHING_HR,TEAM_PITCHING_BB,TEAM_PITCHING_SO)
eval <- eval %>% dplyr::select(-INDEX,-TEAM_BATTING_HBP,-TEAM_PITCHING_H, -TEAM_PITCHING_HR,TEAM_PITCHING_BB,TEAM_PITCHING_SO)
plot_missing(t)
model1 <- lm(TARGET_WINS ~., t)
summary(model1)
perf1<-augment(model1)
ggplot(perf1, aes(x= .fitted, y=.resid))+ geom_point()+geom_hline(yintercept=0)
model_residuals1 = model1$residuals
hist(model_residuals1)
qqnorm(model_residuals1)
qqline(model_residuals1)
mbtrain_boxcox <- preProcess(t, c("BoxCox"))
mb_bc_transformed <- predict(mbtrain_boxcox, t)
model2 <- lm(TARGET_WINS ~ ., mb_bc_transformed)
summary(model2)
perf2<-augment(model2)
ggplot(perf2, aes(x= .fitted, y=.resid))+ geom_point()+geom_hline(yintercept=0)
model_residuals2 = model2$residuals
hist(model_residuals2)
qqnorm(model_residuals2)
qqline(model_residuals2)
model3 <- stepAIC(model1, direction = "both", trace = FALSE)
summary(model3)
perf3<-augment(model3)
ggplot(perf3, aes(x= .fitted, y=.resid))+ geom_point()+geom_hline(yintercept=0)
model_residuals3 = model3$residuals
hist(model_residuals3)
qqnorm(model_residuals3)
qqline(model_residuals3)
broom::glance(model1)
broom::glance(model2)
broom::glance(model3)
eval$TARGET_WINS <- predict(model3,eval)
glimpse(eval)
summary(t$TARGET_WINS)
summary(eval$TARGET_WINS)