R write-up

Data Exploration:

The dataset I analyzed contains information about a professional baseball team from 1871 to 2006, with 17 numeric columns and 2,076 observations. One column TEAM_BATTING_HBP, has a significant amount of missing data. I noticed that some columns had a large number of outliers, particularly TEAM_PITCHING_H which indicates hits allowed.The distribution of the predictor variable TARGET_WINS was normally distributed, while some of the other variables were skewed.The correlation matrix revealed high correlation between some variables such as TEAM_BATTING_HR and TEAM_PITCHING_HR but most of the matrix had missing data. After cleaning the data, I noticed high correlation between some predictors and thus avoided including them in the regression model. But most of the variables were not correlated with the predictor TARGET_WINS.

Data Preparation

I deleted the columns TEAM_BATTING_HBP and TEAM_BASERUN_CS since the majority of their observations contained missing values.For the other columns with missing data, I used the MICE Package in R to impute the missing values. Specifically, I used a mix of predictive mean matching,classification and regression trees on TEAM_FIELDING_DP and TEAM_PITCHING_SO,since they had the most missing values after removing the other columns. I then removed the remaining variables and observations with negative or zero values since I wanted to perform a Box-Cox transformation on the data.

Build Models

I created five linear regression models. The first model included all predictor variables against the response. Then, I used stepwise selection to remove insignificant predictors. Next, I applied the box-cox transformation and transformed the y variable to the power of 1.3536, which maximized the log-likelihood of the transformed data and improved the model slightly. The coefficients of the model had both positive and negative slopes. Since some predictors increase/decrease a team’s chance of winning. For instance, in my final model, TEAM_BATTING_H had a slope of 0.263 meaning that for every base hit by the batter, the win increased by 0.263. This outcome was expected as a hit by the batter can increase their chances of scoring and ultimately winning the game.

Model Selection:

For my final model, I selected the model with the box-cox transformation. This model included all significant variables and had the lowest-root-mean-squared error (RMSE) compared to the other models,with a score of 12.43907.The diagnostic checks for this model showed that all assumptions were met, as the residuals were clearly scattered with no distinct patterns in the plot, and the QQ plot was normal. Additionally the F-statictics for the model was 158 and the adjusted R-squared value was 0.33.

The equation of the model is:

Y^.13536 = 83.63 + TEAM_BATTING_H * 0.263 + TEAM_BATTING_HR * 0.49 +

0.0709 * TEAM_BATTING_BB + TEAM_BATTING_SO * (-0.09) + TEAM_BASERUN_SB * 0.293 + TEAM_FIELDING_E * (-0.188) + TEAM_FIELDING_DP + (-0.693)

Using the model for my predictions I had to apply the inverse box-cox transformation in order to get the actual predicted value for the TARGET_WINS so that I can better interpret the values. I.e (Y^(1/.13536).

Sources Citiaton:

Here were some websites that helped me with my analysis and the data imputation:

Wu, Songhao. “Multi-Collinearity in Regression.” Medium, Towards Data Science, 5 June 2021, https://towardsdatascience.com/multi-collinearity-in-regression-fe7a2c1467ea.

“Imputation in R: Top 3 Ways for Imputing Missing Data.” Machine Learning, R Programming, 8 Oct. 2021, https://appsilon.com/imputation-in-r/.

Appendix:

Here is my R code stored as an appendix:

Introduction

(Data Exploration):

The training dataset contains seventeen columns and two thousand seventy six observations about a professional baseball team throughout the years of 1871 to 2006

## Step 1 call in your libraries and import the data from csv and read it into R
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(reshape2)

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

library(corrplot)

## corrplot 0.92 loaded

training <- read.csv('https://raw.githubusercontent.com/AldataSci/Baseball-Data/main/moneyball-training-data.csv')

Looking at the structure of the dataset we can see they are all integer columns and one of the columns TEAM_BATTING_HBP contains a lot of NA values for the head of the data..

str(training)

## 'data.frame':    2276 obs. of  17 variables:
##  $ INDEX           : int  1 2 3 4 5 6 7 8 11 12 ...
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : int  842 1075 917 922 920 973 1062 1027 922 827 ...
##  $ TEAM_BASERUN_SB : int  NA 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : int  NA 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: int  5456 1082 917 928 920 973 1062 1033 922 827 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: int  NA 155 153 156 168 149 186 136 169 159 ...

A quick glance at the summary statistics of the column.

## OK one of the columns has over 2,085 missing values out of 2276 of its columns..
## TEAM_BATTING_HBP which is the column for Batters hit by pitch (may have to remove this column..)
summary(training)

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286

We can see that HBP contains 2085 missing values followed by TEAM_BASERUN_CS so I may have to omit those columns from the dataset.

## Easier to see all the missing values
sapply(training,function(x) sum(is.na(x)))

##            INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
##                0                0                0                0 
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
##                0                0                0              102 
##  TEAM_BASERUN_SB  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H 
##              131              772             2085                0 
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##                0                0              102                0 
## TEAM_FIELDING_DP 
##              286

From the boxplot the column of TEAM_PITCHING_H has a lot of outliers, I may consider removing this column from the model in order to not sway it.

## Let's try the ggplot method and melt-method..
data_long <- melt(training)

## No id variables; using all as measure variables

##plot boxplot with ggplot.. ## there are a lot of outliers in TEAM_PITCHING_H
gg <- ggplot(data_long,aes(x=variable,y=value,fill = "red")) + geom_boxplot() + coord_flip() + xlab("Columns")
gg

gg + coord_cartesian(ylim = c(0,2000)) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

data_gathered <- training %>%
  gather(variable,value)

The histograms have various distribution but the predictor variable TARGET_WINS is normally distributed but some of the others are skewed like TEAM_FIELDING_E and etc.

## each panel can have its own scale when we use scale = "Free" 
histograms <- ggplot(data_gathered,aes(x=value)) + geom_histogram() +
  facet_wrap(~variable,scale="free")
histograms

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The correlation matrix shows a lot of question marks which shows missing data in the columns,

## Let's create a correlation matrix with our data.. 
sum(is.na(training))

## [1] 3478

## there are a lot of missing data in these columns... i'm gonna have to remove some of those columns..
corrplot(cor(training))

Part II Data Preparation:

Removal of NA values

I’ve removed the columns of HBP and CS since they contained a lot of missing values

## Cleaning the data and imputating some of the data.. i'm going to remove columns TEAM_BATTING_HBP and TEAM_BASERUN_CS since they have a lot of missing data and I will imputate the rest of the data with columns..  those 2 columns are basically batters caught stealing and batters hit by pitch which rarely happened in those cases... 

Training <- training %>%
  dplyr::select(-c(TEAM_BATTING_HBP,TEAM_BASERUN_CS))

sapply(Training,function(x) sum(is.na(x)))

##            INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
##                0                0                0                0 
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
##                0                0                0              102 
##  TEAM_BASERUN_SB  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB 
##              131                0                0                0 
## TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##              102                0              286

Imputation using MICE

I am going to try imputing the missing values with the MICE package and I will use predictive mean matching, cart: Classification and regression trees and lasso linear regression and for each I will see which imputation method closely resembles the distribution of the normal data and choose that method to impute the missing values.

## Now I will imputate the data with the mice package.. 
library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

mice_imputed <- data.frame(
original = Training$TEAM_FIELDING_DP,
imp_pmm = complete(mice(Training,method ="pmm"))$TEAM_FIELDING_DP,
imp_cart = complete(mice(Training,method ="cart"))$TEAM_FIELDING_DP,
imp_lasso = complete(mice(Training,method ="lasso.norm"))$TEAM_FIELDING_DP
)

## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP

head(mice_imputed)

I am going to compare the distribution of the original and then figure which distribution resembles the original.

## compare the distribution between each imputation and see which one resembles the original the most..
## I think the imp_cart looks smiliar to the original histogram so I will use those values.
par(mfrow=c(2,2))
hist(mice_imputed$original)
hist(mice_imputed$imp_pmm)
hist(mice_imputed$imp_cart)
hist(mice_imputed$imp_lasso)

## replace the values with the imputed values..
Training$TEAM_FIELDING_DP <- mice_imputed$imp_cart

## now I will imputate the rest of the columns with the same method..
sapply(Training,function(x) sum(is.na(x)))

##            INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
##                0                0                0                0 
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
##                0                0                0              102 
##  TEAM_BASERUN_SB  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB 
##              131                0                0                0 
## TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##              102                0                0

## i will imputate the TEAM_BASERUN_SB which is stolen bases..
mice_imputed2 <- data.frame(
original = Training$TEAM_BASERUN_SB,
imp_pmm = complete(mice(Training,method ="pmm"))$TEAM_BASERUN_SB,
imp_cart = complete(mice(Training,method ="cart"))$TEAM_BASERUN_SB,
imp_lasso = complete(mice(Training,method ="lasso.norm"))$TEAM_BASERUN_SB
)

## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO

head(mice_imputed2)

## I will impute that value with imp_cart since they resemble the original histogram..
par(mfrow=c(2,2))
hist(mice_imputed2$original)
hist(mice_imputed2$imp_pmm)
hist(mice_imputed2$imp_cart)
hist(mice_imputed2$imp_lasso)

## imputate BASERUN_SB with this value since the distributions looks smiliar 
Training$TEAM_BASERUN_SB <- mice_imputed2$imp_pmm

## looking at the empty values again I think i should be fine with it this time.. 
sapply(Training,function(x) sum(is.na(x)))

##            INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
##                0                0                0                0 
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
##                0                0                0              102 
##  TEAM_BASERUN_SB  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB 
##                0                0                0                0 
## TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##              102                0                0

## now I want to look at the correlation matrix again and see if I can gleam any valuable information..
Training <- na.omit(Training)

corrplot(cor(Training),method = "color")

Part III (Model-Creation)

## I am going to split the training data set into training and testing datasets...
## 70% in Training and 30% in Testing..
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

set.seed(123)
index <- createDataPartition(Training$TARGET_WINS,p=0.7,list = FALSE)

Ttraining <- Training[index,]
Ttest <- Training[-index,]

Model I (All the Predictors minus the Index)

## It went up only a little bit.. but that's fine.. 
mod1 <- lm(TARGET_WINS ~ .-INDEX,data=Ttraining)
summary(mod1)

## 
## Call:
## lm(formula = TARGET_WINS ~ . - INDEX, data = Ttraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.598  -8.275  -0.002   8.180  65.562 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      33.2154867  6.2560068   5.309 1.26e-07 ***
## TEAM_BATTING_H    0.0442560  0.0042495  10.414  < 2e-16 ***
## TEAM_BATTING_2B  -0.0276487  0.0109659  -2.521  0.01179 *  
## TEAM_BATTING_3B   0.0611680  0.0197807   3.092  0.00202 ** 
## TEAM_BATTING_HR   0.0642616  0.0300527   2.138  0.03265 *  
## TEAM_BATTING_BB   0.0108344  0.0065623   1.651  0.09895 .  
## TEAM_BATTING_SO  -0.0150882  0.0029569  -5.103 3.77e-07 ***
## TEAM_BASERUN_SB   0.0412726  0.0048644   8.485  < 2e-16 ***
## TEAM_PITCHING_H  -0.0001039  0.0004585  -0.227  0.82071    
## TEAM_PITCHING_HR  0.0270565  0.0262724   1.030  0.30325    
## TEAM_PITCHING_BB -0.0014802  0.0045355  -0.326  0.74419    
## TEAM_PITCHING_SO  0.0033969  0.0010182   3.336  0.00087 ***
## TEAM_FIELDING_E  -0.0348440  0.0031879 -10.930  < 2e-16 ***
## TEAM_FIELDING_DP -0.1184901  0.0158941  -7.455 1.51e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.59 on 1510 degrees of freedom
## Multiple R-squared:  0.3657, Adjusted R-squared:  0.3602 
## F-statistic: 66.96 on 13 and 1510 DF,  p-value: < 2.2e-16

Model II (Getting rid of the not signficant variables)

## I will get rid of the not so signficant variables so TEAM_PITCHING_HR and TEAM_PITCHING_BB and the R squared has gone up a few values.. since they are signficant I will look at the diagnostics.. 
mod2 <- lm(TARGET_WINS ~ .-INDEX-TEAM_PITCHING_H-TEAM_PITCHING_HR-TEAM_PITCHING_BB,data=Ttraining)
summary(mod2)

## 
## Call:
## lm(formula = TARGET_WINS ~ . - INDEX - TEAM_PITCHING_H - TEAM_PITCHING_HR - 
##     TEAM_PITCHING_BB, data = Ttraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.450  -8.196  -0.005   8.102  65.939 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      32.2438320  6.1261755   5.263 1.62e-07 ***
## TEAM_BATTING_H    0.0446899  0.0041885  10.670  < 2e-16 ***
## TEAM_BATTING_2B  -0.0282364  0.0109156  -2.587 0.009780 ** 
## TEAM_BATTING_3B   0.0646486  0.0192677   3.355 0.000812 ***
## TEAM_BATTING_HR   0.0927076  0.0112779   8.220 4.32e-16 ***
## TEAM_BATTING_BB   0.0090047  0.0037847   2.379 0.017471 *  
## TEAM_BATTING_SO  -0.0146237  0.0028015  -5.220 2.04e-07 ***
## TEAM_BASERUN_SB   0.0414658  0.0046268   8.962  < 2e-16 ***
## TEAM_PITCHING_SO  0.0030944  0.0005988   5.168 2.68e-07 ***
## TEAM_FIELDING_E  -0.0350488  0.0025508 -13.740  < 2e-16 ***
## TEAM_FIELDING_DP -0.1173787  0.0158475  -7.407 2.14e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.58 on 1513 degrees of freedom
## Multiple R-squared:  0.3652, Adjusted R-squared:  0.361 
## F-statistic: 87.04 on 10 and 1513 DF,  p-value: < 2.2e-16

plot(fitted(mod2),residuals(mod2),xlab="Fitted",ylab="Residuals")

## attempt a box-cox transformation..
Ttraining <- Ttraining %>%
  filter(TARGET_WINS != 0)
Ttest <- Ttest %>%
  filter(TARGET_WINS != 0)

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

set.seed(123)
bcox <-boxcox(mod2,plotit = T)

val <- cbind(bcox$x,bcox$y)

## sort the values in ascending-order.. our lambda value is 1.1919 that maxmizes the log-likelihood of the transformed data
head(val[order(-bcox$y),])

##          [,1]      [,2]
## [1,] 1.353535 -2769.905
## [2,] 1.393939 -2769.937
## [3,] 1.313131 -2770.073
## [4,] 1.434343 -2770.166
## [5,] 1.272727 -2770.447
## [6,] 1.474747 -2770.588

Model III (Box-Cox Transformation)

## Let use the lambda value on our model to see if it improves the model even if its a little bit.
bmod3 <- lm(TARGET_WINS ^(1.3536) ~ .-INDEX-TEAM_PITCHING_H-TEAM_PITCHING_HR-TEAM_PITCHING_BB,data=Ttraining)
summary(bmod3)

## 
## Call:
## lm(formula = TARGET_WINS^(1.3536) ~ . - INDEX - TEAM_PITCHING_H - 
##     TEAM_PITCHING_HR - TEAM_PITCHING_BB, data = Ttraining)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -312.67  -53.02   -1.40   51.29  444.89 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      79.192463  39.101153   2.025 0.043010 *  
## TEAM_BATTING_H    0.282125   0.026966  10.462  < 2e-16 ***
## TEAM_BATTING_2B  -0.181950   0.069413  -2.621 0.008848 ** 
## TEAM_BATTING_3B   0.403303   0.121614   3.316 0.000934 ***
## TEAM_BATTING_HR   0.599837   0.071339   8.408  < 2e-16 ***
## TEAM_BATTING_BB   0.059411   0.023887   2.487 0.012984 *  
## TEAM_BATTING_SO  -0.095918   0.017734  -5.409 7.37e-08 ***
## TEAM_BASERUN_SB   0.250584   0.029311   8.549  < 2e-16 ***
## TEAM_PITCHING_SO  0.019955   0.003795   5.258 1.66e-07 ***
## TEAM_FIELDING_E  -0.204932   0.016580 -12.360  < 2e-16 ***
## TEAM_FIELDING_DP -0.756543   0.099965  -7.568 6.55e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79.35 on 1512 degrees of freedom
## Multiple R-squared:  0.3456, Adjusted R-squared:  0.3413 
## F-statistic: 79.87 on 10 and 1512 DF,  p-value: < 2.2e-16

## it looks a bit better
plot(fitted(mod2),residuals(mod2),xlab="Fitted",ylab="Residuals")

plot(fitted(bmod3),residuals(bmod3),xlab="Fitted",ylab="Residuals")

Model Four (Removing the less signficant variables..)

## This looks good I think, I removed the other least signficant variables.. 
bmod4 <- lm(TARGET_WINS ^(1.3536) ~ .-INDEX-TEAM_PITCHING_H-TEAM_PITCHING_HR-TEAM_PITCHING_BB-TEAM_BATTING_3B,data=Training)
summary(bmod4)

## 
## Call:
## lm(formula = TARGET_WINS^(1.3536) ~ . - INDEX - TEAM_PITCHING_H - 
##     TEAM_PITCHING_HR - TEAM_PITCHING_BB - TEAM_BATTING_3B, data = Training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -317.89  -53.97    0.03   51.63  428.18 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      65.983524  33.139593   1.991 0.046598 *  
## TEAM_BATTING_H    0.298351   0.021164  14.097  < 2e-16 ***
## TEAM_BATTING_2B  -0.132452   0.056678  -2.337 0.019534 *  
## TEAM_BATTING_HR   0.484318   0.056101   8.633  < 2e-16 ***
## TEAM_BATTING_BB   0.076188   0.019604   3.886 0.000105 ***
## TEAM_BATTING_SO  -0.097033   0.015011  -6.464 1.25e-10 ***
## TEAM_BASERUN_SB   0.266067   0.023915  11.125  < 2e-16 ***
## TEAM_PITCHING_SO  0.018152   0.003661   4.959 7.65e-07 ***
## TEAM_FIELDING_E  -0.193911   0.013677 -14.178  < 2e-16 ***
## TEAM_FIELDING_DP -0.750365   0.084105  -8.922  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79.71 on 2164 degrees of freedom
## Multiple R-squared:  0.3362, Adjusted R-squared:  0.3334 
## F-statistic: 121.8 on 9 and 2164 DF,  p-value: < 2.2e-16

Model Five (Removing the more of the less signficant variables..)

## Here I removed the least signficant variables and I'm curious now.. 
bmod5 <- lm(TARGET_WINS ^(1.3536) ~ .-INDEX-TEAM_PITCHING_H-TEAM_PITCHING_HR-TEAM_PITCHING_BB-TEAM_BATTING_3B-TEAM_BATTING_2B-TEAM_PITCHING_SO,data=Training)
summary(bmod5)

## 
## Call:
## lm(formula = TARGET_WINS^(1.3536) ~ . - INDEX - TEAM_PITCHING_H - 
##     TEAM_PITCHING_HR - TEAM_PITCHING_BB - TEAM_BATTING_3B - TEAM_BATTING_2B - 
##     TEAM_PITCHING_SO, data = Training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -314.47  -54.48   -0.52   51.95  413.36 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      83.89418   31.68828   2.647 0.008168 ** 
## TEAM_BATTING_H    0.26447    0.01575  16.796  < 2e-16 ***
## TEAM_BATTING_HR   0.45287    0.05604   8.081 1.06e-15 ***
## TEAM_BATTING_BB   0.07403    0.01970   3.758 0.000176 ***
## TEAM_BATTING_SO  -0.07785    0.01349  -5.770 9.07e-09 ***
## TEAM_BASERUN_SB   0.25831    0.02374  10.881  < 2e-16 ***
## TEAM_FIELDING_E  -0.17418    0.01322 -13.176  < 2e-16 ***
## TEAM_FIELDING_DP -0.74332    0.08451  -8.796  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.17 on 2166 degrees of freedom
## Multiple R-squared:  0.3278, Adjusted R-squared:  0.3257 
## F-statistic: 150.9 on 7 and 2166 DF,  p-value: < 2.2e-16

Looking at the diagnostics

I think the model fits all the assumptions but with some outliers here and there in the cook’s distance chart.

par(mfrow=c(2,2))
plot(bmod5)

(Part IV) Model selection.. (using RMSE)

I have calculated the Root Mean Squared Error in this section and I’ve compared against the model I’ve found interesting. I choose bmod4 because it had the lowest rmse then the others.

## I will then use mod,mod2,bmod4 and compare each rmse

## import the caret library..

library(caret)

predictions_1 <- predict(mod1,Ttest)
head(predictions_1)

##        1        2        3        4        5        6 
## 62.93778 75.29586 67.27069 66.65399 69.42688 86.56737

rmse <- RMSE(predictions_1,Ttest$TARGET_WINS)
rmse

## [1] 12.57593

## create the next predictions with mod4

predictions_2 <- predict(mod2,Ttest)
head(predictions_2)

##        1        2        3        4        5        6 
## 61.46682 75.30854 67.27508 66.78664 69.27285 86.69152

rmse2 <- RMSE(predictions_2,Ttest$TARGET_WINS)
rmse2

## [1] 12.57123

## make sure to inverse the box-cox transformation 
predictions_3 <- predict(bmod4,Ttest)

## make sure to inverse the box-cox transformation
inv_box_pred <- predictions_3 ^(1/1.3536)
rmse3 <- RMSE(inv_box_pred,Ttest$TARGET_WINS)
head(inv_box_pred)

##        1        2        3        4        5        6 
## 64.98320 76.10651 67.83114 64.94323 71.38245 87.60343

rmse3

## [1] 12.51824

predictions_4 <- predict(bmod5,Ttest)

## make sure to inverse the box-cox transformation
inv_box_pred2 <- predictions_4 ^(1/1.3536)
rmse4 <- RMSE(inv_box_pred2,Ttest$TARGET_WINS)
head(inv_box_pred)

##        1        2        3        4        5        6 
## 64.98320 76.10651 67.83114 64.94323 71.38245 87.60343

rmse4

## [1] 12.50728

Cleaning The testing dataset

I went to clean the testing dataset in a manner smiliar to the way I have cleaned the training dataset in which I deleted the empty columns and imputate some others and omitted the rest.

## Will predict values with mod4,mod5,and mod6.. 
Test <- read.csv("https://raw.githubusercontent.com/AldataSci/Baseball-Data/main/moneyball-evaluation-data.csv")

## before I do that I have to clean the test data for the linear regression model.. I will clean it in a manner that will resemble the training set

str(Test)

## 'data.frame':    259 obs. of  16 variables:
##  $ INDEX           : int  9 10 14 47 60 63 74 83 98 120 ...
##  $ TEAM_BATTING_H  : int  1209 1221 1395 1539 1445 1431 1430 1385 1259 1397 ...
##  $ TEAM_BATTING_2B : int  170 151 183 309 203 236 219 158 177 212 ...
##  $ TEAM_BATTING_3B : int  33 29 29 29 68 53 55 42 78 42 ...
##  $ TEAM_BATTING_HR : int  83 88 93 159 5 10 37 33 23 58 ...
##  $ TEAM_BATTING_BB : int  447 516 509 486 95 215 568 356 466 452 ...
##  $ TEAM_BATTING_SO : int  1080 929 816 914 416 377 527 609 689 584 ...
##  $ TEAM_BASERUN_SB : int  62 54 59 148 NA NA 365 185 150 52 ...
##  $ TEAM_BASERUN_CS : int  50 39 47 57 NA NA NA NA NA NA ...
##  $ TEAM_BATTING_HBP: int  NA NA NA 42 NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  1209 1221 1395 1539 3902 2793 1544 1626 1342 1489 ...
##  $ TEAM_PITCHING_HR: int  83 88 93 159 14 20 40 39 25 62 ...
##  $ TEAM_PITCHING_BB: int  447 516 509 486 257 420 613 418 497 482 ...
##  $ TEAM_PITCHING_SO: int  1080 929 816 914 1123 736 569 715 734 622 ...
##  $ TEAM_FIELDING_E : int  140 135 156 124 616 572 490 328 226 184 ...
##  $ TEAM_FIELDING_DP: int  156 164 153 154 130 105 NA 104 132 145 ...

## remove the HBP column again and imputate the 
sapply(Test,function(x) sum(is.na(x)))

##            INDEX   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
##                0                0                0                0 
##  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB 
##                0                0               18               13 
##  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H TEAM_PITCHING_HR 
##               87              240                0                0 
## TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##                0               18                0               31

## remove hbp and Cs
Test <- Test %>%
  dplyr::select(-c(TEAM_BATTING_HBP,TEAM_BASERUN_CS))

sapply(Test,function(x) sum(is.na(x)))

##            INDEX   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
##                0                0                0                0 
##  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB 
##                0                0               18               13 
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
##                0                0                0               18 
##  TEAM_FIELDING_E TEAM_FIELDING_DP 
##                0               31

## now we imputate..

library(mice)
mice_imputed3 <- data.frame(
original = Test$TEAM_FIELDING_DP,
imp_pmm = complete(mice(Test,method ="pmm"))$TEAM_FIELDING_DP,
imp_cart = complete(mice(Test,method ="cart"))$TEAM_FIELDING_DP,
imp_lasso = complete(mice(Test,method ="lasso.norm"))$TEAM_FIELDING_DP
)

## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP

## Warning: Number of logged events: 13

## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO  TEAM_FIELDING_DP

head(mice_imputed3)

par(mfrow=c(2,2))
hist(mice_imputed3$original)
hist(mice_imputed3$imp_pmm)
hist(mice_imputed3$imp_cart)
hist(mice_imputed3$imp_lasso)

## Since the imp_cart looks smiliar to the original distribution I will use that then..

Test$TEAM_FIELDING_DP <- mice_imputed3$imp_cart

## now we imputate the next column.. which is BASERUN_SB

mice_imputed4 <- data.frame(
original = Test$TEAM_BASERUN_SB,
imp_pmm = complete(mice(Test,method ="pmm"))$TEAM_BASERUN_SB,
imp_cart = complete(mice(Test,method ="cart"))$TEAM_BASERUN_SB,
imp_lasso = complete(mice(Test,method ="lasso.norm"))$TEAM_BASERUN_SB
)

## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO

head(mice_imputed4)

par(mfrow=c(2,2))
hist(mice_imputed4$original)
hist(mice_imputed4$imp_pmm)
hist(mice_imputed4$imp_cart)
hist(mice_imputed4$imp_lasso)

## I will use imp_pmm again and replace those columns with those imputated values.. 
Test$TEAM_BASERUN_SB <- mice_imputed4$imp_pmm

sapply(Test,function(x) sum(is.na(x)))

##            INDEX   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
##                0                0                0                0 
##  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB 
##                0                0               18                0 
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
##                0                0                0               18 
##  TEAM_FIELDING_E TEAM_FIELDING_DP 
##                0                0

## Then I will remove some of the columns since I had imputated most of the columns..

Testt <- na.omit(Test)


sapply(Testt,function(x) sum(is.na(Testt)))

##            INDEX   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
##                0                0                0                0 
##  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB 
##                0                0                0                0 
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
##                0                0                0                0 
##  TEAM_FIELDING_E TEAM_FIELDING_DP 
##                0                0

Creating predictions with the cleaned Test Data..

Finally, I used the model and I created predictions with the test dataset.

set.seed(123)
pred <- predict(bmod5,newdata=Testt)


## I have to revert the transformation back.. 
actual_predictions <- pred ^ (1/1.3536)

actual_predictions

##         1         2         3         4         5         6         7         8 
##  61.84930  64.43498  74.71737  88.19905  71.15099  76.57790  85.16479  76.95295 
##         9        10        11        12        13        14        15        16 
##  69.17214  74.36038  70.32894  82.04584  80.99146  83.66965  86.00930  77.66413 
##        17        18        20        21        22        23        24        25 
##  74.94783  79.61878  91.46610  82.44225  85.31416  80.32776  73.51264  83.33830 
##        26        27        28        29        30        31        32        33 
##  88.81144  63.09041  75.91467  85.16738  77.44445  91.25169  85.65728  82.64912 
##        34        35        36        37        38        39        40        41 
##  84.42701  79.44160  87.46804  76.48140  88.93335  85.42981  91.18396  85.64544 
##        42        43        44        45        46        47        48        49 
##  91.76689  23.64848 100.51423  89.60579  92.80141  97.23755  77.01990  69.03631 
##        50        51        52        53        54        55        56        57 
##  80.01837  77.64821  86.72308  76.39701  73.34806  76.14399  78.68639  90.99818 
##        58        61        62        63        64        65        66        67 
##  75.93369  87.12887  73.28574  88.79488  86.77272  84.82156 101.04718  74.03182 
##        68        70        71        72        73        74        75        76 
##  79.35332  86.07246  82.53252  70.80341  77.58168  88.99942  81.46213  83.78643 
##        77        78        81        82        83        84        85        86 
##  81.71207  84.26288  87.11174  87.83810  96.43844  75.29771  84.43779  81.95886 
##        87        88        89        90        91        92        93        97 
##  83.73404  83.53460  89.73692  91.45050  81.27488  85.39572  74.67256  86.79716 
##        98        99       100       101       102       103       104       105 
##  99.80526  85.64957  85.61635  79.25594  75.83839  84.33216  84.14362  79.56430 
##       106       107       108       109       110       111       112       113 
##  75.79221  61.47419  78.44368  87.36180  59.25517  84.82429  86.60988  93.01689 
##       114       115       116       117       118       119       120       121 
##  91.15159  81.01480  79.54025  85.31510  81.82653  75.25450  79.85416  92.94476 
##       125       126       127       128       129       130       131       132 
##  67.46779  87.05470  89.59379  76.34769  92.82324  90.90268  86.60316  81.53925 
##       133       134       135       136       137       138       139       140 
##  81.64148  84.19961  86.66708  77.09248  73.83076  78.15329  88.35539  81.91686 
##       141       143       144       145       146       147       148       149 
##  65.21392  89.82173  73.32767  72.57789  72.25746  77.83180  79.70360  79.06384 
##       150       151       152       153       154       155       156       157 
##  83.91487  82.32363  81.53671  46.85312  69.74669  77.09588  70.44905  90.07318 
##       158       159       161       162       163       164       165       166 
##  78.88035  89.68833 100.24991 104.58684  93.14933 101.64941  96.43412  88.21536 
##       167       168       169       170       172       173       174       175 
##  80.37345  81.87318  73.94455  82.34513  87.99642  81.13663  93.90529  84.45892 
##       176       177       178       179       180       181       182       183 
##  73.66354  78.85652  70.54919  74.54723  79.60696  88.59108  88.99611  86.38980 
##       184       185       186       187       188       189       190       193 
##  85.76187  86.42747  93.43328  86.39272  55.91878  69.65853 112.76496  77.21880 
##       194       195       196       197       198       199       200       201 
##  78.42843  81.81643  69.99719  79.60943  84.32617  79.51418  83.04248  73.85928 
##       202       203       204       205       206       207       208       209 
##  78.57583  72.45275  89.65209  82.31471  83.37808  78.56086  78.62720  82.68337 
##       210       211       212       213       214       215       216       217 
##  69.77133 104.77259  94.23390  79.46385  65.61489  67.69880  82.34726  77.13837 
##       218       219       220       221       222       223       224       225 
##  92.84837  78.25119  78.75911  78.51747  74.94712  82.58438  73.03858  78.85815 
##       226       227       228       229       230       232       233       234 
##  74.54286  82.42076  79.83949  82.24726  70.94105  90.70332  78.27249  89.05509 
##       235       236       237       238       239       240       241       242 
##  80.57293  75.23115  83.21572  76.73696  89.69549  71.04698  87.70732  86.29062 
##       243       244       245       246       247       248       249       250 
##  84.21911  81.98194  61.94749  88.47766  81.58155  85.63590  73.41545  83.97253 
##       251       252       253       254       255       256       257       258 
##  81.32226  65.04006  88.95430  28.93765  69.74016  77.79276  83.41562  85.06883 
##       259 
##  78.92672

## And that is all!! done...

Homework # 1 (Write-up)

Al Haque

2023-02-25