Authors: Ajay Arora, Romerl Elizes

Date: 10/31/2021

I. Purpose

In this Initial Experiment, we will conduct a comparison exercise for the Recursive function vs. the Step function for 5 General Linear Model algorithms: Gaussian, Poisson, Gamma, Inverse Gaussian, and Binomial. We will only conduct one iteration of each exercise for this Midterm. The purpose of this experiment is to demonstrate that this exercise can easily be executed for this initial data set. The data set used is a famous baseball data set called moneyball. Each record highlights a team and their baseball statistics for a particular year. The data set contains over 2000 columns. The target variable for this data set is TARGET_WINS.

Summary of Initial Experiments

A Table of the Results may be found in the Conclusion section of the bottom of this document.

The Data Set has only 13 independent variables to work with. The overall goal of the Final Project is to be able to achieve Data Sets with at least 50 independent variables.
The Recursive Function overall has been able to reduce the number of viable variables to 8 in the Gamma and Inverse Gaussian models.
The Recursive implementation of Binomial was able to arrive at 10 optimal variables with 2 Recursive calls.
The Recursive implementation of Gaussian and Poisson were able to arrive at 11 optimal variables with 2 Recursive calls.
The Recursive Function on all models was executed only twice to derive at the optimal number of variables. Even though the Step function was competitive or better in terms of step calls, the optimal number of variables was only 11 as compared to the Recursive function achieving 8. Incidentally, the Step function only reached 11 viable variables with the Poisson Model.
Two Step models, Gaussian and Inverse Gaussian, did not even call the Step function because the optimal variables was 13.
Two Step models, Gamma and Binomal, called only one Step iteration but their optimal variables were 12.
The Calcuted McFadden R² for all 5 models indicates that the R² of the Step function is slightly better than the Recursive function.
The AIC for all 5 models indicates that the AIC of the Step function is slightly better than the Recursive function.
The BIC for all 5 models indicates that the BIC of the Recursive function is slightly better than the Step function.

These findings conclusively indicate that we will be proceeding with our additional Experiments with 4 other data sets increasing to 50 variables for the final Data Set. The results of these initial experiments indicate that our developed Recursive function is competitive to the Step function of the general linear models and that we should proceed with the rest of the experiments to validate our findings.

II. Data Preparation

This section covers the data prepartion activities needed for this experiment.

A. Retrieve Data

In this subsection, we retrieve the data from the csv files and define the global variables needed for this exercise.

bb_train <- read.table(file="moneyball-training-data.csv", header = TRUE, sep = ",")
target_var = "TARGET_WINS"
cand_variables = c()
totalruns = 1
recursivecalls = 0
linearModel = ""
numBegVariables = 0
model1R2 = 0
model1Fstat = 0
model1skew = 0
model1AIC = 0
model1BIC = 0
model1Variables = 0
model2R2 = 0
model2Fstat = 0
model2skew = 0
model2AIC = 0
model2BIC = 0
model2Variables = 0
steptime = 0
recursivetime = 0

B. Get Rid of Some Variables and impute missing values.

We get rid of some variables as they will not be needed for this exercise.

remove1 <- c('INDEX', 'TEAM_BATTING_HBP', 'TEAM_PITCHING_HR')
newvars1 <- names( bb_train) %in% remove1
bb_train <- bb_train[!newvars1]

C. Impute missing values.

We impute any missing values in the data set.

bb_train_imputed <- mice(bb_train, m=5, maxit = 5, method = 'pmm')

## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP

bb_train_imputed <- complete(bb_train_imputed)
summary(bb_train_imputed)

##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 546.0   1st Qu.: 67.0  
##  Median :102.00   Median :512.0   Median : 735.0   Median :106.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 728.7   Mean   :135.4  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.0   3rd Qu.:170.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##  TEAM_BASERUN_CS  TEAM_PITCHING_H TEAM_PITCHING_BB TEAM_PITCHING_SO 
##  Min.   :  0.00   Min.   : 1137   Min.   :   0.0   Min.   :    0.0  
##  1st Qu.: 42.00   1st Qu.: 1419   1st Qu.: 476.0   1st Qu.:  611.0  
##  Median : 56.00   Median : 1518   Median : 536.5   Median :  805.0  
##  Mean   : 74.06   Mean   : 1779   Mean   : 553.0   Mean   :  811.3  
##  3rd Qu.: 85.25   3rd Qu.: 1682   3rd Qu.: 611.0   3rd Qu.:  958.0  
##  Max.   :201.00   Max.   :30132   Max.   :3645.0   Max.   :19278.0  
##  TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   :  65.0   Min.   : 52.0   
##  1st Qu.: 127.0   1st Qu.:125.0   
##  Median : 159.0   Median :146.0   
##  Mean   : 246.5   Mean   :141.6   
##  3rd Qu.: 249.2   3rd Qu.:162.0   
##  Max.   :1898.0   Max.   :228.0

dim(bb_train_imputed)

## [1] 2276   14

III. Defining Recursive Functions

A. Recursive Function 1 for GLM Gaussian

We define the the recursive function for the GLM Gaussian models

recursiveGLMa <- function(targetVariable,lmresult,datainput) 
{
  returnVal <- TRUE
  lmresult.summary <- summary(lmresult)
  coefnames <- names(lmresult.summary$coefficients[,4])
  lencoefnames = length(coefnames)

  canresult1 <- lmresult.summary$coefficients[2:lencoefnames,4]
  lenres1 <- length(canresult1)
  canresult2 <- canresult1[canresult1 < 0.05]
  lenres2 <- length(canresult2)
  recursivecalls <<- recursivecalls + 1
  if (lenres2 == lenres1) {
    linearModel <<- lmresult
    return(FALSE)
  }
  coefnames2 <- names(canresult2)

  ExVar <- toString(paste(coefnames2, "+ ", collapse = ''))
  ExVar <- substr(ExVar, 1, nchar(ExVar)-3)

  model1 <- paste(targetVariable," ~ ",ExVar)
  fit1 <- lm(eval(parse(text = model1)),data = datainput)
  returnVal <- recursiveGLMa(targetVariable,fit1,datainput)

  return(returnVal)
}

B. Recursive Function 2 for GLM Poisson

We define the the recursive function for the GLM Poisson models

recursiveGLMb <- function(targetVariable,lmresult,datainput) 
{
  returnVal <- TRUE
  lmresult.summary <- summary(lmresult)
  coefnames <- names(lmresult.summary$coefficients[,4])
  lencoefnames = length(coefnames)

  canresult1 <- lmresult.summary$coefficients[2:lencoefnames,4]
  lenres1 <- length(canresult1)
  canresult2 <- canresult1[canresult1 < 0.05]
  lenres2 <- length(canresult2)
  recursivecalls <<- recursivecalls + 1
  if (lenres2 == lenres1) {
    linearModel <<- lmresult
    return(FALSE)
  }
  coefnames2 <- names(canresult2)

  ExVar <- toString(paste(coefnames2, "+ ", collapse = ''))
  ExVar <- substr(ExVar, 1, nchar(ExVar)-3)

  model1 <- paste(targetVariable," ~ ",ExVar)
  fit1 <- glm(eval(parse(text = model1)),data = datainput, family=poisson)
  returnVal <- recursiveGLMb(targetVariable,fit1,datainput)

  return(returnVal)
}

C. Recursive Function 3 for GLM Gamma

We define the the recursive function for the GLM Gamma models

recursiveGLMc <- function(targetVariable,lmresult,datainput) 
{
  returnVal <- TRUE
  lmresult.summary <- summary(lmresult)
  coefnames <- names(lmresult.summary$coefficients[,4])
  lencoefnames = length(coefnames)

  canresult1 <- lmresult.summary$coefficients[2:lencoefnames,4]
  lenres1 <- length(canresult1)
  canresult2 <- canresult1[canresult1 < 0.05]
  lenres2 <- length(canresult2)
  recursivecalls <<- recursivecalls + 1
  if (lenres2 == lenres1) {
    linearModel <<- lmresult
    return(FALSE)
  }
  coefnames2 <- names(canresult2)

  ExVar <- toString(paste(coefnames2, "+ ", collapse = ''))
  ExVar <- substr(ExVar, 1, nchar(ExVar)-3)

  model1 <- paste(targetVariable," ~ ",ExVar)
  fit1 <- glm(eval(parse(text = model1)),data = datainput, family=Gamma)
  returnVal <- recursiveGLMc(targetVariable,fit1,datainput)

  return(returnVal)
}

D. Recursive Function 4 for GLM Inverse Gaussian

We define the the recursive function for the GLM Poisson models

recursiveGLMd <- function(targetVariable,lmresult,datainput) 
{
  returnVal <- TRUE
  lmresult.summary <- summary(lmresult)
  coefnames <- names(lmresult.summary$coefficients[,4])
  lencoefnames = length(coefnames)

  canresult1 <- lmresult.summary$coefficients[2:lencoefnames,4]
  lenres1 <- length(canresult1)
  canresult2 <- canresult1[canresult1 < 0.05]
  lenres2 <- length(canresult2)
  recursivecalls <<- recursivecalls + 1
  if (lenres2 == lenres1) {
    linearModel <<- lmresult
    return(FALSE)
  }
  if (lenres1 < 1) {
    print("-------------------------------------------")
    print("RECR function NOT OPTIMAL WITH THIS DATASET")
    print("-------------------------------------------")
    linearModel <<- lmresult
    return(FALSE)
  }
  if (lenres2 < 1) {
    print("-------------------------------------------")
    print("RECR function NOT OPTIMAL WITH THIS DATASET")
    print("-------------------------------------------")
    linearModel <<- lmresult
    return(FALSE)
  }
  coefnames2 <- names(canresult2)

  ExVar <- toString(paste(coefnames2, "+ ", collapse = ''))
  ExVar <- substr(ExVar, 1, nchar(ExVar)-3)

  model1 <- paste(targetVariable," ~ ",ExVar)
  fit1 <- glm(eval(parse(text = model1)),data = datainput, family=inverse.gaussian)
  returnVal <- recursiveGLMd(targetVariable,fit1,datainput)

  return(returnVal)
}

E. Recursive Function 5 for GLM Binomial

We define the the recursive function for the GLM Binomial models

recursiveGLMe <- function(targetVariable,lmresult,datainput) 
{
  returnVal <- TRUE
  lmresult.summary <- summary(lmresult)
  coefnames <- names(lmresult.summary$coefficients[,4])
  lencoefnames = length(coefnames)

  canresult1 <- lmresult.summary$coefficients[2:lencoefnames,4]
  lenres1 <- length(canresult1)
  canresult2 <- canresult1[canresult1 < 0.05]
  lenres2 <- length(canresult2)
  recursivecalls <<- recursivecalls + 1
  if (lenres2 == lenres1) {
    linearModel <<- lmresult
    return(FALSE)
  }
  if (lenres2 < 1) {
    print("-------------------------------------------")
    print("RECR function NOT OPTIMAL WITH THIS DATASET")
    print("-------------------------------------------")
    linearModel <<- lmresult
    return(FALSE)
  }
  coefnames2 <- names(canresult2)

  ExVar <- toString(paste(coefnames2, "+ ", collapse = ''))
  ExVar <- substr(ExVar, 1, nchar(ExVar)-3)
  #ExVar
  
  model1 <- paste(targetVariable," ~ ",ExVar)
  fit1 <- glm(eval(parse(text = model1)),data = datainput, family = binomial)
  returnVal <- recursiveGLMe(targetVariable,fit1,datainput)

  return(returnVal)
}

IV. Run Experiments

A. Run glm Gaussian Experiment

## [1] "Step Model"

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP, data = bb_train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.971  -8.518   0.188   8.272  47.657 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      34.8023987  5.0821245   6.848 9.61e-12 ***
## TEAM_BATTING_H    0.0429297  0.0035710  12.022  < 2e-16 ***
## TEAM_BATTING_2B  -0.0189732  0.0088853  -2.135 0.032841 *  
## TEAM_BATTING_3B   0.0254420  0.0162434   1.566 0.117418    
## TEAM_BATTING_HR   0.0820832  0.0093910   8.741  < 2e-16 ***
## TEAM_BATTING_BB   0.0068444  0.0030684   2.231 0.025804 *  
## TEAM_BATTING_SO  -0.0158206  0.0024253  -6.523 8.46e-11 ***
## TEAM_BASERUN_SB   0.0544231  0.0043586  12.486  < 2e-16 ***
## TEAM_PITCHING_H   0.0011557  0.0003371   3.428 0.000619 ***
## TEAM_PITCHING_SO  0.0012394  0.0006649   1.864 0.062442 .  
## TEAM_FIELDING_E  -0.0414281  0.0026877 -15.414  < 2e-16 ***
## TEAM_FIELDING_DP -0.1004216  0.0126872  -7.915 3.83e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.67 on 2264 degrees of freedom
## Multiple R-squared:  0.3566, Adjusted R-squared:  0.3535 
## F-statistic: 114.1 on 11 and 2264 DF,  p-value: < 2.2e-16

## [1] "Recursive Model"
## 
## Call:
## lm(formula = eval(parse(text = model1)), data = datainput)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.327  -8.561   0.387   8.416  49.196 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      42.1426399  4.5049030   9.355  < 2e-16 ***
## TEAM_BATTING_H    0.0390702  0.0024869  15.710  < 2e-16 ***
## TEAM_BATTING_HR   0.0814626  0.0086417   9.427  < 2e-16 ***
## TEAM_BATTING_SO  -0.0170569  0.0021410  -7.967 2.55e-15 ***
## TEAM_BASERUN_SB   0.0600712  0.0040554  14.813  < 2e-16 ***
## TEAM_PITCHING_H   0.0013156  0.0002949   4.462 8.53e-06 ***
## TEAM_FIELDING_E  -0.0437100  0.0024305 -17.984  < 2e-16 ***
## TEAM_FIELDING_DP -0.0999059  0.0126179  -7.918 3.75e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.7 on 2268 degrees of freedom
## Multiple R-squared:  0.3522, Adjusted R-squared:  0.3502 
## F-statistic: 176.1 on 7 and 2268 DF,  p-value: < 2.2e-16

B. Run glm Poisson Experiment

## [1] "Step Model"

## 
## Call:
## glm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP, family = poisson, data = bb_train_imputed)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.8744  -0.9603   0.0172   0.9167   5.0635  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       3.783e+00  4.594e-02  82.356   <2e-16 ***
## TEAM_BATTING_H    5.766e-04  3.252e-05  17.729   <2e-16 ***
## TEAM_BATTING_2B  -2.782e-04  7.857e-05  -3.540   0.0004 ***
## TEAM_BATTING_3B   3.326e-04  1.432e-04   2.323   0.0202 *  
## TEAM_BATTING_HR   9.565e-04  8.244e-05  11.602   <2e-16 ***
## TEAM_BATTING_BB   6.842e-05  2.689e-05   2.544   0.0110 *  
## TEAM_BATTING_SO  -1.933e-04  2.174e-05  -8.892   <2e-16 ***
## TEAM_BASERUN_SB   6.746e-04  3.804e-05  17.736   <2e-16 ***
## TEAM_PITCHING_H   6.204e-06  3.611e-06   1.718   0.0858 .  
## TEAM_PITCHING_SO  2.609e-05  6.709e-06   3.890   0.0001 ***
## TEAM_FIELDING_E  -5.549e-04  2.554e-05 -21.728   <2e-16 ***
## TEAM_FIELDING_DP -1.218e-03  1.120e-04 -10.875   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7442.7  on 2275  degrees of freedom
## Residual deviance: 4874.1  on 2264  degrees of freedom
## AIC: 19027
## 
## Number of Fisher Scoring iterations: 4

## [1] "Recursive Model"
## 
## Call:
## glm(formula = eval(parse(text = model1)), family = poisson, data = datainput)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.7319  -0.9660   0.0175   0.9417   4.9917  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       3.804e+00  4.172e-02  91.167  < 2e-16 ***
## TEAM_BATTING_H    5.841e-04  3.125e-05  18.690  < 2e-16 ***
## TEAM_BATTING_2B  -2.702e-04  7.837e-05  -3.448 0.000565 ***
## TEAM_BATTING_3B   3.495e-04  1.414e-04   2.472 0.013452 *  
## TEAM_BATTING_HR   1.012e-03  7.904e-05  12.800  < 2e-16 ***
## TEAM_BATTING_SO  -2.051e-04  2.139e-05  -9.587  < 2e-16 ***
## TEAM_BASERUN_SB   6.771e-04  3.615e-05  18.731  < 2e-16 ***
## TEAM_PITCHING_SO  3.174e-05  5.654e-06   5.613 1.99e-08 ***
## TEAM_FIELDING_E  -5.528e-04  1.834e-05 -30.148  < 2e-16 ***
## TEAM_FIELDING_DP -1.158e-03  1.092e-04 -10.600  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7442.7  on 2275  degrees of freedom
## Residual deviance: 4883.1  on 2266  degrees of freedom
## AIC: 19032
## 
## Number of Fisher Scoring iterations: 4

C. Run glm Gamma Experiment

## [1] "Step Model"

## 
## Call:
## glm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     family = Gamma, data = bb_train_imputedG)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.26225  -0.11048   0.00294   0.10193   0.54108  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.927e-02  7.658e-04  25.168  < 2e-16 ***
## TEAM_BATTING_H   -6.903e-06  5.603e-07 -12.319  < 2e-16 ***
## TEAM_BATTING_2B   3.333e-06  1.436e-06   2.321   0.0204 *  
## TEAM_BATTING_3B  -4.162e-06  2.576e-06  -1.615   0.1064    
## TEAM_BATTING_HR  -1.115e-05  1.470e-06  -7.587 4.75e-14 ***
## TEAM_BATTING_SO   2.217e-06  4.126e-07   5.374 8.49e-08 ***
## TEAM_BASERUN_SB  -7.883e-06  6.494e-07 -12.138  < 2e-16 ***
## TEAM_PITCHING_BB -5.325e-07  3.313e-07  -1.607   0.1081    
## TEAM_PITCHING_SO -2.820e-07  1.343e-07  -2.100   0.0359 *  
## TEAM_FIELDING_E   7.104e-06  3.584e-07  19.822  < 2e-16 ***
## TEAM_FIELDING_DP  1.432e-05  2.017e-06   7.096 1.71e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.02803698)
## 
##     Null deviance: 103.169  on 2275  degrees of freedom
## Residual deviance:  71.785  on 2265  degrees of freedom
## AIC: 18573
## 
## Number of Fisher Scoring iterations: 5

## [1] 0.3041924

## [1] "Recursive Model"
## 
## Call:
## glm(formula = eval(parse(text = model1)), family = Gamma, data = datainput)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.27407  -0.11196   0.00168   0.10282   0.82901  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.909e-02  7.605e-04  25.094  < 2e-16 ***
## TEAM_BATTING_H   -7.016e-06  5.311e-07 -13.211  < 2e-16 ***
## TEAM_BATTING_2B   2.881e-06  1.427e-06   2.019   0.0436 *  
## TEAM_BATTING_HR  -1.070e-05  1.392e-06  -7.688 2.22e-14 ***
## TEAM_BATTING_SO   2.068e-06  3.677e-07   5.624 2.10e-08 ***
## TEAM_BASERUN_SB  -8.466e-06  6.154e-07 -13.758  < 2e-16 ***
## TEAM_FIELDING_E   6.931e-06  3.422e-07  20.256  < 2e-16 ***
## TEAM_FIELDING_DP  1.356e-05  1.985e-06   6.832 1.08e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.0282547)
## 
##     Null deviance: 103.17  on 2275  degrees of freedom
## Residual deviance:  72.22  on 2268  degrees of freedom
## AIC: 18580
## 
## Number of Fisher Scoring iterations: 5

D. Run glm Inverse Gaussian Experiment

## [1] "Step Model"

## 
## Call:
## glm(formula = TARGET_WINS ~ ., family = inverse.gaussian, data = bb_train_imputedI)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.97418  -0.01251   0.00066   0.01155   0.06592  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.024e-04  2.091e-05  14.465  < 2e-16 ***
## TEAM_BATTING_H   -1.581e-07  1.470e-08 -10.753  < 2e-16 ***
## TEAM_BATTING_2B   6.425e-08  3.658e-08   1.756   0.0792 .  
## TEAM_BATTING_3B  -5.368e-08  6.331e-08  -0.848   0.3966    
## TEAM_BATTING_HR  -2.700e-07  3.843e-08  -7.027 2.78e-12 ***
## TEAM_BATTING_BB   3.579e-08  2.399e-08   1.492   0.1358    
## TEAM_BATTING_SO   5.328e-08  1.085e-08   4.912 9.67e-07 ***
## TEAM_BASERUN_SB  -1.568e-07  1.798e-08  -8.717  < 2e-16 ***
## TEAM_BASERUN_CS  -1.711e-08  4.259e-08  -0.402   0.6880    
## TEAM_PITCHING_H   6.518e-09  2.691e-09   2.423   0.0155 *  
## TEAM_PITCHING_BB -3.735e-08  1.712e-08  -2.182   0.0292 *  
## TEAM_PITCHING_SO -6.924e-09  4.835e-09  -1.432   0.1523    
## TEAM_FIELDING_E   1.543e-07  1.219e-08  12.654  < 2e-16 ***
## TEAM_FIELDING_DP  2.968e-07  5.333e-08   5.565 2.94e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for inverse.gaussian family taken to be 0.0003897628)
## 
##     Null deviance: 2.3540  on 2275  degrees of freedom
## Residual deviance: 1.9764  on 2262  degrees of freedom
## AIC: 20363
## 
## Number of Fisher Scoring iterations: 8

## [1] "Recursive Model"
## 
## Call:
## glm(formula = eval(parse(text = model1)), family = inverse.gaussian, 
##     data = datainput)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.97705  -0.01238   0.00059   0.01148   0.06833  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.837e-04  1.770e-05  16.030  < 2e-16 ***
## TEAM_BATTING_H   -1.339e-07  8.988e-09 -14.902  < 2e-16 ***
## TEAM_BATTING_HR  -2.459e-07  3.550e-08  -6.927 5.57e-12 ***
## TEAM_BATTING_SO   5.316e-08  8.958e-09   5.935 3.39e-09 ***
## TEAM_BASERUN_SB  -1.765e-07  1.401e-08 -12.602  < 2e-16 ***
## TEAM_PITCHING_BB -1.750e-08  7.469e-09  -2.342   0.0192 *  
## TEAM_FIELDING_E   1.629e-07  8.333e-09  19.554  < 2e-16 ***
## TEAM_FIELDING_DP  3.358e-07  5.071e-08   6.622 4.42e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for inverse.gaussian family taken to be 0.0003861428)
## 
##     Null deviance: 2.3540  on 2275  degrees of freedom
## Residual deviance: 1.9816  on 2268  degrees of freedom
## AIC: 20357
## 
## Number of Fisher Scoring iterations: 5

E. Run glm Binomial Experiment

## [1] "Step Model"

## 
## Call:
## glm(formula = BI_TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, family = binomial, data = bb_train_imputedB)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0646  -0.9933   0.3852   0.9480   3.0153  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -5.222e+00  1.022e+00  -5.107 3.28e-07 ***
## TEAM_BATTING_H    4.295e-03  7.484e-04   5.739 9.55e-09 ***
## TEAM_BATTING_2B  -2.502e-03  1.637e-03  -1.528 0.126490    
## TEAM_BATTING_3B   1.080e-02  3.127e-03   3.455 0.000551 ***
## TEAM_BATTING_HR   1.462e-02  1.778e-03   8.222  < 2e-16 ***
## TEAM_BATTING_BB   4.905e-03  1.148e-03   4.274 1.92e-05 ***
## TEAM_BATTING_SO  -3.221e-03  4.849e-04  -6.642 3.09e-11 ***
## TEAM_BASERUN_SB   8.325e-03  9.360e-04   8.894  < 2e-16 ***
## TEAM_PITCHING_H   4.361e-04  8.472e-05   5.148 2.63e-07 ***
## TEAM_PITCHING_BB -2.608e-03  8.872e-04  -2.939 0.003288 ** 
## TEAM_PITCHING_SO  4.860e-04  1.950e-04   2.492 0.012692 *  
## TEAM_FIELDING_E  -5.527e-03  6.206e-04  -8.905  < 2e-16 ***
## TEAM_FIELDING_DP -1.413e-02  2.318e-03  -6.094 1.10e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3147.8  on 2275  degrees of freedom
## Residual deviance: 2601.0  on 2263  degrees of freedom
## AIC: 2627
## 
## Number of Fisher Scoring iterations: 5

## [1] "Recursive Model"
## 
## Call:
## glm(formula = eval(parse(text = model1)), family = binomial, 
##     data = datainput)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0662  -0.9898   0.3880   0.9533   2.9819  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -4.620e+00  9.336e-01  -4.949 7.47e-07 ***
## TEAM_BATTING_H    3.537e-03  5.487e-04   6.447 1.14e-10 ***
## TEAM_BATTING_3B   1.120e-02  3.113e-03   3.598 0.000321 ***
## TEAM_BATTING_HR   1.478e-02  1.772e-03   8.343  < 2e-16 ***
## TEAM_BATTING_BB   4.846e-03  1.142e-03   4.245 2.18e-05 ***
## TEAM_BATTING_SO  -3.374e-03  4.723e-04  -7.144 9.05e-13 ***
## TEAM_BASERUN_SB   8.331e-03  9.340e-04   8.920  < 2e-16 ***
## TEAM_PITCHING_H   4.374e-04  8.379e-05   5.220 1.79e-07 ***
## TEAM_PITCHING_BB -2.603e-03  8.818e-04  -2.952 0.003160 ** 
## TEAM_PITCHING_SO  4.551e-04  1.918e-04   2.373 0.017651 *  
## TEAM_FIELDING_E  -5.381e-03  6.104e-04  -8.816  < 2e-16 ***
## TEAM_FIELDING_DP -1.417e-02  2.317e-03  -6.116 9.58e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3147.8  on 2275  degrees of freedom
## Residual deviance: 2603.3  on 2264  degrees of freedom
## AIC: 2627.3
## 
## Number of Fisher Scoring iterations: 5

V. Tabular Results

Model	Variables	STEPR2	STEPFit	STEPSkew	STEPAIC	STEPBIC	STEPCalls	STEPVariables	RECRR2	RECRFit	RECRSkew	RECRAIC	RECRBIC	RECRCalls	RECRVariables
Gaussian	13	0.3566129	114.0797198	-0.0311825	18030.032	18104.524	2	11	0.3521932	176.1491024	-0.0206568	18037.613	18089.185	3	7
Poisson	13	0.3451182	0.0000000	-0.0923911	19027.121	19095.883	2	11	0.3439106	0.0000000	-0.1112038	19032.109	19089.410	2	9
Gamma	13	0.3041924	1.0000000	1.1704365	18572.586	18641.349	3	10	0.2999763	1.0000000	0.8365347	18580.408	18631.980	2	7
Inverse Gaussian	13	0.1604279	1.0000000	4.0345558	20362.598	20448.550	0	13	0.1582047	1.0000000	2.6080747	20356.617	20408.188	3	7
Binomial	13	0.1737008	0.0000008	-2.9224244	2627.005	2701.498	1	12	0.1729562	0.0000007	-4.9740247	2627.349	2696.111	2	10

DATA698 - Mid Term - Initial glm Experiments