Blog 1

Model Comparisons

This blog describes a few model comparison packages using an example.

Data

blog1 <- read.csv("https://raw.githubusercontent.com/irene908/DATA621/main/Blog1.csv") 

Creating a copy of the data to store the log of PrizeMoney

Logblog1 <- blog1
Logblog1$logPrizeMoney <- log(blog1$PrizeMoney)
Logblog1$PrizeMoney <- NULL

Stargazer package

The stargazer package helps to compare the model outputs and display the summary statistics.

https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf.

Skew and kurtosis will not be displayed by this.

The result is visible only after knitting the markdown.

stargazer(Logblog1, type = "html", nobs = TRUE, mean.sd = TRUE, median = TRUE, iqr = TRUE)
Statistic N Mean St. Dev. Min Pctl(25) Median Pctl(75) Max
DrivingAccuracy 196 63.380 5.413 49.750 59.758 63.240 66.965 78.430
Scrambling 196 57.494 3.162 49.020 55.260 57.650 59.457 66.450
PuttsPerRound 196 29.201 0.442 27.960 28.910 29.190 29.477 30.190
logPrizeMoney 196 10.378 0.980 7.714 9.762 10.509 10.967 13.404

describe() displays the skew and kurtosis as well.

s <- describe(blog1[,c(1:4)])[,c(2:4,8,9,11,12)]
s
##                   n     mean       sd     min       max skew kurtosis
## PrizeMoney      196 50891.17 63902.95 2240.00 662771.00 5.29    42.57
## DrivingAccuracy 196    63.38     5.41   49.75     78.43 0.09     0.03
## Scrambling      196    57.49     3.16   49.02     66.45 0.00     0.09
## PuttsPerRound   196    29.20     0.44   27.96     30.19 0.13    -0.10

Models

Below I have 3 models. 2 of these models use the log transformation dataset.

lm1 <- lm(PrizeMoney ~., blog1)
lm2 <- lm(logPrizeMoney ~., Logblog1)
lm3 <- lm(logPrizeMoney ~ DrivingAccuracy + Scrambling + PuttsPerRound, Logblog1)

Stargazer result

stargazer(lm1, lm2, lm3, type="html")
Dependent variable:
PrizeMoney logPrizeMoney
(1) (2) (3)
DrivingAccuracy -1,353.794 0.010 0.010
(918.371) (0.014) (0.014)
Scrambling 6,992.504*** 0.100*** 0.100***
(1,725.205) (0.026) (0.026)
PuttsPerRound 5,530.681 -0.117 -0.117
(11,361.840) (0.170) (0.170)
Constant -426,837.100 7.381 7.381
(372,993.200) (5.571) (5.571)
Observations 196 196 196
R2 0.091 0.139 0.139
Adjusted R2 0.077 0.125 0.125
Residual Std. Error (df = 192) 61,386.980 0.917 0.917
F Statistic (df = 3; 192) 6.437*** 10.290*** 10.290***
Note: p<0.1; p<0.05; p<0.01

memisc package

memisc is another package that displays the summary statistics and helps to compare multiple models side by side.

https://cran.r-project.org/web/packages/memisc/memisc.pdf

Notice below I have used the summary.stats argument to mention the summary stats I need to display.

lm_memisc <- mtable("Model 1"=lm1,"Model 2"=lm2,"Model 3"=lm3, summary.stats = c('R-squared','F','p','N'))
lm_memisc
## 
## Calls:
## Model 1: lm(formula = PrizeMoney ~ ., data = blog1)
## Model 2: lm(formula = logPrizeMoney ~ ., data = Logblog1)
## Model 3: lm(formula = logPrizeMoney ~ DrivingAccuracy + Scrambling + PuttsPerRound, 
##     data = Logblog1)
## 
## =================================================================
##                       Model 1         Model 2        Model 3     
##                    --------------  -------------  -------------  
##                      PrizeMoney    logPrizeMoney  logPrizeMoney  
## -----------------------------------------------------------------
##   (Intercept)      -426837.131        7.381          7.381       
##                    (372993.195)      (5.571)        (5.571)      
##   DrivingAccuracy    -1353.794        0.010          0.010       
##                       (918.371)      (0.014)        (0.014)      
##   Scrambling          6992.504***     0.100***       0.100***    
##                      (1725.205)      (0.026)        (0.026)      
##   PuttsPerRound       5530.681       -0.117         -0.117       
##                     (11361.840)      (0.170)        (0.170)      
## -----------------------------------------------------------------
##   R-squared              0.091        0.139          0.139       
##   F                      6.437       10.290         10.290       
##   p                      0.000        0.000          0.000       
##   N                    196          196            196           
## =================================================================
##   Significance: *** = p < 0.001; ** = p < 0.01; * = p < 0.05

A portion of the data can be indexed and displayed as required.

lm_memisc[c("DrivingAccuracy","Scrambling"), c("Model 2","Model 3")]
## 
## Calls:
## Model 2: lm(formula = logPrizeMoney ~ ., data = Logblog1)
## Model 3: lm(formula = logPrizeMoney ~ DrivingAccuracy + Scrambling + PuttsPerRound, 
##     data = Logblog1)
## 
## ===========================================
##                     Model 2     Model 3    
## -------------------------------------------
##   DrivingAccuracy    0.010       0.010     
##                     (0.014)     (0.014)    
##   Scrambling         0.100***    0.100***  
##                     (0.026)     (0.026)    
## -------------------------------------------
##   R-squared          0.139       0.139     
##   F                 10.290      10.290     
##   p                  0.000       0.000     
##   N                196         196         
## ===========================================
##   Significance: *** = p < 0.001;   
##                 ** = p < 0.01;   
##                 * = p < 0.05