A Modern Approach to Regression with `R`

by Simon Sheather

notes by Bonnie Cooper

The following are notes from readings in ‘A Modern Approach to Regression with R’ by Simon Sheather for the course DATA621, ‘Business Analystics and Data Mining’ as part of the Masters of Science in Data Science program at CUNY SPS.

R libraries used:

library( broom )
library( dplyr )
library( ggplot2 )
library( gridExtra )
library( tidyverse )
library( gclus )
library( car )
library( VGAM )

Introduction

building valid regression models for real-world data.

Building Valid Models

It makes sense to base inferences or conclusions only on valid models
Any conclusion is only as sound as the model on which it is based.

Motivating Examples

1) NFL Field Goals

nflfg_csv <- read.csv( 'FieldGoals2003to2006.csv' )
nflfg_df <- data.frame( nflfg_csv )
glimpse( nflfg_df )

## Rows: 76
## Columns: 10
## $ Name      <chr> "Adam Vinatieri", "Adam Vinatieri", "Adam Vinatieri", "Adam …
## $ Yeart     <int> 2003, 2004, 2005, 2006, 2003, 2004, 2005, 2006, 2003, 2004, …
## $ Teamt     <chr> "NE", "NE", "NE", "IND", "PHI", "PHI", "PHI", "PHI", "DEN", …
## $ FGAt      <int> 34, 33, 25, 19, 29, 32, 22, 12, 31, 34, 32, 15, 23, 28, 24, …
## $ FGt       <dbl> 73.5, 93.9, 80.0, 89.4, 82.7, 84.3, 72.7, 83.3, 87.0, 85.2, …
## $ Team.t.1. <chr> "NE", "NE", "NE", "NE", "PHI", "PHI", "PHI", "PHI", "DEN", "…
## $ FGAtM1    <int> 30, 34, 33, 25, 34, 29, 32, 22, 36, 31, 34, 32, 28, 23, 28, …
## $ FGtM1     <dbl> 90.0, 73.5, 93.9, 80.0, 88.2, 82.7, 84.3, 72.7, 72.2, 87.0, …
## $ FGAtM2    <int> NA, 30, 34, 33, NA, 34, 29, 32, NA, 36, 31, 34, NA, 28, 23, …
## $ FGtM2     <dbl> NA, 90.0, 73.5, 93.9, NA, 88.2, 82.7, 84.3, NA, 72.2, 87.0, …

head( nflfg_df )

##             Name Yeart Teamt FGAt  FGt Team.t.1. FGAtM1 FGtM1 FGAtM2 FGtM2
## 1 Adam Vinatieri  2003    NE   34 73.5        NE     30  90.0     NA    NA
## 2 Adam Vinatieri  2004    NE   33 93.9        NE     34  73.5     30  90.0
## 3 Adam Vinatieri  2005    NE   25 80.0        NE     33  93.9     34  73.5
## 4 Adam Vinatieri  2006   IND   19 89.4        NE     25  80.0     33  93.9
## 5    David Akers  2003   PHI   29 82.7       PHI     34  88.2     NA    NA
## 6    David Akers  2004   PHI   32 84.3       PHI     29  82.7     34  88.2

unique( nflfg_df$Yeart )

## [1] 2003 2004 2005 2006

unique( nflfg_df$Name )

##  [1] "Adam Vinatieri"       "David Akers"          "Jason Elam"          
##  [4] "Jason Hanson"         "Jay Feely"            "Jeff Reed"           
##  [7] "Jeff Wilkins"         "John Carney"          "John Hall"           
## [10] "Kris Brown"           "Matt Stover"          "Mike Vanderjagt"     
## [13] "Neil Rackers"         "Olindo Mare"          "Phil Dawson"         
## [16] "Rian Lindell"         "Ryan Longwell"        "Sebastian Janikowski"
## [19] "Shayne Graham"

the incorrect approach just looks at the correlation between feild goal percentages one year and the previous:

nflfg_yeardif <- nflfg_df %>%
  dplyr::select( c( Name, Yeart, FGt ) ) %>%
  pivot_wider( names_from = Yeart, values_from = FGt )
nflfg_yeardif

## # A tibble: 19 x 5
##    Name                 `2003` `2004` `2005` `2006`
##    <chr>                 <dbl>  <dbl>  <dbl>  <dbl>
##  1 Adam Vinatieri         73.5   93.9   80     89.4
##  2 David Akers            82.7   84.3   72.7   83.3
##  3 Jason Elam             87     85.2   75     86.6
##  4 Jason Hanson           95.6   85.7   79.1   82.3
##  5 Jay Feely              70.3   78.2   83.3   76.4
##  6 Jeff Reed              71.8   84.8   82.7   68.7
##  7 Jeff Wilkins           92.8   79.1   87     88.4
##  8 John Carney            73.3   81.4   78.1   88.2
##  9 John Hall              75.7   72.7   85.7   81.8
## 10 Kris Brown             81.8   70.8   76.4   73.3
## 11 Matt Stover            86.8   90.6   88.2  100  
## 12 Mike Vanderjagt       100     80     92     80  
## 13 Neil Rackers           75     75.8   95.2   68.4
## 14 Olindo Mare            75.8   75     83.3   63.6
## 15 Phil Dawson            85.7   82.7   93.1   88.2
## 16 Rian Lindell           70.8   85.7   82.8   87.5
## 17 Ryan Longwell          88.4   85.7   74     83.3
## 18 Sebastian Janikowski   88     89.2   66.6   84.6
## 19 Shayne Graham          88     87     87.5   84.2

firstY <- nflfg_yeardif %>%
  dplyr::select( c( `2003`, `2004`, `2005` ) ) %>%
  pivot_longer( cols = everything(), names_to = 'Year1', values_to = 'val1' )

NextY <- nflfg_yeardif %>%
  dplyr::select( c( `2004`, `2005`, `2006` ) ) %>%
  pivot_longer( cols = everything(), names_to = 'Year2', values_to = 'val2' )


BothY <- cbind( firstY, NextY )

ggplot( BothY, aes( x = val1, y = val2 ) ) +
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 4 ) +
  labs( title = 'Current by Previous Year' ) +
  xlab( 'Field Goal Percentage in Year t-1' ) +
  ylab( 'Field Goal Percentage in Year t' )

Overall, the correlation is very weak. However, this does not take into account the abilities of each of the athletes. another approach would be to fit a linear regression to each athlete’s performance across the years.

name_labs <- rep( nflfg_yeardif$Name, each = 3 )
BothY$Names <- name_labs

ggplot( BothY, aes( x = val1, y = val2, fill = Names ) ) +
  geom_point( shape = 21, alpha = 0.5, size = 4 ) +
  labs( title = 'Current by Previous Year' ) +
  xlab( 'Field Goal Percentage in Year t-1' ) +
  ylab( 'Field Goal Percentage in Year t' )

Now to look at the linear regression coefficients by athlete:

by_athlete <- BothY %>%
  group_by( Names )

lm_byAthlete <- do( by_athlete, tidy( lm( val2 ~ val1, data = . ) ) )

lm_byAthlete <- lm_byAthlete %>%
  dplyr::select( c( Names, term, estimate ) ) %>%
  pivot_wider( names_from = term, values_from = estimate )


lm_byAthlete

## # A tibble: 19 x 3
## # Groups:   Names [19]
##    Names                `(Intercept)`     val1
##    <chr>                        <dbl>    <dbl>
##  1 Adam Vinatieri               144.  -0.681  
##  2 David Akers                  124.  -0.555  
##  3 Jason Elam                   121.  -0.465  
##  4 Jason Hanson                  61.3  0.242  
##  5 Jay Feely                     84.8 -0.0711 
##  6 Jeff Reed                    126.  -0.593  
##  7 Jeff Wilkins                 131.  -0.533  
##  8 John Carney                  104.  -0.275  
##  9 John Hall                     79.7  0.00468
## 10 Kris Brown                   112.  -0.509  
## 11 Matt Stover                  190.  -1.10   
## 12 Mike Vanderjagt              141.  -0.632  
## 13 Neil Rackers                 148.  -0.833  
## 14 Olindo Mare                  233.  -2.03   
## 15 Phil Dawson                  109.  -0.241  
## 16 Rian Lindell                  92.8 -0.0941 
## 17 Ryan Longwell                 91.0 -0.121  
## 18 Sebastian Janikowski         108.  -0.346  
## 19 Shayne Graham                130.  -0.500

Allowing for a different intercept for each athlete (different abilities), it can be shown that if a kicker had a high field goal percentage the previous year, then they are predicted to have a lower goal percentage the current year.

2) Newspaper Circulation demonstrate the use of dummy variables along with transformations to overcome skewness

circ_url <- 'https://raw.githubusercontent.com/SmilodonCub/DATA621/master/circulation.csv'
circ_df <- read.table( circ_url, sep = '\t', header = TRUE )
circ_df <- rename( circ_df, Tabloid_hasComp = Tabloid.with.a.Serious.Competitor )
glimpse( circ_df )

## Rows: 89
## Columns: 4
## $ Newspaper       <chr> "Akron (OH) Beacon Journal ", "Albuquerque (NM) Journa…
## $ Sunday          <int> 185915, 154413, 165607, 622065, 233767, 465807, 227806…
## $ Weekday         <int> 134401, 109693, 111594, 371853, 183312, 301186, 179270…
## $ Tabloid_hasComp <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, …

The feature Tabloid.with.a.Serious.Competitor is a dummy variable it only takes a true/false value to indicate an outcome.

Take a look at Sunday ~ Weekday circulation grouped by the dummy variable:

ggplot( circ_df, aes( x = Weekday, y = Sunday, color = factor( Tabloid_hasComp ) ) ) +
  geom_point( size = 3)

We can make the variable much more constant (linear) by plotting the log values:

ggplot( circ_df, aes( x = log(Weekday), y = log(Sunday), color = factor( Tabloid_hasComp ) ) ) +
  geom_point( size = 3)

3) Menu Pricing in an NYC Restaurant highlights the use of multiple regression. this is a classic dataset: pricing East/West of 5th Ave.
Produce a regression model to predict the price of dinner

nyc_csv <- read.csv( 'nyc.csv' )
nyc_df <- data.frame( nyc_csv )
glimpse( nyc_df )

## Rows: 168
## Columns: 7
## $ Case       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ Restaurant <chr> "Daniella Ristorante", "Tello's Ristorante", "Biricchino", …
## $ Price      <int> 43, 32, 34, 41, 54, 52, 34, 34, 39, 44, 45, 47, 52, 35, 47,…
## $ Food       <int> 22, 20, 21, 20, 24, 22, 22, 20, 22, 21, 19, 21, 21, 19, 20,…
## $ Decor      <int> 18, 19, 13, 20, 19, 22, 16, 18, 19, 17, 17, 19, 19, 17, 18,…
## $ Service    <int> 20, 19, 18, 17, 21, 21, 21, 21, 22, 19, 20, 21, 20, 19, 21,…
## $ East       <int> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

dta <- nyc_df %>%
  dplyr::select( c( Price, Food, Decor, Service ) )
dta.r <- abs(cor(dta)) # get correlations
dta.col <- dmat.color(dta.r) # get colors
# reorder variables so those with highest correlation
# are closest to the diagonal
dta.o <- order.single(dta.r)
cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
main="Matrix plot of features" )

check out the distributions of the East/West variable:

ggplot( nyc_df, aes( x = factor( East ) , y = Price ) ) +
  geom_boxplot() +
  xlab( 'Direction off 5th Ave)' ) +
  ggtitle( 'Price ~ East vs West' ) +
  scale_x_discrete( labels = c( 'West', 'East' ) )

mod <- lm( Price ~ Service + Decor + Food + East, nyc_df )
summary( mod )

## 
## Call:
## lm(formula = Price ~ Service + Decor + Food + East, data = nyc_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.0465  -3.8837   0.0373   3.3942  17.7491 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -24.023800   4.708359  -5.102 9.24e-07 ***
## Service      -0.002727   0.396232  -0.007   0.9945    
## Decor         1.910087   0.217005   8.802 1.87e-15 ***
## Food          1.538120   0.368951   4.169 4.96e-05 ***
## East          2.068050   0.946739   2.184   0.0304 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.738 on 163 degrees of freedom
## Multiple R-squared:  0.6279, Adjusted R-squared:  0.6187 
## F-statistic: 68.76 on 4 and 163 DF,  p-value: < 2.2e-16

4) Wine Critics’ Ratings

wine_csv <- read.csv( 'Bordeaux.csv' )
wine_df <- data.frame( wine_csv )
glimpse( wine_df )

## Rows: 72
## Columns: 9
## $ Wine             <chr> "Lafite", "Latour", "Margaux", "Mouton", "Haut Brion"…
## $ Price            <int> 2850, 2850, 2900, 2500, 2500, 3650, 4200, 10500, 880,…
## $ ParkerPoints     <int> 100, 98, 100, 97, 98, 100, 100, 100, 97, 96, 90, 87, …
## $ CoatesPoints     <dbl> 19.5, 18.5, 19.5, 17.0, 18.5, 19.5, 18.5, 18.5, 16.5,…
## $ P95andAbove      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ FirstGrowth      <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ CultWine         <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Pomerol          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ VintageSuperstar <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

dta <- wine_df %>%
  dplyr::select( c( Price, ParkerPoints, CoatesPoints ) )
dta.r <- abs(cor(dta)) # get correlations
dta.col <- dmat.color(dta.r) # get colors
# reorder variables so those with highest correlation
# are closest to the diagonal
dta.o <- order.single(dta.r)
cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
main="Matrix plot of features" )

wine_df <- wine_df %>%
  mutate_at( vars( P95andAbove, FirstGrowth, 
                   CultWine, Pomerol, VintageSuperstar ),
             funs( factor ) )

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

firstG <- ggplot( wine_df, aes( y = Price, x = FirstGrowth ) ) +
  geom_boxplot()
p95 <- ggplot( wine_df, aes( y = Price, x = P95andAbove ) ) +
  geom_boxplot()
cult <- ggplot( wine_df, aes( y = Price, x = CultWine ) ) +
  geom_boxplot()
pom <- ggplot( wine_df, aes( y = Price, x = Pomerol ) ) +
  geom_boxplot()
VS <- ggplot( wine_df, aes( y = Price, x = VintageSuperstar ) ) +
  geom_boxplot()

grid.arrange( firstG, p95, cult, pom, VS, ncol = 3 )

dta <- wine_df %>%
  dplyr::select( c( Price, ParkerPoints, CoatesPoints ) ) %>%
  mutate_at( vars( Price, ParkerPoints, CoatesPoints ),
             funs( log ) )
dta.r <- abs(cor(dta)) # get correlations
dta.col <- dmat.color(dta.r) # get colors
# reorder variables so those with highest correlation
# are closest to the diagonal
dta.o <- order.single(dta.r)
cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
main="Matrix plot of log(features)" )

Over the chapters in this book, will dive in to these data sets much deeper to model various aspects and build predictive capabilities

Simple Linear Regression

modeling the relationship between two variables as a straight line, that is, when $Y$ is modeled as a linear function of $X$.

Introduction to Least Squares Estimates

production_url <- 'https://raw.githubusercontent.com/SmilodonCub/DATA621/master/production.txt'
production_df <- read.table( production_url, sep = '\t', header = TRUE )
glimpse( production_df )

## Rows: 20
## Columns: 3
## $ Case    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ RunTime <int> 195, 215, 243, 162, 185, 231, 234, 166, 253, 196, 220, 168, 20…
## $ RunSize <int> 175, 189, 344, 88, 114, 338, 271, 173, 284, 277, 337, 58, 146,…

ggplot( production_df, aes( x = RunSize, y = RunTime ) ) +
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 4 ) +
  xlab( 'Run Size' ) +
  ylab( 'Run Time' )

We wish to develop an equation to model the relationship between Y, the run time, and X, the run size.

If the regression of Y on X is linear: \[Y_i = \mbox{E}(Y|X=x) + e_i = \beta_0 + \beta_1 x + e_i\] where $e_i$ is the random error in $Y_i$
all unexplained variation is called random error. Random error does not depend on X, nor does it contain any information about Y (otherwise it would be systematic error).
residuals: the difference between the actual value of y and the predictor value of y.
we wish to describe a line of best fit which minimizes the residuals.

fit the production data with a linear model and visualize the residuals:

mod <- lm( RunTime ~ RunSize, production_df )
production_df$pred <- predict( mod )
production_df$res <- residuals( mod )
glimpse( production_df )

## Rows: 20
## Columns: 5
## $ Case    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ RunTime <int> 195, 215, 243, 162, 185, 231, 234, 166, 253, 196, 220, 168, 20…
## $ RunSize <int> 175, 189, 344, 88, 114, 338, 271, 173, 284, 277, 337, 58, 146,…
## $ pred    <dbl> 195.1152, 198.7447, 238.9273, 172.5611, 179.3014, 237.3719, 22…
## $ res     <dbl> -0.1152469, 16.2553496, 4.0726679, -10.5610965, 5.6985827, -6.…

ggplot( production_df, aes( x = RunSize, y = RunTime ) ) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
  geom_segment(aes(xend = RunSize, yend = pred), alpha = .2) + 
  geom_point() +
  geom_point(aes(y = pred), shape = 1) +
  theme_classic()

## `geom_smooth()` using formula 'y ~ x'

Least Squares line of best fit coefficients for the line of best fit are chosen so as to minimize the sum of the squared residuals: \[\mbox{RSS} = \sum^n_{i=1}\hat{e}^2_i = \sum^n_{i=1}(y_y-\hat{y}_i)^2 = \sum^n_{i=1}(y_i - b_0 - b_1x_i)^2\]

summary( mod )

## 
## Call:
## lm(formula = RunTime ~ RunSize, data = production_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.597 -11.079   3.329   8.302  29.627 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 149.74770    8.32815   17.98 6.00e-13 ***
## RunSize       0.25924    0.03714    6.98 1.61e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.25 on 18 degrees of freedom
## Multiple R-squared:  0.7302, Adjusted R-squared:  0.7152 
## F-statistic: 48.72 on 1 and 18 DF,  p-value: 1.615e-06

The equation for the best fit line is given by: \[y = \beta_0 + \beta_1 x = 149.7 + 0.26x\] * The intercept, 149.7, is where the line of best fit crosses the y-axis. * The slope, 0.26, estimates the change in relationship of RunTime as a function of RunSize. The slope is informative and describes a 0.26 increase in RunTime for every unit increase of RunSize.

Estimating the variance of the error $\sigma^2 = Var(e)$
The residuals can be used to estimate $\sigma^2$ \[S^2 = \frac{\mbox{RSS}}{n-2} = \frac{1}{n-2}\sum^n_{i=1}\hat{e}_i^2\]

Exercise 2.1

playbill_df <- read.csv( 'playbill.csv' )
glimpse( playbill_df )

## Rows: 18
## Columns: 3
## $ Production  <chr> "42nd Street", "Avenue Q", "Beauty and Beast", "Bombay Dre…
## $ CurrentWeek <int> 684966, 502367, 594474, 529298, 570254, 319959, 579126, 13…
## $ LastWeek    <int> 695437, 498969, 598576, 528994, 562964, 282778, 583177, 15…

fit a linear model to describe the relationship CurrentWeek ~ LastWeek:

playbill_mod <- lm( CurrentWeek ~ LastWeek, playbill_df )
playbill_mod_sum <- summary( playbill_mod )
playbill_mod_sum

## 
## Call:
## lm(formula = CurrentWeek ~ LastWeek, data = playbill_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -36926  -7525  -2581   7782  35443 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.805e+03  9.929e+03   0.685    0.503    
## LastWeek    9.821e-01  1.443e-02  68.071   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18010 on 16 degrees of freedom
## Multiple R-squared:  0.9966, Adjusted R-squared:  0.9963 
## F-statistic:  4634 on 1 and 16 DF,  p-value: < 2.2e-16

visualize the results:

ggplot( playbill_df, aes( x = LastWeek, y = CurrentWeek ) ) +
  geom_point(fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  xlab( 'Gross Box Office Results Last Week' ) +
  ylab( 'Gross Box Office Results Current Week' ) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
  theme_classic()

## `geom_smooth()` using formula 'y ~ x'

a) find the 95% confidence interval for the slope of the regression model. Is 1 a plausible value for the slope? Give a reason to support your answer

The 95% confidence interval is given by the estimated slope coefficient $\pm$ 2*SE:

#can be calculated directly from the coefficients
slope_coeff <- playbill_mod_sum$coefficients[ 2,1 ]
slope_SE <- playbill_mod_sum$coefficients[ 2,2 ]
slope_CI <- c( slope_coeff - 2*slope_SE, slope_coeff + 2*slope_SE )
slope_CI

## [1] 0.953227 1.010936

#alternatively can use the function confint
confint( playbill_mod, 'LastWeek', level = 0.95 )

##              2.5 %   97.5 %
## LastWeek 0.9514971 1.012666

We can say with 95% confidence that slope of the regression line is within the range of 0.953227 and 1.0109359 Given this range, we can conclude that 1 is a plausible value for the slope of the line as it is within the CI.

#### b) Test the null hypothesis that $H_0: \beta_0 = 10000$ against a two-sided alternative. Interpret your results. taking a look at the confidence interval for intercept:

confint( playbill_mod, '(Intercept)', level = 0.95 )

##                 2.5 %  97.5 %
## (Intercept) -14244.33 27854.1

10,000 is well withing the 95% confidence interval. Therefore we fail to reject the null hypothesis that $\beta_0 = 10000$

c) Use the fitted regression model to estimate the gross box office results for the current week (in $) for a production with $400,000 in gross box office the previous week. Find a 95% prediction interval for the gross box office results for the current week (in $) for a production with $400,000 in gross box office the previous week. Is $450,000 a feasible value for the gross box office results in the current week, for a production with $400,000 in gross box office the previous week? Give a reason to support your answer.

lw400k <- data.frame( 'LastWeek' = 400000 )
pred <- predict( playbill_mod, newdata = lw400k, interval = 'predict' )
res <- paste( '1) Estimated GBO w/ LastWeek(400000):', round( pred[ 1 ], 0 ), 
              '\n2) 95% Prediction Interval: Lwr =', round( pred[ 2 ], 0 ), 
              'Upr =', round( pred[ 3 ], 0 ) )
cat( res, sep = '\n' )

## 1) Estimated GBO w/ LastWeek(400000): 399637 
## 2) 95% Prediction Interval: Lwr = 359833 Upr = 439442

from the prediction interval calculated above, we see that $450,000 is not feasible, because it lies well above the upper limit of the prediction interval (Upr = 4.39442^{5})

d) Some promoters of Broadway plays use the prediction rule that next week’s gross box office results will be equal to this week’s gross box office results. Comment on the appropriateness of this rule.

The assertion is approximately correct. The slope of the regression line that models a current week’s BO gross result is approximately 1 (slightly less at slope = 0.9820815) and 1.0 is well with in the confidence interval.

Exercise 2.2

A story by James R. Hagerty entitled ‘With Buyers Sidelined, Home Prices Slide’ published in the Wall Street Journal contained data on so-called fundamental housing indicators in major real estate markets across the US. The author argues that, ‘prices are generally falling and loan payments are piling up.’ Thus, we shall consider data presented in the article.

loading the data:

path <- '/home/bonzilla/Documents/MSDS/DATA621/indicators.txt'
indicators_df <- read.csv( path, sep = '\t', header = TRUE )
glimpse( indicators_df )

## Rows: 18
## Columns: 3
## $ MetroArea           <dbl> NA, NA, NA, NA, NA, NA, -6.1, -4.8, -6.4, -3.4, NA…
## $ PriceChange         <dbl> 1.20, -3.40, -0.90, 0.80, -0.70, -9.70, 4.90, 3.05…
## $ LoanPaymentsOverdue <dbl> 4.55, 3.31, 2.99, 4.26, 3.56, 4.71, NA, NA, NA, NA…

Fit the following model: $Y = \beta_0 + \beta_1x + \epsilon$
Where:

Y = Percent change in average price
x = Percent of mortgage loans 30 days or more overdue

indicators_mod <- lm( PriceChange ~ LoanPaymentsOverdue, data = indicators_df )
indicators_mod_sum <- summary( indicators_mod )
indicators_mod_sum

## 
## Call:
## lm(formula = PriceChange ~ LoanPaymentsOverdue, data = indicators_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.873 -2.890 -1.074  2.811  6.746 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            8.982      5.764   1.558   0.1578  
## LoanPaymentsOverdue   -3.193      1.546  -2.065   0.0728 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.385 on 8 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.3476, Adjusted R-squared:  0.2661 
## F-statistic: 4.263 on 1 and 8 DF,  p-value: 0.07282

visualize the data as well:

ggplot( indicators_df, aes( x = LoanPaymentsOverdue, y = PriceChange ) ) +
  geom_point(fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  xlab( 'Percentage of Mortgage Loans >= 30days Overdue' ) +
  ylab( 'Percent Change in Average Price' ) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
  theme_classic()

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 8 rows containing non-finite values (stat_smooth).

## Warning: Removed 8 rows containing missing values (geom_point).

a) Find a 95% confidence interval for the slope of the regression model

confint( indicators_mod, 'LoanPaymentsOverdue', level = 0.95 )

##                         2.5 %    97.5 %
## LoanPaymentsOverdue -6.759119 0.3731191

judging by the range of the 95% confidence intervals, a negative linear association between the Percent change in average price by the percentage of overdue mortgage loans is not statistically significant because the 95% CI range extents into positive slope values.

b) use the fitted regression model to estimate $E(Y|X=4)$.

Find a 95% CI for $E(Y|X=4)$. Is 0% a feasible value for $E(Y|X=4)$? Give a reason to support your answer.

use the model to predict the expected value at X = 4:

exp4 <- data.frame( 'LoanPaymentsOverdue' = 4 )
pred <- predict( indicators_mod, newdata = exp4, interval = 'predict' )
res <- paste( '1) E(Y|X=4):', round( pred[ 1 ], 0 ), 
              '\n2) 95% Prediction Interval: Lwr =', round( pred[ 2 ], 0 ), 
              'Upr =', round( pred[ 3 ], 0 ) )
cat( res, sep = '\n' )

## 1) E(Y|X=4): -4 
## 2) 95% Prediction Interval: Lwr = -14 Upr = 7

A value of 0% is well within the CI for the expected value of Y given X=4. Therefore, with only this data, there is no statistical significance to support a negative relationship between changes in housing prices and percentages of overdue loans.

Inferences about the Slope and Intercept

find confidence intervals for hypothesis tests about either the slope of the intercept of a regression line.

We need to make some assumptions:

The relationship between the response and predictor is linear
the errors are independent
the errors are normally distributed.

intercept \[\hat{\beta}_0 = y - \hat{\beta}_1\bar{x}\] slope \[\hat{\beta}_1 = \frac{SXY}{SXX}\] where:
* SXY: the sum of the product of the difference of xs and the mean and the difference of ys and their means * SXX: the sum of squares of the differences between each x and their mean

we can also write that \[\hat{\beta}_1 = \frac{x_i-\bar{x}}{SXX}\] from this emerge 3 useful relations:

\[E(\hat{\beta}_1|X) = \hat{\beta}_1\]
\[Var(\hat{\beta}_1|X) = \frac{\sigma^2}{SXX}\]
\[\hat{\beta}_1|X \sim N \left( \hat{\beta}_1, \frac{\sigma^1}{SXX} \right)\]

find a 95% Confidence Interval for the production example:

glimpse( production_df )

## Rows: 20
## Columns: 5
## $ Case    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ RunTime <int> 195, 215, 243, 162, 185, 231, 234, 166, 253, 196, 220, 168, 20…
## $ RunSize <int> 175, 189, 344, 88, 114, 338, 271, 173, 284, 277, 337, 58, 146,…
## $ pred    <dbl> 195.1152, 198.7447, 238.9273, 172.5611, 179.3014, 237.3719, 22…
## $ res     <dbl> -0.1152469, 16.2553496, 4.0726679, -10.5610965, 5.6985827, -6.…

summary( mod )

## 
## Call:
## lm(formula = RunTime ~ RunSize, data = production_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.597 -11.079   3.329   8.302  29.627 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 149.74770    8.32815   17.98 6.00e-13 ***
## RunSize       0.25924    0.03714    6.98 1.61e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.25 on 18 degrees of freedom
## Multiple R-squared:  0.7302, Adjusted R-squared:  0.7152 
## F-statistic: 48.72 on 1 and 18 DF,  p-value: 1.615e-06

confint( mod )

##                   2.5 %      97.5 %
## (Intercept) 132.2509062 167.2444999
## RunSize       0.1812107   0.3372755

can find manually:

#for the slope
#from the summary( mod ) output
modsum <- summary( mod )
slope <- modsum$coefficients[ 2,1 ]
SE_slope <- modsum$coefficients[ 2,2 ]
n <- nrow( production_df )
tdist<- qt( 0.025, n-2 )
CI <- c( slope - tdist*SE_slope, slope + tdist*SE_slope )
CI

## [1] 0.3372755 0.1812107

#for the intercept
int <- modsum$coefficients[ 1,1 ]
SE_int <- modsum$coefficients[ 1,2 ]
n <- nrow( production_df )
tdist<- qt( 0.025, n-2 )
CI <- c( int - tdist*SE_int, int + tdist*SE_int )
CI

## [1] 167.2445 132.2509

Confidence Intervals for the Population Regression Line

\[E(Y|X = x^*) = \hat{\beta}_0 + \hat{\beta}_1x^* = y^*\] this leads to \[Var(\hat{y}^*) = \sigma^2 \left( \frac{1}{n} + \frac{x^* - \bar{x}}{SXX} \right)\]

Prediction Intervals for the Actual Value of Y

finding a prediction interval for the actual value of Y at $x^*$, a given value of X.

$E(Y|X = x^*)$ the expected value or average value of Y for a given value $x^*$: this is what you expect, a fixed mean, that lies on the regression line. $E(Y|X = x^*) \neq Y$ because Y can take many values
$E(Y|X = x^*) \neq Y^*$. because $Y^*$ can be lots of values and is probably not on the regression line at all.
Confidence Intervals are always reported for a parameter. Prediction Intervals are reported for random values ($Y^*$)

Confidence Intervals $\neq$ Prediction Intervals: The variability in the error of predicting a single value of Y will exceed the variability for estimating the expected value of Y, because the prediction must account for the random error $e^*$ of the system.

# compare these CI values.....
head( predict( mod, interval = 'confidence' ) )

##        fit      lwr      upr
## 1 195.1152 187.2000 203.0305
## 2 198.7447 191.0450 206.4443
## 3 238.9273 225.4549 252.3998
## 4 172.5611 160.8529 184.2693
## 5 179.3014 169.0456 189.5572
## 6 237.3719 224.2825 250.4613

# ....with these PI values:
head( predict( mod, interval = 'prediction' ) )

## Warning in predict.lm(mod, interval = "prediction"): predictions on current data refer to _future_ responses

##        fit      lwr      upr
## 1 195.1152 160.0646 230.1659
## 2 198.7447 163.7421 233.7472
## 3 238.9273 202.2204 275.6343
## 4 172.5611 136.4643 208.6578
## 5 179.3014 143.6493 214.9536
## 6 237.3719 200.8038 273.9400

We see that the estimates are the same. However, the interval span is much larger for the prediction than the confidence intervals.

ANOVA

To test whether there is a linear association between Y and X, we have to test $H_0 : \beta_1 = 0 \mbox{ against } H_A: \beta_1 \neq 0$. This can be evaluated with F-test statistic. \[F = \frac{\frac{SSreg}{1}}{\frac{RSS}{(n-2)}}\] where SSreg: sum of squares explained by the regression model (sum of squares of difference of $\hat{y}-\bar{y}$ (the regression est - mean)) RSS: residual sum of squares (sum of squares of differences of the observed y and the regression models est $\hat{y}$)

Dummy Variable Regression

testing the means of two groups

changeover_df <- read.csv( '/home/bonzilla/Documents/MSDS/DATA621/changeover.txt', sep = '\t', header = TRUE )
changeover_df <- changeover_df %>%
  mutate( Method = as.factor( Method ),
          New = as.factor( New ) )
glimpse( changeover_df )

## Rows: 120
## Columns: 3
## $ Method     <fct> Existing, Existing, Existing, Existing, Existing, Existing,…
## $ Changeover <int> 19, 24, 39, 12, 29, 19, 23, 22, 12, 29, 22, 23, 12, 40, 16,…
## $ New        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

ggplot( changeover_df, aes( x = Method, y = Changeover ) ) +
  geom_boxplot() +
  theme_classic()

We wish to develop an equation to model the relationship between Changeover and the Method. Does the new method decrease the changeover time?

co_mod <- lm( Changeover ~ Method, data = changeover_df )
summary( co_mod )

## 
## Call:
## lm(formula = Changeover ~ Method, data = changeover_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.861  -5.861  -1.861   4.312  25.312 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.8611     0.8905  20.058   <2e-16 ***
## MethodNew    -3.1736     1.4080  -2.254    0.026 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.556 on 118 degrees of freedom
## Multiple R-squared:  0.04128,    Adjusted R-squared:  0.03315 
## F-statistic: 5.081 on 1 and 118 DF,  p-value: 0.02604

We wish to test whether the coefficient of the dummy variable is significantly different than 0.

From the model summary output, we see that the t-value is -2.254 and that the p-value is significant at $\alpha = 0.05$. Therefore, there is significant evidence of a reduction in means for the new method.

Find the confidence interval:

confint( co_mod)

##                 2.5 %     97.5 %
## (Intercept) 16.097721 19.6245015
## MethodNew   -5.961776 -0.3854461

Note that CI for the New method does not span 0; this means that we can reject the null hypothesis that there is no difference between the means of the old and new methods.

Exercise 2.3

The manager of the purchasing department of a large company would like to develop a regression model to predict the average amount of time it takes to process a given number of invoices. Over a 30 day period, data are collected on the number of invoices processed and the total time taken (in hours). The following model was fit to the data: $\mbox{processing time} = \beta_0 + \beta_1 \mbox{#invoices} + \epsilon$

invoices_df <- read.csv( '/home/bonzilla/Documents/MSDS/DATA621/invoices.txt', sep = '\t', header = TRUE )
glimpse( invoices_df )

## Rows: 30
## Columns: 3
## $ Day      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ Invoices <int> 149, 60, 188, 23, 201, 58, 77, 222, 181, 30, 110, 83, 60, 25,…
## $ Time     <dbl> 2.1, 1.8, 2.3, 0.8, 2.7, 1.0, 1.7, 3.1, 2.8, 1.0, 1.5, 1.2, 0…

ggplot( invoices_df, aes( x = Invoices, y = Time ) ) +
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  theme_classic() +
  ggtitle( 'Scatterplot of Invoices data' )

Find a 95% confidence interval for the start-up time, i.e. $\beta_0$

invoices_mod <- lm( Time ~ Invoices, data = invoices_df )
invoices_sum <- summary( invoices_mod )
invoices_sum

## 
## Call:
## lm(formula = Time ~ Invoices, data = invoices_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59516 -0.27851  0.03485  0.19346  0.53083 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.6417099  0.1222707   5.248 1.41e-05 ***
## Invoices    0.0112916  0.0008184  13.797 5.17e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3298 on 28 degrees of freedom
## Multiple R-squared:  0.8718, Adjusted R-squared:  0.8672 
## F-statistic: 190.4 on 1 and 28 DF,  p-value: 5.175e-14

confint( invoices_mod )

##                   2.5 %     97.5 %
## (Intercept) 0.391249620 0.89217014
## Invoices    0.009615224 0.01296806

Suppose that a best practice benchmark for the average processing time for an additional invoice is 0.01 hours (or 0.6 minutes). Test the null hypothesis against the two-sided alternative. Interpret your result.

The confidence interval for the Invoices coefficient spans the value $\beta_1 = 0.01$. Therefore, we cannot reject the null hypothesis that the relationship is any different that the benchmark

Find a point estimate and a 95% prediction interval for the time taken to process 130 invoices.

predict( invoices_mod, newdata = data.frame( 'Invoices' = 130 ), interval = 'confidence' )

##        fit      lwr      upr
## 1 2.109624 1.986293 2.232954

Diagnostics and Transformations for Simple Linear Regression

Numerical regression output should always be supplemented by an analysis to ensure that an appropriate model has been fitted to the data.

Residuals: using plots of residuals to determine whether the proposed regression model is a valid model. If a linear model is appropriate, the residuals should resemble random errors; there should be no pattern. If the residuals vary with respect to x, then this indicates that an incorrect model has been fit.

Regression Diagnostics: Tools for checking the validity of a model.

Is the model valid?: does the model fit the data. use plots of standardized residuals
Are there any data points with unusually large leverage?
Are any of the data points extreme outliers?
If a leverage point exists, is it a bad one?
Is the assumption of constant variance of the errors reasonable?
If the data are collected over time, examine whether the data are correlated over time.
Is the assumption that the errors are normally distributed reasonable?

Leverage Points

Leverage Points: data which exercise considerable influence on the fitted model

a look at Huber’s good/bad leverage points

x <- c( -4, -3, -2, -1, 0, 10 )
YBad <- c( 2.48, 0.73, -0.04, -1.44, -1.32, 0 )
YGood <- c( 2.48, 0.73, -0.04, -1.44, -1.32, -11.4)

huber <- data.frame( x, YBad, YGood ) %>%
  pivot_longer( cols = c( YBad, YGood ), names_to = 'Y', values_to = 'val' ) %>%
  ggplot( aes( x = x, y = val, color = Y  ) ) +
  geom_point() +
  geom_smooth( method = 'lm' )
huber

## `geom_smooth()` using formula 'y ~ x'

In the above plot, x = 10 is a leverage point in both data series: the value of x is far from the other values and the value of Y has a very large effect on the least squares regression line.

Finding a numerical rule to identify leverage points based on (1) the distance of x is away from the bulk of the x’s and (2) the extent to which the fitted regression line is attracted to a given point.
\[\mbox{Leverage } \rightarrow h_{ii} = \frac{1}{n}+\frac{(x_i - \bar{x})^2}{\sum_{j=1}^n(x_j-\bar{x})^2}\] The Rule: classify $x_i$ as a point of hight leverage in a simple linear regression model if: \[h_{ii} \gt 2 \cdot average( h_{ii} ) = 2 \cdot \frac{2}{n} = \frac{4}{n}\]

Strategies for dealing with Bad leverage points:

Remove invalid data points
Fit a different regression model

Standardize Residuals: divide the residuals by an estimate of the standard deviation \[r_i = \frac{\hat{e}_i}{s \sqrt{1-h_{ii}}}\] where s = $\sigma$
Common Practice: label points as outliers in small/moderate datasets if the point falls outside the interval of -2 to 2 standard deviations out. Otherwise, for large datasets, use the range -4 to 4.
Identification and examination of any outliers is a key part of regression analysis

Bad Leverage pnt: a leverage point whose standardized residual is outside -2 to 2
Good Leverage pnt: a leverage point whose standardized residual is inside the interval -2 to 2

a relatively small number of outlying points can have a relatively large effect on the fitted model.

path <- '/home/bonzilla/Documents/MSDS/DATA621/bonds.csv'
bonds <- read.csv( path, sep = '\t' )
glimpse( bonds )

## Rows: 35
## Columns: 3
## $ Case       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ CouponRate <dbl> 7.000, 9.000, 7.000, 4.125, 13.125, 8.000, 8.750, 12.625, 9…
## $ BidPrice   <dbl> 92.94, 101.44, 92.66, 94.50, 118.94, 96.75, 100.88, 117.25,…

ggplot( bonds, aes( x = CouponRate, y = BidPrice ) ) +
  geom_point() +
  geom_smooth( method = 'lm' )

## `geom_smooth()` using formula 'y ~ x'

The linear model doesn’t do a great job here. However, this is largely because of the influence of three data points to the left.

bonds_mod <- lm( BidPrice ~ CouponRate, bonds )
standard_res <- rstandard( bonds_mod )
bonds$standard_res <- standard_res

ggplot( bonds, aes( x = CouponRate, y = standard_res ) ) +
  geom_point() +
  geom_hline( yintercept = -2, color = 'blue' ) +
  geom_hline( yintercept = 2, color = 'blue' )

As it turns out, the 3 points to the left are ‘flower’ bonds, which can be justifiably treated differently from the rest of the data set.
* Points should not be routinely deleted from an analysis just because they do not fit the model * Outliers often point out an important feature of the problem not considered before.

Cook’s Distance: multiply the quantities: square of the ith standardized residual * 2 x a monotonic function that increases as the ith leverage value increases use the recommended cut-off value of $\frac{4}{n-2}$

bonds_cooks <- cooks.distance( bonds_mod )
bonds$cooks <- bonds_cooks

ggplot( bonds, aes( x = CouponRate, y = cooks ) ) +
  geom_point() +
  geom_hline( yintercept = 4/(35-2), color = 'blue' )

Constant Variance

Does the errors/residuals have constant variance?
Ignoring nonconstant variance when it exists invalidates all inferential tools (pvals, CIs etc)

cleaning_df <- read.csv( '/home/bonzilla/Documents/MSDS/DATA621/cleaning.csv', sep = '\t', header = TRUE )
glimpse( cleaning_df )

## Rows: 53
## Columns: 3
## $ Case  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1…
## $ Crews <int> 16, 10, 12, 16, 16, 4, 2, 4, 6, 2, 12, 8, 16, 2, 2, 2, 6, 10, 16…
## $ Rooms <int> 51, 37, 37, 46, 45, 11, 6, 19, 29, 14, 47, 37, 60, 6, 11, 10, 19…

ggplot( cleaning_df, aes( x = Crews, y = Rooms ) ) +
  geom_point() +
  geom_smooth( method = 'lm', se = F ) +
  theme_classic()

## `geom_smooth()` using formula 'y ~ x'

cleaning_mod <- lm( data = cleaning_df, Rooms ~ Crews )
aug_cleaning_mod <- augment( cleaning_mod )
p1 <- ggplot( aug_cleaning_mod, aes( x = Crews, y = .std.resid) ) +
  geom_point() +
  theme_classic()
p1

plot( cleaning_mod )

The standardized residuals x X does not have constant variance. Will see how to deal with this next section

Transformations

How to use transformations to overcome problems due to non-constant variance and/or nonlinearity.

When both Y and X are measured in the same units then it is often natural to consider the same transformation for both X and Y.

sqmod <- lm( data = cleaning_df, sqrt(Rooms) ~ sqrt(Crews) )
aug_sqmod <- augment( sqmod )
glimpse( aug_sqmod )

## Rows: 53
## Columns: 7
## $ `sqrt(Rooms)` <dbl> 7.141428, 6.082763, 6.082763, 6.782330, 6.708204, 3.3166…
## $ `sqrt(Crews)` <dbl> 4.000000, 3.162278, 3.464102, 4.000000, 4.000000, 2.0000…
## $ .fitted       <dbl> 7.806449, 6.213452, 6.787395, 7.806449, 7.806449, 4.0032…
## $ .hat          <dbl> 0.05378764, 0.02187744, 0.02935793, 0.05378764, 0.053787…
## $ .sigma        <dbl> 0.5920552, 0.5996065, 0.5913094, 0.5811269, 0.5782587, 0…
## $ .cooksd       <dbl> 0.0376522689, 0.0005534837, 0.0219254135, 0.0892939179, …
## $ .std.resid    <dbl> -1.1509690, -0.2224671, -1.2040810, -1.7724704, -1.90076…

p1 <- ggplot( aug_sqmod, aes( x = `sqrt(Crews)`, y = `sqrt(Rooms)`) ) +
  geom_point() +
  geom_smooth( method = 'lm', se = F ) +
  theme_classic()
p2 <- ggplot( aug_sqmod, aes( x = `sqrt(Crews)`, y = .std.resid) ) +
  geom_point() +
  theme_classic()

grid.arrange( p1, p2, ncol = 2 )

## `geom_smooth()` using formula 'y ~ x'

Note that the standardized residuals do not have the funnel shape they took on last section

plot( sqmod )

Using Logarithms to estimate percentage effects

path <- '/home/bonzilla/Documents/MSDS/DATA621/confood1.csv'
confood1 <- read.csv( path, sep = '\t' )
glimpse( confood1 )

## Rows: 52
## Columns: 3
## $ Week  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1…
## $ Sales <int> 611, 673, 710, 478, 425, 601, 607, 637, 507, 467, 411, 538, 2670…
## $ Price <dbl> 0.67, 0.66, 0.67, 0.66, 0.66, 0.65, 0.67, 0.67, 0.68, 0.68, 0.78…

ggplot( confood1, aes( x = Price, y = Sales ) ) +
  geom_point() +
  geom_smooth( method = 'lm', se = F) +
  theme_classic()

## `geom_smooth()` using formula 'y ~ x'

When studying the relationship between price and quantity in economics, it is common practice to take the logarithms of both since interest lies in predicting the effect of a 1% increase in price on quantity sold: \[log(Q) = \beta_0 + \beta_1 log(P) + \epsilon\] Here the slope of the model approximately equals the ratio of the percentage changes in Q & P. (Price Elasticity)

ggplot( confood1, aes( x = log(Price), y = log(Sales ) ) ) +
  geom_point() +
  geom_smooth( method = 'lm', se = F) +
  theme_classic()

## `geom_smooth()` using formula 'y ~ x'

Using transformations to overcome problems due to nonlinearity:

Inverse response plots and Box-Cox procedures

3 scenarios:

Response variable needs to be transformed

path <- '/home/bonzilla/Documents/MSDS/DATA621/restrans.csv'
restrans <- read.csv( path, sep = '\t' )
ggplot( restrans, aes( x = x, y = y ) ) +
  geom_point() +
  theme_classic()

restrans_mod <- lm( data = restrans, y ~ x )
plot( restrans_mod )

plainly obviously not a linear model candidate

inverseResponsePlot( restrans_mod, lambda = c(0, 0.33, 1), 
    robust=FALSE, xlab=NULL, id=FALSE )

##      lambda       RSS
## 1 0.3321126  265.8749
## 2 0.0000000 3583.8067
## 3 0.3300000  265.9846
## 4 1.0000000 7136.8828

Transform a response variable using the Box-Cox method: the box-cox procedure aims to find the transform that makes the transformed variable close to normally distributed.

restrans_mod_trans <- lm( data = restrans, y^(1/3) ~ x )
restrans_mod_trans_aug <- augment( restrans_mod_trans )
glimpse( restrans_mod_trans_aug )

## Rows: 250
## Columns: 7
## $ `y^(1/3)`  <dbl> 3.218615, 2.724020, 2.624105, 3.659263, 2.520624, 3.681569,…
## $ x          <dbl> 3.32099, 2.69991, 2.70481, 3.73984, 2.57108, 3.68802, 2.399…
## $ .fitted    <dbl> 3.318152, 2.699276, 2.704159, 3.735516, 2.570903, 3.683880,…
## $ .hat       <dbl> 0.007926593, 0.004152817, 0.004162786, 0.013328615, 0.00400…
## $ .sigma     <dbl> 0.05139567, 0.05176348, 0.05153536, 0.05155666, 0.05168821,…
## $ .cooksd    <dbl> 1.493621e-02, 4.799079e-04, 5.035510e-03, 1.490123e-02, 1.9…
## $ .std.resid <dbl> -1.933587271, 0.479754346, -1.552168450, -1.485319465, -0.9…

ggplot( restrans_mod_trans_aug, aes( x = x, y = `y^(1/3)`)) +
  geom_point() +
  geom_smooth( method = 'lm', se = F ) +
  theme_classic()

## `geom_smooth()` using formula 'y ~ x'

Check out how damn linear the dataseries is after the transform.
Next visualize the diagnostics:

plot( restrans_mod_trans )

Predictor variable needs to be transformed again, consider scaled power transformations.
Also, considering only transforming only X, so fitting models of the form:
$E(Y|X=x) = \alpha_0 + ]\alpha_1 \psi_S(x,\lambda)$
both response and predictor need to be transformed Multivariate generalization of the Box-Cox transformation method

How To Think About It:

Multiple Linear Regression

It is common for more than one factor to influence an outcome.
Checking Multiple Linear Regression Model Validity…

Polynomial Regression

Modeling Salary from years of experience

path <- '/home/bonzilla/Documents/MSDS/DATA621/profsalary.csv'
profsalary <- read.csv( path, sep = '\t' )
glimpse( profsalary )

## Rows: 143
## Columns: 3
## $ Case       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ Salary     <int> 71, 69, 73, 69, 65, 75, 66, 66, 67, 69, 76, 72, 69, 45, 72,…
## $ Experience <int> 26, 19, 22, 17, 13, 25, 35, 16, 16, 16, 26, 16, 25, 4, 17, …

p1 <- ggplot( profsalary, aes( x = Experience, y = Salary ) ) +
  geom_point() +
  geom_smooth( method = 'lm', se = F ) +
  theme_classic()
p1

## `geom_smooth()` using formula 'y ~ x'

very clearly a linear model is inappropriate here.

quadmod <- lm( Salary ~ Experience + I( Experience^2 ), data = profsalary )
quadmodaug <- augment( quadmod )
glimpse( quadmodaug )

## Rows: 143
## Columns: 9
## $ Salary            <int> 71, 69, 73, 69, 65, 75, 66, 66, 67, 69, 76, 72, 69, …
## $ Experience        <int> 26, 19, 22, 17, 13, 25, 35, 16, 16, 16, 26, 16, 25, …
## $ `I(Experience^2)` <I<dbl>>  676,  361,  484,  289,  169,  625, 1225,  256,  …
## $ .fitted           <dbl> 73.35796, 70.04661, 72.10555, 68.14082, 63.04965, 73…
## $ .resid            <dbl> -2.35796374, -1.04660895, 0.89444594, 0.85918200, 1.…
## $ .hat              <dbl> 0.01194999, 0.01314270, 0.01247932, 0.01305830, 0.01…
## $ .sigma            <dbl> 2.820221, 2.825976, 2.826358, 2.826437, 2.822485, 2.…
## $ .cooksd           <dbl> 2.858281e-03, 6.208200e-04, 4.299597e-04, 4.156184e-…
## $ .std.resid        <dbl> -0.842012837, -0.373962729, 0.319486160, 0.306980277…

p2 <- p1 + geom_line( data = quadmodaug, aes( x = Experience, y = .fitted ), color = 'red' )
p3 <- ggplot( quadmodaug, aes( x = Experience, y = .std.resid ) ) +
  geom_point() +
  theme_classic()
p4 <- ggplot( quadmodaug, aes( x = Experience, y = .cooksd ) ) +
  geom_point() +
  theme_classic()
grid.arrange( p2, p3, p4, ncol = 2 )

## `geom_smooth()` using formula 'y ~ x'

go ahead and look at the diagnostics:

plot( quadmod )

Estimation and Inference in Multiple Linear Regression

Multiple Linear Regression: \[E(Y|X-1=x_1,X_2=x_2,\dots , X_p=x_p) = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + beta_px_p\] Thus, \[Y_1 = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + beta_px_p +\epsilon_i\] Degrees of Freedom = Sample Size - Number of mean parameters

For linear regression, $R^2 = = 1 - $
However, adding irrelevant predictor variables to the regression equation often increases $R^2$. To compensate, we define an adjusted coefficient of determination: $R^2_{adj} = 1 - \frac{RSS/(n-p-1)}{SST/(n-1)}$

The F-test is also used to test the linear association between Y and ANY of the p x-varaibles. If the F-test is significant then the next natural question to ask is which of the p x-variables is there evidence of a linear association.

Testing whether a subset of predictors have regression coefficients = 0: Ex: Menu pricing at a new Italian restaurant in NYC

nyc_csv <- read.csv( 'nyc.csv' )
nyc_df <- data.frame( nyc_csv )
glimpse( nyc_df )

## Rows: 168
## Columns: 7
## $ Case       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ Restaurant <chr> "Daniella Ristorante", "Tello's Ristorante", "Biricchino", …
## $ Price      <int> 43, 32, 34, 41, 54, 52, 34, 34, 39, 44, 45, 47, 52, 35, 47,…
## $ Food       <int> 22, 20, 21, 20, 24, 22, 22, 20, 22, 21, 19, 21, 21, 19, 20,…
## $ Decor      <int> 18, 19, 13, 20, 19, 22, 16, 18, 19, 17, 17, 19, 19, 17, 18,…
## $ Service    <int> 20, 19, 18, 17, 21, 21, 21, 21, 22, 19, 20, 21, 20, 19, 21,…
## $ East       <int> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

nyc_mod <- lm( data = nyc_df, Price ~ Food + Decor + Service + East )
summary( nyc_mod )

## 
## Call:
## lm(formula = Price ~ Food + Decor + Service + East, data = nyc_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.0465  -3.8837   0.0373   3.3942  17.7491 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -24.023800   4.708359  -5.102 9.24e-07 ***
## Food          1.538120   0.368951   4.169 4.96e-05 ***
## Decor         1.910087   0.217005   8.802 1.87e-15 ***
## Service      -0.002727   0.396232  -0.007   0.9945    
## East          2.068050   0.946739   2.184   0.0304 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.738 on 163 degrees of freedom
## Multiple R-squared:  0.6279, Adjusted R-squared:  0.6187 
## F-statistic: 68.76 on 4 and 163 DF,  p-value: < 2.2e-16

Decor has the largest reg. corefficient and is the most statistically significant to the model whereas Service’s coefficient does not appear to be statistically significant and has the smallest magnitude.

Analysis of Covariance

Possible outcomes:

coincident regression lines
parallel regression line
regression lines with equal intercepts but different slopes
unrelated regression line

Ex: Amount spent on travel

path <- '/home/bonzilla/Documents/MSDS/DATA621/travel.csv'
travel <- read.csv( path, sep = '\t' )
glimpse( travel )

## Rows: 925
## Columns: 4
## $ Amount  <int> 997, 997, 951, 649, 1265, 1059, 837, 924, 852, 963, 1046, 979,…
## $ Age     <int> 44, 43, 41, 59, 25, 38, 46, 42, 48, 39, 36, 44, 40, 30, 55, 26…
## $ Segment <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A…
## $ C       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

ggplot( travel, aes( x = Age, y = Amount, color = factor(Segment) ) ) +
  geom_point() +
  theme_classic()

travel_mod <- lm( data = travel, Amount ~ Age + C + C*Age )
summary( travel_mod )

## 
## Call:
## lm(formula = Amount ~ Age + C + C * Age, data = travel)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -143.298  -30.541   -0.034   31.108  130.743 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1814.5445     8.6011   211.0   <2e-16 ***
## Age           -20.3175     0.1878  -108.2   <2e-16 ***
## C           -1821.2337    12.5736  -144.8   <2e-16 ***
## Age:C          40.4461     0.2724   148.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.63 on 921 degrees of freedom
## Multiple R-squared:  0.9601, Adjusted R-squared:  0.9599 
## F-statistic:  7379 on 3 and 921 DF,  p-value: < 2.2e-16

Reduced Model:

travel_mod_reduced <- lm( data = travel, Amount ~ Age )
summary( travel_mod_reduced )

## 
## Call:
## lm(formula = Amount ~ Age, data = travel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -545.06 -199.03    6.34  198.74  497.39 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 957.9103    31.3056  30.599   <2e-16 ***
## Age          -1.1140     0.6784  -1.642    0.101    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 237.7 on 923 degrees of freedom
## Multiple R-squared:  0.002913,   Adjusted R-squared:  0.001833 
## F-statistic: 2.697 on 1 and 923 DF,  p-value: 0.1009

perform ANOVA to look at the F-test & compare the two models:

anova( travel_mod_reduced, travel_mod )

## Analysis of Variance Table
## 
## Model 1: Amount ~ Age
## Model 2: Amount ~ Age + C + C * Age
##   Res.Df      RSS Df Sum of Sq     F    Pr(>F)    
## 1    923 52158945                                 
## 2    921  2089377  2  50069568 11035 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is very strong evidence against the reduced model. Thus, we prefer the unrelated regression lines model (full) to the coincident lines model (reduced)

Going back to the NYC Italian restaurants:

nyc_full_mod <- lm( Price ~ Food + Decor + Service + East + Food:East + Decor:East + Service:East, nyc_df )
summary( nyc_full_mod )

## 
## Call:
## lm(formula = Price ~ Food + Decor + Service + East + Food:East + 
##     Decor:East + Service:East, data = nyc_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5099  -3.7996  -0.1413   3.6522  17.1656 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -26.9949     8.4672  -3.188  0.00172 ** 
## Food           1.0068     0.5704   1.765  0.07946 .  
## Decor          1.8881     0.2984   6.327  2.4e-09 ***
## Service        0.7438     0.6443   1.155  0.25001    
## East           6.1253    10.2499   0.598  0.55095    
## Food:East      1.2077     0.7743   1.560  0.12079    
## Decor:East    -0.2500     0.4570  -0.547  0.58510    
## Service:East  -1.2719     0.8171  -1.557  0.12151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.713 on 160 degrees of freedom
## Multiple R-squared:  0.6379, Adjusted R-squared:  0.622 
## F-statistic: 40.27 on 7 and 160 DF,  p-value: < 2.2e-16

NYC ‘final’ mod:

nyc_final_mod <- lm( Price ~ Food + Decor + East, nyc_df )
summary( nyc_full_mod )

## 
## Call:
## lm(formula = Price ~ Food + Decor + Service + East + Food:East + 
##     Decor:East + Service:East, data = nyc_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5099  -3.7996  -0.1413   3.6522  17.1656 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -26.9949     8.4672  -3.188  0.00172 ** 
## Food           1.0068     0.5704   1.765  0.07946 .  
## Decor          1.8881     0.2984   6.327  2.4e-09 ***
## Service        0.7438     0.6443   1.155  0.25001    
## East           6.1253    10.2499   0.598  0.55095    
## Food:East      1.2077     0.7743   1.560  0.12079    
## Decor:East    -0.2500     0.4570  -0.547  0.58510    
## Service:East  -1.2719     0.8171  -1.557  0.12151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.713 on 160 degrees of freedom
## Multiple R-squared:  0.6379, Adjusted R-squared:  0.622 
## F-statistic: 40.27 on 7 and 160 DF,  p-value: < 2.2e-16

now to look at the F-test for the ANOVA:

anova( nyc_full_mod, nyc_final_mod )

## Analysis of Variance Table
## 
## Model 1: Price ~ Food + Decor + Service + East + Food:East + Decor:East + 
##     Service:East
## Model 2: Price ~ Food + Decor + East
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    160 5222.2                           
## 2    164 5366.5 -4   -144.36 1.1057 0.3558

Given the ANOVA p-val, there is little evidence to reject the null hypothesis (no difference between models). Therefore, we can adopt the simpler reduced model without losing much predictive power.

Diagnostics and Transformations for Multiple Linear Regression

Variable Selection

Logistic Regression

Logistic Regression - here we consider the situation where the response variable is a binary yes/no response. Ideally such responses follow a binomial distribution in which case the appropriate model is a logistic regression model.

Logistic Regression based on a single predictor

Binomial Distributions - $Y \sim \mbox{Bin}(m,\theta)$ a process where:

there are $m$ identical trials
each trial results in one of two outcomes, either a success or a failure
the probability for a success is the same for all trials
the trials are independent

Binomial process trials are caller Bernoulli trials
consider the sample proportions of ‘successes’, $\frac{y_i}{m_i}$ as an unbiased estimate of $\theta$ that varies between 0 and 1.
Least squares regression is inappropriate for binomial processes because the variance of the response is not constant

zagatNYC <- read.csv( 'MichelinFood.csv', sep = '\t' )
glimpse( zagatNYC )

## Rows: 14
## Columns: 5
## $ Food          <int> 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28
## $ InMichelin    <int> 0, 0, 0, 2, 5, 8, 15, 4, 12, 6, 11, 1, 6, 4
## $ NotInMichelin <int> 1, 1, 8, 13, 13, 25, 11, 8, 6, 1, 1, 1, 1, 0
## $ mi            <int> 1, 1, 8, 15, 18, 33, 26, 12, 18, 7, 12, 2, 7, 4
## $ proportion    <dbl> 0.00, 0.00, 0.00, 0.13, 0.28, 0.24, 0.58, 0.33, 0.67, 0.…

ggplot( aes( y  = proportion, x = Food ), data = zagatNYC ) +
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  theme_classic() +
  ggtitle( 'Sample Proportion ~ Food Rating' )

The envelope of the data makes it clear that a straight linear fit does not adequately describe the data. Rather, a sigmoidal curve gives the best approximation.

The Logistic Function and Odds.

A popular choice for the sigmoidal function is the logistic function: \[\theta (x) = \frac{exp(\beta_0 + \beta_1x)}{1+exp(\beta_0 + \beta_1x} = \frac{1}{1+exp(-(\beta_0 + \beta_1x)}\] \[\beta_0 + \beta_1x = log \left(\frac{\theta(x)}{1 - \theta(x)} \right)\] \[\mbox{logit} = \frac{\theta(x)}{1 - \theta(x)}\]

if $\theta = P(\mbox{success})$, \[\mbox{Odds in favor of success} = \frac{P(\mbox{success})}{1-P(\mbox{success})} = \frac{\theta}{1-\theta}\] \[\mbox{Odds against success} = \frac{1-P(\mbox{success})}{P(\mbox{success})} = \frac{1-\theta}{\theta}\]

ggplot( aes( y  = proportion, x = Food ), data = zagatNYC ) +
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  geom_smooth( method = 'glm', col = 'red', method.args = list( family = 'binomial'), se = F ) +
  theme_classic() +
  ylab( 'P(included in Michelin)' ) +
  xlab( 'Zagat Food Rating' ) +
  ggtitle( 'Probability of Inclusion in Michelin Guide ~ Food Rating' )

## `geom_smooth()` using formula 'y ~ x'

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

zagatNYC_mod <- glm( cbind( InMichelin, NotInMichelin ) ~ Food, family = binomial, zagatNYC )
summary( zagatNYC_mod )

## 
## Call:
## glm(formula = cbind(InMichelin, NotInMichelin) ~ Food, family = binomial, 
##     data = zagatNYC)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4850  -0.7987  -0.1679   0.5913   1.5889  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.84154    1.86236  -5.821 5.84e-09 ***
## Food          0.50124    0.08768   5.717 1.08e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 61.427  on 13  degrees of freedom
## Residual deviance: 11.368  on 12  degrees of freedom
## AIC: 41.491
## 
## Number of Fisher Scoring iterations: 4

From this result, if the Zagat food rating is increased by one unit, then the odds of being included in the Michelin guide increases by 1.6507615

Find the estimated probabilities and the odds obtained using the logistic model

zagatNYC_odds <- data.frame( 'Food' = zagatNYC$Food, 'pInc' = fitted( zagatNYC_mod ) ) %>%
  mutate( odds = pInc / ( 1 - pInc ) )
head( zagatNYC_odds )

##   Food       pInc       odds
## 1   15 0.03479091 0.03604494
## 2   16 0.05615999 0.05950160
## 3   17 0.08943808 0.09822295
## 4   18 0.13952045 0.16214266
## 5   19 0.21114424 0.26765886
## 6   20 0.30644222 0.44184094

Likelihood for Logistic Regression with a Single Predictor

How likelihood can be used to estimate the parameters of logistic regression
The likelihood function is the function of the unknown probability of success

\[log(1 - \theta(x_i)) = log \left( \frac{1}{1+exp(-(\beta_0 + \beta_1x)} \right)\] and the parameters $\beta_0 \mbox{ & } \beta_1$ can be estimated by maximizing the log-likelihood canbe found with the Newton-Raphson method.
testing the hypothesis $H_0 : \beta_1 = 0$ is to use the Wald test statistic

Explanation of Deviance

deviance replaces the concept of residual sum of squares.
\[G^2 = 2\sum_{i=1}^n \left[ y_i\mbox{log} \left(\frac{y_i}{\hat{y_i}}\right) + (m_i-y_i)\mbox{log}\left(\frac{m_i-y_i}{m_i-\hat{y_i}}\right) \right]\] We wish to test:

$H_0$ - the logistic regression model is appropriate
$H_A$ - the logistic regression model is inappropriate

zagatNYC_mod_sum <- summary( zagatNYC_mod )
zagatNYC_mod_sum

## 
## Call:
## glm(formula = cbind(InMichelin, NotInMichelin) ~ Food, family = binomial, 
##     data = zagatNYC)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4850  -0.7987  -0.1679   0.5913   1.5889  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.84154    1.86236  -5.821 5.84e-09 ***
## Food          0.50124    0.08768   5.717 1.08e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 61.427  on 13  degrees of freedom
## Residual deviance: 11.368  on 12  degrees of freedom
## AIC: 41.491
## 
## Number of Fisher Scoring iterations: 4

Using differences in Deviance Values to Compare Models

Compare the null and residual deviances to test $H_0 : \beta_1 = 0$ against $H_A : \beta_1 \ne 0$

zagatNYC_mod_anova <- anova( zagatNYC_mod, test = 'F' )

## Warning in anova.glm(zagatNYC_mod, test = "F"): using F test with a 'binomial'
## family is inappropriate

zagatNYC_mod_anova

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: cbind(InMichelin, NotInMichelin)
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev      F    Pr(>F)    
## NULL                    13     61.427                     
## Food  1   50.059        12     11.368 50.059 1.492e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

$R^2$ for Logistic Regression

\[R^2_{dev} = 1 - \frac{G^2_{H_A}}{G^2_{H_0}}\]

zagatNYC_mod_anova$`Resid. Dev`

## [1] 61.42704 11.36843

R2dev <- 1 - zagatNYC_mod_anova$`Resid. Dev`[2]/zagatNYC_mod_anova$`Resid. Dev`[1]
R2dev

## [1] 0.8149279

can also look at the Pearson goodness-of-fit stat to evaluate

Residuals for Logistic Regression

diagnostic procedures for logistic regression.

Can look at :

Response residuals - can be difficult to interpret because of nonconstant variance
standardized Pearson’s residuals
standardized Deviance residuals

zagatNYC_3res <- data.frame( 'Food' = zagatNYC$Food, 'Response' = zagatNYC$proportion ) %>%
  mutate( Res_resid = zagatNYC_mod$residuals,
    dev_resid = residuals( zagatNYC_mod, type = 'response' ),
    pear_resid = residuals( zagatNYC_mod, type = 'pearson' ) )
head( zagatNYC_3res )

##   Food Response   Res_resid    dev_resid  pear_resid
## 1   15     0.00 -1.03604494 -0.034790906 -0.18985506
## 2   16     0.00 -1.05950160 -0.056159992 -0.24392950
## 3   17     0.00 -1.09822295 -0.089438079 -0.88644436
## 4   18     0.13 -0.05153588 -0.006187113 -0.06915833
## 5   19     0.28  0.40005165  0.066633542  0.69269290
## 6   20     0.24 -0.30121091 -0.064017976 -0.79770679

Binary Logistic Regression

the binary form of the zagat data:

zagatBinary <- read.csv( '/home/bonzilla/Documents/MSDS/DATA621/MichelinNY.csv' )
glimpse( zagatBinary )

## Rows: 164
## Columns: 6
## $ InMichelin      <int> 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, …
## $ Restaurant.Name <chr> "14 Wall Street", "212", "26 Seats", "44", "A", "A.O.C…
## $ Food            <int> 19, 17, 23, 19, 23, 18, 24, 23, 27, 20, 25, 23, 23, 27…
## $ Decor           <int> 20, 17, 17, 23, 12, 17, 21, 22, 27, 17, 26, 20, 27, 25…
## $ Service         <int> 19, 16, 21, 16, 19, 17, 22, 21, 27, 19, 27, 20, 23, 27…
## $ Price           <int> 50, 43, 35, 52, 24, 36, 51, 61, 179, 42, 71, 50, 82, 9…

ggplot( aes( y  = InMichelin, x = Food, col = factor(InMichelin) ), data = zagatBinary ) +
  geom_jitter( fill = NA, shape = 21, alpha = 0.5, size = 3, width = 0.1, height = 0.1 ) +
  geom_smooth( method = 'glm', col = 'red', method.args = list( family = 'binomial'), se = F ) +
  theme_classic() +
  ylab( 'Included in Michelin?' ) +
  xlab( 'Zagat Food Rating' ) +
  ggtitle( 'In Michelin Guide ~ Food Rating' )

## `geom_smooth()` using formula 'y ~ x'

zagatbinary_mod <- glm( InMichelin ~ Food, family = binomial, zagatBinary )
summary( zagatbinary_mod )

## 
## Call:
## glm(formula = InMichelin ~ Food, family = binomial, data = zagatBinary)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3484  -0.8555  -0.4329   0.9028   1.9847  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.84154    1.86234  -5.821 5.83e-09 ***
## Food          0.50124    0.08767   5.717 1.08e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.79  on 163  degrees of freedom
## Residual deviance: 175.73  on 162  degrees of freedom
## AIC: 179.73
## 
## Number of Fisher Scoring iterations: 4

Notice that the coefficients are the same, but the AIC and deviance are completely different

Deviance for the case of Binary Data

The distribution of the difference in deviances is approximately $\chi^2$

Residuals for Binary Data

zagatBinary_res <- data.frame( 'Food' = zagatBinary$Food ) %>%
  mutate( stPearson = resid(zagatbinary_mod, type='pearson')/sqrt(1 - hatvalues(zagatbinary_mod)),
          stDeviance = rstandard( zagatbinary_mod ) )

p1 <- ggplot( data = zagatBinary_res, aes( x = stDeviance, y = Food ) ) +  
  geom_point( fill = NA, shape = 21, alpha = 0.3, size = 3 ) +
  theme_classic() +
  ylab( 'Food Rating' ) +
  ggtitle( 'Standardized Deviance Residuals' )
p2 <- ggplot( data = zagatBinary_res, aes( x = stPearson, y = Food ) ) +  
  geom_point( fill = NA, shape = 21, alpha = 0.3, size = 3 ) +
  theme_classic() +
  ylab( 'Food Rating' ) +
  ggtitle( 'Standardized Pearson Residuals' )
  
grid.arrange( p1, p2, ncol = 2 )

There are some similar and highly random patterns here. Residual plots are very problematic for binary data!

Transforming Predictors in Logistic Regression for Binary Data.

When the predictor variable $X$ is normally distributed with a different variance for the two values of $Y$, the log odds are a quadratic function of $x$
When the predictor variable $X$ is normally distributed with the same variance for the two values of $Y$, the log odds are a linear function of $x$

Transforming skewed predictor variables….

p1 <- ggplot( data = zagatBinary, aes( x = factor( InMichelin ), y = Food ) ) +
  geom_boxplot() +
  theme_classic()
p2 <- ggplot( data = zagatBinary, aes( x = factor( InMichelin ), y = Decor  ) ) +
  geom_boxplot() +
  theme_classic()
p3 <- ggplot( data = zagatBinary, aes( x = factor( InMichelin ), y = Service ) ) +
  geom_boxplot() +
  theme_classic()
p4 <- ggplot( data = zagatBinary, aes( x = factor( InMichelin ), y = Price ) ) +
  geom_boxplot() +
  theme_classic()

grid.arrange( p1, p2, p3, p4, ncol = 2 )

Let’s consider the following logistic regression model with these 4 predictors: \[\theta(x)=\frac{0}{1+ exp(-(\beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \beta_4x_4 + \beta_5log(x_4))}\]

Marginal Model plots for Binary Data

residual plots are difficult to interpret, but marginal model plots are an alternative…

#glimpse( zagatBinary )
zagatbinary_fullmod <- glm( InMichelin ~ Food + Decor + Service + Price + log(Price), family = binomial, zagatBinary )
mmps( zagatbinary_fullmod )

The plots of Service and Decor don’t track as well with the data as the other features. Here we add a term to the model to look at interaction effects between the two variables:

zagatbinary_fullmod2 <- glm( InMichelin ~ Food + Decor + Service + Price + log( Price ) + Service:Decor, family = binomial, zagatBinary )
mmps( zagatbinary_fullmod2 )

## Warning in mmps(zagatbinary_fullmod2): Interactions and/or factors skipped

anova( zagatbinary_fullmod, zagatbinary_fullmod2 )

## Analysis of Deviance Table
## 
## Model 1: InMichelin ~ Food + Decor + Service + Price + log(Price)
## Model 2: InMichelin ~ Food + Decor + Service + Price + log(Price) + Service:Decor
##   Resid. Df Resid. Dev Df Deviance
## 1       158     136.43            
## 2       157     129.82  1   6.6106

Look at the summary for our new model:

summary( zagatbinary_fullmod2 )

## 
## Call:
## glm(formula = InMichelin ~ Food + Decor + Service + Price + log(Price) + 
##     Service:Decor, family = binomial, data = zagatBinary)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5235  -0.6045  -0.1292   0.5332   2.3757  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -70.85308   15.45786  -4.584 4.57e-06 ***
## Food            0.66996    0.18276   3.666 0.000247 ***
## Decor           1.29788    0.49299   2.633 0.008471 ** 
## Service         0.91971    0.48829   1.884 0.059632 .  
## Price          -0.07456    0.04416  -1.688 0.091347 .  
## log(Price)     10.96400    3.22845   3.396 0.000684 ***
## Decor:Service  -0.06551    0.02512  -2.608 0.009119 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.79  on 163  degrees of freedom
## Residual deviance: 129.82  on 157  degrees of freedom
## AIC: 143.82
## 
## Number of Fisher Scoring iterations: 6

Price is not very significant here, so lets simplify the model a bit:

zagatbinary_fullmod3 <- glm( InMichelin ~ Food + Decor + Service + log( Price ) + Service:Decor, family = binomial, zagatBinary )
summary( zagatbinary_fullmod3 )

## 
## Call:
## glm(formula = InMichelin ~ Food + Decor + Service + log(Price) + 
##     Service:Decor, family = binomial, data = zagatBinary)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6555  -0.6270  -0.1598   0.5398   2.2935  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -63.76436   14.09848  -4.523 6.10e-06 ***
## Food            0.64274    0.17825   3.606 0.000311 ***
## Decor           1.50597    0.47883   3.145 0.001660 ** 
## Service         1.12633    0.47068   2.393 0.016711 *  
## log(Price)      7.29827    1.81062   4.031 5.56e-05 ***
## Decor:Service  -0.07613    0.02448  -3.110 0.001873 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.79  on 163  degrees of freedom
## Residual deviance: 131.23  on 158  degrees of freedom
## AIC: 143.23
## 
## Number of Fisher Scoring iterations: 6

mmps( zagatbinary_fullmod3 )

## Warning in mmps(zagatbinary_fullmod3): Interactions and/or factors skipped

The marginal plots show reasonable agreement and all the model terms are significant at least up to $\alpha$ = 0.05.

A Modern Approach to Regression with R