Regression Notes

Simple Linear Regression(SLR)

Y is modeled as a linear function of X. Relationship between two variables a straight line.

library(pacman)
p_load(openintro,car,alr4,faraway,Stat2Data, ISLR,tidyverse)
bdims_males <- subset(bdims, sex == 1)
plot(wgt~hgt, data = bdims_males, xlab ="Height(cm)", ylab = "Weight(kg)", cex=0.5)

Model \[Y_i= \beta_0 +\beta_1x_i + e_i\]

Fitted or predicted value: Equation of straight line, \[\hat{y}=\hat{\beta_0} + \hat{\beta_1}x_i\]

Residuals: \[\hat{e_i} = y_i - \hat{y_i}\]

lm1 <- lm(wgt~hgt, data = bdims_males) #SLR

summary(lm1)
## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims_males)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.187  -5.924  -1.536   5.441  38.213 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -60.95336   14.05436  -4.337 2.11e-05 ***
## hgt           0.78257    0.07901   9.905  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.902 on 245 degrees of freedom
## Multiple R-squared:  0.2859, Adjusted R-squared:  0.283 
## F-statistic: 98.11 on 1 and 245 DF,  p-value: < 2.2e-16

\[\hat{y}=\hat{\beta_0} + \hat{\beta_1}x_i = -60.95 + 0.78x\] \[\hat{weight}=-60.95 + 0.78height\]

plot(wgt~hgt, data = bdims_males, cex = 0.5) #scatter plot
abline(lm1) #least squares line

cor(bdims_males$wgt, bdims_males$hgt)^2 # R^2, coeff of determination
## [1] 0.2859487

SLR Inference

confint(lm1) #95% CI
##                   2.5 %     97.5 %
## (Intercept) -88.6361527 -33.270576
## hgt           0.6269509   0.938186
#manual calculation
n<-nrow(bdims_males)
tcrit <- qt(0.975,df=n-2)
0.78 - tcrit*0.079
## [1] 0.6243942
0.78 + tcrit * 0.079
## [1] 0.9356058

Prediction Intervals

head(trees)
lm2 <- lm(Volume ~Girth, data = trees) # SLR
summary(lm2)
## 
## Call:
## lm(formula = Volume ~ Girth, data = trees)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.065 -3.107  0.152  3.495  9.587 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
## Girth         5.0659     0.2474   20.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared:  0.9353, Adjusted R-squared:  0.9331 
## F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16

The equation of the least square line: \[\hat{y} = -0.3694 + 5.065x\]

plot(Volume~Girth, data = trees)
abline(lm2)

#95% CI
new <- data.frame(Girth = 17) # new value for x, x* = 17
round(predict(lm2, newdata = new, interval = "confidence"),2)
##     fit   lwr   upr
## 1 49.18 46.72 51.63

The interpretation is that the predicted mean volume, for cherry trees that have a 17 inch diameter, is 49.18 cubic feet. Additionally, we are 95% confident that the population mean volume, for cherry trees that have a 17 inch diameter, is between 46.72 and 51.63 cubic feet.

# 95% Prediction Interval
round(predict(lm2, newdata = new, interval = "prediction"),2)
##     fit   lwr   upr
## 1 49.18 40.14 58.21

The interpretation is that the predicted volume, for a cherry tree that has a 17 inch diameter, is 49.18 cubic feet. Additionally, we are 95% confident prediction interval is between 40.14 and 58.21. This means that the actual volume of a cherry tree, with a 17 inch diameter, is likely to be between 40.14 and 58.21 cubic feet.

Dummy Variable Selection

head(marioKart,5)
ind <- which(marioKart$totalPr>100) #prints rows
marioKart2 <- marioKart[-ind,]
#create dummy variable
n<-nrow(marioKart2)
marioKart2$cond01 <- rep(0,n) #create n rows under cond01 with 0 values
marioKart2$cond01[marioKart2$cond == "new"] <- 1 # assign 1 for new
#fit model
lm3 <- lm(totalPr~cond01, data = marioKart2)
summary(lm3)
## 
## Call:
## lm(formula = totalPr ~ cond01, data = marioKart2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.8911  -5.8311   0.1289   4.1289  22.1489 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   42.871      0.814  52.668  < 2e-16 ***
## cond01        10.900      1.258   8.662 1.06e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.371 on 139 degrees of freedom
## Multiple R-squared:  0.3506, Adjusted R-squared:  0.3459 
## F-statistic: 75.03 on 1 and 139 DF,  p-value: 1.056e-14
#plot
plot(totalPr~cond01, data = marioKart2, xaxt = 'n', xlim = c(-0.1, 1.1), xlab = "Condition", ylab = "Total Price")
axis(1, at = c(0,1), labels = c("Used(0)","New(1)"))
abline(lm3)

\[\hat{y} = 42.87 + 10.9x\]

confint(lm3)
##                 2.5 %   97.5 %
## (Intercept) 41.261713 44.48048
## cond01       8.411621 13.38754

We are 95% confident that the average price of a new game is between $8.41 and $13.39 more than a used game.

Regression Diagnostics

# Residual Plots, to check assumptions of linera regression
lm4 <- lm(wgt~hgt, data = bdims_males)
par(mfrow = c(1,2)) # split plot into 2 panes
plot(wgt~hgt, data = bdims_males, xlab = 'Height(cm)', ylab = 'Weight(kg)')
abline(lm4) #least square line
plot(predict(lm4),resid(lm4), xlab = 'Fitted Values', ylab = 'Residuals') #residual plot
abline(h=0)

## QQ plots to check normality
set.seed(100)
x<- rnorm(100)
par(mfrow = c(1,2))
hist(x)
qqnorm(x)
qqline(x)

Data follows a normal distribution.

set.seed(100)
y <- rexp(100) #random number from exp dist
par(mfrow = c(1,2))
hist(y)
qqnorm(y)
qqline(y)

Data do not follow a normal distribution.

# check normality of errors
par(mfrow = c(1,2))
hist(resid(lm4))
qqnorm(resid(lm4))
qqline(resid(lm4))

The points follow a straight line, indicating the residuals are approximately normal.

Transformations for SLR

used to linearize relationship with skewed data and overcome problems due to non-constant variance.

Log Transformation

plot(BrainWt~BodyWt, data = brains)

Variables are extremely skewed.

# Log transformation
lm5 <- lm(log(BrainWt)~log(BodyWt), data = brains)
plot(log(BrainWt)~log(BodyWt), data = brains)
abline(lm5)

# Residual plot
plot(predict(lm5), resid(lm5), xlab = "Fitted Values", ylab = "Residuals")
abline(h=0,  lty = 2, col = "red")

The assumptions of linearity and constant variance appear to be well satisfied.

summary(lm5)
## 
## Call:
## lm(formula = log(BrainWt) ~ log(BodyWt), data = brains)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.71550 -0.49228 -0.06162  0.43598  1.94833 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.13479    0.09604   22.23   <2e-16 ***
## log(BodyWt)  0.75169    0.02846   26.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6943 on 60 degrees of freedom
## Multiple R-squared:  0.9208, Adjusted R-squared:  0.9195 
## F-statistic: 697.4 on 1 and 60 DF,  p-value: < 2.2e-16

\[log(\hat{BrainWt}) = 2.135 + 0.7516lg(BodyWt)\]

Interpretation: 1% increase in body weight (kg) is associated with an approximate 0.75% increase in brain weight.

Square root transformation

Regression equation for the transformed model: \[\hat{\sqrt Y}= 0.2001 + 1.9016\sqrt X \]

Multiple Linear Regression(MLR)

use pairs() function for group correlation. if the variables have an obvious nonlinear,quadratic association we consider a polynomial regression model.

F-Test for MLR

Multiple Predictors

head(savings)
str(savings)
## 'data.frame':    50 obs. of  5 variables:
##  $ sr   : num  11.43 12.07 13.17 5.75 12.88 ...
##  $ pop15: num  29.4 23.3 23.8 41.9 42.2 ...
##  $ pop75: num  2.87 4.41 4.43 1.67 0.83 2.85 1.34 0.67 1.06 1.14 ...
##  $ dpi  : num  2330 1508 2108 189 728 ...
##  $ ddpi : num  2.87 3.93 3.82 0.22 4.56 2.43 2.67 6.51 3.08 2.8 ...

Null Hypothesis, H0:B1=B2=B3=B4=0

Alternative Hypothesis, HA: at least on Bi is not 0.

# full model
lm_full <- lm(sr~., data = savings)
#null or reduced model
lm_null <- lm(sr~1, data = savings)
anova(lm_null,lm_full)

Since the p-value < 0.001 we reject the null hypothesis. Thus, we conclude, that at least one predictor is associated with the savings rate (sr).

summary(lm_full)
## 
## Call:
## lm(formula = sr ~ ., data = savings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2422 -2.6857 -0.2488  2.4280  9.7509 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.5660865  7.3545161   3.884 0.000334 ***
## pop15       -0.4611931  0.1446422  -3.189 0.002603 ** 
## pop75       -1.6914977  1.0835989  -1.561 0.125530    
## dpi         -0.0003369  0.0009311  -0.362 0.719173    
## ddpi         0.4096949  0.1961971   2.088 0.042471 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared:  0.3385, Adjusted R-squared:  0.2797 
## F-statistic: 5.756 on 4 and 45 DF,  p-value: 0.0007904
# F-test statistics using the formula
n <- nrow(savings) # total data points
p<- 4 #number of predictors
rss <- sum(resid(lm_full)^2) #residual sum of squares
rss
## [1] 650.713
sst <- sum(resid(lm_null)^2) # total sum of squares
sst
## [1] 983.6282
fstat <- ((sst-rss)/p)/(rss/(n-p-1))
fstat
## [1] 5.755681
pvalue <- (1 -pf(fstat, df1=p, df2 = n-p-1))
pvalue
## [1] 0.0007903779

One Predictor

Remove non-significant predictors and repeat the process.

Partial F-Test for subset of predictors

H0:Bpop75=Bdpi = 0

lm_partial <- lm(sr~pop15 + ddpi, data = savings) # without predictors from H0
anova(lm_partial, lm_full)

p-value > 0.05, do not reject null hypothesis. That means we can drop both variables, pop75 and dpi, from the full model.

Categorical Predictors and Interactions

Categorical Predictors with Two levels

lm5 <- lm(Balance ~ Income + Student, data = Credit)
contrasts(Credit$Student) # details of dummy variable
##     Yes
## No    0
## Yes   1
summary(lm5)
## 
## Call:
## lm(formula = Balance ~ Income + Student, data = Credit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -762.37 -331.38  -45.04  323.60  818.28 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 211.1430    32.4572   6.505 2.34e-10 ***
## Income        5.9843     0.5566  10.751  < 2e-16 ***
## StudentYes  382.6705    65.3108   5.859 9.78e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 391.8 on 397 degrees of freedom
## Multiple R-squared:  0.2775, Adjusted R-squared:  0.2738 
## F-statistic: 76.22 on 2 and 397 DF,  p-value: < 2.2e-16

Regression equation for the fit: \[\hat{Balance} = 211.14 + 5.98income + 382.67student\] where student is 0(no) and 1 (yes).

ggplot(Credit, aes(Income, Balance, colour = Student)) + geom_point() + geom_abline(intercept = 211.14, slope = 5.98, color = "#F8766D") + geom_abline(intercept = 593.81, slope = 5.98, color = "#00BFC4")

#Check Interaction
lm6 <- lm(Balance ~ Income + Student + Income:Student, data = Credit)
summary(lm6)
## 
## Call:
## lm(formula = Balance ~ Income + Student + Income:Student, data = Credit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -773.39 -325.70  -41.13  321.65  814.04 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       200.6232    33.6984   5.953 5.79e-09 ***
## Income              6.2182     0.5921  10.502  < 2e-16 ***
## StudentYes        476.6758   104.3512   4.568 6.59e-06 ***
## Income:StudentYes  -1.9992     1.7313  -1.155    0.249    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 391.6 on 396 degrees of freedom
## Multiple R-squared:  0.2799, Adjusted R-squared:  0.2744 
## F-statistic:  51.3 on 3 and 396 DF,  p-value: < 2.2e-16

Equation for the fit \[\hat{Balance} = 200.62 + 6.22income + 476.68student - 2student:income\]

Categorical Predictors with More Than Two Levels

head(Wage)
# box plot
ggplot(Wage, aes(education, logwage)) + geom_boxplot() + coord_flip()

#model
lm7 <- lm(logwage~ education, data = Wage)
summary(lm7)
## 
## Call:
## lm(formula = logwage ~ education, data = Wage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63580 -0.16716  0.00765  0.18079  1.24259 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  4.39759    0.01891 232.502  < 2e-16 ***
## education2. HS Grad          0.12295    0.02137   5.754 9.57e-09 ***
## education3. Some College     0.23821    0.02248  10.597  < 2e-16 ***
## education4. College Grad     0.37373    0.02231  16.752  < 2e-16 ***
## education5. Advanced Degree  0.56036    0.02414  23.212  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3096 on 2995 degrees of freedom
## Multiple R-squared:  0.2262, Adjusted R-squared:  0.2251 
## F-statistic: 218.8 on 4 and 2995 DF,  p-value: < 2.2e-16

Fitted Regression Model: \[\log(\hat{Wage}) = 4.398 + 0.122HSGrad + 0.238SomeCollege + 0.373CollegeGrad + 0.56AdvancedDegree\]

\[\log(\hat{Wage}) = 4.398\]

<HSGrad is baseline.

#include interaction between education and age
lm8 <- lm(logwage~age + education + age:education, data = Wage)
summary(lm8)
## 
## Call:
## lm(formula = logwage ~ age + education + age:education, data = Wage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.65539 -0.15483  0.00691  0.17417  1.24528 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      4.1921197  0.0640086  65.493  < 2e-16 ***
## age                              0.0049162  0.0014664   3.353 0.000811 ***
## education2. HS Grad              0.0979291  0.0731558   1.339 0.180791    
## education3. Some College         0.0644316  0.0775180   0.831 0.405937    
## education4. College Grad         0.4160484  0.0792801   5.248 1.65e-07 ***
## education5. Advanced Degree      0.6467308  0.0918866   7.038 2.40e-12 ***
## age:education2. HS Grad          0.0005434  0.0016738   0.325 0.745466    
## age:education3. Some College     0.0043591  0.0017917   2.433 0.015033 *  
## age:education4. College Grad    -0.0011018  0.0018093  -0.609 0.542593    
## age:education5. Advanced Degree -0.0022699  0.0020470  -1.109 0.267563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3022 on 2990 degrees of freedom
## Multiple R-squared:  0.2642, Adjusted R-squared:  0.262 
## F-statistic: 119.3 on 9 and 2990 DF,  p-value: < 2.2e-16
ggplot(Wage, aes(age, logwage)) + geom_point(size = 0.3, alpha = 0.6) + facet_wrap(~education) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

lm9 <- lm(logwage~age + education , data = Wage)
summary(lm9)
## 
## Call:
## lm(formula = logwage ~ age + education, data = Wage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.64758 -0.15373  0.00796  0.17330  1.24577 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 4.1645351  0.0273609 152.208  < 2e-16 ***
## age                         0.0055762  0.0004821  11.566  < 2e-16 ***
## education2. HS Grad         0.1205914  0.0209082   5.768 8.86e-09 ***
## education3. Some College    0.2432633  0.0219999  11.057  < 2e-16 ***
## education4. College Grad    0.3682739  0.0218360  16.865  < 2e-16 ***
## education5. Advanced Degree 0.5424496  0.0236742  22.913  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.303 on 2994 degrees of freedom
## Multiple R-squared:  0.2592, Adjusted R-squared:  0.258 
## F-statistic: 209.6 on 5 and 2994 DF,  p-value: < 2.2e-16

Check Interaction Effect

#adjusted R^2
summary(lm7)$adj.r.squared #with education only
## [1] 0.2251165
summary(lm8)$adj.r.squared #interaction included
## [1] 0.261979
summary(lm9)$adj.r.squared # with age and education exluding interaction
## [1] 0.2580081

model lm8 is the best fitting model. So, interaction effects are meaningful.

# Use F-test
anova(lm9,lm8)

p-value < 0.001, reject null hypothesis. That means that the model with the interaction is the superior.

Interaction Between Quantitative Variables

Advertisng <- read.csv("Advertising.csv")
head(Advertisng)
#Fit the model
lm10 <- lm(sales~ TV + radio + TV:radio, data = Advertisng)
summary(lm10)
## 
## Call:
## lm(formula = sales ~ TV + radio + TV:radio, data = Advertisng)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3366 -0.4028  0.1831  0.5948  1.5246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
## TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
## radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
## TV:radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9673 
## F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16

Box Cox Transformation

Multicollinearity

Variable Selection

Adjusted R^2,AIC,BIC

Simple Logistic Regression

data("Election16")
par(mfrow = c(1,2), mar = c(4,4, 2,2), cex = 0.8)
plot(TrumpWin~BA, xlab = "% College Grads", ylab = "Trump Win?(1=yes,0=no)", data = Election16)
boxplot(BA~TrumpWin, data = Election16)

# Fit a Logistic Regression
glm1 <- glm(TrumpWin~BA, data = Election16, family = binomial)
summary(glm1)
## 
## Call:
## glm(formula = TrumpWin ~ BA, family = binomial, data = Election16)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9138  -0.3059   0.1718   0.5829   1.4483  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  17.9973     5.1098   3.522 0.000428 ***
## BA           -0.5985     0.1735  -3.449 0.000562 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 34.433  on 48  degrees of freedom
## AIC: 38.433
## 
## Number of Fisher Scoring iterations: 6
#predict trump win when college grad is 31.4%(CA)
new_xx <- data.frame(BA= 31.4)
predict(glm1, newdata = new_xx, type = "response") # Type="response" for predicted prob
##        1 
## 0.310666
#Superimpose the fitted logistic curve on the scatterplot
xx_grd <- seq(18,41, by = 0.1)
new_xxx <- data.frame(BA = xx_grd)
preds_tw <- predict(glm1, newdata = new_xxx, type = "response")
plot(TrumpWin~BA, xlab = "% of College Grads", ylab = "Predict Prob of Trump Win", data = Election16)
lines( xx_grd,preds_tw)

Simple Logistic Regression: \[ p(x) = Pr(Y=1|X=x) = \frac{e^{\beta_0 + \beta_1x}}{1+e^{\beta_0 + \beta_1x}}\]

confint(glm1)
## Waiting for profiling to be done...
##                 2.5 %     97.5 %
## (Intercept)  9.809403 30.2884563
## BA          -1.016162 -0.3211666
#Using ggplot
ggplot(Election16, aes(BA, TrumpWin)) + geom_point() + geom_smooth(method = "glm", method.args = list(family = "binomial"), se = F)
## `geom_smooth()` using formula 'y ~ x'

The fitted logistic regression in log form \[logit(\hat{p}(x)) = 17.9973 - 0.5985x\]

a one unit increase in BA is associated with a change in the log-odds by -0.5985.

in probability form \[ p(x) = Pr(Y=1|X=x) = \frac{e^{17.9973 - 0.5985x}}{1+e^{17.9973 - 0.5985x}}\]

in odds form

\[odds = ({p(x)}/{(1-p(x)})) = e^{\beta_0 + \beta_1x}\]

Model Selection

Compare AIC, smaller the value better the fitted model