Y is modeled as a linear function of X. Relationship between two variables a straight line.
library(pacman)
p_load(openintro,car,alr4,faraway,Stat2Data, ISLR,tidyverse)
bdims_males <- subset(bdims, sex == 1)
plot(wgt~hgt, data = bdims_males, xlab ="Height(cm)", ylab = "Weight(kg)", cex=0.5)
Model \[Y_i= \beta_0 +\beta_1x_i + e_i\]
Fitted or predicted value: Equation of straight line, \[\hat{y}=\hat{\beta_0} + \hat{\beta_1}x_i\]
Residuals: \[\hat{e_i} = y_i - \hat{y_i}\]
lm1 <- lm(wgt~hgt, data = bdims_males) #SLR
summary(lm1)
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims_males)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.187 -5.924 -1.536 5.441 38.213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -60.95336 14.05436 -4.337 2.11e-05 ***
## hgt 0.78257 0.07901 9.905 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.902 on 245 degrees of freedom
## Multiple R-squared: 0.2859, Adjusted R-squared: 0.283
## F-statistic: 98.11 on 1 and 245 DF, p-value: < 2.2e-16
\[\hat{y}=\hat{\beta_0} + \hat{\beta_1}x_i = -60.95 + 0.78x\] \[\hat{weight}=-60.95 + 0.78height\]
plot(wgt~hgt, data = bdims_males, cex = 0.5) #scatter plot
abline(lm1) #least squares line
cor(bdims_males$wgt, bdims_males$hgt)^2 # R^2, coeff of determination
## [1] 0.2859487
confint(lm1) #95% CI
## 2.5 % 97.5 %
## (Intercept) -88.6361527 -33.270576
## hgt 0.6269509 0.938186
#manual calculation
n<-nrow(bdims_males)
tcrit <- qt(0.975,df=n-2)
0.78 - tcrit*0.079
## [1] 0.6243942
0.78 + tcrit * 0.079
## [1] 0.9356058
head(trees)
lm2 <- lm(Volume ~Girth, data = trees) # SLR
summary(lm2)
##
## Call:
## lm(formula = Volume ~ Girth, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.065 -3.107 0.152 3.495 9.587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
## Girth 5.0659 0.2474 20.48 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331
## F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
The equation of the least square line: \[\hat{y} = -0.3694 + 5.065x\]
plot(Volume~Girth, data = trees)
abline(lm2)
#95% CI
new <- data.frame(Girth = 17) # new value for x, x* = 17
round(predict(lm2, newdata = new, interval = "confidence"),2)
## fit lwr upr
## 1 49.18 46.72 51.63
The interpretation is that the predicted mean volume, for cherry trees that have a 17 inch diameter, is 49.18 cubic feet. Additionally, we are 95% confident that the population mean volume, for cherry trees that have a 17 inch diameter, is between 46.72 and 51.63 cubic feet.
# 95% Prediction Interval
round(predict(lm2, newdata = new, interval = "prediction"),2)
## fit lwr upr
## 1 49.18 40.14 58.21
The interpretation is that the predicted volume, for a cherry tree that has a 17 inch diameter, is 49.18 cubic feet. Additionally, we are 95% confident prediction interval is between 40.14 and 58.21. This means that the actual volume of a cherry tree, with a 17 inch diameter, is likely to be between 40.14 and 58.21 cubic feet.
head(marioKart,5)
ind <- which(marioKart$totalPr>100) #prints rows
marioKart2 <- marioKart[-ind,]
#create dummy variable
n<-nrow(marioKart2)
marioKart2$cond01 <- rep(0,n) #create n rows under cond01 with 0 values
marioKart2$cond01[marioKart2$cond == "new"] <- 1 # assign 1 for new
#fit model
lm3 <- lm(totalPr~cond01, data = marioKart2)
summary(lm3)
##
## Call:
## lm(formula = totalPr ~ cond01, data = marioKart2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.8911 -5.8311 0.1289 4.1289 22.1489
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.871 0.814 52.668 < 2e-16 ***
## cond01 10.900 1.258 8.662 1.06e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.371 on 139 degrees of freedom
## Multiple R-squared: 0.3506, Adjusted R-squared: 0.3459
## F-statistic: 75.03 on 1 and 139 DF, p-value: 1.056e-14
#plot
plot(totalPr~cond01, data = marioKart2, xaxt = 'n', xlim = c(-0.1, 1.1), xlab = "Condition", ylab = "Total Price")
axis(1, at = c(0,1), labels = c("Used(0)","New(1)"))
abline(lm3)
\[\hat{y} = 42.87 + 10.9x\]
confint(lm3)
## 2.5 % 97.5 %
## (Intercept) 41.261713 44.48048
## cond01 8.411621 13.38754
We are 95% confident that the average price of a new game is between $8.41 and $13.39 more than a used game.
# Residual Plots, to check assumptions of linera regression
lm4 <- lm(wgt~hgt, data = bdims_males)
par(mfrow = c(1,2)) # split plot into 2 panes
plot(wgt~hgt, data = bdims_males, xlab = 'Height(cm)', ylab = 'Weight(kg)')
abline(lm4) #least square line
plot(predict(lm4),resid(lm4), xlab = 'Fitted Values', ylab = 'Residuals') #residual plot
abline(h=0)
## QQ plots to check normality
set.seed(100)
x<- rnorm(100)
par(mfrow = c(1,2))
hist(x)
qqnorm(x)
qqline(x)
Data follows a normal distribution.
set.seed(100)
y <- rexp(100) #random number from exp dist
par(mfrow = c(1,2))
hist(y)
qqnorm(y)
qqline(y)
Data do not follow a normal distribution.
# check normality of errors
par(mfrow = c(1,2))
hist(resid(lm4))
qqnorm(resid(lm4))
qqline(resid(lm4))
The points follow a straight line, indicating the residuals are approximately normal.
used to linearize relationship with skewed data and overcome problems due to non-constant variance.
Log Transformation
plot(BrainWt~BodyWt, data = brains)
Variables are extremely skewed.
# Log transformation
lm5 <- lm(log(BrainWt)~log(BodyWt), data = brains)
plot(log(BrainWt)~log(BodyWt), data = brains)
abline(lm5)
# Residual plot
plot(predict(lm5), resid(lm5), xlab = "Fitted Values", ylab = "Residuals")
abline(h=0, lty = 2, col = "red")
The assumptions of linearity and constant variance appear to be well satisfied.
summary(lm5)
##
## Call:
## lm(formula = log(BrainWt) ~ log(BodyWt), data = brains)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.71550 -0.49228 -0.06162 0.43598 1.94833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.13479 0.09604 22.23 <2e-16 ***
## log(BodyWt) 0.75169 0.02846 26.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6943 on 60 degrees of freedom
## Multiple R-squared: 0.9208, Adjusted R-squared: 0.9195
## F-statistic: 697.4 on 1 and 60 DF, p-value: < 2.2e-16
\[log(\hat{BrainWt}) = 2.135 + 0.7516lg(BodyWt)\]
Interpretation: 1% increase in body weight (kg) is associated with an approximate 0.75% increase in brain weight.
Square root transformation
Regression equation for the transformed model: \[\hat{\sqrt Y}= 0.2001 + 1.9016\sqrt X \]
use pairs() function for group correlation. if the variables have an obvious nonlinear,quadratic association we consider a polynomial regression model.
Multiple Predictors
head(savings)
str(savings)
## 'data.frame': 50 obs. of 5 variables:
## $ sr : num 11.43 12.07 13.17 5.75 12.88 ...
## $ pop15: num 29.4 23.3 23.8 41.9 42.2 ...
## $ pop75: num 2.87 4.41 4.43 1.67 0.83 2.85 1.34 0.67 1.06 1.14 ...
## $ dpi : num 2330 1508 2108 189 728 ...
## $ ddpi : num 2.87 3.93 3.82 0.22 4.56 2.43 2.67 6.51 3.08 2.8 ...
Null Hypothesis, H0:B1=B2=B3=B4=0
Alternative Hypothesis, HA: at least on Bi is not 0.
# full model
lm_full <- lm(sr~., data = savings)
#null or reduced model
lm_null <- lm(sr~1, data = savings)
anova(lm_null,lm_full)
Since the p-value < 0.001 we reject the null hypothesis. Thus, we conclude, that at least one predictor is associated with the savings rate (sr).
summary(lm_full)
##
## Call:
## lm(formula = sr ~ ., data = savings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2422 -2.6857 -0.2488 2.4280 9.7509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.5660865 7.3545161 3.884 0.000334 ***
## pop15 -0.4611931 0.1446422 -3.189 0.002603 **
## pop75 -1.6914977 1.0835989 -1.561 0.125530
## dpi -0.0003369 0.0009311 -0.362 0.719173
## ddpi 0.4096949 0.1961971 2.088 0.042471 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared: 0.3385, Adjusted R-squared: 0.2797
## F-statistic: 5.756 on 4 and 45 DF, p-value: 0.0007904
# F-test statistics using the formula
n <- nrow(savings) # total data points
p<- 4 #number of predictors
rss <- sum(resid(lm_full)^2) #residual sum of squares
rss
## [1] 650.713
sst <- sum(resid(lm_null)^2) # total sum of squares
sst
## [1] 983.6282
fstat <- ((sst-rss)/p)/(rss/(n-p-1))
fstat
## [1] 5.755681
pvalue <- (1 -pf(fstat, df1=p, df2 = n-p-1))
pvalue
## [1] 0.0007903779
One Predictor
Remove non-significant predictors and repeat the process.
Partial F-Test for subset of predictors
H0:Bpop75=Bdpi = 0
lm_partial <- lm(sr~pop15 + ddpi, data = savings) # without predictors from H0
anova(lm_partial, lm_full)
p-value > 0.05, do not reject null hypothesis. That means we can drop both variables, pop75 and dpi, from the full model.
Categorical Predictors with Two levels
lm5 <- lm(Balance ~ Income + Student, data = Credit)
contrasts(Credit$Student) # details of dummy variable
## Yes
## No 0
## Yes 1
summary(lm5)
##
## Call:
## lm(formula = Balance ~ Income + Student, data = Credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -762.37 -331.38 -45.04 323.60 818.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 211.1430 32.4572 6.505 2.34e-10 ***
## Income 5.9843 0.5566 10.751 < 2e-16 ***
## StudentYes 382.6705 65.3108 5.859 9.78e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 391.8 on 397 degrees of freedom
## Multiple R-squared: 0.2775, Adjusted R-squared: 0.2738
## F-statistic: 76.22 on 2 and 397 DF, p-value: < 2.2e-16
Regression equation for the fit: \[\hat{Balance} = 211.14 + 5.98income + 382.67student\] where student is 0(no) and 1 (yes).
ggplot(Credit, aes(Income, Balance, colour = Student)) + geom_point() + geom_abline(intercept = 211.14, slope = 5.98, color = "#F8766D") + geom_abline(intercept = 593.81, slope = 5.98, color = "#00BFC4")
#Check Interaction
lm6 <- lm(Balance ~ Income + Student + Income:Student, data = Credit)
summary(lm6)
##
## Call:
## lm(formula = Balance ~ Income + Student + Income:Student, data = Credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -773.39 -325.70 -41.13 321.65 814.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 200.6232 33.6984 5.953 5.79e-09 ***
## Income 6.2182 0.5921 10.502 < 2e-16 ***
## StudentYes 476.6758 104.3512 4.568 6.59e-06 ***
## Income:StudentYes -1.9992 1.7313 -1.155 0.249
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 391.6 on 396 degrees of freedom
## Multiple R-squared: 0.2799, Adjusted R-squared: 0.2744
## F-statistic: 51.3 on 3 and 396 DF, p-value: < 2.2e-16
Equation for the fit \[\hat{Balance} = 200.62 + 6.22income + 476.68student - 2student:income\]
Categorical Predictors with More Than Two Levels
head(Wage)
# box plot
ggplot(Wage, aes(education, logwage)) + geom_boxplot() + coord_flip()
#model
lm7 <- lm(logwage~ education, data = Wage)
summary(lm7)
##
## Call:
## lm(formula = logwage ~ education, data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.63580 -0.16716 0.00765 0.18079 1.24259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.39759 0.01891 232.502 < 2e-16 ***
## education2. HS Grad 0.12295 0.02137 5.754 9.57e-09 ***
## education3. Some College 0.23821 0.02248 10.597 < 2e-16 ***
## education4. College Grad 0.37373 0.02231 16.752 < 2e-16 ***
## education5. Advanced Degree 0.56036 0.02414 23.212 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3096 on 2995 degrees of freedom
## Multiple R-squared: 0.2262, Adjusted R-squared: 0.2251
## F-statistic: 218.8 on 4 and 2995 DF, p-value: < 2.2e-16
Fitted Regression Model: \[\log(\hat{Wage}) = 4.398 + 0.122HSGrad + 0.238SomeCollege + 0.373CollegeGrad + 0.56AdvancedDegree\]
\[\log(\hat{Wage}) = 4.398\]
<HSGrad is baseline.
#include interaction between education and age
lm8 <- lm(logwage~age + education + age:education, data = Wage)
summary(lm8)
##
## Call:
## lm(formula = logwage ~ age + education + age:education, data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.65539 -0.15483 0.00691 0.17417 1.24528
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.1921197 0.0640086 65.493 < 2e-16 ***
## age 0.0049162 0.0014664 3.353 0.000811 ***
## education2. HS Grad 0.0979291 0.0731558 1.339 0.180791
## education3. Some College 0.0644316 0.0775180 0.831 0.405937
## education4. College Grad 0.4160484 0.0792801 5.248 1.65e-07 ***
## education5. Advanced Degree 0.6467308 0.0918866 7.038 2.40e-12 ***
## age:education2. HS Grad 0.0005434 0.0016738 0.325 0.745466
## age:education3. Some College 0.0043591 0.0017917 2.433 0.015033 *
## age:education4. College Grad -0.0011018 0.0018093 -0.609 0.542593
## age:education5. Advanced Degree -0.0022699 0.0020470 -1.109 0.267563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3022 on 2990 degrees of freedom
## Multiple R-squared: 0.2642, Adjusted R-squared: 0.262
## F-statistic: 119.3 on 9 and 2990 DF, p-value: < 2.2e-16
ggplot(Wage, aes(age, logwage)) + geom_point(size = 0.3, alpha = 0.6) + facet_wrap(~education) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
lm9 <- lm(logwage~age + education , data = Wage)
summary(lm9)
##
## Call:
## lm(formula = logwage ~ age + education, data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.64758 -0.15373 0.00796 0.17330 1.24577
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.1645351 0.0273609 152.208 < 2e-16 ***
## age 0.0055762 0.0004821 11.566 < 2e-16 ***
## education2. HS Grad 0.1205914 0.0209082 5.768 8.86e-09 ***
## education3. Some College 0.2432633 0.0219999 11.057 < 2e-16 ***
## education4. College Grad 0.3682739 0.0218360 16.865 < 2e-16 ***
## education5. Advanced Degree 0.5424496 0.0236742 22.913 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.303 on 2994 degrees of freedom
## Multiple R-squared: 0.2592, Adjusted R-squared: 0.258
## F-statistic: 209.6 on 5 and 2994 DF, p-value: < 2.2e-16
Check Interaction Effect
#adjusted R^2
summary(lm7)$adj.r.squared #with education only
## [1] 0.2251165
summary(lm8)$adj.r.squared #interaction included
## [1] 0.261979
summary(lm9)$adj.r.squared # with age and education exluding interaction
## [1] 0.2580081
model lm8 is the best fitting model. So, interaction effects are meaningful.
# Use F-test
anova(lm9,lm8)
p-value < 0.001, reject null hypothesis. That means that the model with the interaction is the superior.
Interaction Between Quantitative Variables
Advertisng <- read.csv("Advertising.csv")
head(Advertisng)
#Fit the model
lm10 <- lm(sales~ TV + radio + TV:radio, data = Advertisng)
summary(lm10)
##
## Call:
## lm(formula = sales ~ TV + radio + TV:radio, data = Advertisng)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3366 -0.4028 0.1831 0.5948 1.5246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
## TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
## radio 2.886e-02 8.905e-03 3.241 0.0014 **
## TV:radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673
## F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16
Box Cox Transformation
Multicollinearity
Variable Selection
Adjusted R^2,AIC,BIC
data("Election16")
par(mfrow = c(1,2), mar = c(4,4, 2,2), cex = 0.8)
plot(TrumpWin~BA, xlab = "% College Grads", ylab = "Trump Win?(1=yes,0=no)", data = Election16)
boxplot(BA~TrumpWin, data = Election16)
# Fit a Logistic Regression
glm1 <- glm(TrumpWin~BA, data = Election16, family = binomial)
summary(glm1)
##
## Call:
## glm(formula = TrumpWin ~ BA, family = binomial, data = Election16)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9138 -0.3059 0.1718 0.5829 1.4483
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 17.9973 5.1098 3.522 0.000428 ***
## BA -0.5985 0.1735 -3.449 0.000562 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 67.301 on 49 degrees of freedom
## Residual deviance: 34.433 on 48 degrees of freedom
## AIC: 38.433
##
## Number of Fisher Scoring iterations: 6
#predict trump win when college grad is 31.4%(CA)
new_xx <- data.frame(BA= 31.4)
predict(glm1, newdata = new_xx, type = "response") # Type="response" for predicted prob
## 1
## 0.310666
#Superimpose the fitted logistic curve on the scatterplot
xx_grd <- seq(18,41, by = 0.1)
new_xxx <- data.frame(BA = xx_grd)
preds_tw <- predict(glm1, newdata = new_xxx, type = "response")
plot(TrumpWin~BA, xlab = "% of College Grads", ylab = "Predict Prob of Trump Win", data = Election16)
lines( xx_grd,preds_tw)
Simple Logistic Regression: \[ p(x) = Pr(Y=1|X=x) = \frac{e^{\beta_0 + \beta_1x}}{1+e^{\beta_0 + \beta_1x}}\]
confint(glm1)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 9.809403 30.2884563
## BA -1.016162 -0.3211666
#Using ggplot
ggplot(Election16, aes(BA, TrumpWin)) + geom_point() + geom_smooth(method = "glm", method.args = list(family = "binomial"), se = F)
## `geom_smooth()` using formula 'y ~ x'
The fitted logistic regression in log form \[logit(\hat{p}(x)) = 17.9973 - 0.5985x\]
a one unit increase in BA is associated with a change in the log-odds by -0.5985.
in probability form \[ p(x) = Pr(Y=1|X=x) = \frac{e^{17.9973 - 0.5985x}}{1+e^{17.9973 - 0.5985x}}\]
in odds form
\[odds = ({p(x)}/{(1-p(x)})) = e^{\beta_0 + \beta_1x}\]
Model Selection
Compare AIC, smaller the value better the fitted model