Learning Objectives

In this lesson students will review the main concepts of Simple Linear Regression.

  • How to fit a model
  • Graphics a model
  • Checking model fit and diagnostics
  • Model inference

I. Motivating Example

Import the Data

These laptop price data come from Kaggle.

### use raw file from github
laptop_price <- read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA252/main/Data/laptop_price.csv")

Step Zero: Wrangle Data

## TIDYVERSE
#install.packages("tidyverse")
library(tidyverse)

### convert to dollars
laptop_price<-laptop_price%>%
  mutate(Price_dollar=Price_euros*1.09)

# TAKE OFF LABELS for GB
unique(laptop_price$Ram)
## [1] "8GB"  "16GB" "4GB"  "2GB"  "12GB" "6GB"  "32GB" "24GB" "64GB"
#str_remove()
laptop_price$Ram_GB<-str_remove(laptop_price$Ram, "GB")
laptop_price$Ram_GB<-as.numeric(laptop_price$Ram_GB)
unique(laptop_price$Ram_GB)
## [1]  8 16  4  2 12  6 32 24 64

Step One: Look at the Data

  • Explanatory (X): RAM in GB
  • Repsonse (Y): Price in Dollars

Make a Scatterplot

# SCATTERPLOT
ggplot(data=laptop_price, 
       aes(y=Price_dollar, x=Ram_GB))+
  geom_point(alpha=.25)

# JITTER ADDS NOISE
ggplot(data=laptop_price, 
       aes(y=Price_dollar, x=Ram_GB))+
  geom_jitter(alpha=.25)

Step Two: Assume a Model Form

Non-Parametric

ggplot(data=laptop_price, 
       aes(y=Price_dollar, x=Ram_GB))+
  geom_jitter(alpha=.25)+
  geom_smooth(method="loess", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 6
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 16

Parametric

ggplot(data=laptop_price, 
       aes(y=Price_dollar, x=Ram_GB))+
  geom_jitter(alpha=.25)+
  geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'

We’re going to go with a simple linear model

II. Fitting A Simple Linear Regression (SLR) Model

This model has the form: \[Y_i=\beta_0+\beta_1 x_i +\varepsilon_i\]

This might look familar to \(Y=mx+b\), which was called slope-intercept form.

Step Three: Estimate Model Coefficients

mod1<-lm(Price_dollar~Ram_GB, data=laptop_price)
summary(mod1)
## 
## Call:
## lm(formula = Price_dollar ~ Ram_GB, data = laptop_price)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3084.4  -322.7  -104.3   249.7  3122.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   291.57      27.26   10.70   <2e-16 ***
## Ram_GB        111.34       2.78   40.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 510.1 on 1301 degrees of freedom
## Multiple R-squared:  0.5521, Adjusted R-squared:  0.5517 
## F-statistic:  1603 on 1 and 1301 DF,  p-value: < 2.2e-16

We can estimate the parameters:

  • Y-intercept: \(\hat{\beta}_0 = 291.57\)
  • Slope: \(\hat{\beta}_1 = 111.34\)

Step Four: Error Estimation

Standard Errors

We can estimate the error associated estimating these parameters: * Standard Error for Y-intercept: \(SE(\hat{\beta}_0) = 27.26\) * Standard Error for Slope: \(SE(\hat{\beta}_1) = 2.78\)

Estimating the variance, \(\sigma^2\)

We can estimate the \(\sigma^2\) with \(s^2\): \[s^2=\frac{1}{n-2}\sum_{i=1}^n(Y_i-\hat{Y}_i)^2\]

The quantity \(s^2\) is called the mean squared error (MSE).

The denominator \(n-2\) is referred to as the residual regrees of freedom.

Rather than \(s^2\) we often use \(s\): \[s=\sqrt{\frac{1}{n-2}\sum_{i=1}^n(Y_i-\hat{Y}_i)^2}\]

We can find this value in the R output, but we can also check it:

sqrt(sum(mod1$residuals^2)/1301)
## [1] 510.1363

We can also use the broom package! Using the glance() function in broom accesses the model metrics.

library(broom)

mod1%>%
  glance()%>%
  pull(sigma)
## [1] 510.1363

III. Assessing Model Assumptions

We assume that the \(x_i\)’s are non-random (or observable with negligible error) and that the \(\epsilon_i\)’s are random such that * \(E[\epsilon_i]=0\) (mean 0) * \(Var[\epsilon_i]=\sigma^2\) for all \(i=1, ..., n\) (homogeneity) * $Cor(x_i, x_j)= 0 $ 𝐶𝑜𝑟(𝑥_𝑖, 𝑥_𝑗 )=0,for all \(i \neq j\) (uncorrelated, but does not necessarily imply independence)

Tools for assessment:

  • Residual Plot
  • QQ (Quantile-Quantile) Plot
  • Leverage vs Residual Plot

Residual Plot from Scratch

ggplot(data=laptop_price, aes(x=Ram_GB, y=mod1$residuals))+
  geom_jitter(alpha=.25)+
  geom_hline(yintercept = 0, lty=2, lwd=1, color="blue")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

QQ Plot from Scratch

qqnorm(mod1$residuals)
qqline(mod1$residuals)

Plot Function

plot(mod1)

Remove the Outlier

laptop_price_wo<-laptop_price%>%
  filter(Ram_GB<40)

mod2<-lm(Price_dollar~Ram_GB, data=laptop_price_wo)
summary(mod2)
## 
## Call:
## lm(formula = Price_dollar ~ Ram_GB, data = laptop_price_wo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2595.51  -312.70   -97.74   237.54  3122.12 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  247.098     27.714   8.916   <2e-16 ***
## Ram_GB       116.954      2.873  40.702   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 502.4 on 1300 degrees of freedom
## Multiple R-squared:  0.5603, Adjusted R-squared:   0.56 
## F-statistic:  1657 on 1 and 1300 DF,  p-value: < 2.2e-16
plot(mod2)

IV. Inference

Step Five: Hypothesis Testing for Parameters

Hypothesis Testing for Slope

Hypotheses:
  • Null Hypothesis: \(H_0: \beta_1 = 0\)
  • Alternative Hypothesis:
  • \(H_A: \beta_1 \neq 0\) (two-sided)
  • \(H_A: \beta_1 > 0\) (one-sided upper)
  • \(H_A: \beta_1 < 0\) (one-sided lower)
Test Statistic:

\[t=\frac{\hat{\beta}_1}{SE({\hat{\beta}_1})}\]

where \[SE(\hat{\beta}_1)=\sqrt{\frac{s^2}{\sum_{i=1}^n (x_i-\bar{x})^2}}\]

Reference Distribution:

T-distribution with \(n-2\) degrees of freedom.

P-value:
beta1 <- mod2$coefficients[2]
beta1
##   Ram_GB 
## 116.9538
beta1SE <- sqrt((sum(mod2$residuals^2))/1300/sum((laptop_price_wo$Ram_GB - mean(laptop_price_wo$Ram_GB))^2))
beta1SE
## [1] 2.873447
testStat<- beta1/beta1SE

pt(-abs(testStat), 1300, lower.tail=TRUE)*2
##        Ram_GB 
## 3.286723e-234

Step Six: Confidence Interval for Parameters

Confidence intervals take the form

\[\text{Point Estimate} \pm \text{Critical Value} \times \text{Standard Error}\]

Confidence Interval for Slope

\[\hat{\beta}_1 \pm t_{df=n-2, \alpha/2} \times \sqrt{\frac{s^2}{\sum_{i=1}^n (x_i-\bar{x})^2}}\]

Point Estiamte: \(\hat{\beta}_1\)
beta1 <- mod2$coefficients[2]
beta1
##   Ram_GB 
## 116.9538
Critical Value: \(t_{df=n-2, \alpha/2}\)

We define the significance level \(\alpha\), therefore we need to find the \(1-\alpha/2\) quantile. Typically, the default significance level used is \(\alpha=0.05\), thus the percentile we look for is \(97.5%\).

Recall that the degrees of freedom are \(n-2\).

qt(0.975, df=1302-2)
## [1] 1.96179
Standard Error: \(SE(\hat{\beta}_1)=\sqrt{\frac{s^2}{\sum_{i=1}^n (x_i-\bar{x})^2}}\)
beta1SE <- sqrt((sum(mod2$residuals^2))/1300/sum((laptop_price_wo$Ram_GB - mean(laptop_price_wo$Ram_GB))^2))
beta1SE
## [1] 2.873447

Pull this all together:

beta1 +c(-1,1)*qt(0.975, df=1302-2)*beta1SE
## [1] 111.3167 122.5909

We can check this with the confint function in R:

confint(mod2)
##                2.5 %   97.5 %
## (Intercept) 192.7293 301.4672
## Ram_GB      111.3167 122.5909

V. Prediction

Step Seven: Confidence and Prediction Intervals for an Observation

Using the Model for Prediction

newdata=data.frame(Ram_GB=10)
predict(mod2, newdata)
##        1 
## 1416.636

Confidence Interval for a mean response

predict(mod2, newdata, interval="confidence")
##        fit      lwr      upr
## 1 1416.636 1387.764 1445.509

Prediction Interval for a new response

predict(mod2, newdata, interval="predict")
##        fit      lwr    upr
## 1 1416.636 430.6721 2402.6

Adding Confidence and Prediction Bands

confBand<-predict(mod2, interval="confidence")
predBand<-predict(mod2, interval="predict")
## Warning in predict.lm(mod2, interval = "predict"): predictions on current data refer to _future_ responses
colnames(predBand)<-c("fit2", "lwr2", "upr2")


newDF<-cbind(laptop_price_wo, confBand, predBand)

ggplot(newDF, aes(x=Ram_GB, y=Price_dollar))+
  geom_point(alpha=.3)+
  geom_abline(slope=mod2$coefficients[2], intercept=mod2$coefficients[1],
              color="blue", lty=2, lwd=1)+
  geom_line(aes(y=lwr), color="green", lty=2, lwd=1)+
  geom_line(aes(y=upr), color="green", lty=2, lwd=1)+
  geom_line(aes(y=lwr2), color="red", lty=2, lwd=1)+
  geom_line(aes(y=upr2), color="red", lty=2, lwd=1)+
  theme_bw()

VI. Variability

Step Eight and Nine: Analysis of Variance

anova(mod2)
## Analysis of Variance Table
## 
## Response: Price_dollar
##             Df    Sum Sq   Mean Sq F value    Pr(>F)    
## Ram_GB       1 418086847 418086847  1656.6 < 2.2e-16 ***
## Residuals 1300 328085970    252374                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Researchers are often interested in the \(R^2\) value, which is known as the coefficient of determination.

\[R^2=\frac{SS(Reg)}{SS(Tot)}=1-\frac{SS(Res)}{SS(Tot)}\]

In the case of simple linear regression we can also square the correlation.

r_xy<-cor(laptop_price_wo$Ram_GB, laptop_price_wo$Price_dollar, )
r_xy^2
## [1] 0.5603083

We can also use the broom() function:

mod2%>%
  glance()%>%
  pull(r.squared)
## [1] 0.5603083

VII. Training and Testing

It is circular logic to assess model fit on data used to create the model. Instead, we will split the data into a training set and a testing set. We fit the model using the training set only. Then we use the test set data as a proxy for new observations to understand how the model will perform.

### TEST TRAIN

dim(laptop_price_wo)
## [1] 1302   15
# 1302

## TEST with 1/3 
1302/3
## [1] 434
# 434

set.seed(123)
testInd<-sample(1:1302, size=434, replace=FALSE)
train<-laptop_price_wo[-testInd, ]
test<-laptop_price_wo[testInd, ]

## TRAIN
trainMod<-lm(Price_dollar~Ram_GB, data=train)
summary(trainMod)
## 
## Call:
## lm(formula = Price_dollar ~ Ram_GB, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2598.51  -320.76   -95.68   259.99  3119.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  249.230     35.281   7.064 3.32e-12 ***
## Ram_GB       116.981      3.575  32.719  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 514.6 on 866 degrees of freedom
## Multiple R-squared:  0.5528, Adjusted R-squared:  0.5523 
## F-statistic:  1071 on 1 and 866 DF,  p-value: < 2.2e-16
## TEST
testPred<-predict(trainMod, test)

sqrt(mean((testPred-test$Price_dollar)^2))
## [1] 476.985
#install.packages("caret")
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
RMSE(testPred, test$Price_dollar)
## [1] 476.985