DATA252: Review of Simple Linear Regression

Learning Objectives

In this lesson students will review the main concepts of Simple Linear Regression.

How to fit a model
Graphics a model
Checking model fit and diagnostics
Model inference

I. Motivating Example

Import the Data

These laptop price data come from Kaggle.

### use raw file from github
laptop_price <- read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA252/main/Data/laptop_price.csv")

Step Zero: Wrangle Data

## TIDYVERSE
#install.packages("tidyverse")
library(tidyverse)

### convert to dollars
laptop_price<-laptop_price%>%
  mutate(Price_dollar=Price_euros*1.09)

# TAKE OFF LABELS for GB
unique(laptop_price$Ram)

## [1] "8GB"  "16GB" "4GB"  "2GB"  "12GB" "6GB"  "32GB" "24GB" "64GB"

#str_remove()
laptop_price$Ram_GB<-str_remove(laptop_price$Ram, "GB")
laptop_price$Ram_GB<-as.numeric(laptop_price$Ram_GB)
unique(laptop_price$Ram_GB)

## [1]  8 16  4  2 12  6 32 24 64

Step One: Look at the Data

Explanatory (X): RAM in GB
Repsonse (Y): Price in Dollars

Make a Scatterplot

# SCATTERPLOT
ggplot(data=laptop_price, 
       aes(y=Price_dollar, x=Ram_GB))+
  geom_point(alpha=.25)

# JITTER ADDS NOISE
ggplot(data=laptop_price, 
       aes(y=Price_dollar, x=Ram_GB))+
  geom_jitter(alpha=.25)

Step Two: Assume a Model Form

Non-Parametric

ggplot(data=laptop_price, 
       aes(y=Price_dollar, x=Ram_GB))+
  geom_jitter(alpha=.25)+
  geom_smooth(method="loess", se=FALSE)

## `geom_smooth()` using formula = 'y ~ x'

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 6

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 16

Parametric

ggplot(data=laptop_price, 
       aes(y=Price_dollar, x=Ram_GB))+
  geom_jitter(alpha=.25)+
  geom_smooth(method="lm", se=FALSE)

## `geom_smooth()` using formula = 'y ~ x'

We’re going to go with a simple linear model

II. Fitting A Simple Linear Regression (SLR) Model

This model has the form: \[Y_i=\beta_0+\beta_1 x_i +\varepsilon_i\]

This might look familar to $Y=mx+b$, which was called slope-intercept form.

Step Three: Estimate Model Coefficients

mod1<-lm(Price_dollar~Ram_GB, data=laptop_price)
summary(mod1)

## 
## Call:
## lm(formula = Price_dollar ~ Ram_GB, data = laptop_price)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3084.4  -322.7  -104.3   249.7  3122.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   291.57      27.26   10.70   <2e-16 ***
## Ram_GB        111.34       2.78   40.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 510.1 on 1301 degrees of freedom
## Multiple R-squared:  0.5521, Adjusted R-squared:  0.5517 
## F-statistic:  1603 on 1 and 1301 DF,  p-value: < 2.2e-16

We can estimate the parameters:

Y-intercept: $\hat{\beta}_0 = 291.57$
Slope: $\hat{\beta}_1 = 111.34$

Step Four: Error Estimation

Standard Errors

We can estimate the error associated estimating these parameters: * Standard Error for Y-intercept: $SE(\hat{\beta}_0) = 27.26$ * Standard Error for Slope: $SE(\hat{\beta}_1) = 2.78$

Estimating the variance, $\sigma^2$

We can estimate the $\sigma^2$ with $s^2$: \[s^2=\frac{1}{n-2}\sum_{i=1}^n(Y_i-\hat{Y}_i)^2\]

The quantity $s^2$ is called the mean squared error (MSE).

The denominator $n-2$ is referred to as the residual regrees of freedom.

Rather than $s^2$ we often use $s$: \[s=\sqrt{\frac{1}{n-2}\sum_{i=1}^n(Y_i-\hat{Y}_i)^2}\]

We can find this value in the R output, but we can also check it:

sqrt(sum(mod1$residuals^2)/1301)

## [1] 510.1363

We can also use the broom package! Using the glance() function in broom accesses the model metrics.

library(broom)

mod1%>%
  glance()%>%
  pull(sigma)

## [1] 510.1363

III. Assessing Model Assumptions

We assume that the $x_i$’s are non-random (or observable with negligible error) and that the $\epsilon_i$’s are random such that * $E[\epsilon_i]=0$ (mean 0) * $Var[\epsilon_i]=\sigma^2$ for all $i=1, ..., n$ (homogeneity) * $Cor(x_i, x_j)= 0 $ 𝐶𝑜𝑟(𝑥_𝑖, 𝑥_𝑗 )=0,for all $i \neq j$ (uncorrelated, but does not necessarily imply independence)

Tools for assessment:

Residual Plot
QQ (Quantile-Quantile) Plot
Leverage vs Residual Plot

Residual Plot from Scratch

ggplot(data=laptop_price, aes(x=Ram_GB, y=mod1$residuals))+
  geom_jitter(alpha=.25)+
  geom_hline(yintercept = 0, lty=2, lwd=1, color="blue")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

QQ Plot from Scratch

qqnorm(mod1$residuals)
qqline(mod1$residuals)

Plot Function

plot(mod1)

Remove the Outlier

laptop_price_wo<-laptop_price%>%
  filter(Ram_GB<40)

mod2<-lm(Price_dollar~Ram_GB, data=laptop_price_wo)
summary(mod2)

## 
## Call:
## lm(formula = Price_dollar ~ Ram_GB, data = laptop_price_wo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2595.51  -312.70   -97.74   237.54  3122.12 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  247.098     27.714   8.916   <2e-16 ***
## Ram_GB       116.954      2.873  40.702   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 502.4 on 1300 degrees of freedom
## Multiple R-squared:  0.5603, Adjusted R-squared:   0.56 
## F-statistic:  1657 on 1 and 1300 DF,  p-value: < 2.2e-16

plot(mod2)

IV. Inference

Step Five: Hypothesis Testing for Parameters

Hypothesis Testing for Slope

Hypotheses:

Null Hypothesis: $H_0: \beta_1 = 0$
Alternative Hypothesis:
$H_A: \beta_1 \neq 0$ (two-sided)
$H_A: \beta_1 > 0$ (one-sided upper)
$H_A: \beta_1 < 0$ (one-sided lower)

Test Statistic:

\[t=\frac{\hat{\beta}_1}{SE({\hat{\beta}_1})}\]

where \[SE(\hat{\beta}_1)=\sqrt{\frac{s^2}{\sum_{i=1}^n (x_i-\bar{x})^2}}\]

Reference Distribution:

T-distribution with $n-2$ degrees of freedom.

P-value:

beta1 <- mod2$coefficients[2]
beta1

##   Ram_GB 
## 116.9538

beta1SE <- sqrt((sum(mod2$residuals^2))/1300/sum((laptop_price_wo$Ram_GB - mean(laptop_price_wo$Ram_GB))^2))
beta1SE

## [1] 2.873447

testStat<- beta1/beta1SE

pt(-abs(testStat), 1300, lower.tail=TRUE)*2

##        Ram_GB 
## 3.286723e-234

Step Six: Confidence Interval for Parameters

Confidence intervals take the form

\[\text{Point Estimate} \pm \text{Critical Value} \times \text{Standard Error}\]

Confidence Interval for Slope

\[\hat{\beta}_1 \pm t_{df=n-2, \alpha/2} \times \sqrt{\frac{s^2}{\sum_{i=1}^n (x_i-\bar{x})^2}}\]

Point Estiamte: $\hat{\beta}_1$

beta1 <- mod2$coefficients[2]
beta1

##   Ram_GB 
## 116.9538

Critical Value: $t_{df=n-2, \alpha/2}$

We define the significance level $\alpha$, therefore we need to find the $1-\alpha/2$ quantile. Typically, the default significance level used is $\alpha=0.05$, thus the percentile we look for is $97.5%$.

Recall that the degrees of freedom are $n-2$.

qt(0.975, df=1302-2)

## [1] 1.96179

Standard Error: $SE(\hat{\beta}_1)=\sqrt{\frac{s^2}{\sum_{i=1}^n (x_i-\bar{x})^2}}$

beta1SE <- sqrt((sum(mod2$residuals^2))/1300/sum((laptop_price_wo$Ram_GB - mean(laptop_price_wo$Ram_GB))^2))
beta1SE

## [1] 2.873447

Pull this all together:

beta1 +c(-1,1)*qt(0.975, df=1302-2)*beta1SE

## [1] 111.3167 122.5909

We can check this with the confint function in R:

confint(mod2)

##                2.5 %   97.5 %
## (Intercept) 192.7293 301.4672
## Ram_GB      111.3167 122.5909

V. Prediction

Step Seven: Confidence and Prediction Intervals for an Observation

Using the Model for Prediction

newdata=data.frame(Ram_GB=10)
predict(mod2, newdata)

##        1 
## 1416.636

Confidence Interval for a mean response

predict(mod2, newdata, interval="confidence")

##        fit      lwr      upr
## 1 1416.636 1387.764 1445.509

Prediction Interval for a new response

predict(mod2, newdata, interval="predict")

##        fit      lwr    upr
## 1 1416.636 430.6721 2402.6

Adding Confidence and Prediction Bands

confBand<-predict(mod2, interval="confidence")
predBand<-predict(mod2, interval="predict")

## Warning in predict.lm(mod2, interval = "predict"): predictions on current data refer to _future_ responses

colnames(predBand)<-c("fit2", "lwr2", "upr2")


newDF<-cbind(laptop_price_wo, confBand, predBand)

ggplot(newDF, aes(x=Ram_GB, y=Price_dollar))+
  geom_point(alpha=.3)+
  geom_abline(slope=mod2$coefficients[2], intercept=mod2$coefficients[1],
              color="blue", lty=2, lwd=1)+
  geom_line(aes(y=lwr), color="green", lty=2, lwd=1)+
  geom_line(aes(y=upr), color="green", lty=2, lwd=1)+
  geom_line(aes(y=lwr2), color="red", lty=2, lwd=1)+
  geom_line(aes(y=upr2), color="red", lty=2, lwd=1)+
  theme_bw()

VI. Variability

Step Eight and Nine: Analysis of Variance

anova(mod2)

## Analysis of Variance Table
## 
## Response: Price_dollar
##             Df    Sum Sq   Mean Sq F value    Pr(>F)    
## Ram_GB       1 418086847 418086847  1656.6 < 2.2e-16 ***
## Residuals 1300 328085970    252374                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Researchers are often interested in the $R^2$ value, which is known as the coefficient of determination.

\[R^2=\frac{SS(Reg)}{SS(Tot)}=1-\frac{SS(Res)}{SS(Tot)}\]

In the case of simple linear regression we can also square the correlation.

r_xy<-cor(laptop_price_wo$Ram_GB, laptop_price_wo$Price_dollar, )
r_xy^2

## [1] 0.5603083

We can also use the broom() function:

mod2%>%
  glance()%>%
  pull(r.squared)

## [1] 0.5603083

VII. Training and Testing

It is circular logic to assess model fit on data used to create the model. Instead, we will split the data into a training set and a testing set. We fit the model using the training set only. Then we use the test set data as a proxy for new observations to understand how the model will perform.

### TEST TRAIN

dim(laptop_price_wo)

## [1] 1302   15

# 1302

## TEST with 1/3 
1302/3

## [1] 434

# 434

set.seed(123)
testInd<-sample(1:1302, size=434, replace=FALSE)
train<-laptop_price_wo[-testInd, ]
test<-laptop_price_wo[testInd, ]

## TRAIN
trainMod<-lm(Price_dollar~Ram_GB, data=train)
summary(trainMod)

## 
## Call:
## lm(formula = Price_dollar ~ Ram_GB, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2598.51  -320.76   -95.68   259.99  3119.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  249.230     35.281   7.064 3.32e-12 ***
## Ram_GB       116.981      3.575  32.719  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 514.6 on 866 degrees of freedom
## Multiple R-squared:  0.5528, Adjusted R-squared:  0.5523 
## F-statistic:  1071 on 1 and 866 DF,  p-value: < 2.2e-16

## TEST
testPred<-predict(trainMod, test)

sqrt(mean((testPred-test$Price_dollar)^2))

## [1] 476.985

#install.packages("caret")
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

RMSE(testPred, test$Price_dollar)

## [1] 476.985