In this lesson students will review the main concepts of Simple Linear Regression.
These laptop price data come from Kaggle.
### use raw file from github
laptop_price <- read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA252/main/Data/laptop_price.csv")
## TIDYVERSE
#install.packages("tidyverse")
library(tidyverse)
### convert to dollars
laptop_price<-laptop_price%>%
mutate(Price_dollar=Price_euros*1.09)
# TAKE OFF LABELS for GB
unique(laptop_price$Ram)
## [1] "8GB" "16GB" "4GB" "2GB" "12GB" "6GB" "32GB" "24GB" "64GB"
#str_remove()
laptop_price$Ram_GB<-str_remove(laptop_price$Ram, "GB")
laptop_price$Ram_GB<-as.numeric(laptop_price$Ram_GB)
unique(laptop_price$Ram_GB)
## [1] 8 16 4 2 12 6 32 24 64
# SCATTERPLOT
ggplot(data=laptop_price,
aes(y=Price_dollar, x=Ram_GB))+
geom_point(alpha=.25)
# JITTER ADDS NOISE
ggplot(data=laptop_price,
aes(y=Price_dollar, x=Ram_GB))+
geom_jitter(alpha=.25)
ggplot(data=laptop_price,
aes(y=Price_dollar, x=Ram_GB))+
geom_jitter(alpha=.25)+
geom_smooth(method="loess", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 6
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 16
ggplot(data=laptop_price,
aes(y=Price_dollar, x=Ram_GB))+
geom_jitter(alpha=.25)+
geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'
We’re going to go with a simple linear model
This model has the form: \[Y_i=\beta_0+\beta_1 x_i +\varepsilon_i\]
This might look familar to \(Y=mx+b\), which was called slope-intercept form.
mod1<-lm(Price_dollar~Ram_GB, data=laptop_price)
summary(mod1)
##
## Call:
## lm(formula = Price_dollar ~ Ram_GB, data = laptop_price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3084.4 -322.7 -104.3 249.7 3122.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 291.57 27.26 10.70 <2e-16 ***
## Ram_GB 111.34 2.78 40.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 510.1 on 1301 degrees of freedom
## Multiple R-squared: 0.5521, Adjusted R-squared: 0.5517
## F-statistic: 1603 on 1 and 1301 DF, p-value: < 2.2e-16
We can estimate the parameters:
We can estimate the error associated estimating these parameters: * Standard Error for Y-intercept: \(SE(\hat{\beta}_0) = 27.26\) * Standard Error for Slope: \(SE(\hat{\beta}_1) = 2.78\)
We can estimate the \(\sigma^2\) with \(s^2\): \[s^2=\frac{1}{n-2}\sum_{i=1}^n(Y_i-\hat{Y}_i)^2\]
The quantity \(s^2\) is called the mean squared error (MSE).
The denominator \(n-2\) is referred to as the residual regrees of freedom.
Rather than \(s^2\) we often use \(s\): \[s=\sqrt{\frac{1}{n-2}\sum_{i=1}^n(Y_i-\hat{Y}_i)^2}\]
We can find this value in the R output, but we can also check it:
sqrt(sum(mod1$residuals^2)/1301)
## [1] 510.1363
We can also use the broom
package! Using the
glance()
function in broom
accesses the model
metrics.
library(broom)
mod1%>%
glance()%>%
pull(sigma)
## [1] 510.1363
We assume that the \(x_i\)’s are non-random (or observable with negligible error) and that the \(\epsilon_i\)’s are random such that * \(E[\epsilon_i]=0\) (mean 0) * \(Var[\epsilon_i]=\sigma^2\) for all \(i=1, ..., n\) (homogeneity) * $Cor(x_i, x_j)= 0 $ 𝐶𝑜𝑟(𝑥_𝑖, 𝑥_𝑗 )=0,for all \(i \neq j\) (uncorrelated, but does not necessarily imply independence)
ggplot(data=laptop_price, aes(x=Ram_GB, y=mod1$residuals))+
geom_jitter(alpha=.25)+
geom_hline(yintercept = 0, lty=2, lwd=1, color="blue")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
qqnorm(mod1$residuals)
qqline(mod1$residuals)
plot(mod1)
laptop_price_wo<-laptop_price%>%
filter(Ram_GB<40)
mod2<-lm(Price_dollar~Ram_GB, data=laptop_price_wo)
summary(mod2)
##
## Call:
## lm(formula = Price_dollar ~ Ram_GB, data = laptop_price_wo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2595.51 -312.70 -97.74 237.54 3122.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 247.098 27.714 8.916 <2e-16 ***
## Ram_GB 116.954 2.873 40.702 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 502.4 on 1300 degrees of freedom
## Multiple R-squared: 0.5603, Adjusted R-squared: 0.56
## F-statistic: 1657 on 1 and 1300 DF, p-value: < 2.2e-16
plot(mod2)
\[t=\frac{\hat{\beta}_1}{SE({\hat{\beta}_1})}\]
where \[SE(\hat{\beta}_1)=\sqrt{\frac{s^2}{\sum_{i=1}^n (x_i-\bar{x})^2}}\]
T-distribution with \(n-2\) degrees of freedom.
beta1 <- mod2$coefficients[2]
beta1
## Ram_GB
## 116.9538
beta1SE <- sqrt((sum(mod2$residuals^2))/1300/sum((laptop_price_wo$Ram_GB - mean(laptop_price_wo$Ram_GB))^2))
beta1SE
## [1] 2.873447
testStat<- beta1/beta1SE
pt(-abs(testStat), 1300, lower.tail=TRUE)*2
## Ram_GB
## 3.286723e-234
Confidence intervals take the form
\[\text{Point Estimate} \pm \text{Critical Value} \times \text{Standard Error}\]
\[\hat{\beta}_1 \pm t_{df=n-2, \alpha/2} \times \sqrt{\frac{s^2}{\sum_{i=1}^n (x_i-\bar{x})^2}}\]
beta1 <- mod2$coefficients[2]
beta1
## Ram_GB
## 116.9538
We define the significance level \(\alpha\), therefore we need to find the \(1-\alpha/2\) quantile. Typically, the default significance level used is \(\alpha=0.05\), thus the percentile we look for is \(97.5%\).
Recall that the degrees of freedom are \(n-2\).
qt(0.975, df=1302-2)
## [1] 1.96179
beta1SE <- sqrt((sum(mod2$residuals^2))/1300/sum((laptop_price_wo$Ram_GB - mean(laptop_price_wo$Ram_GB))^2))
beta1SE
## [1] 2.873447
beta1 +c(-1,1)*qt(0.975, df=1302-2)*beta1SE
## [1] 111.3167 122.5909
We can check this with the confint function in R:
confint(mod2)
## 2.5 % 97.5 %
## (Intercept) 192.7293 301.4672
## Ram_GB 111.3167 122.5909
newdata=data.frame(Ram_GB=10)
predict(mod2, newdata)
## 1
## 1416.636
predict(mod2, newdata, interval="confidence")
## fit lwr upr
## 1 1416.636 1387.764 1445.509
predict(mod2, newdata, interval="predict")
## fit lwr upr
## 1 1416.636 430.6721 2402.6
confBand<-predict(mod2, interval="confidence")
predBand<-predict(mod2, interval="predict")
## Warning in predict.lm(mod2, interval = "predict"): predictions on current data refer to _future_ responses
colnames(predBand)<-c("fit2", "lwr2", "upr2")
newDF<-cbind(laptop_price_wo, confBand, predBand)
ggplot(newDF, aes(x=Ram_GB, y=Price_dollar))+
geom_point(alpha=.3)+
geom_abline(slope=mod2$coefficients[2], intercept=mod2$coefficients[1],
color="blue", lty=2, lwd=1)+
geom_line(aes(y=lwr), color="green", lty=2, lwd=1)+
geom_line(aes(y=upr), color="green", lty=2, lwd=1)+
geom_line(aes(y=lwr2), color="red", lty=2, lwd=1)+
geom_line(aes(y=upr2), color="red", lty=2, lwd=1)+
theme_bw()
anova(mod2)
## Analysis of Variance Table
##
## Response: Price_dollar
## Df Sum Sq Mean Sq F value Pr(>F)
## Ram_GB 1 418086847 418086847 1656.6 < 2.2e-16 ***
## Residuals 1300 328085970 252374
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Researchers are often interested in the \(R^2\) value, which is known as the coefficient of determination.
\[R^2=\frac{SS(Reg)}{SS(Tot)}=1-\frac{SS(Res)}{SS(Tot)}\]
In the case of simple linear regression we can also square the correlation.
r_xy<-cor(laptop_price_wo$Ram_GB, laptop_price_wo$Price_dollar, )
r_xy^2
## [1] 0.5603083
We can also use the broom()
function:
mod2%>%
glance()%>%
pull(r.squared)
## [1] 0.5603083
It is circular logic to assess model fit on data used to create the model. Instead, we will split the data into a training set and a testing set. We fit the model using the training set only. Then we use the test set data as a proxy for new observations to understand how the model will perform.
### TEST TRAIN
dim(laptop_price_wo)
## [1] 1302 15
# 1302
## TEST with 1/3
1302/3
## [1] 434
# 434
set.seed(123)
testInd<-sample(1:1302, size=434, replace=FALSE)
train<-laptop_price_wo[-testInd, ]
test<-laptop_price_wo[testInd, ]
## TRAIN
trainMod<-lm(Price_dollar~Ram_GB, data=train)
summary(trainMod)
##
## Call:
## lm(formula = Price_dollar ~ Ram_GB, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2598.51 -320.76 -95.68 259.99 3119.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 249.230 35.281 7.064 3.32e-12 ***
## Ram_GB 116.981 3.575 32.719 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 514.6 on 866 degrees of freedom
## Multiple R-squared: 0.5528, Adjusted R-squared: 0.5523
## F-statistic: 1071 on 1 and 866 DF, p-value: < 2.2e-16
## TEST
testPred<-predict(trainMod, test)
sqrt(mean((testPred-test$Price_dollar)^2))
## [1] 476.985
#install.packages("caret")
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
RMSE(testPred, test$Price_dollar)
## [1] 476.985