The data set give the locations of 1000 seismic events of MB > 4.0. The events occurred in a cube near Fiji since 1964.
library(tidyverse)
data(quakes)
# SCATTERPLOT
ggplot(quakes, aes(x = mag, y = stations)) +
geom_point()
# JITTER ADDS NOISE
ggplot(quakes, aes(x = mag, y = stations)) +
geom_jitter()
ggplot(quakes, aes(x = mag, y = stations)) +
geom_jitter()+
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
ggplot(quakes, aes(x = mag, y = stations)) +
geom_jitter()+
geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
We’re going to go with a simple linear model
This model has the form: \[Y_i=\beta_0+\beta_1 x_i +\varepsilon_i\]
This might look familar to \(Y=mx+b\), which was called slope-intercept form.
mod <- lm(stations ~ mag, data = quakes)
summary(mod)
##
## Call:
## lm(formula = stations ~ mag, data = quakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.871 -7.102 -0.474 6.783 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -180.4243 4.1899 -43.06 <2e-16 ***
## mag 46.2822 0.9034 51.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.5 on 998 degrees of freedom
## Multiple R-squared: 0.7245, Adjusted R-squared: 0.7242
## F-statistic: 2625 on 1 and 998 DF, p-value: < 2.2e-16
We can estimate the parameters:
We can estimate the error associated estimating these parameters: * Standard Error for Y-intercept: \(SE(\hat{\beta}_0) = 4.1899\) * Standard Error for Slope: \(SE(\hat{\beta}_1) = 0.9034\)
We can estimate the \(\sigma^2\) with \(s^2\): \[s^2=\frac{1}{n-2}\sum_{i=1}^n(Y_i-\hat{Y}_i)^2\]
The quantity \(s^2\) is called the mean squared error (MSE).
The denominator \(n-2\) is referred to as the residual regrees of freedom.
Rather than \(s^2\) we often use \(s\): \[s=\sqrt{\frac{1}{n-2}\sum_{i=1}^n(Y_i-\hat{Y}_i)^2}\]
We can find this value in the R output, but we can also check it:
sqrt(sum(mod$residuals^2)/998)
## [1] 11.50061
\[t=\frac{\hat{\beta}_1}{SE({\hat{\beta}_1})}\]
where \[SE(\hat{\beta}_1)=\sqrt{\frac{s^2}{\sum_{i=1}^n (x_i-\bar{x})^2}}\]
T-distribution with \(n-2\) degrees of freedom.
beta1 <- mod$coefficients[2]
beta1
## mag
## 46.28221
beta1SE <- sqrt((sum(mod$residuals^2))/998/sum((quakes$mag - mean(quakes$mag))^2))
beta1SE
## [1] 0.9033955
testStat<- beta1/beta1SE
testStat
## mag
## 51.2314
pt(-abs(testStat), 998, lower.tail=TRUE)*2
## mag
## 1.212549e-281
Confidence intervals take the form
\[\text{Point Estimate} \pm \text{Critical Value} \times \text{Standard Error}\]
\[\hat{\beta}_1 \pm t_{df=n-2, \alpha/2} \times \sqrt{\frac{s^2}{\sum_{i=1}^n (x_i-\bar{x})^2}}\]
beta1 <- mod$coefficients[2]
beta1
## mag
## 46.28221
We define the significance level \(\alpha\), therefore we need to find the \(1-\alpha/2\) quantile. Typically, the default significance level used is \(\alpha=0.05\), thus the percentile we look for is \(97.5%\).
Recall that the degrees of freedom are \(n-2\).
qt(0.975, df=1000-2)
## [1] 1.962344
beta1SE <- sqrt((sum(mod$residuals^2))/998/sum((quakes$mag - mean(quakes$mag))^2))
beta1SE
## [1] 0.9033955
beta1 +c(-1,1)*qt(0.975, df=1000-2)*beta1SE
## [1] 44.50944 48.05498
We can check this with the confint function in R:
confint(mod)
## 2.5 % 97.5 %
## (Intercept) -188.64628 -172.20238
## mag 44.50944 48.05498
newdata=data.frame(mag=7)
predict(mod, newdata)
## 1
## 143.5511
predict(mod, newdata, interval="confidence")
## fit lwr upr
## 1 143.5511 139.2727 147.8296
predict(mod, newdata, interval="predict")
## fit lwr upr
## 1 143.5511 120.581 166.5213
confBand<-predict(mod, interval="confidence")
predBand<-predict(mod, interval="predict")
## Warning in predict.lm(mod, interval = "predict"): predictions on current data refer to _future_ responses
colnames(predBand)<-c("fit2", "lwr2", "upr2")
newDF<-cbind(quakes, confBand, predBand)
ggplot(newDF, aes(x=mag, y=stations))+
geom_point(alpha=.3)+
geom_abline(slope=mod$coefficients[2], intercept=mod$coefficients[1],
color="blue", lty=2, lwd=1)+
geom_line(aes(y=lwr), color="green", lty=2, lwd=1)+
geom_line(aes(y=upr), color="green", lty=2, lwd=1)+
geom_line(aes(y=lwr2), color="red", lty=2, lwd=1)+
geom_line(aes(y=upr2), color="red", lty=2, lwd=1)+
theme_bw()
anova(mod)
## Analysis of Variance Table
##
## Response: stations
## Df Sum Sq Mean Sq F value Pr(>F)
## mag 1 347148 347148 2624.7 < 2.2e-16 ***
## Residuals 998 132000 132
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(mod)