MATH239: SLR EXAMPLES

Earthquake Data

The data set give the locations of 1000 seismic events of MB > 4.0. The events occurred in a cube near Fiji since 1964.

lat = Latitude of event
long = Longitude
depth = Depth (km)
mag = Richter magnitude
stations = Number of statitions reporting

Import the Data

library(tidyverse)
data(quakes)

Step Zero: Relationships Between Variables

Explanatory (X): Magnitude
Repsonse (Y): Stations

Step One: Look at the Data

Make a Scatterplot

# SCATTERPLOT
ggplot(quakes, aes(x = mag, y = stations)) +
  geom_point()

# JITTER ADDS NOISE
ggplot(quakes, aes(x = mag, y = stations)) +
  geom_jitter()

Step Two: Assume a Model Form

Non-Parametric

ggplot(quakes, aes(x = mag, y = stations)) +
  geom_jitter()+
  geom_smooth(se=FALSE)

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Parametric

ggplot(quakes, aes(x = mag, y = stations)) +
  geom_jitter()+
  geom_smooth(method="lm", se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

We’re going to go with a simple linear model

Fitting A Simple Linear Regression (SLR) Model

This model has the form: \[Y_i=\beta_0+\beta_1 x_i +\varepsilon_i\]

This might look familar to \(Y=mx+b\), which was called slope-intercept form.

Step Three: Estimate Model Coefficients

mod <- lm(stations ~ mag, data = quakes)
summary(mod)

## 
## Call:
## lm(formula = stations ~ mag, data = quakes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.871  -7.102  -0.474   6.783  50.244 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -180.4243     4.1899  -43.06   <2e-16 ***
## mag           46.2822     0.9034   51.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.5 on 998 degrees of freedom
## Multiple R-squared:  0.7245, Adjusted R-squared:  0.7242 
## F-statistic:  2625 on 1 and 998 DF,  p-value: < 2.2e-16

We can estimate the parameters:

Y-intercept: \(\hat{\beta}_0 = -180.42\)
Slope: \(\hat{\beta}_1 = 46.28\)

Step Four: Error Estimation

Standard Errors

We can estimate the error associated estimating these parameters: * Standard Error for Y-intercept: \(SE(\hat{\beta}_0) = 4.1899\) * Standard Error for Slope: \(SE(\hat{\beta}_1) = 0.9034\)

Estimating the variance, \(\sigma^2\)

We can estimate the \(\sigma^2\) with \(s^2\): \[s^2=\frac{1}{n-2}\sum_{i=1}^n(Y_i-\hat{Y}_i)^2\]

The quantity \(s^2\) is called the mean squared error (MSE).

The denominator \(n-2\) is referred to as the residual regrees of freedom.

Rather than \(s^2\) we often use \(s\): \[s=\sqrt{\frac{1}{n-2}\sum_{i=1}^n(Y_i-\hat{Y}_i)^2}\]

We can find this value in the R output, but we can also check it:

sqrt(sum(mod$residuals^2)/998)

## [1] 11.50061

Inference

Step Five: Hypothesis Testing for Parameters

Hypothesis Testing for Slope

Hypotheses:

Null Hypothesis: \(H_0: \beta_1 = 0\)
Alternative Hypothesis:
\(H_A: \beta_1 \neq 0\) (two-sided)
\(H_A: \beta_1 > 0\) (one-sided upper)
\(H_A: \beta_1 < 0\) (one-sided lower)

Test Statistic:

\[t=\frac{\hat{\beta}_1}{SE({\hat{\beta}_1})}\]

where \[SE(\hat{\beta}_1)=\sqrt{\frac{s^2}{\sum_{i=1}^n (x_i-\bar{x})^2}}\]

Reference Distribution:

T-distribution with \(n-2\) degrees of freedom.

P-value:

beta1 <- mod$coefficients[2]
beta1

##      mag 
## 46.28221

beta1SE <- sqrt((sum(mod$residuals^2))/998/sum((quakes$mag - mean(quakes$mag))^2))
beta1SE

## [1] 0.9033955

testStat<- beta1/beta1SE
testStat

##     mag 
## 51.2314

pt(-abs(testStat), 998, lower.tail=TRUE)*2

##           mag 
## 1.212549e-281

Step Six: Confidence Interval for Parameters

Confidence intervals take the form

\[\text{Point Estimate} \pm \text{Critical Value} \times \text{Standard Error}\]

Confidence Interval for Slope

\[\hat{\beta}_1 \pm t_{df=n-2, \alpha/2} \times \sqrt{\frac{s^2}{\sum_{i=1}^n (x_i-\bar{x})^2}}\]

Point Estiamte: \(\hat{\beta}_1\)

beta1 <- mod$coefficients[2]
beta1

##      mag 
## 46.28221

Critical Value: \(t_{df=n-2, \alpha/2}\)

We define the significance level \(\alpha\), therefore we need to find the \(1-\alpha/2\) quantile. Typically, the default significance level used is \(\alpha=0.05\), thus the percentile we look for is \(97.5%\).

Recall that the degrees of freedom are \(n-2\).

qt(0.975, df=1000-2)

## [1] 1.962344

Standard Error: \(SE(\hat{\beta}_1)=\sqrt{\frac{s^2}{\sum_{i=1}^n (x_i-\bar{x})^2}}\)

beta1SE <- sqrt((sum(mod$residuals^2))/998/sum((quakes$mag - mean(quakes$mag))^2))
beta1SE

## [1] 0.9033955

Pull this all together:

beta1 +c(-1,1)*qt(0.975, df=1000-2)*beta1SE

## [1] 44.50944 48.05498

We can check this with the confint function in R:

confint(mod)

##                  2.5 %     97.5 %
## (Intercept) -188.64628 -172.20238
## mag           44.50944   48.05498

Step Seven: Confidence and Prediction Intervals for an Observation

Using the Model for Prediction

newdata=data.frame(mag=7)
predict(mod, newdata)

##        1 
## 143.5511

Confidence Interval for a mean response

predict(mod, newdata, interval="confidence")

##        fit      lwr      upr
## 1 143.5511 139.2727 147.8296

Prediction Interval for a new response

predict(mod, newdata, interval="predict")

##        fit     lwr      upr
## 1 143.5511 120.581 166.5213

Adding Confidence and Prediction Bands

confBand<-predict(mod, interval="confidence")
predBand<-predict(mod, interval="predict")

## Warning in predict.lm(mod, interval = "predict"): predictions on current data refer to _future_ responses

colnames(predBand)<-c("fit2", "lwr2", "upr2")


newDF<-cbind(quakes, confBand, predBand)

ggplot(newDF, aes(x=mag, y=stations))+
  geom_point(alpha=.3)+
  geom_abline(slope=mod$coefficients[2], intercept=mod$coefficients[1],
              color="blue", lty=2, lwd=1)+
  geom_line(aes(y=lwr), color="green", lty=2, lwd=1)+
  geom_line(aes(y=upr), color="green", lty=2, lwd=1)+
  geom_line(aes(y=lwr2), color="red", lty=2, lwd=1)+
  geom_line(aes(y=upr2), color="red", lty=2, lwd=1)+
  theme_bw()

Step Eight and Nine: Analysis of Variance

anova(mod)

## Analysis of Variance Table
## 
## Response: stations
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## mag         1 347148  347148  2624.7 < 2.2e-16 ***
## Residuals 998 132000     132                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Assumptions

Normality Assumptions:

Mean 0
Constant Variance; Homogeneity
Independent; uncorrelated

Tools for assessment:

Residual Plot
QQ (Quantile-Quantile) Plot
Leverage vs Residual Plot

plot(mod)