Simple Linear Regression Notes

Earthquake Data

The data set give the locations of 1000 seismic events of MB > 4.0. The events occurred in a cube near Fiji since 1964.

lat = Latitude of event
long = Longitude
depth = Depth (km)
mag = Richter magnitude
stations = Number of statitions reporting

Import the Data

library(tidyverse)
data(quakes)

Step Zero: Relationships Between Variables

What do you think should be the response and explantory variables?

# space for answer

Step One: Look at the Data

Make a Scatterplot

# SCATTERPLOT


# JITTER ADDS NOISE

Step Two: Assume a Model Form

Non-Parametric

# Add a smoothed line to visualize a non-parametric model to your scatterplor

Parametric

# Add the line for best fit to your scatterplor

We’re going to go with a simple linear model

Fitting A Simple Linear Regression (SLR) Model

This model has the form: \[Y_i=\beta_0+\beta_1 x_i +\varepsilon_i\]

This might look familar to \(Y=mx+b\), which was called slope-intercept form.

Step Three: Estimate Model Coefficients

# Use the lm function to create a SLR model and store the output


# Look at the summary of that output

We can estimate the parameters:

Y-intercept: \(\hat{\beta}_0 = -180.42\)
Slope: \(\hat{\beta}_1 = 46.28\)

Step Four: Error Estimation

Standard Errors

We can estimate the error associated estimating these parameters: * Standard Error for Y-intercept: \(SE(\hat{\beta}_0) = 4.1899\) * Standard Error for Slope: \(SE(\hat{\beta}_1) = 0.9034\)

Estimating the variance, \(\sigma^2\)

We can estimate the \(\sigma^2\) with \(s^2\): \[s^2=\frac{1}{n-2}\sum_{i=1}^n(Y_i-\hat{Y}_i)^2\]

The quantity \(s^2\) is called the mean squared error (MSE).

The denominator \(n-2\) is referred to as the residual regrees of freedom.

Rather than \(s^2\) we often use \(s\): \[s=\sqrt{\frac{1}{n-2}\sum_{i=1}^n(Y_i-\hat{Y}_i)^2}\]

We can find this value in the R output, but we can also check it:

# Write code to varify s

Inference

Step Five: Hypothesis Testing for Parameters

Hypothesis Testing for Slope

Hypotheses:

Null Hypothesis: \(H_0: \beta_1 = 0\)
Alternative Hypothesis:
\(H_A: \beta_1 \neq 0\) (two-sided)
\(H_A: \beta_1 > 0\) (one-sided upper)
\(H_A: \beta_1 < 0\) (one-sided lower)

Test Statistic:

\[t=\frac{\hat{\beta}_1}{SE({\hat{\beta}_1})}\]

where \[SE(\hat{\beta}_1)=\sqrt{\frac{s^2}{\sum_{i=1}^n (x_i-\bar{x})^2}}\]

Reference Distribution:

T-distribution with \(n-2\) degrees of freedom.

P-value:

#beta1 <- mod$coefficients[2]
#beta1SE <- sqrt((sum(mod$residuals^2))/998/sum((quakes$mag - mean(quakes$mag))^2))

# Use the above components to create the test statistic


# Use the test statistic to find the p-value

Step Six: Confidence Interval for Parameters

Confidence intervals take the form

\[\text{Point Estimate} \pm \text{Critical Value} \times \text{Standard Error}\]

Confidence Interval for Slope

\[\hat{\beta}_1 \pm t_{df=n-2, \alpha/2} \times \sqrt{\frac{s^2}{\sum_{i=1}^n (x_i-\bar{x})^2}}\]

Point Estiamte: \(\hat{\beta}_1\)

#beta1 <- mod$coefficients[2]
#beta1

Critical Value: \(t_{df=n-2, \alpha/2}\)

We define the significance level \(\alpha\), therefore we need to find the \(1-\alpha/2\) quantile. Typically, the default significance level used is \(\alpha=0.05\), thus the percentile we look for is \(97.5%\).

Recall that the degrees of freedom are \(n-2\).

# Find the critical value

Standard Error: \(SE(\hat{\beta}_1)=\sqrt{\frac{s^2}{\sum_{i=1}^n (x_i-\bar{x})^2}}\)

#beta1SE <- sqrt((sum(mod$residuals^2))/998/sum((quakes$mag - mean(quakes$mag))^2))
#beta1SE

Pull this all together:

# Find the confidence interval for the slope

We can check this with the confint function in R:

# note that we can also use a special function in R instead of making a confidence interval from scratch

Step Seven: Confidence and Prediction Intervals for an Observation

Using the Model for Prediction

# Predict for an earthquake with magnitude 7

Confidence Interval for a mean response

# Create a confidence interval

Prediction Interval for a new response

# Create a prediction interval

What is the difference between confidence and prediction intervals?

# space for notes here..

Adding Confidence and Prediction Bands

#confBand<-predict(mod, interval="confidence")
#predBand<-predict(mod, interval="predict")

#colnames(predBand)<-c("fit2", "lwr2", "upr2")


#newDF<-cbind(quakes, confBand, predBand)

# Create a scatterplot that include confidence and prediction bands..

Step Eight and Nine: Analysis of Variance

# print the anova table

Model Assumptions

Normality Assumptions:

Mean 0
Constant Variance; Homogeneity
Independent; uncorrelated

Tools for assessment:

Residual Plot
QQ (Quantile-Quantile) Plot
Leverage vs Residual Plot

# Look at diagnostic plots