Intro to Simple Linear Regression

Welcome to Sandy Cheek’s R Guide highlighting the material covered in chapter 3: Simple Linear Regression. This publication will serve as a teaching and reference tool for you as you go through the material in chapter 3 and apply the content in R. Below are links to the code blocks introduced in this chapter for quick reference.

EDA
plot
lm
abline
Residual Plot
Residual Histogram
Q-Q Plot
confint
Response Confidence Interval
Response Prediction Interval
summary model
cor
t-test
F-test

Dataset Introduction

Before we dig into Simple Linear Regression, let’s become familiar with the data set that we will be using in this R-Guide. We will be using the “quakes” data set that is preloaded in R. When you downloaded R or R-Studio onto your computer, a handful of data sets came with it as well. These data sets are very easy to call, so we’ll start our first R-Guide using one of the preloaded data sets.

To call the quakes data set, simply use the data command.

data(quakes)

Once we’ve called our data set, it’s always good practice to perform EDA, Exploratory Data Analysis, before conducting rigorous analytics. EDA allows us to confirm that the correct data is loaded into R while also giving us a preview of the data set. Below are two common EDA functions to help you get a feel for the data, both of which have the data set as the only arguement. The head function shows the first six rows and all of the columns of the data while the summary function provides summary statistics for each variable.

EDA

head(quakes)
##      lat   long depth mag stations
## 1 -20.42 181.62   562 4.8       41
## 2 -20.62 181.03   650 4.2       15
## 3 -26.00 184.10    42 5.4       43
## 4 -17.97 181.66   626 4.1       19
## 5 -20.42 181.96   649 4.0       11
## 6 -19.68 184.31   195 4.0       12
summary(quakes)
##       lat              long           depth            mag      
##  Min.   :-38.59   Min.   :165.7   Min.   : 40.0   Min.   :4.00  
##  1st Qu.:-23.47   1st Qu.:179.6   1st Qu.: 99.0   1st Qu.:4.30  
##  Median :-20.30   Median :181.4   Median :247.0   Median :4.60  
##  Mean   :-20.64   Mean   :179.5   Mean   :311.4   Mean   :4.62  
##  3rd Qu.:-17.64   3rd Qu.:183.2   3rd Qu.:543.0   3rd Qu.:4.90  
##  Max.   :-10.72   Max.   :188.1   Max.   :680.0   Max.   :6.40  
##     stations     
##  Min.   : 10.00  
##  1st Qu.: 18.00  
##  Median : 27.00  
##  Mean   : 33.42  
##  3rd Qu.: 42.00  
##  Max.   :132.00

Further, the quakes data set contains 1000 observations of earthquake data that occurred near the tropical island of Fiji. Each observation has the latitude, longitude, depth, magnitude, and number of stations that reported the seismic activity. For this R-Guide, we will be focusing on the magnitude, measured by the logarithmic Richter scale, of the earthquake and number of stations that reported each earthquake.

Simple Linear Regression Model Basics

In a simple linear regression (SLR) model, we are keenly interested in seeing if there is a linear relationship between a predictor variable (sometimes called the independent variable) and a response variable (sometimes called the dependent variable). From our data set, we’re going to be examining if there is a linear relationship between an earthquake’s magnitude and the number of stations that reported the activity. The intuition here is that as the magnitude of a quake changes, so does the number of stations that report it in some sort of predictable manner.

First, let’s get a better feel for the quakes data with a scatter plot. A scatter plot shows us the general shape of the data and can provide us hints to what the relationship between the predictor variable and response variable might be.

Start with the attach command, which tells R that we will be referring to the quakes data set in our future code and allows us to avoid re-calling the data set in future code. The plot function requires only 2 components: the X-axis variable and the Y-axis variable. If we want to spice things up, we can include code for axis labels and a chart title in the next arguments, ylab, xlab, and main respectfully. Just be sure you put the text in quotes.

plot

attach(quakes)
plot(mag, stations, 
     ylab = "# of Stations Reporting",
     xlab = "Magnitude",
     main = "Fiji Earthquakes Magnitude and Reporting")

One thing that you may notice about the data is it almost appears to be in columns. This is most likely due to the fact that the magntitude scale is not continuous, but rather recorded at 0.1 intervals. You can also see that when there are multiple similar values, they overlap making it harder for us to understand the true distribution. To solve these issues, we can do two things: add noise to our magnitude values, and color our observations to reflect the actual density in the plot.

By using the jitter command for magnitude, we can add some slight variation to each x value so that they don’t stack on top of each other. Simply include the variable name as the first arguement and the max amount of horizontal jitter as the second arguement. I’ve also added the pch arguement, which changes the shape of each observation. Pch 20 is a smaller circle that is shaded a darker color when the density at a certain point increases. Finally, the col arguement includes the color of each observation including the shade in the center of each circle. If multiple circles overlap, that point will appear darker as a result of the increased density.

plot(jitter(mag, amount = 0.05), stations, 
     pch = 20,
     ylab = "# of Stations Reporting",
     xlab = "Magnitude",
     main = "Fiji Earthquakes Magnitude and Reporting",
     col = rgb(0.1, 0.2, 0.8, 0.3))

As you can now see, we have added random noise to our points to better understand the shape of the data as a continuous segment while also having a better idea of observation density with the shading of the data.

Now we have a pretty fancy scatter plot. Again, we’re trying to develop some initial ideas about the relationship between the the magnitude of an earthquake and the number of stations that report that quake. Just by looking at the scatter plot, we can develop some preliminary thoughts on the association between magnitude and stations. Generally speaking, as magnitude increases, so does the number of stations that report it. We can also take note of the spread of the data, or how closely packed data points are to each other.

We start by giving our model a specific name that we can reference later: Quake.mod. Then, the only commands we need are response ~ predictor. If we type our model after we define it, R will provide us with the magnitude beta coefficient and our intercept term.

lm

Quake.mod <- lm(stations ~ mag)
Quake.mod
## 
## Call:
## lm(formula = stations ~ mag)
## 
## Coefficients:
## (Intercept)          mag  
##     -180.42        46.28

Using the coefficients from our command above, our linear model gives us the following: \[ \widehat{Stations}= -180.42 + 46.28(Magnitude).\]

So, what does this equation mean? Well, as we discussed before, it provides a numerical relationship, based on our sample data, between the magnitude of an earthquake and the number of stations that reported the quake. From the slope coefficient, we learn that a change of 1 on the richter scale will, on average, change the number of reporting stations by 46.28. Since our slope is positive, our model predicts that there is a positive association between magnitude and the number of stations that report an earthquake. Our intercept coefficient tells us that if the magnitude of the earthquake was zero, negative 180.42 stations would report it. That obviously doesn’t make sense, so we provide two guidelines when conceptualizing the intercept coefficient. 1. If the data set does not include predictor values of zero or around zero, the intercept cannot be interpreted. 2. The value can also only be interpreted if it is reasonable and plausible.

Now that we have our regression line, we can include it in our scatter plot to confirm that the model fits the data. We start with our fancy scatter plot and include the abline code with our model parameters (intercept, slope) to add our regression line. I also colored the line red and increased its width using the lwd arguement with some simple code to help it stand out against the data.

abline

plot(jitter(mag, amount = 0.05), stations, 
     pch = 20,
     ylab = "# of Stations Reporting",
     xlab = "Magnitude",
     main = "Fiji Earthquakes Magnitude and Reporting",
     col = rgb(0.1, 0.2, 0.8, 0.3))

abline(-180.42, 46.28, col="red", lwd = 2)

We can see that the regression line follows the data fairly well. We will discuss how well the model fits the data in future sections.

In the next section, we will explore the assumptions of the linear model to make sure we are only using this model during appropriate circumstances.

SLR Model Assumptions

An essential skill of adept statisticians is knowing under what conditions, each model can be used. In this section of the Chapter 3 R-Guide, we will discuss the four assumptions of the SLR model (homoscedasticity, normally distributed errors, and independent errors) and how to check if these assumptions hold for a given data set. Usually when using a SLR, we make the assumption that the mean of our residuals is equal to zero. Lucky for us, R mimizes our residuals to zero as a part of the function’s programming, so we can skip this assumption check.

Assumption 1: Homoscedasticity

Homoscedasticity means that the variance of our residuals is constant across all earthquake magnitudes. Put another way, the variance of our residuals is independent of our predictor variable. There are a few ways that we can check the variance of our residuals starting with the original scatter plot without jitter and adding guide lines using the abline command in hot pink for variation reference.

plot(mag, stations, 
     pch = 20,
     ylab = "# of Stations Reporting",
     xlab = "Magnitude",
     main = "Fiji Earthquakes Magnitude and Reporting",
     col = rgb(0.1, 0.2, 0.8, 0.3))
abline(-185, 56, col="hotpink", lwd = 2)
abline(-270, 56, col="hotpink", lwd = 2)

As we compare the spread of the data, the variation appears to be relatively constant across the plot except for the lowest earthquake magnitudes. We will keep this in mind as we further check for homoskedasticity in the residuals.

Our primary option for checking the variance in residuals is a residual plot with our model’s fitted values on the X-axis and residual size on the Y-axis. This residual plot allows us more clearly to see changes in the variance of the residuals across all magnitudes compared to the scatter plot. Since R is pretty neat, we can directly compute our residuals and fitted values with some simple code. We will create another function for our residuals (QuakeResiduals) and fitted values (QuakeFittedValues) and then construct our residual plot with a horizontal line at Y equals zero as a reference point for the variation in our residuals.

Residual Plot

QuakeResiduals <- Quake.mod$residuals
QuakeFittedValues <- Quake.mod$fitted.values
plot(QuakeFittedValues, QuakeResiduals, 
     pch = 20,
     xlab= "Magnitude",
     ylab= "Residual",
     main= "Residual Plot",
     col = rgb(0.1, 0.2, 0.8, 0.3))
abline(0,0,col="brown", lwd = 2.5)

As we compare variation for different magnitudes, it appears that the residual variation is slightly lower for magnitudes 4.0-4.25 compared to magnitudes greater than 4.25. While this assumption is not perfectly met, the data is reasonable enough to allow us to proceed and not completely disregard our model’s findings.

Assumption 3: Errors are normally distributed

Our third assumption for SLR models is that our residuals follow a normal distribution. Again, there are two useful visual tests that we can perform in R to help us determine if this assumption is met for our data set and model.

Recall the following characteristics about the normal distribution:

  1. Bell-Shaped Curve

  2. Majority of data within one and two standard deviations of the mean

  3. Symmetrical in shape

An easy way to see if our residuals follows a normal distribution is by showing the residuals in a histogram. The histogram command is very straightforward with only one argument: the data source itself. However, additional arguments for the number of columns (breaks=), axis and chart titles, and colors are available as well.

Residual Histogram

hist(QuakeResiduals, breaks=25,
    xlab="Residual Value",
    ylab="Frequency",
    main="Histogram of Residuals",
    col="moccasin")

From the histogram, we see that the distribution of residuals is bell-shaped, symmetric, and with the majority of the data centered around the mean.

Another way to examine the normality of the residuals is by using a Q-Q Plot. A Q-Q Plot shows if the residuals follow a normal distribution by having the normal distribution quantiles on the X-axis and the residual data’s quantiles on the Y-axis. If the residuals perfectly follow a normal distribution, then the data would lie on the straight qqline with the majority of the data appearing around the 0 quantile. Again, the code to construct a Q-Q plot only requires the residual data.

Q-Q Plot

qqnorm(QuakeResiduals, col="royalblue")
qqline(QuakeResiduals, col="orange")

From the Q-Q plot, we can see that the residuals mostly follow the straight line with some deviation at both ends. Since this distribution has thinner tails than the normal distribution, we are not overly concerned. Further, the majority of the data is located around the zero quantile, which shows us that the residuals are normally distributed.

Assumption 4: Each error term is independent of other error terms

Chekcing this assumption requires more intuitive thinking versus R output. For our data set, there is little reason to believe that the residual number of stations reporting an earthquake for a given magnitude would be dependent on the residual of another predictor/response variable combination. Given this, we can claim that this assumption is met and continue with our linear model.

Based on our knowledge of earthquake reporting, we assume that independence between residuals holds.

As you may have realized, checking the assumptions of the SLR model has objective and subjective components, which ultimately can leave to decision to proceed with the model in your hands. Now that we understand the basics of our SLR Model and the appropriate data concerns for its use, the next section explores how we can quantify the accuracy of our model.

Model Accuracy and Precision

Confidence Intervals for Regression Coefficients

The first way we can assess the accuracy of our model is by measuring the accuracy of the point estimates of our regression coefficients. This section shows how to use confidence intervals to establish the accuracy of the regression coefficients.

The confidence intervals used in SLR are very similar to the confidence intervals explored in a mathematical statistics course, so we will focus on computing the intervals and analyzing their results in the context of our quakes data set.

Recall that our model takes the following form as shown in the first section: \[ \widehat{Stations}= -180.42 + 46.28(Magnitude).\]

Since our model parameters for the slope and intercept are indeed point estimates, we are interested in knowing the relevant range that these parameters could fall into given a different sample, i.e. the precision of our regression coefficients.

We can use the confint function to produce a confidence interval with a given confidence level for each of our regression coefficients. All we need to do is include our linear model and the desired level of significance.

confint

confint(Quake.mod, level=.95)
##                  2.5 %     97.5 %
## (Intercept) -188.64628 -172.20238
## mag           44.50944   48.05498

The code output gives us two 95% confidence intervals: one for the ointercept and one for the slope. We can interpret the results in a similar way to confidence intervals from previous courses as well. For our linear model’s intercept, we are 95% confident that the true intercept for all possible quakes samples will be in the range [-188.6463, -172.2023]. The 2.5% and 97.5% column headings represent the quantiles which are capturing the middle 95% of the data. For our linear model’s slope, we are 95% confident that the true slope of the linear relationship between magnitude and stations will be in the range [44.5094,48.0550].

If we want a visualization of these intervals, we can plot our data again with lines depicting the interval bounds for the intercept and the slope while keeping the other coefficient constant at the respected mean.

In the plot below, the black lines represent the confidence interval for the intercept by simply plotting the corresponding interval bounds with the mean slope of 46.28 in black and comparing them to the original model in red.

plot(jitter(mag, amount = 0.05), stations, 
     pch = 20,
     ylab = "# of Stations Reporting",
     xlab = "Magnitude",
     main = "Fiji Earthquakes Magnitude and Reporting",
     col = rgb(0.1, 0.2, 0.8, 0.3))
abline(-188.6463, 46.28, col = 'black', lwd = 2)
abline(-180.42, 46.28, col="red", lwd = 2)
abline(-172.2024, 46.28, col = 'black', lwd = 2)

Based on the plot above and our confint output, we see that our margin of error for the intercept is 8.22 stations. We should keep this in mind as we think about the model’s accuracy, even though we will not be directly interpreting the intercept.

We can visualize the same for the slope. Again, the black lines represent the confidence interval for slope by simply plotting the corresponding interval bounds with the mean the intercept of -180.42 in black and comparing them to the original model in red.

plot(jitter(mag, amount = 0.05), stations, 
     pch = 20,
     ylab = "# of Stations Reporting",
     xlab = "Magnitude",
     main = "Fiji Earthquakes Magnitude and Reporting",
     col = rgb(0.1, 0.2, 0.8, 0.3))
abline(-180.42, 44.5094, col = "black", lwd = 2)
abline(-180.42, 46.28, col="red", lwd = 2)
abline(-180.42, 48.0550, col = "black", lwd = 2)

This again helps us see the accuracy of our slope point estimate, which has a margin of error of 1.775.

Now that we have an idea of the accuracy of our predictor variable coefficients, we can explore the true accuracy of our model’s prediction of the number of stations that report a given earthquake.

Confidence and Prediction Intervals for Response Variable

There are two ways that we will discuss analyzing the precision of our model through our response variable, stations: a confidence interval for the mean number of stations and a prediction interval for a single stations value.

We will begin with the confidence interval for the average number of stations reporting an earthquake. The intuition for this accuracy measure is as follows: by looking at all of the station values for a specific magnitude, say 4.5, this interval can provide us with a range of the average number of stations that will report an earthquake with magnitude 4.5.

To produce this confidence interval, we will need to produce a new data frame that only contains stations with the magnitude of 4.5. We will conveniently call it Quake4.5.

Quake4.5 <- data.frame(mag=4.5)

Now we will use Quake4.5 to produce the confidence interval. To construct the interval, we will create a predict function using our Quake.mod model and our Quake4.5 data frame called Stations.confidenceint. In the code, we also need to specify the interval as a confidence interval along with our confidence level.

Response Confidence Interval

Stations.confidenceint <- predict(Quake.mod, Quake4.5, interval="confidence", level = .95)
Stations.confidenceint
##        fit      lwr      upr
## 1 27.84562 27.10072 28.59052

From the R output, we learn that our model predicts that the average number of stations that will report a 4.5 magnitude earthquake is 27.8456. We are also 95% confident that the mean number of stations reporting a 4.5 magnitude earthquake will be in the range [27.1007,28.5905].

Our second option for assessing the accuracy of the response variable is a prediction interval for a single stations value. This time, instead of producing an interval for the mean number of stations, we will produce an interval that predicts a single estimate of the number of stations that will report an earthquake with a specific earthquake magnitude.

The code is very similar; we will again look at predicting the number of stations based on a magnitude of 4.5. However, we must specify that we want our interval to be a prediction interval since we are interested in a single value estimate, instead of an average estimate.

Response Prediction Interval

Stations.predictionint <- predict(Quake.mod, Quake4.5, interval="prediction", level = .95)
Stations.predictionint
##        fit      lwr      upr
## 1 27.84562 5.265176 50.42607

Now that we’ve altered our interval type, we’re given a different interval. We can say that we are 95% confident that the number of stations that will report a 4.5 magnitude earthquake is in the range [5.2652,50.4261]. You might be thinking, why has the interval width increased almost 3000%? The increase in the variability of the prediction is solely due to predicting a single response value compared to the mean response value. Predicting an average naturally decreases variation, while predicting a single value can vary greatly based on the range of potential outcomes. So, remember, prediction intervals will always be at least as wide as confidence intervals for the response variable.

As a quick note, we see from the fit column that the center of each interval is 27.8456 stations for a 4.5 magnitude earthquake. This is because the response estimate does not change between the two intervals, only the variation about the estimate. This is always a good check to make sure you have constructed comparable intervals.

To round out our discussions on model accuracy, we will step back a bit to get a measure of how well the model and the predictor variable, magnitude, predicts the response variable, stations.

Correlation

If we truly want a measure of the usefulness of our model, we can begin by looking at R-Squared. R-Squared is known as the simple coefficient of determination. When comparing two variables, R-squared represents the proportion of total variation in the response variable that is explained by the SLR model. Naturally, a higher R-squared shows that the predictor variable predicts the response variable well.

In our case, we can see the proportion of variation in the number of stations reporting that can be explained by our Quake.mod. R-Squared is provided by the summary command, for which the only arguement is the model name. Multiple R-squared, located at the bottom of the summary output can be intrepreted as follows: 72.45% of the total variation in the number of stations reporting a quake can be explained by our linear model. R-squared values range from 0-1, so 72.45 percent is noteworthy for sure.

Summary(Model)

summary(Quake.mod)
## 
## Call:
## lm(formula = stations ~ mag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.871  -7.102  -0.474   6.783  50.244 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -180.4243     4.1899  -43.06   <2e-16 ***
## mag           46.2822     0.9034   51.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.5 on 998 degrees of freedom
## Multiple R-squared:  0.7245, Adjusted R-squared:  0.7242 
## F-statistic:  2625 on 1 and 998 DF,  p-value: < 2.2e-16

Further, we can take the square root of R-Squared and get the correlation coefficient, r. r measures the strength of the linear relationship between the response and predictor variables. r can range from -1 to 1 with -1 being a very strong negative relationship while +1 being a strong positive relationship. If r is 0, that indicates that there is a very weak or no relationship between the two variables. The cor function will produce our r value by inputting our predictor and response variables..

cor

cor(mag, stations)
## [1] 0.8511824

With a correlation of 0.8512, we can say that there is a fairly strong linear relationship between magnitude of earthquake and the number of stations reporting it. As a quick check, our r value should always reflect our slope coefficient such that if the slope is positive, r should be positive as well.

At this point, you may feel inclined to state the following: “Based on our analysis and the strong correlation of 0.8512, I believe that a higher magnitude of an earthquake will cause a high number of stations to report it.”

However, this statement is so errant that the implication or mention of causation at any point will lead you to being shunned by the statistics community forever. So don’t even think about it.

We will know move to testing model significance using hypothesis tests.

Testing Significance

An important part of doing any task is evaluating if what you are doing is actually important. We are able to perform this self-analysis in our SLR Model as well through hypothesis tests on the t and F distribution. For both tests, we will use an alpha of 0.05.

Hypothesis Testing for Regression Coefficients

Hypothesis tests for the SLR model follow a very similar format to that of hypothesis tests discussed in mathematical statistics courses, so we will tailor our approach and analysis beyond the introductory level.

Our goal of hypothesis testing is to test if there is linear relationship between our response and predictor variable. We test this through examining the slope of the regression model.

With our null hypothesis being the slope equals zero, we will fail to reject the null when we believe there is no linear relationship between quake magnitude and the number of stations reporting. On the contrary, when we believe the slope does not equal zero we will reject the null and conclude there is a linear relationship. (Hypothetically, we could perform a one-sided test, but the standard two-sided test will serve our purpose of testing significance the best.)

We can use the summary function we introduced in the last section to analyze the hypothesis test results.

summary(model) t-test

summary(Quake.mod)
## 
## Call:
## lm(formula = stations ~ mag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.871  -7.102  -0.474   6.783  50.244 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -180.4243     4.1899  -43.06   <2e-16 ***
## mag           46.2822     0.9034   51.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.5 on 998 degrees of freedom
## Multiple R-squared:  0.7245, Adjusted R-squared:  0.7242 
## F-statistic:  2625 on 1 and 998 DF,  p-value: < 2.2e-16

This time, we’ll focus our attention on the Coefficients section of the R output. For both the slope and intercept coefficients, we’re provided the test statistic and the p-value in the last two columns of the hypothesis test. Both the slope and intercept have large test statistics (-43.06 and 51.23 respectively) and consequently small p-values (both essentially 0), which means that we will reject the null hypothesis for both the slope and the intercept and conclude that both parameters are significant to our SLR Model. Therefore, the model is worth conducting. If instead one or both of our coefficients produced a p-value larger than our alpha significance level, we would fail to reject the null and remove the insignificant parameter(s) from the model.

F Distribution & F-Test

Another check that we can perform to test the significance of our model parameters is an F-test. Before we discuss this test, let’s review the F-distribution briefly.

Here’s a cute picture to help conceptualize the distribution.

curve(df(x, df1=100, df2=10), from=0, to=5,
      xlab="F-Statistic",
      ylab="Probability",
      main="F-Distribution",
      col = "navy")

The F test produces an F-statistic that is a ratio of the explained variation in the response variable from the model to the unexplained variation. When the explained variation is larger in respect the unexplained variation, the test produces a large F-stat. From the distribution above, a larger F-stat leads to a smaller p-value, and consequently rejecting the null and concluding that the slope is significant.

Just like the previous hypotheses test, the F-test will tell us if our regression coefficients are significant, i.e. do we need to include them in the SLR Model. Our null and alternative hypothesis stay the same as our previous test.

For our quakes linear model, when the proportion that our model explains the variation in number of stations reporting is large compared to the unexplained variation, we believe that our model predictors are significant and impact the number of stations. Thus, we would fail to reject the null and conclude that magnitude is a significant predictor in the model.

Again, our good friend the summary function can provide us with the F-test results.

Summary(model) F-test

summary(Quake.mod)
## 
## Call:
## lm(formula = stations ~ mag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.871  -7.102  -0.474   6.783  50.244 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -180.4243     4.1899  -43.06   <2e-16 ***
## mag           46.2822     0.9034   51.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.5 on 998 degrees of freedom
## Multiple R-squared:  0.7245, Adjusted R-squared:  0.7242 
## F-statistic:  2625 on 1 and 998 DF,  p-value: < 2.2e-16

F-test results are located in the final line of the R output. Here, we see our F-stat of 2625 with 998 degrees of freedom and our p-value of essentially 0. With a huge F-stat and a small p-value, we will reject the null hypothesis and conclude that magnitude is a significant predictor of stations.

The p-value from our F-test should exactly match the p-value for the slope from our previous t-test hypothesis. This is because the F distribution produces the same results as the t distribution when applied to SLR Models.

Top