Chapter Four was on Multiple Linear Regression. We learned about creating regression formulas where the response variable depends on more than just one predictor variable. We used the FTest to determine whether the response variable was linked to any of the predictors. We learned that in Multiple Linear Regression, there can be a few conclusions and observations made. One is called Groupwise Means. This is when you isolate one of the predictor variables and see if it has any affect on the response variable. In the example we used, we wanted to see if there was a difference in weight between men who played football and men who ran cross country. We made a regression model that considered the sport they played and their height. We then isolated the sport to see if there was a difference in weight only based of the sport and not considering height. Then we swithced and just looked at the height and weight correlation. This is called Single Linear Regression which was covered in chapter three. Next, we looked at parallel lines also known as multiple linear regression. Here we want to see if the weight of a male althlete depends on either their height or sport. Lastly, we looked at the relationship between sport and height using an interaction model. I will dissect all these further in the comming sections.
When we look at a model using groupwise means, we look at just the predictor that has a 1 or 0 value associated with it. For example, either you play football (1) or you are a cross country runner(0).
head(quakes)
## lat long depth mag stations
## 1 -20.42 181.62 562 4.8 41
## 2 -20.62 181.03 650 4.2 15
## 3 -26.00 184.10 42 5.4 43
## 4 -17.97 181.66 626 4.1 19
## 5 -20.42 181.96 649 4.0 11
## 6 -19.68 184.31 195 4.0 12
data("quakes")
attach(quakes)
I loaded the quakes data to do some analysis on. I want to see if longitude and depth of an earthquake effect the magnitude of the earthquake. Depth refers to how deep under the earths surface the earthquake occured. Scientifically, this number ranges from 0 - 700 km. Longitude refers to east of west on a map. All these earthquakes occured in a cube near Fiji since 1964. So we are looking at the longitude and it will only vary by a few degrees.
First, let’s make a linear regression model that depends on just the depth of the earthquake. This is a singular regression model. This is much like what we did in chapter three.
Let’s start by simply plotting the data and having a quick look.
plot(depth, mag, ylab = "Magnitude of Earthquake", xlab = "Depth of Earthquake(km)
", main = "Magnitude vs. Depth of Earthquakes")
It looks like there is a slight trend downward. This would imply that the closer the earthquake is to the surface, the higher the magnitude will be. From my very limited knowledge about earthquakes, I would say that this makes sense.
We can now create a linear regression model where the magnitude of the earthquake only depends on the depth. here is the linear regression model:
mySLR <- lm(mag ~depth)
mySLR
##
## Call:
## lm(formula = mag ~ depth)
##
## Coefficients:
## (Intercept) depth
## 4.754599 -0.000431
summary(mySLR)
##
## Call:
## lm(formula = mag ~ depth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72012 -0.29642 -0.03694 0.19818 1.70014
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.755e+00 2.179e-02 218.168 < 2e-16 ***
## depth -4.310e-04 5.756e-05 -7.488 1.54e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3921 on 998 degrees of freedom
## Multiple R-squared: 0.05319, Adjusted R-squared: 0.05225
## F-statistic: 56.07 on 1 and 998 DF, p-value: 1.535e-13
We can see here that the p-value is 1.535 * 10 ^-13. Thus, the pvalue < .05 and so we know that the depth of the hurricane is significant.
plot(depth, mag, ylab = "Magnitude of Earthquake", xlab = "Depth of Earthquake(km)
", main = "Magnitude vs. Depth of Earthquakes")
abline(4.754599, - .000431)
Now we can plot the data with our linear regression line. We see that there is a slight trend downward. This tells us that magnitudes of earthquakes that happen near the surface are higher than the magnitudes of earthquakes that happen deep below the surface. This lines up with out general observations and it makes sense.
Now, let’s do another single linear regression that just analyzes the longitude of the hurricane and see if that alone affects the magnitude. Again, we will start with a simple plot:
plot(long, mag, ylab = "Magnitude of Earthquake", xlab = "Longitude of Earthquake(km)
", main = "Magnitude vs. Longitude of Earthquakes")
This pot shows that the magnitudes of earthquakes that happen between 165 - 175 are typically higher. It also seems more likely to have a earthquake between 180 - 185 longitude. This is interesting data for people living in Fiji. We see that the earthquakes are more often above a 5.75 magnitude when the longitude is less than 175. And when the longitude is less than 175 there are few earthquakes with magnitude less than 4.25. Magnitudes less than 4.25 are common if the longitude is above 180. So we would expect there to be a correlation between longitude and magnitude.
Now, let’s make our single linear regression and anaylze the results:
slr <- lm(mag ~ long)
slr
##
## Call:
## lm(formula = mag ~ long)
##
## Coefficients:
## (Intercept) long
## 6.68148 -0.01148
summary(slr)
##
## Call:
## lm(formula = mag ~ long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77512 -0.29556 -0.06355 0.22608 1.64360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.681481 0.371501 17.985 < 2e-16 ***
## long -0.011485 0.002069 -5.551 3.64e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3969 on 998 degrees of freedom
## Multiple R-squared: 0.02995, Adjusted R-squared: 0.02898
## F-statistic: 30.82 on 1 and 998 DF, p-value: 3.637e-08
It looks like the p-value is 3.637 * 10 ^ -8. Thus, the p-value < .05 and therefore we can see that the longitude is significant.
Now, let us plot this with the data:
plot(long, mag, ylab = "Magnitude of Earthquake", xlab = "Longitude of Earthquake(km)
", main = "Magnitude vs. Longitude of Earthquakes")
abline(6.68148 , -.01148)
The Singular Linear Regression Model for the depth is:
magnitude = 4.754599 - .000431(depth)
The Singular Linear Regression Model for the longitude is:
magnitude = 6.68148 + -.01148*(longitude)
Now, we can make a multiple linear regression model so analyze the impact that the depth and longitude have on the hurricane’s magnitude. We have one response variable, the magnitude, and two predictor variables (the longitude and the depth). We want to make a model that incorporates both of these predictors in estimating the magnitude of a hurricane.
The general formula for this is:
\(\hat{magnitude}\) = \(\sf{\beta _ {0}}\) + \(\sf{\beta _ {2}}\) (longitude) + \(\sf{\beta _ {1}}\) (depth)
so \(\sf{\beta _ {2}}\) is the slope coefficient in front of longitude and \(\sf{\beta _ {1}}\) is the slope coefficient in front of depth. This slope tells us how much the magnitude will change if the longitude changes by one degree or the dept changes by 1km. These numbers will probably be small since the magnitude is on such a small scale. It ranges from about 0 to 10.
This is the multiple linear regression in R. It is sometimes called parallel lines depending on the outcome. Let’s take a look and see what we find out.
myLR <- lm(mag ~ long + depth)
myLR
##
## Call:
## lm(formula = mag ~ long + depth)
##
## Coefficients:
## (Intercept) long depth
## 6.4424070 -0.0094717 -0.0003925
summary(myLR)
##
## Call:
## lm(formula = mag ~ long + depth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.79367 -0.29018 -0.05471 0.21323 1.59508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.442e+00 3.650e-01 17.650 < 2e-16 ***
## long -9.472e-03 2.045e-03 -4.632 4.10e-06 ***
## depth -3.925e-04 5.758e-05 -6.816 1.62e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3882 on 997 degrees of freedom
## Multiple R-squared: 0.07314, Adjusted R-squared: 0.07128
## F-statistic: 39.34 on 2 and 997 DF, p-value: < 2.2e-16
We see that the intercept, or \(\sf{\beta _ {0}}\) = 6.4424070. This intercept tells us that a hurricane that happens at 0 degrees longitude and right on the surface would have a magnitude of 6.44. This is a major extrapolation because all of our longitudes fall between 165 and 195. So we know that this model is not sufficient for estimating longitudes of 0 degrees. This would be a useful intercept if it wasn’t such an extrapolation on our data.
We see that \(\sf{\beta _ {1}}\) = -.0003925 and that \(\sf{\beta _ {2}}\) = -0.0094717
This means our multiple linear regression equation is: \(\hat{magnitude}\) = 6.4424070 -0.0094717 (longitude) - -.0003925 (depth)
We can see from the summary that the p-value is 2.210^-16 which is less than .05 so we know that either the longitude or the depth is significant. We see the p-value next to longitude is 4.10 10^-4 which means the longitude is significant and we see the p-value next to the depth is 1.62 * 10 ^ -11. So we know the depth is also significant.
We see that our final multiple linear regression model is \(\hat{magnitude}\) = 6.4424070 -0.0094717 (longitude) - -.0003925 (depth) and that both \(\sf{\beta _ {1}}\) and \(\sf{\beta _ {2}}\) are significant. We saw this when we did the single linear regression models above.
We want to know if we could make a regression model with less perdictors. In order to check this out, we use our complete model and a reduced model and run an anova test on them and analyze the results.
For our complete model we have:
myComplete <- lm(mag ~ long + depth)
For our reduced1 model we have: (this model only depends on longitude)
red1 <- lm(mag ~ long)
For our reduced2 model we have: (this model only depends on depth)
red2 <- lm(mag ~ depth)
Now, we can run anova testing on these models.
First, we test our complete model with the reduced1 model:
anova(myComplete,red1)
## Analysis of Variance Table
##
## Model 1: mag ~ long + depth
## Model 2: mag ~ long
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 997 150.21
## 2 998 157.21 -1 -6.9993 46.457 1.619e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We set up our hypothesis testing to be:
\(\sf{H _ {0}}\) : Don’t include longitude \(\sf{H _ {1}}\) : Include longitude
We get our p-value to be = 1.619 * 10 ^ -11 This means p-vale < alpha and we reject the null. So we see that the longitude should be included.
Now, we can test our second model:
anova(myComplete,red2)
## Analysis of Variance Table
##
## Model 1: mag ~ long + depth
## Model 2: mag ~ depth
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 997 150.21
## 2 998 153.44 -1 -3.2327 21.457 4.098e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We again see the p-value = 4.098 * 10 ^ -6 < .05 and so we reject the null. We conclude that depth should be included. # Interaction Model We are now going to look at an interaction model of this data. In order to do that, we will use the function in R called lm(mag ~ depth * long, quakes). The general interaction model is: lm(y ~ x1 * x2) and we want to know if the relationship between y and x1 depends on the value/category of x2.
In our case we want to know if the relationship between magnitude and depth depends on the longitude.
The general equation is:
\(\hat{magnitude}\) = \(\sf{\beta _ {0}}\) + \(\sf{\beta _ {2}}\) (longitude) + \(\sf{\beta _ {1}}\) (depth) + \(\sf{\beta _ {3}}\) (longitude*depth)
myInteraction <- lm(mag ~ depth * long, quakes)
myInteraction
##
## Call:
## lm(formula = mag ~ depth * long, data = quakes)
##
## Coefficients:
## (Intercept) depth long depth:long
## 5.4723372 0.0057888 -0.0040743 -0.0000343
Our equation becomes:
\(\hat{magnitude}\) = 5.4623372 + .0057888(depth) - .0040743(longitude) - .0000343(depth * longitude)
summary(myInteraction)
##
## Call:
## lm(formula = mag ~ depth * long, data = quakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.81459 -0.28777 -0.06269 0.21399 1.60562
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.472e+00 5.703e-01 9.595 <2e-16 ***
## depth 5.789e-03 2.797e-03 2.070 0.0387 *
## long -4.074e-03 3.182e-03 -1.280 0.2007
## depth:long -3.430e-05 1.552e-05 -2.211 0.0273 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3874 on 996 degrees of freedom
## Multiple R-squared: 0.07767, Adjusted R-squared: 0.07489
## F-statistic: 27.96 on 3 and 996 DF, p-value: < 2.2e-16
From the summary, we see that the p-value is 2.2* 10 ^ -16.
We see that the depth or the longitude helps determine the magnitude.
However, from the data above, we see a few interesting things: (Intercept) 5.472e+00 5.703e-01 9.595 <2e-16 ** depth 5.789e-03 2.797e-03 2.070 0.0387
long -4.074e-03 3.182e-03 -1.280 0.2007
depth:long -3.430e-05 1.552e-05 -2.211 0.0273 *
We see that the p-value on the depth is .0387 which is less than .05 so it is significant. the p-value on longitude is .2007 which is not less than .05, so it is not significant. the p-value on depth*long is .0273 which is less than .05 so it is significant.
This means we can do a partial f-test to determine if we could drop one of the predictors. I wil lexplain this shortly.
Our final equation for the interaction model is: \(\hat{magnitude}\) = 5.4623372 + .0057888(depth) - .0040743(longitude) - .0000343(depth * longitude) We believe that there may be an unnecessary predictor and would like to investigate this.
We can use anova to determine whether both predictors are needed. In our case, we have only two predictors. Anova could also be used in the case where there are more than two predictors. In this case, there are only two.
Our complete model will use all the predictors. Out reduced model will use only depth as a predictor(since the p-value on length is insignificant, we are going to try dropping it).
myComplete <- lm(mag ~ depth * long, quakes)
myR1 <- lm(mag ~ depth )
myR2 <- lm(mag ~ long )
We can do what we did above where we compare the complete model to each reduced model.
First we look at the model that only includes depth (excludes longitude)
anova(myComplete, myR1)
## Analysis of Variance Table
##
## Model 1: mag ~ depth * long
## Model 2: mag ~ depth
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 996 149.48
## 2 998 153.44 -2 -3.9662 13.214 2.168e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that the p-value = 2.16 * 10 ^ -6 < .05 and therefore we should reject the null. This means that we want to include depth in our model.
Now we can look at the model that excludes depth and only looks at longitude:
anova(myComplete, myR2)
## Analysis of Variance Table
##
## Model 1: mag ~ depth * long
## Model 2: mag ~ long
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 996 149.48
## 2 998 157.21 -2 -7.7328 25.763 1.234e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that the p-value = 1.234 * 10 ^ -11 < .05 and therefore we should reject the null. This means that we want to include longitude in our model.
The reduced model here is the multiple linear regression model that we created above.
myComplete <- lm(mag ~ depth * long, quakes)
red <- lm(mag ~ depth + long)
We can run an anove complete to get:
anova(myComplete, red)
## Analysis of Variance Table
##
## Model 1: mag ~ depth * long
## Model 2: mag ~ depth + long
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 996 149.48
## 2 997 150.21 -1 -0.73346 4.8872 0.02728 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is .02728 < .05. Thus we want to include both the depth and longitude in our model.