R Guide Chapter Four

Introduction

Chapter Four was on Multiple Linear Regression. We learned about creating regression formulas where the response variable depends on more than just one predictor variable. We used the FTest to determine whether the response variable was linked to any of the predictors. We learned that in Multiple Linear Regression, there can be a few conclusions and observations made. One is called Groupwise Means. This is when you isolate the Categorical variable and see if it has any affect on the response variable. In the example we used, we wanted to see if there was a difference in weight between men who played football and men who ran cross country. We made a regression model that considered the sport they played and their height. We then isolated the sport to see if there was a difference in weight only based of the sport and not considering height. Then we swithced and just looked at the height and weight correlation. When you isolate just the predictor variable it is called Single Linear Regression which was covered in chapter three. Next, we looked at parallel lines also known as multiple linear regression. Here we wanted to see if the weight of a male althlete depends on either their height or sport. Lastly, we looked at the relationship between sport and height using an interaction model. I will dissect all these methods further in the coming sections.

Groupwise Means

When we look at a model using groupwise means, we look at just the categorical predictor variable which has a numeric value (such as 1 or 0) associated with it. For example, either you play football (1) or you are a cross country runner(0). The model may look like this: \(\hat{weight}\) = \(\sf{\beta _ {0}}\) + \(\sf{\beta _ {1}}\) (height) + \(\sf{\beta _ {1}}\) (sport = 0 or 1) We did this in class, I just wanted to address a categorical variable since my example only uses quantitative predictor variables.

Earthquake Data

Let’s go ahead and load my data on earthquakes.

head(quakes)

##      lat   long depth mag stations
## 1 -20.42 181.62   562 4.8       41
## 2 -20.62 181.03   650 4.2       15
## 3 -26.00 184.10    42 5.4       43
## 4 -17.97 181.66   626 4.1       19
## 5 -20.42 181.96   649 4.0       11
## 6 -19.68 184.31   195 4.0       12

data("quakes")
attach(quakes)

I loaded the quakes data to do some analysis on. I want to see if longitude and depth of an earthquake effect the magnitude of the earthquake. Depth refers to how deep under the earths surface the earthquake occured. Scientifically, this number ranges from 0 - 700 km. Longitude refers to east of west on a map. All these earthquakes occured in a cube near Fiji since 1964. So we are looking at the longitude and it will only vary by a few degrees.

Single Linear Regression

First, let’s make a linear regression model that depends on just the depth of the earthquake. This is a singular regression model. This is much like what we did in chapter three.

Let’s start by simply plotting the data and having a quick look.

plot(depth, mag, ylab = "Magnitude of Earthquake", xlab = "Depth of Earthquake(km)
    ", main = "Magnitude vs. Depth of Earthquakes")

It looks like there is a slight trend downward. This would imply that the closer the earthquake is to the surface, the higher the magnitude will be. From my very limited knowledge about earthquakes, I would say that this makes sense.

We can now create a linear regression model where the magnitude of the earthquake only depends on the depth. here is the linear regression model:

mySLR <- lm(mag ~depth)
mySLR

## 
## Call:
## lm(formula = mag ~ depth)
## 
## Coefficients:
## (Intercept)        depth  
##    4.754599    -0.000431

summary(mySLR)

## 
## Call:
## lm(formula = mag ~ depth)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72012 -0.29642 -0.03694  0.19818  1.70014 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.755e+00  2.179e-02 218.168  < 2e-16 ***
## depth       -4.310e-04  5.756e-05  -7.488 1.54e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3921 on 998 degrees of freedom
## Multiple R-squared:  0.05319,    Adjusted R-squared:  0.05225 
## F-statistic: 56.07 on 1 and 998 DF,  p-value: 1.535e-13

We can see here that the p-value is 1.535 * 10 ^-13. Thus, the pvalue < .05 and so we know that the depth of the earthquake is significant.

We can now plot the data with the linear regression line.

plot(depth, mag, ylab = "Magnitude of Earthquake", xlab = "Depth of Earthquake(km)
    ", main = "Magnitude vs. Depth of Earthquakes")
abline(4.754599, - .000431)

We see that there is a slight trend downward. This tells us that magnitudes of earthquakes that happen near the surface are higher than the magnitudes of earthquakes that happen deep below the surface. This lines up with our general observations and it makes sense.

Now, let’s do another single linear regression that just analyzes the longitude of the earthquake and see if that alone affects the magnitude. Again, we will start with a simple plot:

plot(long, mag, ylab = "Magnitude of Earthquake", xlab = "Longitude of Earthquake(km)
    ", main = "Magnitude vs. Longitude of Earthquakes")

This pot shows that the magnitudes of earthquakes that happen between 165 - 175 are typically higher. It also seems more likely to have a earthquake between 180 - 185 longitude. This is interesting data for people living in Fiji. We see that the earthquakes are more often above a 5.75 magnitude when the longitude is less than 175. And when the longitude is less than 175 there are few earthquakes with magnitude less than 4.25. Magnitudes less than 4.25 are common if the longitude is above 180. So we would expect there to be a correlation between longitude and magnitude.

Now, let’s make our single linear regression and anaylze the results:

slr <- lm(mag ~ long)
slr

## 
## Call:
## lm(formula = mag ~ long)
## 
## Coefficients:
## (Intercept)         long  
##     6.68148     -0.01148

summary(slr)

## 
## Call:
## lm(formula = mag ~ long)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77512 -0.29556 -0.06355  0.22608  1.64360 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.681481   0.371501  17.985  < 2e-16 ***
## long        -0.011485   0.002069  -5.551 3.64e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3969 on 998 degrees of freedom
## Multiple R-squared:  0.02995,    Adjusted R-squared:  0.02898 
## F-statistic: 30.82 on 1 and 998 DF,  p-value: 3.637e-08

It looks like the p-value is 3.637 * 10 ^ -8. Thus, the p-value < .05 and therefore we can see that the longitude is significant.

Now, let us plot this with the data:

plot(long, mag, ylab = "Magnitude of Earthquake", xlab = "Longitude of Earthquake(km)
    ", main = "Magnitude vs. Longitude of Earthquakes")
abline(6.68148 , -.01148)

Summary of Singular Linear Regression

The Singular Linear Regression Model for the depth is:

magnitude = 4.754599 + -.000431*(depth)

The Singular Linear Regression Model for the longitude is:

magnitude = 6.68148 + -.01148*(longitude)

Multiple Linear Regression

Now, we can make a multiple linear regression model so analyze the impact that the depth and longitude have on the earthquakes’s magnitude. We have one response variable, the magnitude, and two predictor variables (the longitude and the depth). We want to make a model that incorporates both of these predictors in estimating the magnitude of a earthquake.

The general formula for this is:

\(\hat{magnitude}\) = \(\sf{\beta _ {0}}\) + \(\sf{\beta _ {2}}\) (longitude) + \(\sf{\beta _ {1}}\) (depth)

so \(\sf{\beta _ {2}}\) is the slope coefficient in front of longitude and \(\sf{\beta _ {1}}\) is the slope coefficient in front of depth. This slope tells us how much the magnitude will change if the longitude changes by one degree or the dept changes by 1km. These numbers will probably be small since the magnitude is on such a small scale. It ranges from about 0 to 10.

This is the multiple linear regression in R. It is sometimes called parallel lines depending on the outcome. Let’s take a look and see what we find out.

myLR <- lm(mag ~ long + depth)
myLR

## 
## Call:
## lm(formula = mag ~ long + depth)
## 
## Coefficients:
## (Intercept)         long        depth  
##   6.4424070   -0.0094717   -0.0003925

summary(myLR)

## 
## Call:
## lm(formula = mag ~ long + depth)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79367 -0.29018 -0.05471  0.21323  1.59508 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.442e+00  3.650e-01  17.650  < 2e-16 ***
## long        -9.472e-03  2.045e-03  -4.632 4.10e-06 ***
## depth       -3.925e-04  5.758e-05  -6.816 1.62e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3882 on 997 degrees of freedom
## Multiple R-squared:  0.07314,    Adjusted R-squared:  0.07128 
## F-statistic: 39.34 on 2 and 997 DF,  p-value: < 2.2e-16

We see that the intercept, or \(\sf{\beta _ {0}}\) = 6.4424070. This intercept tells us that a earthquake that happens at 0 degrees longitude and right on the surface would have a magnitude of 6.44. This is a major extrapolation because all of our longitudes fall between 165 and 195. So we know that this model is not sufficient for estimating longitudes of 0 degrees. This would be a useful intercept if it wasn’t such an extrapolation on our data.

We see that \(\sf{\beta _ {1}}\) = -.0003925 and that \(\sf{\beta _ {2}}\) = -0.0094717

This means our multiple linear regression equation is: \(\hat{magnitude}\) = 6.4424070 -0.0094717 (longitude) - -.0003925 (depth)

We can see from the summary that the p-value is 2.2(10^-16) which is less than .05 so we know that either the longitude or the depth is significant. We see the p-value next to longitude is 4.10(10^-4) which means the longitude is significant and we see the p-value next to the depth is 1.62 * 10 ^ -11. So we know the depth is also significant.

Summary of Multiple Linear Regression

We see that our final multiple linear regression model is \(\hat{magnitude}\) = 6.4424070 -0.0094717 (longitude) - -.0003925 (depth) and that both \(\sf{\beta _ {1}}\) and \(\sf{\beta _ {2}}\) are significant. We saw this when we did the single linear regression models above.

Anova on Multiple Regression

We want to know if we could make a regression model with less perdictors. In order to check this out, we use our complete model and a reduced model and run an anova test on them and analyze the results.

For our complete model we have:

myComplete <- lm(mag ~ long + depth)

For our reduced1 model we have: (this model only depends on longitude)

red1 <- lm(mag ~ long)

For our reduced2 model we have: (this model only depends on depth)

red2 <- lm(mag ~ depth)

Now, we can run anova testing on these models.

First, we test our complete model with the reduced1 model:

anova(myComplete,red1)

## Analysis of Variance Table
## 
## Model 1: mag ~ long + depth
## Model 2: mag ~ long
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    997 150.21                                  
## 2    998 157.21 -1   -6.9993 46.457 1.619e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We set up our hypothesis testing to be:

\(\sf{H _ {0}}\) : Don’t include longitude \(\sf{H _ {1}}\) : Include longitude

We get our p-value to be = 1.619 * 10 ^ -11 This means p-vale < alpha and we reject the null. So we see that the longitude should be included.

Now, we can test our second model:

anova(myComplete,red2)

## Analysis of Variance Table
## 
## Model 1: mag ~ long + depth
## Model 2: mag ~ depth
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    997 150.21                                  
## 2    998 153.44 -1   -3.2327 21.457 4.098e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We again see the p-value = 4.098 * 10 ^ -6 < .05 and so we reject the null. We conclude that depth should be included.

Interaction Model

We are now going to look at an interaction model of this data. In order to do that, we will use the function in R called lm(mag ~ depth * long, quakes). The general interaction model is: lm(y ~ x1 * x2) and we want to know if the relationship between y and x1 depends on the value/category of x2.

In our case we want to know if the relationship between magnitude and depth depends on the longitude.

The general equation is:

\(\hat{magnitude}\) = \(\sf{\beta _ {0}}\) + \(\sf{\beta _ {2}}\) (longitude) + \(\sf{\beta _ {1}}\) (depth) + \(\sf{\beta _ {3}}\) (longitude*depth)

myInteraction <- lm(mag ~ depth * long, quakes)
myInteraction

## 
## Call:
## lm(formula = mag ~ depth * long, data = quakes)
## 
## Coefficients:
## (Intercept)        depth         long   depth:long  
##   5.4723372    0.0057888   -0.0040743   -0.0000343

Our equation becomes:

\(\hat{magnitude}\) = 5.4623372 + .0057888(depth) - .0040743(longitude) - .0000343(depth * longitude)

summary(myInteraction)

## 
## Call:
## lm(formula = mag ~ depth * long, data = quakes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81459 -0.28777 -0.06269  0.21399  1.60562 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.472e+00  5.703e-01   9.595   <2e-16 ***
## depth        5.789e-03  2.797e-03   2.070   0.0387 *  
## long        -4.074e-03  3.182e-03  -1.280   0.2007    
## depth:long  -3.430e-05  1.552e-05  -2.211   0.0273 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3874 on 996 degrees of freedom
## Multiple R-squared:  0.07767,    Adjusted R-squared:  0.07489 
## F-statistic: 27.96 on 3 and 996 DF,  p-value: < 2.2e-16

From the summary, we see that the p-value is 2.2* 10 ^ -16.

We see that the depth or the longitude helps determine the magnitude.

However, from the data above, we see a few interesting things: (Intercept) 5.472e+00 5.703e-01 9.595 <2e-16 ** depth 5.789e-03 2.797e-03 2.070 0.0387
long -4.074e-03 3.182e-03 -1.280 0.2007
depth:long -3.430e-05 1.552e-05 -2.211 0.0273 *

We see that the p-value on the depth is .0387 which is less than .05 so it is significant. the p-value on longitude is .2007 which is not less than .05, so it is not significant. the p-value on depth*long is .0273 which is less than .05 so it is significant.

This means we can do a partial f-test to determine if we could drop one of the predictors. I wil lexplain this shortly.

Summary of the Interaction Model

Our final equation for the interaction model is: \(\hat{magnitude}\) = 5.4623372 + .0057888(depth) - .0040743(longitude) - .0000343(depth * longitude) We believe that there may be an unnecessary predictor and would like to investigate this.

T- Test on the Multiple Linear Regression Model

We can also do the T-Test on this Multiple Linear Regression Model. In order to do the T-Test, we can use the interactive model. Again, the interaction model looks like this:

iner <- lm(mag~long*depth)

Then we run the summary call on this interaction model and observe the p-value on the interaction term. If the p-value is significant, we know what the relationship between the magnitude and the longitude depends on the depth of the earthquake.

summary(iner)

## 
## Call:
## lm(formula = mag ~ long * depth)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81459 -0.28777 -0.06269  0.21399  1.60562 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.472e+00  5.703e-01   9.595   <2e-16 ***
## long        -4.074e-03  3.182e-03  -1.280   0.2007    
## depth        5.789e-03  2.797e-03   2.070   0.0387 *  
## long:depth  -3.430e-05  1.552e-05  -2.211   0.0273 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3874 on 996 degrees of freedom
## Multiple R-squared:  0.07767,    Adjusted R-squared:  0.07489 
## F-statistic: 27.96 on 3 and 996 DF,  p-value: < 2.2e-16

The interaction term p-value is .0273 which is less than .05. Therefore, the longitude and magnitude’s relationship does depend on the depth of the earthquake. Since the interaction term is significant. This means that the interaction term is necessary in the model and that we are not working with a “parallel lines” model. This means that the relationship between the magnitude and longitude depends on the depth. It also means that the “depth slope”" does not equal the “longitude slope”.

Partial F-Test on the Multiple Linear Regression Model

We can also use a Partial-F Test to determine whether the relationship between magnitude and longitude depends on the depth of the earthquake. In order to do this, we use anova on the interaction model and the reduced model. It looks like this:

complete <- lm(mag~long*depth)
reduced <- lm(mag ~ long + depth)
anova(reduced, complete)

## Analysis of Variance Table
## 
## Model 1: mag ~ long + depth
## Model 2: mag ~ long * depth
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    997 150.21                              
## 2    996 149.48  1   0.73346 4.8872 0.02728 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the p-value is .02728. This is less than .05, so again we see that the relationship between magnitude and longitude depends on the depth of the earthquake.

Anova for Interaction Model

We can use anova to determine whether both predictors are needed. In our case, we have only two predictors. Anova could also be used in the case where there are more than two predictors. In this case, there are only two.

Our complete model will use all the predictors. Out reduced model will use only depth as a predictor(since the p-value on length is insignificant, we are going to try dropping it).

myComplete <- lm(mag ~ depth * long, quakes)
myR1 <- lm(mag ~ depth )
myR2 <- lm(mag ~ long )

We can do what we did above where we compare the complete model to each reduced model.

First we look at the model that only includes depth (excludes longitude)

anova(myComplete, myR1)

## Analysis of Variance Table
## 
## Model 1: mag ~ depth * long
## Model 2: mag ~ depth
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    996 149.48                                  
## 2    998 153.44 -2   -3.9662 13.214 2.168e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the p-value = 2.16 * 10 ^ -6 < .05 and therefore we should reject the null. This means that we want to include depth in our model.

Now we can look at the model that excludes depth and only looks at longitude:

anova(myComplete, myR2)

## Analysis of Variance Table
## 
## Model 1: mag ~ depth * long
## Model 2: mag ~ long
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    996 149.48                                  
## 2    998 157.21 -2   -7.7328 25.763 1.234e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the p-value = 1.234 * 10 ^ -11 < .05 and therefore we should reject the null. This means that we want to include longitude in our model.

Complete F-Test

We can also do a complete F-test to determine whether the magnitude depends on the depth and/or longitude of the earthquake.

mod <- lm ( mag ~ depth + long)
summary(mod)

## 
## Call:
## lm(formula = mag ~ depth + long)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79367 -0.29018 -0.05471  0.21323  1.59508 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.442e+00  3.650e-01  17.650  < 2e-16 ***
## depth       -3.925e-04  5.758e-05  -6.816 1.62e-11 ***
## long        -9.472e-03  2.045e-03  -4.632 4.10e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3882 on 997 degrees of freedom
## Multiple R-squared:  0.07314,    Adjusted R-squared:  0.07128 
## F-statistic: 39.34 on 2 and 997 DF,  p-value: < 2.2e-16

We see that the p-value is less than .05. Thus, the magnitude depends on the depth and/or longitude of the earthquake.

Summary of Chapter Four

In chapter four, we learned about Multiple Linear Regression. When doing Multiple Linear Regression, you often have quantitative predictor variables and categorical predictor variables. In my example, I only have quantitative predictor variables. When doing Multiple Linear Regression, there are a few types of problems you can solve. You can break the problem down into Single Linear Regression, Multiple Linear Regression and the interaction model. From each model, you can determine various things about the data. Using the t-test, partial f-test or complete f-test, you can determine which predictors are significant in the model. If predictors are not significant, one can remove them from the model to obtain the same accuracy with less variables. We can also look at the significance of the interaction term. If the interaction term is not significant, we know that the two predictors have the same slope. They would still have different intercepts, but the same slope. We can then call this model “parallel lines model”. This was not the case in my data because the interaction term was significant.