Multiple Regression Plotting and Interactions

Prelude.

Multiple linear regression comes with several complicating factors, not the least of which is interaction effects. We also have the need to isolate certain variables and figure out what their effect is on the response variable. We left off on this at the end of the last lab. We will tackle both here, starting with interaction effects.

Interaction Effects.

Understanding an interaction effect in a linear regression model is usually difficult when using just the basic output tables and looking at the coefficients. We can include an interaction effect in our model and see if it is significant, but visualizing that effect is a different story. There are now packages that help with interpretation. The interactions package provides several functions that can help analysts probe more deeply.

The tools described here require at least one variable to be continuous. A separate vignette describes cat_plot, which handles the plotting of interactions in which all the focal predictors are categorical variables.

First, we use example data from state.x77 that is built into R. We will fit a model of Income (per capita) as a function of some predictors (here, illiteracy, murder rate, and HS graduation rate) as we’ve done before, but introduce some new tricks to view the output. Note the symbols used: Illiteracy*Murder, means we include the main effects of Illiteracy and Murder, as well as the interaction between the two, and we include HS Grad as a variable too. The backticks around HS Grad (HS Grad) are put there because it’s a non-syntactic name (in this case, it has a space which is not allowed).

## 
## Call:
## lm(formula = Income ~ Illiteracy * Murder + `HS Grad`, data = states)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -916.27 -244.42   28.42  228.14 1221.16 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1414.46     737.84   1.917  0.06160 .  
## Illiteracy          753.07     385.90   1.951  0.05724 .  
## Murder              130.60      44.67   2.923  0.00540 ** 
## `HS Grad`            40.76      10.92   3.733  0.00053 ***
## Illiteracy:Murder   -97.04      35.86  -2.706  0.00958 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 459.5 on 45 degrees of freedom
## Multiple R-squared:  0.4864, Adjusted R-squared:  0.4407 
## F-statistic: 10.65 on 4 and 45 DF,  p-value: 3.689e-06

We see that illiteracy is barely non-sig. for slope, but the other variables have slopes sig. diff. from 0.

anova(fiti)

## Analysis of Variance Table
## 
## Response: Income
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## Illiteracy         1 3534351 3534351 16.7367 0.0001755 ***
## Murder             1  217848  217848  1.0316 0.3152121    
## `HS Grad`          1 3699681 3699681 17.5196 0.0001303 ***
## Illiteracy:Murder  1 1546371 1546371  7.3227 0.0095844 ** 
## Residuals         45 9502841  211174                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The anova shows us that all variables are contributing to the explanation of variance, except Murder. Yet, the interaction is still sig.

plot(states$Income~states$Murder)

If we plot this, we see there is no obvious relationship.

#Let's try a new function for summarizing model output
summ(fiti)

## MODEL INFO:
## Observations: 50
## Dependent Variable: Income
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(4,45) = 10.65, p = 0.00
## R² = 0.49
## Adj. R² = 0.44 
## 
## Standard errors: OLS
## ----------------------------------------------------------
##                              Est.     S.E.   t val.      p
## ----------------------- --------- -------- -------- ------
## (Intercept)               1414.46   737.84     1.92   0.06
## Illiteracy                 753.07   385.90     1.95   0.06
## Murder                     130.60    44.67     2.92   0.01
## HS Grad                     40.76    10.92     3.73   0.00
## Illiteracy:Murder          -97.04    35.86    -2.71   0.01
## ----------------------------------------------------------

Our interaction term is significant, suggesting some more probing is warranted to see what’s going on. It’s worth recalling that you shouldn’t focus too much on the main effects of terms included in the interaction since they are conditional on the other variable(s) in the interaction being held constant at 0.

Plotting Interactions.

A versatile, and oftentimes most interpretable, method for understanding interaction effects is via plotting. The package interactions provides interact_plot as a relatively pain-free method to get good-looking plots of interactions using ggplot2 on the backend.

Keep in mind that the default behavior of interact_plot is to mean-center all continuous variables not involved in the interaction so that the predicted values are more easily interpreted, i.e., you are holding other variables constant (here, at their mean) while you look at the effects of the variables of interest. You can disable this by adding centered = “none”. You can choose specific variables by providing their names in a vector to the centered argument. Look up the arguments for the function interact_plot in the help before proceeding.

1. What do the arguments mean in the interact_plot() code below:

interact_plot(fiti, pred = Illiteracy, modx = Murder, plot.points = TRUE)

“fiti” is our linear regression model function. “pred = Illiteracy” assigns the predictor variable as being Illiteracy. “modx = Murder” calls out Murder as the variable that will be moderated to different levels in the interaction, to see how the model is affected. “plot.points = TRUE” ensures that the actual points are plotted as well as the model lines, colored by the value of the moderator variable.

By default, with a continuous moderator (here, Murder) you get three lines: 1 standard deviation above and below the mean and the mean itself. If you specify modx.values = “plus-minus”, the mean of the moderator is not plotted, just the two +/- SD lines. You may also choose “terciles” to split the data into three equal-sized groups, representing the upper, middle, and lower thirds of the distribution of the moderator, and get the line that represents the median of the moderator within each of those groups.

If your moderator is a factor, each level will be plotted and you should leave modx.values = NULL, the default. Here is an example using the built in iris data we’ve used before.

fitiris <- lm (Petal.Length ~ Petal.Width * Species, data = iris)
interact_plot(fitiris, pred = Petal.Width, modx = Species, plot.points = TRUE)

2. Review what you have read about interaction effects and describe how you would interpret the two plots you just made, one with the States data dn the other with the Iris data.

“Here we see that the relationship between mpg and car wt is variable by cylinders. I know virtually nothing about cars, but more cylinders typically means more gas, although, here we can see that at some level of wt (around 4800 lbs), heavy vehicles are more fuel efficient when they have more cylinders. Compare this output to what you saw in the tree example (volume~ht*girth). Similar, right? Here we were able to provide levels of cyl over which to predict, which makes the plot more relevant.”

From the plot with the States data, we see an interacting effect between murder rate and illiteracy. When murder is low (1 SD below the mean), income increases as illiteracy increases. However, when murder rates are high (1 SD above the mean), increasing illiteracy decreases income. This shows that when illiteracy is below a certain point (~ 1.4), murder rate has a positive effect on income. Beyond that point, increased murder rate has a negative impact on income.

In the Iris data, we can see an interaction between petal width and species, but only for versicolor and virginica species. Setosa petal length is predicted well by petal width, but for versicolor and virginica species, species and petal width predict petal length. This suggests that versicolor and virginica irises have other influencing factors that are determining petal length as well as petal width.

Plotting partial residuals in multiple regression.

In more complex regressions, plotting the observed data can sometimes be relatively uninformative because the points seem to be all over the place when plotting the observed values with only one of many predictors, especially if it is one without a strong relationship.

You’ll recall from last lab, we used the cars data and went through a rather laborious process to get the partial effects using residuals. The process is fitting the full model, fitting each univariate model (including models of the predictors, x1~x2 and x2~x1), and pulling the residuals from all. I leave the code here for you to explore. The output plots are nice examples.

#Setting up the data
mtcars_added_variable_plots <- mtcars
model_wt_and_cyl <- lm(mpg ~ cyl + wt, data = mtcars)
model_wt <- lm(mpg ~ wt, data = mtcars_added_variable_plots)
mtcars_added_variable_plots$wt_residuals <- residuals(model_wt) 
model_cyl <- lm(mpg ~ cyl, data = mtcars_added_variable_plots)
mtcars_added_variable_plots$cyl_residuals <- residuals(model_cyl) 
model_wt_against_cyl <- lm(wt ~ cyl, data = mtcars_added_variable_plots)
mtcars_added_variable_plots$wt_against_cyl_residuals <- residuals(model_wt_against_cyl) 
model_cyl_against_wt <- lm(cyl ~ wt, data = mtcars_added_variable_plots)
mtcars_added_variable_plots$cyl_against_wt_residuals <- residuals(model_cyl_against_wt)

one <- ggplot(mtcars_added_variable_plots, aes(x = cyl_against_wt_residuals, y = wt_residuals)) +
  geom_point() + geom_smooth(method = 'lm', color = '#e76254') +
    labs(title = "Mtcars", subtitle = "Partial Regression: mpg ~ wt + cyl") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "holding wt constant") +
  xlab("Cylinder \n (cyl_against_wt_residuals)") +
  ylab("MPG \n (wt_residuals)")
one

## `geom_smooth()` using formula = 'y ~ x'

two <- ggplot(mtcars_added_variable_plots, aes(x = wt_against_cyl_residuals, y = cyl_residuals)) +
  geom_point() + geom_smooth(method = 'lm', color = '#ef8a47') +
  labs(title = "Mtcars", subtitle = "Partial Regression: mpg ~ wt + cyl") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "holding cyl constant") +
  xlab("Weight \n (wt_against_cyl_residuals)") +
  ylab("MPG \n (cyl_residuals)")
two

## `geom_smooth()` using formula = 'y ~ x'

And we can do it a simpler way using the car package.

c3<-lm(mpg~cyl+wt, data=mtcars)
avPlots(c3)

##### New Function. I hunted around for some new means for doing this and the interactions package seems to be useful.

For an example, let’s take a look at a model using a transformed version of the mtcars data. We will use the mpg dataset predicting the city miles per gallon (cty) based on several variables, including model year, type of car, fuel type, drive type, and an interaction between engine displacement (displ) and number of cylinders in the engine (cyl). Use str() to figure out what form all the variables are in within the dataset.

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Now that you know what’s in the dataset, let’s take a look at the regression output associated with the model below.

data(cars)
fitc <- lm(cty ~ year + cyl * displ + class + fl + drv, data = mpg)
summary(fitc)

## 
## Call:
## lm(formula = cty ~ year + cyl * displ + class + fl + drv, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9772 -0.7164  0.0018  0.7690  5.9314 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -200.97599   47.00954  -4.275 2.86e-05 ***
## year               0.11813    0.02347   5.033 1.01e-06 ***
## cyl               -1.85648    0.27745  -6.691 1.86e-10 ***
## displ             -3.56467    0.65943  -5.406 1.70e-07 ***
## classcompact      -2.60177    0.92972  -2.798 0.005597 ** 
## classmidsize      -2.62996    0.93273  -2.820 0.005253 ** 
## classminivan      -4.40817    1.03844  -4.245 3.24e-05 ***
## classpickup       -4.37322    0.93416  -4.681 5.02e-06 ***
## classsubcompact   -2.38384    0.92943  -2.565 0.010997 *  
## classsuv          -4.27352    0.86777  -4.925 1.67e-06 ***
## fld                6.34343    1.69499   3.742 0.000233 ***
## fle               -4.57060    1.65992  -2.754 0.006396 ** 
## flp               -1.91733    1.58649  -1.209 0.228158    
## flr               -0.78873    1.56551  -0.504 0.614901    
## drvf               1.39617    0.39660   3.520 0.000525 ***
## drvr               0.48740    0.46113   1.057 0.291694    
## cyl:displ          0.36206    0.07934   4.564 8.42e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.526 on 217 degrees of freedom
## Multiple R-squared:  0.8803, Adjusted R-squared:  0.8715 
## F-statistic: 99.73 on 16 and 217 DF,  p-value: < 2.2e-16

summ(fitc)

## MODEL INFO:
## Observations: 234
## Dependent Variable: cty
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(16,217) = 99.73, p = 0.00
## R² = 0.88
## Adj. R² = 0.87 
## 
## Standard errors: OLS
## -------------------------------------------------------
##                            Est.    S.E.   t val.      p
## --------------------- --------- ------- -------- ------
## (Intercept)             -200.98   47.01    -4.28   0.00
## year                       0.12    0.02     5.03   0.00
## cyl                       -1.86    0.28    -6.69   0.00
## displ                     -3.56    0.66    -5.41   0.00
## classcompact              -2.60    0.93    -2.80   0.01
## classmidsize              -2.63    0.93    -2.82   0.01
## classminivan              -4.41    1.04    -4.24   0.00
## classpickup               -4.37    0.93    -4.68   0.00
## classsubcompact           -2.38    0.93    -2.56   0.01
## classsuv                  -4.27    0.87    -4.92   0.00
## fld                        6.34    1.69     3.74   0.00
## fle                       -4.57    1.66    -2.75   0.01
## flp                       -1.92    1.59    -1.21   0.23
## flr                       -0.79    1.57    -0.50   0.61
## drvf                       1.40    0.40     3.52   0.00
## drvr                       0.49    0.46     1.06   0.29
## cyl:displ                  0.36    0.08     4.56   0.00
## -------------------------------------------------------

You’ll note from the output that class, fuel type, and drive train are all categorical predictors. As such there is a coefficient estimated for each level of the categorical predictor, relative to the first level of that predictor!

Let’s look at the interaction plot, since we now know how to do that:

interact_plot(fitc, pred = displ, modx = cyl, plot.points = TRUE,
              modx.values = c(4, 5, 6, 8))

Hmm, doesn’t that look…bad? You can kind of see the pattern of the interaction, but the predicted lines don’t seem to match the data very well. But we included a lot of variables besides cyl and displ in this model and they may be responsible for some of this variation. Remember, to make this plot, we are not using the other variables.

Here is where we really need partial residual plots–they are designed to help with deciphering complex model visualizations. You can learn more about the technique and theory in Fox and Weisberg (2018). Another place to generate partial residual plots is in Fox’s effects package.

Using the same function and the argument partial.residuals = TRUE what is plotted instead is the observed data with the effects of all the control variables accounted for. In other words, the value cty for the observed data is based only on the values of displ, cyl, and the model error. The variables class, fl, and drv are held constant, but we allow the variables in the plot to vary across their ranges. Let’s take a look.

interact_plot(fitc, pred = displ, modx = cyl, partial.residuals = TRUE,
              modx.values = c(4, 5, 6, 8))

Compare this to the plot above.

By plotting the partial residual points, the partial regression lines fit the data much better. We can see there is a shift in city miles per gallon at about 5 displacement from fewer cylinders to more cylinders being more efficient.

Now we can understand how the observed data and the model relate to each other much better. One insight is how the model really underestimates the values at the low end of displacement and cylinders. You can also see how much the cylinders and displacement seem to be correlated each other, which makes it difficult to say how much we can learn from this kind of model.

Confidence Intervals

There seemed to be some messy confidence intervaals in the previous HW. Let’s try using a different method, using this same function interact_plot.

Another way to get a sense of the precision of the estimates from you model is by plotting confidence bands. To get started, just set interval = TRUE. To decide how wide the confidence interval should be, express the percentile as a number, e.g., int.width = 0.8 corresponds to an 80% interval.

#For the Income response var.
interact_plot(fiti, pred = Illiteracy, modx = Murder, interval = TRUE,
              int.width = 0.8)

#For the mpg
interact_plot(fitc, pred = displ, modx = cyl, interval = TRUE,
              modx.values = c(4, 5, 6, 8))

HW

For HW you will use the dataset Bugs. It includes a response variable: insect species counts/m^2, and 3 predictors: precipitation (inches/yr), site (3 levels), temperature (deg. C).

3. Fit a model and explore the interaction(s) and partial regression plots

Fit a model of Species Counts as a function of the main effects (temp, precip, and site) plus the interaction between site and temperature. Whether or not your interaction is significant, make an interaction plot for the variables you included in the interaction term. Explore the partial regression plots associated with your model and present one of bug counts and temperature with site as the third variable, and one of bug counts and precipitation with site as a third variable.

Using model selection, identify the most supported model(s). Using the best model, make predictions using predict() and plot the predicted bug counts as a function of precipitation.

Interpret your findings using appropriate statistical inference.

Assessing assumptions.

From the histograms and boxplots of the continuous variables, species count, precipitation, and temperature are normally distributed and have roughly equal variance.

Fit the model.

bugs.lm <- lm(SpeciesCount ~ Precipitation + Temperature * Site, data = bugs)
summary(bugs.lm)

## 
## Call:
## lm(formula = SpeciesCount ~ Precipitation + Temperature * Site, 
##     data = bugs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.7314  -2.3804   0.9819   3.4426  11.2354 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -8.55198    7.12893  -1.200   0.2333    
## Precipitation          0.23573    0.01152  20.470   <2e-16 ***
## Temperature            0.94568    0.41232   2.294   0.0241 *  
## Sitesouth             -4.63676   10.63574  -0.436   0.6639    
## Sitewest               6.20029   10.49740   0.591   0.5562    
## Temperature:Sitesouth  0.50788    0.64342   0.789   0.4319    
## Temperature:Sitewest  -0.15525    0.64641  -0.240   0.8107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.259 on 93 degrees of freedom
## Multiple R-squared:  0.8387, Adjusted R-squared:  0.8283 
## F-statistic: 80.58 on 6 and 93 DF,  p-value: < 2.2e-16

summ(bugs.lm)

## MODEL INFO:
## Observations: 100
## Dependent Variable: SpeciesCount
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(6,93) = 80.58, p = 0.00
## R² = 0.84
## Adj. R² = 0.83 
## 
## Standard errors: OLS
## -----------------------------------------------------------
##                                Est.    S.E.   t val.      p
## --------------------------- ------- ------- -------- ------
## (Intercept)                   -8.55    7.13    -1.20   0.23
## Precipitation                  0.24    0.01    20.47   0.00
## Temperature                    0.95    0.41     2.29   0.02
## Sitesouth                     -4.64   10.64    -0.44   0.66
## Sitewest                       6.20   10.50     0.59   0.56
## Temperature:Sitesouth          0.51    0.64     0.79   0.43
## Temperature:Sitewest          -0.16    0.65    -0.24   0.81
## -----------------------------------------------------------

Precipitation and temperature are significant predictors of species count (\(t = 20.47, p < 0.001\) and \(t = 2.294, p < 0.0241\), respectively). Neither site nor the interaction between site and temperature contributed significantly.

There is evidence for an interaction between site and temperature as the slopes of the lines differ and they intersect. However, there is a wide spread of points outside the confidence intervals for all three sites. I made an interaction plot of the partial residuals as well to better visualize how the other variables are influencing the model.

interact_plot(bugs.lm, pred = Temperature, modx = Site, interval = TRUE, int.width = 0.95, partial.residuals = TRUE, x.label = "Temperature (deg C)", y.label = "Species Count", main.title = "Temperature and Site Interaction Effects on Species Count (Residuals)")

The partial regressions by site of temperature fit the data much better when the residuals are plotted, although the south site still has a great deal of spread beyond the confidence intervals of the model. We can much more clearly see the intersection of the north and west sites, showing there is some interacting effect of temperature and site.

To visualize whether any interaction existed between precipitation and site, I made another partial regression plot.

interact_plot(bugs.lm, pred = Precipitation, modx = Site, interval = TRUE, int.width = 0.95, plot.points = TRUE, x.label = "Precipitation (inches/yr)", y.label = "Species Count", main.title = "Precipitation and Site Interaction Effects on Species Count")

## Warning: Precipitation and Site are not included in an interaction with one another
## in the model.

interact_plot(bugs.lm, pred = Precipitation, modx = Site, interval = TRUE, int.width = 0.95, partial.residuals = TRUE, x.label = "Precipitation (inches/yr)", y.label = "Species Count", main.title = "Precipitation and Site Interaction Effects on Species Count (Residuals)")

## Warning: Precipitation and Site are not included in an interaction with one another
## in the model.

We can see from the parallel lines for each site that the slope of the precipitation partial regression does not change as site changes, so there is no significant interaction between them.

By making added-variable plots for each variable (holding the others constant), we can visualize how precipitation has a strong impact on species count, temperature has a moderate impact, and site has a minimal slope depending on the site considered (each is compared to the North site as a standard). Temperature-site interactions can also be shown to have little influence.

Model Selection

First I created models removing the interaction, then sequentially removing a variable.

bugs.lmPTS <- lm (SpeciesCount ~ Precipitation + Temperature + Site, data = bugs)
summary(bugs.lmPTS)

bugs.lmPT <- lm (SpeciesCount ~ Precipitation + Temperature, data = bugs)
summary(bugs.lmPT)

bugs.lmPS <- lm(SpeciesCount ~ Precipitation + Site, data = bugs)
summary(bugs.lmPS)

bugs.lmTS <- lm(SpeciesCount ~ Temperature + Site, data = bugs)
summary(bugs.lmTS)

bugs.lmP <- lm(SpeciesCount ~ Precipitation, data = bugs)
summary(bugs.lmP)

bugs.lmT <- lm(SpeciesCount ~ Temperature, data = bugs)
summary(bugs.lmT)

AIC(bugs.lm, bugs.lmPTS, bugs.lmPT, bugs.lmPS, bugs.lmTS, bugs.lmP, bugs.lmT)

##            df      AIC
## bugs.lm     8 659.3376
## bugs.lmPTS  6 656.4087
## bugs.lmPT   4 660.5995
## bugs.lmPS   5 669.6544
## bugs.lmTS   5 825.0373
## bugs.lmP    3 669.5956
## bugs.lmT    3 830.4953

summary(bugs.lm)

## 
## Call:
## lm(formula = SpeciesCount ~ Precipitation + Temperature * Site, 
##     data = bugs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.7314  -2.3804   0.9819   3.4426  11.2354 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -8.55198    7.12893  -1.200   0.2333    
## Precipitation          0.23573    0.01152  20.470   <2e-16 ***
## Temperature            0.94568    0.41232   2.294   0.0241 *  
## Sitesouth             -4.63676   10.63574  -0.436   0.6639    
## Sitewest               6.20029   10.49740   0.591   0.5562    
## Temperature:Sitesouth  0.50788    0.64342   0.789   0.4319    
## Temperature:Sitewest  -0.15525    0.64641  -0.240   0.8107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.259 on 93 degrees of freedom
## Multiple R-squared:  0.8387, Adjusted R-squared:  0.8283 
## F-statistic: 80.58 on 6 and 93 DF,  p-value: < 2.2e-16

summary(bugs.lmPTS)

## 
## Call:
## lm(formula = SpeciesCount ~ Precipitation + Temperature + Site, 
##     data = bugs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.6833  -2.7723   0.7364   3.8346  11.0881 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -10.38855    4.73099  -2.196 0.030537 *  
## Precipitation   0.23651    0.01143  20.696  < 2e-16 ***
## Temperature     1.05053    0.26559   3.955 0.000147 ***
## Sitesouth       3.58387    1.54526   2.319 0.022524 *  
## Sitewest        3.92149    1.55765   2.518 0.013490 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.226 on 95 degrees of freedom
## Multiple R-squared:  0.8369, Adjusted R-squared:  0.8301 
## F-statistic: 121.9 on 4 and 95 DF,  p-value: < 2.2e-16

I made models associated with each combination of variables, successively dropping one, then ran an AIC to compare goodness-of-fit. The original model (\(SpeciesCount \sim Precipitation + Temperature * Site\)) as a slightly higher AIC than the model with all three variables but no interaction term. Comparing the \(R^2\) values of each model (\(R^2 = 0.828, F_{(6,93)} = 80.58, p < 0.001\) for the original, \(R^2 = 0.830, F_{(4,95)} = 121.9, p < 0.001\) without the interaction), the interactionless model does explain a small fraction more of the variation in species count, even when accounting for \(R^2\) inflation using adjusted \(R^2\) values. Interestingly, however, when there is an interaction term in the model, sites by themselves are not significant, but when no interaction is included, site becomes significant.

Hence, my final model is:

\[Species Count \sim Precipitation + Temperature + Site \]

Assessing model fit.

Diagnostic plots:

par(mfrow=c(2,2))
plot(bugs.lmPTS)

There are a few points - notably points 21, 22, and 26 - that appear to be skewing the model. Overall, it appears the residuals are normally distributed, with some tailedness in the QQ plot on the lower end, and the plotted residuals show a mostly even spread. The model fits the data well (\(F_{(4,95)} = 121.9, p < 0.001\)) and explains a large proportion of the variance in the data (\(R^2 = 0.830\)).

Checking the model with new data.

predicted_bug<-predict(bugs.lmPTS)
print(predicted_bug)

Plotting predicted counts as a function of precipitation.

Using correlation test to determine relationship:

cor.test(bugs$Precipitation, predicted_bug)

## 
##  Pearson's product-moment correlation
## 
## data:  bugs$Precipitation and predicted_bug
## t = 47.751, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9691571 0.9859673
## sample estimates:
##       cor 
## 0.9791787

ggplot(data = bugs, aes(Precipitation, SpeciesCount)) +
  geom_point() +
  geom_smooth(aes(Precipitation, predicted_bug), se = FALSE) +
  ggtitle("Species Count by Precipitation")+
  xlab("Precipitation (inches/yr)")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

From the correlation, we can see that the model produces predicted values that are highly correlated with precipitation (\(t=47.74, p < 0.001, r = 0.979\)).

In conclusion, bug species count is predicted by precipitation, temperature, and site (\(F_{(4,95)} = 121.9, p < 0.001\)). As precipitation and temperature increase, bug species counts increase. From our data, there is no significant interaction between temperature and site, though site had a significant impact when considered in the model alone.