Markdown Author: Jessie Bell, 2023

Libraries Used: tidyverse, gt, ggplot2, reshape2, jtools, interactions, ggpubr, car, GGally, ggpmisc

Answers: GGgreen

Interactions

I. Interaction Effects

states <- data.frame(state.x77)

names(states)
## [1] "Population" "Income"     "Illiteracy" "Life.Exp"   "Murder"    
## [6] "HS.Grad"    "Frost"      "Area"
fiti <- lm(Income ~ Illiteracy * Murder + HS.Grad, data = states)

summ(fiti) #this is a part of the jtools package
Observations 50
Dependent variable Income
Type OLS linear regression
F(4,45) 10.65
0.49
Adj. R² 0.44
Est. S.E. t val. p
(Intercept) 1414.46 737.84 1.92 0.06
Illiteracy 753.07 385.90 1.95 0.06
Murder 130.60 44.67 2.92 0.01
HS.Grad 40.76 10.92 3.73 0.00
Illiteracy:Murder -97.04 35.86 -2.71 0.01
Standard errors: OLS
## NICE!

 #We see that illiteracy is barely non-sig. for slope, but the other variables have slopes sig. diff. from 0.

Our interaction term is significant, suggesting some more probing is warranted to see what’s going on. It’s worth recalling that you shouldn’t focus too much on the main effects of terms included in the interaction since they are conditional on the other variable(s) in the interaction being held constant at 0.

anova(fiti)
## Analysis of Variance Table
## 
## Response: Income
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## Illiteracy         1 3534351 3534351 16.7367 0.0001755 ***
## Murder             1  217848  217848  1.0316 0.3152121    
## HS.Grad            1 3699681 3699681 17.5196 0.0001303 ***
## Illiteracy:Murder  1 1546371 1546371  7.3227 0.0095844 ** 
## Residuals         45 9502841  211174                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#The anova shows us that all variables are contributing to the explanation of variance, except Murder.
#Yet, the interaction is still sig.
ggplot(states, aes(Murder, Income))+
  geom_point( color = "plum")#If we plot this, we see there is no obvious relationship
**Figure 1: Trying to plot the interaction to see if there is a relationship. Does not seem to be any relationship between murder and income.**

Figure 1: Trying to plot the interaction to see if there is a relationship. Does not seem to be any relationship between murder and income.

II. Plotting Interactions

A versatile, and oftentimes most interpretable, method for understanding interaction effects is via plotting. The package interactions provides interact_plot as a relatively pain-free method to get good-looking plots of interactions using ggplot2 on the back-end.

Keep in mind that the default behavior of interact_plot is to mean-center all continuous variables not involved in the interaction so that the predicted values are more easily interpreted, i.e., you are holding other variables constant (here, at their mean) while you look at the effects of the variables of interest. You can disable this by adding centered = “none”. You can choose specific variables by providing their names in a vector to the centered argument. Look up the arguments for the function interact_plot in the help before proceeding.

Continuous Moderator

  1. What do the arguments mean in the interact_plot() code below?

    Answer: The first part calls the dataframe you are curious about. The second part uses Illiteracy as a predictor (a continuous column in the data. The moderator is called as Murder (another column in the data with continuous values). We are graphing how Murder influences/interacts with Illiteracy.

interact_plot(fiti, pred=Illiteracy, modx=Murder, colors = c("plum", "purple", "red"), plot.points = T) #OKAY WEIRD. This command does not use a data.frame but a list created by the linear model function lm(). notice here we used fiti and not states. 
**Figure 2: Plotting interactions. The moderator below is Murder. You get three lines: 1 standard deviation above and below the mean and the mean itself. If you specify modx.values = “plus-minus”, the mean of the moderator is not plotted, just the two +/- SD lines. You may also choose “terciles” to split the data into three equal-sized groups, representing the upper, middle, and lower thirds of the distribution of the moderator, and get the line that represents the median of the moderator within each of those groups.**

Figure 2: Plotting interactions. The moderator below is Murder. You get three lines: 1 standard deviation above and below the mean and the mean itself. If you specify modx.values = “plus-minus”, the mean of the moderator is not plotted, just the two +/- SD lines. You may also choose “terciles” to split the data into three equal-sized groups, representing the upper, middle, and lower thirds of the distribution of the moderator, and get the line that represents the median of the moderator within each of those groups.

Categorical Moderator

fitiris <- lm(Petal.Length ~ Petal.Width * Species, data = iris)

interact_plot(fitiris, pred=Petal.Width, modx = Species, plot.points=T, colors=c("#c178f7", "#00ba38", "#00b0f6"))
**Figure 3: Plotting interactions for a categorical moderator. If your moderator is a factor, each level wil be plotted and you should leave modx.values = NULL, the default. Here is an example using the built in iris dataset.**

Figure 3: Plotting interactions for a categorical moderator. If your moderator is a factor, each level wil be plotted and you should leave modx.values = NULL, the default. Here is an example using the built in iris dataset.

  1. Review what you have read about interaction effects (e.g., Zuur et al. 2010 or other readings) and describe how you would interpret the two plots you just made, one with the States data and the other with the Iris data.

    Answer: Interaction effects refer to situations where the effect of one explanatory variable on the response variable varies depending on the level of another explanatory variable. Figure 3 shows Non-parallel lines and suggest an interaction effect.

Discrete Numerical Moderator

c4<-lm(mpg~cyl*wt, data=mtcars)

interact_plot(c4, pred = wt, modx = cyl, plot.points = TRUE,
              modx.values = c(4, 5, 6, 8), colors=c("pink", "red", "steelblue", "brown"))
**Figure 4: Plotting interactions for a discrete numerical moderator. Here we see that the relationship between mpg and car wt is variable by cylinders. I know virtually nothing about cars, but more cylinders typically means more gas, although, here we can see that at some level of wt (around 4800 lbs), heavy vehicles are more fuel efficient when they have more cylinders. Compare this output to what you saw in the tree example (volume~ht*girth). Similar, right? Here we were able to provide levels of cyl over which to predict, which makes the plot more relevant.**

**Figure 4: Plotting interactions for a discrete numerical moderator. Here we see that the relationship between mpg and car wt is variable by cylinders. I know virtually nothing about cars, but more cylinders typically means more gas, although, here we can see that at some level of wt (around 4800 lbs), heavy vehicles are more fuel efficient when they have more cylinders. Compare this output to what you saw in the tree example (volume~ht*girth). Similar, right? Here we were able to provide levels of cyl over which to predict, which makes the plot more relevant.**

III. Plotting Partial Residuals in MR

In more complex regressions, plotting the observed data can sometimes be relatively uninformative because the points seem to be all over the place when plotting the observed values with only one of many predictors, especially if it is one without a strong relationship.

You’ll recall from last lab, we used the cars data and went through a rather laborious process to get the partial effects using residuals. The process is fitting the full model, fitting each univariate model (including models of the predictors, x1~x2 and x2~x1), and pulling the residuals from all. I leave the code here for you to explore. The output plots are nice examples.

#Setting up the data
mtcars_added_variable_plots <- mtcars
model_wt_and_cyl <- lm(mpg ~ cyl + wt, data = mtcars)
model_wt <- lm(mpg ~ wt, data = mtcars_added_variable_plots)
mtcars_added_variable_plots$wt_residuals <- residuals(model_wt) 
model_cyl <- lm(mpg ~ cyl, data = mtcars_added_variable_plots)
mtcars_added_variable_plots$cyl_residuals <- residuals(model_cyl) 
model_wt_against_cyl <- lm(wt ~ cyl, data = mtcars_added_variable_plots)
mtcars_added_variable_plots$wt_against_cyl_residuals <- residuals(model_wt_against_cyl) #
model_cyl_against_wt <- lm(cyl ~ wt, data = mtcars_added_variable_plots)
mtcars_added_variable_plots$cyl_against_wt_residuals <- residuals(model_cyl_against_wt) #

one <- ggplot(mtcars_added_variable_plots, aes(x = cyl_against_wt_residuals, y = wt_residuals)) +
  geom_point() + geom_smooth(method = 'lm', color = '#00bfc4') +
    labs(title = "Mtcars", subtitle = "Parial Regression: mpg ~ wt + cyl") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "holding wt constant") +
  xlab("Cylinder \n (cyl_against_wt_residuals)") +
  ylab("MPG \n (wt_residuals)")


two <- ggplot(mtcars_added_variable_plots, aes(x = wt_against_cyl_residuals, y = cyl_residuals)) +
  geom_point() + geom_smooth(method = 'lm', color = '#fc4f68') +
  labs(title = "Mtcars", subtitle = "Parial Regression: mpg ~ wt + cyl") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "holding cyl constant") +
  xlab("Weight \n (wt_against_cyl_residuals)") +
  ylab("MPG \n (cyl_residuals)")

ggarrange(one, two)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
**Figure 5: Plot the partial residuals SUPER SLOW.**

Figure 5: Plot the partial residuals SUPER SLOW.

#or do it even simpler using the CAR package!

c3 <- lm(mpg~cyl+wt, data=mtcars)
three <- avPlots(c3)
**Figure 6: PLOT THE PARTIAL RESIDUALS WAY FASTER.**

Figure 6: PLOT THE PARTIAL RESIDUALS WAY FASTER.

Lets try this another elevated way.

str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...
fitc <- lm(cty ~ year + cyl * displ + class + fl + drv, data = mpg)
summary(fitc)
## 
## Call:
## lm(formula = cty ~ year + cyl * displ + class + fl + drv, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9772 -0.7164  0.0018  0.7690  5.9314 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -200.97599   47.00954  -4.275 2.86e-05 ***
## year               0.11813    0.02347   5.033 1.01e-06 ***
## cyl               -1.85648    0.27745  -6.691 1.86e-10 ***
## displ             -3.56467    0.65943  -5.406 1.70e-07 ***
## classcompact      -2.60177    0.92972  -2.798 0.005597 ** 
## classmidsize      -2.62996    0.93273  -2.820 0.005253 ** 
## classminivan      -4.40817    1.03844  -4.245 3.24e-05 ***
## classpickup       -4.37322    0.93416  -4.681 5.02e-06 ***
## classsubcompact   -2.38384    0.92943  -2.565 0.010997 *  
## classsuv          -4.27352    0.86777  -4.925 1.67e-06 ***
## fld                6.34343    1.69499   3.742 0.000233 ***
## fle               -4.57060    1.65992  -2.754 0.006396 ** 
## flp               -1.91733    1.58649  -1.209 0.228158    
## flr               -0.78873    1.56551  -0.504 0.614901    
## drvf               1.39617    0.39660   3.520 0.000525 ***
## drvr               0.48740    0.46113   1.057 0.291694    
## cyl:displ          0.36206    0.07934   4.564 8.42e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.526 on 217 degrees of freedom
## Multiple R-squared:  0.8803, Adjusted R-squared:  0.8715 
## F-statistic: 99.73 on 16 and 217 DF,  p-value: < 2.2e-16
summ(fitc)
Observations 234
Dependent variable cty
Type OLS linear regression
F(16,217) 99.73
0.88
Adj. R² 0.87
Est. S.E. t val. p
(Intercept) -200.98 47.01 -4.28 0.00
year 0.12 0.02 5.03 0.00
cyl -1.86 0.28 -6.69 0.00
displ -3.56 0.66 -5.41 0.00
classcompact -2.60 0.93 -2.80 0.01
classmidsize -2.63 0.93 -2.82 0.01
classminivan -4.41 1.04 -4.24 0.00
classpickup -4.37 0.93 -4.68 0.00
classsubcompact -2.38 0.93 -2.56 0.01
classsuv -4.27 0.87 -4.92 0.00
fld 6.34 1.69 3.74 0.00
fle -4.57 1.66 -2.75 0.01
flp -1.92 1.59 -1.21 0.23
flr -0.79 1.57 -0.50 0.61
drvf 1.40 0.40 3.52 0.00
drvr 0.49 0.46 1.06 0.29
cyl:displ 0.36 0.08 4.56 0.00
Standard errors: OLS

You’ll note from the output that class, fuel type, and drive train are all categorical predictors. As such there is a coefficient estimated for each level of the categorical predictor, relative to the first level of that predictor!

Let’s look at the interaction plot, since we now know how to do that:

interact_plot(fitc, pred = displ, modx = cyl, plot.points = TRUE,
              modx.values = c(4, 5, 6, 8))
**Figure 7: PLOT THE PARTIAL RESIDUALS WAY FASTER, BUT THIS IS BAD. See the fixed version in Figure 8.**

Figure 7: PLOT THE PARTIAL RESIDUALS WAY FASTER, BUT THIS IS BAD. See the fixed version in Figure 8.

THAT LOOKS BAD Let’s fix it.

Hmm, doesn’t that look…bad? You can kind of see the pattern of the interaction, but the predicted lines don’t seem to match the data very well. But we included a lot of variables besides cyl and displ in this model and they may be responsible for some of this variation. Remember, to make this plot, we are not using the other variables.

Here is where we really need partial residual plots–they are designed to help with deciphering complex model visualizations. You can learn more about the technique and theory in Fox and Weisberg (2018). Another place to generate partial residual plots is in Fox’s effects package.

Using the same function and the argument partial.residuals = TRUE what is plotted instead is the observed data with the effects of all the control variables accounted for. In other words, the value cty for the observed data is based only on the values of displ, cyl, and the model error. The varibales class, fl, and drv are held constant, but we allow the variables in the plot to vary across their ranges. Let’s take a look.

interact_plot(fitc, pred = displ, modx = cyl, partial.residuals = TRUE,
              modx.values = c(4, 5, 6, 8))
**Figure 8: The good interaction plot. Now we can understand how the observed data and the model relate to each other much better. One insight is how the model really underestimates the values at the low end of displacement and cylinders. You can also see how much the cylinders and displacement seem to be correlated each other, which makes it difficult to say how much we can learn from this kind of model.**

Figure 8: The good interaction plot. Now we can understand how the observed data and the model relate to each other much better. One insight is how the model really underestimates the values at the low end of displacement and cylinders. You can also see how much the cylinders and displacement seem to be correlated each other, which makes it difficult to say how much we can learn from this kind of model.

IV. Confidence Intervals

There seemed to be some messy confidence intervaals in the previous HW. Let’s try using a different method, using this same function interact_plot.

Another way to get a sense of the precision of the estimates from you model is by plotting confidence bands. To get started, just set interval = TRUE. To decide how wide the confidence interval should be, express the percentile as a number, e.g., int.width = 0.8 corresponds to an 80% interval.

a <- interact_plot(fiti, pred = Illiteracy, modx = Murder, interval = TRUE,
              int.width = 0.8, colors=c("#f3640d", "#ac7322", "#fc4f68"))+
  theme_pubclean()

b <- interact_plot(fitc, pred = displ, modx = cyl, interval = TRUE,
              modx.values = c(4, 5, 6, 8), colors=c("#f8766d", "#b79f00", "#00bf7d", "#f564e3"))+
  theme_pubclean()
ggarrange(a, b)
**Figure 9: Adding confidence intervals the right way.**

Figure 9: Adding confidence intervals the right way.

HW

  • Species Counts \(/ m^2\)
  • Precipitation \(in/yr\)
  • Site (3 levels)
  • Temperature \(^{\circ}\)C
bugz <- read.csv("Bugs.csv")

data exploration



Checking assumptions

z <- ggplot(bugz, aes(Temperature))+
  geom_histogram(bins=30, fill = "purple")+
  labs(x=expression("Temperature"*~degree*C)) #not too normal eh?

y <- ggplot(bugz, aes(Precipitation))+
  geom_histogram(bins=30, fill = "#00b0f6")+
  labs(x=expression("Precipitation (in/yr)")) #looks bimodal

ggarrange(z,y, labels=c("A", "B"))
**Figure 10: Looking at the numerical explanatories to get a sense of what our data are.**

Figure 10: Looking at the numerical explanatories to get a sense of what our data are.

ggpairs(bugz, columns = 2:5, aes(colour=Site))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
**Figure 11: Here you can see that north site has higher average temperatures, and the south site has higher average precipitation. It also seems that precipitation is strongly correlated with species count. I think it will play a big role in our model.**

Figure 11: Here you can see that north site has higher average temperatures, and the south site has higher average precipitation. It also seems that precipitation is strongly correlated with species count. I think it will play a big role in our model.

#south has highest precip

#north has highest temperature

a



Fit a model of Species Counts as a function of the main effects (temp, precip, and site) plus the interaction between site and temperature. Whether or not your interaction is significant, make an interaction plot for the variables you included in the interaction term.

fitbugz <- lm(SpeciesCount ~ Temperature + Precipitation + Site + Site*Temperature, data = bugz)

#the model above is the same as lm(SpeciesCount ~ Precipitation + Temperature * Site)

interact_plot(fitbugz, pred=Temperature, modx = Site, plot.points=T, colors=c("#c178f7", "#00ba38", "#00b0f6"))
**Figure 12: Interaction plot for Site * Temperature. Here you can see that North and West sites are parallel, suggesting that no interaction occurs between them. There is an interaction between South with both North and West sites. This indicates that temperature and species count depends on site.**

Figure 12: Interaction plot for Site * Temperature. Here you can see that North and West sites are parallel, suggesting that no interaction occurs between them. There is an interaction between South with both North and West sites. This indicates that temperature and species count depends on site.

b



Explore the partial regression plots associated with your model and present one of bug counts and temperature with site as the third variable, and one of bug counts and precipitation with site as a third variable.

partial<-fitbugz
avPlots(partial, pch=1, col="#7cae00", col.lines = "#00b6eb")
**Figure 13: Partial regression plots.**

Figure 13: Partial regression plots.

#Setting up the data

model_precip <- lm(SpeciesCount ~ Precipitation, data = bugz)

bugz$precip_residuals <- residuals(model_precip) 

model_temp <- lm(SpeciesCount ~ Temperature, data = bugz)

bugz$temp_residuals <- residuals(model_temp)

model_temp_against_precip <- lm(Temperature ~ Precipitation, data = bugz)

bugz$temp_vs_precip_residuals <- residuals(model_temp_against_precip) #

model_prec_against_temp <- lm(Precipitation ~ Temperature, data =bugz)

bugz$precip_vs_temp_residuals <- residuals(model_prec_against_temp) #


a <- ggplot(bugz, aes(x = precip_vs_temp_residuals, y = temp_residuals)) +
  geom_point(color="#00bfc4") + 
  geom_smooth(method = 'lm', color = '#7cae00') +
  labs(title = "Bugz 1", subtitle = "Partial Regression: SpeciesCount ~ Temperature + Precipitation") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "holding temperature constant") +
  xlab("Temperature \n (precip_against_temp_residuals)") +
  ylab("Species Count \n (temp_residuals)")+
  theme(plot.title = element_text(size = 10, color = "black"), 
      plot.subtitle = element_text(size = 7, color = "darkgrey"), 
      plot.caption = element_text(hjust = .5)) 

b <- ggplot(bugz, aes(x = precip_vs_temp_residuals, y = precip_residuals, col=Site)) +
  geom_point() + 
  geom_smooth(method = 'lm', color = '#7cae00') +
  labs(title = "Bugz 2", subtitle = "Partial Regression: SpeciesCount ~ Temperature + Precipitation") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "holding precipitation constant") +
  xlab("Precipitation \n (temp_against_precip_residuals)") +
  ylab("Species Count \n (temp_residuals)")+
  theme(plot.title = element_text(size = 10, color = "black"), 
        plot.subtitle = element_text(size = 7, color = "darkgrey"), 
        plot.caption = element_text(hjust = .5)) 

ggarrange(a, b, labels=c("A", "B"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
**Figure 14: Partial regression plots, the long way. These are the same as the first two plots in figure 11, horizontally.**

Figure 14: Partial regression plots, the long way. These are the same as the first two plots in figure 11, horizontally.

c



Using model selection, identify the most supported model(s). Using the best model, make predictions using predict() and plot the predicted bug counts as a function of precipitation.

Answer: The optimal model using the step() function to determine the lowest AIC is:

\(lm(SpeciesCount\)~ \(Precipitation * Temperature * Site, data = bugz)\)

fitbugz <- lm(SpeciesCount ~ Precipitation * Temperature * Site, data=bugz)


summ(fitbugz) # adj R^2 =0.83, it seems precip and temp are significant, but site is not. 
Observations 100
Dependent variable SpeciesCount
Type OLS linear regression
F(11,88) 61.36
0.88
Adj. R² 0.87
Est. S.E. t val. p
(Intercept) -1.16 12.99 -0.09 0.93
Precipitation 0.17 0.12 1.44 0.15
Temperature 0.70 0.76 0.93 0.36
Sitesouth 26.87 26.63 1.01 0.32
Sitewest 10.36 19.25 0.54 0.59
Precipitation:Temperature 0.00 0.01 0.30 0.76
Precipitation:Sitesouth -0.21 0.21 -1.02 0.31
Precipitation:Sitewest -0.03 0.16 -0.21 0.84
Temperature:Sitesouth -2.59 1.70 -1.52 0.13
Temperature:Sitewest -0.46 1.16 -0.40 0.69
Precipitation:Temperature:Sitesouth 0.02 0.01 1.71 0.09
Precipitation:Temperature:Sitewest 0.00 0.01 0.26 0.80
Standard errors: OLS
anova(fitbugz) #any significance? Seems the interaction term is not. 
## Analysis of Variance Table
## 
## Response: SpeciesCount
##                                Df  Sum Sq Mean Sq  F value    Pr(>F)    
## Precipitation                   1 18122.7 18122.7 612.2744 < 2.2e-16 ***
## Temperature                     1   464.6   464.6  15.6959 0.0001508 ***
## Site                            2   314.3   157.2   5.3099 0.0066500 ** 
## Precipitation:Temperature       1     0.2     0.2   0.0084 0.9271630    
## Precipitation:Site              2   973.0   486.5  16.4362 8.607e-07 ***
## Temperature:Site                2    13.6     6.8   0.2292 0.7956086    
## Precipitation:Temperature:Site  2    91.1    45.6   1.5390 0.2203193    
## Residuals                      88  2604.7    29.6                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
step(fitbugz) #tells us that the model with the interaction term is the best. 
## Start:  AIC=349.99
## SpeciesCount ~ Precipitation * Temperature * Site
## 
##                                  Df Sum of Sq    RSS    AIC
## - Precipitation:Temperature:Site  2    91.104 2695.8 349.43
## <none>                                        2604.7 349.99
## 
## Step:  AIC=349.43
## SpeciesCount ~ Precipitation + Temperature + Site + Precipitation:Temperature + 
##     Precipitation:Site + Temperature:Site
## 
##                             Df Sum of Sq    RSS    AIC
## - Temperature:Site           2     13.57 2709.4 345.93
## <none>                                   2695.8 349.43
## - Precipitation:Temperature  1     63.48 2759.3 349.76
## - Precipitation:Site         2    947.15 3643.0 375.54
## 
## Step:  AIC=345.93
## SpeciesCount ~ Precipitation + Temperature + Site + Precipitation:Temperature + 
##     Precipitation:Site
## 
##                             Df Sum of Sq    RSS    AIC
## <none>                                   2709.4 345.93
## - Precipitation:Temperature  1     76.20 2785.6 346.70
## - Precipitation:Site         2    972.99 3682.4 372.61
## 
## Call:
## lm(formula = SpeciesCount ~ Precipitation + Temperature + Site + 
##     Precipitation:Temperature + Precipitation:Site, data = bugz)
## 
## Coefficients:
##               (Intercept)              Precipitation  
##                  7.419382                   0.082216  
##               Temperature                  Sitesouth  
##                  0.196165                 -13.119366  
##                  Sitewest  Precipitation:Temperature  
##                  2.593992                   0.007041  
##   Precipitation:Sitesouth     Precipitation:Sitewest  
##                  0.141678                   0.011598
# the optimal model given AIC is using all as interaction effects. 

optimal_fit_bugz <- lm(SpeciesCount ~ Precipitation * Temperature * Site, data=bugz)

summ(optimal_fit_bugz)
Observations 100
Dependent variable SpeciesCount
Type OLS linear regression
F(11,88) 61.36
0.88
Adj. R² 0.87
Est. S.E. t val. p
(Intercept) -1.16 12.99 -0.09 0.93
Precipitation 0.17 0.12 1.44 0.15
Temperature 0.70 0.76 0.93 0.36
Sitesouth 26.87 26.63 1.01 0.32
Sitewest 10.36 19.25 0.54 0.59
Precipitation:Temperature 0.00 0.01 0.30 0.76
Precipitation:Sitesouth -0.21 0.21 -1.02 0.31
Precipitation:Sitewest -0.03 0.16 -0.21 0.84
Temperature:Sitesouth -2.59 1.70 -1.52 0.13
Temperature:Sitewest -0.46 1.16 -0.40 0.69
Precipitation:Temperature:Sitesouth 0.02 0.01 1.71 0.09
Precipitation:Temperature:Sitewest 0.00 0.01 0.26 0.80
Standard errors: OLS
par(mfrow = c(2, 2))
plot(optimal_fit_bugz) #standard graphical output
**Figure 15: [A] Fitted values versus residuals (homogeneity). [B] Histogram of the residuals (normality). [C] Residuals versus length (independence). [D] Residuals versus month. Here you can see evidence of non-normality (B), and violations in heteroscedasticity (D).**

Figure 15: [A] Fitted values versus residuals (homogeneity). [B] Histogram of the residuals (normality). [C] Residuals versus length (independence). [D] Residuals versus month. Here you can see evidence of non-normality (B), and violations in heteroscedasticity (D).

predicted_values <- predict(optimal_fit_bugz) #predict what values should be given the model

#attach those values to dataframe
bugz$predicted_SpeciesCount <- predicted_values



ggplot(bugz, aes(Precipitation, SpeciesCount))+
  geom_point(aes(color="Observed"))+
  geom_point(aes(y=predicted_SpeciesCount, color = "Predicted"))+
  scale_color_manual(values=c("#c77cff", "#00bfc4"))+
  guides(col= guide_legend(title= "Legend"))+
  labs(title="Predicted and observed values for insect densities", x="Precipitation (in/yr)", y=expression(paste("Insect Densities(count / ", m^{2}, ")")))
**Figure 16: Predicted vs. observed values using the predict() function to determine what y would be using our best fit multiple regression model.**

Figure 16: Predicted vs. observed values using the predict() function to determine what y would be using our best fit multiple regression model.

summary(optimal_fit_bugz)
## 
## Call:
## lm(formula = SpeciesCount ~ Precipitation * Temperature * Site, 
##     data = bugz)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.328  -1.921   0.541   2.387  16.754 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                         -1.159419  12.992845  -0.089   0.9291  
## Precipitation                        0.167106   0.116398   1.436   0.1546  
## Temperature                          0.699308   0.755406   0.926   0.3571  
## Sitesouth                           26.874418  26.627113   1.009   0.3156  
## Sitewest                            10.358072  19.245113   0.538   0.5918  
## Precipitation:Temperature            0.002041   0.006797   0.300   0.7647  
## Precipitation:Sitesouth             -0.214245   0.209915  -1.021   0.3102  
## Precipitation:Sitewest              -0.034236   0.164439  -0.208   0.8356  
## Temperature:Sitesouth               -2.589917   1.701685  -1.522   0.1316  
## Temperature:Sitewest                -0.457134   1.155247  -0.396   0.6933  
## Precipitation:Temperature:Sitesouth  0.022841   0.013320   1.715   0.0899 .
## Precipitation:Temperature:Sitewest   0.002536   0.009839   0.258   0.7972  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.44 on 88 degrees of freedom
## Multiple R-squared:  0.8847, Adjusted R-squared:  0.8703 
## F-statistic: 61.36 on 11 and 88 DF,  p-value: < 2.2e-16

d



Interpret your findings using appropriate statistical inference.

\(Y_i\) = \(\beta_0\) + \(\beta_1\)\(Precipitation\) * \(\beta_2\)\(Temperature\) * \(\beta_3\)\(Site\) + \(\epsilon_i\) where \(\epsilon_i\) ~ N(0, \(\sigma^2\))

Using Bugs.csv data, we estimate \(\hat\beta_0\) = -1.16, \(\hat\beta_1\) = 0.17, \(\hat\beta_2\) = 0.70, \(\hat\beta_3\) = Sites, and \(\sigma^2\)= 5.44. Thus, according to our linear regression model, insect species counts have an estimated species count of -1.16/\(m^2\) when precipitation = 0, and the species count improves by an estimated 0.17 for every inch of precipitation each year. Species count increases 0.70 for each unit of temperature increase (this seems like it would not be linear because temperature would negatively affect if too hot), and species count is greater in the west site than the south site. With an adjusted \(R^2\) of 0.87, the regression model explains a pretty good amount (87%) of the year-to-year variability in insect species counts, and the model is statistically significant at the 0.001 level.

I would also like to suggest that these data violate the assumption of normality (Fig. 15b) and I think that another model (GLM) would make more sense than a simple linear model.

References

by: Jacob Long located here

by: Kathryn Sobocinski Located in local folder ESCI_502_HW