Intro

You’ve seen that, in Tennessee, the percentage of households with broadband internet access in a county depends, at least partly, on the county’s median household income.

What else might it depend on?

One possibility is “density,” meaning the county’s number of households per square mile. For technical reasons, offering broadband internet access in high-density areas is easier and more profitable for internet service providers than offering broadband internet access in low-density areas. In fact, broadband access might depend on density even more than it depends on household income.

You can use multiple regression to learn whether broadband access depends on household income, density, or both. All you do is add the “density” variable to the independent variables side of the R code for a bivariate regression analysis.

Here’s the code, and what it does:

# Installing required packages
if (!require("dplyr"))
  install.packages("dplyr")
if (!require("tidyverse"))
  install.packages("tidyverse")
if (!require("visreg"))
  install.packages("visreg")
## Warning: package 'visreg' was built under R version 4.2.3
library(dplyr)
library(ggplot2)
library(visreg)

# Read the data
mydata <- read.csv("MultipleRegression.csv") #Edit YOURFILENAME.csv

# Look at the DV and IV
ggplot(mydata, aes(x = Pct_HH_with_internet)) + geom_histogram(color = "black", fill = "#1f78b4")

ggplot(mydata, aes(x = Median_HH_Income)) + geom_histogram(color = "black", fill = "#1f78b4")

ggplot(mydata, aes(x = Density)) + geom_histogram(color = "black", fill = "#1f78b4")

# Run the regression
options(scipen = 999)
myreg <- lm(Pct_HH_with_internet ~ Median_HH_Income+
              Density,
            data = mydata)
summary(myreg)
## 
## Call:
## lm(formula = Pct_HH_with_internet ~ Median_HH_Income + Density, 
##     data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.4208  -2.7407   0.5792   3.0349   7.9126 
## 
## Coefficients:
##                     Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)      51.61425290  1.99840050  25.828 < 0.0000000000000002 ***
## Median_HH_Income  0.00042283  0.00004345   9.731 0.000000000000000839 ***
## Density           0.02031969  0.00539919   3.763             0.000295 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.215 on 92 degrees of freedom
## Multiple R-squared:  0.636,  Adjusted R-squared:  0.6281 
## F-statistic: 80.38 on 2 and 92 DF,  p-value: < 0.00000000000000022

According to the regression output, both Median_HH_Income and Density are significant predictors of broadband access. That outcome indicates that a county’s household income and population density are both uniquely important factors in the county’s level of broadband internet access. A researcher who was interested in learning how to boost broadband access would be wise to look at both variables, not just one or the other.

Checking for an interaction

There might be an even more sophisticated way of understanding why some counties have more broadband access than others. It could be that household income and population density “interact” as factors in broadband access. For example, it could be that household income is a stronger determinant of broadband access in low-density counties than in high-density counties. The idea makes logical sense. For various technical reasons, internet service providers can offer broadband internet services at a lower cost in high-density areas than in lower-density areas. And if broadband access is cheaper in high-density areas, then having broadband access in such areas probably would depend less on household income than it does in low-density areas.

Checking for such an interaction is surprisingly easy. All you have to do is create a third variable by multiplying the two independent variables, then add that third variable to the regression model. Here’s the code:

# Create the interaction term
mydata$Interaction <- mydata$Median_HH_Income * mydata$Density

# Add the interaction term to the model
options(scipen = 999)
myreg <- lm(Pct_HH_with_internet ~ Median_HH_Income+
              Density+
              Interaction,
            data = mydata)
summary(myreg)
## 
## Call:
## lm(formula = Pct_HH_with_internet ~ Median_HH_Income + Density + 
##     Interaction, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2708  -2.6634   0.4668   2.6983   7.7340 
## 
## Coefficients:
##                       Estimate    Std. Error t value             Pr(>|t|)    
## (Intercept)      44.1133054170  3.1494814938  14.007 < 0.0000000000000002 ***
## Median_HH_Income  0.0005790469  0.0000666757   8.685    0.000000000000144 ***
## Density           0.1309000418  0.0372041843   3.518              0.00068 ***
## Interaction      -0.0000020277  0.0000006756  -3.001              0.00347 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.043 on 91 degrees of freedom
## Multiple R-squared:  0.6688, Adjusted R-squared:  0.6579 
## F-statistic: 61.25 on 3 and 91 DF,  p-value: < 0.00000000000000022

The significant p-value next to the interaction term indicates that household income and density do, indeed, interact, as described above.

Graphing the interaction

One way to better understand such an interaction is to graph it. You can do it by showing the relationship between broadband access and household income at different levels of population density.

There are a couple of ways to make such a graph. One of the easiest ways is to use the visreg package for R. The first block of code in this demo installed and loaded the visreg package. This code makes the interaction graph:

myreg2 <- lm(Pct_HH_with_internet ~ Median_HH_Income*Density,
             data = mydata)
visreg(myreg2,"Median_HH_Income",by="Density", overlay=T, partial=FALSE,band=F)

In the graph, the red and green lines show the relationship between household income and broadband access in low-density and medium-density counties. The blue line shows the same relationship, but in high-density counties.

What stands out the most, perhaps, is that the red and green lines rise faster than the blue line does. The steeper the line is, the stronger the relationship is between broadband access and household income. So, the chart is showing you that the relationship between household income and broadband access is stronger in low- and medium-density counties than it is in high-density counties.

And that is what an interaction means.

What about the outliers?

You might remember that outliers are problematic in regression analysis, and this particular dataset has a couple of them. Here’s what happens when you identify and remove the outliers in the data, then run the whole analysis again:

# Find and delete outliers
leverage <- as.data.frame(hatvalues(myreg))
view(leverage)

mydata <- mydata[-c(94, 19, 79), ]

# Run the regression again, minus the outliers
options(scipen = 999)
myreg <- lm(Pct_HH_with_internet ~ Median_HH_Income+
              Density,
            data = mydata)
summary(myreg)
## 
## Call:
## lm(formula = Pct_HH_with_internet ~ Median_HH_Income + Density, 
##     data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2324  -2.6738   0.6028   2.7638   7.6536 
## 
## Coefficients:
##                     Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)      48.59272929  2.43702954  19.939 < 0.0000000000000002 ***
## Median_HH_Income  0.00047705  0.00005531   8.625    0.000000000000224 ***
## Density           0.03563057  0.00834936   4.267    0.000049236818677 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.934 on 89 degrees of freedom
## Multiple R-squared:  0.6444, Adjusted R-squared:  0.6364 
## F-statistic: 80.65 on 2 and 89 DF,  p-value: < 0.00000000000000022
# Graph the interaction again, minus the outliers 
myreg2 <- lm(Pct_HH_with_internet ~ Median_HH_Income*Density,
             data = mydata)
visreg(myreg2,"Median_HH_Income",by="Density", overlay=T, partial=FALSE,band=F)

The interaction term remained significant, even without the outliers. The interaction graph is less dramatic, but it still shows that household income is a stronger determinant of broadband access in lower-density counties than in higher-density ones.