Bivariate Regression with Continuous Variables

Research Question

For this first example, we will be looking at a record of all taxi trips taken in March, 2020.

Research Question: Is there a relationship between the duration (in time) of a taxi ride and the tip amount paid?

Note: We will only look at riders who paid via credit card, because we know that cash tip amounts were not recorded in this dataset.

# subset to only riders who paid via credit card
nyc.taxi.credit.dta <- nyc.taxi.dta[nyc.taxi.dta$payment_type == 1,]
dim(nyc.taxi.credit.dta)
## [1] 2248516      25

Checking Assumptions

We need to check that all of our data values make sense in the context of our analysis. We also need to check that the 3 main assumptions of regressions are met by doing the following:

  1. Linearity - use a scatterplot of the data reveals an approximately linear relationship (see: visualization lecture or Graphs section)

  2. Homoscedasticity - subset to remove major outliers

  3. Normality - use a density plot to confirm that the independent variable has an approximately normal distribution

summary(nyc.taxi.credit.dta$tip_amount) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -61.42    1.66    2.26    2.90    3.26  800.00   37487
# negative tip amounts are impossible and likely a typo so we will remove them
nyc.taxi.credit.dta <- nyc.taxi.credit.dta[nyc.taxi.credit.dta$tip_amount >= 0,]

summary(as.numeric(nyc.taxi.credit.dta$duration, "minutes")) 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## -47402.62      6.50     10.57     15.46     16.80   1439.93     37494
# a ride of 0 or fewer minutes does not make sense / trips longer than ~30 minutes appear to be outliers so we will limit our analysis to only trips between 1-30 min
nyc.taxi.credit.dta <- nyc.taxi.credit.dta[as.numeric(nyc.taxi.credit.dta$duration, "minutes") >= 1 & as.numeric(nyc.taxi.credit.dta$duration, "minutes") <= 30,]

# check the approximately normal distribution of our independent variable
plot(density(as.numeric(nyc.taxi.credit.dta$duration), na.rm = T))

The distribution of our independent variable, taxi trip duration, appears to somewhat approximate a normal distribution - although there is a noticeable right skew due to the fact that some trips were unusually long. Since a trip cannot have a negative duration, the tail is not mirrored on the lower end of our range. However, the distribution is close enough to normal that we can proceed with our analysis.

Regression

Null Hypothesis: There is no association between duration (in time) of a taxi ride and the tip amount paid. In other words, there is no change in tip amount for an additional unit of time spent riding in a taxi.

Before running our bivariate regression, we need to consider the units of our continuous variables:

  • duration can be represented in units of seconds, minutes, or hours.

  • tip_amount can be represented in units of cents or dollars.

Model 1: duration in seconds and tip_amount in cents
At a ride duration of 0 seconds, the base tip amount would be 88 cents. For every 1 second increase in ride duration, the tip increases by .25 cents.

sec.cent.model <- lm(tip_amount*100 ~ as.numeric(duration, "seconds"), data = nyc.taxi.credit.dta)
summary(sec.cent.model)
## 
## Call:
## lm(formula = tip_amount * 100 ~ as.numeric(duration, "seconds"), 
##     data = nyc.taxi.credit.dta)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -532    -51      4     36  79645 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     88.254673   0.257041   343.3   <2e-16 ***
## as.numeric(duration, "seconds")  0.246308   0.000326   755.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 182.4 on 2061472 degrees of freedom
##   (37494 observations deleted due to missingness)
## Multiple R-squared:  0.2169, Adjusted R-squared:  0.2169 
## F-statistic: 5.709e+05 on 1 and 2061472 DF,  p-value: < 2.2e-16


Model 2: duration in minutes and tip_amount in cents
At a ride duration of 0 minutes, the base tip amount would be 88 cents. For every 1 minute increase in ride duration, the tip increases by 15 cents.

min.cent.model <- lm(tip_amount*100 ~  as.numeric(duration, "minutes"), data = nyc.taxi.credit.dta)
summary(min.cent.model)
## 
## Call:
## lm(formula = tip_amount * 100 ~ as.numeric(duration, "minutes"), 
##     data = nyc.taxi.credit.dta)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -532    -51      4     36  79645 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     88.25467    0.25704   343.3   <2e-16 ***
## as.numeric(duration, "minutes") 14.77849    0.01956   755.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 182.4 on 2061472 degrees of freedom
##   (37494 observations deleted due to missingness)
## Multiple R-squared:  0.2169, Adjusted R-squared:  0.2169 
## F-statistic: 5.709e+05 on 1 and 2061472 DF,  p-value: < 2.2e-16


Model 3: duration in minutes and tip_amount in dollars
At a ride duration of 0 minutes, the base tip amount would be .88 dollars. For every 1 minute increase in ride duration, the tip increases by .15 dollars

min.dollar.model <- lm(tip_amount ~  as.numeric(duration, "minutes"), data = nyc.taxi.credit.dta)
summary(min.dollar.model)
## 
## Call:
## lm(formula = tip_amount ~ as.numeric(duration, "minutes"), data = nyc.taxi.credit.dta)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5.32  -0.51   0.04   0.36 796.45 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     0.8825467  0.0025704   343.3   <2e-16 ***
## as.numeric(duration, "minutes") 0.1477849  0.0001956   755.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.824 on 2061472 degrees of freedom
##   (37494 observations deleted due to missingness)
## Multiple R-squared:  0.2169, Adjusted R-squared:  0.2169 
## F-statistic: 5.709e+05 on 1 and 2061472 DF,  p-value: < 2.2e-16


Model 4: duration in hours and tip_amount in dollars
At a ride duration of 0 hours, the base tip amount would be .88 dollars. For every 1 hour increase in ride duration, the tip increases by 8.9 dollars

hours.dollar.model <- lm(tip_amount ~ as.numeric(duration, "hours"), data = nyc.taxi.credit.dta)
summary(hours.dollar.model)
## 
## Call:
## lm(formula = tip_amount ~ as.numeric(duration, "hours"), data = nyc.taxi.credit.dta)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5.32  -0.51   0.04   0.36 796.45 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    0.88255    0.00257   343.3   <2e-16 ***
## as.numeric(duration, "hours")  8.86710    0.01174   755.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.824 on 2061472 degrees of freedom
##   (37494 observations deleted due to missingness)
## Multiple R-squared:  0.2169, Adjusted R-squared:  0.2169 
## F-statistic: 5.709e+05 on 1 and 2061472 DF,  p-value: < 2.2e-16


All the above regressions represent the same slope - the same proportional change in tip amount per unit of time - simply represented in different units. They are also all significant, because their p-values are < .05. This means that we can reject the null hypothesis and begin our coefficients in meaningful units.

However, the models using seconds as the unit of duration have impractically small coefficients and the models using hours have impractically large coefficients, both of which would be difficult to interpret. Models using the units of minutes make the most sense to use for our interpretation.

The difference in ease of interpretation between models that use dollars vs cents as monetary units is primarily subjective - choose which one makes the most sense in the context of the analysis.

Overall, the best models to interpret would be min.cent.model or min.dollar.model.

Graphing

Because both of our variables in this bivariate regression model are continuous, we can graph the regression line over a scatterplot of the data points.

  • In base R plots, wrapping the the function around your model formula will add a regression line to a plot: abline(lm(dep ~ ind))

  • In ggplot, the function geom_smooth() can be used to add a regression line to a plot as in the example below:

# randomly sample 1000 trips                          
set.seed(350)
rows.to.include <- sample(1:nrow(nyc.taxi.credit.dta), 1000)
sub.nyc.taxi.credit.dta <- nyc.taxi.credit.dta[rows.to.include,]

# scatterplot
ggplot(sub.nyc.taxi.credit.dta) +
  geom_point(aes(x = as.numeric(duration, "minutes"), 
                 y = tip_amount),pch = 1) +
  geom_smooth(method = "lm",formula = y ~ x, 
              aes(x=as.numeric(duration, "minutes"), y=tip_amount, group=1))+
  ggtitle("NYC Taxi tip amount by ride length in March 2020", ) +
  xlab("Duration of ride in minutes") +
  ylab("Value of tip in USD") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))+ 
  scale_x_continuous(limits = c(0,30), breaks = seq(0,30,5))+
  scale_y_continuous(labels = dollar_format(prefix = "$"),
                     limits = c(0,21), breaks = seq(0,21,2)) 



Mutivariate Regression with Continuous Variables

Identify Base Bivariate Regression

Before constructing any multivariate regression, we need to analyze the basic bivariate regression between our dependent and independent variable. This will provide a standard of comparison for subsequent models containing our independent variable, dependent variable, and control(s).

Null Hypothesis: There is no association between duration (in time) of a taxi ride and the tip amount paid, all else held constant.

In other words, there is no change in tip amount for an additional unit of time spent riding in a taxi, when all other measured factors remain unchanged.

We have selected to display the units of our continuous variables in minutes for duration and cents for tip_amount.

Bivariate Model
At a ride duration of 0 minutes, the base tip amount would be 88 cents. For every 1 minute increase in ride duration, the tip increases by 15 cents.

min.cent.model <- lm(tip_amount*100 ~  as.numeric(duration, "minutes"), data = nyc.taxi.credit.dta)
summary(min.cent.model)
## 
## Call:
## lm(formula = tip_amount * 100 ~ as.numeric(duration, "minutes"), 
##     data = nyc.taxi.credit.dta)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -532    -51      4     36  79645 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     88.25467    0.25704   343.3   <2e-16 ***
## as.numeric(duration, "minutes") 14.77849    0.01956   755.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 182.4 on 2061472 degrees of freedom
##   (37494 observations deleted due to missingness)
## Multiple R-squared:  0.2169, Adjusted R-squared:  0.2169 
## F-statistic: 5.709e+05 on 1 and 2061472 DF,  p-value: < 2.2e-16

Selecting Control Variables

We want to identify which variables in our dataset might impact the expected value of the tip for a given taxi ride.

What factors might impact the tip amount?
1. amount of traffic
2. perceived inconvenience to the driver
3. urgency of the trip to the passenger
4. generosity of the rider
5. perceived quality of the service
6. total fare cost

What variables measure or approximate these factors?

  • pickup time - the time of day of the ride impacts 1-3 & can be used as a proxy measure

  • day of the week - the day of the week impacts 1-3 & can be used as a proxy measure

  • number of passengers - the number of passengers impacts 2 and 4 & can be used as a proxy measure

  • fare amount - tip is usually related to the total cost (6)

Note: we do not have a variable that measures perceived quality of the service (5), which is a limitation of our model

summary(nyc.taxi.credit.dta$fare_amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.50    9.00   10.59   12.50  700.00   37494
summary(nyc.taxi.credit.dta$pickup.time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   10.00   14.00   13.83   19.00   23.00   37494
summary(nyc.taxi.credit.dta$day.of.week)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    2.00    4.00    3.87    5.00    7.00   37494
summary(nyc.taxi.credit.dta$passenger_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.00    1.00    1.46    1.00    9.00   37494

Multivariate Regression

Multivariate Model 1: Controlling for day of the week
On a Sunday, the expected tip amount would be 99 cents for a ride duration of 0 minutes. For every 1 minute increase in ride duration, the tip increases by 15 cents, all else held constant.

min.cent.model <- lm(tip_amount*100 ~  as.numeric(duration, "minutes") + day.of.week, data = nyc.taxi.credit.dta)
summary(min.cent.model)
## 
## Call:
## lm(formula = tip_amount * 100 ~ as.numeric(duration, "minutes") + 
##     day.of.week, data = nyc.taxi.credit.dta)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -540    -51      3     37  79637 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     99.27903    0.36089   275.1   <2e-16 ***
## as.numeric(duration, "minutes") 14.79525    0.01955   756.6   <2e-16 ***
## day.of.week                     -2.89522    0.06656   -43.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 182.3 on 2061471 degrees of freedom
##   (37494 observations deleted due to missingness)
## Multiple R-squared:  0.2176, Adjusted R-squared:  0.2176 
## F-statistic: 2.866e+05 on 2 and 2061471 DF,  p-value: < 2.2e-16

Control coefficient: For every subsequent day (starting on Sunday), the expected tip decreases by 3 cents, all else held constant.


Multivariate Model 2: Controlling for day of the month
On the first day of the month, the expected tip amount would be 74 cents for a ride duration of 0 minutes. For every 1 minute increase in ride duration, the tip increases by 15 cents, all else held constant.

min.cent.model <- lm(tip_amount*100 ~  as.numeric(duration, "minutes") + day, data = nyc.taxi.credit.dta)
summary(min.cent.model)
## 
## Call:
## lm(formula = tip_amount * 100 ~ as.numeric(duration, "minutes") + 
##     day, data = nyc.taxi.credit.dta)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -568    -51      4     37  79645 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     74.27795    0.33599   221.1   <2e-16 ***
## as.numeric(duration, "minutes") 14.91368    0.01965   758.9   <2e-16 ***
## day                              1.49385    0.02316    64.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 182.2 on 2061471 degrees of freedom
##   (37494 observations deleted due to missingness)
## Multiple R-squared:  0.2184, Adjusted R-squared:  0.2184 
## F-statistic: 2.881e+05 on 2 and 2061471 DF,  p-value: < 2.2e-16

Control coefficient: Starting on the first of the month, the expected tip increases by an average of 3 cents every two days, all else held constant.


Multivariate Model 3: Controlling for passenger count
When there is only one passenger, the expected tip amount would be 87 cents for a ride duration of 0 minutes. For every 1 minute increase in ride duration, the tip increases by 15 cents, all else held constant.

min.cent.model <- lm(tip_amount*100 ~  as.numeric(duration, "minutes") + passenger_count, data = nyc.taxi.credit.dta)
summary(min.cent.model)
## 
## Call:
## lm(formula = tip_amount * 100 ~ as.numeric(duration, "minutes") + 
##     passenger_count, data = nyc.taxi.credit.dta)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -535    -51      4     36  79646 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     87.15268    0.30536 285.413  < 2e-16 ***
## as.numeric(duration, "minutes") 14.77798    0.01956 755.545  < 2e-16 ***
## passenger_count                  0.75838    0.11345   6.685 2.31e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 182.4 on 2061471 degrees of freedom
##   (37494 observations deleted due to missingness)
## Multiple R-squared:  0.2169, Adjusted R-squared:  0.2169 
## F-statistic: 2.855e+05 on 2 and 2061471 DF,  p-value: < 2.2e-16

Control coefficient: The expected tip increases by an average of less than one cent for each additional passenger, all else held constant.


Multivariate Model 4: Controlling for fare amount
At a fare cost of $0 and a ride duration of 0 minutes, the base tip amount would be 70 cents. For every 4 minute increase in ride duration, the tip increases by 1 cent all else held constant.

min.cent.model <- lm(tip_amount*100 ~  as.numeric(duration, "minutes") + fare_amount , data = nyc.taxi.credit.dta)
summary(min.cent.model)
## 
## Call:
## lm(formula = tip_amount * 100 ~ as.numeric(duration, "minutes") + 
##     fare_amount, data = nyc.taxi.credit.dta)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8742    -42     17     37  79621 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     70.03218    0.23463 298.479   <2e-16 ***
## as.numeric(duration, "minutes")  0.25934    0.02805   9.247   <2e-16 ***
## fare_amount                     17.37697    0.02601 668.191   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 165.4 on 2061471 degrees of freedom
##   (37494 observations deleted due to missingness)
## Multiple R-squared:  0.3563, Adjusted R-squared:  0.3563 
## F-statistic: 5.705e+05 on 2 and 2061471 DF,  p-value: < 2.2e-16

Control coefficient: The expected tip increases by an average of 17 cents for each additional dollar of the total fare amount, all else held constant.

Notice, that the relationship between tip amount and distance basically disappears when we control for fare amount. This shows that the original relationship we observed between distance was actually misattributed. Fare amount is the strongest predictor of tip amount, and distance was acting as a proxy in our earlier models.

We can see, based on the R-squared value of our model, that together fare amount, and distance explain 36% of the overall variance in tip amount.

Multivariate Regression - multiple controls

Multivariate Model 5: all controls
Fare amount continues to be the strongest predictor of tip amount, when we control for distance, date, day of the week, and passenger count, each of which have less than a 1 cent impact on expected fare amount for an additional unit.

The base tip amount would be 71 cents (fare cost of $0, ride duration of 0 minutes, passenger count of 1, Sunday, first of the month). For every 1 dollar increase in fare, the tip increases by 17 cents, all else held constant. When fare cost is held constant, a 1 unit increase in any other variable, is predicted to alter the tip amount by less than 1 cent.

min.cent.model <- lm(tip_amount*100 ~  as.numeric(duration, "minutes") + fare_amount + day.of.week + day + passenger_count, data = nyc.taxi.credit.dta)
summary(min.cent.model)
## 
## Call:
## lm(formula = tip_amount * 100 ~ as.numeric(duration, "minutes") + 
##     fare_amount + day.of.week + day + passenger_count, data = nyc.taxi.credit.dta)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8749    -42     17     37  79622 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     71.29117    0.39430 180.803  < 2e-16 ***
## as.numeric(duration, "minutes")  0.22901    0.02854   8.025 1.02e-15 ***
## fare_amount                     17.39586    0.02630 661.521  < 2e-16 ***
## day.of.week                     -0.25398    0.06174  -4.113 3.90e-05 ***
## day                             -0.17050    0.02162  -7.887 3.10e-15 ***
## passenger_count                  0.88331    0.10290   8.584  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 165.4 on 2061468 degrees of freedom
##   (37494 observations deleted due to missingness)
## Multiple R-squared:  0.3563, Adjusted R-squared:  0.3563 
## F-statistic: 2.283e+05 on 5 and 2061468 DF,  p-value: < 2.2e-16


We can see, based on the R-squared value of our model, that together all of the independent variables included in our model(fare amount, distance, date, day of the week, and passenger count) explain 36% of the variance in tip amount. This is nearly the same amount as explained by fare amount alone, thus we can conclude that all other controls explain relatively little of the variance in tip amount.

Additionally, 64% of the variance in tip amount is unexplained by the variables included in our model. Thus this variance may be explained by quality of service or any other of the infinite possible factors for which we do not have measurements.



Bivariate Regression with Categorical Variables

Research Question

For this example, we will be revisiting the ANES response data containing the trust.govt which is contains response values to the question:
“How much of the time do you think you can trust the government in Washington to do what is right?”

Research Question: Is there a relationship between the age of voters and how much of the time they trust the government to do what is right?

Recode Dependent Variable

We want the numeric values representing our categorical variables to be meaningful and reflect the implied magnitude of respose options. In this case, I took the response options to mean the proportion of time when a respondent trusts the government with equal value spacing between ordered responses:

  • Always - 100% of the time  
  • Most of the time - 75% of the time
  • About half the time - 50% of the time
  • Some of the time - 25% of the time  
  • Never - 0% of the time

In this case, our units are proportions of (an abstract concept of) “time”

summary(anes2020.dta$n.trust.govt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   3.000   4.000   3.476   4.000   5.000      37
table(anes2020.dta$trust.govt)
## 
##              1. Always    2. Most of the time 3. About half the time 
##                     88                   1133                   2569 
##    4. Some of the time               5. Never 
##                   3674                    779
# recode categorical responses to interpreted proportion values
anes2020.dta$n.trust.govt[anes2020.dta$n.trust.govt == 1] <- 1
anes2020.dta$n.trust.govt[anes2020.dta$n.trust.govt == 2] <- 3/4
anes2020.dta$n.trust.govt[anes2020.dta$n.trust.govt == 3] <- 1/2
anes2020.dta$n.trust.govt[anes2020.dta$n.trust.govt == 4] <- 1/4
anes2020.dta$n.trust.govt[anes2020.dta$n.trust.govt == 5] <- 0

summary(anes2020.dta$n.trust.govt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.250   0.250   0.381   0.500   1.000      37

Regression 1

Null Hypothesis: There is no association between trust in the government and age. In other words, there is no change in the proportion of time one trust’s the government for an additional unit of age.

trust.model1 <- lm(n.trust.govt ~  n.age, data = anes2020.dta)
summary(trust.model1)
## 
## Call:
## lm(formula = n.trust.govt ~ n.age, data = anes2020.dta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.43050 -0.14753 -0.08854  0.13370  0.67534 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.2916954  0.0078012   37.39   <2e-16 ***
## n.age       0.0017350  0.0001435   12.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2193 on 7902 degrees of freedom
##   (376 observations deleted due to missingness)
## Multiple R-squared:  0.01816,    Adjusted R-squared:  0.01804 
## F-statistic: 146.2 on 1 and 7902 DF,  p-value: < 2.2e-16

The p-value of this regression indicates that the results are statistically significant, but, in order to interpret the results, we have to consider what the real meanings of the values are in terms of their units. This model is telling us that at an age of 0, average trust in government has a value of 0.29 and each additional year of age is associated with a 0.0017 increase in this value…

This makes no sense! Newborns do not understand what the government is or does! Do we really want to say that there is an 0.0017 increase in trust in government for every individual subsequent year of age?

Regression 2

Instead, we can use ordered age groups as our numeric values representing age.

table(anes2020.dta$age)
## 
## 18-30 31-40 41-50 51-60   60+ 
##  1143  1377  1219  1347  2840
#recode age groups as ordered set of numeric values
anes2020.dta$ord.age[anes2020.dta$age == "18-30"] <- 0
anes2020.dta$ord.age[anes2020.dta$age == "31-40"] <- 1
anes2020.dta$ord.age[anes2020.dta$age == "41-50"] <- 2
anes2020.dta$ord.age[anes2020.dta$age == "51-60"] <- 3
anes2020.dta$ord.age[anes2020.dta$age == "60+"] <- 4

# run a regression with this new ordered age variable as the independent
trust.model2 <- lm(n.trust.govt ~  ord.age, data = anes2020.dta)
summary(trust.model2)
## 
## Call:
## lm(formula = n.trust.govt ~ ord.age, data = anes2020.dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4119 -0.1619 -0.0839  0.1271  0.6661 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.333898   0.004751   70.28   <2e-16 ***
## ord.age     0.019501   0.001675   11.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2195 on 7902 degrees of freedom
##   (376 observations deleted due to missingness)
## Multiple R-squared:  0.01687,    Adjusted R-squared:  0.01675 
## F-statistic: 135.6 on 1 and 7902 DF,  p-value: < 2.2e-16

In this model, the small p-value indicates that it is significant, and, in this case, the units of the variables lend themselves more easily to meaningful interpretation. The results indicate that, at a value of 0, the predicted value of trust in government is .33. In other words, the age group 18-30 can be expected to trust the government about 33% of the time on average - or slightly more often than “some of the time”.
Additionally, each one unit increase in age group is associated a 0.019 unit increase in trust in government. In other words, moving up one age group is associated with a 1.9% increase in the proportion of time one trusts in government.

What if don’t want to assume that the increase in government trust is uniform for every age group? We would want to analyze each age group’s individual relationship with trust in government.

Regression 3

To investigate each age group individually in our model, we can recode membership to each age group as a binary/dummy variable. If someone belongs to the age group, they will be coded as 1 for the dummy variable and 0 if they do not.
Out of the 4 age group binary variables, each person should have one and only one value of 1.

summary(anes2020.dta$n.age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.00   37.00   52.00   51.57   66.00   80.00     354
table(anes2020.dta$age)
## 
## 18-30 31-40 41-50 51-60   60+ 
##  1143  1377  1219  1347  2840
# recode each age group to a dummy variable
anes2020.dta$age1830[anes2020.dta$age == "18-30"] <- 1
anes2020.dta$age1830[anes2020.dta$age != "18-30" & !is.na(anes2020.dta$age)] <- 0

anes2020.dta$age3140[anes2020.dta$age == "31-40"] <- 1
anes2020.dta$age3140[anes2020.dta$age != "31-40" & !is.na(anes2020.dta$age)] <- 0

anes2020.dta$age4150[anes2020.dta$age == "41-50"] <- 1
anes2020.dta$age4150[anes2020.dta$age != "41-50" & !is.na(anes2020.dta$age)] <- 0

anes2020.dta$age5160[anes2020.dta$age == "51-60"] <- 1
anes2020.dta$age5160[anes2020.dta$age != "51-60" & !is.na(anes2020.dta$age)] <- 0

anes2020.dta$age60[anes2020.dta$age == "60+"] <- 1
anes2020.dta$age60[anes2020.dta$age != "60+" & !is.na(anes2020.dta$age)] <- 0

# confirm all observations have a value of 1 for one and only one of our new dummy variables by checking that the sum of all the variables is 1 for each respondent
table(with(anes2020.dta, age1830 + age3140 + age4150 + age5160 + age60))
## 
##    1 
## 7926
# create a model that includes all age group bianry variables
trust.model3 <- lm(n.trust.govt ~  age1830 + age3140 + age4150 + age5160 + age60, data = anes2020.dta)
summary(trust.model3)
## 
## Call:
## lm(formula = n.trust.govt ~ age1830 + age3140 + age4150 + age5160 + 
##     age60, data = anes2020.dta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41125 -0.16125 -0.08758  0.12592  0.66242 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.411250   0.004125  99.686  < 2e-16 ***
## age1830     -0.073674   0.007702  -9.565  < 2e-16 ***
## age3140     -0.063614   0.007215  -8.817  < 2e-16 ***
## age4150     -0.037175   0.007524  -4.941 7.94e-07 ***
## age5160     -0.015759   0.007275  -2.166   0.0303 *  
## age60              NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2195 on 7899 degrees of freedom
##   (376 observations deleted due to missingness)
## Multiple R-squared:  0.01707,    Adjusted R-squared:  0.01657 
## F-statistic: 34.29 on 4 and 7899 DF,  p-value: < 2.2e-16

Notice that in the above model, the coefficient for age60 is returning as NA. This is because, we need a “zero value” for our intercept. In this case, R is automatically selecting age60 as our comparison group - meaning that in this model, the value of the intercept can be interpreted as the predicted average trust in government value among individuals over 60.
Each coefficient in this model is significant and negative, because membership to each remaining age group is associated with less trust in government than that of age60.

If we want to use a different age group as our comparison group, we can indicate this by omitting that group from our regression. For this to work, the age group variables have to be collinear - meaning each respondent belongs to one and only one of the groups.
Below, I have omitted age1830 as our comparison group:

# create a model that includes all age group bianry variables except age1830
trust.model4 <- lm(n.trust.govt ~  age3140 + age4150 + age5160 + age60, data = anes2020.dta)
summary(trust.model4)
## 
## Call:
## lm(formula = n.trust.govt ~ age3140 + age4150 + age5160 + age60, 
##     data = anes2020.dta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41125 -0.16125 -0.08758  0.12592  0.66242 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.337577   0.006504  51.903  < 2e-16 ***
## age3140     0.010060   0.008795   1.144    0.253    
## age4150     0.036499   0.009049   4.033 5.55e-05 ***
## age5160     0.057915   0.008843   6.549 6.16e-11 ***
## age60       0.073674   0.007702   9.565  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2195 on 7899 degrees of freedom
##   (376 observations deleted due to missingness)
## Multiple R-squared:  0.01707,    Adjusted R-squared:  0.01657 
## F-statistic: 34.29 on 4 and 7899 DF,  p-value: < 2.2e-16

This model indicates that the predicted average trust in government among individuals 18-30 is 33.76% (the value of the intercept). Each coefficient in this model is positive, because membership to each older age group is associated with more trust in government than that of age1830.
The predicted proportion of time that each age group trusts the government on average according to this model is as follows:

  • individuals 18-30 are predicted to trust the government 33.76% of the time on average

  • individuals 31-40 - not significantly different from the intercept**

  • individuals 41-50 are predicted to trust the government (.3376 + 0.04 =) 37.41% of the time on average

  • individuals 51-60 are predicted to trust the government (.3376 + 0.06 =) 39.55% of the time on average

  • individuals 60+ are predicted to trust the government (.3376 + 0.07 =) 41.13% of the time on average

The global p-value (seen in the last line of the regression output) allows us to reject the overall null hypothesis that there is no association between age group and trust in government. However, note that the individual coefficients in this regression have different p-values, indicating different levels of significance. In this case we can say that, compared to the average predicted trust in government value of age group 18-30, all age groups are associated with a higher trust in government value except for age group 30-41 (p-value = .25), which is not significantly different from age group 18-30.

Multivariate Regression with Categorical Variables

Identify Base Bivariate Regression

Before constructing any multivariate regression, we need to analyze the basic bivariate regression between our dependent and independent variable. This will provide a standard of comparison for subsequent models containing our independent variable, dependent variable, and control(s).

Null Hypothesis: There is no association between trust in the government and age, all else held constant. In other words, there is no change in the proportion of time one trust’s the government between each subsequent age group.

Bivariate Regression Model We will be using ordered age groups as our numeric values representing age, our independent variable. Recall, we have coded age groups 18-30, 31-40, 41-50, 51-60, and 60+ on a scale of 0-4.

We will be using a scale value of trust in government (0 - Never to 1 - Always) as our dependent variable

trust.model <- lm(n.trust.govt ~  ord.age, data = anes2020.dta)
summary(trust.model)
## 
## Call:
## lm(formula = n.trust.govt ~ ord.age, data = anes2020.dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4119 -0.1619 -0.0839  0.1271  0.6661 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.333898   0.004751   70.28   <2e-16 ***
## ord.age     0.019501   0.001675   11.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2195 on 7902 degrees of freedom
##   (376 observations deleted due to missingness)
## Multiple R-squared:  0.01687,    Adjusted R-squared:  0.01675 
## F-statistic: 135.6 on 1 and 7902 DF,  p-value: < 2.2e-16

The results indicate that, age group 18-30 (our comparison group) can be expected to trust the government about 33% of the time on average - or slightly more often than “some of the time” - and each subsequent age group is associated with a 1.9% increase in the proportion of time one trusts in government - or trusting the government slightly more often.

Selecting Controls

What factors might impact trust in government?

  • education - trust in government might be impacted by one’s exposure to the government’s role in education and/or information taught about the government

  • race - trust in government might be impacted by the government’s historical treatment of one’s community

  • party - trust in government might be impacted by whether one identifies with the party in power

  • ideology - different ideologies espouse different beliefs about the amount of power and responsibility that should be entrusted to the government which might impact trust in government

  • 2020 vote choice - trust in government might be impacted by whether one voted for the candidate in power

table(anes2020.dta$education)
## 
## -2. Missing, other specify not coded for preliminary release 
##                                                           97 
##                          1. Less than high school credential 
##                                                          376 
##                                    2. High school credential 
##                                                         1336 
##               3. Some post-high school, no bachelor's degree 
##                                                         2790 
##                                         4. Bachelor's degree 
##                                                         2055 
##                                           5. Graduate degree 
##                                                         1592
table(anes2020.dta$race)
## 
##                                                 1. White, non-Hispanic 
##                                                                   5963 
##                                                 2. Black, non-Hispanic 
##                                                                    726 
##                                                            3. Hispanic 
##                                                                    762 
## 4. Asian or Native Hawaiian/other Pacific Islander, non-Hispanic alone 
##                                                                    284 
##     5. Native American/Alaska Native or other race, non-Hispanic alone 
##                                                                    172 
##                                        6. Multiple races, non-Hispanic 
##                                                                    271
table(anes2020.dta$party)
## 
##      1. Democratic party      2. Republican party 4. None or 'independent' 
##                     1861                     1336                     1029 
##       5. Other {SPECIFY} 
##                       33
table(anes2020.dta$ideology)
## 
##                 -4. Technical error                          1. Liberal 
##                                   1                                1320 
##                     2. Conservative 3. Moderate {VOL, video/phone only} 
##                                1534                                  20
table(anes2020.dta$vote.choice2020)
## 
##                 1. Joe Biden              2. Donald Trump 
##                         4026                         3134 
##              3. Jo Jorgensen             4. Howie Hawkins 
##                          135                           56 
## 5. Other candidate {SPECIFY} 
##                          181

Multivariate Regression

Re want to make intentional choices about the values we use to recode our control variables - remember that the value we assign as 0 will serve as our comparison group in the regression.

Multivariate Model 1: Controlling for race and education

# for education, "high school or less" will be our comparison group (assigned a value of 0) and each unit is an increase in education level group (1 - some college, 2 - college degree or greater)
anes2020.dta$edu1[anes2020.dta$n.education %in% c(1:2)] <- 0
anes2020.dta$edu1[anes2020.dta$n.education == 3] <- 1
anes2020.dta$edu1[anes2020.dta$n.education %in% c(4:5)] <- 2

# race/ethnicity does not have an implied value or order, so we know that we need to code each response category as its own binary/dummy variable
anes2020.dta$white[anes2020.dta$n.race == 1] <- 1
anes2020.dta$white[anes2020.dta$n.race != 1 & !is.na(anes2020.dta$n.race)] <- 0

anes2020.dta$black[anes2020.dta$n.race == 2] <- 1
anes2020.dta$black[anes2020.dta$n.race != 2 & !is.na(anes2020.dta$n.race)] <- 0

anes2020.dta$hisp[anes2020.dta$n.race == 3] <- 1
anes2020.dta$hisp[anes2020.dta$n.race != 3 & !is.na(anes2020.dta$n.race)] <- 0

anes2020.dta$other[!is.na(anes2020.dta$n.race)] <- 0
anes2020.dta$other[anes2020.dta$n.race %in% c(4:6)] <- 1

# check that all groups are mutually exclusive / collinear
table(with(anes2020.dta, white + black + hisp + other))
## 
##    1 
## 8178
# create model with our education and race/ethnicity controls
trust.model2 <- lm(n.trust.govt ~  ord.age + edu1 + black + hisp + other, data = anes2020.dta)
summary(trust.model2)
## 
## Call:
## lm(formula = n.trust.govt ~ ord.age + edu1 + black + hisp + other, 
##     data = anes2020.dta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45587 -0.15405 -0.08072  0.13691  0.67900 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.340443   0.006859  49.635  < 2e-16 ***
## ord.age      0.020764   0.001724  12.043  < 2e-16 ***
## edu1        -0.009724   0.003250  -2.991 0.002785 ** 
## black       -0.016291   0.008875  -1.835 0.066474 .  
## hisp         0.032367   0.008886   3.643 0.000272 ***
## other        0.017303   0.008909   1.942 0.052167 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2194 on 7760 degrees of freedom
##   (514 observations deleted due to missingness)
## Multiple R-squared:  0.02127,    Adjusted R-squared:  0.02064 
## F-statistic: 33.73 on 5 and 7760 DF,  p-value: < 2.2e-16


Only the coefficients on age, education, and hispanic are significant at the 5% level - meaning that we can only be confident that there is a meaningful relationship between each of these variables and trust in government. All other relationships displayed in the model could possibly be due to random noise.

Age (primary independent variable / discrete ordered) - When education and race are held constant, there is a positive relationship between age group and trust in government, with all age groups displaying significantly more trust compared to voters 18-30. For each subsequent age group, there is an associated 2% increase in trust in government.

Education (discrete ordered) - For a one level increase in education, there is an associated .1% decrease in trust in government. This relationship is significant, but the magnitude is so small that is is basically meaningless when viewed in the context of our categorical dependent variable.

Hispanic (binary) - Compared to white voters, hispanic voters are predicted to express 3% more trust in government on average.


Multivariate Model 2: Controlling for party, ideology, and vote choice

# since there are only two response options for ideology - conservative and liberal - we can call this variable 'conservative' because it is a binary variable - 1 for conservative and 0 for not conservative aka liberal
anes2020.dta$conservative[anes2020.dta$ideology == "1. Liberal"] <- 0
anes2020.dta$conservative[anes2020.dta$ideology == "2. Conservative"] <- 1

# for party, Democrat will be our comparison group 
anes2020.dta$party1[anes2020.dta$party == "1. Democratic party"] <- 0
anes2020.dta$party1[anes2020.dta$party == "2. Republican party"] <- 1
anes2020.dta$party1[anes2020.dta$party == "4. None or 'independent'"] <- 2

# for vote choice in the 2020 presidential election, selecting joe biden is our comparison group
anes2020.dta$cand2020[anes2020.dta$vote.choice2020 == "1. Joe Biden"] <- 0
anes2020.dta$cand2020[anes2020.dta$vote.choice2020 == "2. Donald Trump"] <- 1
anes2020.dta$cand2020[anes2020.dta$vote.choice2020 == "3. Jo Jorgensen"|
                     anes2020.dta$vote.choice2020 == "4. Howie Hawkins"|
                     anes2020.dta$vote.choice2020 == "5. Other candidate {SPECIFY}"] <- 2

trust.model1a <- lm(n.trust.govt ~  ord.age + party1 + conservative  + cand2020, data = anes2020.dta)
summary(trust.model1a)
## 
## Call:
## lm(formula = n.trust.govt ~ ord.age + party1 + conservative + 
##     cand2020, data = anes2020.dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4586 -0.1794  0.0414  0.1296  0.6831 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.324841   0.016464  19.730  < 2e-16 ***
## ord.age       0.028124   0.004668   6.024 2.25e-09 ***
## party1       -0.003959   0.007958  -0.497    0.619    
## conservative  0.021261   0.013703   1.552    0.121    
## cand2020      0.003106   0.011364   0.273    0.785    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2281 on 1217 degrees of freedom
##   (7058 observations deleted due to missingness)
## Multiple R-squared:  0.03298,    Adjusted R-squared:  0.02981 
## F-statistic: 10.38 on 4 and 1217 DF,  p-value: 2.885e-08


An issue with interpreting the above model is that R assumes that our values for party1 and can2020 are continuous and thus it only generates one coefficient to apply to the difference in government trust between each party. This is not accurate. What we actually want is a binary variable for each party and candidate.

Wrapping a variable in the function as.factor() tells R to treat each factor level as a binary variable in our linear model - without having to explicitly recode each level as its own variable. However, keep track of which variable values represent which groups - R will still exclude the 0 value to be used as the comparison group.

trust.model1a <- lm(n.trust.govt ~  ord.age + as.factor(party1) + conservative  + as.factor(cand2020), data = anes2020.dta)
summary(trust.model1a)
## 
## Call:
## lm(formula = n.trust.govt ~ ord.age + as.factor(party1) + conservative + 
##     as.factor(cand2020), data = anes2020.dta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47499 -0.17309 -0.00095  0.14229  0.68616 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.326977   0.016317  20.039  < 2e-16 ***
## ord.age               0.025952   0.004648   5.584  2.9e-08 ***
## as.factor(party1)1    0.021745   0.019261   1.129   0.2591    
## as.factor(party1)2   -0.013135   0.015805  -0.831   0.4061    
## conservative          0.004784   0.013977   0.342   0.7322    
## as.factor(cand2020)1  0.043631   0.016997   2.567   0.0104 *  
## as.factor(cand2020)2 -0.064930   0.026658  -2.436   0.0150 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.226 on 1215 degrees of freedom
##   (7058 observations deleted due to missingness)
## Multiple R-squared:  0.05257,    Adjusted R-squared:  0.04789 
## F-statistic: 11.24 on 6 and 1215 DF,  p-value: 3.072e-12


The only significant relationships with trust in this model exist for age and candidate choice. All other relationships displayed in the model could possibly be due to random noise.

Age (primary independent variable / discrete ordered) - When party, ideology, and 2020 vote are held constant, there is a positive relationship between age group and trust in government, with all age groups displaying significantly more trust compared to voters 18-30. For each subsequent age group, there is an associated 2.5% increase in trust in government.

Candidate Choice (binary) - Compared to Biden 2020 voters, Trump voters are expected to trust the government 4.4% more on average. Compared to Biden 2020 voters, Third party voters are expected to trust the government 6.5% less on average.


Multivariate Model 2: all controls

trust.model.F <- lm(n.trust.govt ~  ord.age + as.factor(party1) + conservative + as.factor(cand2020) + edu1 + black + hisp + other , data = anes2020.dta)
summary(trust.model.F)
## 
## Call:
## lm(formula = n.trust.govt ~ ord.age + as.factor(party1) + conservative + 
##     as.factor(cand2020) + edu1 + black + hisp + other, data = anes2020.dta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51991 -0.16445  0.00156  0.14620  0.65961 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.300001   0.022593  13.278  < 2e-16 ***
## ord.age               0.029259   0.004811   6.082  1.6e-09 ***
## as.factor(party1)1    0.029018   0.020083   1.445  0.14874    
## as.factor(party1)2   -0.007759   0.016332  -0.475  0.63483    
## conservative          0.003336   0.014244   0.234  0.81488    
## as.factor(cand2020)1  0.050343   0.017512   2.875  0.00411 ** 
## as.factor(cand2020)2 -0.059686   0.026971  -2.213  0.02709 *  
## edu1                 -0.001290   0.008437  -0.153  0.87847    
## black                 0.022399   0.020916   1.071  0.28442    
## hisp                  0.049436   0.020411   2.422  0.01558 *  
## other                 0.058509   0.022677   2.580  0.01000 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2261 on 1191 degrees of freedom
##   (7078 observations deleted due to missingness)
## Multiple R-squared:  0.06185,    Adjusted R-squared:  0.05398 
## F-statistic: 7.853 on 10 and 1191 DF,  p-value: 2.656e-12


Coefficients that are significant at the 5% level correspond with age, candidate choice in 2020, and identifying as hispanic or a race/ethnicity other than white/black/hispanic. All other relationships displayed in the model could possibly be due to random noise.

Age (primary independent variable / discrete ordered) - All else held constant, there is a positive relationship between age group and trust in government, with all age groups displaying significantly more trust compared to voters 18-30. For each subsequent age group, there is an associated 3% increase in trust in government.

Candidate Choice (binary) - Compared to Biden 2020 voters, Trump voters are expected to trust the government 5% more on average. Compared to Biden 2020 voters, Third party voters are expected to trust the government 6% less on average.

Hispanic & Other (binary) - Compared to white voters, hispanic voters are predicted to express 5% more trust in government on average. Compared to white voters, voters of a race/ethnicity other than white, black, or hispanic are predicted to express 6% more trust in government on average.

Comparing Multivariate Models

Comparing our Models - R-squared

Out of all of our models, the final model including all of our control variables had the largest adjusted R-squared, meaning it had the most explanatory power. However, the adjusted R-squared value was only .054, meaning that age and all of our dependent variables only explained 5.4% of the variance in trust in government and 94.6% of the variance can be attributed to other factors.

Comparing Significance

The global p-value of each of our models is significant, but notice that the significance of our multivariate models (including controls) is actually lower than that of our bivariate model (only includes age and trust variables). This does not mean that these models are “worse” - just that as we split our sample into smaller and smaller comparison groups by including more control variables, our n value decreases, which can reduce overall significance of the model (while at the same increasing R-squared).

In multivariate model 1, when we controlled for education and race, there were significant coefficients for hispanic and education level. However, the significance of education level disappeared in our combined model (model 3). This suggests that the significance of education level was actually a misattribution of a variable included in model 3 that was not included in model 2 - possibly party or candidate choice - for which the distribution of education is skewed.

Comparing Coefficients

In multivariate model 2, when we controlled for party, ideology, and 2020 vote choice, there is an associated 3% increase in trust in government for each subsequent age group. This is greater than the associated 2% increase for each age group in our binary model. This tells us that one of the control variables introduced in model 2 was a confounding variable in the binary model, and once it was held constant, there was a larger apparent value increase in trust in government for each age group on average.

Overall Conclusions

The fact that all of our models explain less than 10% of the variance in trust in government, indicates that none of our variables are very good predictors of this value, and that responses are primarily influenced by unmeasured factors. This could suggest that responses to the trust in government question are selected at random or that most voters default to the neutral response without considering the meaning of the question, but all we know for certain is that we cannot come to any strong overall conclusions about voters’ trust in government based on these models.

The fact that some of our coefficients are statistically significant, indicates that they have a real relationship with trust in government that is distinguishable from random noise. However, the low explanatory power of our models combined with the small magnitudes of the coefficients associated with each significant relationship (all less than 10% when the difference between each categorical answer is an implied 25%) indicates that this relationship, while statistically significant, does not have much meaning when translated to real world outcomes.