knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

# List of packages
packages <- c("tidyverse", "broom", "modelsummary", "sjPlot", "car", "knitr", "gt", "fst")

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)
## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "broom"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "modelsummary" "broom"        "lubridate"    "forcats"      "stringr"     
##  [6] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [11] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [16] "utils"        "datasets"     "methods"      "base"        
## 
## [[4]]
##  [1] "sjPlot"       "modelsummary" "broom"        "lubridate"    "forcats"     
##  [6] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [11] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [16] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[5]]
##  [1] "car"          "carData"      "sjPlot"       "modelsummary" "broom"       
##  [6] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [11] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [16] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [21] "methods"      "base"        
## 
## [[6]]
##  [1] "knitr"        "car"          "carData"      "sjPlot"       "modelsummary"
##  [6] "broom"        "lubridate"    "forcats"      "stringr"      "dplyr"       
## [11] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [16] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [21] "datasets"     "methods"      "base"        
## 
## [[7]]
##  [1] "gt"           "knitr"        "car"          "carData"      "sjPlot"      
##  [6] "modelsummary" "broom"        "lubridate"    "forcats"      "stringr"     
## [11] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [16] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [21] "utils"        "datasets"     "methods"      "base"        
## 
## [[8]]
##  [1] "fst"          "gt"           "knitr"        "car"          "carData"     
##  [6] "sjPlot"       "modelsummary" "broom"        "lubridate"    "forcats"     
## [11] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [16] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [21] "grDevices"    "utils"        "datasets"     "methods"      "base"

Introduction

Simple linear regression allows us to model how an outcome variable (Y) is related to a single predictor variable (X). A linear regression measures both the strength of a relationship but also models how one variable relates to another.

In this tutorial, we’ll implement several simple linear regression models to answer research questions using real data. We’ll focus on:

  1. Understanding the regression equation and output

  2. Interpreting coefficients properly

  3. Creating visualizations to communicate findings

  4. Presenting results in professional tables

Throughout, we’ll follow a systematic workflow:

  1. Explore and prepare variables

  2. Visualize relationships

  3. Fit regression models

  4. Interpret the coefficients

  5. Present and visualize results professionally

Occupational Prestige: The Duncan Data

For our first example, we will use the classic Duncan dataset from the car package, which contains data on the prestige of 45 U.S. occupations in 1950, along with other characteristics.

Data Exploration

data("Duncan", package = "car")

str(Duncan)
## 'data.frame':    45 obs. of  4 variables:
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 3 2 ...
##  $ income   : int  62 72 75 55 64 21 64 80 67 72 ...
##  $ education: int  86 76 92 90 86 84 93 100 87 86 ...
##  $ prestige : int  82 83 90 76 90 87 93 90 52 88 ...
summary(Duncan)
##    type        income        education         prestige    
##  bc  :21   Min.   : 7.00   Min.   :  7.00   Min.   : 3.00  
##  prof:18   1st Qu.:21.00   1st Qu.: 26.00   1st Qu.:16.00  
##  wc  : 6   Median :42.00   Median : 45.00   Median :41.00  
##            Mean   :41.87   Mean   : 52.56   Mean   :47.69  
##            3rd Qu.:64.00   3rd Qu.: 84.00   3rd Qu.:81.00  
##            Max.   :81.00   Max.   :100.00   Max.   :97.00

The dataset includes information on:

  • type: Type of occupation (prof = professional, wc = white collar, bc = blue collar)

  • income: Income level scale (as a percentage of earnings)

  • education: Education level scale (as a percentage in occupations who were HS graduates)

  • prestige: Occupational prestige score (NORC rating scale of the occupation)

Let’s visualize the relationship between income and occupational prestige:

ggplot(Duncan, aes(x = income, y = prestige)) +
  geom_point(aes(color = type), size = 3) +
  geom_smooth(method = "lm", color = "black", linetype = "dashed") +
  labs(
    title = "Income and Occupational Prestige",
    subtitle = "Duncan occupational prestige data (1950)",
    x = "Income Score (0-100)",
    y = "Prestige Score (0-100)",
    color = "Occupation Type"
  ) +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")

The plot suggests a positive relationship between income and prestige, with variation by occupation type. But let’s model and speak more substantive regarding this relationship.

Running the Regression

Now let’s fit a simple linear regression model to predict prestige based on income:

model_duncan <- lm(prestige ~ income, data = Duncan)

summary(model_duncan)
## 
## Call:
## lm(formula = prestige ~ income, data = Duncan)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.566  -9.421   0.257   9.167  61.855 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.4566     5.1901   0.473    0.638    
## income        1.0804     0.1074  10.062 7.14e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.4 on 43 degrees of freedom
## Multiple R-squared:  0.7019, Adjusted R-squared:  0.695 
## F-statistic: 101.3 on 1 and 43 DF,  p-value: 7.144e-13

What do you note?

Interpreting the Results

Let’s examine the results using broom for clarity:

tidy(model_duncan, conf.int = TRUE)
## # A tibble: 2 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)     2.46     5.19      0.473 6.38e- 1   -8.01      12.9 
## 2 income          1.08     0.107    10.1   7.14e-13    0.864      1.30

Interpreting the Duncan Occupational Prestige Model

Looking at this regression output for our model predicting occupational prestige based on income scores:

Intercept Interpretation:

  • The intercept (2.46) represents the expected prestige score when income scale score equals zero

Income Score Coefficient:

  • Key finding: For each additional point on the income scale (0-100), occupational prestige increases by about 1.08 points on average

  • Real-world interpretation:

    • Occupations differing by 10 points in income score would be expected to differ by 10.8 points in prestige

    • This is a substantively large relationship – income has nearly a one-to-one relationship with prestige

  • This relationship is highly statistically significant (p-value < 0.001)

  • The large t-statistic (10.06) indicates this relationship is very unlikely to occur by random chance

Confidence Interval Interpretation:

  • The 95% confidence interval ranges from 0.86 to 1.30

  • This means if we were to repeatedly sample occupations and calculate this relationship, about 95% of those intervals would contain the true population relationship between income and prestige

  • Since this interval is entirely positive and doesn’t include zero, we can be confident in the positive relationship between income and prestige

Model Fit:

  • The R-squared value of 0.702 indicates that income alone explains about 70.2% of the variation in occupational prestige

  • This is an really strong relationship for a single-predictor model in social science research, and you likely not encounter that in your own project

  • The strong relationship suggests that economic rewards are closely tied to social status in occupational hierarchies (based on the data, and with the scale assumptions)

Sociological Significance:

  • This finding has important implications for understanding social stratification

  • The close relationship between income and prestige supports theories that economic and social status dimensions of inequality are tightly coupled

  • Income appears to be a powerful predictor of social standing, at least in the occupational domain

Remember this analysis is based on 1950s occupational data, so contemporary relationships might differ as occupation structures have evolved.

Professional Visualization

Let’s create a simple visualization of our findings:

plot_model(model_duncan, show.values = TRUE) +
  labs(
    title = "Effect of Income on Occupational Prestige",
    subtitle = "One point increase in income score associated with 1.08 point increase in prestige"
  )

Creating a Professional Table

Let’s create a professional-looking regression table with modelsummary:

modelsummary(model_duncan,
             fmt = 2,
             estimate = "{estimate} [{conf.low}, {conf.high}]",
             statistic = NULL,
             coef_rename = c("(Intercept)" = "Intercept", "income" = "Income Score"),
             gof_map = c("nobs", "r.squared", "adj.r.squared", "rmse", "aic", "bic"),
             title = "Regression of Occupational Prestige on Income",
             notes = "Data: Duncan occupational prestige data (1950)")
Regression of Occupational Prestige on Income
(1)
Data: Duncan occupational prestige data (1950)
Intercept 2.46 [-8.01, 12.92]
Income Score 1.08 [0.86, 1.30]
Num.Obs. 45
R2 0.702
R2 Adj. 0.695
RMSE 17.01
AIC 388.8
BIC 394.2

We can also do it with the sjplot package:

tab_model(model_duncan, auto.label=TRUE)
  prestige
Predictors Estimates CI p
(Intercept) 2.46 -8.01 – 12.92 0.638
income 1.08 0.86 – 1.30 <0.001
Observations 45
R2 / R2 adjusted 0.702 / 0.695

ESS Data Application: Populist Attitudes

For our second example, we’ll examine populist attitudes using data from the European Social Survey (ESS). Specifically, we’ll analyze how these attitudes might differ by gender in Spain.

Understanding Scales

Before diving into the analysis, let’s discuss what we mean by “scales” in sociological research. A scale is a composite measure created by combining multiple items that are believed to tap into the same underlying construct. Scales offer several advantages over single-item measures:

In this example, we will create a “populist attitudes” scale based on trust items from the ESS, following the methods similar to that used by Norris and Inglehart (2019) in their widely-cited book “Cultural Backlash.” Their conceptualization of populism (and eventual measurement as seen below) has been criticized. However, it does generate a nice 0-100 scale leveraging some trust items.

Data Preparation and Scale Construction

First, let’s load the ESS data for Spain and examine the trust variables we’ll use for our populist scale

spain_data <- read.fst("spain_data.fst")

head(spain_data[, c("trstplt", "trstprl", "trstprt", "gndr")])
##   trstplt trstprl trstprt gndr
## 1       1       5      NA    2
## 2       2       4      NA    1
## 3       0       5      NA    1
## 4       3       5      NA    1
## 5      88       7      NA    1
## 6       5       9      NA    1

Note here we already seen there are lots of NA values to remove. But also values to deal with that should be treated as NA. The standard is to check the survey documentation – again, for the ESS, found here: https://www.europeansocialsurvey.org/data-portal

We can do a simple check. For instance:

table(spain_data$trstprt)
## 
##    0    1    2    3    4    5    6    7    8    9   10   77   88   99 
## 4887 1784 2097 2210 1855 2628  952  558  274   66   58   40  278   36

These three variables measure respondents’ trust in politicians (trstplt), parliament (trstprl), and political parties (trstprt) on a scale from 0 (no trust at all) to 10 (complete trust). According to Norris and Inglehart’s conceptualization of populism, anti-trust in political institutions/parties/political elites is a key component of populist attitudes.

Now, let’s construct our populist scale:

# Step 1: Clean the trust variables (set values > 10 to NA)
spain_data$trstplt <- ifelse(spain_data$trstplt > 10, NA, spain_data$trstplt)
spain_data$trstprl <- ifelse(spain_data$trstprl > 10, NA, spain_data$trstprl)
spain_data$trstprt <- ifelse(spain_data$trstprt > 10, NA, spain_data$trstprt)

# Step 2: Create a trust scale (0-100)
spain_data$trust <- scales::rescale(
  spain_data$trstplt + spain_data$trstprl + spain_data$trstprt, 
  na.rm = TRUE, 
  to = c(0, 100)
)

# Step 3: Flip the scale to create a populist measure (higher = more populist)
# This reverses the direction so that lower trust = higher populism
spain_data$populist <- scales::rescale(spain_data$trust, na.rm = TRUE, to = c(100, 0))

# Recode gender as a factor (1 = Male, 2 = Female)
spain_data$gender <- factor(spain_data$gndr, 
                           levels = c(1, 2), 
                           labels = c("Male", "Female"))

# Examine the properties of our new scale
summary(spain_data$populist, na.rm = TRUE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   50.00   66.67   67.88   86.67  100.00    2775

The scale transformation works as follows:

  1. We sum the three trust items to create an overall trust score

  2. We use scales::rescale() to transform this sum to a 0-100 scale

  3. We create the populist scale by inverting the trust scale (so low trust = high populism)

Visualizing the Populist Scale Distribution

Let’s examine the distribution of populist attitudes in Spain:

ggplot(spain_data, aes(x = populist)) +
  # Add density histogram
  geom_histogram(
    aes(y = ..density..),
    bins = 30, 
    fill = "#3498db", 
    color = "white", 
    alpha = 0.8
  ) +
  # Add density curve
  geom_density(
    color = "#e74c3c", 
    size = 1, 
    fill = "#e74c3c", 
    alpha = 0.2
  ) +
  # Add mean line
  geom_vline(
    aes(xintercept = mean(populist, na.rm = TRUE)),
    color = "#2c3e50",
    linetype = "dashed",
    size = 1
  ) +
  # Add annotations
  annotate(
    "text",
    x = mean(spain_data$populist, na.rm = TRUE) + 5,
    y = 0.025,
    label = paste("Mean =", round(mean(spain_data$populist, na.rm = TRUE), 1)),
    color = "#2c3e50",
    fontface = "bold",
    hjust = 0
  ) +
  # Add labels
  labs(
    title = "Distribution of Populist Attitudes in Spain",
    subtitle = "Based on inverted trust in politicians, parliament, and parties (ESS data)",
    x = "Populist Attitudes Scale (0-100)",
    y = "Density",
    caption = "Higher values indicate stronger populist attitudes (lower trust in political institutions)"
  ) +
  # Apply theme
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "gray50"),
    plot.caption = element_text(color = "gray50", hjust = 0),
    panel.grid.minor = element_blank()
  )

Visualizing Populist Attitudes by Gender

Let’s explore how populist attitudes vary by gender:

# Calculate mean populist score and confidence intervals by gender
populist_summary <- spain_data %>%
  filter(!is.na(gender) & !is.na(populist)) %>%
  group_by(gender) %>%
  summarize(
    mean_populist = mean(populist, na.rm = TRUE),
    sd_populist = sd(populist, na.rm = TRUE),
    n = n(),
    se_populist = sd_populist / sqrt(n),
    ci_lower = mean_populist - 1.96 * se_populist,
    ci_upper = mean_populist + 1.96 * se_populist
  )

populist_summary
## # A tibble: 2 × 7
##   gender mean_populist sd_populist     n se_populist ci_lower ci_upper
##   <fct>          <dbl>       <dbl> <int>       <dbl>    <dbl>    <dbl>
## 1 Male            67.8        21.6  8368       0.236     67.3     68.3
## 2 Female          68.0        21.4  8307       0.234     67.5     68.4

Running the Regression

Now, let’s fit a regression model to test whether/how gender predicts populist attitudes:

# Make Male the reference category
spain_data$gender <- relevel(factor(spain_data$gender), ref = "Female")

# Run the regression
model_populist <- lm(populist ~ gender, data = spain_data)

# View the model summary
summary(model_populist)
## 
## Call:
## lm(formula = populist ~ gender, data = spain_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.984 -17.796  -1.129  18.682  32.204 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  67.9844     0.2356 288.545   <2e-16 ***
## genderMale   -0.1889     0.3326  -0.568     0.57    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.47 on 16673 degrees of freedom
##   (2777 observations deleted due to missingness)
## Multiple R-squared:  1.934e-05,  Adjusted R-squared:  -4.064e-05 
## F-statistic: 0.3224 on 1 and 16673 DF,  p-value: 0.5702

Examining Results

Let’s examine the results using the broom package:

tab_model(model_populist)
  populist
Predictors Estimates CI p
(Intercept) 67.98 67.52 – 68.45 <0.001
gender [Male] -0.19 -0.84 – 0.46 0.570
Observations 16675
R2 / R2 adjusted 0.000 / -0.000

Interpreting the Gender and Populist Attitudes Results

Let’s examine the regression results for our model predicting populist attitudes based on gender:

  • Gender coefficient: Males score on average 0.19 points lower than females on the populist attitudes scale. However, this difference is not statistically significant (p=0.570). We thus fail to reject the null.

  • Confidence interval: The 95% confidence interval for the gender effect ranges from -0.84 to 0.46. Because this interval includes zero, we cannot reject the null hypothesis that there is no difference in populist attitudes between males and females.

  • Model fit: The R² value of 0.000 indicates that gender explains essentially none of the variation in populist attitudes among Spanish respondents. This confirms that gender is not a meaningful predictor of populist attitudes in this sample.

Why this result is important:

Unlike our previous examples where we found significant relationships (income predicting occupational prestige), here we have a clear case of a non-significant result. This highlights an important aspect of statistical inference - sometimes our variables simply do not have the relationships we might expect.

Substantive interpretation:

These results suggest that populist attitudes in Spain are not differentiated by gender. Both men and women show similar levels of trust/distrust in political institutions. This finding contrasts with some theories that suggest gender might be associated with different political attitudes.

Interpreting the Results

Visualization with sjPlot

plot_model(model_populist, 
           type = "est",
           show.values = TRUE, 
           value.offset = 0.3,
           vline.color = "steelblue",
           title = "Effect of Gender on Populist Attitudes",
           axis.labels = c("Female vs. Male"),
           axis.title = "Difference in Populist Attitudes Score") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    axis.text.y = element_text(face = "bold")
  )

Second ESS Data Application: Authoritarian Values

Now let’s examine authoritarian values using the ESS data. We’ll create another scale to measure authoritarian values and analyze how they vary across age groups. We will create the authoritarian values scale from Schwartz human values items, found under the corresponding tab: https://ess.sikt.no/en/datafile/242aaa39-3bbb-40f5-98bf-bfb1ce53d8ef

hungary_data <- read.fst("hungary_data.fst")

Scale Construction

# Create authoritarian values scale from Schwartz human values items
hungary_data <- hungary_data %>%
  mutate(
    # Rename variables for clarity
    behave = ipbhprp,   # Proper behavior
    secure = impsafe,   # Security
    safety = ipstrgv,   # Strong government
    tradition = imptrad, # Tradition
    rules = ipfrule     # Following rules
  ) %>%
  # Set invalid values to NA
  mutate(across(c("behave", "secure", "safety", "tradition", "rules"),
                ~ na_if(.x, 7) %>% na_if(8) %>% na_if(9))) %>%
  # Apply reverse coding (original scale is 1=very much like me, 6=not like me at all)
  # After reversing: higher scores = more authoritarian
  mutate(across(c("behave", "secure", "safety", "tradition", "rules"), ~ 7 - .x))

# Calculate the authoritarian scale (0-100)
hungary_data$auth <- scales::rescale(
  hungary_data$behave + 
  hungary_data$secure + 
  hungary_data$safety + 
  hungary_data$tradition + 
  hungary_data$rules, 
  to = c(0, 100), 
  na.rm = TRUE
)

# Examine the distribution
summary(hungary_data$auth)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   60.00   72.00   71.01   80.00  100.00    1031
# Visualize the distribution
ggplot(hungary_data, aes(x = auth)) +
  geom_histogram(bins = 30, fill = "#e74c3c", color = "white", alpha = 0.8) +
  labs(
    title = "Distribution of Authoritarian Values Scale",
    subtitle = "Hungary (ESS data)",
    x = "Authoritarian Values (Higher = More Authoritarian)",
    y = "Count"
  ) +
  theme_minimal()

Now let’s create a generational variable for our analysis:

# Create generational cohorts based on year of birth
hungary_data <- hungary_data %>%
  mutate(
    # Recode year of birth, setting invalid values to NA
    birth_year = ifelse(yrbrn < 1928 | yrbrn > 2005, NA, yrbrn),
    
    # Create generation variable
    generation = case_when(
      birth_year %in% 1928:1945 ~ "Interwar",
      birth_year %in% 1946:1964 ~ "Baby Boomers",
      birth_year %in% 1965:1980 ~ "Generation X",
      birth_year %in% 1981:1996 ~ "Millennials",
      birth_year %in% 1997:2005 ~ "Gen Z",
      TRUE ~ NA_character_
    ),
    
    # Convert to factor with correct order
    generation = factor(generation, 
                      levels = c("Interwar", "Baby Boomers", 
                               "Generation X", "Millennials", "Gen Z"))
  )

# Check the distribution
table(hungary_data$generation, useNA = "ifany")
## 
##     Interwar Baby Boomers Generation X  Millennials        Gen Z         <NA> 
##         2945         5294         4537         2948          465          453

Visualization by Generation

Let’s visualize authoritarian values by generation:

# Calculate mean authoritarian score by generation
auth_summary <- hungary_data %>%
  filter(!is.na(generation)) %>%
  group_by(generation) %>%
  summarize(
    mean_auth = mean(auth, na.rm = TRUE),
    sd_auth = sd(auth, na.rm = TRUE),
    n = n(),
    se_auth = sd_auth / sqrt(n)
  )

ggplot(auth_summary, aes(x = generation, y = mean_auth, fill = generation)) +
  geom_col(alpha = 0.8) +
  labs(
    title = "Authoritarian Values by Generation",
    subtitle = "Hungary (ESS data)",
    x = "Generation",
    y = "Mean Authoritarian Values Score (0-100)",
    caption = "Error bars show standard error of the mean"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 45, hjust = 1)
  ) +
  scale_fill_viridis_d()

Running the Regression

Now let’s fit a regression model, using the Interwar generation as the reference category:

model_auth <- lm(auth ~ generation, data = hungary_data)

summary(model_auth)
## 
## Call:
## lm(formula = auth ~ generation, data = hungary_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -75.605  -9.487   0.610  10.513  32.952 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             75.6053     0.2832  266.95   <2e-16 ***
## generationBaby Boomers  -3.9441     0.3524  -11.19   <2e-16 ***
## generationGeneration X  -6.1187     0.3623  -16.89   <2e-16 ***
## generationMillennials   -8.2150     0.3983  -20.62   <2e-16 ***
## generationGen Z         -8.5577     0.7590  -11.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.79 on 15201 degrees of freedom
##   (1436 observations deleted due to missingness)
## Multiple R-squared:  0.03277,    Adjusted R-squared:  0.03252 
## F-statistic: 128.8 on 4 and 15201 DF,  p-value: < 2.2e-16

Interpreting Results

Let’s examine the results using broom:

tidy(model_auth, conf.int = TRUE)
## # A tibble: 5 × 7
##   term                  estimate std.error statistic  p.value conf.low conf.high
##   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)              75.6      0.283     267.  0           75.1      76.2 
## 2 generationBaby Boome…    -3.94     0.352     -11.2 5.85e-29    -4.63     -3.25
## 3 generationGeneration…    -6.12     0.362     -16.9 2.13e-63    -6.83     -5.41
## 4 generationMillennials    -8.22     0.398     -20.6 3.06e-93    -9.00     -7.43
## 5 generationGen Z          -8.56     0.759     -11.3 2.27e-29   -10.0      -7.07

Interpreting Regression with Categorical Predictors: Generations and Authoritarian Values

Understanding the Omitted Reference Category

In this regression model, we’re using categorical predictors with the Interwar generation serving as our reference (omitted) category. This is fundamental to interpreting the results correctly:

  • The omitted category (Interwar generation) becomes the baseline against which all other categories are compared

  • The intercept represents the predicted value for members of this reference category

  • Each coefficient shows the difference between a given category and the reference category

Interpreting the Results

Intercept (75.6):

  • This represents the predicted authoritarian values score for the Interwar generation (our reference category)

  • Members of the Interwar generation have an average authoritarian values score of approximately 76 points (75.6, specifically) on our 0-100 scale

  • This makes the Interwar generation the most authoritarian of all the generational cohorts in our Hungarian sample

Generational Differences:

  • Baby Boomers: Score 3.94 points lower on the authoritarian scale compared to the Interwar generation

  • Generation X: Score 6.11 points lower compared to the Interwar generation

  • Millennials: Score 8.22 points lower compared to the Interwar generation

  • Gen Z: Score 8.56 points lower compared to the Interwar generation

All of these differences are statistically significant (p < 0.001), indicating that there is strong evidence for generational differences in authoritarian values.

Model Visualization

plot_model(model_auth, show.values = TRUE, auto.label = TRUE) +
  labs(
    title = "Effects of Generation on Authoritarian Values",
    subtitle = "Reference category: Interwar Generation"
  )

tab_model(model_auth)
  auth
Predictors Estimates CI p
(Intercept) 75.61 75.05 – 76.16 <0.001
generation [Baby Boomers] -3.94 -4.63 – -3.25 <0.001
generation [Generation X] -6.12 -6.83 – -5.41 <0.001
generation [Millennials] -8.22 -9.00 – -7.43 <0.001
generation [Gen Z] -8.56 -10.05 – -7.07 <0.001
Observations 15206
R2 / R2 adjusted 0.033 / 0.033

Let’s also create a professional regression table:

modelsummary(model_auth,
             stars = TRUE,
             coef_rename = c("(Intercept)" = "Intercept (Interwar)", 
                           "generationBaby Boomers" = "Baby Boomers",
                           "generationGeneration X" = "Generation X",
                           "generationMillennials" = "Millennials",
                           "generationGen Z" = "Gen Z"),
             gof_map = c("nobs", "r.squared", "adj.r.squared"),
             title = "Regression of Authoritarian Values on Generation",
             notes = "Data: European Social Survey (Hungary)")
Regression of Authoritarian Values on Generation
(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Data: European Social Survey (Hungary)
Intercept (Interwar) 75.605***
(0.283)
Baby Boomers -3.944***
(0.352)
Generation X -6.119***
(0.362)
Millennials -8.215***
(0.398)
Gen Z -8.558***
(0.759)
Num.Obs. 15206
R2 0.033
R2 Adj. 0.033

GSS Data: Confidence in Scientific Community

For our final example, we will analyze how education relates to confidence in the scientific community using General Social Survey (GSS) data.

Data Preparation

First, let’s load and prepare the GSS data:

gss_data <- read_fst("gss2022.fst")
table(gss_data$consci, useNA = "ifany")
## 
##                  a great deal                     only some 
##                         19781                         22537 
##                    hardly any                    don't know 
##                          3434                             0 
##                           iap            I don't have a job 
##                             0                             0 
##                   dk, na, iap                     no answer 
##                             0                             0 
##    not imputable_(2147483637)    not imputable_(2147483638) 
##                             0                             0 
##                       refused                skipped on web 
##                             0                             0 
##                    uncodeable not available in this release 
##                             0                             0 
##    not available in this year                  see codebook 
##                             0                             0 
##                          <NA> 
##                         26638
table(gss_data$degree, useNA = "ifany")
## 
##         less than high school                   high school 
##                         14192                         36446 
##      associate/junior college                    bachelor's 
##                          4355                         11248 
##                      graduate                    don't know 
##                          5953                             0 
##                           iap            I don't have a job 
##                             0                             0 
##                   dk, na, iap                     no answer 
##                             0                             0 
##    not imputable_(2147483637)    not imputable_(2147483638) 
##                             0                             0 
##                       refused                skipped on web 
##                             0                             0 
##                    uncodeable not available in this release 
##                             0                             0 
##    not available in this year                  see codebook 
##                             0                             0 
##                          <NA> 
##                           196

Now, let’s clean and recode these variables more efficiently:

gss_data <- gss_data %>%
  mutate(
    high_confidence = case_when(
      consci == "a great deal" ~ 1, 
      consci %in% c("only some", "hardly any") ~ 0,   
      TRUE ~ NA_real_
    )
  )

gss_data <- gss_data %>%
  mutate(
    education = case_when(
      degree %in% c("less than high school", "high school") ~ "High School or Less",
      degree == "associate/junior college" ~ "Some College",
      degree %in% c("bachelor's", "graduate") ~ "Bachelor's or Higher",
      TRUE ~ NA_character_
    ),
    education = factor(education, 
                     levels = c("High School or Less", "Some College", "Bachelor's or Higher"))
  )

gss_clean <- gss_data %>%
  filter(!is.na(education) & !is.na(high_confidence))

table(gss_clean$education, gss_clean$high_confidence, useNA = "ifany")
##                       
##                            0     1
##   High School or Less  19655 12201
##   Some College          1522  1194
##   Bachelor's or Higher  4732  6349

Data Visualization

Let’s visualize the relationship between education and confidence in science:

conf_summary <- gss_clean %>%
  group_by(education) %>%
  summarize(
    prop_high_conf = mean(high_confidence, na.rm = TRUE),
    n = n(),
    se = sqrt((prop_high_conf * (1 - prop_high_conf)) / n),
    ci_lower = prop_high_conf - 1.96 * se,
    ci_upper = prop_high_conf + 1.96 * se
  )
conf_summary
## # A tibble: 3 × 6
##   education            prop_high_conf     n      se ci_lower ci_upper
##   <fct>                         <dbl> <int>   <dbl>    <dbl>    <dbl>
## 1 High School or Less           0.383 31856 0.00272    0.378    0.388
## 2 Some College                  0.440  2716 0.00952    0.421    0.458
## 3 Bachelor's or Higher          0.573 11081 0.00470    0.564    0.582
ggplot(conf_summary, aes(x = education, y = prop_high_conf, fill = education)) +
  geom_col(alpha = 0.8) +
  scale_fill_viridis_d() +
  scale_y_continuous(labels = scales::percent, limits = c(0, 0.6)) +
  labs(
    title = "Proportion with 'Great Deal' of Confidence in Science",
    subtitle = "By education level with 95% confidence intervals",
    x = "Education Level",
    y = "Proportion"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "none"
  )

Regression Model

Let’s fit a model to quantify the relationship between education and confidence in scientific community:

model_conf <- lm(high_confidence ~ education,
                          data = gss_clean)

summary(model_conf)
## 
## Call:
## lm(formula = high_confidence ~ education, data = gss_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.573 -0.383 -0.383  0.617  0.617 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   0.383005   0.002739 139.843  < 2e-16 ***
## educationSome College         0.056612   0.009771   5.794 6.93e-09 ***
## educationBachelor's or Higher 0.189958   0.005391  35.235  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4888 on 45650 degrees of freedom
## Multiple R-squared:  0.02649,    Adjusted R-squared:  0.02645 
## F-statistic:   621 on 2 and 45650 DF,  p-value: < 2.2e-16
plot_model(model_conf)

Professional Tables and Visualizations

Let’s create professional tables and visualizations using sjPlot:

tab_model(model_conf,
          title = "Relationship Between Education and High Confidence in Science",
          pred.labels = c("(Intercept)", 
                         "Some College (vs. HS or Less)", 
                         "Bachelor's or Higher (vs. HS or Less)"),
          dv.labels = "Great Deal of Confidence (vs. Less)",
          string.pred = "Predictors",
          string.ci = "95% CI",
          string.p = "p-value",
          p.style = "numeric",
          transform = NULL,  # Show odds ratios
          show.reflvl = TRUE,
          show.aic = TRUE)
Relationship Between Education and High Confidence in Science
  Great Deal of Confidence (vs. Less)
Predictors Estimates 95% CI p-value
(Intercept) 0.38 0.38 – 0.39 <0.001
Some College (vs. HS or Less) 0.06 0.04 – 0.08 <0.001
Bachelor’s or Higher (vs. HS or Less) 0.19 0.18 – 0.20 <0.001
Observations 45653
R2 / R2 adjusted 0.026 / 0.026
AIC 64211.327

Interpreting the Linear Regression Results

Looking at these results from our linear model examining the relationship between education and high confidence in science:

Intercept Interpretation:

  • The intercept (0.38) represents the proportion of respondents with “High School or Less” education who express a “Great Deal” of confidence in science

  • This tells us that about 38% of those with lower educational attainment still have high confidence in scientific institutions

Education Effects:

  • Some College: Respondents with some college education are 6 percentage points more likely to have high confidence in science compared to those with high school or less (0.38 + 0.06 = 0.44 or 44%)

  • Bachelor’s or Higher: Those with at least a bachelor’s degree are 19 percentage points more likely to have high confidence than those with high school or less (0.38 + 0.19 = 0.57 or 57%)

  • All of these differences are statistically significant (p < 0.001)

Educational Gradient:

  • There is a clear “step pattern” in the relationship between education and science confidence

  • The effect of having a bachelor’s degree or higher (19 percentage points) is more than three times the effect of having some college (6 percentage points)

Model Fit: - The R² value of 0.026 indicates that education alone explains about 2.6% of the variation in confidence in science

  • While this might seem low, it’s actually a meaningful effect size for a single predictor of attitudes in social science research

In the next session, we will consider multivariate regression models to account (or “adjust”) for other predictors!

END