knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
# List of packages
packages <- c("tidyverse", "broom", "modelsummary", "sjPlot", "car", "knitr", "gt", "fst")
# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load the packages
lapply(packages, library, character.only = TRUE)
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "broom" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "modelsummary" "broom" "lubridate" "forcats" "stringr"
## [6] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [11] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [16] "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "sjPlot" "modelsummary" "broom" "lubridate" "forcats"
## [6] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [11] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [16] "grDevices" "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "car" "carData" "sjPlot" "modelsummary" "broom"
## [6] "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [11] "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [16] "stats" "graphics" "grDevices" "utils" "datasets"
## [21] "methods" "base"
##
## [[6]]
## [1] "knitr" "car" "carData" "sjPlot" "modelsummary"
## [6] "broom" "lubridate" "forcats" "stringr" "dplyr"
## [11] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [16] "tidyverse" "stats" "graphics" "grDevices" "utils"
## [21] "datasets" "methods" "base"
##
## [[7]]
## [1] "gt" "knitr" "car" "carData" "sjPlot"
## [6] "modelsummary" "broom" "lubridate" "forcats" "stringr"
## [11] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [16] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [21] "utils" "datasets" "methods" "base"
##
## [[8]]
## [1] "fst" "gt" "knitr" "car" "carData"
## [6] "sjPlot" "modelsummary" "broom" "lubridate" "forcats"
## [11] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [16] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [21] "grDevices" "utils" "datasets" "methods" "base"
Simple linear regression allows us to model how an outcome variable (Y) is related to a single predictor variable (X). A linear regression measures both the strength of a relationship but also models how one variable relates to another.
In this tutorial, we’ll implement several simple linear regression models to answer research questions using real data. We’ll focus on:
Understanding the regression equation and output
Interpreting coefficients properly
Creating visualizations to communicate findings
Presenting results in professional tables
Throughout, we’ll follow a systematic workflow:
Explore and prepare variables
Visualize relationships
Fit regression models
Interpret the coefficients
Present and visualize results professionally
For our first example, we will use the classic Duncan dataset from
the car
package, which contains data on the prestige of 45
U.S. occupations in 1950, along with other characteristics.
## 'data.frame': 45 obs. of 4 variables:
## $ type : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 3 2 ...
## $ income : int 62 72 75 55 64 21 64 80 67 72 ...
## $ education: int 86 76 92 90 86 84 93 100 87 86 ...
## $ prestige : int 82 83 90 76 90 87 93 90 52 88 ...
## type income education prestige
## bc :21 Min. : 7.00 Min. : 7.00 Min. : 3.00
## prof:18 1st Qu.:21.00 1st Qu.: 26.00 1st Qu.:16.00
## wc : 6 Median :42.00 Median : 45.00 Median :41.00
## Mean :41.87 Mean : 52.56 Mean :47.69
## 3rd Qu.:64.00 3rd Qu.: 84.00 3rd Qu.:81.00
## Max. :81.00 Max. :100.00 Max. :97.00
The dataset includes information on:
type
: Type of occupation (prof = professional, wc =
white collar, bc = blue collar)
income
: Income level scale (as a percentage of
earnings)
education
: Education level scale (as a percentage in
occupations who were HS graduates)
prestige
: Occupational prestige score (NORC rating
scale of the occupation)
Let’s visualize the relationship between income and occupational prestige:
ggplot(Duncan, aes(x = income, y = prestige)) +
geom_point(aes(color = type), size = 3) +
geom_smooth(method = "lm", color = "black", linetype = "dashed") +
labs(
title = "Income and Occupational Prestige",
subtitle = "Duncan occupational prestige data (1950)",
x = "Income Score (0-100)",
y = "Prestige Score (0-100)",
color = "Occupation Type"
) +
theme_minimal() +
scale_color_brewer(palette = "Set1")
The plot suggests a positive relationship between income and prestige, with variation by occupation type. But let’s model and speak more substantive regarding this relationship.
Now let’s fit a simple linear regression model to predict prestige based on income:
##
## Call:
## lm(formula = prestige ~ income, data = Duncan)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.566 -9.421 0.257 9.167 61.855
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4566 5.1901 0.473 0.638
## income 1.0804 0.1074 10.062 7.14e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.4 on 43 degrees of freedom
## Multiple R-squared: 0.7019, Adjusted R-squared: 0.695
## F-statistic: 101.3 on 1 and 43 DF, p-value: 7.144e-13
What do you note?
Let’s examine the results using broom for clarity:
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.46 5.19 0.473 6.38e- 1 -8.01 12.9
## 2 income 1.08 0.107 10.1 7.14e-13 0.864 1.30
Looking at this regression output for our model predicting occupational prestige based on income scores:
Intercept Interpretation:
Income Score Coefficient:
Key finding: For each additional point on the income scale (0-100), occupational prestige increases by about 1.08 points on average
Real-world interpretation:
Occupations differing by 10 points in income score would be expected to differ by 10.8 points in prestige
This is a substantively large relationship – income has nearly a one-to-one relationship with prestige
This relationship is highly statistically significant (p-value < 0.001)
The large t-statistic (10.06) indicates this relationship is very unlikely to occur by random chance
Confidence Interval Interpretation:
The 95% confidence interval ranges from 0.86 to 1.30
This means if we were to repeatedly sample occupations and calculate this relationship, about 95% of those intervals would contain the true population relationship between income and prestige
Since this interval is entirely positive and doesn’t include zero, we can be confident in the positive relationship between income and prestige
Model Fit:
The R-squared value of 0.702 indicates that income alone explains about 70.2% of the variation in occupational prestige
This is an really strong relationship for a single-predictor model in social science research, and you likely not encounter that in your own project
The strong relationship suggests that economic rewards are closely tied to social status in occupational hierarchies (based on the data, and with the scale assumptions)
Sociological Significance:
This finding has important implications for understanding social stratification
The close relationship between income and prestige supports theories that economic and social status dimensions of inequality are tightly coupled
Income appears to be a powerful predictor of social standing, at least in the occupational domain
Remember this analysis is based on 1950s occupational data, so contemporary relationships might differ as occupation structures have evolved.
Let’s create a simple visualization of our findings:
plot_model(model_duncan, show.values = TRUE) +
labs(
title = "Effect of Income on Occupational Prestige",
subtitle = "One point increase in income score associated with 1.08 point increase in prestige"
)
Let’s create a professional-looking regression table with modelsummary:
modelsummary(model_duncan,
fmt = 2,
estimate = "{estimate} [{conf.low}, {conf.high}]",
statistic = NULL,
coef_rename = c("(Intercept)" = "Intercept", "income" = "Income Score"),
gof_map = c("nobs", "r.squared", "adj.r.squared", "rmse", "aic", "bic"),
title = "Regression of Occupational Prestige on Income",
notes = "Data: Duncan occupational prestige data (1950)")
(1) | |
---|---|
Data: Duncan occupational prestige data (1950) | |
Intercept | 2.46 [-8.01, 12.92] |
Income Score | 1.08 [0.86, 1.30] |
Num.Obs. | 45 |
R2 | 0.702 |
R2 Adj. | 0.695 |
RMSE | 17.01 |
AIC | 388.8 |
BIC | 394.2 |
We can also do it with the sjplot package:
prestige | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 2.46 | -8.01 – 12.92 | 0.638 |
income | 1.08 | 0.86 – 1.30 | <0.001 |
Observations | 45 | ||
R2 / R2 adjusted | 0.702 / 0.695 |
For our second example, we’ll examine populist attitudes using data from the European Social Survey (ESS). Specifically, we’ll analyze how these attitudes might differ by gender in Spain.
Before diving into the analysis, let’s discuss what we mean by “scales” in sociological research. A scale is a composite measure created by combining multiple items that are believed to tap into the same underlying construct. Scales offer several advantages over single-item measures:
In this example, we will create a “populist attitudes” scale based on trust items from the ESS, following the methods similar to that used by Norris and Inglehart (2019) in their widely-cited book “Cultural Backlash.” Their conceptualization of populism (and eventual measurement as seen below) has been criticized. However, it does generate a nice 0-100 scale leveraging some trust items.
First, let’s load the ESS data for Spain and examine the trust variables we’ll use for our populist scale
spain_data <- read.fst("spain_data.fst")
head(spain_data[, c("trstplt", "trstprl", "trstprt", "gndr")])
## trstplt trstprl trstprt gndr
## 1 1 5 NA 2
## 2 2 4 NA 1
## 3 0 5 NA 1
## 4 3 5 NA 1
## 5 88 7 NA 1
## 6 5 9 NA 1
Note here we already seen there are lots of NA values to remove. But also values to deal with that should be treated as NA. The standard is to check the survey documentation – again, for the ESS, found here: https://www.europeansocialsurvey.org/data-portal
We can do a simple check. For instance:
##
## 0 1 2 3 4 5 6 7 8 9 10 77 88 99
## 4887 1784 2097 2210 1855 2628 952 558 274 66 58 40 278 36
These three variables measure respondents’ trust in politicians (trstplt), parliament (trstprl), and political parties (trstprt) on a scale from 0 (no trust at all) to 10 (complete trust). According to Norris and Inglehart’s conceptualization of populism, anti-trust in political institutions/parties/political elites is a key component of populist attitudes.
Now, let’s construct our populist scale:
# Step 1: Clean the trust variables (set values > 10 to NA)
spain_data$trstplt <- ifelse(spain_data$trstplt > 10, NA, spain_data$trstplt)
spain_data$trstprl <- ifelse(spain_data$trstprl > 10, NA, spain_data$trstprl)
spain_data$trstprt <- ifelse(spain_data$trstprt > 10, NA, spain_data$trstprt)
# Step 2: Create a trust scale (0-100)
spain_data$trust <- scales::rescale(
spain_data$trstplt + spain_data$trstprl + spain_data$trstprt,
na.rm = TRUE,
to = c(0, 100)
)
# Step 3: Flip the scale to create a populist measure (higher = more populist)
# This reverses the direction so that lower trust = higher populism
spain_data$populist <- scales::rescale(spain_data$trust, na.rm = TRUE, to = c(100, 0))
# Recode gender as a factor (1 = Male, 2 = Female)
spain_data$gender <- factor(spain_data$gndr,
levels = c(1, 2),
labels = c("Male", "Female"))
# Examine the properties of our new scale
summary(spain_data$populist, na.rm = TRUE)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 50.00 66.67 67.88 86.67 100.00 2775
The scale transformation works as follows:
We sum the three trust items to create an overall trust score
We use scales::rescale()
to transform this sum to a
0-100 scale
We create the populist scale by inverting the trust scale (so low trust = high populism)
Let’s examine the distribution of populist attitudes in Spain:
ggplot(spain_data, aes(x = populist)) +
# Add density histogram
geom_histogram(
aes(y = ..density..),
bins = 30,
fill = "#3498db",
color = "white",
alpha = 0.8
) +
# Add density curve
geom_density(
color = "#e74c3c",
size = 1,
fill = "#e74c3c",
alpha = 0.2
) +
# Add mean line
geom_vline(
aes(xintercept = mean(populist, na.rm = TRUE)),
color = "#2c3e50",
linetype = "dashed",
size = 1
) +
# Add annotations
annotate(
"text",
x = mean(spain_data$populist, na.rm = TRUE) + 5,
y = 0.025,
label = paste("Mean =", round(mean(spain_data$populist, na.rm = TRUE), 1)),
color = "#2c3e50",
fontface = "bold",
hjust = 0
) +
# Add labels
labs(
title = "Distribution of Populist Attitudes in Spain",
subtitle = "Based on inverted trust in politicians, parliament, and parties (ESS data)",
x = "Populist Attitudes Scale (0-100)",
y = "Density",
caption = "Higher values indicate stronger populist attitudes (lower trust in political institutions)"
) +
# Apply theme
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(color = "gray50"),
plot.caption = element_text(color = "gray50", hjust = 0),
panel.grid.minor = element_blank()
)
Let’s explore how populist attitudes vary by gender:
# Calculate mean populist score and confidence intervals by gender
populist_summary <- spain_data %>%
filter(!is.na(gender) & !is.na(populist)) %>%
group_by(gender) %>%
summarize(
mean_populist = mean(populist, na.rm = TRUE),
sd_populist = sd(populist, na.rm = TRUE),
n = n(),
se_populist = sd_populist / sqrt(n),
ci_lower = mean_populist - 1.96 * se_populist,
ci_upper = mean_populist + 1.96 * se_populist
)
populist_summary
## # A tibble: 2 × 7
## gender mean_populist sd_populist n se_populist ci_lower ci_upper
## <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Male 67.8 21.6 8368 0.236 67.3 68.3
## 2 Female 68.0 21.4 8307 0.234 67.5 68.4
Now, let’s fit a regression model to test whether/how gender predicts populist attitudes:
# Make Male the reference category
spain_data$gender <- relevel(factor(spain_data$gender), ref = "Female")
# Run the regression
model_populist <- lm(populist ~ gender, data = spain_data)
# View the model summary
summary(model_populist)
##
## Call:
## lm(formula = populist ~ gender, data = spain_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.984 -17.796 -1.129 18.682 32.204
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.9844 0.2356 288.545 <2e-16 ***
## genderMale -0.1889 0.3326 -0.568 0.57
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.47 on 16673 degrees of freedom
## (2777 observations deleted due to missingness)
## Multiple R-squared: 1.934e-05, Adjusted R-squared: -4.064e-05
## F-statistic: 0.3224 on 1 and 16673 DF, p-value: 0.5702
Let’s examine the regression results for our model predicting populist attitudes based on gender:
Gender coefficient: Males score on average 0.19 points lower than females on the populist attitudes scale. However, this difference is not statistically significant (p=0.570). We thus fail to reject the null.
Confidence interval: The 95% confidence interval for the gender effect ranges from -0.84 to 0.46. Because this interval includes zero, we cannot reject the null hypothesis that there is no difference in populist attitudes between males and females.
Model fit: The R² value of 0.000 indicates that gender explains essentially none of the variation in populist attitudes among Spanish respondents. This confirms that gender is not a meaningful predictor of populist attitudes in this sample.
Why this result is important:
Unlike our previous examples where we found significant relationships (income predicting occupational prestige), here we have a clear case of a non-significant result. This highlights an important aspect of statistical inference - sometimes our variables simply do not have the relationships we might expect.
Substantive interpretation:
These results suggest that populist attitudes in Spain are not differentiated by gender. Both men and women show similar levels of trust/distrust in political institutions. This finding contrasts with some theories that suggest gender might be associated with different political attitudes.
plot_model(model_populist,
type = "est",
show.values = TRUE,
value.offset = 0.3,
vline.color = "steelblue",
title = "Effect of Gender on Populist Attitudes",
axis.labels = c("Female vs. Male"),
axis.title = "Difference in Populist Attitudes Score") +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
axis.text.y = element_text(face = "bold")
)
For our final example, we will analyze how education relates to confidence in the scientific community using General Social Survey (GSS) data.
First, let’s load and prepare the GSS data:
##
## a great deal only some
## 19781 22537
## hardly any don't know
## 3434 0
## iap I don't have a job
## 0 0
## dk, na, iap no answer
## 0 0
## not imputable_(2147483637) not imputable_(2147483638)
## 0 0
## refused skipped on web
## 0 0
## uncodeable not available in this release
## 0 0
## not available in this year see codebook
## 0 0
## <NA>
## 26638
##
## less than high school high school
## 14192 36446
## associate/junior college bachelor's
## 4355 11248
## graduate don't know
## 5953 0
## iap I don't have a job
## 0 0
## dk, na, iap no answer
## 0 0
## not imputable_(2147483637) not imputable_(2147483638)
## 0 0
## refused skipped on web
## 0 0
## uncodeable not available in this release
## 0 0
## not available in this year see codebook
## 0 0
## <NA>
## 196
Now, let’s clean and recode these variables more efficiently:
gss_data <- gss_data %>%
mutate(
high_confidence = case_when(
consci == "a great deal" ~ 1,
consci %in% c("only some", "hardly any") ~ 0,
TRUE ~ NA_real_
)
)
gss_data <- gss_data %>%
mutate(
education = case_when(
degree %in% c("less than high school", "high school") ~ "High School or Less",
degree == "associate/junior college" ~ "Some College",
degree %in% c("bachelor's", "graduate") ~ "Bachelor's or Higher",
TRUE ~ NA_character_
),
education = factor(education,
levels = c("High School or Less", "Some College", "Bachelor's or Higher"))
)
gss_clean <- gss_data %>%
filter(!is.na(education) & !is.na(high_confidence))
table(gss_clean$education, gss_clean$high_confidence, useNA = "ifany")
##
## 0 1
## High School or Less 19655 12201
## Some College 1522 1194
## Bachelor's or Higher 4732 6349
Let’s visualize the relationship between education and confidence in science:
conf_summary <- gss_clean %>%
group_by(education) %>%
summarize(
prop_high_conf = mean(high_confidence, na.rm = TRUE),
n = n(),
se = sqrt((prop_high_conf * (1 - prop_high_conf)) / n),
ci_lower = prop_high_conf - 1.96 * se,
ci_upper = prop_high_conf + 1.96 * se
)
conf_summary
## # A tibble: 3 × 6
## education prop_high_conf n se ci_lower ci_upper
## <fct> <dbl> <int> <dbl> <dbl> <dbl>
## 1 High School or Less 0.383 31856 0.00272 0.378 0.388
## 2 Some College 0.440 2716 0.00952 0.421 0.458
## 3 Bachelor's or Higher 0.573 11081 0.00470 0.564 0.582
ggplot(conf_summary, aes(x = education, y = prop_high_conf, fill = education)) +
geom_col(alpha = 0.8) +
scale_fill_viridis_d() +
scale_y_continuous(labels = scales::percent, limits = c(0, 0.6)) +
labs(
title = "Proportion with 'Great Deal' of Confidence in Science",
subtitle = "By education level with 95% confidence intervals",
x = "Education Level",
y = "Proportion"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "none"
)
Let’s fit a model to quantify the relationship between education and confidence in scientific community:
##
## Call:
## lm(formula = high_confidence ~ education, data = gss_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.573 -0.383 -0.383 0.617 0.617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.383005 0.002739 139.843 < 2e-16 ***
## educationSome College 0.056612 0.009771 5.794 6.93e-09 ***
## educationBachelor's or Higher 0.189958 0.005391 35.235 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4888 on 45650 degrees of freedom
## Multiple R-squared: 0.02649, Adjusted R-squared: 0.02645
## F-statistic: 621 on 2 and 45650 DF, p-value: < 2.2e-16
Let’s create professional tables and visualizations using
sjPlot
:
tab_model(model_conf,
title = "Relationship Between Education and High Confidence in Science",
pred.labels = c("(Intercept)",
"Some College (vs. HS or Less)",
"Bachelor's or Higher (vs. HS or Less)"),
dv.labels = "Great Deal of Confidence (vs. Less)",
string.pred = "Predictors",
string.ci = "95% CI",
string.p = "p-value",
p.style = "numeric",
transform = NULL, # Show odds ratios
show.reflvl = TRUE,
show.aic = TRUE)
Great Deal of Confidence (vs. Less) | |||
---|---|---|---|
Predictors | Estimates | 95% CI | p-value |
(Intercept) | 0.38 | 0.38 – 0.39 | <0.001 |
Some College (vs. HS or Less) | 0.06 | 0.04 – 0.08 | <0.001 |
Bachelor’s or Higher (vs. HS or Less) | 0.19 | 0.18 – 0.20 | <0.001 |
Observations | 45653 | ||
R2 / R2 adjusted | 0.026 / 0.026 | ||
AIC | 64211.327 |
Looking at these results from our linear model examining the relationship between education and high confidence in science:
Intercept Interpretation:
The intercept (0.38) represents the proportion of respondents with “High School or Less” education who express a “Great Deal” of confidence in science
This tells us that about 38% of those with lower educational attainment still have high confidence in scientific institutions
Education Effects:
Some College: Respondents with some college education are 6 percentage points more likely to have high confidence in science compared to those with high school or less (0.38 + 0.06 = 0.44 or 44%)
Bachelor’s or Higher: Those with at least a bachelor’s degree are 19 percentage points more likely to have high confidence than those with high school or less (0.38 + 0.19 = 0.57 or 57%)
All of these differences are statistically significant (p < 0.001)
Educational Gradient:
There is a clear “step pattern” in the relationship between education and science confidence
The effect of having a bachelor’s degree or higher (19 percentage points) is more than three times the effect of having some college (6 percentage points)
Model Fit: - The R² value of 0.026 indicates that education alone explains about 2.6% of the variation in confidence in science
In the next session, we will consider multivariate regression models to account (or “adjust”) for other predictors!