pkgs <- c("tidyverse", "openintro", "broom", "dplyr", "ggplot2")
to_install <- pkgs[!pkgs %in% rownames(installed.packages())]
if (length(to_install)) install.packages(to_install)
library(tidyverse); library(openintro); library(broom)
data('hfi', package = 'openintro')
# Clean subset used throughout
hfi_clean <- hfi %>%
select(year, countries, region, pf_score, pf_expression_control, hf_score) %>%
filter(!is.na(pf_score), !is.na(pf_expression_control))
glimpse(hfi_clean)
## Rows: 1,378
## Columns: 6
## $ year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, …
## $ countries <chr> "Albania", "Algeria", "Angola", "Argentina", "Ar…
## $ region <chr> "Eastern Europe", "Middle East & North Africa", …
## $ pf_score <dbl> 7.596281, 5.281772, 6.111324, 8.099696, 6.912804…
## $ pf_expression_control <dbl> 5.25, 4.00, 2.50, 5.50, 4.25, 7.75, 8.00, 0.25, …
## $ hf_score <dbl> 7.568140, 5.135886, 5.640662, 6.469848, 7.241402…
Answer: The dataset has 1458 rows (country-year observations) and 123 columns (variables).
dim(hfi)
## [1] 1458 123
Answer: A scatterplot is appropriate for two quantitative variables. The plot below shows a clear positive, roughly linear trend: higher media expression control scores (i.e., fewer controls) are associated with higher personal freedom scores. I would be comfortable fitting a simple linear model, and I’ll check assumptions with diagnostics.
ggplot(hfi_clean, aes(x = pf_expression_control, y = pf_score)) +
geom_point(alpha = 0.5) +
labs(x = "Expression Control (0–10; higher = fewer controls)",
y = "Personal Freedom Score (0–10)",
title = "pf_score vs pf_expression_control") +
theme_minimal()
Answer: The correlation is computed below. A value near 0.8 indicates a strong positive linear association with modest vertical spread and no severe outliers.
hfi_clean %>%
summarise(cor(pf_expression_control, pf_score, use = "complete.obs"))
Brief description: Form ≈ linear; direction positive; strength strong; unusual points minimal.
The original lab uses
DATA606::plot_ssinteractively (console-only). To keep the knitted document lightweight, I instead compute the residual sum of squares (RSS) from the least-squares fit and explain it.
m1 <- lm(pf_score ~ pf_expression_control, data = hfi_clean)
rss <- sum(residuals(m1)^2)
c(Residual_Sum_of_Squares = rss)
## Residual_Sum_of_Squares
## 952.1532
Answer: The least-squares line minimizes the sum of squared residuals. Any other hand-picked line yields a larger sum of squares; therefore this RSS is the smallest achievable under a linear fit.
summary(m1)
##
## Call:
## lm(formula = pf_score ~ pf_expression_control, data = hfi_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8467 -0.5704 0.1452 0.6066 3.2060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.61707 0.05745 80.36 <2e-16 ***
## pf_expression_control 0.49143 0.01006 48.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8318 on 1376 degrees of freedom
## Multiple R-squared: 0.6342, Adjusted R-squared: 0.634
## F-statistic: 2386 on 1 and 1376 DF, p-value: < 2.2e-16
tidy(m1)
Equation: \(\hat{pf\_score} = 4.617 + 0.491 \times pf\_expression\_control\).
Slope meaning: For a 1-point
increase in pf_expression_control, the
expected personal freedom score increases by about
0.491 points, on average.
\(R^2\): About
63.4% of the variation in pf_score is
explained by this predictor.
ggplot(hfi_clean, aes(x = pf_expression_control, y = pf_score)) +
geom_point(alpha = 0.5) +
stat_smooth(method = "lm", se = FALSE) +
labs(x = "Expression Control (0–10; higher = fewer controls)",
y = "Personal Freedom Score (0–10)",
title = "Least-squares line (no SE ribbon)") +
theme_minimal()
Prediction for pf_expression_control = 6.7:
pred_67 <- predict(m1, newdata = tibble(pf_expression_control = 6.7))
pred_67
## 1
## 7.909663
If someone only used the regression line (and not the raw data),
they’d predict 7.91. The residual
depends on a specific country’s observed value; for any observation with
x = 6.7, residual = observed pf_score − 7.91.
Positive residuals mean the line underestimates;
negative residuals mean it overestimates.
augment(m1) %>%
ggplot(aes(.fitted, .resid)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = "Fitted values", y = "Residuals",
title = "Residuals vs Fitted") +
theme_minimal()
Answer: The residual cloud is centered near 0 without strong curvature, supporting linearity.
# Histogram (binwidth chosen for residual scale)
augment(m1) %>%
ggplot(aes(x = .resid)) +
geom_histogram(bins = 30) +
labs(title = "Histogram of residuals", x = "Residuals") +
theme_minimal()
# Normal Q–Q
augment(m1) %>%
ggplot(aes(sample = .resid)) +
stat_qq() + stat_qq_line() +
labs(title = "Normal Q–Q plot of residuals") +
theme_minimal()
Answer: The histogram is roughly symmetric and the Q–Q points fall close to the line with mild tail deviations, so the nearly normal condition is adequate for inference.
# Reuse residuals vs fitted as the primary check
augment(m1) %>%
ggplot(aes(.fitted, .resid)) +
geom_point(alpha = 0.5) +
geom_smooth(se = FALSE, method = "loess") +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = "Fitted values", y = "Residuals",
title = "Residual spread across fitted values") +
theme_minimal()
Answer: Residual spread is fairly uniform across fitted values; any heteroskedasticity is minor. The constant variability assumption looks reasonable.
Below I relate total human freedom score (hf_score) to
pf_expression_control.
m2 <- lm(hf_score ~ pf_expression_control, data = hfi %>% filter(!is.na(hf_score), !is.na(pf_expression_control)))
summary(m2)
##
## Call:
## lm(formula = hf_score ~ pf_expression_control, data = hfi %>%
## filter(!is.na(hf_score), !is.na(pf_expression_control)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6198 -0.4908 0.1031 0.4703 2.2933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.153687 0.046070 111.87 <2e-16 ***
## pf_expression_control 0.349862 0.008067 43.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.667 on 1376 degrees of freedom
## Multiple R-squared: 0.5775, Adjusted R-squared: 0.5772
## F-statistic: 1881 on 1 and 1376 DF, p-value: < 2.2e-16
ggplot(hfi, aes(x = pf_expression_control, y = hf_score)) +
geom_point(alpha = 0.5, na.rm = TRUE) +
stat_smooth(method = "lm", se = TRUE, na.rm = TRUE) +
labs(x = "Expression Control (0–10)", y = "Human Freedom Score (0–10)",
title = "hf_score vs pf_expression_control") +
theme_minimal()
At a glance: The trend remains positive and fairly linear.
Answer: The \(R^2\)
for pf_score ~ pf_expression_control is
0.634, while for
hf_score ~ pf_expression_control it is
0.578. Thus, pf_expression_control explains a larger
share of variation in pf_score; differences are modest, suggesting other
factors also matter.
As an additional example, consider predicting pf_score
from year (not causally meaningful but illustrative).
m3 <- lm(pf_score ~ year, data = hfi %>% filter(!is.na(pf_score), !is.na(year)))
summary(m3)
##
## Call:
## lm(formula = pf_score ~ year, data = hfi %>% filter(!is.na(pf_score),
## !is.na(year)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8316 -0.9688 -0.0321 1.2253 2.4239
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 112.32687 28.89108 3.888 0.000106 ***
## year -0.05225 0.01436 -3.639 0.000284 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.369 on 1376 degrees of freedom
## Multiple R-squared: 0.00953, Adjusted R-squared: 0.008811
## F-statistic: 13.24 on 1 and 1376 DF, p-value: 0.0002842
# Diagnostics
augment(m3) %>% ggplot(aes(.fitted, .resid)) +
geom_point(alpha = 0.5) + geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Residuals vs Fitted (pf_score ~ year)") + theme_minimal()
augment(m3) %>% ggplot(aes(sample = .resid)) +
stat_qq() + stat_qq_line() + labs(title = "Q–Q (pf_score ~ year)") + theme_minimal()
Answer: The year effect is small;
diagnostics suggest linearity is weak and \(R^2\) is low, reinforcing that
institutional predictors like expression control are
more informative.