DATA606 Lab 8 — Introduction to Linear Regression

1 Getting Started

1.1 Load packages and data

pkgs <- c("tidyverse", "openintro", "broom", "dplyr", "ggplot2")
to_install <- pkgs[!pkgs %in% rownames(installed.packages())]
if (length(to_install)) install.packages(to_install)
library(tidyverse); library(openintro); library(broom)

data('hfi', package = 'openintro')

# Clean subset used throughout
hfi_clean <- hfi %>%
  select(year, countries, region, pf_score, pf_expression_control, hf_score) %>%
  filter(!is.na(pf_score), !is.na(pf_expression_control))
glimpse(hfi_clean)

## Rows: 1,378
## Columns: 6
## $ year                  <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, …
## $ countries             <chr> "Albania", "Algeria", "Angola", "Argentina", "Ar…
## $ region                <chr> "Eastern Europe", "Middle East & North Africa", …
## $ pf_score              <dbl> 7.596281, 5.281772, 6.111324, 8.099696, 6.912804…
## $ pf_expression_control <dbl> 5.25, 4.00, 2.50, 5.50, 4.25, 7.75, 8.00, 0.25, …
## $ hf_score              <dbl> 7.568140, 5.135886, 5.640662, 6.469848, 7.241402…

2 Exercise 1 — Dimensions of the dataset

Answer: The dataset has 1458 rows (country-year observations) and 123 columns (variables).

dim(hfi)

## [1] 1458  123

3 Exercise 2 — Visualizing pf_score vs pf_expression_control

Answer: A scatterplot is appropriate for two quantitative variables. The plot below shows a clear positive, roughly linear trend: higher media expression control scores (i.e., fewer controls) are associated with higher personal freedom scores. I would be comfortable fitting a simple linear model, and I’ll check assumptions with diagnostics.

ggplot(hfi_clean, aes(x = pf_expression_control, y = pf_score)) +
  geom_point(alpha = 0.5) +
  labs(x = "Expression Control (0–10; higher = fewer controls)",
       y = "Personal Freedom Score (0–10)",
       title = "pf_score vs pf_expression_control") +
  theme_minimal()

4 Exercise 3 — Correlation and description

Answer: The correlation is computed below. A value near 0.8 indicates a strong positive linear association with modest vertical spread and no severe outliers.

hfi_clean %>%
  summarise(cor(pf_expression_control, pf_score, use = "complete.obs"))

Brief description: Form ≈ linear; direction positive; strength strong; unusual points minimal.

5 Exercise 4 — Sum of squared residuals (without interactive plot)

The original lab uses DATA606::plot_ss interactively (console-only). To keep the knitted document lightweight, I instead compute the residual sum of squares (RSS) from the least-squares fit and explain it.

m1 <- lm(pf_score ~ pf_expression_control, data = hfi_clean)
rss <- sum(residuals(m1)^2)
c(Residual_Sum_of_Squares = rss)

## Residual_Sum_of_Squares 
##                952.1532

Answer: The least-squares line minimizes the sum of squared residuals. Any other hand-picked line yields a larger sum of squares; therefore this RSS is the smallest achievable under a linear fit.

6 Exercise 5 — Fit the model and interpret

summary(m1)

## 
## Call:
## lm(formula = pf_score ~ pf_expression_control, data = hfi_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8467 -0.5704  0.1452  0.6066  3.2060 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.61707    0.05745   80.36   <2e-16 ***
## pf_expression_control  0.49143    0.01006   48.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8318 on 1376 degrees of freedom
## Multiple R-squared:  0.6342, Adjusted R-squared:  0.634 
## F-statistic:  2386 on 1 and 1376 DF,  p-value: < 2.2e-16

tidy(m1)

Equation: \(\hat{pf\_score} = 4.617 + 0.491 \times pf\_expression\_control\).

Slope meaning: For a 1-point increase in pf_expression_control, the expected personal freedom score increases by about 0.491 points, on average.

\(R^2\): About 63.4% of the variation in pf_score is explained by this predictor.

7 Exercise 6 — Line on scatterplot & a specific prediction

ggplot(hfi_clean, aes(x = pf_expression_control, y = pf_score)) +
  geom_point(alpha = 0.5) +
  stat_smooth(method = "lm", se = FALSE) +
  labs(x = "Expression Control (0–10; higher = fewer controls)",
       y = "Personal Freedom Score (0–10)",
       title = "Least-squares line (no SE ribbon)") +
  theme_minimal()

Prediction for pf_expression_control = 6.7:

pred_67 <- predict(m1, newdata = tibble(pf_expression_control = 6.7))
pred_67

##        1 
## 7.909663

If someone only used the regression line (and not the raw data), they’d predict 7.91. The residual depends on a specific country’s observed value; for any observation with x = 6.7, residual = observed pf_score − 7.91. Positive residuals mean the line underestimates; negative residuals mean it overestimates.

8 Exercise 7 — Linearity check via residuals vs fitted

augment(m1) %>%
  ggplot(aes(.fitted, .resid)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Fitted values", y = "Residuals",
       title = "Residuals vs Fitted") +
  theme_minimal()

Answer: The residual cloud is centered near 0 without strong curvature, supporting linearity.

9 Exercise 8 — Nearly normal residuals

# Histogram (binwidth chosen for residual scale)
augment(m1) %>%
  ggplot(aes(x = .resid)) +
  geom_histogram(bins = 30) +
  labs(title = "Histogram of residuals", x = "Residuals") +
  theme_minimal()

# Normal Q–Q
augment(m1) %>%
  ggplot(aes(sample = .resid)) +
  stat_qq() + stat_qq_line() +
  labs(title = "Normal Q–Q plot of residuals") +
  theme_minimal()

Answer: The histogram is roughly symmetric and the Q–Q points fall close to the line with mild tail deviations, so the nearly normal condition is adequate for inference.

10 Exercise 9 — Constant variability

# Reuse residuals vs fitted as the primary check
augment(m1) %>%
  ggplot(aes(.fitted, .resid)) +
  geom_point(alpha = 0.5) +
  geom_smooth(se = FALSE, method = "loess") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Fitted values", y = "Residuals",
       title = "Residual spread across fitted values") +
  theme_minimal()

Answer: Residual spread is fairly uniform across fitted values; any heteroskedasticity is minor. The constant variability assumption looks reasonable.

11 More Practice

11.1 MP1 — Choose another freedom relationship

Below I relate total human freedom score (hf_score) to pf_expression_control.

m2 <- lm(hf_score ~ pf_expression_control, data = hfi %>% filter(!is.na(hf_score), !is.na(pf_expression_control)))
summary(m2)

## 
## Call:
## lm(formula = hf_score ~ pf_expression_control, data = hfi %>% 
##     filter(!is.na(hf_score), !is.na(pf_expression_control)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6198 -0.4908  0.1031  0.4703  2.2933 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.153687   0.046070  111.87   <2e-16 ***
## pf_expression_control 0.349862   0.008067   43.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.667 on 1376 degrees of freedom
## Multiple R-squared:  0.5775, Adjusted R-squared:  0.5772 
## F-statistic:  1881 on 1 and 1376 DF,  p-value: < 2.2e-16

ggplot(hfi, aes(x = pf_expression_control, y = hf_score)) +
  geom_point(alpha = 0.5, na.rm = TRUE) +
  stat_smooth(method = "lm", se = TRUE, na.rm = TRUE) +
  labs(x = "Expression Control (0–10)", y = "Human Freedom Score (0–10)",
       title = "hf_score vs pf_expression_control") +
  theme_minimal()

At a glance: The trend remains positive and fairly linear.

11.2 MP2 — Compare \(R^2\) values

Answer: The \(R^2\) for pf_score ~ pf_expression_control is 0.634, while for hf_score ~ pf_expression_control it is 0.578. Thus, pf_expression_control explains a larger share of variation in pf_score; differences are modest, suggesting other factors also matter.

11.3 MP3 — Most surprising relationship & diagnostics

As an additional example, consider predicting pf_score from year (not causally meaningful but illustrative).

m3 <- lm(pf_score ~ year, data = hfi %>% filter(!is.na(pf_score), !is.na(year)))
summary(m3)

## 
## Call:
## lm(formula = pf_score ~ year, data = hfi %>% filter(!is.na(pf_score), 
##     !is.na(year)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8316 -0.9688 -0.0321  1.2253  2.4239 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 112.32687   28.89108   3.888 0.000106 ***
## year         -0.05225    0.01436  -3.639 0.000284 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.369 on 1376 degrees of freedom
## Multiple R-squared:  0.00953,    Adjusted R-squared:  0.008811 
## F-statistic: 13.24 on 1 and 1376 DF,  p-value: 0.0002842

# Diagnostics
augment(m3) %>% ggplot(aes(.fitted, .resid)) +
  geom_point(alpha = 0.5) + geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted (pf_score ~ year)") + theme_minimal()

augment(m3) %>% ggplot(aes(sample = .resid)) +
  stat_qq() + stat_qq_line() + labs(title = "Q–Q (pf_score ~ year)") + theme_minimal()

Answer: The year effect is small; diagnostics suggest linearity is weak and \(R^2\) is low, reinforcing that institutional predictors like expression control are more informative.