Descriptive Statistics and Visualisation

# Part 4 Descriptive Statistics & Visualisation

setwd("C:/Users/hanse/Downloads/Applied Analytics/A2")

# 1) Parameters
FILE_NAME  <- "API_SP.POP.TOTL_DS2_en_csv_v2_1234617.csv"
COUNTRY_A  <- "France"
COUNTRY_B  <- "United Kingdom"
YEAR_START <- 2000
YEAR_END   <- 2024

set.seed(123)
theme_set(theme_minimal(base_size = 12))
# skip 4, there are 4 rows are,

# Data Source   World Development Indicators        
#           
# Last Updated Date 7/1/2025        
#           
# Country Name  Country Code    Indicator Name  Indicator Code  1960 ...

# 2) Read & tidy (World Bank dataset format)
raw <- readr::read_csv(FILE_NAME, skip = 4, show_col_types = FALSE) %>%  
  dplyr::select(!dplyr::starts_with("Unnamed")) %>%
  dplyr::filter(`Indicator Code` == "SP.POP.TOTL")

long <- raw %>%
  tidyr::pivot_longer(
    cols = tidyselect::matches("^[0-9]{4}$"),
    names_to = "Year", values_to = "Pop"
  ) %>%
  dplyr::mutate(
    Country = `Country Name`,
    Year    = as.integer(Year),
    Pop     = suppressWarnings(as.numeric(Pop))
  ) %>%
  dplyr::select(Country, Year, Pop) %>%
  dplyr::filter(
    dplyr::between(Year, YEAR_START, YEAR_END),
    Country %in% c(COUNTRY_A, COUNTRY_B)
  )

# 3) Data issues & cleaning
# Missing values: World Bank dataset is complete for these countries/years.
#   If missing occurs, we REPORT it and DROP missing rows.
# Outliers: we do FLAG unusual year-to-year growth using the IQR rule on log(Pop),
#   but we decided to DO NOT remove outliers for descriptives ( for reporting focus).

missing_tbl <- long %>%
  dplyr::group_by(Country) %>%
  dplyr::summarise(
    years_expected = YEAR_END - YEAR_START + 1,
    n_missing      = sum(is.na(Pop)),
    years_missing  = paste(Year[is.na(Pop)], collapse = ", "),
    .groups = "drop"
  )
cat("\n=== Missingness summary ===\n")

## 
## === Missingness summary ===

print(knitr::kable(missing_tbl, caption = "Missing population values by country"))

## 
## 
## Table: Missing population values by country
## 
## |Country        | years_expected| n_missing|years_missing |
## |:--------------|--------------:|---------:|:-------------|
## |France         |             25|         0|              |
## |United Kingdom |             25|         0|              |

# Drop rows with missing Pop
study <- dplyr::filter(long, !is.na(Pop))

# 4) Derived features
study_feats <- study %>%
  dplyr::group_by(Country) %>%
  dplyr::arrange(Year, .by_group = TRUE) %>%
  dplyr::mutate(
    logPop  = log(Pop),
    dlogPop = log(Pop) - dplyr::lag(log(Pop)),          # log ≈ growth (log points)
    yoy_pct = 100 * (Pop / dplyr::lag(Pop) - 1)         # Year-over-year % change
  ) %>%
  dplyr::ungroup()

# Flag potential outliers in log using the IQR rule (report only, do not remove)
growth_outliers <- study_feats %>%
  dplyr::filter(!is.na(dlogPop)) %>%
  dplyr::group_by(Country) %>%
  dplyr::mutate(
    Q1  = quantile(dlogPop, 0.25, na.rm = TRUE),
    Q3  = quantile(dlogPop, 0.75, na.rm = TRUE),
    IQR = Q3 - Q1,
    is_outlier = (dlogPop < Q1 - 1.5*IQR) | (dlogPop > Q3 + 1.5*IQR)
  ) %>%
  dplyr::filter(is_outlier) %>%
  dplyr::select(Country, Year, Pop, dlogPop, yoy_pct)
cat("\n=== Potential growth outliers (IQR on log) — reporting only ===\n")

## 
## === Potential growth outliers (IQR on log) — reporting only ===

print(knitr::kable(growth_outliers, digits = 4,
                   caption = "Flagged unusual year-to-year growth (without removed)"))

## 
## 
## Table: Flagged unusual year-to-year growth (without removed)
## 
## |Country        | Year|      Pop| dlogPop| yoy_pct|
## |:--------------|----:|--------:|-------:|-------:|
## |United Kingdom | 2023| 68492000|   0.013|  1.3135|

# 5) Descriptive statistics
# Helper for  CAGR between first & last year
cagr_fun <- function(x, y1, y2) {
  if (length(x) < 2 || any(is.na(c(x[1], x[length(x)])))) return(NA_real_)
  n <- y2 - y1
  if (n <= 0) return(NA_real_)
  (x[length(x)] / x[1])^(1/n) - 1
}

# 5a) Population levels summary
table_levels <- study_feats %>%
  dplyr::group_by(Country) %>%
  dplyr::summarise(
    Min        = min(Pop, na.rm = TRUE),
    Q1         = quantile(Pop, 0.25, na.rm = TRUE),
    Median     = median(Pop, na.rm = TRUE),
    Q3         = quantile(Pop, 0.75, na.rm = TRUE),
    Max        = max(Pop, na.rm = TRUE),
    Mean       = mean(Pop, na.rm = TRUE),
    SD         = sd(Pop, na.rm = TRUE),
    n          = dplyr::n(),
    first_year = min(Year), last_year = max(Year),
    first_pop  = Pop[Year == min(Year)][1],
    last_pop   = Pop[Year == max(Year)][1],
    CAGR_pct   = 100 * cagr_fun(Pop, min(Year), max(Year)),
    .groups = "drop"
  )

cat("\n=== Population levels (persons) ===\n")

## 
## === Population levels (persons) ===

print(knitr::kable(table_levels, digits = 3,
                   caption = "Population level summary with quartiles & CAGR (%)"))

## 
## 
## Table: Population level summary with quartiles & CAGR (%)
## 
## |Country        |      Min|       Q1|   Median|       Q3|      Max|     Mean|      SD|  n| first_year| last_year| first_pop| last_pop| CAGR_pct|
## |:--------------|--------:|--------:|--------:|--------:|--------:|--------:|-------:|--:|----------:|---------:|---------:|--------:|--------:|
## |France         | 60918661| 63622342| 65657659| 67158348| 68516699| 65293315| 2304162| 25|       2000|      2024|  60918661| 68516699|    0.491|
## |United Kingdom | 58892514| 60846820| 63711000| 66289000| 69226000| 63631923| 3162915| 25|       2000|      2024|  58892514| 69226000|    0.676|

# 5b) Year-over-year % growth summary
table_growth <- study_feats %>%
  dplyr::group_by(Country) %>%
  dplyr::summarise(
    Min     = min(yoy_pct, na.rm = TRUE),
    Q1      = quantile(yoy_pct, 0.25, na.rm = TRUE),
    Median  = median(yoy_pct, na.rm = TRUE),
    Q3      = quantile(yoy_pct, 0.75, na.rm = TRUE),
    Max     = max(yoy_pct, na.rm = TRUE),
    Mean    = mean(yoy_pct, na.rm = TRUE),
    SD      = sd(yoy_pct, na.rm = TRUE),
    n       = sum(!is.na(yoy_pct)),
    Missing = sum(is.na(yoy_pct)),
    .groups = "drop"
  )

cat("\n=== Year-over-year growth (%) ===\n")

## 
## === Year-over-year growth (%) ===

print(knitr::kable(table_growth, digits = 3,
                   caption = "YoY population growth (%) summary with quartiles"))

## 
## 
## Table: YoY population growth (%) summary with quartiles
## 
## |Country        |   Min|    Q1| Median|    Q3|   Max|  Mean|    SD|  n| Missing|
## |:--------------|-----:|-----:|------:|-----:|-----:|-----:|-----:|--:|-------:|
## |France         | 0.264| 0.335|  0.484| 0.640| 0.755| 0.491| 0.166| 24|       1|
## |United Kingdom | 0.170| 0.509|  0.720| 0.785| 1.314| 0.676| 0.242| 24|       1|

# 6) Visualisations ( the plots)
# 6a) Population over time (levels)
p1 <- ggplot(
  dplyr::filter(study_feats, Country %in% c(COUNTRY_A, COUNTRY_B)),
  aes(x = Year, y = Pop, color = Country, group = Country)
) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  labs(title = "Population over time (levels): United Kingdom vs France",
       y = "Population (persons)", x = "Year")
print(p1)

# 6b) Population over time (log scale, easier to compare growth rates)
p2 <- ggplot(
  dplyr::filter(study_feats, Country %in% c(COUNTRY_A, COUNTRY_B)),
  aes(x = Year, y = Pop, color = Country, group = Country)
) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_y_log10() +
  labs(title = "Population over time (log scale): United Kingdom vs France",
       y = "Population (log10 scale)", x = "Year")
print(p2)

# 6c) YoY % growth over time (drop initial NA growth)
p3 <- ggplot(
  dplyr::filter(study_feats, Country %in% c(COUNTRY_A, COUNTRY_B), !is.na(yoy_pct)),
  aes(x = Year, y = yoy_pct, color = Country, group = Country)
) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  labs(title = "Year-over-year population growth (%): United Kingdom vs France",
       y = "YoY growth (%)", x = "Year")
print(p3)

# 6d) Distribution of YoY % growth (boxplot)
p4 <- ggplot(
  dplyr::filter(study_feats, Country %in% c(COUNTRY_A, COUNTRY_B), !is.na(yoy_pct)),
  aes(x = Country, y = yoy_pct, fill = Country)
) +
  geom_boxplot(width = 0.6, outlier.alpha = 0.7) +
  labs(title = "Distribution of annual population growth (%)",
       y = "YoY growth (%)", x = NULL)
print(p4)

# 6e) Normality check: QQ plot of log population (one panel per country)
p5 <- ggplot(
  dplyr::filter(study_feats, Country %in% c(COUNTRY_A, COUNTRY_B)),
  aes(sample = logPop, color = Country)
) +
  stat_qq() + stat_qq_line() +
  facet_wrap(~ Country, nrow = 1, scales = "free") +
  labs(title = "QQ plot of log(Population): normality check by country",
       x = "Theoretical Quantiles", y = "Sample Quantiles")
print(p5)

# 6f) QQ plot of log growth ( remove that with NA)
p6 <- ggplot(
  dplyr::filter(study_feats, Country %in% c(COUNTRY_A, COUNTRY_B), !is.na(dlogPop)),
  aes(sample = dlogPop, color = Country)
) +
  stat_qq() + stat_qq_line() +
  facet_wrap(~ Country, nrow = 1, scales = "free") +
  labs(title = "QQ plot of log population (annual growth, log points)",
       x = "Theoretical Quantiles", y = "Sample Quantiles")
print(p6)

# Variables: Country (factor), Year (integer), Pop (numeric persons),
#   derived: logPop (numeric), dlogPop (numeric log points), yoy_pct (numeric %).
# Missing data: we listed per country and dropped (no imputation) to keep analyses clean.
# Outliers: flagged using IQR rule on annual log,
# Visualisation: UK & France show steady growth with similar log-scale trajectories,
#   YoY growth is low and stable, we can see QQ plots of log(Pop) to look closer to linear than raw scale,supporting log transformation for later t-tests.

We use the World Bank total population series (SP.POP.TOTL) for France and the United Kingdom, covering 2000 to 2024, 25 years for each country.

Our working fields are Country, Year, and Pop (persons). We also create:

logPop = log(Pop) to compare proportional changes
dlogPop = annual change in log population (approximate growth in log points)
yoy_pct = year-over-year percent growth

There are no missing values for either country over this window. For potential anomalies, we flag unusual year-to-year moves using an IQR rule on Delta log(Pop). One spike stands ou is UK 2023 with YoY about 1.31 percent. We report it but DO NOT remove it.

Descriptive statistics (levels)

France: min 60.9M, median 65.7M, max 68.5M, mean 65.29M, SD 2.30M, CAGR about 0.49 percent per year.

United Kingdom: min 58.9M, median 63.7M, max 69.2M, mean 63.63M, SD 3.16M, CAGR about 0.68 percent per year.

From the first to the last year, France moves 60.9M to 68.5M, and the UK moves 58.9M to 69.2M. Within our window, the UK ends slightly higher than France by 2024.

what is CAGR (Compound Annual Growth Rate) and purpose we use it?

\[ \text{CAGR} = (E/B)^{1/n} - 1 \ \] CAGR computes the single yearly growth rate that would turn a beginning value into an ending value over n years if growth compounded smoothly. it summarizes a whole time-path into one clean “average per year” growth number.

Descriptive statistics (growth rates)

France YoY percent: median 0.48, mean 0.49, IQR about 0.34 to 0.64, SD 0.17.

UK YoY percent: median 0.72, mean 0.68, IQR about 0.51 to 0.79, SD 0.24, with a high growth year in 2023 about 1.31.

Visualisations

Population over time, levels and log: both lines rise steadily. On the log scale, the paths are close to straight lines, which suggests roughly constant proportional growth. The UK line is a bit steeper, consistent with a higher CAGR.

YoY percent over time: growth is low and fairly stable for both countries, mostly below 1 percent. France eases down after the mid 2000s. The UK climbs through the mid 2000s, softens around 2016 to 2019, dips in 2020, then rebounds sharply in 2022 to 2023, the flagged outlier.

YoY percent distribution, boxplot: the UK median and spread sit above France, matching the tables.

QQ plots, normality checks: For log(Pop), points hug the reference lines, so using logs looks sensible for modeling means and trends.

For delta log(Pop), the points are fairly straight with mild tails, fine for descriptive comparisons of annual changes.

Overall

Across 2000 to 2024, both countries grow, but the UK grows a bit faster on average, about 0.68 percent versus 0.49 percent per year. By 2024, the UK is slightly larger than France within our sample window. Year to year movements are small and stable, aside from the UK jump in 2023. Working on the log scale gives us nicer distributional shape and a clearer comparison of proportional changes. Because these are time series, simple correlations in levels can be driven by common trends. That is why in the next parts explanations ( part 7 Regression Analysis) we try to implement a comparison of intercepts and trends explicitly and use time series aware inference, HAC and GLS with AR(1), to keep the conclusions robust.

Hypothesis Testing

# Part 5 Hypothesis Testing & Confidence Interval
# welch two-sample t-test: United Kingdom (country A) vs France ( country B)
# scale log(Pop)

# Preconditions & Hypotheses
# To make sure the key objects created in Part 4 are already in session before Part 5/7 runs. If any are missing, it stops early with a clear message instead of failing later with an error.(Aligning on the same years avoids biased results from unequal time coverage and lets us compute clean differences, t-tests, and regressions on matched samples)

required_objs <- c("study_feats","COUNTRY_A","COUNTRY_B","YEAR_START","YEAR_END")
missing <- required_objs[!vapply(required_objs, exists, logical(1))]
if (length(missing)) stop("Missing objects from Part 4: ", paste(missing, collapse=", "))

# Hypotheses:
# H0: μ_A = μ_B   (same mean log population over YEAR_START to YEAR_END)
# H1: μ_A ≠ μ_B


# Build aligned samples (years where BOTH have data)
panel <- study_feats %>%
  filter(Country %in% c(COUNTRY_A, COUNTRY_B)) %>%
  select(Country, Year, Pop, logPop) %>%
  pivot_wider(names_from = Country, values_from = c(Pop, logPop)) %>%
  filter(!is.na(.data[[paste0("Pop_", COUNTRY_A)]]),
         !is.na(.data[[paste0("Pop_", COUNTRY_B)]])) %>%
  arrange(Year)

A_pop <- panel[[paste0("Pop_",    COUNTRY_A)]]
B_pop <- panel[[paste0("Pop_",    COUNTRY_B)]]
A_log <- panel[[paste0("logPop_", COUNTRY_A)]]
B_log <- panel[[paste0("logPop_", COUNTRY_B)]]

cat("\nYears used (both countries):", nrow(panel), "\n")

## 
## Years used (both countries): 25

# Assumption checks (log scale)
sw_A <- shapiro.test(A_log)
sw_B <- shapiro.test(B_log)
assump_tbl <- tibble(
  Country   = c(COUNTRY_A, COUNTRY_B),
  Shapiro_W = c(unname(sw_A$statistic), unname(sw_B$statistic)),
  Shapiro_p = c(sw_A$p.value, sw_B$p.value),
  SD_log    = c(sd(A_log), sd(B_log))
)
cat("\nAssumptions (log scale):\n")

## 
## Assumptions (log scale):

print(kable(assump_tbl %>% mutate(across(where(is.numeric), ~signif(., 4))),
            caption = "Shapiro–Wilk normality & SD (log scale)"))

## 
## 
## Table: Shapiro–Wilk normality & SD (log scale)
## 
## |Country        | Shapiro_W| Shapiro_p|  SD_log|
## |:--------------|---------:|---------:|-------:|
## |France         |    0.9420|    0.1649| 0.03556|
## |United Kingdom |    0.9499|    0.2495| 0.04974|

# Welch two-sample t-test (log scale)

tt <- t.test(A_log, B_log, var.equal = FALSE)
cat("\nWelch two-sample t-test on log(Pop):\n"); print(tt)

## 
## Welch two-sample t-test on log(Pop):

## 
##  Welch Two Sample t-test
## 
## data:  A_log and B_log
## t = 2.1552, df = 43.45, p-value = 0.03672
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.00170186 0.05101329
## sample estimates:
## mean of x mean of y 
##  17.99380  17.96744

# Difference & ratio (A/B) with 95% CI
diff_log <- mean(A_log) - mean(B_log)
diff_CI  <- unname(tt$conf.int)
ratio_pt <- exp(diff_log)
ratio_CI <- exp(diff_CI)

summary_tbl <- tibble(
  Contrast      = paste(COUNTRY_A, "−", COUNTRY_B),
  MeanDiff_log  = diff_log,
  CI_Lower_log  = diff_CI[1],
  CI_Upper_log  = diff_CI[2],
  t_stat        = unname(tt$statistic),
  df_welch      = unname(tt$parameter),
  p_value       = tt$p.value,
  Ratio_point   = ratio_pt,
  Ratio_Lower   = ratio_CI[1],
  Ratio_Upper   = ratio_CI[2]
)
cat("\n=== Welch result summary (log difference & ratio A/B) ===\n")

## 
## === Welch result summary (log difference & ratio A/B) ===

print(kable(summary_tbl %>% mutate(across(where(is.numeric), ~signif(., 4))),
            align = paste0("l", strrep("r", ncol(summary_tbl)-1)),
            caption = "Two-sample Welch t-test: mean diff (log) and ratio (A/B) with 95% CI"))

## 
## 
## Table: Two-sample Welch t-test: mean diff (log) and ratio (A/B) with 95% CI
## 
## |Contrast                | MeanDiff_log| CI_Lower_log| CI_Upper_log| t_stat| df_welch| p_value| Ratio_point| Ratio_Lower| Ratio_Upper|
## |:-----------------------|------------:|------------:|------------:|------:|--------:|-------:|-----------:|-----------:|-----------:|
## |France − United Kingdom |      0.02636|     0.001702|      0.05101|  2.155|    43.45| 0.03672|       1.027|       1.002|       1.052|

# DECISION & SUMMARY
alpha <- 0.05
pval  <- tt$p.value
diff  <- diff_log
ci    <- diff_CI
ratio <- exp(diff)

decision <- if (pval < alpha) "REJECT H0" else "FAIL TO REJECT H0"
direction <- if (diff > 0) {
  paste(COUNTRY_A, "has the larger mean (on average) on the log scale.")
} else if (diff < 0) {
  paste(COUNTRY_B, "has the larger mean (on average) on the log scale.")
} else {
  "The sample means are equal."
}

cat("\n========= DECISION (Two-sided Welch t-test) ===========\n")

## 
## ========= DECISION (Two-sided Welch t-test) ===========

cat("H0: μ_", COUNTRY_A, " = μ_", COUNTRY_B, "   vs   H1: μ_", COUNTRY_A, " ≠ μ_", COUNTRY_B, "\n", sep = "")

## H0: μ_France = μ_United Kingdom   vs   H1: μ_France ≠ μ_United Kingdom

cat("α = ", alpha, " | t = ", round(unname(tt$statistic), 3),
    " | df = ", round(unname(tt$parameter), 1),
    " | p = ", signif(pval, 4), "\n", sep = "")

## α = 0.05 | t = 2.155 | df = 43.5 | p = 0.03672

cat("95% CI for (μ_A − μ_B) on log scale: [",
    round(ci[1], 4), ", ", round(ci[2], 4), "]\n", sep = "")

## 95% CI for (μ_A − μ_B) on log scale: [0.0017, 0.051]

cat("Interpretation on original scale (A/B ratio): point ≈ ",
    round(ratio, 3), ", 95% CI ≈ [", round(exp(ci[1]), 3), ", ",
    round(exp(ci[2]), 3), "]\n", sep = "")

## Interpretation on original scale (A/B ratio): point ≈ 1.027, 95% CI ≈ [1.002, 1.052]

cat("Decision: ", decision, ". ", direction, "\n", sep = "")

## Decision: REJECT H0. France has the larger mean (on average) on the log scale.

cat("===================================================\n")

## ===================================================

# Group mean CIs (log scale)
mean_ci <- function(x, conf = 0.95) {
  x <- x[is.finite(x)]; n <- length(x); se <- sd(x)/sqrt(n)
  tcrit <- qt(1 - (1-conf)/2, df = n - 1)
  c(lower = mean(x) - tcrit*se, upper = mean(x) + tcrit*se)
}
ci_A <- mean_ci(A_log, 0.95); ci_B <- mean_ci(B_log, 0.95)
group_tbl <- tibble(
  Country   = c(COUNTRY_A, COUNTRY_B),
  n         = c(length(A_log), length(B_log)),
  Mean_log  = c(mean(A_log), mean(B_log)),
  SD_log    = c(sd(A_log), sd(B_log)),
  SE_log    = SD_log / sqrt(n),
  CI_Lower  = c(ci_A["lower"], ci_B["lower"]),
  CI_Upper  = c(ci_A["upper"])
)
cat("\n=== Group means with 95% CI (log scale) ===\n")

## 
## === Group means with 95% CI (log scale) ===

print(kable(group_tbl %>% mutate(across(where(is.numeric), ~signif(., 4))),
            align = paste0("l", strrep("r", ncol(group_tbl)-1)),
            caption = "Group means, SD/SE, and 95% CI on log scale"))

## 
## 
## Table: Group means, SD/SE, and 95% CI on log scale
## 
## |Country        |  n| Mean_log|  SD_log|   SE_log| CI_Lower| CI_Upper|
## |:--------------|--:|--------:|-------:|--------:|--------:|--------:|
## |France         | 25|    17.99| 0.03556| 0.007112|    17.98|    18.01|
## |United Kingdom | 25|    17.97| 0.04974| 0.009949|    17.95|    18.01|

# PLOTS 
# Shared data for plots
df_log <- dplyr::bind_rows(
  data.frame(Country = COUNTRY_A, value = A_log),
  data.frame(Country = COUNTRY_B, value = B_log)
)
means_tbl <- tibble(
  Country    = c(COUNTRY_A, COUNTRY_B),
  mean_self  = c(mean(A_log, na.rm = TRUE), mean(B_log, na.rm = TRUE)),
  mean_other = c(mean(B_log, na.rm = TRUE),  mean(A_log, na.rm = TRUE))
)
xlim_all <- range(c(A_log, B_log), na.rm = TRUE)
pad <- diff(xlim_all) * 0.06; if (!is.finite(pad) || pad == 0) pad <- 0.1
xlim_all <- xlim_all + c(-pad, pad)

# 1. Histogram of log(Pop) with both mean lines
p_hist <- ggplot(df_log, aes(x = value)) +
  geom_histogram(fill = "grey85", color = "grey40", bins = 12) +
  geom_vline(data = means_tbl, aes(xintercept = mean_self),  color = "blue", linewidth = 1) +
  geom_vline(data = means_tbl, aes(xintercept = mean_other), color = "red",  linewidth = 1) +
  facet_wrap(~ Country, nrow = 1, scales = "free_y") +
  coord_cartesian(xlim = xlim_all) +
  labs(
    title = "Histogram of log population — United Kingdom vs France",
    x = "log(Population)", y = "Count",
    subtitle = "Blue = country’s own mean, Red = other country’s mean (shared x-range)"
  ) +
  theme_minimal(base_size = 12)
print(p_hist)

# 2. Sampling distribution of the SAMPLE MEAN (single country)
#     (yellow line = mean of simulated sample mean, red line = true mean)
set.seed(12345)
TARGET_COUNTRY <- COUNTRY_A   # just change to COUNTRY_B if want France
x_pop <- study_feats %>%
  filter(Country == TARGET_COUNTRY) %>%
  arrange(Year) %>%
  pull(logPop)

stopifnot(length(x_pop) >= 10)
true_mean <- mean(x_pop, na.rm = TRUE)

Ns <- c(10, 50, 100, 250)   # sample mean sizes to show
B  <- 5000                    # number of simulated samples per n

# simulate sample means for each n 
smean_list <- vector("list", length(Ns))
names(smean_list) <- paste0("n = ", Ns)
for (k in seq_along(Ns)) {
  n_k <- Ns[k]
  m_k <- numeric(B)
  for (i in 1:B) {
    samp <- sample(x_pop, n_k, replace = TRUE)
    m_k[i] <- mean(samp)
  }
  smean_list[[k]] <- tibble(n = paste0("n = ", n_k), mean_samp = m_k)
}

smean_df <- dplyr::bind_rows(smean_list) %>%
  mutate(n = factor(n, levels = paste0("n = ", Ns)))

# per (n) yellow line & 95% percentile CI of the sampling distribution
smean_ci <- smean_df %>%
  group_by(n) %>%
  summarise(
    mean_of_means = mean(mean_samp),
    p2.5  = quantile(mean_samp, 0.025),
    p97.5 = quantile(mean_samp, 0.975),
    .groups = "drop"
  )

cat("\n=== Sampling distribution of sample mean (", TARGET_COUNTRY, ") ===\n", sep = "") # CLT

## 
## === Sampling distribution of sample mean (France) ===

print(kable(smean_ci %>% mutate(across(where(is.numeric), ~signif(., 4))),
            caption = paste0("Sampling means of log(Pop) for ", TARGET_COUNTRY, " by n")))

## 
## 
## Table: Sampling means of log(Pop) for France by n
## 
## |n       | mean_of_means|  p2.5| p97.5|
## |:-------|-------------:|-----:|-----:|
## |n = 10  |         17.99| 17.97| 18.01|
## |n = 50  |         17.99| 17.98| 18.00|
## |n = 100 |         17.99| 17.99| 18.00|
## |n = 250 |         17.99| 17.99| 18.00|

p_smean <- ggplot(smean_df, aes(x = mean_samp)) +
  geom_histogram(bins = 30, fill = "grey85", color = "grey40") +
  geom_density(linewidth = 0.9, alpha = 0.4) +
  geom_vline(xintercept = true_mean, color = "red", linewidth = 1) +
  geom_vline(data = smean_ci, aes(xintercept = mean_of_means), color = "yellow", linewidth = 1) +
  geom_vline(data = smean_ci, aes(xintercept = p2.5),  color = "black", linetype = 2) +
  geom_vline(data = smean_ci, aes(xintercept = p97.5), color = "black", linetype = 2) +
  facet_wrap(~ n, ncol = 2, scales = "free_y") +
  labs(
    title = paste0("Sampling distribution of the mean log(Pop): ", TARGET_COUNTRY),
    subtitle = "Red = true mean, Yellow = mean of simulated sample means, Black dashed = 95% percentile CI",
    x = "Sample mean of log(Population)", y = "Count / density"
  ) +
  theme_minimal(base_size = 12)
print(p_smean)

# QQ plot of log(Pop) by country (normality check)
p_qq <- ggplot(df_log, aes(sample = value)) +
  stat_qq() + stat_qq_line() +
  facet_wrap(~ Country, nrow = 1, scales = "free") +
  labs(title = "QQ plot of log(Population) by country",
       x = "Theoretical quantiles", y = "Sample quantiles") +
  theme_minimal(base_size = 12)
print(p_qq)

# Box plot of log(Pop) (with mean points)
p_box <- ggplot(df_log, aes(x = Country, y = value, fill = Country)) +
  geom_boxplot(width = 0.55, alpha = 0.7, outlier.alpha = 0.6, show.legend = FALSE) +
  stat_summary(fun = mean, geom = "point", shape = 21, size = 3, fill = "white", color = "black") +
  labs(title = "Box plot of log(Population): UK vs France",
       x = NULL, y = "log(Population)") +
  theme_minimal(base_size = 12)
print(p_box)

# Additional Create Fancy plots violin + boxplot + jitter (raincloud-ish) on log(Pop)
set.seed(42)
p_violin <- ggplot(df_log, aes(x = Country, y = value, fill = Country)) +
  geom_violin(trim = FALSE, alpha = 0.35, color = NA, show.legend = FALSE) +
  geom_boxplot(width = 0.12, outlier.shape = NA, alpha = 0.6, show.legend = FALSE) +
  geom_jitter(width = 0.07, alpha = 0.7, height = 0, size = 1.8, show.legend = FALSE) +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 3.2, fill = "white") +
  labs(title = "Distribution of log(Population): violin + box + jitter",
       x = NULL, y = "log(Population)",
       subtitle = "Mean shown as white diamond, combines distribution shape and summary") +
  theme_minimal(base_size = 12)
print(p_violin)

# Overall Summaries
group <- tibble(
  Country   = c(COUNTRY_A, COUNTRY_B),
  n         = as.character(c(length(A_log), length(B_log))),
  Mean_log  = c(mean(A_log), mean(B_log)),
  SD_log    = c(sd(A_log), sd(B_log))
) %>%
  mutate(
    SE_log   = SD_log / sqrt(as.numeric(n)),
    CI_lower = Mean_log - qt(0.975, df = as.numeric(n) - 1) * SE_log,
    CI_upper = Mean_log + qt(0.975, df = as.numeric(n) - 1) * SE_log
  )

diff <- tibble(
  Country   = paste(COUNTRY_A, "−", COUNTRY_B),
  n         = paste(length(A_log), "&", length(B_log)),
  Mean_log  = mean(A_log) - mean(B_log),
  SD_log    = NA_real_,
  SE_log    = sqrt(var(A_log)/length(A_log) + var(B_log)/length(B_log)),
  CI_lower  = unname(tt$conf.int[1]),
  CI_upper  = unname(tt$conf.int[2]),
  t_stat    = unname(tt$statistic),
  df        = unname(tt$parameter),
  p_val     = unname(tt$p.value),
  Ratio      = exp(Mean_log),
  Ratio_CI_L = exp(CI_lower),
  Ratio_CI_U = exp(CI_upper)
)

summary <- bind_rows(group, diff)

# Double check
cat("\n=== summary (groups + contrast) ==\n")

## 
## === summary (groups + contrast) ==

print(knitr::kable(
  summary %>% mutate(across(where(is.numeric), ~signif(., 4))),
  align = paste0("l", strrep("r", ncol(summary)-1)),
  caption = "summary: group means/CIs and Welch contrast (log scale)"
))

## 
## 
## Table: summary: group means/CIs and Welch contrast (log scale)
## 
## |Country                 |       n| Mean_log|  SD_log|   SE_log|  CI_lower| CI_upper| t_stat|    df|   p_val| Ratio| Ratio_CI_L| Ratio_CI_U|
## |:-----------------------|-------:|--------:|-------:|--------:|---------:|--------:|------:|-----:|-------:|-----:|----------:|----------:|
## |France                  |      25| 17.99000| 0.03556| 0.007112| 17.980000| 18.01000|     NA|    NA|      NA|    NA|         NA|         NA|
## |United Kingdom          |      25| 17.97000| 0.04974| 0.009949| 17.950000| 17.99000|     NA|    NA|      NA|    NA|         NA|         NA|
## |France − United Kingdom | 25 & 25|  0.02636|      NA| 0.012230|  0.001702|  0.05101|  2.155| 43.45| 0.03672| 1.027|      1.002|      1.052|

# Additional mini-summary for the sampling-mean block (the “yellow line” table) for comparison
cat("\n=== Sampling mean table by n for ", TARGET_COUNTRY, " ===\n", sep = "")

## 
## === Sampling mean table by n for France ===

print(knitr::kable(
  smean_ci %>% mutate(across(where(is.numeric), ~signif(., 4))),
  align = paste0("l", strrep("r", ncol(smean_ci)-1)),
  caption = paste0("Sampling distribution of the sample mean (log scale) — ", TARGET_COUNTRY)
))

## 
## 
## Table: Sampling distribution of the sample mean (log scale) — France
## 
## |n       | mean_of_means|  p2.5| p97.5|
## |:-------|-------------:|-----:|-----:|
## |n = 10  |         17.99| 17.97| 18.01|
## |n = 50  |         17.99| 17.98| 18.00|
## |n = 100 |         17.99| 17.99| 18.00|
## |n = 250 |         17.99| 17.99| 18.00|

Question?

Do France and the United Kingdom have the same mean log population in 2000–2024?

What we tested

H0: the mean log population is the same in France and the UK.

H1: the means are different.

Assumptions we checked

We lined up the two series by year so both countries use the same 25 years. On the log scale the distributions: Shapiro–Wilk p = 0.165 for France and 0.250 for the UK. The spreads are not quite the same (SD about 0.0356 vs 0.0497), so we used Welch’s two-sample t-test, which allows unequal variances.

Test and results

Test: Welch t-test on log(Pop).

Numbers: t = 2.155, df ≈ 43.5, p = 0.0367.

95% CI for μ_FR − μ_UK on the log scale: [0.0017, 0.0510].

Mean difference (log): 0.0264.

Interpretation

A log difference of 0.0264 means a ratio of exp(0.0264) ≈ 1.027. So, on average, France is about 2.7% larger than the UK over this period. The 95% CI for the ratio is [1.002, 1.052], which does not include 1.00, so the difference is statistically significant at the 5% level. It is small in size.

Decision

With p = 0.0367 we HAVE enough evidence to reject H0 at α = 0.05. Our data suggest France has a slightly higher mean log population than the UK in 2000–2024.

The reason why we work on the log scale - Logs make the distributions closer to normal and keep the spread more stable. - A difference in logs turns into a percentage ratio, which is easier to explain.

Sensitivity and robustness

We flagged one unusually high UK growth year in 2023. Because the effect we found is small, results can move a bit if we exclude years with flagged spikes. We expect the direction to stay the same, but the p-value could drift toward 0.05.

Regression Analysis

# Part 7 Regression analysis (time-series)
# methods use OLS with HAC (Newey–West) and GLS with AR(1) errors
# hypothesis tests, CIs, diagnostics, fitted-vs-actual plot

theme_set(theme_minimal(base_size = 12))
set.seed(123)

# Preconditions 
need <- c("study_feats","COUNTRY_A","COUNTRY_B","YEAR_START","YEAR_END")
miss <- need[!vapply(need, exists, logical(1))]
if (length(miss)) stop("Missing from Part 4: ", paste(miss, collapse=", "))

# Build aligned panel (common years for A & B) 
panel <- study_feats %>%
  filter(Country %in% c(COUNTRY_A, COUNTRY_B)) %>%
  select(Country, Year, logPop) %>%
  tidyr::pivot_wider(names_from = Country, values_from = logPop) %>%
  filter(!is.na(.data[[COUNTRY_A]]), !is.na(.data[[COUNTRY_B]])) %>%
  tidyr::pivot_longer(cols = c(all_of(COUNTRY_A), all_of(COUNTRY_B)),
                      names_to = "Country", values_to = "logPop") %>%
  arrange(Year, Country)

if (dplyr::n_distinct(panel$Country) < 2 || dplyr::n_distinct(panel$Year) < 3)
  stop("Need at least 3 common years per country for regression.")

cat("\nYears used: ", paste(sort(unique(panel$Year)), collapse = ", "), "\n", sep = "")

## 
## Years used: 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024

# Relationship visualisation &
# scatter of Country_B vs Country_A (both on log scale)
rel_wide <- panel %>%
  select(Year, Country, logPop) %>%
  tidyr::pivot_wider(names_from = Country, values_from = logPop) %>%
  arrange(Year)

# Quick correlation
r_rel <- suppressWarnings(cor(rel_wide[[COUNTRY_A]], rel_wide[[COUNTRY_B]], use = "complete.obs"))
cat("\n=== Scatter plot: correlation (log scale) ===\n",
    "Pearson r = ", round(r_rel, 4), "\n", sep = "")

## 
## === Scatter plot: correlation (log scale) ===
## Pearson r = 0.986

# Base R scatter
# ggplot
p_scatter <- ggplot(rel_wide,
                    aes(x = .data[[COUNTRY_A]], y = .data[[COUNTRY_B]])) +
  geom_point(alpha = 0.9) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1) +
  labs(title = paste("log(Pop):", COUNTRY_A, "vs", COUNTRY_B),
       subtitle = paste("Visual check of linearity, Pearson r ≈", round(r_rel, 3)),
       x = paste0("log(Pop) — ", COUNTRY_A),
       y = paste0("log(Pop) — ", COUNTRY_B)) +
  theme_minimal(base_size = 12)
print(p_scatter)

# Center time, set A as baseline 
panel <- panel %>%
  mutate(Year_c = Year - mean(Year),
         Country = factor(Country, levels = c(COUNTRY_A, COUNTRY_B)))

# OLS with country & trend and interaction 
# logPop_it = β0 + β1*1{B} + β2*Year_c + β3*(Year_c*1{B}) + e_it
m_ols <- lm(logPop ~ Country + Year_c + Country:Year_c, data = panel)

# 1. Overall model significance
# Classical F-test (assumes it homoskedastic, no autocorrelation)
cat("\n=== Overall model F-test (classical) ===\n")

## 
## === Overall model F-test (classical) ===

ols_sum <- summary(m_ols)
f_val   <- unname(ols_sum$fstatistic["value"])
df1     <- unname(ols_sum$fstatistic["numdf"])
df2     <- unname(ols_sum$fstatistic["dendf"])
p_overall <- pf(f_val, df1, df2, lower.tail = FALSE)
cat(sprintf("F(%d, %d) = %.3f, p = %.4g\n", df1, df2, f_val, p_overall))

## F(3, 46) = 1369.925, p = 5.655e-45

# HAC-robust Wald F (robust to heteroskedasticity & autocorrelation)
cat("\n=== Overall model test (HAC-robust Wald F test) ===\n")

## 
## === Overall model test (HAC-robust Wald F test) ===

Tn_tmp  <- length(unique(panel$Year))
hac_tmp <- sandwich::NeweyWest(m_ols, lag = max(1, floor(sqrt(Tn_tmp))), prewhite = FALSE, adjust = TRUE)
print(lmtest::waldtest(m_ols, vcov = hac_tmp, test = "F"))

## Wald test
## 
## Model 1: logPop ~ Country + Year_c + Country:Year_c
## Model 2: logPop ~ 1
##   Res.Df Df      F    Pr(>F)    
## 1     46                        
## 2     49 -3 784.11 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# 2. Linear vs quadratic trend
# Classical nested ANOVA (assumes it homoskedastic, no autocorrelation)
# anova(m_ols, m_quad)
# Compares linear vs linear & quadratic, tests by adding curvature improves fit.
panel_quad <- panel %>% mutate(Year_c2 = Year_c^2)
m_quad <- lm(
  logPop ~ Country + Year_c + I(Year_c^2) + Country:Year_c + Country:I(Year_c^2),
  data = panel_quad
)

cat("\n=== Linear vs Quadratic trend (classical ANOVA) ===\n")

## 
## === Linear vs Quadratic trend (classical ANOVA) ===

cmp <- anova(m_ols, m_quad)   # quadratic adds curvature
print(cmp)

## Analysis of Variance Table
## 
## Model 1: logPop ~ Country + Year_c + Country:Year_c
## Model 2: logPop ~ Country + Year_c + I(Year_c^2) + Country:Year_c + Country:I(Year_c^2)
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1     46 0.00108941                                   
## 2     44 0.00032395  2 0.00076546 51.984 2.584e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

p_quad_classical <- cmp$`Pr(>F)`[2]

# HAC-robust curvature test (robust to heteroskedasticity & autocorrelation)
# jointly tests the quadratic terms = 0 using HAC covariance.
hac_quad <- sandwich::NeweyWest(m_quad, lag = max(1, floor(sqrt(Tn_tmp))), prewhite = FALSE, adjust = TRUE)
nm_qA <- "I(Year_c^2)"                               # curvature for baseline country
nm_qB <- paste0("Country", COUNTRY_B, ":I(Year_c^2)") # additional curvature for B vs A
cat("\n=== Curvature test (HAC-robust Wald F test) ===\n")

## 
## === Curvature test (HAC-robust Wald F test) ===

wald_curv <- car::linearHypothesis(m_quad, c(nm_qA, nm_qB), vcov = hac_quad, test = "F")
print(wald_curv)

## 
## Linear hypothesis test:
## I(Year_c^2) = 0
## CountryUnited Kingdom:I(Year_c^2) = 0
## 
## Model 1: restricted model
## Model 2: logPop ~ Country + Year_c + I(Year_c^2) + Country:Year_c + Country:I(Year_c^2)
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F    Pr(>F)    
## 1     46                        
## 2     44  2 136.82 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

cat("\n == Trend type= \n")

## 
##  == Trend type=

if (!is.na(p_quad_classical) && p_quad_classical < 0.05 && wald_curv$`Pr(>F)`[2] < 0.05) {
  cat("Evidence of curvature: a quadratic time trend fits better than a straight line.\n")
} else if (!is.na(p_quad_classical) && p_quad_classical < 0.05) {
  cat("Some classical evidence of curvature, but HAC-robust test is not strong.\n")
} else if (wald_curv$`Pr(>F)`[2] < 0.05) {
  cat("HAC-robust test suggests curvature, consider keeping quadratic terms.\n")
} else {
  cat("There is NO strong evidence of curvature: a linear trend looks adequate for this window.\n")
}

## Evidence of curvature: a quadratic time trend fits better than a straight line.

#  OLS outputs (SEs & CIs) 
cat("\n===  OLS summary() (SEs)===\n")

## 
## ===  OLS summary() (SEs)===

print(summary(m_ols))

## 
## Call:
## lm(formula = logPop ~ Country + Year_c + Country:Year_c, data = panel)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0115253 -0.0035743  0.0008018  0.0035654  0.0065692 
## 
## Coefficients:
##                                Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                  17.9937964  0.0009733 18487.44  < 2e-16 ***
## CountryUnited Kingdom        -0.0263576  0.0013765   -19.15  < 2e-16 ***
## Year_c                        0.0047684  0.0001350    35.33  < 2e-16 ***
## CountryUnited Kingdom:Year_c  0.0019734  0.0001909    10.34  1.4e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.004866 on 46 degrees of freedom
## Multiple R-squared:  0.9889, Adjusted R-squared:  0.9882 
## F-statistic:  1370 on 3 and 46 DF,  p-value: < 2.2e-16

cat("\n=== OLS confint() 95% CI (classical SEs) ===\n")

## 
## === OLS confint() 95% CI (classical SEs) ===

ols_ci_classical <- as.data.frame(confint(m_ols))
ols_ci_classical$term <- rownames(ols_ci_classical)
rownames(ols_ci_classical) <- NULL
names(ols_ci_classical)[1:2] <- c("2.5 %", "97.5 %")
print(knitr::kable(ols_ci_classical[, c("term","2.5 %","97.5 %")],
                   digits = 4, caption = "OLS 95% CIs (classical)"))

## 
## 
## Table: OLS 95% CIs (classical)
## 
## |term                         |   2.5 %|  97.5 %|
## |:----------------------------|-------:|-------:|
## |(Intercept)                  | 17.9918| 17.9958|
## |CountryUnited Kingdom        | -0.0291| -0.0236|
## |Year_c                       |  0.0045|  0.0050|
## |CountryUnited Kingdom:Year_c |  0.0016|  0.0024|

# HAC (Newey–West) robust SEs
Tn  <- length(unique(panel$Year))
hac <- sandwich::NeweyWest(m_ols, lag = max(1, floor(sqrt(Tn))), prewhite = FALSE, adjust = TRUE)

# tidy directly from coeftest()
ols_hac_table <- broom::tidy(lmtest::coeftest(m_ols, vcov. = hac))
cat("\n=== OLS with HAC (Newey–West) robust SEs ===\n")

## 
## === OLS with HAC (Newey–West) robust SEs ===

print(knitr::kable(ols_hac_table, digits = 4, caption = "OLS (HAC-robust) coefficients"))

## 
## 
## Table: OLS (HAC-robust) coefficients
## 
## |term                         | estimate| std.error| statistic| p.value|
## |:----------------------------|--------:|---------:|---------:|-------:|
## |(Intercept)                  |  17.9938|    0.0019| 9636.1889|       0|
## |CountryUnited Kingdom        |  -0.0264|    0.0019|  -13.9188|       0|
## |Year_c                       |   0.0048|    0.0003|   16.5607|       0|
## |CountryUnited Kingdom:Year_c |   0.0020|    0.0003|    6.1090|       0|

# HAC-robust Wald tests for named constraints 
nm_b   <- paste0("Country", COUNTRY_B)
nm_int <- paste0("Country", COUNTRY_B, ":Year_c")

wald_level <- car::linearHypothesis(m_ols, nm_b, vcov = hac, test = "Chisq")
wald_trend <- car::linearHypothesis(m_ols, nm_int, vcov = hac, test = "Chisq")
wald_joint <- car::linearHypothesis(m_ols, c(nm_b, nm_int), vcov = hac, test = "Chisq")

cat("\n=== Wald tests (HAC-robust) ===\n")

## 
## === Wald tests (HAC-robust) ===

cat("H0 (level): ", nm_b, " = 0  | p = ", signif(wald_level$`Pr(>Chisq)`[2], 4), "\n", sep = "")

## H0 (level): CountryUnited Kingdom = 0  | p = 4.87e-44

cat("H0 (trend): ", nm_int, " = 0 | p = ", signif(wald_trend$`Pr(>Chisq)`[2], 4), "\n", sep = "")

## H0 (trend): CountryUnited Kingdom:Year_c = 0 | p = 1.003e-09

cat("H0 (jointly): ", nm_b, " = 0 & ", nm_int, " = 0 | p = ",
    signif(wald_joint$`Pr(>Chisq)`[2], 4), "\n", sep = "")

## H0 (jointly): CountryUnited Kingdom = 0 & CountryUnited Kingdom:Year_c = 0 | p = 5.067e-43

#  95% CIs (HAC) for all terms 
zcrit   <- qnorm(0.975)
coefvec <- coef(m_ols)
se_hac  <- sqrt(diag(hac))
ci_hac <- tibble(
  term      = names(coefvec),
  estimate  = unname(coefvec),
  std.error = unname(se_hac),
  conf.low  = estimate - zcrit * std.error,
  conf.high = estimate + zcrit * std.error
)
cat("\n=== 95% HAC-robust CIs ===\n")

## 
## === 95% HAC-robust CIs ===

print(knitr::kable(ci_hac, digits = 4, caption = "HAC-robust 95% CIs"))

## 
## 
## Table: HAC-robust 95% CIs
## 
## |term                         | estimate| std.error| conf.low| conf.high|
## |:----------------------------|--------:|---------:|--------:|---------:|
## |(Intercept)                  |  17.9938|    0.0019|  17.9901|   17.9975|
## |CountryUnited Kingdom        |  -0.0264|    0.0019|  -0.0301|   -0.0226|
## |Year_c                       |   0.0048|    0.0003|   0.0042|    0.0053|
## |CountryUnited Kingdom:Year_c |   0.0020|    0.0003|   0.0013|    0.0026|

# GLS with AR(1) errors (per-country)
m_gls <- nlme::gls(
  logPop ~ Country + Year_c + Country:Year_c,
  correlation = nlme::corAR1(form = ~ Year | Country),
  method = "REML",
  control = nlme::glsControl(msMaxIter = 200, tolerance = 1e-7),
  data = panel
)

# GLS outputs and 95% intervals for coefficients
cat("\n=== GLS(AR1) summary() ===\n")

## 
## === GLS(AR1) summary() ===

print(summary(m_gls))

## Generalized least squares fit by REML
##   Model: logPop ~ Country + Year_c + Country:Year_c 
##   Data: panel 
##         AIC       BIC  logLik
##   -420.1361 -409.1642 216.068
## 
## Correlation Structure: AR(1)
##  Formula: ~Year | Country 
##  Parameter estimate(s):
##       Phi 
## 0.9999912 
## 
## Coefficients:
##                                  Value Std.Error  t-value p-value
## (Intercept)                  17.983820 0.4916152 36.58109  0.0000
## CountryUnited Kingdom        -0.011765 0.6952489 -0.01692  0.9866
## Year_c                        0.004897 0.0004204 11.64864  0.0000
## CountryUnited Kingdom:Year_c  0.001839 0.0005946  3.09217  0.0034
## 
##  Correlation: 
##                              (Intr) CntrUK Year_c
## CountryUnited Kingdom        -0.707              
## Year_c                        0.000  0.000       
## CountryUnited Kingdom:Year_c  0.000  0.000 -0.707
## 
## Standardized residuals:
##           Min            Q1           Med            Q3           Max 
## -2.586116e-02 -7.842552e-03  1.030132e-06  2.103868e-02  3.339098e-02 
## 
## Residual standard error: 0.4916411 
## Degrees of freedom: 50 total; 46 residual

cat("\n=== GLS(AR1) 95% intervals for coefficients ===\n")

## 
## === GLS(AR1) 95% intervals for coefficients ===

gls_ci_df <- tryCatch({
  gls_int <- nlme::intervals(m_gls, which = "coef", level = 0.95)$coef  # fixed effects only
  data.frame(
    term     = rownames(gls_int),
    `2.5 %`  = gls_int[, "lower"],
    estimate = gls_int[, "est."],
    `97.5 %` = gls_int[, "upper"],
    row.names = NULL
  )
}, error = function(e) {
  # fallback to Wald CIs from tTable
  tt <- summary(m_gls)$tTable
  zcrit <- qnorm(0.975)
  data.frame(
    term     = rownames(tt),
    `2.5 %`  = tt[, "Value"] - zcrit * tt[, "Std.Error"],
    estimate = tt[, "Value"],
    `97.5 %` = tt[, "Value"] + zcrit * tt[, "Std.Error"],
    row.names = NULL
  )
})
print(knitr::kable(gls_ci_df, digits = 4, caption = "GLS(AR1) 95% coefficient intervals"))

## 
## 
## Table: GLS(AR1) 95% coefficient intervals
## 
## |term                         |  X2.5..| estimate| X97.5..|
## |:----------------------------|-------:|--------:|-------:|
## |(Intercept)                  | 16.9943|  17.9838| 18.9734|
## |CountryUnited Kingdom        | -1.4112|  -0.0118|  1.3877|
## |Year_c                       |  0.0041|   0.0049|  0.0057|
## |CountryUnited Kingdom:Year_c |  0.0006|   0.0018|  0.0030|

# comparison table (SEs)
tidy_gls <- function(model) {
  tt <- summary(model)$tTable
  as_tibble(tt, rownames = "term") %>%
    rename(
      estimate  = Value,
      std.error = `Std.Error`,
      statistic = `t-value`,
      p.value   = `p-value`
    )
}
gls_tidy <- tidy_gls(m_gls) %>% mutate(model = "GLS AR(1)")
ols_tidy <- broom::tidy(m_ols) %>% mutate(model = "OLS (classical)")
comp_tab <- bind_rows(ols_tidy, gls_tidy)

cat("\n=== Coefficient comparison (OLS vs GLS-AR1) ===\n")

## 
## === Coefficient comparison (OLS vs GLS-AR1) ===

print(knitr::kable(comp_tab, digits = 4,
                   caption = "Compare point estimates (SEs differ, OLS HAC shown above)"))

## 
## 
## Table: Compare point estimates (SEs differ, OLS HAC shown above)
## 
## |term                         | estimate| std.error|  statistic| p.value|model           |
## |:----------------------------|--------:|---------:|----------:|-------:|:---------------|
## |(Intercept)                  |  17.9938|    0.0010| 18487.4361|  0.0000|OLS (classical) |
## |CountryUnited Kingdom        |  -0.0264|    0.0014|   -19.1489|  0.0000|OLS (classical) |
## |Year_c                       |   0.0048|    0.0001|    35.3288|  0.0000|OLS (classical) |
## |CountryUnited Kingdom:Year_c |   0.0020|    0.0002|    10.3385|  0.0000|OLS (classical) |
## |(Intercept)                  |  17.9838|    0.4916|    36.5811|  0.0000|GLS AR(1)       |
## |CountryUnited Kingdom        |  -0.0118|    0.6952|    -0.0169|  0.9866|GLS AR(1)       |
## |Year_c                       |   0.0049|    0.0004|    11.6486|  0.0000|GLS AR(1)       |
## |CountryUnited Kingdom:Year_c |   0.0018|    0.0006|     3.0922|  0.0034|GLS AR(1)       |

cat("\nAIC(OLS) = ", AIC(m_ols), "  |  AIC(GLS-AR1) = ", AIC(m_gls), "\n", sep = "")

## 
## AIC(OLS) = -384.8134  |  AIC(GLS-AR1) = -420.1361

# Helpers ( ratio scale) 
beta1 <- unname(coefvec[nm_b])
beta3 <- unname(coefvec[nm_int])
ratio_level_B_over_A_at_center <- if (length(beta1)) exp(beta1) else NA_real_
ratio_trend_step_per_year      <- if (length(beta3)) exp(beta3) else NA_real_

cat("\n=== Interpretation of Ratios ===\n")

## 
## === Interpretation of Ratios ===

cat("Level ratio (B/A) at Year_c = 0: ", round(ratio_level_B_over_A_at_center, 3), "\n", sep = "")

## Level ratio (B/A) at Year_c = 0: 0.974

cat("Annual multiplicative gap (B vs A): ", round(ratio_trend_step_per_year, 6),
    " ( exp(interaction), values near 1 indicate similar trends)\n", sep = "")

## Annual multiplicative gap (B vs A): 1.001975 ( exp(interaction), values near 1 indicate similar trends)

# Fitted vs actual plot 
panel$fit_ols <- fitted(m_ols)
p_fit <- ggplot(panel, aes(x = Year, y = logPop, color = Country, group = Country)) +
  geom_line(linewidth = 1, alpha = 0.7) +
  geom_point(size = 2, alpha = 0.8) +
  geom_line(aes(y = fit_ols), linetype = 2, linewidth = 1) +
  labs(title = paste0("log(Pop) over time: ", COUNTRY_A, " vs ", COUNTRY_B),
       subtitle = "Solid is observed, Dashed is OLS fitted with country intercept and trend",
       x = "Year", y = "log(Population)") +
  theme(legend.position = "bottom")
print(p_fit)

# Diagnostics (OLS residuals) 
resid_df <- broom::augment(m_ols, data = panel)

# Residuals vs fitted
p_resid1 <- ggplot(resid_df, aes(.fitted, .resid, color = Country)) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_point(alpha = 0.8) +
  labs(title = "Residuals vs Fitted (OLS)", x = "Fitted", y = "Residuals") +
  theme(legend.position = "bottom")
print(p_resid1)

# QQ plot of standardized residuals
p_resid2 <- ggplot(resid_df, aes(sample = .std.resid, color = Country)) +
  stat_qq() + stat_qq_line() +
  labs(title = "QQ plot of standardized residuals (OLS)") +
  theme(legend.position = "bottom")
print(p_resid2)

# residual vs leverage with Cook's distance
diag_df <- resid_df %>%
  dplyr::transmute(
    Year      = Year,
    Country   = Country,
    leverage  = .hat,
    stdres    = .std.resid,
    cooks     = .cooksd
  )

n <- stats::nobs(m_ols)
p <- length(coef(m_ols))
h1 <- 2 * p / n         # leverage 
h2 <- 3 * p / n
cook_flag <- 4 / n      # Cook's D

p_lev <- ggplot(diag_df, aes(x = leverage, y = stdres, color = Country)) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_vline(xintercept = h1, linetype = 3) +
  geom_vline(xintercept = h2, linetype = 3) +
  geom_point(aes(size = cooks), alpha = 0.85) +
  scale_size_continuous(name = "Cook's D", range = c(1.5, 6)) +
  labs(
    title = "Residuals vs Leverage (OLS)",
    subtitle = paste0("Vertical guides at 2p/n and 3p/n, labels for Cook's D > ", round(cook_flag, 3)),
    x = "Leverage (hat values)",
    y = "Standardized residuals"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

if (requireNamespace("ggrepel", quietly = TRUE)) {
  lab_df <- dplyr::filter(diag_df, cooks > cook_flag)
  p_lev <- p_lev +
    ggrepel::geom_text_repel(
      data = lab_df,
      aes(label = paste0(Country, " ", Year)),
      size = 3, show.legend = FALSE, max.overlaps = 20
    )
}
print(p_lev)

# ACF of residuals (ggplot) 
make_acf_df <- function(x, lag_max = 20) {
  x <- x[is.finite(x)]
  if (length(x) < 3) return(tibble(lag = integer(), acf = numeric()))
  ac <- stats::acf(x, plot = FALSE, lag.max = lag_max, na.action = na.pass)
  tibble(lag = as.integer(ac$lag)[-1], acf = as.numeric(ac$acf)[-1])
}

resA <- resid_df$.resid[resid_df$Country == COUNTRY_A]
resB <- resid_df$.resid[resid_df$Country == COUNTRY_B]

lag_max <- max(1, min(20,
                      sum(is.finite(resA)) - 2,
                      sum(is.finite(resB)) - 2))

dfA <- make_acf_df(resA, lag_max) %>% mutate(Country = COUNTRY_A)
dfB <- make_acf_df(resB, lag_max) %>% mutate(Country = COUNTRY_B)
acf_df <- bind_rows(dfA, dfB)

ci_df <- tibble(
  Country = c(COUNTRY_A, COUNTRY_B),
  ci_pos  = qnorm(0.975) / sqrt(c(sum(is.finite(resA)), sum(is.finite(resB)))),
  ci_neg  = -qnorm(0.975) / sqrt(c(sum(is.finite(resA)), sum(is.finite(resB))))
)

p_acf <- ggplot(acf_df, aes(x = lag, y = acf)) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_hline(data = ci_df, aes(yintercept = ci_pos), linetype = 3) +
  geom_hline(data = ci_df, aes(yintercept = ci_neg), linetype = 3) +
  geom_col(width = 0.8) +
  facet_wrap(~ Country, nrow = 1) +
  labs(title = "ACF of OLS residuals",
       x = "Lag", y = "Autocorrelation") +
  theme_minimal(base_size = 12)

if (requireNamespace("plotly", quietly = TRUE)) {
  print(plotly::ggplotly(p_acf))
} else {
  print(p_acf)
}

# hypothesis decisions (two-sided test)
alpha <- 0.05
p_level <- wald_level$`Pr(>Chisq)`[2]
p_trend <- wald_trend$`Pr(>Chisq)`[2]
p_joint <- wald_joint$`Pr(>Chisq)`[2]

cat("\n================ Decision summary (α = 0.05) ================\n")

## 
## ================ Decision summary (α = 0.05) ================

cat("H0 (levels equal at center year): ", if (p_level < alpha) "REJECT" else "FAIL TO REJECT",
    " | p = ", signif(p_level, 4), "\n", sep = "")

## H0 (levels equal at center year): REJECT | p = 4.87e-44

cat("H0 (trends equal):               ", if (p_trend < alpha) "REJECT" else "FAIL TO REJECT",
    " | p = ", signif(p_trend, 4), "\n", sep = "")

## H0 (trends equal):               REJECT | p = 1.003e-09

cat("H0 (jointly):  ", if (p_joint < alpha) "REJECT" else "FAIL TO REJECT",
    " | p = ", signif(p_joint, 4), "\n", sep = "")

## H0 (jointly):  REJECT | p = 5.067e-43

cat("Interpretation: levels compare via exp(beta1), annual trend gap through exp(beta3).\n")

## Interpretation: levels compare via exp(beta1), annual trend gap through exp(beta3).

cat("Use HAC-robust results for inference, GLS AR(1) is a robustness check.\n")

## Use HAC-robust results for inference, GLS AR(1) is a robustness check.

cat("=============================================================\n")

## =============================================================

Regression analysis (time-series)

Questions?

Do France and the United Kingdom differ in,
1. average level and/or of log population over 2000 to 2024?
2. time trend of log(Population) over 2000–2024?

Model

We fit a two-country panel on the log scale:
\[ log(Popit)=β0+β11{i=UK}+β2Yearc+β3(Yearc×1{i=UK})+εit\] where Year(c) is year centered at the sample mean (so the intercepts compare the countries at the middle year).

β1 = level gap (UK vs France) at the center year β2 = France’s linear trend β3 = extra trend for the UK (vs France)

We report it by using OLS with HAC (Newey–West) standard errors for inference and also a GLS AR(1) model as a robustness check

Hypothesis Levels ( at middle of year) H0: β1 = 0 and H1: β1 != 0 Trends H0: β3 = 0 and H1: β3 != 0 Joins H0: (β1, β3) = (0, 0) We also check whether a quadratic time trend is needed.
Assumptions & checks

Linearity: France vs UK scatter of log levels is very close to a line (Pearson r ≈ 0.986), fitted-vs-actual lines track.
Normality of residuals: QQ plot is roughly straight with mild tail departures.
Serial correlation: The ACF of OLS residuals shows clear positive autocorrelation at short lags for both countries, we use HAC SEs and fit GLS(AR1).
Functional form: A nested comparison strongly prefers adding a quadratic trend with ANOVA F≈ 51.98, p ≈ 2.6×10^ -12 and HAC-robust Wald F≈ 136.8, p< 2 × 10^ −16 we keep the linear model for comparability with earlier sections and then note the curvature as a sensitivity.
```
                                          What is HAC wild test?
```

it is the way to test linear hypotheses about regression coefficients, e.g. like “are the level and trend gaps zero?” formally like H0: Rβ = r, using the estimation of β and its co- variance.

HAC or (Heteroskedasticity- and Autocorrelation-Consistent) covariance (we use Newey–West). It keeps the OLS coefficients the same, but replaces the usual standard errors with ones that are valid even if the errors are:

uneven in variance (heteroskedastic), and/or serially correlated (autocorrelated).

A HAC Wald test computes the usual Wald statistic but plugs in a HAC covariance matrix. The test statistic is reported as a Chi-square or F, with a p-value.

                                           Why we use HAC Wald here?

These are time-series datassets for each country. Our residual ACFs clearly show serial correlation.Year-to-year variability can differ by country and over time as heteroskedasticity is plausible.If we used ordinary OLS standard errors, the p-values could be too optimistic. HAC fixes the standard errors so our tests on:

the level gap (Country effect), the trend gap (Country × Year), and the joint test (both zero) are reliable even with those time-series quirks.

Results in Linear model

Classical F- test: F(3, 46) ≈ 1369, p < 1e^10 -44, It explained 98.9% of the variation in log population (R2 = 0.9889, adjusted R2 = 0.9882) We test the null that all three added terms (country difference, time trend, and their interaction) are zero, as the data are just a flat mean (no structure). The very large F and p say that null is wildly incompatible with the data. At least one (and in practice, several) of those terms matters a lot. (“3” = three restrictions tested; “46” = residual degrees of freedom.)
HAC Wald F ≈ 784.1, p < 2x 1e^10 -16 computed with Newey–West (HAC) standard errors, which allow for heteroskedasticity and autocorrelation in the residuals. We still reject by a mile, so the conclusion is robust to time-series issues.

HAC-robust coefficients
1. the level gap is β^1 ≈ -0.0264(SE≈ 0.0019), p ≈ 0 Ratio UK/FR at center year: exp(-0.0264) ≈ 0.974 UK about 2.6% lower than France at the center year.

France trend is β^2 ≈ 0.00477(SE≈0.00029), p ≈ 0
UK trend is β^3 ≈ 0.00197(SE≈ 0.00032), p ≈ 0, and Annual multiplicative gap:exp(0.00197) ≈ 1.0020 ,the UK’s trend is about 0.20 percentage points faster per year than France on the log scale.

check HAC Wald tests

is the level equal? we have enough Reject (p ≈ 4.9x 10^-44 )
is the trend equal? we we have enough Reject (p ≈ 1.0x 10^-9 )
is it jointly equal? we have enough Reject (p ≈ 5.1x 10^-43 )

GLS AR(1) robustness

AR(1) 𝜙 ≈ 0.99999 (very persistent).
Signs match OLS: UK trend is steeper ( p ≈ 0.003), the level gap is imprecise in GLS because the near-unit-root raises standard errors.
Model comparison: AIC(GLS-AR1) = −420.1 vs AIC(OLS) = −384.8, the AR(1) specification fits better.
```
                                  What are OLS and GLS?
```

OLS (Ordinary Least Squares) The classic straight-line (or linear) fit. It picks coefficients that minimise the sum of squared residuals.

Why we use it? It’s the baseline model and gives clear interpretation of intercepts and trends. With HAC/Newey–West standard errors, we can keep the OLS coefficients but fix the standard errors when there’s autocorrelation/heteroskedasticity.

GLS (Generalized Least Squares)

it is like OLS, but it models the error covariance and reweights the data accordingly (e.g AR(1) errors for time series).

what it helps? If residuals are autocorrelated or heteroskedastic, GLS can be more efficient (smaller SEs, better inference) because it uses the error structure instead of ignoring it.

in our case, GLS with AR(1) errors per country. That says “each country’s residuals are correlated with last year’s residuals.”

The reason to use both due to We’re working with time series (2000–2024). Residuals can be serially correlated.

So we, Fit OLS for a clean, interpretable trend + country + interaction model, and report HAC-robust SEs to protect inference. Fit GLS(AR1) as a robustness check that explicitly models serial correlation. If OLS(HAC) and GLS(AR1) tell the same story about levels/trends, our conclusions are stable. If they diverge, we dig into misspecification (e.g curvature, different AR structure).

                                  What is AR ?

AR is autoregressive that says today’s value depends on its own past values. AR(1) model: Yt = 𝜙Yt-1 + et, where et is noise error, if |𝜙| < 1, the process is stationary. and the meaning of 𝜙 > 0: persistance, 𝜙 < 0 alternating up/down pattern.the AR(p), use the last p lags Yt = 𝜙1 Yt-1 + … + 𝜙p Yt-p + et

We fitted a GLS model with AR(1) errors, meaning the regression residuals follow, ut = p ut-1 + nt, this to correct for serial correlation in time-series residuals (a common violation of OLS assumptions), so maybe giving better standard errors and better forecasts.

Overall Summary Through 2000–2024, both countries grow smoothly on a log scale. At the mid-sample year, France’s average level is a bit higher (≈2–3%). The UK’s slope is steeper by about 0.2% (log points) per year, which is consistent with the descriptive CAGR gap we saw earlier. If we put together: the UK starts slightly below France in level at the center year but catches up via a faster trend, which is visible in the fitted lines.
What the plots show?

Fitted vs residuals: gentle arcs, linear trend is a simplification. QQ of standardized residuals: close to line (normality not a major issue). ACF bars above dotted bounds at early lags: serial correlation, so HAC/GLS justified. Signs match OLS, UK trend is steeper.

Overall Discussion

Descriptive Statistics & Visualisation (UK vs France)

We pick samples and analyze the total population data for France and the United Kingdom from 2000 to 2024, giving us 25 years of aligned data for each country. After cleaning the data, we create three variables: the natural log of population, the yearly change in the log of population, and the annual percent growth rate. The data covers the entire time period, so our summaries and charts use all 25 years of data for each country.

Looking at the level plots, both countries show steady increases. France has a larger population for most of the time, but the United Kingdom catches up towards the end. On the log scale, the two lines are almost parallel for most of the time, which suggests that both countries had similar growth rates for much of the period.

The averages back this up.The average log population is 17.9938 for France and 17.9674 for the United Kingdom. The average difference is 0.0264 log points, which translates to France being about 2.7 percent larger on average when converted back to the original population scale. This matches the findings from the Welch comparison in Part 5. There, we found a 95 percent confidence interval for the mean log gap between 0.0017 and 0.0510. On the original scale, this corresponds to a population ratio between 1.002 and 1.052, with the center at around 1.027.

Over time, growth trends show a clear pattern on the plots. In France, growth starts in the mid-0.6 percent range in the 2000s, slowly goes down to about 0.3 percent by the late 2010s, and remains close to that level after 2020. The United Kingdom sees growth rise during the mid-2000s, then drops between 2016 and 2020. However, in 2021 and 2022, there is a quick rise that brings yearly growth above one percent, before it settles again. This quick rise explains why the United Kingdom has a larger spread in the growth box plot.

The QQ plots for the log of population stay close to the reference lines for both countries. This supports our decision to use the log scale for summaries and future hypothesis tests. The optional QQ plots for annual log growth also behave well in the middle, with only slight deviations in the tails.

Hypothesis Testing

We start by making sure all the main items from Part 4 are in the session. If any are missing, a clear message stops the process right away. This simple check helps avoid problems later and makes sure the analysis can be repeated exactly the same way. Next, we create a panel that only includes the years where both countries have data. Focusing on the same years for both sets of data stops any bias from having different time coverage and gives us matching samples for comparing differences, confidence intervals, and the t-test. In our data, this gives us 25 paired years from 2000 to 2024.

Our test compares the average log population between the United Kingdom and France over these shared years. On the log scale, the null hypothesis says the two averages are the same, and the alternative says they are different. We use Welch’s two-sample t-test because it doesn’t assume the two groups have the same amount of variation, which is a safer choice when groups might have slightly different spread.

Before we run the test, we check if the data is normally distributed using Shapiro–Wilk tests and QQ plots. The QQ plots show the points mostly follow the reference line for both countries, with only slight deviations in the tails. This pattern suggests the data is approximately normal. Even if the tails aren’t perfect, Welch’s test still works well here, especially with 25 data points in each group. The box plots and violin plots show the same thing but in a different way. France’s data is a bit higher overall, and the central boxes overlap but are shifted, which already shows a small difference in the average. The white dots in the box plots and the white diamonds in the violin plots show the group averages and make that shift clear.

The histogram shows both countries on the same horizontal scale, with two vertical lines in each subplot. The blue line shows each country’s own average, and the red line shows the other country’s average. This side-by-side view makes it easy to see that the French data is centered slightly to the right of the UK data on the log scale, which again hints at the result we’ll find with the formal t-test.

Welch’s test on log(Pop) gives a t-value of 2.155, with about 43.45 degrees of freedom and a p-value of 0.0367. Since our alpha level is 0.05, we reject the idea that the average populations are the same. On the log scale, the 95% confidence interval for the difference in means ranges from 0.0017 to 0.0510. When we convert this difference back to the original scale, the ratio of the means is about 1.027, with a 95% confidence interval from roughly 1.002 to 1.052. In simple terms, the average French population is about 2.7% higher than the UK average, and the range of this uncertainty is from about 0.2% to 5.2%. Because the entire confidence interval is above 1, the result is not only statistically significant, but the direction of the difference is consistent.

The sampling distribution graphic shows why our statistical conclusions are trustworthy. For each country, we repeatedly sampled with replacement from its log population data and plotted the distribution of the sample means for different sample sizes. As the sample size increases from 10 to 50, then 100, and finally 250 the histograms of the sample means become narrower and more symmetrical. The yellow vertical lines, which represent the mean of the simulated sample means, align closely with the red vertical line, which shows the true mean. The dashed black lines show the 95% percentile band of the sampling distribution. This visual is a clear demonstration of the central limit theorem, which supports the t-intervals we reported earlier.

We also report group mean confidence intervals on the log scale. These intervals were calculated using standard t-critical values along with each sample’s standard error. The tables match the numbers in the test summary and offer a clear view of each group’s mean, spread, and level of uncertainty. These tables also align with the plots: France has a slightly higher mean but a similar spread, and the two confidence intervals only slightly overlap, which is consistent with a small but real difference between the groups.

Last, the assumptions and limitations deserve. We treated each year as an independent observation for the group comparison. In time series, residual correlation can make standard errors too optimistic. That is why our main between-country inference in Part 7 uses HAC-robust methods and GLS with AR(1) errors. For the simple mean comparison in Part 5, the effect is small because the outcome is very smooth and the two series are compared over identical years, but we should still read the p-value as approximate. The good news is that the direction and magnitude we see here agree with the regression-based results that account for autocorrelation, which gives us extra confidence.

Regression Analysis (Time- series)

our model and hypotheses.

We model log population as a function of time and country using a pooled time-series regression with a country intercept, a linear time trend, and a country-by-trend interaction: logPop = β0 + β1·1{UK} + β2·Year_c + β3·Year_c·1{UK} + error. The main questions are whether the countries differ in level at the center of the sample (β1 = 0), whether their trends differ (β3 = 0), and whether a simple straight line in time is adequate. Because annual population series are autocorrelated, we report ordinary least squares estimates along with Newey–West (HAC) standard errors for inference, and we re-estimate the model with generalized least squares that imposes an AR(1) error within each country as a robustness check.

What the data say before modeling? The France-versus-UK scatter plots on the log scale line up almost perfectly along a straight line, and the Pearson correlation is about 0.986. The fitted-vs-actual time plot shows that a country-specific line in time tracks the two series very closely, which justifies starting from a linear trend with an intercept shift.

overall fit and curvature.

The model fits extremely well. The classical overall F test gives F(3, 46) = 1369.9 with p < 10^−44, and the HAC-robust Wald version also rejects the model with no regressors (F ≈ 784, p < 2.2×10^−16). The residuals vs fitted plot, however, curves gently, and the formal nested test confirms that adding quadratic time terms improves the fit. The classical comparison of linear versus quadratic trends yields F = 51.98 with p = 2.58×10^−12, and the HAC-robust curvature test that jointly sets the quadratic terms to zero gives F = 136.82 with p < 2.2×10^−16. We therefore conclude that there is clear curvature from 2000 to 2024. The linear model is still useful for summarizing level and trend differences, but the quadratic extension captures the small deceleration-acceleration pattern that we see by eye.

For country differences: chi-square (Wald) tests interpretation. Using HAC-robust covariance, the Wald chi-square test for the country level effect rejects the null decisively (p = 4.87×10^−44). The test for equal trends across countries also rejects (p = 1.00×10^−9). Testing both together gives p = 5.07×10^−43. Interpreting coefficients on the original scale, exp(β1) ≈ 0.974, which means that at the sample mid-year the UK level is about 2.6 percent lower than the France level. The interaction gives exp(β3) ≈ 1.001975, which reads as a slightly steeper annual growth in the UK relative to France of about 0.20 percent per year on the log scale. These results line up with the fitted lines in the time plot, where the UK starts lower but edges up a little faster in recent years.

diagnostics and the use of HAC and GLS.

The residuals vs fitted cloud is centered and narrow but displays a shallow U shape. The residual QQ plot is close to the straight reference lines with mild tail departures, which is acceptable for large-sample HAC inference. The autocorrelation function of the OLS residuals shows strong positive lag-1 correlation in both countries that decays slowly, with several bars well outside the approximate confidence bands. That dependence violates the independence assumption behind classical OLS standard errors, so HAC standard errors are the right default for testing and interval estimation. For robustness, we also fit a GLS model with AR(1) errors inside each country. The estimated AR parameter is essentially one, which confirms strong persistence. Point estimates are very similar to OLS, and model selection favors GLS with AR(1) on information criteria (AIC around −420 for GLS versus −385 for OLS), which reassures us that the main conclusions are not an artifact of the OLS error assumptions.

Influence and leverage. The residuals-versus-leverage plot shows most years have modest leverage below roughly 0.16. Bubble sizes encode Cook’s D, and nearly all points lie below the 4/n guide, so there are no single years that dominate the fit. That pattern explains why the OLS and GLS point estimates agree closely and why the HAC and classical confidence intervals are narrow.Then, the coefficients mean is that the OLS estimates (with HAC uncertainty) indicate a clear country level gap and a small trend gap. France has the higher level for most of the window, while the UK trend is a touch steeper, which is consistent with the two lines converging near the end of the period in the time plot. When we exponentiate the coefficients, the level ratio is about 0.974 and the annual multiplicative trend gap is about 1.002.

strengths and limitations

A strength of this analysis is that we combine clear specification with robust inference. Centering time makes coefficients easy to read. HAC Newey West addresses autocorrelation without heavy modeling assumptions. GLS AR(1) then validates that the story does not depend on a single inference method. Diagnostics and plots are used throughout so we can see why each modeling choice is being made.

Limitations follow from the design. We have only two countries and a single outcome. Country effects collect everything that differs between the two series that is not time itself, so we cannot separate migration policy, fertility change, or measurement artifacts. The residual plots show curvature and autocorrelation that a richer dynamic model could soak up. Finally, the window ends in 2024. Any structural change after that date is outside our sample.

final Conclusion

France and the United Kingdom both show very similar, almost perfectly linear growth in log population over the period 2000 to 2024, but their paths are not exactly the same. On average, the UK population level is about 2.6 percent lower than France’s, and the UK’s trend is slightly faster, by roughly two tenths of a percent per year. These differences are statistically significant, as confirmed by chi squared HAC Wald tests, and they remain even when we re-estimate the model using GLS under AR(1) errors. It is important to note that France commences higher, and the United Kingdom grows at a little faster; both results are robust when autocorrelation is taken into account along with mild deviation from Normality.

Assignment 2 Statistical Data Analyses

Population Analytics: France vs United Kingdom

RPubs link information

Problem Statement

Data

Data Cont.

Descriptive Statistics and Visualisation

Hypothesis Testing

Regression Analysis

Overall Discussion

the use of AI Generation

References