1. Overview

Further to Topic Two, this report extend the analysis on youth employment which focused on youth aged 15 to 24 years old in New Zealand who are vulnerable to labour market distriptions due to their limited work experience and reliance on the entry level industries such as retail and hospitality (Huang,2021).The data set from Household Labour Force Survey (HLFS) dataset sourced from the Statistics New Zealand (Stats NZ,2025) was used which covers quarterly data from September 2018 to March 2025.

During the EDA in Topic Two, it has revealaed that the age group 15-19 consistently recorded a higher mean of unemployment rate which was 12.9% compared to the age group 20-24 which was 5.6% whereby both age group has experienced sharp employment declide during COVID-19 Pandameic durinc 2020 and followed by strong recovery from 2021 onwards. A moderated negative correlation (R = -0.609) was identified between employment and unemployment counts. These findings has refine the analytical questions raised and has provided the foundation for the inferential analysis in this report.

# Load libraries
library(tidyverse)
library(ggplot2)
library(scales)
library(corrplot)

# Set working directory and load data
load("/Users/paul/Documents/PGDAV8/youth_employment_analysis.RData")
cat("Data loaded successfully. Rows:", nrow(df_youth_lf), "\n")

## Data loaded successfully. Rows: 3886

2. Correlation Analysis

2 a. Correlation matrix

Correlation matrix was done to examine and understand the relationshop between the key variables such as employment count, unemploymebt count, unemployment rate and year. Pearson correlation coefficients were selected as all four variables are continuous numerical which making suitable method for this analysis.

# Prepare wide format dataset
df_corr_raw <- df_youth_lf[
  df_youth_lf$sex == "Total Both Sexes" &
    df_youth_lf$emp_status %in% c("Persons Employed in Labour Force",
                                  "Persons Unemployed in Labour Force",
                                  "Unemployment Rate"), ]

df_corr_wide <- reshape(df_corr_raw[, c("date", "age_group", 
                                         "emp_status", "data_value")],
                        idvar   = c("date", "age_group"),
                        timevar = "emp_status",
                        direction = "wide")

names(df_corr_wide) <- c("date", "age_group",
                         "employed", "unemployed", "unemp_rate")

df_corr_wide$year <- as.integer(format(df_corr_wide$date, "%Y"))

# Compute correlation matrix
corr_data   <- df_corr_wide[, c("employed", "unemployed", "unemp_rate", "year")]
corr_data   <- corr_data[complete.cases(corr_data), ]
corr_matrix <- cor(corr_data, use = "complete.obs")

# Print results
print(round(corr_matrix, 3))

##            employed unemployed unemp_rate  year
## employed      1.000     -0.609     -0.914 0.069
## unemployed   -0.609      1.000      0.869 0.257
## unemp_rate   -0.914      0.869      1.000 0.062
## year          0.069      0.257      0.062 1.000

corrplot(corr_matrix,
         method      = "color",
         type        = "upper",
         addCoef.col = "black",
         tl.col      = "black",
         tl.srt      = 45,
         col         = colorRampPalette(c("#D7191C", "white", "#2C7BB6"))(200),
         title       = "Correlation Matrix: Youth Labour Force Variables",
         mar         = c(0, 0, 2, 0))

Figure 1: Correlation matrix

Correlation matrix reveal very strong negative correlation between the employment count and the unemployment rate (r = -0.91) which indicates that as more youth enter employments, unemployment rate falls significantly. The moderate negative correlations between employment count and unemplpument count is (r = -0.61) which suggests that emploument growth does not always translated directly into reduction of unemployument numbers which possibly due to labour force size fluctuations. The weak correlations involving year (r = 0.06-0.26) has indicated that there is no significant long term linear trend, suggesting COVID-19 shock was cyclical distruptions rather than normal and structural trend.

2b. Spearman Rank Order Correlation

Before running the correlation, the data has been checked if its normally distributed using Shapiro Wilk test, where the results has showed that both age group was not normally distributed (P<0.05) so due to this, Spearman correlation done as backup where the results were similar to Pearson which has confirmed that the Pearson results are reliable.

spearman_matrix <- cor(corr_data, method = "spearman", use = "complete.obs")
cat("=== SPEARMAN CORRELATION MATRIX ===\n")

## === SPEARMAN CORRELATION MATRIX ===

print(round(spearman_matrix, 3))

##            employed unemployed unemp_rate  year
## employed      1.000     -0.567     -0.814 0.112
## unemployed   -0.567      1.000      0.916 0.179
## unemp_rate   -0.814      0.916      1.000 0.041
## year          0.112      0.179      0.041 1.000

3. Hypothesis Testing

Based on the EDA findings, a formal hyphothesis test was developed to determine wheter the diffrence in mean unemployment rates between the two age groups is statistically significant or not. The tested hypothesis are:

H0: There is no significant difference in the mean unemployment rate between the 15-19 and 20-24 age groups.
H1: The 15-19 age group has a significantly higher mean unemployment rate than the 20-24 age group.

3a. Normality Assessment

Before selecting the test, a Shapiro-Wilk normality test was conducted to assess whether the data follows a normal distribution.

# Prepare data
df_rate_test <- df_youth_lf[
  df_youth_lf$emp_status == "Unemployment Rate" &
    df_youth_lf$sex == "Total Both Sexes", ]

rate_1519 <- df_rate_test$data_value[
  df_rate_test$age_group == "Aged 15-19 Years"]
rate_2024 <- df_rate_test$data_value[
  df_rate_test$age_group == "Aged 20-24 Years"]

# Shapiro-Wilk test
cat("=== NORMALITY TEST ===\n")

## === NORMALITY TEST ===

print(shapiro.test(rate_1519))

## 
##  Shapiro-Wilk normality test
## 
## data:  rate_1519
## W = 0.80113, p-value = 1.752e-09

print(shapiro.test(rate_2024))

## 
##  Shapiro-Wilk normality test
## 
## data:  rate_2024
## W = 0.85065, p-value = 6.615e-08

From the test, it is noted that both groups has significantly violated the normality assumption (p < 0.05). Despite this, the Welch Two Sample t-test was still used as it works well even when data is not normally distributed.

3b. T-Test Results

# Welch two-sample t-test (one-tailed)
cat("=== WELCH TWO-SAMPLE T-TEST ===\n")

## === WELCH TWO-SAMPLE T-TEST ===

t_result <- t.test(rate_1519, rate_2024,
                   alternative = "greater",
                   var.equal   = FALSE)
print(t_result)

## 
##  Welch Two Sample t-test
## 
## data:  rate_1519 and rate_2024
## t = 8.2035, df = 114.77, p-value = 1.905e-13
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  5.800548      Inf
## sample estimates:
## mean of x mean of y 
## 12.874713  5.604598

cat("\nMean difference:", 
    round(mean(rate_1519) - mean(rate_2024), 2), 
    "percentage points\n")

## 
## Mean difference: 7.27 percentage points

ggplot(df_rate_test, aes(x = age_group, y = data_value, fill = age_group)) +
  geom_boxplot(alpha = 0.7, outlier.colour = "red", outlier.shape = 16) +
  scale_fill_manual(values = c("Aged 15-19 Years" = "#F4A460",
                                "Aged 20-24 Years" = "#4682B4")) +
  annotate("text", x = 1.5, y = max(df_rate_test$data_value) * 0.95,
           label = "Welch t-test: p < 0.001",
           size = 4, fontface = "italic") +
  labs(title    = "Unemployment Rate Distribution by Age Group",
       subtitle = "New Zealand HLFS (2018-2025)",
       x = "Age Group",
       y = "Unemployment Rate (%)") +
  theme_minimal() +
  theme(legend.position = "none")

Figure 2: Unemployment rate by age group

The t-test result was highly significant where by (t = 8.2035, df = 114.77, p = 1.905e-13), well below the 0.05 significance level. H0 is therefore rejected. This has confirmed that the 15-19 age group faces a statiscally significantly higher unemployment rate than the 20-24 age group with a mean difference of 7.27 percentage points. This is very consistent witht the prior research which has indicated that younger teenager are structurally more vulnerable in the labour market due to their concentration in casual and part time roles.

4. Linear Regression Model

To further explore the relationship between employment and unemployment, a multiple linear regression was done to predict the unemployment count from employment count and the age group. Age group was included as a categorical variable to account for the structural differences between the two cohorts which was identified in the hypothesis test.

4a. Model Summary

# Prepare regression data
df_reg <- df_corr_wide[complete.cases(
  df_corr_wide[, c("employed", "unemployed", "age_group")]), ]
df_reg$age_group <- as.factor(df_reg$age_group)

# Fit model
cat("=== LINEAR REGRESSION MODEL ===\n")

## === LINEAR REGRESSION MODEL ===

model <- lm(unemployed ~ employed + age_group, data = df_reg)
print(summary(model))

## 
## Call:
## lm(formula = unemployed ~ employed + age_group, data = df_reg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.2155 -3.9760 -0.1675  3.4414 12.8317 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                19.03128    9.44250   2.015   0.0488 *
## employed                    0.07725    0.07137   1.083   0.2837  
## age_groupAged 20-24 Years -17.44789    8.22038  -2.123   0.0383 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.333 on 55 degrees of freedom
## Multiple R-squared:  0.4185, Adjusted R-squared:  0.3973 
## F-statistic: 19.79 on 2 and 55 DF,  p-value: 3.357e-07

Based on the the model, its statistically significant overall (F = 19.79, p = 3.357e-07) whereby the adjusted R-squared of 0.3973 indicated that the model explains approximately 39% of the variance is unemployment counts. The model equations is:

Unemployed = 19.031 + (0.077 x Employed) + (-17.448 x Age Group 20-24)

For the age group coefficient (-17.448, p=0.038) is statistacally significant which confirming that the 20-24 age grup has systematically lower in unemployment counts than the 15-19 age group when employment levels are held at constant. The employment count coefficient (0.077, p = 0.284) is not statistically significant, which indicate that the employment count alone is not strong direct predictor ocne age group is accounted for.

4b. Regression Fit Plot

df_reg$predicted <- predict(model)

ggplot(df_reg, aes(x = employed, y = unemployed, colour = age_group)) +
  geom_point(size = 2.5, alpha = 0.7) +
  geom_line(aes(y = predicted), linewidth = 1, linetype = "dashed") +
  scale_colour_manual(values = c("Aged 15-19 Years" = "#F4A460",
                                  "Aged 20-24 Years" = "#4682B4")) +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(title    = "Linear Regression: Employment vs Unemployment Count",
       subtitle = "Predicted vs Actual, New Zealand HLFS (2018-2025)",
       x        = "Persons Employed (thousands)",
       y        = "Persons Unemployed (thousands)",
       colour   = "Age Group") +
  theme_minimal() +
  theme(legend.position = "bottom")

Figure 3: Regression fit - predicted vs actual

4c. Regression Diagnostics

par(mfrow = c(2, 2))
plot(model)

Figure 4: Regression diagnostic plots

par(mfrow = c(1, 1))

The diagnostic plots were checked to verify the model assumptions. The residual versus fitted plot shows no clear pattern which suggested that the linear relationship is reasonable. On the other side, the Q-Q plot showed some deviation from the normality in the residual which is expected since the raw data was also not normally distributed. The scale location plot has showed relatively stable spread across the fitted values. No major outliers was noticed that could significantly affect the model. One of the limitation of this model is that it does not account for the time based pattern in the quarterly data which couple be explored further in the future analysis.

Summary

In summary, this analysis has produced several key findings which is relevant to the youth unemployment business problem in New Zealand. The correlation analysis has confirmed a very strong negative relationship between employment and unemployment rat (r = -0.914) and moderately negative relationship between employment and unemployment count (r = -0.609). The hypothesis test has confirmed that the 15-19 age group has significantly higher in unemployment rate compare to 20-24 age group (t = 8.2035, p < 0.001) with the mean difference of 7.27 percentage points. The regression model further confirmed that the age group is a significant predictor which explains 39.7% of the variance.

These findings has confirmed that the younger teenagers in New Zealand facing greater disadvantages in the labour market and its recomended that intervantions such as youth employment programmes and vocational trainings should be considerede for the 15-19 age groups.

References

Huang, T. (2021). Youth not in employment, education or training (NEET) in Auckland: Trends June 2011 to June 2021 (Technical Report 2021/20). Auckland Council Research and Evaluation Unit. https://knowledgeauckland.org.nz/media/2247/tr2021-20-youth-not-in-employment-education-or-training-neet-in-auckland-trends-2011-2021.pdf

Organisation for Economic Co-operation and Development. (2023). Youth employment and labour market trends. OECD. https://www.oecd.org/employment/youth/

Stats NZ. (2025). Household Labour Force Survey - Population rebase: September 2018-March 2025 quarters. https://www.stats.govt.nz/large-datasets/csv-files-for-download/

PGDAV8.100 - Topic 3 & 4A: Statistical Inference and Machine Learning

Paul Richard Vinsitti | ID: 2025004328

2026