Introduction

Understanding the factors affecting wages is vital for policymakers, employers, and economists. Wage determination is influenced by numerous demographic, socioeconomic, and employment-related variables. This study applies multiple linear regression to analyze how variables such as education, age, gender, insurance status, union membership, metropolitan residence, marital status, and region affect wages for younger and older workers separately. Two distinct models were estimated after applying backward stepwise selection to a full set of explanatory variables, resulting in reduced models optimized by minimizing the Akaike Information Criterion (AIC). The models were estimated using ordinary least squares regression on data subsets segmented by age groups: “young” and “old.”

library import

library(psych)
## Warning: package 'psych' was built under R version 4.3.3
library(MASS)

import dataset

data <- read.csv("D:/MA334-SP-7_2412507 (1).csv")

Description of Models**

Model for the Young Group

The reduced model for the young cohort includes five predictor variables: Age: Continuous variable representing the worker’s age. Education (educ): Years of education completed. Gender: A binary indicator variable (likely coded 0/1). Insurance (insure): Indicates whether the worker has insurance. Union membership (union): Indicates whether the worker belongs to a labor union.

Model for the Old Group

The reduced model for the older cohort contains eight predictors: Education (educ) Gender Insurance (insure) Metropolitan residence (metro) Marital status (marital) Region (categorical variable with four levels: Northeast, South, West, and presumably the omitted baseline category)

Data Exploration

Overview

str(data)
## 'data.frame':    1181 obs. of  12 variables:
##  $ age    : int  29 45 39 30 42 47 62 57 21 69 ...
##  $ educ   : int  4 3 2 3 3 3 2 2 1 0 ...
##  $ gender : int  1 1 1 0 0 1 1 0 0 1 ...
##  $ hrswork: int  40 45 40 45 60 45 40 48 40 40 ...
##  $ insure : int  1 1 1 1 1 1 1 1 1 0 ...
##  $ metro  : int  1 1 1 1 0 1 1 1 1 1 ...
##  $ nchild : int  2 3 1 0 3 0 1 0 0 0 ...
##  $ union  : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ wage   : num  25.9 14.4 17.2 17.1 18.3 ...
##  $ race   : chr  "White" "White" "White" "White" ...
##  $ marital: int  1 2 1 0 1 1 1 1 0 2 ...
##  $ region : chr  "south" "south" "midwest" "northeast" ...
summary(data)
##       age             educ           gender         hrswork     
##  Min.   :17.00   Min.   :0.000   Min.   :0.000   Min.   : 0.00  
##  1st Qu.:32.00   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:40.00  
##  Median :43.00   Median :2.000   Median :0.000   Median :40.00  
##  Mean   :42.61   Mean   :1.751   Mean   :0.442   Mean   :41.61  
##  3rd Qu.:52.00   3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.:42.00  
##  Max.   :77.00   Max.   :5.000   Max.   :1.000   Max.   :80.00  
##      insure           metro            nchild           union       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :1.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.8256   Mean   :0.8239   Mean   :0.8061   Mean   :0.1372  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:2.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :9.0000   Max.   :1.0000  
##       wage           race              marital          region         
##  Min.   : 2.50   Length:1181        Min.   :0.0000   Length:1181       
##  1st Qu.:13.00   Class :character   1st Qu.:0.0000   Class :character  
##  Median :18.75   Mode  :character   Median :1.0000   Mode  :character  
##  Mean   :22.77                      Mean   :0.8476                     
##  3rd Qu.:28.84                      3rd Qu.:1.0000                     
##  Max.   :99.00                      Max.   :2.0000
nrow(data)  
## [1] 1181
ncol(data) 
## [1] 12

Descriptive Statistics

describe(data)

Visualizations

Histograms

hist(data$wage, main="Wage Distribution", xlab="Wage")

hist(data$age, main="Age Distribution", xlab="Age")

Boxplots

boxplot(wage ~ gender, data=data, main="Wage by Gender", names=c("Female", "Male"))

Correlation heatmap

num_vars <- data[sapply(data, is.numeric)]
cor_matrix <- cor(num_vars)
heatmap(cor_matrix, main="Correlation Matrix")

Correlations

cor(data$age, data$wage)
## [1] 0.2119489
cor(data$hrswork, data$wage)
## [1] 0.09091083

Statistical Significance and Coefficient Interpretation

Young Group Model

The model explains approximately 31.3% of the variance in log wages , indicating a moderate explanatory power. Age: The coefficient estimate is 0.028 (p < 0.001), meaning each additional year of age is associated with approximately a 2.8% increase in wage, holding other variables constant. Education: The strongest predictor with a coefficient of 0.129 (p < 0.001). Each additional year of education is associated with a 12.9% increase in wage (Dayioglu, Küçükbayrak and Tumen, 2022). Gender: The coefficient is -0.192 (p < 0.001), indicating that, controlling for other factors, one gender group (likely female if coded 1) earns about 19.2% less than the other. Insurance: Workers with insurance earn approximately 21.9% more (coefficient 0.219, p < 0.001). Union Membership: Union members earn around 15.7% more (coefficient 0.157, p = 0.029). All predictors are statistically significant at the 5% level, underscoring their relevance in explaining wage variation among younger workers.

Old Group Model

This model accounts for about 26.3% of wage variation, slightly lower than the young group model but still substantial (Kasilingam and Krishna, 2022). Education: The strongest predictor again with a coefficient of 0.155 (p < 0.001), showing a 15.5% wage increase per additional year of schooling. Gender: Negative coefficient -0.188 (p < 0.001), consistent with the young group, confirming wage disparities by gender. Insurance: Positive effect (0.253, p < 0.001), suggesting insured workers earn 25.3% more. Metropolitan Residence: Positive and significant (0.140, p = 0.003), indicating living in metro areas is associated with higher wages, about 14%. Marital Status: Marginally significant positive effect (0.058, p = 0.067), implying married workers might earn slightly more. Region: The West region shows a positive significant effect (0.133, p = 0.008). Other regions show no significant effect relative to the baseline. Most predictors are statistically significant, except marital status (p slightly above 0.05) and some regions, indicating regional wage differences are less pronounced except for the West.

Probability & Distributions

Probability of not insured in a group of 5

p_no_insure <- sum(data$insure == 0) / nrow(data)
prob_at_least_one_no_insure <- 1 - (1 - p_no_insure)^5

P(nchild ≥ 1 | married)

married_data <- data[data$marital == 1 | data$marital == 2, ]
p_nchild_given_married <- sum(married_data$nchild >= 1) / nrow(married_data)

Probability distribution for nchild

nchild_table <- table(data$nchild)
nchild_probs <- prop.table(nchild_table)
nchild_df <- data.frame(nchild = as.numeric(names(nchild_probs)), prob = as.vector(nchild_probs))

mean_nchild <- sum(nchild_df$nchild * nchild_df$prob)
var_nchild <- sum((nchild_df$nchild - mean_nchild)^2 * nchild_df$prob)
p_nchild_3_or_more <- sum(nchild_df$prob[nchild_df$nchild >= 3])

Point Estimates & Hypothesis Testing

Mean & 95% CI for wage (nchild == 2)

wage_2child <- data$wage[data$nchild == 2]
mean(wage_2child)
## [1] 23.43355
t.test(wage_2child, conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  wage_2child
## t = 24.938, df = 216, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.58146 25.28563
## sample estimates:
## mean of x 
##  23.43355

For nchild >= 5 (check feasibility)

subset_5plus <- data[data$nchild >= 5, ]
nrow(subset_5plus)  
## [1] 9

Contingency table and hypothesis test

table_gender_insure <- table(data$gender, data$insure)
chisq.test(table_gender_insure)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_gender_insure
## X-squared = 0.0574, df = 1, p-value = 0.8107

Model Diagnostics and Fit

Both models have residual standard errors below 0.5, indicating reasonable model fit. The young group model’s residual standard error is 0.423. The old group model’s residual standard error is slightly higher at 0.490. The F-statistics for both models are highly significant (p < 0.001), confirming that the sets of predictors jointly explain a significant portion of wage variance in their respective samples. The young group model explains slightly more variance than the old group model (Autor, Dube and McGrew, 2023), potentially reflecting that age and union membership variables, included only for the young group, add predictive power.

Simple Linear Regression

Split data into ‘young’ and ‘old’

young <- data[data$age < 35, ]
old <- data[data$age >= 35, ]

Fit model: log(wage) ~ age

model_young <- lm(log(wage) ~ age, data=young)
model_old <- lm(log(wage) ~ age, data=old)

summary(model_young)
## 
## Call:
## lm(formula = log(wage) ~ age, data = young)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63005 -0.32110 -0.01201  0.31821  1.49042 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.594555   0.173214   9.206  < 2e-16 ***
## age         0.041382   0.006074   6.813 3.85e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4816 on 374 degrees of freedom
## Multiple R-squared:  0.1104, Adjusted R-squared:  0.108 
## F-statistic: 46.41 on 1 and 374 DF,  p-value: 3.846e-11
summary(model_old)
## 
## Call:
## lm(formula = log(wage) ~ age, data = old)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.91172 -0.39124 -0.04711  0.39679  1.54456 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.0795566  0.1157775  26.599   <2e-16 ***
## age         -0.0005273  0.0023115  -0.228     0.82    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5712 on 803 degrees of freedom
## Multiple R-squared:  6.479e-05,  Adjusted R-squared:  -0.00118 
## F-statistic: 0.05203 on 1 and 803 DF,  p-value: 0.8196

Scatter plots

plot(young$age, log(young$wage), main="Young: log(wage) ~ age")
abline(model_young, col="blue")

plot(old$age, log(old$wage), main="Old: log(wage) ~ age")
abline(model_old, col="red")

Multiple Linear Regression

Encode categorical variables & build full model

data$gender <- as.factor(data$gender)
data$race <- as.factor(data$race)
data$region <- as.factor(data$region)
data$marital <- as.factor(data$marital)
full_model_young <- lm(log(wage) ~ . -wage, data=young)
## Warning in terms.formula(formula, data = data): 'varlist' has changed (from
## nvar=12) to new 13 after EncodeVars() -- should no longer happen!
full_model_old <- lm(log(wage) ~ . -wage, data=old)
## Warning in terms.formula(formula, data = data): 'varlist' has changed (from
## nvar=12) to new 13 after EncodeVars() -- should no longer happen!
summary(full_model_young)
## 
## Call:
## lm(formula = log(wage) ~ . - wage, data = young)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36303 -0.26382 -0.01698  0.25524  1.30213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.834191   0.209063   8.773  < 2e-16 ***
## age              0.028458   0.006332   4.495 9.40e-06 ***
## educ             0.121518   0.017415   6.978 1.44e-11 ***
## gender          -0.194123   0.047602  -4.078 5.59e-05 ***
## hrswork         -0.003245   0.002368  -1.370   0.1715    
## insure           0.224896   0.053054   4.239 2.85e-05 ***
## metro            0.011774   0.058192   0.202   0.8398    
## nchild          -0.025153   0.025517  -0.986   0.3249    
## union            0.159936   0.073275   2.183   0.0297 *  
## raceBlack       -0.172978   0.118896  -1.455   0.1466    
## raceWhite       -0.102353   0.089136  -1.148   0.2516    
## marital          0.051933   0.043641   1.190   0.2348    
## regionnortheast  0.116789   0.067034   1.742   0.0823 .  
## regionsouth      0.010973   0.058890   0.186   0.8523    
## regionwest       0.048742   0.065094   0.749   0.4545    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4226 on 361 degrees of freedom
## Multiple R-squared:  0.3388, Adjusted R-squared:  0.3132 
## F-statistic: 13.21 on 14 and 361 DF,  p-value: < 2.2e-16
summary(full_model_old)
## 
## Call:
## lm(formula = log(wage) ~ . - wage, data = old)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.85888 -0.30451  0.02666  0.32575  1.31774 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.2694898  0.1799543  12.611  < 2e-16 ***
## age              0.0003294  0.0021291   0.155  0.87711    
## educ             0.1551089  0.0119593  12.970  < 2e-16 ***
## gender          -0.1811629  0.0355925  -5.090 4.48e-07 ***
## hrswork          0.0015615  0.0021518   0.726  0.46824    
## insure           0.2475608  0.0528619   4.683 3.32e-06 ***
## metro            0.1417880  0.0471982   3.004  0.00275 ** 
## nchild          -0.0177843  0.0164374  -1.082  0.27961    
## union            0.0452883  0.0489084   0.926  0.35474    
## raceBlack       -0.0106162  0.1013315  -0.105  0.91659    
## raceWhite        0.0849661  0.0832381   1.021  0.30768    
## marital          0.0548061  0.0320963   1.708  0.08811 .  
## regionnortheast  0.0536894  0.0533367   1.007  0.31443    
## regionsouth      0.0456868  0.0466322   0.980  0.32752    
## regionwest       0.1326383  0.0506384   2.619  0.00898 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.49 on 790 degrees of freedom
## Multiple R-squared:  0.276,  Adjusted R-squared:  0.2632 
## F-statistic: 21.51 on 14 and 790 DF,  p-value: < 2.2e-16

Compare Full vs Simple Models

summary(model_young)   
## 
## Call:
## lm(formula = log(wage) ~ age, data = young)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63005 -0.32110 -0.01201  0.31821  1.49042 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.594555   0.173214   9.206  < 2e-16 ***
## age         0.041382   0.006074   6.813 3.85e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4816 on 374 degrees of freedom
## Multiple R-squared:  0.1104, Adjusted R-squared:  0.108 
## F-statistic: 46.41 on 1 and 374 DF,  p-value: 3.846e-11
summary(model_old)     
## 
## Call:
## lm(formula = log(wage) ~ age, data = old)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.91172 -0.39124 -0.04711  0.39679  1.54456 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.0795566  0.1157775  26.599   <2e-16 ***
## age         -0.0005273  0.0023115  -0.228     0.82    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5712 on 803 degrees of freedom
## Multiple R-squared:  6.479e-05,  Adjusted R-squared:  -0.00118 
## F-statistic: 0.05203 on 1 and 803 DF,  p-value: 0.8196
summary(full_model_young)  
## 
## Call:
## lm(formula = log(wage) ~ . - wage, data = young)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36303 -0.26382 -0.01698  0.25524  1.30213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.834191   0.209063   8.773  < 2e-16 ***
## age              0.028458   0.006332   4.495 9.40e-06 ***
## educ             0.121518   0.017415   6.978 1.44e-11 ***
## gender          -0.194123   0.047602  -4.078 5.59e-05 ***
## hrswork         -0.003245   0.002368  -1.370   0.1715    
## insure           0.224896   0.053054   4.239 2.85e-05 ***
## metro            0.011774   0.058192   0.202   0.8398    
## nchild          -0.025153   0.025517  -0.986   0.3249    
## union            0.159936   0.073275   2.183   0.0297 *  
## raceBlack       -0.172978   0.118896  -1.455   0.1466    
## raceWhite       -0.102353   0.089136  -1.148   0.2516    
## marital          0.051933   0.043641   1.190   0.2348    
## regionnortheast  0.116789   0.067034   1.742   0.0823 .  
## regionsouth      0.010973   0.058890   0.186   0.8523    
## regionwest       0.048742   0.065094   0.749   0.4545    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4226 on 361 degrees of freedom
## Multiple R-squared:  0.3388, Adjusted R-squared:  0.3132 
## F-statistic: 13.21 on 14 and 361 DF,  p-value: < 2.2e-16
summary(full_model_old)    
## 
## Call:
## lm(formula = log(wage) ~ . - wage, data = old)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.85888 -0.30451  0.02666  0.32575  1.31774 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.2694898  0.1799543  12.611  < 2e-16 ***
## age              0.0003294  0.0021291   0.155  0.87711    
## educ             0.1551089  0.0119593  12.970  < 2e-16 ***
## gender          -0.1811629  0.0355925  -5.090 4.48e-07 ***
## hrswork          0.0015615  0.0021518   0.726  0.46824    
## insure           0.2475608  0.0528619   4.683 3.32e-06 ***
## metro            0.1417880  0.0471982   3.004  0.00275 ** 
## nchild          -0.0177843  0.0164374  -1.082  0.27961    
## union            0.0452883  0.0489084   0.926  0.35474    
## raceBlack       -0.0106162  0.1013315  -0.105  0.91659    
## raceWhite        0.0849661  0.0832381   1.021  0.30768    
## marital          0.0548061  0.0320963   1.708  0.08811 .  
## regionnortheast  0.0536894  0.0533367   1.007  0.31443    
## regionsouth      0.0456868  0.0466322   0.980  0.32752    
## regionwest       0.1326383  0.0506384   2.619  0.00898 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.49 on 790 degrees of freedom
## Multiple R-squared:  0.276,  Adjusted R-squared:  0.2632 
## F-statistic: 21.51 on 14 and 790 DF,  p-value: < 2.2e-16

Why Use a Reduced Model?

reduced_model_young <- stepAIC(full_model_young, direction = "backward")
## Start:  AIC=-633.1
## log(wage) ~ (age + educ + gender + hrswork + insure + metro + 
##     nchild + union + race + marital + region) - wage
## 
##           Df Sum of Sq    RSS     AIC
## - region   3    0.6450 65.104 -635.36
## - metro    1    0.0073 64.466 -635.06
## - race     2    0.3852 64.844 -634.86
## - nchild   1    0.1735 64.632 -634.09
## - marital  1    0.2529 64.712 -633.63
## - hrswork  1    0.3352 64.794 -633.15
## <none>                 64.459 -633.10
## - union    1    0.8507 65.310 -630.17
## - gender   1    2.9695 67.428 -618.16
## - insure   1    3.2085 67.667 -616.83
## - age      1    3.6070 68.066 -614.63
## - educ     1    8.6936 73.153 -587.53
## 
## Step:  AIC=-635.36
## log(wage) ~ age + educ + gender + hrswork + insure + metro + 
##     nchild + union + race + marital
## 
##           Df Sum of Sq    RSS     AIC
## - metro    1    0.0047 65.109 -637.33
## - race     2    0.3898 65.494 -637.11
## - marital  1    0.2128 65.317 -636.13
## - nchild   1    0.2301 65.334 -636.03
## - hrswork  1    0.2595 65.363 -635.86
## <none>                 65.104 -635.36
## - union    1    0.9538 66.058 -631.89
## - gender   1    2.9645 68.068 -620.61
## - insure   1    3.1964 68.300 -619.33
## - age      1    3.8440 68.948 -615.79
## - educ     1    8.9125 74.017 -589.11
## 
## Step:  AIC=-637.33
## log(wage) ~ age + educ + gender + hrswork + insure + nchild + 
##     union + race + marital
## 
##           Df Sum of Sq    RSS     AIC
## - race     2    0.3888 65.498 -639.09
## - marital  1    0.2092 65.318 -638.12
## - nchild   1    0.2470 65.356 -637.90
## - hrswork  1    0.2701 65.379 -637.77
## <none>                 65.109 -637.33
## - union    1    0.9591 66.068 -633.83
## - gender   1    2.9667 68.075 -622.57
## - insure   1    3.1916 68.300 -621.33
## - age      1    3.8903 68.999 -617.51
## - educ     1    9.0659 74.175 -590.31
## 
## Step:  AIC=-639.09
## log(wage) ~ age + educ + gender + hrswork + insure + nchild + 
##     union + marital
## 
##           Df Sum of Sq    RSS     AIC
## - marital  1    0.2076 65.705 -639.90
## - nchild   1    0.2704 65.768 -639.54
## - hrswork  1    0.2831 65.781 -639.47
## <none>                 65.498 -639.09
## - union    1    0.8531 66.351 -636.22
## - gender   1    3.2283 68.726 -623.00
## - insure   1    3.3402 68.838 -622.39
## - age      1    4.0068 69.504 -618.76
## - educ     1    9.7441 75.242 -588.94
## 
## Step:  AIC=-639.9
## log(wage) ~ age + educ + gender + hrswork + insure + nchild + 
##     union
## 
##           Df Sum of Sq    RSS     AIC
## - nchild   1    0.1604 65.866 -640.98
## - hrswork  1    0.2711 65.976 -640.35
## <none>                 65.705 -639.90
## - union    1    0.8237 66.529 -637.21
## - gender   1    3.3342 69.039 -623.29
## - insure   1    3.4490 69.154 -622.66
## - age      1    4.9450 70.650 -614.62
## - educ     1    9.8609 75.566 -589.32
## 
## Step:  AIC=-640.98
## log(wage) ~ age + educ + gender + hrswork + insure + union
## 
##           Df Sum of Sq    RSS     AIC
## - hrswork  1    0.2705 66.136 -641.44
## <none>                 65.866 -640.98
## - union    1    0.8129 66.678 -638.37
## - gender   1    3.3877 69.253 -624.12
## - insure   1    3.4395 69.305 -623.84
## - age      1    4.9657 70.831 -615.65
## - educ     1   10.9429 76.808 -585.19
## 
## Step:  AIC=-641.44
## log(wage) ~ age + educ + gender + insure + union
## 
##          Df Sum of Sq    RSS     AIC
## <none>                66.136 -641.44
## - union   1    0.8544 66.990 -638.62
## - gender  1    3.1282 69.264 -626.06
## - insure  1    3.1971 69.333 -625.69
## - age     1    4.7057 70.842 -617.60
## - educ    1   10.7339 76.870 -586.89
reduced_model_old <- stepAIC(full_model_old, direction = "backward")
## Start:  AIC=-1133.57
## log(wage) ~ (age + educ + gender + hrswork + insure + metro + 
##     nchild + union + race + marital + region) - wage
## 
##           Df Sum of Sq    RSS      AIC
## - age      1     0.006 189.70 -1135.54
## - hrswork  1     0.126 189.82 -1135.03
## - union    1     0.206 189.90 -1134.70
## - nchild   1     0.281 189.98 -1134.38
## - race     2     0.784 190.48 -1134.25
## <none>                 189.69 -1133.57
## - marital  1     0.700 190.40 -1132.60
## - region   3     1.686 191.38 -1132.45
## - metro    1     2.167 191.86 -1126.43
## - insure   1     5.266 194.96 -1113.53
## - gender   1     6.221 195.91 -1109.59
## - educ     1    40.392 230.09  -980.17
## 
## Step:  AIC=-1135.54
## log(wage) ~ educ + gender + hrswork + insure + metro + nchild + 
##     union + race + marital + region
## 
##           Df Sum of Sq    RSS      AIC
## - hrswork  1     0.127 189.83 -1137.01
## - union    1     0.206 189.91 -1136.67
## - race     2     0.779 190.48 -1136.25
## - nchild   1     0.351 190.05 -1136.06
## <none>                 189.70 -1135.54
## - marital  1     0.715 190.41 -1134.52
## - region   3     1.683 191.38 -1134.44
## - metro    1     2.168 191.87 -1128.40
## - insure   1     5.291 194.99 -1115.40
## - gender   1     6.216 195.92 -1111.59
## - educ     1    40.460 230.16  -981.91
## 
## Step:  AIC=-1137.01
## log(wage) ~ educ + gender + insure + metro + nchild + union + 
##     race + marital + region
## 
##           Df Sum of Sq    RSS      AIC
## - union    1     0.201 190.03 -1138.15
## - race     2     0.790 190.62 -1137.66
## - nchild   1     0.327 190.15 -1137.62
## <none>                 189.83 -1137.01
## - marital  1     0.702 190.53 -1136.03
## - region   3     1.676 191.50 -1135.93
## - metro    1     2.217 192.04 -1129.66
## - insure   1     5.494 195.32 -1116.04
## - gender   1     6.745 196.57 -1110.90
## - educ     1    41.435 231.26  -980.07
## 
## Step:  AIC=-1138.15
## log(wage) ~ educ + gender + insure + metro + nchild + race + 
##     marital + region
## 
##           Df Sum of Sq    RSS      AIC
## - nchild   1     0.313 190.34 -1138.83
## - race     2     0.802 190.83 -1138.77
## <none>                 190.03 -1138.15
## - marital  1     0.721 190.75 -1137.11
## - region   3     1.788 191.82 -1136.61
## - metro    1     2.300 192.33 -1130.47
## - insure   1     5.667 195.69 -1116.50
## - gender   1     6.711 196.74 -1112.22
## - educ     1    41.245 231.27  -982.03
## 
## Step:  AIC=-1138.83
## log(wage) ~ educ + gender + insure + metro + race + marital + 
##     region
## 
##           Df Sum of Sq    RSS      AIC
## - race     2     0.944 191.29 -1138.84
## <none>                 190.34 -1138.83
## - marital  1     0.676 191.02 -1137.97
## - region   3     1.812 192.15 -1137.20
## - metro    1     2.219 192.56 -1131.50
## - insure   1     5.501 195.84 -1117.89
## - gender   1     6.642 196.98 -1113.22
## - educ     1    41.223 231.56  -983.02
## 
## Step:  AIC=-1138.84
## log(wage) ~ educ + gender + insure + metro + marital + region
## 
##           Df Sum of Sq    RSS      AIC
## <none>                 191.29 -1138.84
## - marital  1     0.809 192.09 -1137.45
## - region   3     1.852 193.14 -1137.09
## - metro    1     2.141 193.43 -1131.88
## - insure   1     5.640 196.93 -1117.45
## - gender   1     6.981 198.27 -1111.99
## - educ     1    41.607 232.89  -982.41
summary(reduced_model_young)
## 
## Call:
## lm(formula = log(wage) ~ age + educ + gender + insure + union, 
##     data = young)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.38577 -0.26592 -0.02603  0.24556  1.27938 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.655058   0.155906  10.616  < 2e-16 ***
## age          0.028131   0.005483   5.131 4.67e-07 ***
## educ         0.128586   0.016593   7.749 9.00e-14 ***
## gender      -0.191847   0.045859  -4.183 3.59e-05 ***
## insure       0.218750   0.051724   4.229 2.96e-05 ***
## union        0.156707   0.071678   2.186   0.0294 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4228 on 370 degrees of freedom
## Multiple R-squared:  0.3216, Adjusted R-squared:  0.3125 
## F-statistic: 35.08 on 5 and 370 DF,  p-value: < 2.2e-16
summary(reduced_model_old)
## 
## Call:
## lm(formula = log(wage) ~ educ + gender + insure + metro + marital + 
##     region, data = old)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.87787 -0.29972  0.02165  0.33262  1.29289 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.41277    0.07787  30.986  < 2e-16 ***
## educ             0.15494    0.01178  13.158  < 2e-16 ***
## gender          -0.18811    0.03490  -5.390 9.29e-08 ***
## insure           0.25329    0.05229   4.844 1.53e-06 ***
## metro            0.14017    0.04696   2.985  0.00292 ** 
## marital          0.05844    0.03185   1.835  0.06691 .  
## regionnortheast  0.05972    0.05313   1.124  0.26135    
## regionsouth      0.03501    0.04599   0.761  0.44672    
## regionwest       0.13263    0.04972   2.667  0.00780 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4902 on 796 degrees of freedom
## Multiple R-squared:  0.2699, Adjusted R-squared:  0.2626 
## F-statistic: 36.78 on 8 and 796 DF,  p-value: < 2.2e-16

Discussion

Role of Education and Gender

Education is the most influential determinant of wages in both groups, reinforcing the well-established link between human capital and earning potential(Eriksson and Stenius, 2022). The positive, statistically significant coefficients confirm that additional education corresponds to higher wages. Gender consistently shows a negative coefficient in both models, reflecting gender wage gaps that persist after controlling for education, age, insurance, and other factors. This highlights ongoing disparities that may be related to occupational segregation, discrimination, or other structural issues.

Insurance and Union Membership

Insurance coverage positively correlates with wages for both age groups, possibly indicating that higher-paying jobs offer insurance benefits or that insurance status proxies for job quality (Lin et al., 2021). Union membership significantly affects wages only in the young group, suggesting that unions may have a more pronounced impact on younger workers’ wages or that union presence varies by age group.

Metropolitan Residence, Marital Status, and Region

The older group model incorporates metropolitan residence, marital status, and geographic region, reflecting that location and social factors affect wages more in this group. Metropolitan areas offer wage premiums, likely due to cost of living and economic opportunities (Dayioglu, Küçükbayrak and Tumen, 2022). Marital status shows a weak positive association, consistent with literature suggesting that married individuals may have higher earnings, possibly due to stability or employer perceptions. Regional wage differences are significant for the West but less so for other regions, suggesting localized economic conditions impact older workers’ wages.

Model Limitations

While these models provide valuable insights, they explain only about 26-31% of wage variation, indicating other unmeasured factors (e.g., work experience, occupation, hours worked, discrimination) contribute to wage determination (Kasilingam and Krishna, 2022). The models assume linear relationships and may miss nonlinearities or interactions between predictors.

Conclusion

The reduced linear regression models identify critical wage determinants differentiated by age group. Education and gender consistently emerge as key factors, with insurance and union membership also important in the young group, and metropolitan residence, marital status, and region playing larger roles among older workers. These findings underscore the importance of education and equitable labor practices to address wage disparities. Policymakers should consider targeted interventions focusing on gender wage gaps and the role of insurance and unionization, especially for younger workers. Regional economic development and urban planning may also influence wage outcomes for older populations. Further research incorporating additional variables and interaction effects could enhance model accuracy and deepen understanding of wage determinants across life stages.

References

Dayioglu, M., Küçükbayrak, M. and Tumen, S., 2022. The impact of age-specific minimum wages on youth employment and education: a regression discontinuity analysis. International Journal of Manpower, 43(6), pp.1352-1377.

Kasilingam, D. and Krishna, R., 2022. Understanding the adoption and willingness to pay for internet of things services. International Journal of Consumer Studies, 46(1), pp.102-131.

Autor, D., Dube, A. and McGrew, A., 2023. The unexpected compression: Competition at work in the low wage labor market (No. w31010). National Bureau of Economic Research.

Eriksson, N. and Stenius, M., 2022. Online grocery shoppers due to the Covid-19 pandemic-An analysis of demographic and household characteristics. Procedia Computer Science, 196, pp.93-100.

Lin, Y., Zheng, Y., Wang, H.L. and Wu, J., 2021. Global patterns and trends in gastric cancer incidence rates (1988–2012) and predictions to 2030. Gastroenterology, 161(1), pp.116-127.