Understanding the factors affecting wages is vital for policymakers, employers, and economists. Wage determination is influenced by numerous demographic, socioeconomic, and employment-related variables. This study applies multiple linear regression to analyze how variables such as education, age, gender, insurance status, union membership, metropolitan residence, marital status, and region affect wages for younger and older workers separately. Two distinct models were estimated after applying backward stepwise selection to a full set of explanatory variables, resulting in reduced models optimized by minimizing the Akaike Information Criterion (AIC). The models were estimated using ordinary least squares regression on data subsets segmented by age groups: “young” and “old.”
library(psych)
## Warning: package 'psych' was built under R version 4.3.3
library(MASS)
data <- read.csv("D:/MA334-SP-7_2412507 (1).csv")
The reduced model for the young cohort includes five predictor variables: Age: Continuous variable representing the worker’s age. Education (educ): Years of education completed. Gender: A binary indicator variable (likely coded 0/1). Insurance (insure): Indicates whether the worker has insurance. Union membership (union): Indicates whether the worker belongs to a labor union.
The reduced model for the older cohort contains eight predictors: Education (educ) Gender Insurance (insure) Metropolitan residence (metro) Marital status (marital) Region (categorical variable with four levels: Northeast, South, West, and presumably the omitted baseline category)
str(data)
## 'data.frame': 1181 obs. of 12 variables:
## $ age : int 29 45 39 30 42 47 62 57 21 69 ...
## $ educ : int 4 3 2 3 3 3 2 2 1 0 ...
## $ gender : int 1 1 1 0 0 1 1 0 0 1 ...
## $ hrswork: int 40 45 40 45 60 45 40 48 40 40 ...
## $ insure : int 1 1 1 1 1 1 1 1 1 0 ...
## $ metro : int 1 1 1 1 0 1 1 1 1 1 ...
## $ nchild : int 2 3 1 0 3 0 1 0 0 0 ...
## $ union : int 0 0 0 0 1 0 0 1 0 0 ...
## $ wage : num 25.9 14.4 17.2 17.1 18.3 ...
## $ race : chr "White" "White" "White" "White" ...
## $ marital: int 1 2 1 0 1 1 1 1 0 2 ...
## $ region : chr "south" "south" "midwest" "northeast" ...
summary(data)
## age educ gender hrswork
## Min. :17.00 Min. :0.000 Min. :0.000 Min. : 0.00
## 1st Qu.:32.00 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:40.00
## Median :43.00 Median :2.000 Median :0.000 Median :40.00
## Mean :42.61 Mean :1.751 Mean :0.442 Mean :41.61
## 3rd Qu.:52.00 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:42.00
## Max. :77.00 Max. :5.000 Max. :1.000 Max. :80.00
## insure metro nchild union
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median :0.0000 Median :0.0000
## Mean :0.8256 Mean :0.8239 Mean :0.8061 Mean :0.1372
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :9.0000 Max. :1.0000
## wage race marital region
## Min. : 2.50 Length:1181 Min. :0.0000 Length:1181
## 1st Qu.:13.00 Class :character 1st Qu.:0.0000 Class :character
## Median :18.75 Mode :character Median :1.0000 Mode :character
## Mean :22.77 Mean :0.8476
## 3rd Qu.:28.84 3rd Qu.:1.0000
## Max. :99.00 Max. :2.0000
nrow(data)
## [1] 1181
ncol(data)
## [1] 12
describe(data)
hist(data$wage, main="Wage Distribution", xlab="Wage")
hist(data$age, main="Age Distribution", xlab="Age")
boxplot(wage ~ gender, data=data, main="Wage by Gender", names=c("Female", "Male"))
num_vars <- data[sapply(data, is.numeric)]
cor_matrix <- cor(num_vars)
heatmap(cor_matrix, main="Correlation Matrix")
cor(data$age, data$wage)
## [1] 0.2119489
cor(data$hrswork, data$wage)
## [1] 0.09091083
The model explains approximately 31.3% of the variance in log wages , indicating a moderate explanatory power. Age: The coefficient estimate is 0.028 (p < 0.001), meaning each additional year of age is associated with approximately a 2.8% increase in wage, holding other variables constant. Education: The strongest predictor with a coefficient of 0.129 (p < 0.001). Each additional year of education is associated with a 12.9% increase in wage (Dayioglu, Küçükbayrak and Tumen, 2022). Gender: The coefficient is -0.192 (p < 0.001), indicating that, controlling for other factors, one gender group (likely female if coded 1) earns about 19.2% less than the other. Insurance: Workers with insurance earn approximately 21.9% more (coefficient 0.219, p < 0.001). Union Membership: Union members earn around 15.7% more (coefficient 0.157, p = 0.029). All predictors are statistically significant at the 5% level, underscoring their relevance in explaining wage variation among younger workers.
This model accounts for about 26.3% of wage variation, slightly lower than the young group model but still substantial (Kasilingam and Krishna, 2022). Education: The strongest predictor again with a coefficient of 0.155 (p < 0.001), showing a 15.5% wage increase per additional year of schooling. Gender: Negative coefficient -0.188 (p < 0.001), consistent with the young group, confirming wage disparities by gender. Insurance: Positive effect (0.253, p < 0.001), suggesting insured workers earn 25.3% more. Metropolitan Residence: Positive and significant (0.140, p = 0.003), indicating living in metro areas is associated with higher wages, about 14%. Marital Status: Marginally significant positive effect (0.058, p = 0.067), implying married workers might earn slightly more. Region: The West region shows a positive significant effect (0.133, p = 0.008). Other regions show no significant effect relative to the baseline. Most predictors are statistically significant, except marital status (p slightly above 0.05) and some regions, indicating regional wage differences are less pronounced except for the West.
p_no_insure <- sum(data$insure == 0) / nrow(data)
prob_at_least_one_no_insure <- 1 - (1 - p_no_insure)^5
married_data <- data[data$marital == 1 | data$marital == 2, ]
p_nchild_given_married <- sum(married_data$nchild >= 1) / nrow(married_data)
nchild_table <- table(data$nchild)
nchild_probs <- prop.table(nchild_table)
nchild_df <- data.frame(nchild = as.numeric(names(nchild_probs)), prob = as.vector(nchild_probs))
mean_nchild <- sum(nchild_df$nchild * nchild_df$prob)
var_nchild <- sum((nchild_df$nchild - mean_nchild)^2 * nchild_df$prob)
p_nchild_3_or_more <- sum(nchild_df$prob[nchild_df$nchild >= 3])
wage_2child <- data$wage[data$nchild == 2]
mean(wage_2child)
## [1] 23.43355
t.test(wage_2child, conf.level = 0.95)
##
## One Sample t-test
##
## data: wage_2child
## t = 24.938, df = 216, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.58146 25.28563
## sample estimates:
## mean of x
## 23.43355
subset_5plus <- data[data$nchild >= 5, ]
nrow(subset_5plus)
## [1] 9
table_gender_insure <- table(data$gender, data$insure)
chisq.test(table_gender_insure)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_gender_insure
## X-squared = 0.0574, df = 1, p-value = 0.8107
Both models have residual standard errors below 0.5, indicating reasonable model fit. The young group model’s residual standard error is 0.423. The old group model’s residual standard error is slightly higher at 0.490. The F-statistics for both models are highly significant (p < 0.001), confirming that the sets of predictors jointly explain a significant portion of wage variance in their respective samples. The young group model explains slightly more variance than the old group model (Autor, Dube and McGrew, 2023), potentially reflecting that age and union membership variables, included only for the young group, add predictive power.
young <- data[data$age < 35, ]
old <- data[data$age >= 35, ]
model_young <- lm(log(wage) ~ age, data=young)
model_old <- lm(log(wage) ~ age, data=old)
summary(model_young)
##
## Call:
## lm(formula = log(wage) ~ age, data = young)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.63005 -0.32110 -0.01201 0.31821 1.49042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.594555 0.173214 9.206 < 2e-16 ***
## age 0.041382 0.006074 6.813 3.85e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4816 on 374 degrees of freedom
## Multiple R-squared: 0.1104, Adjusted R-squared: 0.108
## F-statistic: 46.41 on 1 and 374 DF, p-value: 3.846e-11
summary(model_old)
##
## Call:
## lm(formula = log(wage) ~ age, data = old)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91172 -0.39124 -0.04711 0.39679 1.54456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0795566 0.1157775 26.599 <2e-16 ***
## age -0.0005273 0.0023115 -0.228 0.82
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5712 on 803 degrees of freedom
## Multiple R-squared: 6.479e-05, Adjusted R-squared: -0.00118
## F-statistic: 0.05203 on 1 and 803 DF, p-value: 0.8196
plot(young$age, log(young$wage), main="Young: log(wage) ~ age")
abline(model_young, col="blue")
plot(old$age, log(old$wage), main="Old: log(wage) ~ age")
abline(model_old, col="red")
data$gender <- as.factor(data$gender)
data$race <- as.factor(data$race)
data$region <- as.factor(data$region)
data$marital <- as.factor(data$marital)
full_model_young <- lm(log(wage) ~ . -wage, data=young)
## Warning in terms.formula(formula, data = data): 'varlist' has changed (from
## nvar=12) to new 13 after EncodeVars() -- should no longer happen!
full_model_old <- lm(log(wage) ~ . -wage, data=old)
## Warning in terms.formula(formula, data = data): 'varlist' has changed (from
## nvar=12) to new 13 after EncodeVars() -- should no longer happen!
summary(full_model_young)
##
## Call:
## lm(formula = log(wage) ~ . - wage, data = young)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.36303 -0.26382 -0.01698 0.25524 1.30213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.834191 0.209063 8.773 < 2e-16 ***
## age 0.028458 0.006332 4.495 9.40e-06 ***
## educ 0.121518 0.017415 6.978 1.44e-11 ***
## gender -0.194123 0.047602 -4.078 5.59e-05 ***
## hrswork -0.003245 0.002368 -1.370 0.1715
## insure 0.224896 0.053054 4.239 2.85e-05 ***
## metro 0.011774 0.058192 0.202 0.8398
## nchild -0.025153 0.025517 -0.986 0.3249
## union 0.159936 0.073275 2.183 0.0297 *
## raceBlack -0.172978 0.118896 -1.455 0.1466
## raceWhite -0.102353 0.089136 -1.148 0.2516
## marital 0.051933 0.043641 1.190 0.2348
## regionnortheast 0.116789 0.067034 1.742 0.0823 .
## regionsouth 0.010973 0.058890 0.186 0.8523
## regionwest 0.048742 0.065094 0.749 0.4545
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4226 on 361 degrees of freedom
## Multiple R-squared: 0.3388, Adjusted R-squared: 0.3132
## F-statistic: 13.21 on 14 and 361 DF, p-value: < 2.2e-16
summary(full_model_old)
##
## Call:
## lm(formula = log(wage) ~ . - wage, data = old)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.85888 -0.30451 0.02666 0.32575 1.31774
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2694898 0.1799543 12.611 < 2e-16 ***
## age 0.0003294 0.0021291 0.155 0.87711
## educ 0.1551089 0.0119593 12.970 < 2e-16 ***
## gender -0.1811629 0.0355925 -5.090 4.48e-07 ***
## hrswork 0.0015615 0.0021518 0.726 0.46824
## insure 0.2475608 0.0528619 4.683 3.32e-06 ***
## metro 0.1417880 0.0471982 3.004 0.00275 **
## nchild -0.0177843 0.0164374 -1.082 0.27961
## union 0.0452883 0.0489084 0.926 0.35474
## raceBlack -0.0106162 0.1013315 -0.105 0.91659
## raceWhite 0.0849661 0.0832381 1.021 0.30768
## marital 0.0548061 0.0320963 1.708 0.08811 .
## regionnortheast 0.0536894 0.0533367 1.007 0.31443
## regionsouth 0.0456868 0.0466322 0.980 0.32752
## regionwest 0.1326383 0.0506384 2.619 0.00898 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.49 on 790 degrees of freedom
## Multiple R-squared: 0.276, Adjusted R-squared: 0.2632
## F-statistic: 21.51 on 14 and 790 DF, p-value: < 2.2e-16
summary(model_young)
##
## Call:
## lm(formula = log(wage) ~ age, data = young)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.63005 -0.32110 -0.01201 0.31821 1.49042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.594555 0.173214 9.206 < 2e-16 ***
## age 0.041382 0.006074 6.813 3.85e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4816 on 374 degrees of freedom
## Multiple R-squared: 0.1104, Adjusted R-squared: 0.108
## F-statistic: 46.41 on 1 and 374 DF, p-value: 3.846e-11
summary(model_old)
##
## Call:
## lm(formula = log(wage) ~ age, data = old)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91172 -0.39124 -0.04711 0.39679 1.54456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0795566 0.1157775 26.599 <2e-16 ***
## age -0.0005273 0.0023115 -0.228 0.82
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5712 on 803 degrees of freedom
## Multiple R-squared: 6.479e-05, Adjusted R-squared: -0.00118
## F-statistic: 0.05203 on 1 and 803 DF, p-value: 0.8196
summary(full_model_young)
##
## Call:
## lm(formula = log(wage) ~ . - wage, data = young)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.36303 -0.26382 -0.01698 0.25524 1.30213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.834191 0.209063 8.773 < 2e-16 ***
## age 0.028458 0.006332 4.495 9.40e-06 ***
## educ 0.121518 0.017415 6.978 1.44e-11 ***
## gender -0.194123 0.047602 -4.078 5.59e-05 ***
## hrswork -0.003245 0.002368 -1.370 0.1715
## insure 0.224896 0.053054 4.239 2.85e-05 ***
## metro 0.011774 0.058192 0.202 0.8398
## nchild -0.025153 0.025517 -0.986 0.3249
## union 0.159936 0.073275 2.183 0.0297 *
## raceBlack -0.172978 0.118896 -1.455 0.1466
## raceWhite -0.102353 0.089136 -1.148 0.2516
## marital 0.051933 0.043641 1.190 0.2348
## regionnortheast 0.116789 0.067034 1.742 0.0823 .
## regionsouth 0.010973 0.058890 0.186 0.8523
## regionwest 0.048742 0.065094 0.749 0.4545
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4226 on 361 degrees of freedom
## Multiple R-squared: 0.3388, Adjusted R-squared: 0.3132
## F-statistic: 13.21 on 14 and 361 DF, p-value: < 2.2e-16
summary(full_model_old)
##
## Call:
## lm(formula = log(wage) ~ . - wage, data = old)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.85888 -0.30451 0.02666 0.32575 1.31774
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2694898 0.1799543 12.611 < 2e-16 ***
## age 0.0003294 0.0021291 0.155 0.87711
## educ 0.1551089 0.0119593 12.970 < 2e-16 ***
## gender -0.1811629 0.0355925 -5.090 4.48e-07 ***
## hrswork 0.0015615 0.0021518 0.726 0.46824
## insure 0.2475608 0.0528619 4.683 3.32e-06 ***
## metro 0.1417880 0.0471982 3.004 0.00275 **
## nchild -0.0177843 0.0164374 -1.082 0.27961
## union 0.0452883 0.0489084 0.926 0.35474
## raceBlack -0.0106162 0.1013315 -0.105 0.91659
## raceWhite 0.0849661 0.0832381 1.021 0.30768
## marital 0.0548061 0.0320963 1.708 0.08811 .
## regionnortheast 0.0536894 0.0533367 1.007 0.31443
## regionsouth 0.0456868 0.0466322 0.980 0.32752
## regionwest 0.1326383 0.0506384 2.619 0.00898 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.49 on 790 degrees of freedom
## Multiple R-squared: 0.276, Adjusted R-squared: 0.2632
## F-statistic: 21.51 on 14 and 790 DF, p-value: < 2.2e-16
reduced_model_young <- stepAIC(full_model_young, direction = "backward")
## Start: AIC=-633.1
## log(wage) ~ (age + educ + gender + hrswork + insure + metro +
## nchild + union + race + marital + region) - wage
##
## Df Sum of Sq RSS AIC
## - region 3 0.6450 65.104 -635.36
## - metro 1 0.0073 64.466 -635.06
## - race 2 0.3852 64.844 -634.86
## - nchild 1 0.1735 64.632 -634.09
## - marital 1 0.2529 64.712 -633.63
## - hrswork 1 0.3352 64.794 -633.15
## <none> 64.459 -633.10
## - union 1 0.8507 65.310 -630.17
## - gender 1 2.9695 67.428 -618.16
## - insure 1 3.2085 67.667 -616.83
## - age 1 3.6070 68.066 -614.63
## - educ 1 8.6936 73.153 -587.53
##
## Step: AIC=-635.36
## log(wage) ~ age + educ + gender + hrswork + insure + metro +
## nchild + union + race + marital
##
## Df Sum of Sq RSS AIC
## - metro 1 0.0047 65.109 -637.33
## - race 2 0.3898 65.494 -637.11
## - marital 1 0.2128 65.317 -636.13
## - nchild 1 0.2301 65.334 -636.03
## - hrswork 1 0.2595 65.363 -635.86
## <none> 65.104 -635.36
## - union 1 0.9538 66.058 -631.89
## - gender 1 2.9645 68.068 -620.61
## - insure 1 3.1964 68.300 -619.33
## - age 1 3.8440 68.948 -615.79
## - educ 1 8.9125 74.017 -589.11
##
## Step: AIC=-637.33
## log(wage) ~ age + educ + gender + hrswork + insure + nchild +
## union + race + marital
##
## Df Sum of Sq RSS AIC
## - race 2 0.3888 65.498 -639.09
## - marital 1 0.2092 65.318 -638.12
## - nchild 1 0.2470 65.356 -637.90
## - hrswork 1 0.2701 65.379 -637.77
## <none> 65.109 -637.33
## - union 1 0.9591 66.068 -633.83
## - gender 1 2.9667 68.075 -622.57
## - insure 1 3.1916 68.300 -621.33
## - age 1 3.8903 68.999 -617.51
## - educ 1 9.0659 74.175 -590.31
##
## Step: AIC=-639.09
## log(wage) ~ age + educ + gender + hrswork + insure + nchild +
## union + marital
##
## Df Sum of Sq RSS AIC
## - marital 1 0.2076 65.705 -639.90
## - nchild 1 0.2704 65.768 -639.54
## - hrswork 1 0.2831 65.781 -639.47
## <none> 65.498 -639.09
## - union 1 0.8531 66.351 -636.22
## - gender 1 3.2283 68.726 -623.00
## - insure 1 3.3402 68.838 -622.39
## - age 1 4.0068 69.504 -618.76
## - educ 1 9.7441 75.242 -588.94
##
## Step: AIC=-639.9
## log(wage) ~ age + educ + gender + hrswork + insure + nchild +
## union
##
## Df Sum of Sq RSS AIC
## - nchild 1 0.1604 65.866 -640.98
## - hrswork 1 0.2711 65.976 -640.35
## <none> 65.705 -639.90
## - union 1 0.8237 66.529 -637.21
## - gender 1 3.3342 69.039 -623.29
## - insure 1 3.4490 69.154 -622.66
## - age 1 4.9450 70.650 -614.62
## - educ 1 9.8609 75.566 -589.32
##
## Step: AIC=-640.98
## log(wage) ~ age + educ + gender + hrswork + insure + union
##
## Df Sum of Sq RSS AIC
## - hrswork 1 0.2705 66.136 -641.44
## <none> 65.866 -640.98
## - union 1 0.8129 66.678 -638.37
## - gender 1 3.3877 69.253 -624.12
## - insure 1 3.4395 69.305 -623.84
## - age 1 4.9657 70.831 -615.65
## - educ 1 10.9429 76.808 -585.19
##
## Step: AIC=-641.44
## log(wage) ~ age + educ + gender + insure + union
##
## Df Sum of Sq RSS AIC
## <none> 66.136 -641.44
## - union 1 0.8544 66.990 -638.62
## - gender 1 3.1282 69.264 -626.06
## - insure 1 3.1971 69.333 -625.69
## - age 1 4.7057 70.842 -617.60
## - educ 1 10.7339 76.870 -586.89
reduced_model_old <- stepAIC(full_model_old, direction = "backward")
## Start: AIC=-1133.57
## log(wage) ~ (age + educ + gender + hrswork + insure + metro +
## nchild + union + race + marital + region) - wage
##
## Df Sum of Sq RSS AIC
## - age 1 0.006 189.70 -1135.54
## - hrswork 1 0.126 189.82 -1135.03
## - union 1 0.206 189.90 -1134.70
## - nchild 1 0.281 189.98 -1134.38
## - race 2 0.784 190.48 -1134.25
## <none> 189.69 -1133.57
## - marital 1 0.700 190.40 -1132.60
## - region 3 1.686 191.38 -1132.45
## - metro 1 2.167 191.86 -1126.43
## - insure 1 5.266 194.96 -1113.53
## - gender 1 6.221 195.91 -1109.59
## - educ 1 40.392 230.09 -980.17
##
## Step: AIC=-1135.54
## log(wage) ~ educ + gender + hrswork + insure + metro + nchild +
## union + race + marital + region
##
## Df Sum of Sq RSS AIC
## - hrswork 1 0.127 189.83 -1137.01
## - union 1 0.206 189.91 -1136.67
## - race 2 0.779 190.48 -1136.25
## - nchild 1 0.351 190.05 -1136.06
## <none> 189.70 -1135.54
## - marital 1 0.715 190.41 -1134.52
## - region 3 1.683 191.38 -1134.44
## - metro 1 2.168 191.87 -1128.40
## - insure 1 5.291 194.99 -1115.40
## - gender 1 6.216 195.92 -1111.59
## - educ 1 40.460 230.16 -981.91
##
## Step: AIC=-1137.01
## log(wage) ~ educ + gender + insure + metro + nchild + union +
## race + marital + region
##
## Df Sum of Sq RSS AIC
## - union 1 0.201 190.03 -1138.15
## - race 2 0.790 190.62 -1137.66
## - nchild 1 0.327 190.15 -1137.62
## <none> 189.83 -1137.01
## - marital 1 0.702 190.53 -1136.03
## - region 3 1.676 191.50 -1135.93
## - metro 1 2.217 192.04 -1129.66
## - insure 1 5.494 195.32 -1116.04
## - gender 1 6.745 196.57 -1110.90
## - educ 1 41.435 231.26 -980.07
##
## Step: AIC=-1138.15
## log(wage) ~ educ + gender + insure + metro + nchild + race +
## marital + region
##
## Df Sum of Sq RSS AIC
## - nchild 1 0.313 190.34 -1138.83
## - race 2 0.802 190.83 -1138.77
## <none> 190.03 -1138.15
## - marital 1 0.721 190.75 -1137.11
## - region 3 1.788 191.82 -1136.61
## - metro 1 2.300 192.33 -1130.47
## - insure 1 5.667 195.69 -1116.50
## - gender 1 6.711 196.74 -1112.22
## - educ 1 41.245 231.27 -982.03
##
## Step: AIC=-1138.83
## log(wage) ~ educ + gender + insure + metro + race + marital +
## region
##
## Df Sum of Sq RSS AIC
## - race 2 0.944 191.29 -1138.84
## <none> 190.34 -1138.83
## - marital 1 0.676 191.02 -1137.97
## - region 3 1.812 192.15 -1137.20
## - metro 1 2.219 192.56 -1131.50
## - insure 1 5.501 195.84 -1117.89
## - gender 1 6.642 196.98 -1113.22
## - educ 1 41.223 231.56 -983.02
##
## Step: AIC=-1138.84
## log(wage) ~ educ + gender + insure + metro + marital + region
##
## Df Sum of Sq RSS AIC
## <none> 191.29 -1138.84
## - marital 1 0.809 192.09 -1137.45
## - region 3 1.852 193.14 -1137.09
## - metro 1 2.141 193.43 -1131.88
## - insure 1 5.640 196.93 -1117.45
## - gender 1 6.981 198.27 -1111.99
## - educ 1 41.607 232.89 -982.41
summary(reduced_model_young)
##
## Call:
## lm(formula = log(wage) ~ age + educ + gender + insure + union,
## data = young)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.38577 -0.26592 -0.02603 0.24556 1.27938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.655058 0.155906 10.616 < 2e-16 ***
## age 0.028131 0.005483 5.131 4.67e-07 ***
## educ 0.128586 0.016593 7.749 9.00e-14 ***
## gender -0.191847 0.045859 -4.183 3.59e-05 ***
## insure 0.218750 0.051724 4.229 2.96e-05 ***
## union 0.156707 0.071678 2.186 0.0294 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4228 on 370 degrees of freedom
## Multiple R-squared: 0.3216, Adjusted R-squared: 0.3125
## F-statistic: 35.08 on 5 and 370 DF, p-value: < 2.2e-16
summary(reduced_model_old)
##
## Call:
## lm(formula = log(wage) ~ educ + gender + insure + metro + marital +
## region, data = old)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.87787 -0.29972 0.02165 0.33262 1.29289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.41277 0.07787 30.986 < 2e-16 ***
## educ 0.15494 0.01178 13.158 < 2e-16 ***
## gender -0.18811 0.03490 -5.390 9.29e-08 ***
## insure 0.25329 0.05229 4.844 1.53e-06 ***
## metro 0.14017 0.04696 2.985 0.00292 **
## marital 0.05844 0.03185 1.835 0.06691 .
## regionnortheast 0.05972 0.05313 1.124 0.26135
## regionsouth 0.03501 0.04599 0.761 0.44672
## regionwest 0.13263 0.04972 2.667 0.00780 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4902 on 796 degrees of freedom
## Multiple R-squared: 0.2699, Adjusted R-squared: 0.2626
## F-statistic: 36.78 on 8 and 796 DF, p-value: < 2.2e-16
Education is the most influential determinant of wages in both groups, reinforcing the well-established link between human capital and earning potential(Eriksson and Stenius, 2022). The positive, statistically significant coefficients confirm that additional education corresponds to higher wages. Gender consistently shows a negative coefficient in both models, reflecting gender wage gaps that persist after controlling for education, age, insurance, and other factors. This highlights ongoing disparities that may be related to occupational segregation, discrimination, or other structural issues.
Insurance coverage positively correlates with wages for both age groups, possibly indicating that higher-paying jobs offer insurance benefits or that insurance status proxies for job quality (Lin et al., 2021). Union membership significantly affects wages only in the young group, suggesting that unions may have a more pronounced impact on younger workers’ wages or that union presence varies by age group.
The older group model incorporates metropolitan residence, marital status, and geographic region, reflecting that location and social factors affect wages more in this group. Metropolitan areas offer wage premiums, likely due to cost of living and economic opportunities (Dayioglu, Küçükbayrak and Tumen, 2022). Marital status shows a weak positive association, consistent with literature suggesting that married individuals may have higher earnings, possibly due to stability or employer perceptions. Regional wage differences are significant for the West but less so for other regions, suggesting localized economic conditions impact older workers’ wages.
While these models provide valuable insights, they explain only about 26-31% of wage variation, indicating other unmeasured factors (e.g., work experience, occupation, hours worked, discrimination) contribute to wage determination (Kasilingam and Krishna, 2022). The models assume linear relationships and may miss nonlinearities or interactions between predictors.
The reduced linear regression models identify critical wage determinants differentiated by age group. Education and gender consistently emerge as key factors, with insurance and union membership also important in the young group, and metropolitan residence, marital status, and region playing larger roles among older workers. These findings underscore the importance of education and equitable labor practices to address wage disparities. Policymakers should consider targeted interventions focusing on gender wage gaps and the role of insurance and unionization, especially for younger workers. Regional economic development and urban planning may also influence wage outcomes for older populations. Further research incorporating additional variables and interaction effects could enhance model accuracy and deepen understanding of wage determinants across life stages.
Dayioglu, M., Küçükbayrak, M. and Tumen, S., 2022. The impact of age-specific minimum wages on youth employment and education: a regression discontinuity analysis. International Journal of Manpower, 43(6), pp.1352-1377.
Kasilingam, D. and Krishna, R., 2022. Understanding the adoption and willingness to pay for internet of things services. International Journal of Consumer Studies, 46(1), pp.102-131.
Autor, D., Dube, A. and McGrew, A., 2023. The unexpected compression: Competition at work in the low wage labor market (No. w31010). National Bureau of Economic Research.
Eriksson, N. and Stenius, M., 2022. Online grocery shoppers due to the Covid-19 pandemic-An analysis of demographic and household characteristics. Procedia Computer Science, 196, pp.93-100.
Lin, Y., Zheng, Y., Wang, H.L. and Wu, J., 2021. Global patterns and trends in gastric cancer incidence rates (1988–2012) and predictions to 2030. Gastroenterology, 161(1), pp.116-127.