library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
df <- read.csv('D:/MA334-SP-7_2412507.csv')
str(df)
## 'data.frame': 1181 obs. of 12 variables:
## $ age : int 29 45 39 30 42 47 62 57 21 69 ...
## $ educ : int 4 3 2 3 3 3 2 2 1 0 ...
## $ gender : int 1 1 1 0 0 1 1 0 0 1 ...
## $ hrswork: int 40 45 40 45 60 45 40 48 40 40 ...
## $ insure : int 1 1 1 1 1 1 1 1 1 0 ...
## $ metro : int 1 1 1 1 0 1 1 1 1 1 ...
## $ nchild : int 2 3 1 0 3 0 1 0 0 0 ...
## $ union : int 0 0 0 0 1 0 0 1 0 0 ...
## $ wage : num 25.9 14.4 17.2 17.1 18.3 ...
## $ race : chr "White" "White" "White" "White" ...
## $ marital: int 1 2 1 0 1 1 1 1 0 2 ...
## $ region : chr "south" "south" "midwest" "northeast" ...
summary(df)
## age educ gender hrswork
## Min. :17.00 Min. :0.000 Min. :0.000 Min. : 0.00
## 1st Qu.:32.00 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:40.00
## Median :43.00 Median :2.000 Median :0.000 Median :40.00
## Mean :42.61 Mean :1.751 Mean :0.442 Mean :41.61
## 3rd Qu.:52.00 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:42.00
## Max. :77.00 Max. :5.000 Max. :1.000 Max. :80.00
## insure metro nchild union
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median :0.0000 Median :0.0000
## Mean :0.8256 Mean :0.8239 Mean :0.8061 Mean :0.1372
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :9.0000 Max. :1.0000
## wage race marital region
## Min. : 2.50 Length:1181 Min. :0.0000 Length:1181
## 1st Qu.:13.00 Class :character 1st Qu.:0.0000 Class :character
## Median :18.75 Mode :character Median :1.0000 Mode :character
## Mean :22.77 Mean :0.8476
## 3rd Qu.:28.84 3rd Qu.:1.0000
## Max. :99.00 Max. :2.0000
nrow(df)
## [1] 1181
ncol(df)
## [1] 12
The dataset under analysis includes demographic, socioeconomic, and wage-related variables. Numerical variables include age, wages, and number of children, while categorical variables cover gender, insurance status, union membership, and geographic region. An initial inspection showed that wages are right-skewed, prompting a logarithmic transformation of wages (log(wage)) for subsequent analysis to achieve normality and stabilize variance.
summary(select(df, age, hrswork, nchild, wage))
## age hrswork nchild wage
## Min. :17.00 Min. : 0.00 Min. :0.0000 Min. : 2.50
## 1st Qu.:32.00 1st Qu.:40.00 1st Qu.:0.0000 1st Qu.:13.00
## Median :43.00 Median :40.00 Median :0.0000 Median :18.75
## Mean :42.61 Mean :41.61 Mean :0.8061 Mean :22.77
## 3rd Qu.:52.00 3rd Qu.:42.00 3rd Qu.:2.0000 3rd Qu.:28.84
## Max. :77.00 Max. :80.00 Max. :9.0000 Max. :99.00
The distribution of the number of children per individual revealed that the majority have zero to two children, with an average of approximately 0.81 children and a variance of 1.21, indicating moderate dispersion. Larger families (three or more children) were less common, representing about 7.54% of the population. Gender distribution was roughly balanced, with no immediate skew observed (Williams and Gashi, 2022).
table(df$gender)
##
## 0 1
## 659 522
table(df$race)
##
## Asian Black White
## 65 104 1012
table(df$marital)
##
## 0 1 2
## 324 713 144
table(df$region)
##
## midwest northeast south west
## 309 215 385 272
Initial cross-tabulations between gender and insurance status suggested a roughly equal distribution of insured and uninsured individuals among males and females. This preliminary finding was essential for deciding which variables to include in further statistical tests and regression models.
ggplot(df, aes(x = wage)) + geom_histogram(binwidth = 2, fill = "lightblue", color = "black") + labs(title = "Histogram of Wages")
ggplot(df, aes(x = factor(gender), y = wage)) + geom_boxplot() + labs(title = "Wage by Gender", x = "Gender", y = "Wage")
numeric_df <- select(df, age, hrswork, nchild, wage)
cor(numeric_df)
## age hrswork nchild wage
## age 1.00000000 0.05585503 -0.05046348 0.21194887
## hrswork 0.05585503 1.00000000 0.06866293 0.09091083
## nchild -0.05046348 0.06866293 1.00000000 0.01655582
## wage 0.21194887 0.09091083 0.01655582 1.00000000
p_insured <- mean(df$insure == 1)
p_at_least_one_uninsured <- 1 - (p_insured)^5
p_at_least_one_uninsured
## [1] 0.6164927
The probability of having three or more children was calculated to be 7.54%. This low proportion highlights that large family sizes are uncommon in this sample and may warrant specific attention when examining income effects related to family burden. ## P(nchild >= 1 | married)
married <- df[df$marital == 1, ]
p_child_given_married <- mean(married$nchild >= 1)
p_child_given_married
## [1] 0.6002805
nchild_table <- table(df$nchild)
nchild_probs <- prop.table(nchild_table)
nchild_probs
##
## 0 1 2 3 4 5
## 0.5605419136 0.1803556308 0.1837425910 0.0550381033 0.0127011008 0.0059271804
## 6 9
## 0.0008467401 0.0008467401
Visual and statistical examination suggested that wages follow a skewed distribution. Applying the natural logarithm to wages resulted in a more symmetric distribution, satisfying assumptions of normality needed for regression analyses. ### Mean and variance
nchild_vals <- as.numeric(names(nchild_probs))
nchild_mean <- sum(nchild_vals * nchild_probs)
nchild_var <- sum((nchild_vals - nchild_mean)^2 * nchild_probs)
nchild_mean
## [1] 0.8060965
nchild_var
## [1] 1.211343
p_nchild_3_or_more <- sum(nchild_probs[nchild_vals >= 3])
p_nchild_3_or_more
## [1] 0.07535986
Mean wages were estimated for subgroups defined by family size. For example, individuals with exactly two children had a mean wage of $23.43, with a 95% confidence interval ranging from $21.58 to $25.29. Such intervals provide insight into the precision of estimated means and the reliability of observed differences across groups (Ciminelli, Schwellnus and Stadle, 2021).
two_children <- df[df$nchild == 2, ]
mean(two_children$wage)
## [1] 23.43355
t.test(two_children$wage, conf.level = 0.95)
##
## One Sample t-test
##
## data: two_children$wage
## t = 24.938, df = 216, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.58146 25.28563
## sample estimates:
## mean of x
## 23.43355
five_or_more <- df[df$nchild >= 5, ]
nrow(five_or_more) # Check if enough data
## [1] 9
if(nrow(five_or_more) >= 2){
t.test(five_or_more$wage)
} else {
"Too few observations to compute CI"
}
##
## One Sample t-test
##
## data: five_or_more$wage
## t = 5.8681, df = 8, p-value = 0.000375
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 7.740921 17.763523
## sample estimates:
## mean of x
## 12.75222
The mean number of children (0.81) and mean wages for different family sizes were key summary statistics. Those with larger families (five or more children) earned substantially less on average ($12.75) compared to those with fewer children.
Statistical hypothesis testing compared wage means between individuals with two children and those with five or more. The t-test yielded a highly significant difference (p = 0.000375), suggesting that larger families might be associated with reduced earnings, possibly due to increased financial responsibilities or reduced labor market participation. However, the very small sample size in the large family group (n=9) advises caution in generalizing this finding (Bluedorn et al., 2023).
cont_table <- table(df$gender, df$insure)
cont_table
##
## 0 1
## 0 117 542
## 1 89 433
chisq.test(cont_table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: cont_table
## X-squared = 0.0574, df = 1, p-value = 0.8107
A chi-squared test examined the relationship between gender and insurance status. The test returned a non-significant p-value of 0.8107, indicating no statistically significant association. This implies that males and females are equally likely to have insurance coverage in this dataset.
young <- df[df$age < 35, ]
lm_young <- lm(log(wage) ~ age, data = young)
summary(lm_young)
##
## Call:
## lm(formula = log(wage) ~ age, data = young)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.63005 -0.32110 -0.01201 0.31821 1.49042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.594555 0.173214 9.206 < 2e-16 ***
## age 0.041382 0.006074 6.813 3.85e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4816 on 374 degrees of freedom
## Multiple R-squared: 0.1104, Adjusted R-squared: 0.108
## F-statistic: 46.41 on 1 and 374 DF, p-value: 3.846e-11
A simple linear regression model was fitted to assess the relationship between age and log-transformed wages among younger individuals. The resulting equation: log(wage)=1.595+0.0414×age indicates a positive and significant association. The coefficient for age (0.0414) implies that each additional year of age is associated with an approximate 4.14% increase in wages. This effect was highly significant (p < 0.001), and the model explained about 11% of wage variability (adjusted R² = 0.108) (Broz, Frieden and Weymouth, 2021).
old <- df[df$age >= 35, ]
lm_old <- lm(log(wage) ~ age, data = old)
summary(lm_old)
##
## Call:
## lm(formula = log(wage) ~ age, data = old)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91172 -0.39124 -0.04711 0.39679 1.54456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0795566 0.1157775 26.599 <2e-16 ***
## age -0.0005273 0.0023115 -0.228 0.82
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5712 on 803 degrees of freedom
## Multiple R-squared: 6.479e-05, Adjusted R-squared: -0.00118
## F-statistic: 0.05203 on 1 and 803 DF, p-value: 0.8196
For older individuals, the model: log(wage)=3.08-0.00053×age revealed no significant relationship between age and wages (p = 0.82), with an adjusted R² near zero. This suggests that wages tend to plateau or become unrelated to age in later career stages, possibly reflecting career ceiling effects or retirement planning.
ggplot(young, aes(x = age, y = log(wage))) +
geom_point() + geom_smooth(method = "lm") +
labs(title = "Young: log(Wage) vs Age")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(old, aes(x = age, y = log(wage))) +
geom_point() + geom_smooth(method = "lm") +
labs(title = "Old: log(Wage) vs Age")
## `geom_smooth()` using formula = 'y ~ x'
Graphical plots using ggplot2 further supported these findings. The
younger group displayed a noticeable upward trend in wages with
increasing age, whereas the older group’s wage trend remained
essentially flat.
df$gender <- factor(df$gender)
df$race <- factor(df$race)
df$marital <- factor(df$marital)
df$region <- factor(df$region)
df$union <- factor(df$union)
df$insure <- factor(df$insure)
df$metro <- factor(df$metro)
lm_young_full <- lm(log(wage) ~ . - wage, data = young)
## Warning in terms.formula(formula, data = data): 'varlist' has changed (from
## nvar=12) to new 13 after EncodeVars() -- should no longer happen!
summary(lm_young_full)
##
## Call:
## lm(formula = log(wage) ~ . - wage, data = young)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.36303 -0.26382 -0.01698 0.25524 1.30213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.834191 0.209063 8.773 < 2e-16 ***
## age 0.028458 0.006332 4.495 9.40e-06 ***
## educ 0.121518 0.017415 6.978 1.44e-11 ***
## gender -0.194123 0.047602 -4.078 5.59e-05 ***
## hrswork -0.003245 0.002368 -1.370 0.1715
## insure 0.224896 0.053054 4.239 2.85e-05 ***
## metro 0.011774 0.058192 0.202 0.8398
## nchild -0.025153 0.025517 -0.986 0.3249
## union 0.159936 0.073275 2.183 0.0297 *
## raceBlack -0.172978 0.118896 -1.455 0.1466
## raceWhite -0.102353 0.089136 -1.148 0.2516
## marital 0.051933 0.043641 1.190 0.2348
## regionnortheast 0.116789 0.067034 1.742 0.0823 .
## regionsouth 0.010973 0.058890 0.186 0.8523
## regionwest 0.048742 0.065094 0.749 0.4545
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4226 on 361 degrees of freedom
## Multiple R-squared: 0.3388, Adjusted R-squared: 0.3132
## F-statistic: 13.21 on 14 and 361 DF, p-value: < 2.2e-16
A multiple linear regression model was constructed for younger workers to evaluate combined effects of predictors on wages. Significant variables included: Age (Estimate = 0.0285, p < 0.001): Continued positive effect, though smaller than in the simple regression. Education (0.1215, p < 0.001): The strongest predictor, with each additional year of schooling increasing wages substantially. Gender (-0.1941, p < 0.001): Females earned significantly less than males, highlighting a gender wage gap. Insurance (0.2249, p < 0.001): Being insured correlated with higher wages. Union Membership (0.1599, p = 0.0297): Union membership was associated with higher wages. This model explained approximately 31% of wage variability (adjusted R² = 0.313), demonstrating that multiple social and economic factors influence earnings among younger workers (Borsboom et al., 2021).
lm_old_full <- lm(log(wage) ~ . - wage, data = old)
## Warning in terms.formula(formula, data = data): 'varlist' has changed (from
## nvar=12) to new 13 after EncodeVars() -- should no longer happen!
summary(lm_old_full)
##
## Call:
## lm(formula = log(wage) ~ . - wage, data = old)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.85888 -0.30451 0.02666 0.32575 1.31774
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2694898 0.1799543 12.611 < 2e-16 ***
## age 0.0003294 0.0021291 0.155 0.87711
## educ 0.1551089 0.0119593 12.970 < 2e-16 ***
## gender -0.1811629 0.0355925 -5.090 4.48e-07 ***
## hrswork 0.0015615 0.0021518 0.726 0.46824
## insure 0.2475608 0.0528619 4.683 3.32e-06 ***
## metro 0.1417880 0.0471982 3.004 0.00275 **
## nchild -0.0177843 0.0164374 -1.082 0.27961
## union 0.0452883 0.0489084 0.926 0.35474
## raceBlack -0.0106162 0.1013315 -0.105 0.91659
## raceWhite 0.0849661 0.0832381 1.021 0.30768
## marital 0.0548061 0.0320963 1.708 0.08811 .
## regionnortheast 0.0536894 0.0533367 1.007 0.31443
## regionsouth 0.0456868 0.0466322 0.980 0.32752
## regionwest 0.1326383 0.0506384 2.619 0.00898 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.49 on 790 degrees of freedom
## Multiple R-squared: 0.276, Adjusted R-squared: 0.2632
## F-statistic: 21.51 on 14 and 790 DF, p-value: < 2.2e-16
For the older group, a similar multiple regression model identified different key predictors: Education (0.1551, p < 0.001) and Gender (-0.1812, p < 0.001) remained significant. Insurance (0.2476, p < 0.001) continued to show positive effects. Metro residence (0.1418, p = 0.0027) and Western region (0.1326, p = 0.009) emerged as significant geographic factors. Age and number of children were not significant. This model explained 26% of wage variability (adjusted R² = 0.263), indicating a shift in wage determinants with age. Geographic factors played a more prominent role in the older population, possibly reflecting local labor market conditions or cost of living differentials.
The divergence in predictors between age groups underscores how wage determinants vary over a career lifespan. Education and insurance status consistently influence wages, while union membership and age are more important early in careers. Geographic factors gain relevance for older workers. Persistent gender disparities highlight ongoing inequalities requiring policy attention.
This comprehensive analysis reveals that wage determination is a complex interplay of demographic, socioeconomic, and geographic factors, with their relative importance shifting over the life course. Age positively influences wages primarily in younger workers, while education consistently boosts earnings across all ages. Insurance and union membership provide wage advantages, especially for younger employees. Geographic location increasingly shapes wage outcomes in later career stages. The persistent negative impact of gender on wages demands continued focus on closing the wage gap. These insights offer valuable guidance for policymakers, employers, and educators aiming to foster equitable wage growth (Blundell et al., 2025).
Bluedorn, J., Hansen, N.J., Noureldin, D., Shibata, I. and Tavares, M.M., 2023. Transitioning to a greener labor market: Cross-country evidence from microdata. Energy Economics, 126, p.106836.
Blundell, R., Bollinger, C.R., Hokayem, C. and Ziliak, J.P., 2025. Interpreting cohort profiles of life cycle earnings volatility. Journal of Labor Economics, 43(S1), pp.S55-S82.
Borsboom, D., Deserno, M.K., Rhemtulla, M., Epskamp, S., Fried, E.I., McNally, R.J., Robinaugh, D.J., Perugini, M., Dalege, J., Costantini, G. and Isvoranu, A.M., 2021. Network analysis of multivariate data in psychological science. Nature Reviews Methods Primers, 1(1), p.58.
Broz, J.L., Frieden, J. and Weymouth, S., 2021. Populism in place: the economic geography of the globalization backlash. International Organization, 75(2), pp.464-494.
Ciminelli, G., Schwellnus, C. and Stadle, B., 2021. Sticky floors or glass ceilings? The role of human capital, working time flexibility and discrimination in the gender wage gap. OECD Economic Department Working Papers, (1668), pp.0_1-43.
Williams, C. and Gashi, A., 2022. Evaluating the wage differential between the formal and informal economy: a gender perspective. Journal of Economic Studies, 49(4), pp.735-750.