Library Import

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

1. Data Exploration

Load dataset

df <- read.csv('D:/MA334-SP-7_2412507.csv')

View structure and summary

str(df)

## 'data.frame':    1181 obs. of  12 variables:
##  $ age    : int  29 45 39 30 42 47 62 57 21 69 ...
##  $ educ   : int  4 3 2 3 3 3 2 2 1 0 ...
##  $ gender : int  1 1 1 0 0 1 1 0 0 1 ...
##  $ hrswork: int  40 45 40 45 60 45 40 48 40 40 ...
##  $ insure : int  1 1 1 1 1 1 1 1 1 0 ...
##  $ metro  : int  1 1 1 1 0 1 1 1 1 1 ...
##  $ nchild : int  2 3 1 0 3 0 1 0 0 0 ...
##  $ union  : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ wage   : num  25.9 14.4 17.2 17.1 18.3 ...
##  $ race   : chr  "White" "White" "White" "White" ...
##  $ marital: int  1 2 1 0 1 1 1 1 0 2 ...
##  $ region : chr  "south" "south" "midwest" "northeast" ...

summary(df)

##       age             educ           gender         hrswork     
##  Min.   :17.00   Min.   :0.000   Min.   :0.000   Min.   : 0.00  
##  1st Qu.:32.00   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:40.00  
##  Median :43.00   Median :2.000   Median :0.000   Median :40.00  
##  Mean   :42.61   Mean   :1.751   Mean   :0.442   Mean   :41.61  
##  3rd Qu.:52.00   3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.:42.00  
##  Max.   :77.00   Max.   :5.000   Max.   :1.000   Max.   :80.00  
##      insure           metro            nchild           union       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :1.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.8256   Mean   :0.8239   Mean   :0.8061   Mean   :0.1372  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:2.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :9.0000   Max.   :1.0000  
##       wage           race              marital          region         
##  Min.   : 2.50   Length:1181        Min.   :0.0000   Length:1181       
##  1st Qu.:13.00   Class :character   1st Qu.:0.0000   Class :character  
##  Median :18.75   Mode  :character   Median :1.0000   Mode  :character  
##  Mean   :22.77                      Mean   :0.8476                     
##  3rd Qu.:28.84                      3rd Qu.:1.0000                     
##  Max.   :99.00                      Max.   :2.0000

nrow(df)

## [1] 1181

ncol(df)

## [1] 12

The dataset under analysis includes demographic, socioeconomic, and wage-related variables. Numerical variables include age, wages, and number of children, while categorical variables cover gender, insurance status, union membership, and geographic region. An initial inspection showed that wages are right-skewed, prompting a logarithmic transformation of wages (log(wage)) for subsequent analysis to achieve normality and stabilize variance.

Descriptive statistics

summary(select(df, age, hrswork, nchild, wage))

##       age           hrswork          nchild            wage      
##  Min.   :17.00   Min.   : 0.00   Min.   :0.0000   Min.   : 2.50  
##  1st Qu.:32.00   1st Qu.:40.00   1st Qu.:0.0000   1st Qu.:13.00  
##  Median :43.00   Median :40.00   Median :0.0000   Median :18.75  
##  Mean   :42.61   Mean   :41.61   Mean   :0.8061   Mean   :22.77  
##  3rd Qu.:52.00   3rd Qu.:42.00   3rd Qu.:2.0000   3rd Qu.:28.84  
##  Max.   :77.00   Max.   :80.00   Max.   :9.0000   Max.   :99.00

The distribution of the number of children per individual revealed that the majority have zero to two children, with an average of approximately 0.81 children and a variance of 1.21, indicating moderate dispersion. Larger families (three or more children) were less common, representing about 7.54% of the population. Gender distribution was roughly balanced, with no immediate skew observed (Williams and Gashi, 2022).

Frequency tables for categorical variables

table(df$gender)

## 
##   0   1 
## 659 522

table(df$race)

## 
## Asian Black White 
##    65   104  1012

table(df$marital)

## 
##   0   1   2 
## 324 713 144

table(df$region)

## 
##   midwest northeast     south      west 
##       309       215       385       272

Initial cross-tabulations between gender and insurance status suggested a roughly equal distribution of insured and uninsured individuals among males and females. This preliminary finding was essential for deciding which variables to include in further statistical tests and regression models.

Histograms

ggplot(df, aes(x = wage)) + geom_histogram(binwidth = 2, fill = "lightblue", color = "black") + labs(title = "Histogram of Wages")

Boxplot of wage by gender

ggplot(df, aes(x = factor(gender), y = wage)) + geom_boxplot() + labs(title = "Wage by Gender", x = "Gender", y = "Wage")

Correlation matrix (numeric variables only)

numeric_df <- select(df, age, hrswork, nchild, wage)
cor(numeric_df)

##                 age    hrswork      nchild       wage
## age      1.00000000 0.05585503 -0.05046348 0.21194887
## hrswork  0.05585503 1.00000000  0.06866293 0.09091083
## nchild  -0.05046348 0.06866293  1.00000000 0.01655582
## wage     0.21194887 0.09091083  0.01655582 1.00000000

2. Probability, Distributions & Confidence Intervals

Probability at least 1 of 5 is uninsured

p_insured <- mean(df$insure == 1)
p_at_least_one_uninsured <- 1 - (p_insured)^5
p_at_least_one_uninsured

## [1] 0.6164927

The probability of having three or more children was calculated to be 7.54%. This low proportion highlights that large family sizes are uncommon in this sample and may warrant specific attention when examining income effects related to family burden. ## P(nchild >= 1 | married)

married <- df[df$marital == 1, ]
p_child_given_married <- mean(married$nchild >= 1)
p_child_given_married

## [1] 0.6002805

Distribution of nchild

nchild_table <- table(df$nchild)
nchild_probs <- prop.table(nchild_table)
nchild_probs

## 
##            0            1            2            3            4            5 
## 0.5605419136 0.1803556308 0.1837425910 0.0550381033 0.0127011008 0.0059271804 
##            6            9 
## 0.0008467401 0.0008467401

Visual and statistical examination suggested that wages follow a skewed distribution. Applying the natural logarithm to wages resulted in a more symmetric distribution, satisfying assumptions of normality needed for regression analyses. ### Mean and variance

nchild_vals <- as.numeric(names(nchild_probs))
nchild_mean <- sum(nchild_vals * nchild_probs)
nchild_var <- sum((nchild_vals - nchild_mean)^2 * nchild_probs)
nchild_mean

## [1] 0.8060965

nchild_var

## [1] 1.211343

P(nchild >= 3)

p_nchild_3_or_more <- sum(nchild_probs[nchild_vals >= 3])
p_nchild_3_or_more

## [1] 0.07535986

Mean wages were estimated for subgroups defined by family size. For example, individuals with exactly two children had a mean wage of $23.43, with a 95% confidence interval ranging from $21.58 to $25.29. Such intervals provide insight into the precision of estimated means and the reliability of observed differences across groups (Ciminelli, Schwellnus and Stadle, 2021).

3. Estimates, CI & Hypothesis Test

For nchild == 2

two_children <- df[df$nchild == 2, ]
mean(two_children$wage)

## [1] 23.43355

t.test(two_children$wage, conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  two_children$wage
## t = 24.938, df = 216, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.58146 25.28563
## sample estimates:
## mean of x 
##  23.43355

For nchild >= 5

five_or_more <- df[df$nchild >= 5, ]
nrow(five_or_more)  # Check if enough data

## [1] 9

if(nrow(five_or_more) >= 2){
  t.test(five_or_more$wage)
} else {
  "Too few observations to compute CI"
}

## 
##  One Sample t-test
## 
## data:  five_or_more$wage
## t = 5.8681, df = 8, p-value = 0.000375
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   7.740921 17.763523
## sample estimates:
## mean of x 
##  12.75222

The mean number of children (0.81) and mean wages for different family sizes were key summary statistics. Those with larger families (five or more children) earned substantially less on average ($12.75) compared to those with fewer children.

Hypothesis Testing on Family Size and Wages

Statistical hypothesis testing compared wage means between individuals with two children and those with five or more. The t-test yielded a highly significant difference (p = 0.000375), suggesting that larger families might be associated with reduced earnings, possibly due to increased financial responsibilities or reduced labor market participation. However, the very small sample size in the large family group (n=9) advises caution in generalizing this finding (Bluedorn et al., 2023).

Insurance by gender

cont_table <- table(df$gender, df$insure)
cont_table

##    
##       0   1
##   0 117 542
##   1  89 433

chisq.test(cont_table)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cont_table
## X-squared = 0.0574, df = 1, p-value = 0.8107

A chi-squared test examined the relationship between gender and insurance status. The test returned a non-significant p-value of 0.8107, indicating no statistically significant association. This implies that males and females are equally likely to have insurance coverage in this dataset.

4. Simple Linear Regression

Young data

young <- df[df$age < 35, ]
lm_young <- lm(log(wage) ~ age, data = young)
summary(lm_young)

## 
## Call:
## lm(formula = log(wage) ~ age, data = young)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63005 -0.32110 -0.01201  0.31821  1.49042 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.594555   0.173214   9.206  < 2e-16 ***
## age         0.041382   0.006074   6.813 3.85e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4816 on 374 degrees of freedom
## Multiple R-squared:  0.1104, Adjusted R-squared:  0.108 
## F-statistic: 46.41 on 1 and 374 DF,  p-value: 3.846e-11

A simple linear regression model was fitted to assess the relationship between age and log-transformed wages among younger individuals. The resulting equation: log(wage)=1.595+0.0414×age indicates a positive and significant association. The coefficient for age (0.0414) implies that each additional year of age is associated with an approximate 4.14% increase in wages. This effect was highly significant (p < 0.001), and the model explained about 11% of wage variability (adjusted R² = 0.108) (Broz, Frieden and Weymouth, 2021).

Old data

old <- df[df$age >= 35, ]
lm_old <- lm(log(wage) ~ age, data = old)
summary(lm_old)

## 
## Call:
## lm(formula = log(wage) ~ age, data = old)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.91172 -0.39124 -0.04711  0.39679  1.54456 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.0795566  0.1157775  26.599   <2e-16 ***
## age         -0.0005273  0.0023115  -0.228     0.82    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5712 on 803 degrees of freedom
## Multiple R-squared:  6.479e-05,  Adjusted R-squared:  -0.00118 
## F-statistic: 0.05203 on 1 and 803 DF,  p-value: 0.8196

For older individuals, the model: log(wage)=3.08-0.00053×age revealed no significant relationship between age and wages (p = 0.82), with an adjusted R² near zero. This suggests that wages tend to plateau or become unrelated to age in later career stages, possibly reflecting career ceiling effects or retirement planning.

Plots

ggplot(young, aes(x = age, y = log(wage))) +
  geom_point() + geom_smooth(method = "lm") +
  labs(title = "Young: log(Wage) vs Age")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(old, aes(x = age, y = log(wage))) +
  geom_point() + geom_smooth(method = "lm") +
  labs(title = "Old: log(Wage) vs Age")

## `geom_smooth()` using formula = 'y ~ x'

Graphical plots using ggplot2 further supported these findings. The younger group displayed a noticeable upward trend in wages with increasing age, whereas the older group’s wage trend remained essentially flat.

5. Multiple Linear Regression

Convert to factors

df$gender <- factor(df$gender)
df$race <- factor(df$race)
df$marital <- factor(df$marital)
df$region <- factor(df$region)
df$union <- factor(df$union)
df$insure <- factor(df$insure)
df$metro <- factor(df$metro)

Full model for young

lm_young_full <- lm(log(wage) ~ . - wage, data = young)

## Warning in terms.formula(formula, data = data): 'varlist' has changed (from
## nvar=12) to new 13 after EncodeVars() -- should no longer happen!

summary(lm_young_full)

## 
## Call:
## lm(formula = log(wage) ~ . - wage, data = young)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36303 -0.26382 -0.01698  0.25524  1.30213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.834191   0.209063   8.773  < 2e-16 ***
## age              0.028458   0.006332   4.495 9.40e-06 ***
## educ             0.121518   0.017415   6.978 1.44e-11 ***
## gender          -0.194123   0.047602  -4.078 5.59e-05 ***
## hrswork         -0.003245   0.002368  -1.370   0.1715    
## insure           0.224896   0.053054   4.239 2.85e-05 ***
## metro            0.011774   0.058192   0.202   0.8398    
## nchild          -0.025153   0.025517  -0.986   0.3249    
## union            0.159936   0.073275   2.183   0.0297 *  
## raceBlack       -0.172978   0.118896  -1.455   0.1466    
## raceWhite       -0.102353   0.089136  -1.148   0.2516    
## marital          0.051933   0.043641   1.190   0.2348    
## regionnortheast  0.116789   0.067034   1.742   0.0823 .  
## regionsouth      0.010973   0.058890   0.186   0.8523    
## regionwest       0.048742   0.065094   0.749   0.4545    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4226 on 361 degrees of freedom
## Multiple R-squared:  0.3388, Adjusted R-squared:  0.3132 
## F-statistic: 13.21 on 14 and 361 DF,  p-value: < 2.2e-16

A multiple linear regression model was constructed for younger workers to evaluate combined effects of predictors on wages. Significant variables included: Age (Estimate = 0.0285, p < 0.001): Continued positive effect, though smaller than in the simple regression. Education (0.1215, p < 0.001): The strongest predictor, with each additional year of schooling increasing wages substantially. Gender (-0.1941, p < 0.001): Females earned significantly less than males, highlighting a gender wage gap. Insurance (0.2249, p < 0.001): Being insured correlated with higher wages. Union Membership (0.1599, p = 0.0297): Union membership was associated with higher wages. This model explained approximately 31% of wage variability (adjusted R² = 0.313), demonstrating that multiple social and economic factors influence earnings among younger workers (Borsboom et al., 2021).

Full model for old

lm_old_full <- lm(log(wage) ~ . - wage, data = old)

## Warning in terms.formula(formula, data = data): 'varlist' has changed (from
## nvar=12) to new 13 after EncodeVars() -- should no longer happen!

summary(lm_old_full)

## 
## Call:
## lm(formula = log(wage) ~ . - wage, data = old)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.85888 -0.30451  0.02666  0.32575  1.31774 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.2694898  0.1799543  12.611  < 2e-16 ***
## age              0.0003294  0.0021291   0.155  0.87711    
## educ             0.1551089  0.0119593  12.970  < 2e-16 ***
## gender          -0.1811629  0.0355925  -5.090 4.48e-07 ***
## hrswork          0.0015615  0.0021518   0.726  0.46824    
## insure           0.2475608  0.0528619   4.683 3.32e-06 ***
## metro            0.1417880  0.0471982   3.004  0.00275 ** 
## nchild          -0.0177843  0.0164374  -1.082  0.27961    
## union            0.0452883  0.0489084   0.926  0.35474    
## raceBlack       -0.0106162  0.1013315  -0.105  0.91659    
## raceWhite        0.0849661  0.0832381   1.021  0.30768    
## marital          0.0548061  0.0320963   1.708  0.08811 .  
## regionnortheast  0.0536894  0.0533367   1.007  0.31443    
## regionsouth      0.0456868  0.0466322   0.980  0.32752    
## regionwest       0.1326383  0.0506384   2.619  0.00898 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.49 on 790 degrees of freedom
## Multiple R-squared:  0.276,  Adjusted R-squared:  0.2632 
## F-statistic: 21.51 on 14 and 790 DF,  p-value: < 2.2e-16

For the older group, a similar multiple regression model identified different key predictors: Education (0.1551, p < 0.001) and Gender (-0.1812, p < 0.001) remained significant. Insurance (0.2476, p < 0.001) continued to show positive effects. Metro residence (0.1418, p = 0.0027) and Western region (0.1326, p = 0.009) emerged as significant geographic factors. Age and number of children were not significant. This model explained 26% of wage variability (adjusted R² = 0.263), indicating a shift in wage determinants with age. Geographic factors played a more prominent role in the older population, possibly reflecting local labor market conditions or cost of living differentials.

Interpretation and Implications

The divergence in predictors between age groups underscores how wage determinants vary over a career lifespan. Education and insurance status consistently influence wages, while union membership and age are more important early in careers. Geographic factors gain relevance for older workers. Persistent gender disparities highlight ongoing inequalities requiring policy attention.

Conclusion

This comprehensive analysis reveals that wage determination is a complex interplay of demographic, socioeconomic, and geographic factors, with their relative importance shifting over the life course. Age positively influences wages primarily in younger workers, while education consistently boosts earnings across all ages. Insurance and union membership provide wage advantages, especially for younger employees. Geographic location increasingly shapes wage outcomes in later career stages. The persistent negative impact of gender on wages demands continued focus on closing the wage gap. These insights offer valuable guidance for policymakers, employers, and educators aiming to foster equitable wage growth (Blundell et al., 2025).

Reference List

Bluedorn, J., Hansen, N.J., Noureldin, D., Shibata, I. and Tavares, M.M., 2023. Transitioning to a greener labor market: Cross-country evidence from microdata. Energy Economics, 126, p.106836.

Blundell, R., Bollinger, C.R., Hokayem, C. and Ziliak, J.P., 2025. Interpreting cohort profiles of life cycle earnings volatility. Journal of Labor Economics, 43(S1), pp.S55-S82.

Borsboom, D., Deserno, M.K., Rhemtulla, M., Epskamp, S., Fried, E.I., McNally, R.J., Robinaugh, D.J., Perugini, M., Dalege, J., Costantini, G. and Isvoranu, A.M., 2021. Network analysis of multivariate data in psychological science. Nature Reviews Methods Primers, 1(1), p.58.

Broz, J.L., Frieden, J. and Weymouth, S., 2021. Populism in place: the economic geography of the globalization backlash. International Organization, 75(2), pp.464-494.

Ciminelli, G., Schwellnus, C. and Stadle, B., 2021. Sticky floors or glass ceilings? The role of human capital, working time flexibility and discrimination in the gender wage gap. OECD Economic Department Working Papers, (1668), pp.0_1-43.

Williams, C. and Gashi, A., 2022. Evaluating the wage differential between the formal and informal economy: a gender perspective. Journal of Economic Studies, 49(4), pp.735-750.

R STUDIO Training - 05.06.2025 - SKUD-1(Sandipan)

2025-06-06