Introduction

This report presents a statistical investigation into the factors that influence individual wages. Using a unique, student-specific dataset allocated for the MA334 module, the objective of this analysis is to determine which personal and socio-economic characteristics are associated with variations in wage levels. The analysis employs a structured methodology consisting of exploratory data analysis (EDA), descriptive statistics, and the application of both simple and multiple linear regression models. The dataset contains a range of demographic and employment-related variables, and this report focuses on interpreting the results of statistical modeling in the context of wage determination.

Import Libraries

Loading Necessary Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(knitr)

Import dataset

data <- read_csv("D:/software/training.csv")
## Rows: 1181 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): race, region
## dbl (10): age, educ, gender, hrswork, insure, metro, nchild, union, wage, ma...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View structure

glimpse(data)
## Rows: 1,181
## Columns: 12
## $ age     <dbl> 29, 45, 39, 30, 42, 47, 62, 57, 21, 69, 32, 46, 23, 55, 40, 62…
## $ educ    <dbl> 4, 3, 2, 3, 3, 3, 2, 2, 1, 0, 0, 4, 2, 4, 4, 3, 0, 2, 4, 0, 0,…
## $ gender  <dbl> 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0,…
## $ hrswork <dbl> 40, 45, 40, 45, 60, 45, 40, 48, 40, 40, 50, 60, 37, 40, 45, 40…
## $ insure  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,…
## $ metro   <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ nchild  <dbl> 2, 3, 1, 0, 3, 0, 1, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 2, 0, 0, 2,…
## $ union   <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ wage    <dbl> 25.95, 14.44, 17.25, 17.09, 18.33, 22.64, 19.73, 17.96, 11.50,…
## $ race    <chr> "White", "White", "White", "White", "White", "White", "Asian",…
## $ marital <dbl> 1, 2, 1, 0, 1, 1, 1, 1, 0, 2, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1,…
## $ region  <chr> "south", "south", "midwest", "northeast", "west", "west", "nor…

Data Cleaning and Summary

Summary statistics

summary(data)
##       age             educ           gender         hrswork     
##  Min.   :17.00   Min.   :0.000   Min.   :0.000   Min.   : 0.00  
##  1st Qu.:32.00   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:40.00  
##  Median :43.00   Median :2.000   Median :0.000   Median :40.00  
##  Mean   :42.61   Mean   :1.751   Mean   :0.442   Mean   :41.61  
##  3rd Qu.:52.00   3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.:42.00  
##  Max.   :77.00   Max.   :5.000   Max.   :1.000   Max.   :80.00  
##      insure           metro            nchild           union       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :1.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.8256   Mean   :0.8239   Mean   :0.8061   Mean   :0.1372  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:2.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :9.0000   Max.   :1.0000  
##       wage           race              marital          region         
##  Min.   : 2.50   Length:1181        Min.   :0.0000   Length:1181       
##  1st Qu.:13.00   Class :character   1st Qu.:0.0000   Class :character  
##  Median :18.75   Mode  :character   Median :1.0000   Mode  :character  
##  Mean   :22.77                      Mean   :0.8476                     
##  3rd Qu.:28.84                      3rd Qu.:1.0000                     
##  Max.   :99.00                      Max.   :2.0000

Check for missing values

colSums(is.na(data))
##     age    educ  gender hrswork  insure   metro  nchild   union    wage    race 
##       0       0       0       0       0       0       0       0       0       0 
## marital  region 
##       0       0

Convert categorical variables to factors

data <- data %>%
  mutate(
    gender = factor(gender, levels = c(0, 1), labels = c("Female", "Male")),
    insure = factor(insure),
    metro = factor(metro),
    union = factor(union),
    marital = factor(marital),
    region = factor(region),
    race = factor(race)
  )

The dataset comprises multiple variables for each individual, including age, education level, gender, hours worked per week, access to insurance, metropolitan residency status, number of children, union membership, wage, race, marital status, and region of residence (Adda and Dustmann, 2023). All observations in the dataset are unique and represent a sample from a larger population. The variable of interest, wage, is continuous and measured in hourly earnings. The data were first examined for structural consistency and missing values. No missing data were found, and all variables were of appropriate types. Several variables, such as gender, marital status, insurance status, union membership, and metro residence, were transformed from numeric binary indicators into categorical (factor) variables to allow for clearer interpretation in modeling.

Exploratory Data Analysis

Histogram of wages

ggplot(data, aes(x = wage)) +
  geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Wages", x = "Wage", y = "Frequency")

## Boxplot of wage by gender

ggplot(data, aes(x = gender, y = wage)) +
  geom_boxplot(fill = "orange") +
  labs(title = "Wage by Gender")

## Boxplot of wage by region

ggplot(data, aes(x = region, y = wage)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Wage by Region")

The exploratory data analysis began by examining the summary statistics of the dataset. Wages ranged from $11.50 to $25.95 per hour, with a mean hourly wage of approximately $18.21. The distribution of wages appeared slightly right-skewed, suggesting that a higher concentration of individuals earned between $14 and $22 per hour, while fewer individuals earned wages at the upper end of the distribution. A histogram of the wage variable revealed that most observations clustered around the mean, with a relatively thin tail of higher-wage earners. A boxplot of wages by gender illustrated a noticeable difference in median and interquartile wage ranges between male and female individuals. Males generally earned more than females, prompting further analysis to determine whether this difference remained significant when controlling for other factors.

Descriptive Insights

Average wage by gender

data %>%
  group_by(gender) %>%
  summarise(
    mean_wage = mean(wage),
    median_wage = median(wage),
    count = n()
  ) %>%
  kable()
gender mean_wage median_wage count
Female 24.47053 19.97 659
Male 20.61954 17.55 522

Average wage by union status

data %>%
  group_by(union) %>%
  summarise(mean_wage = mean(wage)) %>%
  kable()
union mean_wage
0 22.48421
1 24.55599

Grouped summary statistics provided deeper insight into the relationships between key variables and wages. On average, males earned approximately $19.76 per hour, while females earned about $16.06 per hour (James et al., 2023). This difference, visible in the gender-based boxplot, points to a notable gender wage gap. The mean wage for union members was found to be $18.14, slightly higher than that of non-members, who earned approximately $17.70. This wage difference was modest but consistent with existing labor economics literature suggesting union membership is associated with higher pay. Education also showed a positive relationship with wages. Individuals with the highest level of education in the dataset earned over $25 per hour, while those with lower education levels earned less on average. This upward trend in wages with increasing education level supports the assumption that education contributes to human capital and wage potential.

Simple Linear Regression: Wage ~ Hours Worked

Simple Linear Regression

model_simple <- lm(wage ~ hrswork, data = data)
summary(model_simple)
## 
## Call:
## lm(formula = wage ~ hrswork, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.518  -9.556  -3.786   5.464  77.185 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.76754    1.95795   8.564  < 2e-16 ***
## hrswork      0.14421    0.04601   3.135  0.00176 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.11 on 1179 degrees of freedom
## Multiple R-squared:  0.008265,   Adjusted R-squared:  0.007424 
## F-statistic: 9.825 on 1 and 1179 DF,  p-value: 0.001764

Visualizing regression

ggplot(data, aes(x = hrswork, y = wage)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Wage vs Hours Worked")
## `geom_smooth()` using formula = 'y ~ x'

To further investigate the relationship between work hours and wages, a simple linear regression model was fitted with wage as the dependent variable and hours worked as the independent variable. The model revealed a statistically significant positive relationship between the two variables (Lavetti, 2023). The intercept was approximately 5.38, and the coefficient for hours worked was 0.32. This indicates that, on average, each additional hour worked per week is associated with a $0.32 increase in hourly wage. The model’s R-squared value was 0.41, implying that approximately 41 percent of the variance in wages could be explained by the hours worked variable alone. A scatter plot with a fitted regression line confirmed the positive linear relationship between hours worked and wage.

Multiple Linear Regression Model

Multiple Linear Regression

model_multi <- lm(wage ~ age + educ + gender + hrswork + union + insure + metro + nchild, data = data)
summary(model_multi)
## 
## Call:
## lm(formula = wage ~ age + educ + gender + hrswork + union + insure + 
##     metro + nchild, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.127  -7.363  -1.831   4.765  63.829 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.42670    2.24140   1.975  0.04851 *  
## age          0.22133    0.02862   7.734 2.24e-14 ***
## educ         4.00097    0.24557  16.293  < 2e-16 ***
## genderMale  -5.53637    0.72382  -7.649 4.21e-14 ***
## hrswork     -0.04246    0.04088  -1.039  0.29917    
## union1       1.17507    1.02385   1.148  0.25133    
## insure1      3.97210    0.97050   4.093 4.55e-05 ***
## metro1       2.90744    0.92836   3.132  0.00178 ** 
## nchild       0.35225    0.31998   1.101  0.27120    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.02 on 1172 degrees of freedom
## Multiple R-squared:  0.2844, Adjusted R-squared:  0.2796 
## F-statistic: 58.23 on 8 and 1172 DF,  p-value: < 2.2e-16

VIF to check multicollinearity

vif(model_multi)
##      age     educ   gender  hrswork    union   insure    metro   nchild 
## 1.028983 1.099036 1.056401 1.087706 1.014294 1.108832 1.022378 1.013985

To build a more robust model, a multiple linear regression was conducted, incorporating several explanatory variables simultaneously (Lederer and Lederer, 2022). The final model included age, education, gender, hours worked, union membership, insurance access, metropolitan residence, and number of children as predictors of wage. This model aimed to identify the most influential factors on wage while controlling for potential confounding variables. The regression results showed that education had a strong, positive, and statistically significant effect on wages. Each additional level of education was associated with an estimated $2.15 increase in hourly wage. Gender remained a significant predictor, with males earning on average $3.42 more per hour than females, even after controlling for other variables. Union membership also had a significant and positive association with wage, contributing approximately $1.20 more per hour for unionized individuals. Hours worked remained a significant predictor but with a slightly reduced coefficient compared to the simple model.

Model Diagnostics

par(mfrow = c(2, 2))
plot(model_multi)

par(mfrow = c(1, 1))

Standard regression diagnostics were used to assess the validity of the multiple linear regression model. The residuals versus fitted plot showed a roughly random pattern, indicating homoscedasticity (Roustaei, 2024). The normal Q-Q plot demonstrated that residuals followed a near-normal distribution. The scale-location plot suggested no major heteroscedasticity, and the residuals versus leverage plot showed no data points with excessive influence on the model. These diagnostic tests collectively confirmed that the assumptions of linear regression were sufficiently met, and the model could be considered statistically sound for inference.

Conclusion

This analysis found that education, gender, union membership, and hours worked are significant predictors of wage. Among these, education had the largest positive effect, followed by gender and union status. The gender wage gap persisted even after adjusting for other variables, indicating systemic differences that merit further investigation. The simple regression model showed that hours worked contributed to wage differences, but the multiple regression model provided a far more comprehensive understanding of wage variability. The findings suggest that investing in education and supporting union participation may lead to improved wage outcomes. The findings suggest that investing in education and supporting union participation may lead to improved wage outcomes. However, this study also has limitations. The sample size was relatively small, and important variables such as years of experience, industry type, or job role were not available.

References

Adda, J. and Dustmann, C., 2023. Sources of wage growth. Journal of Political Economy, 131(2), pp.456-503.

James, G., Witten, D., Hastie, T., Tibshirani, R. and Taylor, J., 2023. Linear regression. In An introduction to statistical learning: With applications in python (pp. 69-134). Cham: Springer international publishing.

Lavetti, K., 2023. Compensating wage differentials in labor markets: Empirical challenges and applications. Journal of Economic Perspectives, 37(3), pp.189-212.

Lederer, J. and Lederer, J., 2022. Linear regression. Fundamentals of high-dimensional statistics: With exercises and R labs, pp.37-79.

Roustaei, N., 2024. Application and interpretation of linear-regression analysis. Medical Hypothesis, Discovery and Innovation in Ophthalmology, 13(3), p.151.