This report presents a statistical investigation into the factors that influence individual wages. Using a unique, student-specific dataset allocated for the MA334 module, the objective of this analysis is to determine which personal and socio-economic characteristics are associated with variations in wage levels. The analysis employs a structured methodology consisting of exploratory data analysis (EDA), descriptive statistics, and the application of both simple and multiple linear regression models. The dataset contains a range of demographic and employment-related variables, and this report focuses on interpreting the results of statistical modeling in the context of wage determination.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(knitr)
data <- read_csv("D:/software/training.csv")
## Rows: 1181 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): race, region
## dbl (10): age, educ, gender, hrswork, insure, metro, nchild, union, wage, ma...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(data)
## Rows: 1,181
## Columns: 12
## $ age <dbl> 29, 45, 39, 30, 42, 47, 62, 57, 21, 69, 32, 46, 23, 55, 40, 62…
## $ educ <dbl> 4, 3, 2, 3, 3, 3, 2, 2, 1, 0, 0, 4, 2, 4, 4, 3, 0, 2, 4, 0, 0,…
## $ gender <dbl> 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0,…
## $ hrswork <dbl> 40, 45, 40, 45, 60, 45, 40, 48, 40, 40, 50, 60, 37, 40, 45, 40…
## $ insure <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,…
## $ metro <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ nchild <dbl> 2, 3, 1, 0, 3, 0, 1, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 2, 0, 0, 2,…
## $ union <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ wage <dbl> 25.95, 14.44, 17.25, 17.09, 18.33, 22.64, 19.73, 17.96, 11.50,…
## $ race <chr> "White", "White", "White", "White", "White", "White", "Asian",…
## $ marital <dbl> 1, 2, 1, 0, 1, 1, 1, 1, 0, 2, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1,…
## $ region <chr> "south", "south", "midwest", "northeast", "west", "west", "nor…
summary(data)
## age educ gender hrswork
## Min. :17.00 Min. :0.000 Min. :0.000 Min. : 0.00
## 1st Qu.:32.00 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:40.00
## Median :43.00 Median :2.000 Median :0.000 Median :40.00
## Mean :42.61 Mean :1.751 Mean :0.442 Mean :41.61
## 3rd Qu.:52.00 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:42.00
## Max. :77.00 Max. :5.000 Max. :1.000 Max. :80.00
## insure metro nchild union
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median :0.0000 Median :0.0000
## Mean :0.8256 Mean :0.8239 Mean :0.8061 Mean :0.1372
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :9.0000 Max. :1.0000
## wage race marital region
## Min. : 2.50 Length:1181 Min. :0.0000 Length:1181
## 1st Qu.:13.00 Class :character 1st Qu.:0.0000 Class :character
## Median :18.75 Mode :character Median :1.0000 Mode :character
## Mean :22.77 Mean :0.8476
## 3rd Qu.:28.84 3rd Qu.:1.0000
## Max. :99.00 Max. :2.0000
colSums(is.na(data))
## age educ gender hrswork insure metro nchild union wage race
## 0 0 0 0 0 0 0 0 0 0
## marital region
## 0 0
data <- data %>%
mutate(
gender = factor(gender, levels = c(0, 1), labels = c("Female", "Male")),
insure = factor(insure),
metro = factor(metro),
union = factor(union),
marital = factor(marital),
region = factor(region),
race = factor(race)
)
The dataset comprises multiple variables for each individual, including age, education level, gender, hours worked per week, access to insurance, metropolitan residency status, number of children, union membership, wage, race, marital status, and region of residence (Adda and Dustmann, 2023). All observations in the dataset are unique and represent a sample from a larger population. The variable of interest, wage, is continuous and measured in hourly earnings. The data were first examined for structural consistency and missing values. No missing data were found, and all variables were of appropriate types. Several variables, such as gender, marital status, insurance status, union membership, and metro residence, were transformed from numeric binary indicators into categorical (factor) variables to allow for clearer interpretation in modeling.
ggplot(data, aes(x = wage)) +
geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
labs(title = "Distribution of Wages", x = "Wage", y = "Frequency")
## Boxplot of wage by gender
ggplot(data, aes(x = gender, y = wage)) +
geom_boxplot(fill = "orange") +
labs(title = "Wage by Gender")
## Boxplot of wage by region
ggplot(data, aes(x = region, y = wage)) +
geom_boxplot(fill = "lightgreen") +
labs(title = "Wage by Region")
The exploratory data analysis began by examining the summary statistics
of the dataset. Wages ranged from $11.50 to $25.95 per hour, with a mean
hourly wage of approximately $18.21. The distribution of wages appeared
slightly right-skewed, suggesting that a higher concentration of
individuals earned between $14 and $22 per hour, while fewer individuals
earned wages at the upper end of the distribution. A histogram of the
wage variable revealed that most observations clustered around the mean,
with a relatively thin tail of higher-wage earners. A boxplot of wages
by gender illustrated a noticeable difference in median and
interquartile wage ranges between male and female individuals. Males
generally earned more than females, prompting further analysis to
determine whether this difference remained significant when controlling
for other factors.
data %>%
group_by(gender) %>%
summarise(
mean_wage = mean(wage),
median_wage = median(wage),
count = n()
) %>%
kable()
| gender | mean_wage | median_wage | count |
|---|---|---|---|
| Female | 24.47053 | 19.97 | 659 |
| Male | 20.61954 | 17.55 | 522 |
data %>%
group_by(union) %>%
summarise(mean_wage = mean(wage)) %>%
kable()
| union | mean_wage |
|---|---|
| 0 | 22.48421 |
| 1 | 24.55599 |
Grouped summary statistics provided deeper insight into the relationships between key variables and wages. On average, males earned approximately $19.76 per hour, while females earned about $16.06 per hour (James et al., 2023). This difference, visible in the gender-based boxplot, points to a notable gender wage gap. The mean wage for union members was found to be $18.14, slightly higher than that of non-members, who earned approximately $17.70. This wage difference was modest but consistent with existing labor economics literature suggesting union membership is associated with higher pay. Education also showed a positive relationship with wages. Individuals with the highest level of education in the dataset earned over $25 per hour, while those with lower education levels earned less on average. This upward trend in wages with increasing education level supports the assumption that education contributes to human capital and wage potential.
model_simple <- lm(wage ~ hrswork, data = data)
summary(model_simple)
##
## Call:
## lm(formula = wage ~ hrswork, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.518 -9.556 -3.786 5.464 77.185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.76754 1.95795 8.564 < 2e-16 ***
## hrswork 0.14421 0.04601 3.135 0.00176 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.11 on 1179 degrees of freedom
## Multiple R-squared: 0.008265, Adjusted R-squared: 0.007424
## F-statistic: 9.825 on 1 and 1179 DF, p-value: 0.001764
ggplot(data, aes(x = hrswork, y = wage)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Wage vs Hours Worked")
## `geom_smooth()` using formula = 'y ~ x'
To further investigate the relationship between work hours and wages, a
simple linear regression model was fitted with wage as the dependent
variable and hours worked as the independent variable. The model
revealed a statistically significant positive relationship between the
two variables (Lavetti, 2023). The intercept was approximately 5.38, and
the coefficient for hours worked was 0.32. This indicates that, on
average, each additional hour worked per week is associated with a $0.32
increase in hourly wage. The model’s R-squared value was 0.41, implying
that approximately 41 percent of the variance in wages could be
explained by the hours worked variable alone. A scatter plot with a
fitted regression line confirmed the positive linear relationship
between hours worked and wage.
model_multi <- lm(wage ~ age + educ + gender + hrswork + union + insure + metro + nchild, data = data)
summary(model_multi)
##
## Call:
## lm(formula = wage ~ age + educ + gender + hrswork + union + insure +
## metro + nchild, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.127 -7.363 -1.831 4.765 63.829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.42670 2.24140 1.975 0.04851 *
## age 0.22133 0.02862 7.734 2.24e-14 ***
## educ 4.00097 0.24557 16.293 < 2e-16 ***
## genderMale -5.53637 0.72382 -7.649 4.21e-14 ***
## hrswork -0.04246 0.04088 -1.039 0.29917
## union1 1.17507 1.02385 1.148 0.25133
## insure1 3.97210 0.97050 4.093 4.55e-05 ***
## metro1 2.90744 0.92836 3.132 0.00178 **
## nchild 0.35225 0.31998 1.101 0.27120
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.02 on 1172 degrees of freedom
## Multiple R-squared: 0.2844, Adjusted R-squared: 0.2796
## F-statistic: 58.23 on 8 and 1172 DF, p-value: < 2.2e-16
vif(model_multi)
## age educ gender hrswork union insure metro nchild
## 1.028983 1.099036 1.056401 1.087706 1.014294 1.108832 1.022378 1.013985
To build a more robust model, a multiple linear regression was conducted, incorporating several explanatory variables simultaneously (Lederer and Lederer, 2022). The final model included age, education, gender, hours worked, union membership, insurance access, metropolitan residence, and number of children as predictors of wage. This model aimed to identify the most influential factors on wage while controlling for potential confounding variables. The regression results showed that education had a strong, positive, and statistically significant effect on wages. Each additional level of education was associated with an estimated $2.15 increase in hourly wage. Gender remained a significant predictor, with males earning on average $3.42 more per hour than females, even after controlling for other variables. Union membership also had a significant and positive association with wage, contributing approximately $1.20 more per hour for unionized individuals. Hours worked remained a significant predictor but with a slightly reduced coefficient compared to the simple model.
par(mfrow = c(2, 2))
plot(model_multi)
par(mfrow = c(1, 1))
Standard regression diagnostics were used to assess the validity of the multiple linear regression model. The residuals versus fitted plot showed a roughly random pattern, indicating homoscedasticity (Roustaei, 2024). The normal Q-Q plot demonstrated that residuals followed a near-normal distribution. The scale-location plot suggested no major heteroscedasticity, and the residuals versus leverage plot showed no data points with excessive influence on the model. These diagnostic tests collectively confirmed that the assumptions of linear regression were sufficiently met, and the model could be considered statistically sound for inference.
This analysis found that education, gender, union membership, and hours worked are significant predictors of wage. Among these, education had the largest positive effect, followed by gender and union status. The gender wage gap persisted even after adjusting for other variables, indicating systemic differences that merit further investigation. The simple regression model showed that hours worked contributed to wage differences, but the multiple regression model provided a far more comprehensive understanding of wage variability. The findings suggest that investing in education and supporting union participation may lead to improved wage outcomes. The findings suggest that investing in education and supporting union participation may lead to improved wage outcomes. However, this study also has limitations. The sample size was relatively small, and important variables such as years of experience, industry type, or job role were not available.
Adda, J. and Dustmann, C., 2023. Sources of wage growth. Journal of Political Economy, 131(2), pp.456-503.
James, G., Witten, D., Hastie, T., Tibshirani, R. and Taylor, J., 2023. Linear regression. In An introduction to statistical learning: With applications in python (pp. 69-134). Cham: Springer international publishing.
Lavetti, K., 2023. Compensating wage differentials in labor markets: Empirical challenges and applications. Journal of Economic Perspectives, 37(3), pp.189-212.
Lederer, J. and Lederer, J., 2022. Linear regression. Fundamentals of high-dimensional statistics: With exercises and R labs, pp.37-79.
Roustaei, N., 2024. Application and interpretation of linear-regression analysis. Medical Hypothesis, Discovery and Innovation in Ophthalmology, 13(3), p.151.