This report investigates factors influencing individual wages using a real-world dataset provided by the Division of Economic Development and Forecasting, Louisiana State University. The data, collected in March 2013, includes demographic, educational, and employment-related variables such as age, gender, education level, hours worked, marital status, and union membership. The objective is to perform a comprehensive analysis through data exploration, probability assessments, and both simple and multiple linear regression models. By examining patterns within different age groups, this report aims to uncover meaningful relationships between variables and hourly wage outcomes, offering insights into the key drivers of income variation.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
data <- read.csv("MA334-SP-7_2412507.csv")
data$gender <- factor(data$gender, levels = c(0, 1), labels = c("Male", "Female"))
data$educ <- factor(data$educ, levels = 0:5,
labels = c("High School", "College No Degree", "College Degree", "BA", "MA", "PhD"))
data$marital <- factor(data$marital, levels = c(0, 1, 2), labels = c("Single", "Married", "Divorced"))
data$insure <- factor(data$insure, levels = c(0, 1), labels = c("No", "Yes"))
data$union <- factor(data$union, levels = c(0, 1), labels = c("No", "Yes"))
data$metro <- factor(data$metro, levels = c(0, 1), labels = c("Non-Metro", "Metro"))
data$race <- as.factor(data$race)
data$region <- as.factor(data$region)
summary(data)
## age educ gender hrswork insure
## Min. :17.00 High School :365 Male :659 Min. : 0.00 No :206
## 1st Qu.:32.00 College No Degree:208 Female:522 1st Qu.:40.00 Yes:975
## Median :43.00 College Degree :143 Median :40.00
## Mean :42.61 BA :304 Mean :41.61
## 3rd Qu.:52.00 MA :143 3rd Qu.:42.00
## Max. :77.00 PhD : 18 Max. :80.00
## metro nchild union wage race
## Non-Metro:208 Min. :0.0000 No :1019 Min. : 2.50 Asian: 65
## Metro :973 1st Qu.:0.0000 Yes: 162 1st Qu.:13.00 Black: 104
## Median :0.0000 Median :18.75 White:1012
## Mean :0.8061 Mean :22.77
## 3rd Qu.:2.0000 3rd Qu.:28.84
## Max. :9.0000 Max. :99.00
## marital region
## Single :324 midwest :309
## Married :713 northeast:215
## Divorced:144 south :385
## west :272
##
##
str(data)
## 'data.frame': 1181 obs. of 12 variables:
## $ age : int 29 45 39 30 42 47 62 57 21 69 ...
## $ educ : Factor w/ 6 levels "High School",..: 5 4 3 4 4 4 3 3 2 1 ...
## $ gender : Factor w/ 2 levels "Male","Female": 2 2 2 1 1 2 2 1 1 2 ...
## $ hrswork: int 40 45 40 45 60 45 40 48 40 40 ...
## $ insure : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 1 ...
## $ metro : Factor w/ 2 levels "Non-Metro","Metro": 2 2 2 2 1 2 2 2 2 2 ...
## $ nchild : int 2 3 1 0 3 0 1 0 0 0 ...
## $ union : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 1 1 ...
## $ wage : num 25.9 14.4 17.2 17.1 18.3 ...
## $ race : Factor w/ 3 levels "Asian","Black",..: 3 3 3 3 3 3 1 3 3 3 ...
## $ marital: Factor w/ 3 levels "Single","Married",..: 2 3 2 1 2 2 2 2 1 3 ...
## $ region : Factor w/ 4 levels "midwest","northeast",..: 3 3 1 2 4 4 2 4 4 4 ...
data %>% select_if(is.numeric) %>% summary()
## age hrswork nchild wage
## Min. :17.00 Min. : 0.00 Min. :0.0000 Min. : 2.50
## 1st Qu.:32.00 1st Qu.:40.00 1st Qu.:0.0000 1st Qu.:13.00
## Median :43.00 Median :40.00 Median :0.0000 Median :18.75
## Mean :42.61 Mean :41.61 Mean :0.8061 Mean :22.77
## 3rd Qu.:52.00 3rd Qu.:42.00 3rd Qu.:2.0000 3rd Qu.:28.84
## Max. :77.00 Max. :80.00 Max. :9.0000 Max. :99.00
ggplot(data, aes(x = wage)) + geom_histogram(bins = 30, fill="lightblue") + ggtitle("Histogram of Wage")
ggplot(data, aes(x = gender, y = wage)) + geom_boxplot() + ggtitle("Wage by Gender")
ggplot(data, aes(x = educ, fill = educ)) + geom_bar() + ggtitle("Distribution of Education Levels")
## Correlation matrix
numeric_data <- data %>% select(age, hrswork, nchild, wage)
cor(numeric_data, use = "complete.obs")
## age hrswork nchild wage
## age 1.00000000 0.05585503 -0.05046348 0.21194887
## hrswork 0.05585503 1.00000000 0.06866293 0.09091083
## nchild -0.05046348 0.06866293 1.00000000 0.01655582
## wage 0.21194887 0.09091083 0.01655582 1.00000000
The dataset comprises 1,181 observations collected in March 2013, sourced from the Division of Economic Development and Forecasting, Louisiana State University. It includes 12 variables encompassing both numerical and categorical attributes such as age, wage, hours worked, education, gender, marital status, and insurance coverage. Relevant categorical variables were recoded as factors for interpretability in analysis (Zwysen, 2023).
Descriptive statistics indicate that the mean age of respondents is approximately 42.6 years, with wages averaging $22.77 per hour. Most individuals reported working around 40 hours weekly, and over half had no children. The majority held a high school diploma or some college education. Gender distribution was fairly balanced, and most participants lived in metropolitan areas.
Visualizations supported these insights: a histogram showed wages were right-skewed, prompting log transformation for regression modelling. Boxplots revealed gender-based wage differences, while bar charts illustrated education distribution. A correlation matrix of numerical variables indicated a modest positive relationship between age and wage (r ≈ 0.21) and a weaker correlation between hours worked and wage. These findings guided further probabilistic assessments and model-based investigations (Hassan et al., 2022).
p_not_insured <- mean(data$insure == "No")
1 - (1 - p_not_insured)^5
## [1] 0.6164927
married <- data[data$marital == "Married", ]
mean(married$nchild >= 1)
## [1] 0.6002805
nchild_table <- table(data$nchild)
nchild_probs <- prop.table(nchild_table)
mean_nchild <- mean(data$nchild)
var_nchild <- var(data$nchild)
p_nchild_ge3 <- mean(data$nchild >= 3)
data.frame(nchild_table, Proportion = round(nchild_probs, 3))
## Var1 Freq Proportion.Var1 Proportion.Freq
## 1 0 662 0 0.561
## 2 1 213 1 0.180
## 3 2 217 2 0.184
## 4 3 65 3 0.055
## 5 4 15 4 0.013
## 6 5 7 5 0.006
## 7 6 1 6 0.001
## 8 9 1 9 0.001
mean_nchild
## [1] 0.8060965
var_nchild
## [1] 1.21237
p_nchild_ge3
## [1] 0.07535986
Probability-based insights were extracted to better understand the dataset’s demographic structure. First, the probability that at least one out of five randomly selected individuals is not privately insured was approximately 61.6%, reflecting the notable share of uninsured individuals. Using conditional probability, it was found that among married individuals, 60.0% had at least one child, providing insight into typical family compositions.
The distribution of the number of own children (nchild) was examined. The mean number of children was 0.81, with a variance of approximately 1.21. A small but non-negligible portion (7.5%) of individuals had three or more children. These descriptive statistics highlight the skewed nature of the distribution, with most individuals having no or one child. Such probabilities and estimates form the foundation for deeper inferential analyses, allowing researchers to identify population patterns and model socioeconomic behavior effectively (Ramadan et al., 2024).
wage_2 <- data[data$nchild == 2, "wage"]
m2 <- mean(wage_2)
se2 <- sd(wage_2)/sqrt(length(wage_2))
c(m2, m2 - 1.96 * se2, m2 + 1.96 * se2)
## [1] 23.43355 21.59181 25.27529
wage_5plus <- data[data$nchild >= 5, "wage"]
if(length(wage_5plus) > 1){
m5 <- mean(wage_5plus)
se5 <- sd(wage_5plus)/sqrt(length(wage_5plus))
c(m5, m5 - 1.96 * se5, m5 + 1.96 * se5)
} else {
"Not enough data for confidence interval."
}
## [1] 12.752222 8.492841 17.011603
table_ig <- table(data$gender, data$insure)
table_ig
##
## No Yes
## Male 117 542
## Female 89 433
chisq.test(table_ig)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_ig
## X-squared = 0.0574, df = 1, p-value = 0.8107
To estimate the average hourly wage among individuals with two children, a sample mean and 95% confidence interval were computed. The mean wage was approximately $23.43, with a confidence interval ranging from $21.59 to $25.28. This interval provides a plausible range for the population mean wage for this subgroup. For households with five or more children, the sample mean wage was notably lower at $12.75. The 95% confidence interval ranged from $8.49 to $17.01, though the small sample size in this group limits the precision and reliability of inference.
In examining the relationship between gender and private health insurance coverage, a chi-squared test of independence was conducted. The test produced a p-value of 0.81, indicating no statistically significant association between gender and insurance status. This suggests that, in this dataset, gender does not significantly affect the likelihood of being insured, thereby supporting the null hypothesis of independence (Neumark and Shirley, 2022).
young <- data[data$age < 35, ]
old <- data[data$age >= 35, ]
lm_young <- lm(log(wage) ~ age, data = young)
lm_old <- lm(log(wage) ~ age, data = old)
summary(lm_young)
##
## Call:
## lm(formula = log(wage) ~ age, data = young)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.63005 -0.32110 -0.01201 0.31821 1.49042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.594555 0.173214 9.206 < 2e-16 ***
## age 0.041382 0.006074 6.813 3.85e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4816 on 374 degrees of freedom
## Multiple R-squared: 0.1104, Adjusted R-squared: 0.108
## F-statistic: 46.41 on 1 and 374 DF, p-value: 3.846e-11
summary(lm_old)
##
## Call:
## lm(formula = log(wage) ~ age, data = old)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91172 -0.39124 -0.04711 0.39679 1.54456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0795566 0.1157775 26.599 <2e-16 ***
## age -0.0005273 0.0023115 -0.228 0.82
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5712 on 803 degrees of freedom
## Multiple R-squared: 6.479e-05, Adjusted R-squared: -0.00118
## F-statistic: 0.05203 on 1 and 803 DF, p-value: 0.8196
ggplot(young, aes(x = age, y = log(wage))) +
geom_point() +
geom_smooth(method = "lm", col = "blue") +
ggtitle("Young: log(wage) vs age")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(old, aes(x = age, y = log(wage))) +
geom_point() +
geom_smooth(method = "lm", col = "red") +
ggtitle("Old: log(wage) vs age")
## `geom_smooth()` using formula = 'y ~ x'
To assess the individual effect of age on wages, the dataset was divided
into two subgroups: individuals under 35 years old (young) and those
aged 35 or older (old). For each group, a simple linear regression model
was developed using log-transformed wages as the response variable and
age as the sole predictor. In the younger cohort, results indicated a
statistically significant but modest positive relationship between age
and log(wage) (β = 0.041, p < 0.001). The R-squared value of 0.1104
suggests that age explains only around 11% of the variance in log wages
for this group, implying that while age matters, many other factors
likely influence earnings (Kraft and Lyon, 2024).
For the older group, age was not a significant predictor (β = -0.0005, p = 0.82), and the R-squared value was near zero. These findings imply that age alone provides minimal explanatory power for wage variation among older workers. The limited predictive value of age alone supports the need for multiple regression analysis incorporating additional socioeconomic and demographic variables.
lm_young_full <- lm(log(wage) ~ . -wage, data = young)
## Warning in terms.formula(formula, data = data): 'varlist' has changed (from
## nvar=12) to new 13 after EncodeVars() -- should no longer happen!
lm_old_full <- lm(log(wage) ~ . -wage, data = old)
## Warning in terms.formula(formula, data = data): 'varlist' has changed (from
## nvar=12) to new 13 after EncodeVars() -- should no longer happen!
summary(lm_young_full)
##
## Call:
## lm(formula = log(wage) ~ . - wage, data = young)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.36147 -0.25532 -0.00523 0.23962 1.25290
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.857776 0.209955 8.848 < 2e-16 ***
## age 0.027433 0.006396 4.289 2.31e-05 ***
## educCollege No Degree 0.203154 0.067747 2.999 0.00290 **
## educCollege Degree 0.143600 0.075724 1.896 0.05872 .
## educBA 0.363079 0.063166 5.748 1.94e-08 ***
## educMA 0.559912 0.092469 6.055 3.56e-09 ***
## educPhD 0.651371 0.201380 3.235 0.00133 **
## genderFemale -0.198814 0.048244 -4.121 4.70e-05 ***
## hrswork -0.003007 0.002392 -1.257 0.20950
## insureYes 0.213487 0.053593 3.983 8.24e-05 ***
## metroMetro -0.002178 0.058798 -0.037 0.97047
## nchild -0.024028 0.027501 -0.874 0.38288
## unionYes 0.156249 0.074146 2.107 0.03579 *
## raceBlack -0.173138 0.120494 -1.437 0.15162
## raceWhite -0.099359 0.089509 -1.110 0.26773
## maritalMarried 0.071798 0.056804 1.264 0.20707
## maritalDivorced 0.090768 0.109676 0.828 0.40845
## regionnortheast 0.119318 0.067477 1.768 0.07787 .
## regionsouth 0.010853 0.059064 0.184 0.85431
## regionwest 0.038820 0.065535 0.592 0.55399
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4221 on 356 degrees of freedom
## Multiple R-squared: 0.3495, Adjusted R-squared: 0.3148
## F-statistic: 10.07 on 19 and 356 DF, p-value: < 2.2e-16
summary(lm_old_full)
##
## Call:
## lm(formula = log(wage) ~ . - wage, data = old)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.81215 -0.30916 0.01239 0.31634 1.36426
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3336710 0.1802125 12.950 < 2e-16 ***
## age -0.0003227 0.0021561 -0.150 0.881081
## educCollege No Degree 0.0497378 0.0521099 0.954 0.340135
## educCollege Degree 0.1995416 0.0595573 3.350 0.000845 ***
## educBA 0.3560193 0.0478719 7.437 2.70e-13 ***
## educMA 0.6814279 0.0580195 11.745 < 2e-16 ***
## educPhD 0.8384954 0.1428711 5.869 6.47e-09 ***
## genderFemale -0.1775413 0.0356511 -4.980 7.82e-07 ***
## hrswork 0.0008388 0.0021508 0.390 0.696647
## insureYes 0.2481036 0.0531311 4.670 3.55e-06 ***
## metroMetro 0.1486570 0.0477692 3.112 0.001926 **
## nchild -0.0239424 0.0172757 -1.386 0.166170
## unionYes 0.0439623 0.0486716 0.903 0.366673
## raceBlack 0.0146457 0.1014394 0.144 0.885239
## raceWhite 0.0945473 0.0834354 1.133 0.257485
## maritalMarried 0.1045251 0.0536904 1.947 0.051914 .
## maritalDivorced 0.1291999 0.0641291 2.015 0.044278 *
## regionnortheast 0.0488733 0.0531199 0.920 0.357826
## regionsouth 0.0417798 0.0464567 0.899 0.368754
## regionwest 0.1292354 0.0504974 2.559 0.010676 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4866 on 785 degrees of freedom
## Multiple R-squared: 0.2905, Adjusted R-squared: 0.2733
## F-statistic: 16.92 on 19 and 785 DF, p-value: < 2.2e-16
Multiple linear regression models were constructed separately for the ‘young’ (age < 35) and ‘old’ (age ≥ 35) groups, using log(wage) as the response variable and all other available variables—excluding raw wage—as predictors. This approach allowed for analysis of how multiple factors jointly influence wage outcomes.
In the young group, education level, gender, and union membership were significant. Individuals with a BA, MA, or PhD earned significantly more. Female workers had lower wages on average, and union membership correlated with higher earnings. The model yielded an adjusted R² of 0.3148, indicating a moderate fit (Frymer and Grumbach, 2021).
In the older group, advanced education again played a strong role, with marital status and metropolitan residence emerging as influential. Married individuals and those living in metro areas earned more. The adjusted R² for this model was 0.2733, slightly lower but still suggesting meaningful explanatory power. Both models outperformed their simple linear regression counterparts, confirming that wages are driven by a complex combination of demographic, social, and economic variables.
This report has systematically investigated the determinants of wages using a real-world dataset. The analysis included exploratory data techniques, probability-based inference, and both simple and multiple linear regression. Key findings suggest that factors such as education level, union membership, and gender significantly influence earnings, with additional insights derived through age-specific models. The use of both simple and full models illustrated the need for multivariate approaches in economic data analysis. Future work could incorporate interaction terms and model selection techniques for enhanced predictive accuracy and interpretation.
Frymer, P. and Grumbach, J.M., 2021. Labor unions and white racial politics. American Journal of Political Science, 65(1), pp.225-240.
Hassan, S.T., Batool, B., Zhu, B. and Khan, I., 2022. Environmental complexity of globalization, education, and income inequalities: New insights of energy poverty. Journal of Cleaner Production, 340, p.130735.
Kraft, M.A. and Lyon, M.A., 2024. The rise and fall of the teaching profession: Prestige, interest, preparation, and satisfaction over the last half century. American Educational Research Journal, 61(6), pp.1192-1236.
Neumark, D. and Shirley, P., 2022. Myth or measurement: What does the new minimum wage research say about minimum wages and job loss in the United States?. Industrial Relations: A Journal of Economy and Society, 61(4), pp.384-417.
Ramadan, A., Teguh, A., Roselina, A., Andriastuti, L. and Antriyandarti, E., 2024. The influence of regional minimum wages on unemployment rates in Indonesia: Multiple linear regression analysis. Economic Military and Geographically Business Review, 2(1), pp.1-15.
Zwysen, W., 2023. Global and institutional drivers of wage inequality between and within firms. Socio-Economic Review, 21(4), pp.2043-2068.