Introduction

This report investigates factors influencing individual wages using a real-world dataset provided by the Division of Economic Development and Forecasting, Louisiana State University. The data, collected in March 2013, includes demographic, educational, and employment-related variables such as age, gender, education level, hours worked, marital status, and union membership. The objective is to perform a comprehensive analysis through data exploration, probability assessments, and both simple and multiple linear regression models. By examining patterns within different age groups, this report aims to uncover meaningful relationships between variables and hourly wage outcomes, offering insights into the key drivers of income variation.

library import

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)

Data Exploration

data <- read.csv("MA334-SP-7_2412507.csv")

Factor conversion

data$gender <- factor(data$gender, levels = c(0, 1), labels = c("Male", "Female"))
data$educ <- factor(data$educ, levels = 0:5,
                    labels = c("High School", "College No Degree", "College Degree", "BA", "MA", "PhD"))
data$marital <- factor(data$marital, levels = c(0, 1, 2), labels = c("Single", "Married", "Divorced"))
data$insure <- factor(data$insure, levels = c(0, 1), labels = c("No", "Yes"))
data$union <- factor(data$union, levels = c(0, 1), labels = c("No", "Yes"))
data$metro <- factor(data$metro, levels = c(0, 1), labels = c("Non-Metro", "Metro"))
data$race <- as.factor(data$race)
data$region <- as.factor(data$region)

Summary statistics

summary(data)
##       age                       educ        gender       hrswork      insure   
##  Min.   :17.00   High School      :365   Male  :659   Min.   : 0.00   No :206  
##  1st Qu.:32.00   College No Degree:208   Female:522   1st Qu.:40.00   Yes:975  
##  Median :43.00   College Degree   :143                Median :40.00            
##  Mean   :42.61   BA               :304                Mean   :41.61            
##  3rd Qu.:52.00   MA               :143                3rd Qu.:42.00            
##  Max.   :77.00   PhD              : 18                Max.   :80.00            
##        metro         nchild       union           wage          race     
##  Non-Metro:208   Min.   :0.0000   No :1019   Min.   : 2.50   Asian:  65  
##  Metro    :973   1st Qu.:0.0000   Yes: 162   1st Qu.:13.00   Black: 104  
##                  Median :0.0000              Median :18.75   White:1012  
##                  Mean   :0.8061              Mean   :22.77               
##                  3rd Qu.:2.0000              3rd Qu.:28.84               
##                  Max.   :9.0000              Max.   :99.00               
##      marital          region   
##  Single  :324   midwest  :309  
##  Married :713   northeast:215  
##  Divorced:144   south    :385  
##                 west     :272  
##                                
## 
str(data)
## 'data.frame':    1181 obs. of  12 variables:
##  $ age    : int  29 45 39 30 42 47 62 57 21 69 ...
##  $ educ   : Factor w/ 6 levels "High School",..: 5 4 3 4 4 4 3 3 2 1 ...
##  $ gender : Factor w/ 2 levels "Male","Female": 2 2 2 1 1 2 2 1 1 2 ...
##  $ hrswork: int  40 45 40 45 60 45 40 48 40 40 ...
##  $ insure : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 1 ...
##  $ metro  : Factor w/ 2 levels "Non-Metro","Metro": 2 2 2 2 1 2 2 2 2 2 ...
##  $ nchild : int  2 3 1 0 3 0 1 0 0 0 ...
##  $ union  : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 1 1 ...
##  $ wage   : num  25.9 14.4 17.2 17.1 18.3 ...
##  $ race   : Factor w/ 3 levels "Asian","Black",..: 3 3 3 3 3 3 1 3 3 3 ...
##  $ marital: Factor w/ 3 levels "Single","Married",..: 2 3 2 1 2 2 2 2 1 3 ...
##  $ region : Factor w/ 4 levels "midwest","northeast",..: 3 3 1 2 4 4 2 4 4 4 ...

Numeric summary

data %>% select_if(is.numeric) %>% summary()
##       age           hrswork          nchild            wage      
##  Min.   :17.00   Min.   : 0.00   Min.   :0.0000   Min.   : 2.50  
##  1st Qu.:32.00   1st Qu.:40.00   1st Qu.:0.0000   1st Qu.:13.00  
##  Median :43.00   Median :40.00   Median :0.0000   Median :18.75  
##  Mean   :42.61   Mean   :41.61   Mean   :0.8061   Mean   :22.77  
##  3rd Qu.:52.00   3rd Qu.:42.00   3rd Qu.:2.0000   3rd Qu.:28.84  
##  Max.   :77.00   Max.   :80.00   Max.   :9.0000   Max.   :99.00

Visualizations

ggplot(data, aes(x = wage)) + geom_histogram(bins = 30, fill="lightblue") + ggtitle("Histogram of Wage")

ggplot(data, aes(x = gender, y = wage)) + geom_boxplot() + ggtitle("Wage by Gender")

ggplot(data, aes(x = educ, fill = educ)) + geom_bar() + ggtitle("Distribution of Education Levels")

## Correlation matrix

numeric_data <- data %>% select(age, hrswork, nchild, wage)
cor(numeric_data, use = "complete.obs")
##                 age    hrswork      nchild       wage
## age      1.00000000 0.05585503 -0.05046348 0.21194887
## hrswork  0.05585503 1.00000000  0.06866293 0.09091083
## nchild  -0.05046348 0.06866293  1.00000000 0.01655582
## wage     0.21194887 0.09091083  0.01655582 1.00000000

The dataset comprises 1,181 observations collected in March 2013, sourced from the Division of Economic Development and Forecasting, Louisiana State University. It includes 12 variables encompassing both numerical and categorical attributes such as age, wage, hours worked, education, gender, marital status, and insurance coverage. Relevant categorical variables were recoded as factors for interpretability in analysis (Zwysen, 2023).

Descriptive statistics indicate that the mean age of respondents is approximately 42.6 years, with wages averaging $22.77 per hour. Most individuals reported working around 40 hours weekly, and over half had no children. The majority held a high school diploma or some college education. Gender distribution was fairly balanced, and most participants lived in metropolitan areas.

Visualizations supported these insights: a histogram showed wages were right-skewed, prompting log transformation for regression modelling. Boxplots revealed gender-based wage differences, while bar charts illustrated education distribution. A correlation matrix of numerical variables indicated a modest positive relationship between age and wage (r ≈ 0.21) and a weaker correlation between hours worked and wage. These findings guided further probabilistic assessments and model-based investigations (Hassan et al., 2022).

2.Probability, Distributions & Confidence Intervals

Q1: P(at least 1 of 5 not insured)

p_not_insured <- mean(data$insure == "No")
1 - (1 - p_not_insured)^5
## [1] 0.6164927

Q2: P(nchild ≥ 1 | married)

married <- data[data$marital == "Married", ]
mean(married$nchild >= 1)
## [1] 0.6002805

Q3: Distribution of nchild

nchild_table <- table(data$nchild)
nchild_probs <- prop.table(nchild_table)
mean_nchild <- mean(data$nchild)
var_nchild <- var(data$nchild)
p_nchild_ge3 <- mean(data$nchild >= 3)

data.frame(nchild_table, Proportion = round(nchild_probs, 3))
##   Var1 Freq Proportion.Var1 Proportion.Freq
## 1    0  662               0           0.561
## 2    1  213               1           0.180
## 3    2  217               2           0.184
## 4    3   65               3           0.055
## 5    4   15               4           0.013
## 6    5    7               5           0.006
## 7    6    1               6           0.001
## 8    9    1               9           0.001
mean_nchild
## [1] 0.8060965
var_nchild
## [1] 1.21237
p_nchild_ge3
## [1] 0.07535986

Probability-based insights were extracted to better understand the dataset’s demographic structure. First, the probability that at least one out of five randomly selected individuals is not privately insured was approximately 61.6%, reflecting the notable share of uninsured individuals. Using conditional probability, it was found that among married individuals, 60.0% had at least one child, providing insight into typical family compositions.

The distribution of the number of own children (nchild) was examined. The mean number of children was 0.81, with a variance of approximately 1.21. A small but non-negligible portion (7.5%) of individuals had three or more children. These descriptive statistics highlight the skewed nature of the distribution, with most individuals having no or one child. Such probabilities and estimates form the foundation for deeper inferential analyses, allowing researchers to identify population patterns and model socioeconomic behavior effectively (Ramadan et al., 2024).

3. Point Estimates, CIs & Hypothesis Test

CI for 2 children

wage_2 <- data[data$nchild == 2, "wage"]
m2 <- mean(wage_2)
se2 <- sd(wage_2)/sqrt(length(wage_2))
c(m2, m2 - 1.96 * se2, m2 + 1.96 * se2)
## [1] 23.43355 21.59181 25.27529

CI for 5+ children

wage_5plus <- data[data$nchild >= 5, "wage"]
if(length(wage_5plus) > 1){
  m5 <- mean(wage_5plus)
  se5 <- sd(wage_5plus)/sqrt(length(wage_5plus))
  c(m5, m5 - 1.96 * se5, m5 + 1.96 * se5)
} else {
  "Not enough data for confidence interval."
}
## [1] 12.752222  8.492841 17.011603

Insurance by gender

table_ig <- table(data$gender, data$insure)
table_ig
##         
##           No Yes
##   Male   117 542
##   Female  89 433
chisq.test(table_ig)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_ig
## X-squared = 0.0574, df = 1, p-value = 0.8107

To estimate the average hourly wage among individuals with two children, a sample mean and 95% confidence interval were computed. The mean wage was approximately $23.43, with a confidence interval ranging from $21.59 to $25.28. This interval provides a plausible range for the population mean wage for this subgroup. For households with five or more children, the sample mean wage was notably lower at $12.75. The 95% confidence interval ranged from $8.49 to $17.01, though the small sample size in this group limits the precision and reliability of inference.

In examining the relationship between gender and private health insurance coverage, a chi-squared test of independence was conducted. The test produced a p-value of 0.81, indicating no statistically significant association between gender and insurance status. This suggests that, in this dataset, gender does not significantly affect the likelihood of being insured, thereby supporting the null hypothesis of independence (Neumark and Shirley, 2022).

4. Simple Linear Regression

young <- data[data$age < 35, ]
old <- data[data$age >= 35, ]

lm_young <- lm(log(wage) ~ age, data = young)
lm_old <- lm(log(wage) ~ age, data = old)

summary(lm_young)
## 
## Call:
## lm(formula = log(wage) ~ age, data = young)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63005 -0.32110 -0.01201  0.31821  1.49042 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.594555   0.173214   9.206  < 2e-16 ***
## age         0.041382   0.006074   6.813 3.85e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4816 on 374 degrees of freedom
## Multiple R-squared:  0.1104, Adjusted R-squared:  0.108 
## F-statistic: 46.41 on 1 and 374 DF,  p-value: 3.846e-11
summary(lm_old)
## 
## Call:
## lm(formula = log(wage) ~ age, data = old)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.91172 -0.39124 -0.04711  0.39679  1.54456 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.0795566  0.1157775  26.599   <2e-16 ***
## age         -0.0005273  0.0023115  -0.228     0.82    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5712 on 803 degrees of freedom
## Multiple R-squared:  6.479e-05,  Adjusted R-squared:  -0.00118 
## F-statistic: 0.05203 on 1 and 803 DF,  p-value: 0.8196

Plot

ggplot(young, aes(x = age, y = log(wage))) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  ggtitle("Young: log(wage) vs age")
## `geom_smooth()` using formula = 'y ~ x'

ggplot(old, aes(x = age, y = log(wage))) +
  geom_point() +
  geom_smooth(method = "lm", col = "red") +
  ggtitle("Old: log(wage) vs age")
## `geom_smooth()` using formula = 'y ~ x'

To assess the individual effect of age on wages, the dataset was divided into two subgroups: individuals under 35 years old (young) and those aged 35 or older (old). For each group, a simple linear regression model was developed using log-transformed wages as the response variable and age as the sole predictor. In the younger cohort, results indicated a statistically significant but modest positive relationship between age and log(wage) (β = 0.041, p < 0.001). The R-squared value of 0.1104 suggests that age explains only around 11% of the variance in log wages for this group, implying that while age matters, many other factors likely influence earnings (Kraft and Lyon, 2024).

For the older group, age was not a significant predictor (β = -0.0005, p = 0.82), and the R-squared value was near zero. These findings imply that age alone provides minimal explanatory power for wage variation among older workers. The limited predictive value of age alone supports the need for multiple regression analysis incorporating additional socioeconomic and demographic variables.

5. Multiple Linear Regression

lm_young_full <- lm(log(wage) ~ . -wage, data = young)
## Warning in terms.formula(formula, data = data): 'varlist' has changed (from
## nvar=12) to new 13 after EncodeVars() -- should no longer happen!
lm_old_full <- lm(log(wage) ~ . -wage, data = old)
## Warning in terms.formula(formula, data = data): 'varlist' has changed (from
## nvar=12) to new 13 after EncodeVars() -- should no longer happen!
summary(lm_young_full)
## 
## Call:
## lm(formula = log(wage) ~ . - wage, data = young)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36147 -0.25532 -0.00523  0.23962  1.25290 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.857776   0.209955   8.848  < 2e-16 ***
## age                    0.027433   0.006396   4.289 2.31e-05 ***
## educCollege No Degree  0.203154   0.067747   2.999  0.00290 ** 
## educCollege Degree     0.143600   0.075724   1.896  0.05872 .  
## educBA                 0.363079   0.063166   5.748 1.94e-08 ***
## educMA                 0.559912   0.092469   6.055 3.56e-09 ***
## educPhD                0.651371   0.201380   3.235  0.00133 ** 
## genderFemale          -0.198814   0.048244  -4.121 4.70e-05 ***
## hrswork               -0.003007   0.002392  -1.257  0.20950    
## insureYes              0.213487   0.053593   3.983 8.24e-05 ***
## metroMetro            -0.002178   0.058798  -0.037  0.97047    
## nchild                -0.024028   0.027501  -0.874  0.38288    
## unionYes               0.156249   0.074146   2.107  0.03579 *  
## raceBlack             -0.173138   0.120494  -1.437  0.15162    
## raceWhite             -0.099359   0.089509  -1.110  0.26773    
## maritalMarried         0.071798   0.056804   1.264  0.20707    
## maritalDivorced        0.090768   0.109676   0.828  0.40845    
## regionnortheast        0.119318   0.067477   1.768  0.07787 .  
## regionsouth            0.010853   0.059064   0.184  0.85431    
## regionwest             0.038820   0.065535   0.592  0.55399    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4221 on 356 degrees of freedom
## Multiple R-squared:  0.3495, Adjusted R-squared:  0.3148 
## F-statistic: 10.07 on 19 and 356 DF,  p-value: < 2.2e-16
summary(lm_old_full)
## 
## Call:
## lm(formula = log(wage) ~ . - wage, data = old)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.81215 -0.30916  0.01239  0.31634  1.36426 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.3336710  0.1802125  12.950  < 2e-16 ***
## age                   -0.0003227  0.0021561  -0.150 0.881081    
## educCollege No Degree  0.0497378  0.0521099   0.954 0.340135    
## educCollege Degree     0.1995416  0.0595573   3.350 0.000845 ***
## educBA                 0.3560193  0.0478719   7.437 2.70e-13 ***
## educMA                 0.6814279  0.0580195  11.745  < 2e-16 ***
## educPhD                0.8384954  0.1428711   5.869 6.47e-09 ***
## genderFemale          -0.1775413  0.0356511  -4.980 7.82e-07 ***
## hrswork                0.0008388  0.0021508   0.390 0.696647    
## insureYes              0.2481036  0.0531311   4.670 3.55e-06 ***
## metroMetro             0.1486570  0.0477692   3.112 0.001926 ** 
## nchild                -0.0239424  0.0172757  -1.386 0.166170    
## unionYes               0.0439623  0.0486716   0.903 0.366673    
## raceBlack              0.0146457  0.1014394   0.144 0.885239    
## raceWhite              0.0945473  0.0834354   1.133 0.257485    
## maritalMarried         0.1045251  0.0536904   1.947 0.051914 .  
## maritalDivorced        0.1291999  0.0641291   2.015 0.044278 *  
## regionnortheast        0.0488733  0.0531199   0.920 0.357826    
## regionsouth            0.0417798  0.0464567   0.899 0.368754    
## regionwest             0.1292354  0.0504974   2.559 0.010676 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4866 on 785 degrees of freedom
## Multiple R-squared:  0.2905, Adjusted R-squared:  0.2733 
## F-statistic: 16.92 on 19 and 785 DF,  p-value: < 2.2e-16

Multiple linear regression models were constructed separately for the ‘young’ (age < 35) and ‘old’ (age ≥ 35) groups, using log(wage) as the response variable and all other available variables—excluding raw wage—as predictors. This approach allowed for analysis of how multiple factors jointly influence wage outcomes.

In the young group, education level, gender, and union membership were significant. Individuals with a BA, MA, or PhD earned significantly more. Female workers had lower wages on average, and union membership correlated with higher earnings. The model yielded an adjusted R² of 0.3148, indicating a moderate fit (Frymer and Grumbach, 2021).

In the older group, advanced education again played a strong role, with marital status and metropolitan residence emerging as influential. Married individuals and those living in metro areas earned more. The adjusted R² for this model was 0.2733, slightly lower but still suggesting meaningful explanatory power. Both models outperformed their simple linear regression counterparts, confirming that wages are driven by a complex combination of demographic, social, and economic variables.

Conclusion

This report has systematically investigated the determinants of wages using a real-world dataset. The analysis included exploratory data techniques, probability-based inference, and both simple and multiple linear regression. Key findings suggest that factors such as education level, union membership, and gender significantly influence earnings, with additional insights derived through age-specific models. The use of both simple and full models illustrated the need for multivariate approaches in economic data analysis. Future work could incorporate interaction terms and model selection techniques for enhanced predictive accuracy and interpretation.

References

Frymer, P. and Grumbach, J.M., 2021. Labor unions and white racial politics. American Journal of Political Science, 65(1), pp.225-240.

Hassan, S.T., Batool, B., Zhu, B. and Khan, I., 2022. Environmental complexity of globalization, education, and income inequalities: New insights of energy poverty. Journal of Cleaner Production, 340, p.130735.

Kraft, M.A. and Lyon, M.A., 2024. The rise and fall of the teaching profession: Prestige, interest, preparation, and satisfaction over the last half century. American Educational Research Journal, 61(6), pp.1192-1236.

Neumark, D. and Shirley, P., 2022. Myth or measurement: What does the new minimum wage research say about minimum wages and job loss in the United States?. Industrial Relations: A Journal of Economy and Society, 61(4), pp.384-417.

Ramadan, A., Teguh, A., Roselina, A., Andriastuti, L. and Antriyandarti, E., 2024. The influence of regional minimum wages on unemployment rates in Indonesia: Multiple linear regression analysis. Economic Military and Geographically Business Review, 2(1), pp.1-15.

Zwysen, W., 2023. Global and institutional drivers of wage inequality between and within firms. Socio-Economic Review, 21(4), pp.2043-2068.