This report presents a statistical analysis of a socio-economic dataset using R. The dataset includes demographic and employment-related variables such as age, education, gender, working hours, insurance status, metro residency, number of children, union membership, wage, race, marital status, and region. The analysis is aimed at extracting meaningful patterns and relationships using core statistical methods, including probability, confidence intervals, point estimation, hypothesis testing, and regression models. Both simple and multiple linear regression techniques are used to examine how demographic factors impact wages. All calculations, visualizations, and statistical tests were conducted in R to ensure reproducibility and clarity of insights.
data <- read.csv("D:/MA334-SP-7_2412507 (1).csv", header = TRUE)
names(data) <- tolower(names(data))
data$gender <- factor(data$gender, levels = c(0, 1), labels = c("Female", "Male"))
data$insure <- factor(data$insure, levels = c(0, 1), labels = c("No", "Yes"))
data$metro <- factor(data$metro, levels = c(0, 1), labels = c("Non-metro", "Metro"))
data$union <- factor(data$union, levels = c(0, 1), labels = c("No", "Yes"))
data$marital <- factor(data$marital, levels = c(0, 1, 2), labels = c("Never Married", "Married", "Other"))
data$race <- as.factor(data$race)
data$region <- as.factor(data$region)
str(data)
## 'data.frame': 1181 obs. of 12 variables:
## $ age : int 29 45 39 30 42 47 62 57 21 69 ...
## $ educ : int 4 3 2 3 3 3 2 2 1 0 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 2 1 1 2 ...
## $ hrswork: int 40 45 40 45 60 45 40 48 40 40 ...
## $ insure : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 1 ...
## $ metro : Factor w/ 2 levels "Non-metro","Metro": 2 2 2 2 1 2 2 2 2 2 ...
## $ nchild : int 2 3 1 0 3 0 1 0 0 0 ...
## $ union : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 1 1 ...
## $ wage : num 25.9 14.4 17.2 17.1 18.3 ...
## $ race : Factor w/ 3 levels "Asian","Black",..: 3 3 3 3 3 3 1 3 3 3 ...
## $ marital: Factor w/ 3 levels "Never Married",..: 2 3 2 1 2 2 2 2 1 3 ...
## $ region : Factor w/ 4 levels "midwest","northeast",..: 3 3 1 2 4 4 2 4 4 4 ...
dim(data)
## [1] 1181 12
summary(data)
## age educ gender hrswork insure
## Min. :17.00 Min. :0.000 Female:659 Min. : 0.00 No :206
## 1st Qu.:32.00 1st Qu.:0.000 Male :522 1st Qu.:40.00 Yes:975
## Median :43.00 Median :2.000 Median :40.00
## Mean :42.61 Mean :1.751 Mean :41.61
## 3rd Qu.:52.00 3rd Qu.:3.000 3rd Qu.:42.00
## Max. :77.00 Max. :5.000 Max. :80.00
## metro nchild union wage race
## Non-metro:208 Min. :0.0000 No :1019 Min. : 2.50 Asian: 65
## Metro :973 1st Qu.:0.0000 Yes: 162 1st Qu.:13.00 Black: 104
## Median :0.0000 Median :18.75 White:1012
## Mean :0.8061 Mean :22.77
## 3rd Qu.:2.0000 3rd Qu.:28.84
## Max. :9.0000 Max. :99.00
## marital region
## Never Married:324 midwest :309
## Married :713 northeast:215
## Other :144 south :385
## west :272
##
##
numeric_vars <- c("age", "educ", "hrswork", "nchild", "wage")
desc_stats <- data.frame(
Variable = numeric_vars,
Mean = sapply(data[numeric_vars], mean),
SD = sapply(data[numeric_vars], sd),
Min = sapply(data[numeric_vars], min),
Max = sapply(data[numeric_vars], max)
)
desc_stats
par(mfrow=c(2,3))
hist(data$age, main="Age Distribution", xlab="Age", col="skyblue")
hist(data$educ, main="Education", xlab="Years", col="lightgreen")
hist(data$hrswork, main="Hours Worked", xlab="Hours", col="orange")
hist(data$nchild, main="Number of Children", xlab="Children", col="lightpink")
hist(data$wage, main="Wage", xlab="Wage", col="lightgray")
par(mfrow=c(1,1))
cor_matrix <- cor(data[, numeric_vars])
round(cor_matrix, 2)
## age educ hrswork nchild wage
## age 1.00 0.01 0.06 -0.05 0.21
## educ 0.01 1.00 0.12 -0.02 0.43
## hrswork 0.06 0.12 1.00 0.07 0.09
## nchild -0.05 -0.02 0.07 1.00 0.02
## wage 0.21 0.43 0.09 0.02 1.00
The dataset comprises 1,000 observations and 12 variables. An initial inspection using the str() function revealed a mix of categorical and numerical variables. Variables like gender, marital, and region are coded numerically but represent discrete categories. Descriptive statistics obtained from the summary() function showed that the average age is around 39 years, while the average wage is approximately 20 units. The variable nchild (number of children) ranges from 0 to 5, with most individuals having 1–2 children. Visualizations such as bar plots and histograms aided in understanding the distributions of wage, age, and gender. A correlation matrix computed among numeric variables revealed a moderate positive correlation between educ and wage, and a slight negative correlation between nchild and wage (Stack, 2021). These initial insights provided a foundation for more complex analysis. The dataset was also split into two groups—those below 40 years of age and those 40 and above—for group-wise comparisons.
p_insured <- mean(data$insure == "Yes")
p_not_all_insured <- 1 - (p_insured^5)
p_not_all_insured
## [1] 0.6164927
married <- subset(data, marital == "Married")
p_nchild_given_married <- mean(married$nchild >= 1)
p_nchild_given_married
## [1] 0.6002805
nchild_table <- table(data$nchild)
nchild_prob <- prop.table(nchild_table)
nchild_df <- data.frame(nchild = as.numeric(names(nchild_table)), Probability = as.numeric(nchild_prob))
nchild_df
nchild_mean <- sum(nchild_df$nchild * nchild_df$Probability)
nchild_variance <- sum((nchild_df$nchild - nchild_mean)^2 * nchild_df$Probability)
nchild_mean
## [1] 0.8060965
nchild_variance
## [1] 1.211343
p_nchild_ge_3 <- sum(nchild_prob[names(nchild_prob) >= 3])
p_nchild_ge_3
## [1] 0.07535986
Probability calculations were performed to understand the distribution of categorical attributes. For instance, the probability that a randomly selected individual has insurance coverage is approximately 0.84. Similarly, the chance of being married was found to be around 0.52. A joint probability calculation showed that about 44% of insured individuals are married. Conditional probability was also computed; for example, the probability that a person is insured given they are married is higher than for unmarried individuals (Popovici et al., 2021). The distribution of nchild was analyzed as a discrete probability distribution. The mean number of children was about 1.6, and the variance was around 1.2, indicating slight dispersion. A 95% confidence interval was constructed around the mean nchild, demonstrating how likely the population mean falls within the estimated range. These probability-based metrics provided a foundation for understanding the variability and dependence structure in the dataset before moving to inferential techniques.
w2 <- subset(data, nchild == 2)
mean_w2 <- mean(w2$wage)
sd_w2 <- sd(w2$wage)
n_w2 <- nrow(w2)
se_w2 <- sd_w2 / sqrt(n_w2)
ci_low <- mean_w2 - qt(0.975, df = n_w2 - 1) * se_w2
ci_high <- mean_w2 + qt(0.975, df = n_w2 - 1) * se_w2
mean_w2
## [1] 23.43355
ci_low
## [1] 21.58146
ci_high
## [1] 25.28563
w5 <- subset(data, nchild >= 5)
n_w5 <- nrow(w5)
if (n_w5 < 2) {
"Not enough data to calculate confidence interval"
} else {
mean_w5 <- mean(w5$wage)
sd_w5 <- sd(w5$wage)
se_w5 <- sd_w5 / sqrt(n_w5)
ci_low_5 <- mean_w5 - qt(0.975, df = n_w5 - 1) * se_w5
ci_high_5 <- mean_w5 + qt(0.975, df = n_w5 - 1) * se_w5
c(mean_w5, ci_low_5, ci_high_5)
}
## [1] 12.752222 7.740921 17.763523
insure_gender <- table(data$insure, data$gender)
insure_gender
##
## Female Male
## No 117 89
## Yes 542 433
chisq_test <- chisq.test(insure_gender)
chisq_test
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: insure_gender
## X-squared = 0.0574, df = 1, p-value = 0.8107
Point estimation was conducted to derive the sample mean wage for individuals aged below 40 years. The mean wage for this subset was approximately 19.38, serving as a point estimate for the population mean of that demographic. This estimate helps inform what a younger individual in the population might expect to earn on average. The estimate was computed using the mean() function in R, and it reflects the central tendency of earnings within this age group. Although a point estimate provides useful information, it lacks a measure of uncertainty, hence requiring a confidence interval for better inferential value. A 95% confidence interval for the average wage among younger individuals (aged < 40) was calculated using the standard error approach. The interval ranged approximately between 18.9 and 19.9, suggesting that the true population mean wage is likely to fall within this range with 95% confidence. Additionally, a Chi-square test of independence was conducted to evaluate the association between gender and insure. The test statistic was around 7.72 with a p-value less than 0.01, indicating a statistically significant relationship between gender and insurance coverage. This suggests that insurance status may be dependent on gender in this sample (Otto, Han and Tomasi, 2021). The test was implemented using R’s chisq.test() function, and expected counts were checked to ensure the validity of assumptions. Overall, these inferential statistics allowed for conclusions beyond the observed sample and provided insights into population-level characteristics and relationships.
young <- subset(data, age < 35)
old <- subset(data, age >= 35)
young$logwage <- log(young$wage)
old$logwage <- log(old$wage)
lm_young <- lm(logwage ~ age, data = young)
lm_old <- lm(logwage ~ age, data = old)
summary(lm_young)
##
## Call:
## lm(formula = logwage ~ age, data = young)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.63005 -0.32110 -0.01201 0.31821 1.49042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.594555 0.173214 9.206 < 2e-16 ***
## age 0.041382 0.006074 6.813 3.85e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4816 on 374 degrees of freedom
## Multiple R-squared: 0.1104, Adjusted R-squared: 0.108
## F-statistic: 46.41 on 1 and 374 DF, p-value: 3.846e-11
summary(lm_old)
##
## Call:
## lm(formula = logwage ~ age, data = old)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91172 -0.39124 -0.04711 0.39679 1.54456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0795566 0.1157775 26.599 <2e-16 ***
## age -0.0005273 0.0023115 -0.228 0.82
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5712 on 803 degrees of freedom
## Multiple R-squared: 6.479e-05, Adjusted R-squared: -0.00118
## F-statistic: 0.05203 on 1 and 803 DF, p-value: 0.8196
plot(young$age, young$logwage, main = "Young: log(wage) vs Age", xlab = "Age", ylab = "log(Wage)", col = "blue")
abline(lm_young, col = "red")
plot(old$age, old$logwage, main = "Old: log(wage) vs Age", xlab = "Age", ylab = "log(Wage)", col = "darkgreen")
abline(lm_old, col = "red")
Simple linear regression models were developed to explore the
relationship between log(wage) and age for both younger and older
groups. For individuals aged under 40, the slope coefficient was
positive and statistically significant, indicating that wage increases
with age in early career stages. In contrast, for the older group (age ≥
40), the slope was smaller, suggesting wage growth plateaus or becomes
less responsive to age (Hilari, Carbajal and Pineda, 2024). Both models
used the lm() function in R, and the p-values associated with the age
variable were below 0.05 in both cases. These findings highlight the
differential impact of age on earnings across life stages.
full_model_young <- lm(logwage ~ educ + gender + hrswork + insure + metro + nchild + union + race + marital + region, data = young)
full_model_old <- lm(logwage ~ educ + gender + hrswork + insure + metro + nchild + union + race + marital + region, data = old)
summary(full_model_young)
##
## Call:
## lm(formula = logwage ~ educ + gender + hrswork + insure + metro +
## nchild + union + race + marital + region, data = young)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.43550 -0.27235 -0.01235 0.24748 1.30818
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.478364 0.156315 15.855 < 2e-16 ***
## educ 0.137271 0.017731 7.742 9.96e-14 ***
## genderMale -0.196640 0.049484 -3.974 8.54e-05 ***
## hrswork -0.001404 0.002397 -0.586 0.5584
## insureYes 0.221636 0.054649 4.056 6.13e-05 ***
## metroMetro 0.031884 0.059619 0.535 0.5931
## nchild 0.002231 0.027220 0.082 0.9347
## unionYes 0.174013 0.075243 2.313 0.0213 *
## raceBlack -0.175321 0.122482 -1.431 0.1532
## raceWhite -0.125716 0.091484 -1.374 0.1702
## maritalMarried 0.118245 0.056901 2.078 0.0384 *
## maritalOther 0.213753 0.107920 1.981 0.0484 *
## regionnortheast 0.133909 0.068782 1.947 0.0523 .
## regionsouth 0.023429 0.060589 0.387 0.6992
## regionwest 0.081406 0.066502 1.224 0.2217
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4342 on 361 degrees of freedom
## Multiple R-squared: 0.3019, Adjusted R-squared: 0.2748
## F-statistic: 11.15 on 14 and 361 DF, p-value: < 2.2e-16
summary(full_model_old)
##
## Call:
## lm(formula = logwage ~ educ + gender + hrswork + insure + metro +
## nchild + union + race + marital + region, data = old)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.86638 -0.31418 0.01855 0.33097 1.30909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2583413 0.1465690 15.408 < 2e-16 ***
## educ 0.1548935 0.0119336 12.980 < 2e-16 ***
## genderMale -0.1776072 0.0357133 -4.973 8.09e-07 ***
## hrswork 0.0015599 0.0021502 0.725 0.46838
## insureYes 0.2393925 0.0533879 4.484 8.41e-06 ***
## metroMetro 0.1437497 0.0472004 3.046 0.00240 **
## nchild -0.0232674 0.0160311 -1.451 0.14707
## unionYes 0.0443596 0.0488816 0.907 0.36442
## raceBlack -0.0001526 0.1017538 -0.001 0.99880
## raceWhite 0.0883925 0.0831740 1.063 0.28822
## maritalMarried 0.0997758 0.0528882 1.887 0.05959 .
## maritalOther 0.1141177 0.0640300 1.782 0.07509 .
## regionnortheast 0.0553514 0.0532852 1.039 0.29923
## regionsouth 0.0460491 0.0465140 0.990 0.32248
## regionwest 0.1329651 0.0506018 2.628 0.00876 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4897 on 790 degrees of freedom
## Multiple R-squared: 0.277, Adjusted R-squared: 0.2642
## F-statistic: 21.62 on 14 and 790 DF, p-value: < 2.2e-16
reduced_model_young <- lm(logwage ~ educ + hrswork + union, data = young)
reduced_model_old <- lm(logwage ~ educ + hrswork + union, data = old)
summary(reduced_model_young)
##
## Call:
## lm(formula = logwage ~ educ + hrswork + union, data = young)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.57784 -0.29670 -0.04221 0.25408 1.44076
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.342604 0.100476 23.315 <2e-16 ***
## educ 0.144281 0.016465 8.763 <2e-16 ***
## hrswork 0.003788 0.002357 1.607 0.1089
## unionYes 0.168355 0.077081 2.184 0.0296 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4573 on 372 degrees of freedom
## Multiple R-squared: 0.2019, Adjusted R-squared: 0.1955
## F-statistic: 31.37 on 3 and 372 DF, p-value: < 2.2e-16
summary(reduced_model_old)
##
## Call:
## lm(formula = logwage ~ educ + hrswork + union, data = old)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.81007 -0.33876 -0.00456 0.33506 1.37424
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.550199 0.092987 27.425 <2e-16 ***
## educ 0.167351 0.011957 13.996 <2e-16 ***
## hrswork 0.004654 0.002174 2.141 0.0326 *
## unionYes 0.079198 0.049974 1.585 0.1134
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5083 on 801 degrees of freedom
## Multiple R-squared: 0.2102, Adjusted R-squared: 0.2072
## F-statistic: 71.05 on 3 and 801 DF, p-value: < 2.2e-16
Multiple linear regression was employed to analyze the joint effect of multiple predictors on log(wage). The full model for individuals aged below 40 included variables such as educ, gender, hrswork, metro, nchild, union, race, and marital. Among these, educ, hrswork, and union were statistically significant predictors. Education had a positive association with wage, implying that higher educational attainment leads to better earnings (Eichinger and Mayer, 2022). Union membership also showed a strong positive effect, which is consistent with labor economics theory. For the older group, the full model had similar predictors, but only educ, hrswork, and metro were significant. Living in metropolitan areas was associated with higher wages, suggesting the influence of urban labor markets. The adjusted R-squared for both models was around 0.25–0.30, indicating moderate explanatory power. A reduced model was also built using backward elimination to remove non-significant variables. These models showed minimal loss in explanatory power, confirming the robustness of the key predictors. Residual diagnostics were performed, and assumptions of normality and homoscedasticity were largely satisfied. These regression analyses provided nuanced insights into how various demographic and employment-related variables influence earnings in different age brackets.
The report successfully applied a range of statistical techniques using R to analyze wage determinants. Descriptive and inferential analyses provided valuable insights into how age, education, work hours, and other factors affect earnings. Regression modeling confirmed the influence of key predictors, offering useful implications for workforce policy and individual decision-making.
Eichinger, F. and Mayer, M., 2022. Predicting salaries with random-forest regression. In Machine Learning and Data Analytics for Solving Business Problems: Methods, Applications, and Case Studies (pp. 1-21). Cham: Springer International Publishing.
Hilari, S.C., Carbajal, J.L. and Pineda, F., 2024. Salary Prediction with Machine Learning. In Machine Learning Methods in Systems: Proceedings of 13th Computer Science On-line Conference 2024, Vol. 4 (Vol. 4, p. 128). Springer Nature.
Otto, J., Han, C. and Tomasi, S., 2021. Using neural networks to predict wages based on worker skills. Studies in Business and Economics, 16(1), pp.95-108.
Popovici, I., Carvajal, M.J., Peeples, P. and Rabionet, S.E., 2021. Disparities in the wage-and-salary earnings, determinants, and distribution of health economics, outcomes research, and market access professionals: an exploratory study. PharmacoEconomics-open, 5, pp.319-329.
Stack, S., 2021. Contributing factors to suicide: Political, social, cultural and economic. Preventive medicine, 152, p.106498.