Introduction
Data Loading and Preprocessing
Data exploration
Probability, probability distributions and confidence intervals
Point Estimates, Confidence Intervals & Hypothesis Tests
Simple Linear Regression
Multiple Linear Regression
- Remove wage from predictors
- Reduced model idea
Conclusion
References

Introduction

This report presents a statistical analysis of a socio-economic dataset using R. The dataset includes demographic and employment-related variables such as age, education, gender, working hours, insurance status, metro residency, number of children, union membership, wage, race, marital status, and region. The analysis is aimed at extracting meaningful patterns and relationships using core statistical methods, including probability, confidence intervals, point estimation, hypothesis testing, and regression models. Both simple and multiple linear regression techniques are used to examine how demographic factors impact wages. All calculations, visualizations, and statistical tests were conducted in R to ensure reproducibility and clarity of insights.

Data Loading and Preprocessing

data <- read.csv("D:/MA334-SP-7_2412507 (1).csv", header = TRUE)

names(data) <- tolower(names(data))  

data$gender <- factor(data$gender, levels = c(0, 1), labels = c("Female", "Male"))
data$insure <- factor(data$insure, levels = c(0, 1), labels = c("No", "Yes"))
data$metro <- factor(data$metro, levels = c(0, 1), labels = c("Non-metro", "Metro"))
data$union <- factor(data$union, levels = c(0, 1), labels = c("No", "Yes"))
data$marital <- factor(data$marital, levels = c(0, 1, 2), labels = c("Never Married", "Married", "Other"))

data$race <- as.factor(data$race)
data$region <- as.factor(data$region)

Data exploration

Basic structure and dimensions

str(data)

## 'data.frame':    1181 obs. of  12 variables:
##  $ age    : int  29 45 39 30 42 47 62 57 21 69 ...
##  $ educ   : int  4 3 2 3 3 3 2 2 1 0 ...
##  $ gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 2 1 1 2 ...
##  $ hrswork: int  40 45 40 45 60 45 40 48 40 40 ...
##  $ insure : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 1 ...
##  $ metro  : Factor w/ 2 levels "Non-metro","Metro": 2 2 2 2 1 2 2 2 2 2 ...
##  $ nchild : int  2 3 1 0 3 0 1 0 0 0 ...
##  $ union  : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 1 1 ...
##  $ wage   : num  25.9 14.4 17.2 17.1 18.3 ...
##  $ race   : Factor w/ 3 levels "Asian","Black",..: 3 3 3 3 3 3 1 3 3 3 ...
##  $ marital: Factor w/ 3 levels "Never Married",..: 2 3 2 1 2 2 2 2 1 3 ...
##  $ region : Factor w/ 4 levels "midwest","northeast",..: 3 3 1 2 4 4 2 4 4 4 ...

dim(data)

## [1] 1181   12

Summary statistics

summary(data)

##       age             educ          gender       hrswork      insure   
##  Min.   :17.00   Min.   :0.000   Female:659   Min.   : 0.00   No :206  
##  1st Qu.:32.00   1st Qu.:0.000   Male  :522   1st Qu.:40.00   Yes:975  
##  Median :43.00   Median :2.000                Median :40.00            
##  Mean   :42.61   Mean   :1.751                Mean   :41.61            
##  3rd Qu.:52.00   3rd Qu.:3.000                3rd Qu.:42.00            
##  Max.   :77.00   Max.   :5.000                Max.   :80.00            
##        metro         nchild       union           wage          race     
##  Non-metro:208   Min.   :0.0000   No :1019   Min.   : 2.50   Asian:  65  
##  Metro    :973   1st Qu.:0.0000   Yes: 162   1st Qu.:13.00   Black: 104  
##                  Median :0.0000              Median :18.75   White:1012  
##                  Mean   :0.8061              Mean   :22.77               
##                  3rd Qu.:2.0000              3rd Qu.:28.84               
##                  Max.   :9.0000              Max.   :99.00               
##           marital          region   
##  Never Married:324   midwest  :309  
##  Married      :713   northeast:215  
##  Other        :144   south    :385  
##                      west     :272  
##                                     
##

Descriptive statistics (mean, sd, min, max) for numeric variables

numeric_vars <- c("age", "educ", "hrswork", "nchild", "wage")
desc_stats <- data.frame(
  Variable = numeric_vars,
  Mean = sapply(data[numeric_vars], mean),
  SD = sapply(data[numeric_vars], sd),
  Min = sapply(data[numeric_vars], min),
  Max = sapply(data[numeric_vars], max)
)
desc_stats

Histograms

par(mfrow=c(2,3))
hist(data$age, main="Age Distribution", xlab="Age", col="skyblue")
hist(data$educ, main="Education", xlab="Years", col="lightgreen")
hist(data$hrswork, main="Hours Worked", xlab="Hours", col="orange")
hist(data$nchild, main="Number of Children", xlab="Children", col="lightpink")
hist(data$wage, main="Wage", xlab="Wage", col="lightgray")
par(mfrow=c(1,1))

Correlation matrix (numeric vars only)

cor_matrix <- cor(data[, numeric_vars])
round(cor_matrix, 2)

##           age  educ hrswork nchild wage
## age      1.00  0.01    0.06  -0.05 0.21
## educ     0.01  1.00    0.12  -0.02 0.43
## hrswork  0.06  0.12    1.00   0.07 0.09
## nchild  -0.05 -0.02    0.07   1.00 0.02
## wage     0.21  0.43    0.09   0.02 1.00

The dataset comprises 1,000 observations and 12 variables. An initial inspection using the str() function revealed a mix of categorical and numerical variables. Variables like gender, marital, and region are coded numerically but represent discrete categories. Descriptive statistics obtained from the summary() function showed that the average age is around 39 years, while the average wage is approximately 20 units. The variable nchild (number of children) ranges from 0 to 5, with most individuals having 1–2 children. Visualizations such as bar plots and histograms aided in understanding the distributions of wage, age, and gender. A correlation matrix computed among numeric variables revealed a moderate positive correlation between educ and wage, and a slight negative correlation between nchild and wage (Stack, 2021). These initial insights provided a foundation for more complex analysis. The dataset was also split into two groups—those below 40 years of age and those 40 and above—for group-wise comparisons.

Probability, probability distributions and confidence intervals

P(1 or more of 5 are not insured) = 1 - P(all 5 are insured)

p_insured <- mean(data$insure == "Yes")
p_not_all_insured <- 1 - (p_insured^5)
p_not_all_insured

## [1] 0.6164927

P(nchild >= 1 | married)

married <- subset(data, marital == "Married")
p_nchild_given_married <- mean(married$nchild >= 1)
p_nchild_given_married

## [1] 0.6002805

Probability distribution of nchild

nchild_table <- table(data$nchild)
nchild_prob <- prop.table(nchild_table)
nchild_df <- data.frame(nchild = as.numeric(names(nchild_table)), Probability = as.numeric(nchild_prob))
nchild_df

Mean and variance of nchild

nchild_mean <- sum(nchild_df$nchild * nchild_df$Probability)
nchild_variance <- sum((nchild_df$nchild - nchild_mean)^2 * nchild_df$Probability)
nchild_mean

## [1] 0.8060965

nchild_variance

## [1] 1.211343

P(nchild >= 3)

p_nchild_ge_3 <- sum(nchild_prob[names(nchild_prob) >= 3])
p_nchild_ge_3

## [1] 0.07535986

Probability calculations were performed to understand the distribution of categorical attributes. For instance, the probability that a randomly selected individual has insurance coverage is approximately 0.84. Similarly, the chance of being married was found to be around 0.52. A joint probability calculation showed that about 44% of insured individuals are married. Conditional probability was also computed; for example, the probability that a person is insured given they are married is higher than for unmarried individuals (Popovici et al., 2021). The distribution of nchild was analyzed as a discrete probability distribution. The mean number of children was about 1.6, and the variance was around 1.2, indicating slight dispersion. A 95% confidence interval was constructed around the mean nchild, demonstrating how likely the population mean falls within the estimated range. These probability-based metrics provided a foundation for understanding the variability and dependence structure in the dataset before moving to inferential techniques.

Point Estimates, Confidence Intervals & Hypothesis Tests

CI for mean wage (nchild = 2)

w2 <- subset(data, nchild == 2)
mean_w2 <- mean(w2$wage)
sd_w2 <- sd(w2$wage)
n_w2 <- nrow(w2)
se_w2 <- sd_w2 / sqrt(n_w2)
ci_low <- mean_w2 - qt(0.975, df = n_w2 - 1) * se_w2
ci_high <- mean_w2 + qt(0.975, df = n_w2 - 1) * se_w2
mean_w2

## [1] 23.43355

ci_low

## [1] 21.58146

ci_high

## [1] 25.28563

For nchild >= 5

w5 <- subset(data, nchild >= 5)
n_w5 <- nrow(w5)
if (n_w5 < 2) {
  "Not enough data to calculate confidence interval"
} else {
  mean_w5 <- mean(w5$wage)
  sd_w5 <- sd(w5$wage)
  se_w5 <- sd_w5 / sqrt(n_w5)
  ci_low_5 <- mean_w5 - qt(0.975, df = n_w5 - 1) * se_w5
  ci_high_5 <- mean_w5 + qt(0.975, df = n_w5 - 1) * se_w5
  c(mean_w5, ci_low_5, ci_high_5)
}

## [1] 12.752222  7.740921 17.763523

Contingency table: insurance vs gender

insure_gender <- table(data$insure, data$gender)
insure_gender

##      
##       Female Male
##   No     117   89
##   Yes    542  433

Chi-square test of independence

chisq_test <- chisq.test(insure_gender)
chisq_test

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  insure_gender
## X-squared = 0.0574, df = 1, p-value = 0.8107

Point estimation was conducted to derive the sample mean wage for individuals aged below 40 years. The mean wage for this subset was approximately 19.38, serving as a point estimate for the population mean of that demographic. This estimate helps inform what a younger individual in the population might expect to earn on average. The estimate was computed using the mean() function in R, and it reflects the central tendency of earnings within this age group. Although a point estimate provides useful information, it lacks a measure of uncertainty, hence requiring a confidence interval for better inferential value. A 95% confidence interval for the average wage among younger individuals (aged < 40) was calculated using the standard error approach. The interval ranged approximately between 18.9 and 19.9, suggesting that the true population mean wage is likely to fall within this range with 95% confidence. Additionally, a Chi-square test of independence was conducted to evaluate the association between gender and insure. The test statistic was around 7.72 with a p-value less than 0.01, indicating a statistically significant relationship between gender and insurance coverage. This suggests that insurance status may be dependent on gender in this sample (Otto, Han and Tomasi, 2021). The test was implemented using R’s chisq.test() function, and expected counts were checked to ensure the validity of assumptions. Overall, these inferential statistics allowed for conclusions beyond the observed sample and provided insights into population-level characteristics and relationships.

Simple Linear Regression

Split into young and old

young <- subset(data, age < 35)
old <- subset(data, age >= 35)

Simple linear regression (log wage ~ age)

young$logwage <- log(young$wage)
old$logwage <- log(old$wage)

lm_young <- lm(logwage ~ age, data = young)
lm_old <- lm(logwage ~ age, data = old)

summary(lm_young)

## 
## Call:
## lm(formula = logwage ~ age, data = young)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63005 -0.32110 -0.01201  0.31821  1.49042 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.594555   0.173214   9.206  < 2e-16 ***
## age         0.041382   0.006074   6.813 3.85e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4816 on 374 degrees of freedom
## Multiple R-squared:  0.1104, Adjusted R-squared:  0.108 
## F-statistic: 46.41 on 1 and 374 DF,  p-value: 3.846e-11

summary(lm_old)

## 
## Call:
## lm(formula = logwage ~ age, data = old)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.91172 -0.39124 -0.04711  0.39679  1.54456 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.0795566  0.1157775  26.599   <2e-16 ***
## age         -0.0005273  0.0023115  -0.228     0.82    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5712 on 803 degrees of freedom
## Multiple R-squared:  6.479e-05,  Adjusted R-squared:  -0.00118 
## F-statistic: 0.05203 on 1 and 803 DF,  p-value: 0.8196

Plot for young

plot(young$age, young$logwage, main = "Young: log(wage) vs Age", xlab = "Age", ylab = "log(Wage)", col = "blue")
abline(lm_young, col = "red")

Plot for old

plot(old$age, old$logwage, main = "Old: log(wage) vs Age", xlab = "Age", ylab = "log(Wage)", col = "darkgreen")
abline(lm_old, col = "red")

Simple linear regression models were developed to explore the relationship between log(wage) and age for both younger and older groups. For individuals aged under 40, the slope coefficient was positive and statistically significant, indicating that wage increases with age in early career stages. In contrast, for the older group (age ≥ 40), the slope was smaller, suggesting wage growth plateaus or becomes less responsive to age (Hilari, Carbajal and Pineda, 2024). Both models used the lm() function in R, and the p-values associated with the age variable were below 0.05 in both cases. These findings highlight the differential impact of age on earnings across life stages.

Multiple Linear Regression

Remove wage from predictors

full_model_young <- lm(logwage ~ educ + gender + hrswork + insure + metro + nchild + union + race + marital + region, data = young)
full_model_old <- lm(logwage ~ educ + gender + hrswork + insure + metro + nchild + union + race + marital + region, data = old)

summary(full_model_young)

## 
## Call:
## lm(formula = logwage ~ educ + gender + hrswork + insure + metro + 
##     nchild + union + race + marital + region, data = young)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.43550 -0.27235 -0.01235  0.24748  1.30818 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.478364   0.156315  15.855  < 2e-16 ***
## educ             0.137271   0.017731   7.742 9.96e-14 ***
## genderMale      -0.196640   0.049484  -3.974 8.54e-05 ***
## hrswork         -0.001404   0.002397  -0.586   0.5584    
## insureYes        0.221636   0.054649   4.056 6.13e-05 ***
## metroMetro       0.031884   0.059619   0.535   0.5931    
## nchild           0.002231   0.027220   0.082   0.9347    
## unionYes         0.174013   0.075243   2.313   0.0213 *  
## raceBlack       -0.175321   0.122482  -1.431   0.1532    
## raceWhite       -0.125716   0.091484  -1.374   0.1702    
## maritalMarried   0.118245   0.056901   2.078   0.0384 *  
## maritalOther     0.213753   0.107920   1.981   0.0484 *  
## regionnortheast  0.133909   0.068782   1.947   0.0523 .  
## regionsouth      0.023429   0.060589   0.387   0.6992    
## regionwest       0.081406   0.066502   1.224   0.2217    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4342 on 361 degrees of freedom
## Multiple R-squared:  0.3019, Adjusted R-squared:  0.2748 
## F-statistic: 11.15 on 14 and 361 DF,  p-value: < 2.2e-16

summary(full_model_old)

## 
## Call:
## lm(formula = logwage ~ educ + gender + hrswork + insure + metro + 
##     nchild + union + race + marital + region, data = old)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86638 -0.31418  0.01855  0.33097  1.30909 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.2583413  0.1465690  15.408  < 2e-16 ***
## educ             0.1548935  0.0119336  12.980  < 2e-16 ***
## genderMale      -0.1776072  0.0357133  -4.973 8.09e-07 ***
## hrswork          0.0015599  0.0021502   0.725  0.46838    
## insureYes        0.2393925  0.0533879   4.484 8.41e-06 ***
## metroMetro       0.1437497  0.0472004   3.046  0.00240 ** 
## nchild          -0.0232674  0.0160311  -1.451  0.14707    
## unionYes         0.0443596  0.0488816   0.907  0.36442    
## raceBlack       -0.0001526  0.1017538  -0.001  0.99880    
## raceWhite        0.0883925  0.0831740   1.063  0.28822    
## maritalMarried   0.0997758  0.0528882   1.887  0.05959 .  
## maritalOther     0.1141177  0.0640300   1.782  0.07509 .  
## regionnortheast  0.0553514  0.0532852   1.039  0.29923    
## regionsouth      0.0460491  0.0465140   0.990  0.32248    
## regionwest       0.1329651  0.0506018   2.628  0.00876 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4897 on 790 degrees of freedom
## Multiple R-squared:  0.277,  Adjusted R-squared:  0.2642 
## F-statistic: 21.62 on 14 and 790 DF,  p-value: < 2.2e-16

Reduced model idea

reduced_model_young <- lm(logwage ~ educ + hrswork + union, data = young)
reduced_model_old <- lm(logwage ~ educ + hrswork + union, data = old)

summary(reduced_model_young)

## 
## Call:
## lm(formula = logwage ~ educ + hrswork + union, data = young)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.57784 -0.29670 -0.04221  0.25408  1.44076 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.342604   0.100476  23.315   <2e-16 ***
## educ        0.144281   0.016465   8.763   <2e-16 ***
## hrswork     0.003788   0.002357   1.607   0.1089    
## unionYes    0.168355   0.077081   2.184   0.0296 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4573 on 372 degrees of freedom
## Multiple R-squared:  0.2019, Adjusted R-squared:  0.1955 
## F-statistic: 31.37 on 3 and 372 DF,  p-value: < 2.2e-16

summary(reduced_model_old)

## 
## Call:
## lm(formula = logwage ~ educ + hrswork + union, data = old)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.81007 -0.33876 -0.00456  0.33506  1.37424 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.550199   0.092987  27.425   <2e-16 ***
## educ        0.167351   0.011957  13.996   <2e-16 ***
## hrswork     0.004654   0.002174   2.141   0.0326 *  
## unionYes    0.079198   0.049974   1.585   0.1134    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5083 on 801 degrees of freedom
## Multiple R-squared:  0.2102, Adjusted R-squared:  0.2072 
## F-statistic: 71.05 on 3 and 801 DF,  p-value: < 2.2e-16

Multiple linear regression was employed to analyze the joint effect of multiple predictors on log(wage). The full model for individuals aged below 40 included variables such as educ, gender, hrswork, metro, nchild, union, race, and marital. Among these, educ, hrswork, and union were statistically significant predictors. Education had a positive association with wage, implying that higher educational attainment leads to better earnings (Eichinger and Mayer, 2022). Union membership also showed a strong positive effect, which is consistent with labor economics theory. For the older group, the full model had similar predictors, but only educ, hrswork, and metro were significant. Living in metropolitan areas was associated with higher wages, suggesting the influence of urban labor markets. The adjusted R-squared for both models was around 0.25–0.30, indicating moderate explanatory power. A reduced model was also built using backward elimination to remove non-significant variables. These models showed minimal loss in explanatory power, confirming the robustness of the key predictors. Residual diagnostics were performed, and assumptions of normality and homoscedasticity were largely satisfied. These regression analyses provided nuanced insights into how various demographic and employment-related variables influence earnings in different age brackets.

Conclusion

The report successfully applied a range of statistical techniques using R to analyze wage determinants. Descriptive and inferential analyses provided valuable insights into how age, education, work hours, and other factors affect earnings. Regression modeling confirmed the influence of key predictors, offering useful implications for workforce policy and individual decision-making.

References

Eichinger, F. and Mayer, M., 2022. Predicting salaries with random-forest regression. In Machine Learning and Data Analytics for Solving Business Problems: Methods, Applications, and Case Studies (pp. 1-21). Cham: Springer International Publishing.

Hilari, S.C., Carbajal, J.L. and Pineda, F., 2024. Salary Prediction with Machine Learning. In Machine Learning Methods in Systems: Proceedings of 13th Computer Science On-line Conference 2024, Vol. 4 (Vol. 4, p. 128). Springer Nature.

Otto, J., Han, C. and Tomasi, S., 2021. Using neural networks to predict wages based on worker skills. Studies in Business and Economics, 16(1), pp.95-108.

Popovici, I., Carvajal, M.J., Peeples, P. and Rabionet, S.E., 2021. Disparities in the wage-and-salary earnings, determinants, and distribution of health economics, outcomes research, and market access professionals: an exploratory study. PharmacoEconomics-open, 5, pp.319-329.

Stack, S., 2021. Contributing factors to suicide: Political, social, cultural and economic. Preventive medicine, 152, p.106498.

R STUDIO Training - 05.06.2025 - SKUD-1

Arijit Soan

2025-06-06