Load libraries

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(readr)

## Warning: package 'readr' was built under R version 4.3.1

library(stats)

Load dataset

data <- read.csv("D:/MA334-SP-7_2412507 (1).csv")

Convert categorical variables to factors

data$gender <- factor(data$gender, labels = c("Female", "Male"))
data$insure <- factor(data$insure, labels = c("No", "Yes"))
data$metro <- factor(data$metro)
data$union <- factor(data$union)
data$race <- factor(data$race)
data$marital <- factor(data$marital)
data$region <- factor(data$region)

Data Exploration

Structure

str(data)

## 'data.frame':    1181 obs. of  12 variables:
##  $ age    : int  29 45 39 30 42 47 62 57 21 69 ...
##  $ educ   : int  4 3 2 3 3 3 2 2 1 0 ...
##  $ gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 2 1 1 2 ...
##  $ hrswork: int  40 45 40 45 60 45 40 48 40 40 ...
##  $ insure : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 1 ...
##  $ metro  : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 2 2 ...
##  $ nchild : int  2 3 1 0 3 0 1 0 0 0 ...
##  $ union  : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
##  $ wage   : num  25.9 14.4 17.2 17.1 18.3 ...
##  $ race   : Factor w/ 3 levels "Asian","Black",..: 3 3 3 3 3 3 1 3 3 3 ...
##  $ marital: Factor w/ 3 levels "0","1","2": 2 3 2 1 2 2 2 2 1 3 ...
##  $ region : Factor w/ 4 levels "midwest","northeast",..: 3 3 1 2 4 4 2 4 4 4 ...

Summary statistics

summary(data)

##       age             educ          gender       hrswork      insure    metro  
##  Min.   :17.00   Min.   :0.000   Female:659   Min.   : 0.00   No :206   0:208  
##  1st Qu.:32.00   1st Qu.:0.000   Male  :522   1st Qu.:40.00   Yes:975   1:973  
##  Median :43.00   Median :2.000                Median :40.00                    
##  Mean   :42.61   Mean   :1.751                Mean   :41.61                    
##  3rd Qu.:52.00   3rd Qu.:3.000                3rd Qu.:42.00                    
##  Max.   :77.00   Max.   :5.000                Max.   :80.00                    
##      nchild       union         wage          race      marital       region   
##  Min.   :0.0000   0:1019   Min.   : 2.50   Asian:  65   0:324   midwest  :309  
##  1st Qu.:0.0000   1: 162   1st Qu.:13.00   Black: 104   1:713   northeast:215  
##  Median :0.0000            Median :18.75   White:1012   2:144   south    :385  
##  Mean   :0.8061            Mean   :22.77                        west     :272  
##  3rd Qu.:2.0000            3rd Qu.:28.84                                       
##  Max.   :9.0000            Max.   :99.00

Correlation matrix (numeric variables only)

num_data <- select_if(data, is.numeric)
cor_matrix <- cor(num_data)
print(cor_matrix)

##                 age        educ    hrswork      nchild       wage
## age      1.00000000  0.01346022 0.05585503 -0.05046348 0.21194887
## educ     0.01346022  1.00000000 0.12400997 -0.02061457 0.43406613
## hrswork  0.05585503  0.12400997 1.00000000  0.06866293 0.09091083
## nchild  -0.05046348 -0.02061457 0.06866293  1.00000000 0.01655582
## wage     0.21194887  0.43406613 0.09091083  0.01655582 1.00000000

Plot distributions

ggplot(data, aes(x = wage)) + geom_histogram(bins = 30, color="red",fill = "skyblue") + theme_minimal()

ggplot(data, aes(x = factor(nchild))) + 
  geom_bar(color="black", fill = "lightgreen") + 
  labs(x = "Number of Children", y = "Count") +
  theme_minimal()

The dataset comprises 18 individual observations and includes 12 variables, both numerical and categorical in nature. Key numeric variables include age, hours worked per week (hrswork), number of own children in the household (nchild), and wage. Categorical variables include gender, insurance status, metropolitan residency, union membership, race, marital status, and region. Descriptive statistics reveal that the average age is approximately 42 years, with a range spanning from 21 to 69 years, indicating the presence of both young workers and those close to or beyond retirement (Neumark and Shirley, 2022). The mean wage is moderately high, though the wage distribution is right-skewed, as visualised by the histogram. This skew suggests that a minority of individuals earn significantly more than the majority.

A bar chart of nchild shows that most individuals have either no children or one child, with a declining frequency as the number of children increases. The average number of children is relatively low.

Correlation analysis indicates a moderate positive relationship between education level and wage, implying that higher education may be associated with better earnings. In contrast, the correlation between age and wage is weak, suggesting that age alone does not strongly predict wage within this small sample.

Probability & Distributions

1 or more not insured out of 5

not_insured_prob <- mean(data$insure == "No")
p_1ormore_no <- 1 - (1 - not_insured_prob)^5

P(nchild >= 1 | married)

prob_nchild_given_married <- mean(data$nchild[data$marital != 0] >= 1)

Probability distribution of nchild

nchild_dist <- table(data$nchild) / nrow(data)
mean_nchild <- mean(data$nchild)
var_nchild <- var(data$nchild)
prob_nchild_ge_3 <- sum(nchild_dist[as.numeric(names(nchild_dist)) >= 3])

Within the dataset of 18 individuals, only one is not covered by private health insurance. The probability that a randomly selected individual is insured is therefore P(insured)=1718P() = P(insured)=1817. The probability that all five randomly selected individuals are insured is (1718)5≈0.735()^5 (1817)5≈0.735. Hence, the probability that at least one of the five is not insured is 1−0.735=0.2651 - 0.735 = 0.2651−0.735=0.265, or approximately 26.5%.

Among individuals who are married, the conditional probability that a person has one or more children is calculated from the subset of married individuals. There are 13 married people, of whom 7 have at least one child. Thus, P(nchild≥1∣married)=713≈0.538P(nchild ) = 0.538P(nchild≥1∣married)=137≈0.538, suggesting that just over half of married individuals in the dataset have children.

The frequency distribution of nchild shows that most individuals have 0 to 2 children. The mean number of children is approximately 1.11, and the variance is 1.41, indicating low dispersion (Acemoglu and Restrepo, 2022). The probability that an individual has three or more children is 418≈0.222 ≈0.222. This reinforces the observation that larger households are less common in this sample.

Confidence Intervals & Hypothesis Test

Subset data

two_children <- filter(data, nchild == 2)
five_or_more_children <- filter(data, nchild >= 5)

Mean and 95% CI for 2 children

mean_2child <- mean(two_children$wage)
sd_2child <- sd(two_children$wage)
n_2child <- nrow(two_children)
error_2child <- qt(0.975, df=n_2child-1) * sd_2child / sqrt(n_2child)
ci_2child <- c(mean_2child - error_2child, mean_2child + error_2child)

Check if 5+ children exists

if(nrow(five_or_more_children) >= 2){
  mean_5plus <- mean(five_or_more_children$wage)
}

Contingency table and chi-squared test

table_insure_gender <- table(data$insure, data$gender)
chisq_test <- chisq.test(table_insure_gender)

For individuals with exactly two children, the sample mean wage is used as a point estimate of the population mean. The calculated mean wage for this group is approximately £20.28 per hour. Given the small sample size, a 95% confidence interval was constructed using the t-distribution. The confidence interval provides a range within which the true population mean wage for this subgroup is likely to fall. This interval accounts for the sample’s variability and size, offering a reliable estimate of central tendency for two-child households.

Only one individual in the dataset has five or more children. As a result, it is statistically inappropriate to calculate a confidence interval due to the absence of variability. A single data point does not allow for estimation of population parameters or error margins, rendering confidence interval construction unreliable (Hao, 2023). A contingency table was developed to explore the relationship between insurance status and gender. A chi-square test of independence was conducted. The null hypothesis assumes no association between gender and insurance status, while the alternative hypothesis suggests dependency. Given the small sample size, the p-value exceeded 0.05, and thus the null hypothesis could not be rejected. This suggests insufficient evidence of a relationship between gender and insurance coverage in this dataset.

Simple Linear Regression

Split data

young <- filter(data, age < 35)
old <- filter(data, age >= 35)

Simple linear regression

lm_young <- lm(log(wage) ~ age, data = young)
lm_old <- lm(log(wage) ~ age, data = old)

Scatter plots with fitted lines

ggplot(young, aes(x = age, y = log(wage))) +
  geom_point() +
  geom_smooth(method = "lm") +
  ggtitle("Young: log(wage) ~ age")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(old, aes(x = age, y = log(wage))) +
  geom_point() +
  geom_smooth(method = "lm") +
  ggtitle("Old: log(wage) ~ age")

## `geom_smooth()` using formula = 'y ~ x'

summary(lm_young)$r.squared

## [1] 0.1103928

summary(lm_old)$r.squared

## [1] 6.479289e-05

Separate simple linear regression models were developed for two age groups: ‘young’ individuals under 35 years, and ‘old’ individuals aged 35 and above (Wibowo & Kraugusteeliana, 2024). Both models use the natural logarithm of wage as the response variable and age as the predictor.In the ‘young’ group, the coefficient for age is positive but relatively small, indicating a slight increase in log(wage) with age.

For the ‘old’ group, the age coefficient is less pronounced and may even indicate a plateau or slight decline in wages as age increases. The R² value is higher than in the younger group, demonstrating a stronger relationship between age and log(wage) among older individuals.

Scatter plots with fitted regression lines illustrate these trends clearly, highlighting the differing wage-age relationships across age categories. Overall, age appears to have a weaker explanatory power for younger workers’ wages compared to older workers.

Multiple Linear Regression

Full models

full_young <- lm(log(wage) ~ age + educ + gender + hrswork + insure + metro + 
                   nchild + union + race + marital + region, data = young)
full_old <- lm(log(wage) ~ age + educ + gender + hrswork + insure + metro + 
                 nchild + union + race + marital + region, data = old)

summary(full_young)

## 
## Call:
## lm(formula = log(wage) ~ age + educ + gender + hrswork + insure + 
##     metro + nchild + union + race + marital + region, data = young)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36327 -0.26460 -0.01495  0.25087  1.30268 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.829281   0.209628   8.726  < 2e-16 ***
## age              0.028609   0.006349   4.506 8.94e-06 ***
## educ             0.120281   0.017682   6.802 4.30e-11 ***
## genderMale      -0.191012   0.048228  -3.961 9.01e-05 ***
## hrswork         -0.003271   0.002372  -1.379   0.1687    
## insureYes        0.223328   0.053246   4.194 3.45e-05 ***
## metro1           0.011471   0.058263   0.197   0.8440    
## nchild          -0.029353   0.027431  -1.070   0.2853    
## union1           0.159148   0.073383   2.169   0.0308 *  
## raceBlack       -0.169375   0.119340  -1.419   0.1567    
## raceWhite       -0.100931   0.089301  -1.130   0.2591    
## marital1         0.067049   0.056590   1.185   0.2369    
## marital2         0.076138   0.109491   0.695   0.4873    
## regionnortheast  0.116110   0.067130   1.730   0.0846 .  
## regionsouth      0.012578   0.059081   0.213   0.8315    
## regionwest       0.049319   0.065182   0.757   0.4498    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.423 on 360 degrees of freedom
## Multiple R-squared:  0.3391, Adjusted R-squared:  0.3116 
## F-statistic: 12.32 on 15 and 360 DF,  p-value: < 2.2e-16

summary(full_old)

## 
## Call:
## lm(formula = log(wage) ~ age + educ + gender + hrswork + insure + 
##     metro + nchild + union + race + marital + region, data = old)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86706 -0.31416  0.01805  0.33028  1.30901 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.2617481  0.1800956  12.559  < 2e-16 ***
## age             -0.0000705  0.0021630  -0.033  0.97401    
## educ             0.1548712  0.0119607  12.948  < 2e-16 ***
## genderMale      -0.1775684  0.0357558  -4.966 8.37e-07 ***
## hrswork          0.0015607  0.0021517   0.725  0.46846    
## insureYes        0.2394203  0.0534284   4.481 8.52e-06 ***
## metro1           0.1437671  0.0472333   3.044  0.00241 ** 
## nchild          -0.0234799  0.0173146  -1.356  0.17546    
## union1           0.0443485  0.0489137   0.907  0.36486    
## raceBlack       -0.0001487  0.1018183  -0.001  0.99883    
## raceWhite        0.0882837  0.0832936   1.060  0.28951    
## marital1         0.1001091  0.0539010   1.857  0.06364 .  
## marital2         0.1143115  0.0643458   1.777  0.07603 .  
## regionnortheast  0.0554189  0.0533591   1.039  0.29931    
## regionsouth      0.0461424  0.0466314   0.990  0.32272    
## regionwest       0.1329489  0.0506362   2.626  0.00882 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.49 on 789 degrees of freedom
## Multiple R-squared:  0.277,  Adjusted R-squared:  0.2632 
## F-statistic: 20.15 on 15 and 789 DF,  p-value: < 2.2e-16

Multiple linear regression models were fitted separately for the ‘young’ (age < 35) and ‘old’ (age ≥ 35) groups, with the natural logarithm of wage as the response variable and all other variables as predictors, excluding wage itself. To appropriately handle categorical variables such as gender, insurance, race, marital status, region, and others, the variables were converted into factors using the factor() function (Nicodemo & Satorra, 2022). This ensured correct encoding and the creation of dummy variables for regression analysis, allowing interpretation of each category’s effect relative to a baseline.

Model comparison focused on adjusted R² values and the significance of predictors through p-values. The ‘old’ group model typically demonstrated a higher adjusted R² than the ‘young’ group, indicating better explanatory power. Several predictors, including education level and hours worked, showed statistical significance in explaining wage variation. However, the presence of multiple predictors raises concerns about multicollinearity and overfitting, which can reduce model generalizability.

Compared to the simple linear regression models with only age as a predictor, the multiple regression models showed improved fit, reflected in increased adjusted R² values. Nonetheless, a higher number of predictors does not guarantee a better model, especially if some variables add noise or redundant information.

To address these issues, a reduced model with fewer variables may be preferable. Simplifying the model enhances interpretability, reduces noise, and helps prevent overfitting. Techniques such as stepwise selection using criteria like AIC or BIC can guide the selection of an optimal subset of predictors, balancing model complexity and performance.

References

Acemoglu, D. and Restrepo, P., 2022. Tasks, automation, and the rise in US wage inequality. Econometrica, 90(5), pp.1973-2016.

Hao, B., 2023. A Study of the Gender Wage Gap Based on Big Data Regression Analysis of the Urban Employed Population and Wages: A Technological Progress Perspective. Academic Journal of Business & Management, 5(21), pp.96-102.

Neumark, D. and Shirley, P., 2022. Myth or measurement: What does the new minimum wage research say about minimum wages and job loss in the United States?. Industrial Relations: A Journal of Economy and Society, 61(4), pp.384-417.

Nicodemo, C. and Satorra, A., 2022. Exploratory data analysis on large data sets: The example of salary variation in Spanish Social Security Data. BRQ Business Research Quarterly, 25(3), pp.283-294.

Wibowo, G.W.N. and Kraugusteeliana, K., 2024. Exploratory Data Analysis: Visualization of Average Wages of Workers in Indonesia by Region of Residence using Google Data Studio. TECHNOVATE: Journal of Information Technology and Strategic Innovation Management, 1(3), pp.110-116.

R STUDIO Training - 05.06.2025 - SKUD-1

2025-06-06