The dataset consists of 534 individuals with no missing values. Education averages 13 years, with most people having around 12 years of schooling. Work experience averages 17.8 years but varies widely, with some individuals having significantly more. Wages have a mean of $9.02 per hour, but the distribution is positively skewed (as we see from the histogram), so most people earn less than the average, with a few high earners pulling the mean up (for example there is huge outlier in female category as we see from the box plot). The sample has a mean age of 36.8 years.
Around 29% of the sample lives in the South, and about 46% are female, showing a relatively balanced gender split. Only 18% are union members. Most individuals are categorized as White in the race variable, with a slight skew toward service and professional occupations. The majority do not work in manufacturing or construction, as indicated by the low average for the sector variable. About 66% of the sample is married.
Descriptive statistics by category
aggregate(Wage ~ Gender, data = wage, FUN = mean)
Gender Wage
1 0 9.994913
2 1 7.878857
aggregate(Wage ~ Gender, data = wage, FUN = median)
Gender Wage
1 0 8.93
2 1 6.80
aggregate(Wage ~ Gender, data = wage, FUN = sd)
Gender Wage
1 0 5.285854
2 1 4.720113
boxplot(Wage ~ Gender, data = wage, main ="Wages by Gender", xlab ="Gender", ylab ="Wage", col =c("lightblue", "pink"))
boxplot(Wage ~ Union, data = wage, main ="Wages by Union Membership", xlab ="Union Membership", ylab ="Wage", col =c("lightgreen", "lightcoral"))
boxplot(Wage ~ Race, data = wage, main ="Wages by Race", xlab ="Race", ylab ="Wage", col =rainbow(3))
The data shows that males, on average, earn higher wages than females, as well as workers with union membership, white race and professional occupation.
boxplot(Wage ~ Occupation, data = wage, main ="Wages by Occupation", xlab ="Occupation", ylab ="Wage", col =rainbow(length(unique(wage$Occupation))))
We can check correlations for wages and different factors
The analysis shows that education is a bit linked to higher wages (0.38) but slightly reduces experience (-0.35), likely because educated people start working later. Experience strongly grows with age (0.98). Wages weakly increase with both age (0.18) and experience (0.09), but education has the strongest impact on wages.
Simple linear regression
model_simple <-lm(Wage ~ Education, data = wage)summary(model_simple)
Call:
lm(formula = Wage ~ Education, data = wage)
Residuals:
Min 1Q Median 3Q Max
-7.911 -3.260 -0.760 2.240 34.740
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.74598 1.04545 -0.714 0.476
Education 0.75046 0.07873 9.532 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.754 on 532 degrees of freedom
Multiple R-squared: 0.1459, Adjusted R-squared: 0.1443
F-statistic: 90.85 on 1 and 532 DF, p-value: < 2.2e-16
text
model_simple <-lm(Wage ~ Gender, data = wage)summary(model_simple)
Call:
lm(formula = Wage ~ Gender, data = wage)
Residuals:
Min 1Q Median 3Q Max
-8.995 -3.529 -1.072 2.394 36.621
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.9949 0.2961 33.75 < 2e-16 ***
Gender -2.1161 0.4372 -4.84 1.7e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.034 on 532 degrees of freedom
Multiple R-squared: 0.04218, Adjusted R-squared: 0.04038
F-statistic: 23.43 on 1 and 532 DF, p-value: 1.703e-06
model_simple <-lm(Wage ~ Experience, data = wage)summary(model_simple)
Call:
lm(formula = Wage ~ Experience, data = wage)
Residuals:
Min 1Q Median 3Q Max
-8.247 -3.601 -1.111 2.332 36.084
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.37997 0.38895 21.545 <2e-16 ***
Experience 0.03614 0.01793 2.016 0.0443 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.124 on 532 degrees of freedom
Multiple R-squared: 0.007579, Adjusted R-squared: 0.005714
F-statistic: 4.063 on 1 and 532 DF, p-value: 0.04433
The analysis shows that Education has a significant positive effect on Wage, with each additional year of education increasing wages by $0.75, and the model explains 14.6% of wage variability. Gender is also significant, with females earning $2.12 less than males on average, but this model explains only 4.2% of the variability in wages. Experience has a small positive effect, increasing wages by $0.036 per year, but its impact is weak and explains less than 1% of the wage variability.
Multiple linear regression model development
wage$ln_wage <-log(wage$Wage)rgrmodel <-lm(ln_wage ~ Occupation + Sector + Union + Education + Experience + Age + Gender + Marital_status + Race + South, data = wage)summary(rgrmodel)
Call:
lm(formula = ln_wage ~ Occupation + Sector + Union + Education +
Experience + Age + Gender + Marital_status + Race + South,
data = wage)
Residuals:
Min 1Q Median 3Q Max
-2.16246 -0.29163 -0.00469 0.29981 1.98248
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.078596 0.687514 1.569 0.117291
Occupation -0.007417 0.013109 -0.566 0.571761
Sector 0.091458 0.038736 2.361 0.018589 *
Union 0.200483 0.052475 3.821 0.000149 ***
Education 0.179366 0.110756 1.619 0.105949
Experience 0.095822 0.110799 0.865 0.387531
Age -0.085444 0.110730 -0.772 0.440671
Gender -0.221997 0.039907 -5.563 4.24e-08 ***
Marital_status 0.076611 0.041931 1.827 0.068259 .
Race 0.050406 0.028531 1.767 0.077865 .
South -0.102360 0.042823 -2.390 0.017187 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4398 on 523 degrees of freedom
Multiple R-squared: 0.3185, Adjusted R-squared: 0.3054
F-statistic: 24.44 on 10 and 523 DF, p-value: < 2.2e-16
The regression model estimates the natural logarithm of wages based on several predictors as occupation, sector, union membership, education, and demographic variables.
Among the predictors, Union membership has a significant positive effect on wages, increasing the log of wages by 0.20, while working in the Sector and living in the South also significantly affect wages, with coefficients of 0.09 and -0.10, respectively. Gender is a key factor, where being female reduces the log of wages by 0.22 on average. Although Education shows a positive trend (0.18), it is not statistically significant at the 5% level in this model. Other variables, such as Age, Experience, and Occupation, do not have significant impacts in this analysis. The model explains about 31.85% of the variability in log wages, indicating a moderate fit.
The boxplot of the residuals provides an overview of how well the regression model fits the data. The outliers above and below the plot are data points where the actual wage values deviate significantly from what the model predicts, so it may not perform well.
The two main problems with this model are multicollinearity and interpretation issues with some variables like race, occupation, and sector.
Multicollinearity happens when two variables are highly related. For example, experience and age are likely correlated, making it hard to separate their effects on wages. This can make the model’s results unreliable.
The variables race, occupation, and sector should be treated as categorical variables. Treating them as continuous can make the model’s interpretation incorrect.
In short, to fix these problems, we should check for multicollinearity and change the way we treat occupation, sector, and race in the model.
Addressing issues
library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:psych':
logit
vif(rgrmodel)
Occupation Sector Union Education Experience
1.298232 1.198670 1.120861 231.195580 5184.093895
Age Gender Marital_status Race South
4645.664977 1.091634 1.096130 1.037138 1.046828
The Variance Inflation Factor (VIF) results show that most of the variables have VIF values around 1, indicating no significant multicollinearity issues. However, Education, Experience, and Age have extremely high VIFs, especially:
Education: VIF = 231.2
Experience: VIF = 5184.1
Age: VIF = 4645.7
These very high VIFs suggest that these variables are highly collinear. Also, from previous analysis we know that experience and age are highly correlated so there is multicollinearity, so we can remove one of them
Occupation Sector Union Education Gender
1.273183 1.190236 1.102779 1.097074 1.081223
Marital_status Race South
1.019463 1.036985 1.043706
The new VIF results show that all variables now have VIF values close to 1, indicating that multicollinearity is no longer an issue in the model.
summary(rgrmodel_updated)
Call:
lm(formula = ln_wage ~ Occupation + Sector + Union + Education +
Gender + Marital_status + Race + South, data = wage)
Residuals:
Min 1Q Median 3Q Max
-2.1519 -0.3309 0.0034 0.3028 1.8315
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.95654 0.15028 6.365 4.26e-10 ***
Occupation -0.01719 0.01339 -1.283 0.199910
Sector 0.10996 0.03982 2.762 0.005953 **
Union 0.23924 0.05369 4.456 1.02e-05 ***
Education 0.07650 0.00787 9.721 < 2e-16 ***
Gender -0.20108 0.04097 -4.908 1.23e-06 ***
Marital_status 0.13980 0.04171 3.352 0.000861 ***
Race 0.05157 0.02943 1.752 0.080287 .
South -0.11579 0.04411 -2.625 0.008912 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4537 on 525 degrees of freedom
Multiple R-squared: 0.2721, Adjusted R-squared: 0.261
F-statistic: 24.53 on 8 and 525 DF, p-value: < 2.2e-16
The residuals show that the differences between predicted and actual wages are small, which indicates that the model makes good predictions. The coefficients reveal that some variables, such as Union and Education, have a significant positive impact on wages, while Gender and South have a negative effect. The R-squared value of 0.2721 means that the model explains about 27% of the variation in wages, which is relatively low. The F-statistic is significant (p-value < 2e-16), suggesting that the predictors, as a group, are related to wages. Overall, the model shows that factors like education, union membership, and sector influence wages, while gender and location (South) tend to lower them, although the model could be improved.
This model is now better because it provides a more accurate result as it properly includes categorical variables. The model now shows that specific occupations and sectors significantly impact wages, which was not captured in the previous version. Additionally, the Adjusted R-squared increased from 0.261 to 0.3068, so there is a better fit and more explained variance in the data.