Econometrics HW2

Author

Vladyslava Bondarenko

wage <- read.csv('C:/Users/User/Desktop/wage_data.csv')
head(wage)
  ID Education South Gender Experience Union  Wage Age Race Occupation Sector
1  1         8     0      1         21     0  5.10  35    2          6      1
2  2         9     0      1         42     0  4.95  57    3          6      1
3  3        12     0      0          1     0  6.67  19    3          6      1
4  4        12     0      0          4     0  4.00  22    3          6      0
5  5        12     0      0         17     0  7.50  35    3          6      0
6  6        13     0      0          9     1 13.07  28    3          6      0
  Marital_status
1              1
2              1
3              0
4              0
5              1
6              0
library(psych)
describe(wage) 
               vars   n   mean     sd median trimmed    mad min   max range
ID                1 534 267.50 154.30 267.50  267.50 197.93   1 534.0 533.0
Education         2 534  13.02   2.62  12.00   13.05   1.48   2  18.0  16.0
South             3 534   0.29   0.46   0.00    0.24   0.00   0   1.0   1.0
Gender            4 534   0.46   0.50   0.00    0.45   0.00   0   1.0   1.0
Experience        5 534  17.82  12.38  15.00   16.75  11.86   0  55.0  55.0
Union             6 534   0.18   0.38   0.00    0.10   0.00   0   1.0   1.0
Wage              7 534   9.02   5.14   7.78    8.28   4.12   1  44.5  43.5
Age               8 534  36.83  11.73  35.00   36.07  11.86  18  64.0  46.0
Race              9 534   2.70   0.68   3.00    2.87   0.00   1   3.0   2.0
Occupation       10 534   4.15   1.66   4.00    4.31   1.48   1   6.0   5.0
Sector           11 534   0.28   0.54   0.00    0.16   0.00   0   2.0   2.0
Marital_status   12 534   0.66   0.48   1.00    0.69   0.00   0   1.0   1.0
                skew kurtosis   se
ID              0.00    -1.21 6.68
Education      -0.20     0.81 0.11
South           0.91    -1.17 0.02
Gender          0.16    -1.98 0.02
Experience      0.68    -0.40 0.54
Union           1.66     0.77 0.02
Wage            1.69     4.90 0.22
Age             0.55    -0.60 0.51
Race           -1.94     1.97 0.03
Occupation     -0.49    -0.94 0.07
Sector          1.83     2.39 0.02
Marital_status -0.65    -1.58 0.02
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
hist(wage$Wage, breaks = 10, col = "blue", main = "Histogram of Wages", xlab = "Wage")

At first wee can take a look at data using some descriptive statistics methods

describe(wage) 
               vars   n   mean     sd median trimmed    mad min   max range
ID                1 534 267.50 154.30 267.50  267.50 197.93   1 534.0 533.0
Education         2 534  13.02   2.62  12.00   13.05   1.48   2  18.0  16.0
South             3 534   0.29   0.46   0.00    0.24   0.00   0   1.0   1.0
Gender            4 534   0.46   0.50   0.00    0.45   0.00   0   1.0   1.0
Experience        5 534  17.82  12.38  15.00   16.75  11.86   0  55.0  55.0
Union             6 534   0.18   0.38   0.00    0.10   0.00   0   1.0   1.0
Wage              7 534   9.02   5.14   7.78    8.28   4.12   1  44.5  43.5
Age               8 534  36.83  11.73  35.00   36.07  11.86  18  64.0  46.0
Race              9 534   2.70   0.68   3.00    2.87   0.00   1   3.0   2.0
Occupation       10 534   4.15   1.66   4.00    4.31   1.48   1   6.0   5.0
Sector           11 534   0.28   0.54   0.00    0.16   0.00   0   2.0   2.0
Marital_status   12 534   0.66   0.48   1.00    0.69   0.00   0   1.0   1.0
                skew kurtosis   se
ID              0.00    -1.21 6.68
Education      -0.20     0.81 0.11
South           0.91    -1.17 0.02
Gender          0.16    -1.98 0.02
Experience      0.68    -0.40 0.54
Union           1.66     0.77 0.02
Wage            1.69     4.90 0.22
Age             0.55    -0.60 0.51
Race           -1.94     1.97 0.03
Occupation     -0.49    -0.94 0.07
Sector          1.83     2.39 0.02
Marital_status -0.65    -1.58 0.02
library(dplyr)

The dataset consists of 534 individuals with no missing values. Education averages 13 years, with most people having around 12 years of schooling. Work experience averages 17.8 years but varies widely, with some individuals having significantly more. Wages have a mean of $9.02 per hour, but the distribution is positively skewed (as we see from the histogram), so most people earn less than the average, with a few high earners pulling the mean up (for example there is huge outlier in female category as we see from the box plot). The sample has a mean age of 36.8 years.

Around 29% of the sample lives in the South, and about 46% are female, showing a relatively balanced gender split. Only 18% are union members. Most individuals are categorized as White in the race variable, with a slight skew toward service and professional occupations. The majority do not work in manufacturing or construction, as indicated by the low average for the sector variable. About 66% of the sample is married.

Descriptive statistics by category

aggregate(Wage ~ Gender, data = wage, FUN = mean)
  Gender     Wage
1      0 9.994913
2      1 7.878857
aggregate(Wage ~ Gender, data = wage, FUN = median)
  Gender Wage
1      0 8.93
2      1 6.80
aggregate(Wage ~ Gender, data = wage, FUN = sd)
  Gender     Wage
1      0 5.285854
2      1 4.720113
boxplot(Wage ~ Gender, data = wage, main = "Wages by Gender", 
        xlab = "Gender", ylab = "Wage", col = c("lightblue", "pink"))

boxplot(Wage ~ Union, data = wage, main = "Wages by Union Membership", 
        xlab = "Union Membership", ylab = "Wage", col = c("lightgreen", "lightcoral"))

boxplot(Wage ~ Race, data = wage, main = "Wages by Race", 
        xlab = "Race", ylab = "Wage", col = rainbow(3))

The data shows that males, on average, earn higher wages than females, as well as workers with union membership, white race and professional occupation.

boxplot(Wage ~ Occupation, data = wage, 
        main = "Wages by Occupation", 
        xlab = "Occupation", ylab = "Wage", 
        col = rainbow(length(unique(wage$Occupation))))

We can check correlations for wages and different factors

numeric_data <- wage[, c("Education", "Experience", "Wage", "Age")]


cor_matrix <- cor(numeric_data)


print(cor_matrix)
            Education  Experience       Wage        Age
Education   1.0000000 -0.35267645 0.38192207 -0.1500195
Experience -0.3526764  1.00000000 0.08705953  0.9779612
Wage        0.3819221  0.08705953 1.00000000  0.1769669
Age        -0.1500195  0.97796125 0.17696688  1.0000000
library(corrplot)
corrplot 0.95 loaded
corrplot(cor_matrix, method = "color", addCoef.col = "black", tl.col = "black", tl.srt = 45)

The analysis shows that education is a bit linked to higher wages (0.38) but slightly reduces experience (-0.35), likely because educated people start working later. Experience strongly grows with age (0.98). Wages weakly increase with both age (0.18) and experience (0.09), but education has the strongest impact on wages.

Simple linear regression

model_simple <- lm(Wage ~ Education, data = wage)
summary(model_simple)

Call:
lm(formula = Wage ~ Education, data = wage)

Residuals:
   Min     1Q Median     3Q    Max 
-7.911 -3.260 -0.760  2.240 34.740 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.74598    1.04545  -0.714    0.476    
Education    0.75046    0.07873   9.532   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.754 on 532 degrees of freedom
Multiple R-squared:  0.1459,    Adjusted R-squared:  0.1443 
F-statistic: 90.85 on 1 and 532 DF,  p-value: < 2.2e-16

text

model_simple <- lm(Wage ~ Gender, data = wage)
summary(model_simple)

Call:
lm(formula = Wage ~ Gender, data = wage)

Residuals:
   Min     1Q Median     3Q    Max 
-8.995 -3.529 -1.072  2.394 36.621 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.9949     0.2961   33.75  < 2e-16 ***
Gender       -2.1161     0.4372   -4.84  1.7e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.034 on 532 degrees of freedom
Multiple R-squared:  0.04218,   Adjusted R-squared:  0.04038 
F-statistic: 23.43 on 1 and 532 DF,  p-value: 1.703e-06
model_simple <- lm(Wage ~ Experience, data = wage)
summary(model_simple)

Call:
lm(formula = Wage ~ Experience, data = wage)

Residuals:
   Min     1Q Median     3Q    Max 
-8.247 -3.601 -1.111  2.332 36.084 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.37997    0.38895  21.545   <2e-16 ***
Experience   0.03614    0.01793   2.016   0.0443 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.124 on 532 degrees of freedom
Multiple R-squared:  0.007579,  Adjusted R-squared:  0.005714 
F-statistic: 4.063 on 1 and 532 DF,  p-value: 0.04433

The analysis shows that Education has a significant positive effect on Wage, with each additional year of education increasing wages by $0.75, and the model explains 14.6% of wage variability. Gender is also significant, with females earning $2.12 less than males on average, but this model explains only 4.2% of the variability in wages. Experience has a small positive effect, increasing wages by $0.036 per year, but its impact is weak and explains less than 1% of the wage variability.

Multiple linear regression model development

wage$ln_wage <- log(wage$Wage)
rgrmodel <- lm(ln_wage ~ Occupation + Sector + Union + Education + Experience + Age + Gender + Marital_status + Race + South, data = wage)
summary(rgrmodel)

Call:
lm(formula = ln_wage ~ Occupation + Sector + Union + Education + 
    Experience + Age + Gender + Marital_status + Race + South, 
    data = wage)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.16246 -0.29163 -0.00469  0.29981  1.98248 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1.078596   0.687514   1.569 0.117291    
Occupation     -0.007417   0.013109  -0.566 0.571761    
Sector          0.091458   0.038736   2.361 0.018589 *  
Union           0.200483   0.052475   3.821 0.000149 ***
Education       0.179366   0.110756   1.619 0.105949    
Experience      0.095822   0.110799   0.865 0.387531    
Age            -0.085444   0.110730  -0.772 0.440671    
Gender         -0.221997   0.039907  -5.563 4.24e-08 ***
Marital_status  0.076611   0.041931   1.827 0.068259 .  
Race            0.050406   0.028531   1.767 0.077865 .  
South          -0.102360   0.042823  -2.390 0.017187 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4398 on 523 degrees of freedom
Multiple R-squared:  0.3185,    Adjusted R-squared:  0.3054 
F-statistic: 24.44 on 10 and 523 DF,  p-value: < 2.2e-16

The regression model estimates the natural logarithm of wages based on several predictors as occupation, sector, union membership, education, and demographic variables.

boxplot(rgrmodel[['residuals']],main='Boxplot: Residuals',ylab='residual value')

Among the predictors, Union membership has a significant positive effect on wages, increasing the log of wages by 0.20, while working in the Sector and living in the South also significantly affect wages, with coefficients of 0.09 and -0.10, respectively. Gender is a key factor, where being female reduces the log of wages by 0.22 on average. Although Education shows a positive trend (0.18), it is not statistically significant at the 5% level in this model. Other variables, such as Age, Experience, and Occupation, do not have significant impacts in this analysis. The model explains about 31.85% of the variability in log wages, indicating a moderate fit.

The boxplot of the residuals provides an overview of how well the regression model fits the data. The outliers above and below the plot are data points where the actual wage values deviate significantly from what the model predicts, so it may not perform well.

The two main problems with this model are multicollinearity and interpretation issues with some variables like race, occupation, and sector.

  1. Multicollinearity happens when two variables are highly related. For example, experience and age are likely correlated, making it hard to separate their effects on wages. This can make the model’s results unreliable.

  2. The variables race, occupation, and sector should be treated as categorical variables. Treating them as continuous can make the model’s interpretation incorrect.

In short, to fix these problems, we should check for multicollinearity and change the way we treat occupation, sector, and race in the model.

Addressing issues

library(car)  
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
The following object is masked from 'package:psych':

    logit
vif(rgrmodel)
    Occupation         Sector          Union      Education     Experience 
      1.298232       1.198670       1.120861     231.195580    5184.093895 
           Age         Gender Marital_status           Race          South 
   4645.664977       1.091634       1.096130       1.037138       1.046828 

The Variance Inflation Factor (VIF) results show that most of the variables have VIF values around 1, indicating no significant multicollinearity issues. However, Education, Experience, and Age have extremely high VIFs, especially:

  • Education: VIF = 231.2

  • Experience: VIF = 5184.1

  • Age: VIF = 4645.7

These very high VIFs suggest that these variables are highly collinear. Also, from previous analysis we know that experience and age are highly correlated so there is multicollinearity, so we can remove one of them

rgrmodel_updated <- lm(ln_wage ~ Occupation + Sector + Union + Education + Gender + 
                       Marital_status + Race + South, data = wage)
vif(rgrmodel_updated)
    Occupation         Sector          Union      Education         Gender 
      1.273183       1.190236       1.102779       1.097074       1.081223 
Marital_status           Race          South 
      1.019463       1.036985       1.043706 

The new VIF results show that all variables now have VIF values close to 1, indicating that multicollinearity is no longer an issue in the model.

summary(rgrmodel_updated)

Call:
lm(formula = ln_wage ~ Occupation + Sector + Union + Education + 
    Gender + Marital_status + Race + South, data = wage)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1519 -0.3309  0.0034  0.3028  1.8315 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.95654    0.15028   6.365 4.26e-10 ***
Occupation     -0.01719    0.01339  -1.283 0.199910    
Sector          0.10996    0.03982   2.762 0.005953 ** 
Union           0.23924    0.05369   4.456 1.02e-05 ***
Education       0.07650    0.00787   9.721  < 2e-16 ***
Gender         -0.20108    0.04097  -4.908 1.23e-06 ***
Marital_status  0.13980    0.04171   3.352 0.000861 ***
Race            0.05157    0.02943   1.752 0.080287 .  
South          -0.11579    0.04411  -2.625 0.008912 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4537 on 525 degrees of freedom
Multiple R-squared:  0.2721,    Adjusted R-squared:  0.261 
F-statistic: 24.53 on 8 and 525 DF,  p-value: < 2.2e-16

The residuals show that the differences between predicted and actual wages are small, which indicates that the model makes good predictions. The coefficients reveal that some variables, such as Union and Education, have a significant positive impact on wages, while Gender and South have a negative effect. The R-squared value of 0.2721 means that the model explains about 27% of the variation in wages, which is relatively low. The F-statistic is significant (p-value < 2e-16), suggesting that the predictors, as a group, are related to wages. Overall, the model shows that factors like education, union membership, and sector influence wages, while gender and location (South) tend to lower them, although the model could be improved.

wage$Occupation <- as.factor(wage$Occupation)
wage$Sector <- as.factor(wage$Sector)
wage$Race <- as.factor(wage$Race)


rgrmodel_updated <- lm(ln_wage ~ Occupation + Sector + Union + Education + Gender + Marital_status + Race + South, data = wage)


summary(rgrmodel_updated)

Call:
lm(formula = ln_wage ~ Occupation + Sector + Union + Education + 
    Gender + Marital_status + Race + South, data = wage)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.36964 -0.28094 -0.00487  0.30712  1.65522 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1.625035   0.168071   9.669  < 2e-16 ***
Occupation2    -0.392754   0.093825  -4.186 3.33e-05 ***
Occupation3    -0.249680   0.077883  -3.206  0.00143 ** 
Occupation4    -0.424285   0.082832  -5.122 4.26e-07 ***
Occupation5    -0.062567   0.074653  -0.838  0.40236    
Occupation6    -0.347314   0.080659  -4.306 1.99e-05 ***
Sector1         0.141428   0.056123   2.520  0.01204 *  
Sector2         0.136720   0.098785   1.384  0.16695    
Union           0.247276   0.052169   4.740 2.77e-06 ***
Education       0.045896   0.009562   4.800 2.08e-06 ***
Gender         -0.203950   0.042907  -4.753 2.60e-06 ***
Marital_status  0.117211   0.040832   2.871  0.00426 ** 
Race2          -0.070847   0.101494  -0.698  0.48546    
Race3           0.076640   0.058943   1.300  0.19409    
South          -0.102609   0.043031  -2.385  0.01746 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4394 on 519 degrees of freedom
Multiple R-squared:  0.325, Adjusted R-squared:  0.3068 
F-statistic: 17.85 on 14 and 519 DF,  p-value: < 2.2e-16

This model is now better because it provides a more accurate result as it properly includes categorical variables. The model now shows that specific occupations and sectors significantly impact wages, which was not captured in the previous version. Additionally, the Adjusted R-squared increased from 0.261 to 0.3068, so there is a better fit and more explained variance in the data.