Estimating the composite hourly wage of an individual in the province of Ontario, Canada using education, age, sex, language as dependent variables
library(carData)
mydataA <- force(SLID)
mydataA$ID <- 1:nrow(mydataA)
head(mydataA)
## wages education age sex language ID
## 1 10.56 15.0 40 Male English 1
## 2 11.00 13.2 19 Male English 2
## 3 NA 16.0 49 Male Other 3
## 4 17.76 14.0 46 Male Other 4
## 5 NA 8.0 71 Male English 5
## 6 14.00 16.0 50 Female English 6
Sample size: 7425
Unit of observation: 1 worker
Description of variables:
wages: Composite hourly wage rate from all jobs.
education: Number of years of schooling.
age: in years.
sex: A factor with levels: Female, Male.
language: A factor with levels: English, French, Other
Source: The data are taken from the public-use dataset made available by Statistics Canada, and prepared by the Institute for Social Research, York University
Cleaning the data:
library(tidyr)
mydataA <- drop_na(mydataA)
head(mydataA)
## wages education age sex language ID
## 1 10.56 15.0 40 Male English 1
## 2 11.00 13.2 19 Male English 2
## 4 17.76 14.0 46 Male Other 4
## 6 14.00 16.0 50 Female English 6
## 9 8.20 15.0 31 Male English 9
## 12 16.97 13.5 30 Female English 12
mydataA$genderF <- factor(mydataA$sex,
levels = c("Male", "Female"),
labels = c("Male", "Female"))
mydataA$languageF <- factor(mydataA$language,
levels = c("English", "French", "Other"),
labels = c("English", "French", "Other"))
summary(mydataA[ ,-6])
## wages education age sex language
## Min. : 2.30 Min. : 0.00 Min. :16.0 Female:2001 English:3244
## 1st Qu.: 9.25 1st Qu.:12.00 1st Qu.:28.0 Male :1986 French : 259
## Median :14.13 Median :13.00 Median :36.0 Other : 484
## Mean :15.54 Mean :13.34 Mean :37.1
## 3rd Qu.:19.72 3rd Qu.:15.10 3rd Qu.:46.0
## Max. :49.92 Max. :20.00 Max. :69.0
## genderF languageF
## Male :1986 English:3244
## Female:2001 French : 259
## Other : 484
##
##
##
Mean Wages: The average composite hourly wage rate from all jobs for an individual in this data set is $15.54
Median Age: When all the observations in this data set are arranged in an ascending order on the basis of age the age for the observation lying in the middle is 36 years
Interquartile range: The difference between the 25th and 75th percentile of the number of schooling years for this data set is 15.1-12=3.1 years
y(estimated) = a + bx1 + cx2 + dx3 + ex4
Where:
y = Dependent variable = Composite hourly wage rate from all jobs
a = Intercept term
x1 = Number of years of schooling
x2 = age in years
x3 = A factor with levels: Female, Male
x4 = A factor with levels: English, French, Other
b = Beta for x1
c = Beta for x2
d = Beta for x3
e = Beta for x4
Explanation:
Age and number of schooling years usually have a positive relationship with wages
Gender: Usually there is a difference between the wages of males and females, ceterus paribus
Language: Language known could have an impact on wages but this depends on the country
As first model, we don’t use language in our fit
Since the sample size is quite large, we pick 75 units randomly for our analysis
set.seed(1)
mydataB <- mydataA[sample(nrow(mydataA), 75), ]
library(car)
scatterplotMatrix(~ wages + education + age, data = mydataB,
smooth = FALSE)
The scatterplot shows a positive relationship between wages and education, and wages and age
fit <- lm(wages ~ age + education + genderF,
data = mydataB)
vif(fit)
## age education genderF
## 1.018230 1.032389 1.027290
mean(vif(fit))
## [1] 1.025969
VIF: All the values are almost equal to one, which means the fit does not have a problem with multi collinearity
mydataB$StdResid <- round(rstandard(fit), 3)
mydataB$CooksD <- round(cooks.distance(fit), 3)
hist(mydataB$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
We assume there are no outliers as all the standard residuals are within +/-3 range
hist(mydataB$CooksD,
xlab = "Cooks Distance",
ylab = "Frequency",
main = "Histogram of Cooks Distance")
There are some gaps and it implies units with significant impact should be removed from the dataset
head(mydataB[order(-mydataB$StdResid),], 3)
## wages education age sex language ID genderF languageF StdResid CooksD
## 4027 44.95 17 46 Male English 4027 Male English 2.988 0.149
## 2984 40.00 19 42 Female English 2984 Female English 2.503 0.140
## 7336 32.64 12 33 Male English 7336 Male English 2.460 0.043
head(mydataB[order(-mydataB$CooksD),], 5)
## wages education age sex language ID genderF languageF StdResid CooksD
## 4027 44.95 17 46 Male English 4027 Male English 2.988 0.149
## 2984 40.00 19 42 Female English 2984 Female English 2.503 0.140
## 2402 20.16 20 56 Male English 2402 Male English -1.491 0.092
## 3124 16.96 6 54 Female English 3124 Female English 1.240 0.074
## 1039 8.00 12 61 Male French 1039 Male French -1.881 0.072
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
removal <- c("4027", "3124", "2984", "2402", "1039")
mydataB <- mydataB %>%
filter(!(ID %in% removal))
fitA <- lm(wages ~ age + education + genderF,
data = mydataB)
mydataB$StdResid <- round(rstandard(fitA), 3)
mydataB$CooksD <- round(cooks.distance(fitA), 3)
hist(mydataB$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydataB$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydataB$StdResid
## W = 0.9782, p-value = 0.2612
H0: standard residuals normally distributed
H1: standard residuals not normally distributed
We fail to reject H0 as p value > 0.05 and assume the errors are normally distributed
All the outliers and the units with significant impact are removed
mydataB$StdFitted <- scale(fitA$fitted.values)
library(car)
scatterplot(y = mydataB$StdResid, x = mydataB$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fitA)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : wages
## Variables: fitted values of wages
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 2.32531
## Prob > Chi2 = 0.1272849
We fail to reject the null hypothesis as p> 0.05 and as a result we assume the variance is constant
summary(fitA)
##
## Call:
## lm(formula = wages ~ age + education + genderF, data = mydataB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1640 -3.8686 -0.6608 4.0237 17.2767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.16750 4.47773 -1.154 0.252644
## age 0.21805 0.05934 3.675 0.000479 ***
## education 1.11125 0.32098 3.462 0.000945 ***
## genderFFemale -3.79208 1.54123 -2.460 0.016502 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.224 on 66 degrees of freedom
## Multiple R-squared: 0.3092, Adjusted R-squared: 0.2778
## F-statistic: 9.848 on 3 and 66 DF, p-value: 1.877e-05
Age coefficient: With an increase of one year in age, the Composite hourly wage from all jobs increases by $0.218 on an average with the education level and gender being constant(p<0.001)
Education coefficient: With an increase of one year in terms of years of schooling, the Composite hourly wage from all jobs increases by $1.11 on an average with the age and gender being constant(p<0.001)
Gender coefficient: Given the age in years and the number of years of schooling, the Composite hourly wage from all jobs is $3.79 less on average for a female compared with a male(p<0.02)
Multiple R-squared: 30.92% of the variability in Composite hourly wage from all jobs can be explained by a variability in age, education, and gender
F statistic:
H0: rho2 = 0
H1: rho2 > 0
We reject H0(p<0.001) and can say this model explains some of the variability in Composite hourly wage from all jobs
Estimate the Composite hourly wage from all jobs for someone who is 35 years old, has 14 years of schooling, and is a male
Composite hourly wage from all jobs(estimated) = -5.167 + 0.218(35) + 1.11(14) - 3.79(0)
= $18
sqrt(summary(fitA)$r.squared)
## [1] 0.5560808
The above value shows the relationship between the variables is moderate
library(lm.beta)
lm.beta(fitA)
##
## Call:
## lm(formula = wages ~ age + education + genderF, data = mydataB)
##
## Standardized Coefficients::
## (Intercept) age education genderFFemale
## NA 0.3805994 0.3664231 -0.2606534
The above coefficients show the age variable is the most effective in explaining the wages in this fit, and the gender variable is the least effective
fitLanguage <- lm(wages ~ age + education + genderF + languageF,
data = mydataB)
summary(fitLanguage)
##
## Call:
## lm(formula = wages ~ age + education + genderF + languageF, data = mydataB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0676 -4.1341 -0.8374 3.9729 16.4694
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.73033 4.37708 -1.081 0.283884
## age 0.20008 0.05906 3.388 0.001210 **
## education 1.19152 0.31290 3.808 0.000317 ***
## genderFFemale -3.89257 1.49554 -2.603 0.011478 *
## languageFFrench -6.74627 3.18385 -2.119 0.037985 *
## languageFOther -3.83419 2.43817 -1.573 0.120749
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.019 on 64 degrees of freedom
## Multiple R-squared: 0.3735, Adjusted R-squared: 0.3246
## F-statistic: 7.632 on 5 and 64 DF, p-value: 1.124e-05
Age coefficient: With an increase of one year in age, the Composite hourly wage from all jobs increases by $0.20 on an average with the education level, gender, and the language being constant(p<0.002)
Education coefficient: With an increase of one year in terms of years of schooling, the Composite hourly wage from all jobs increases by $1.19 on an average with the age, gender, and the language being constant(p<0.001)
Gender coefficient: Given the age in years, the number of years of schooling, and the language, the Composite hourly wage from all jobs is $3.89 less on average for a female compared with a male(p<0.02)
languageFFrench: Given the age in years, the number of years of schooling, and the gender, the Composite hourly wage from all jobs is $6.74 less on average for a French user compared with an English user(p<0.04)
Multiple R-squared: 37.35% of the variability in Composite hourly wage from all jobs can be explained by a variability in age, education, gender, and language
F statistic:
H0: rho2 = 0
H1: rho2 > 0
We reject H0(p<0.001) and can say this model explains some of the variability in Composite hourly wage from all jobs
Estimate the Composite hourly wage from all jobs for someone who is 35 years old, has 14 years of schooling, is a female, and uses French
Composite hourly wage from all jobs(estimated) = -4.73 + 0.20(35) + 1.19(14) - 3.89(1) - 6.74(1) - 3.83(0)
= $8.3
anova(fitA, fitLanguage)
## Analysis of Variance Table
##
## Model 1: wages ~ age + education + genderF
## Model 2: wages ~ age + education + genderF + languageF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 66 2556.5
## 2 64 2318.5 2 238.01 3.285 0.04384 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: Simple model(with less variables) should be used
H1: Complex model(with more variables) should be used
We reject H0(p<0.05) and can say the complex model with language(fitLanguage) provides a higher degree of explanation
The model with language should be used for estimating the composite hourly wage rate