Regression Assignment

Research Question

Estimating the composite hourly wage of an individual in the province of Ontario, Canada using education, age, sex, language as dependent variables

library(carData)
mydataA <- force(SLID)
mydataA$ID <- 1:nrow(mydataA)
head(mydataA)

##   wages education age    sex language ID
## 1 10.56      15.0  40   Male  English  1
## 2 11.00      13.2  19   Male  English  2
## 3    NA      16.0  49   Male    Other  3
## 4 17.76      14.0  46   Male    Other  4
## 5    NA       8.0  71   Male  English  5
## 6 14.00      16.0  50 Female  English  6

Description:

Sample size: 7425

Unit of observation: 1 worker

Description of variables:

wages: Composite hourly wage rate from all jobs.
education: Number of years of schooling.
age: in years.
sex: A factor with levels: Female, Male.
language: A factor with levels: English, French, Other

Source: The data are taken from the public-use dataset made available by Statistics Canada, and prepared by the Institute for Social Research, York University

Cleaning the data:

library(tidyr)
mydataA <- drop_na(mydataA)
head(mydataA)

##    wages education age    sex language ID
## 1  10.56      15.0  40   Male  English  1
## 2  11.00      13.2  19   Male  English  2
## 4  17.76      14.0  46   Male    Other  4
## 6  14.00      16.0  50 Female  English  6
## 9   8.20      15.0  31   Male  English  9
## 12 16.97      13.5  30 Female  English 12

mydataA$genderF <- factor(mydataA$sex, 
                            levels = c("Male", "Female"), 
                            labels = c("Male", "Female"))

mydataA$languageF <- factor(mydataA$language, 
                            levels = c("English", "French", "Other"), 
                            labels = c("English", "French", "Other"))

Showing descriptive statistics

summary(mydataA[ ,-6])

##      wages         education          age           sex          language   
##  Min.   : 2.30   Min.   : 0.00   Min.   :16.0   Female:2001   English:3244  
##  1st Qu.: 9.25   1st Qu.:12.00   1st Qu.:28.0   Male  :1986   French : 259  
##  Median :14.13   Median :13.00   Median :36.0                 Other  : 484  
##  Mean   :15.54   Mean   :13.34   Mean   :37.1                               
##  3rd Qu.:19.72   3rd Qu.:15.10   3rd Qu.:46.0                               
##  Max.   :49.92   Max.   :20.00   Max.   :69.0                               
##    genderF       languageF   
##  Male  :1986   English:3244  
##  Female:2001   French : 259  
##                Other  : 484  
##                              
##                              
##

Explaining a few estimates of parameters

Mean Wages: The average composite hourly wage rate from all jobs for an individual in this data set is $15.54
Median Age: When all the observations in this data set are arranged in an ascending order on the basis of age the age for the observation lying in the middle is 36 years
Interquartile range: The difference between the 25th and 75th percentile of the number of schooling years for this data set is 15.1-12=3.1 years

Initial regression equation

y(estimated) = a + bx1 + cx2 + dx3 + ex4

Where:

y = Dependent variable = Composite hourly wage rate from all jobs

a = Intercept term

x1 = Number of years of schooling

x2 = age in years

x3 = A factor with levels: Female, Male

x4 = A factor with levels: English, French, Other

b = Beta for x1

c = Beta for x2

d = Beta for x3

e = Beta for x4

Explanation:

Age and number of schooling years usually have a positive relationship with wages
Gender: Usually there is a difference between the wages of males and females, ceterus paribus
Language: Language known could have an impact on wages but this depends on the country

As first model, we don’t use language in our fit

Since the sample size is quite large, we pick 75 units randomly for our analysis

set.seed(1)
mydataB <- mydataA[sample(nrow(mydataA), 75), ]

Showing the scatterplot

library(car)
scatterplotMatrix(~ wages + education + age, data = mydataB,
smooth = FALSE)

The scatterplot shows a positive relationship between wages and education, and wages and age

Forming the first equation without the language variable

fit <- lm(wages ~ age + education + genderF, 
           data = mydataB)

Checking for multi collinearity

vif(fit)

##       age education   genderF 
##  1.018230  1.032389  1.027290

mean(vif(fit))

## [1] 1.025969

VIF: All the values are almost equal to one, which means the fit does not have a problem with multi collinearity

Checking for outliers and units with significant impact

mydataB$StdResid <- round(rstandard(fit), 3) 
mydataB$CooksD <- round(cooks.distance(fit), 3)

hist(mydataB$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

We assume there are no outliers as all the standard residuals are within +/-3 range

hist(mydataB$CooksD, 
     xlab = "Cooks Distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks Distance")

There are some gaps and it implies units with significant impact should be removed from the dataset

Removing all the values which are outside the +/-3 range for Standard Residuals and with gaps for Cooks Distance

head(mydataB[order(-mydataB$StdResid),], 3)

##      wages education age    sex language   ID genderF languageF StdResid CooksD
## 4027 44.95        17  46   Male  English 4027    Male   English    2.988  0.149
## 2984 40.00        19  42 Female  English 2984  Female   English    2.503  0.140
## 7336 32.64        12  33   Male  English 7336    Male   English    2.460  0.043

head(mydataB[order(-mydataB$CooksD),], 5)

##      wages education age    sex language   ID genderF languageF StdResid CooksD
## 4027 44.95        17  46   Male  English 4027    Male   English    2.988  0.149
## 2984 40.00        19  42 Female  English 2984  Female   English    2.503  0.140
## 2402 20.16        20  56   Male  English 2402    Male   English   -1.491  0.092
## 3124 16.96         6  54 Female  English 3124  Female   English    1.240  0.074
## 1039  8.00        12  61   Male   French 1039    Male    French   -1.881  0.072

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

removal <- c("4027", "3124", "2984", "2402", "1039")

mydataB <- mydataB %>%
  filter(!(ID %in% removal))

fitA <- lm(wages ~ age + education + genderF, 
           data = mydataB)

Checking the data again:

mydataB$StdResid <- round(rstandard(fitA), 3) 
mydataB$CooksD <- round(cooks.distance(fitA), 3)

hist(mydataB$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

Assumption about normality of the standard residuals

shapiro.test(mydataB$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydataB$StdResid
## W = 0.9782, p-value = 0.2612

H0: standard residuals normally distributed

H1: standard residuals not normally distributed

We fail to reject H0 as p value > 0.05 and assume the errors are normally distributed

Outliers and units with significant impact

All the outliers and the units with significant impact are removed

Homoskedascity

mydataB$StdFitted <- scale(fitA$fitted.values)

library(car)
scatterplot(y = mydataB$StdResid, x = mydataB$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fitA)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : wages 
##  Variables: fitted values of wages 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    2.32531 
##  Prob > Chi2   =    0.1272849

We fail to reject the null hypothesis as p> 0.05 and as a result we assume the variance is constant

Showing summary of the fit

summary(fitA)

## 
## Call:
## lm(formula = wages ~ age + education + genderF, data = mydataB)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1640  -3.8686  -0.6608   4.0237  17.2767 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -5.16750    4.47773  -1.154 0.252644    
## age            0.21805    0.05934   3.675 0.000479 ***
## education      1.11125    0.32098   3.462 0.000945 ***
## genderFFemale -3.79208    1.54123  -2.460 0.016502 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.224 on 66 degrees of freedom
## Multiple R-squared:  0.3092, Adjusted R-squared:  0.2778 
## F-statistic: 9.848 on 3 and 66 DF,  p-value: 1.877e-05

Explanation of summary:

Age coefficient: With an increase of one year in age, the Composite hourly wage from all jobs increases by $0.218 on an average with the education level and gender being constant(p<0.001)
Education coefficient: With an increase of one year in terms of years of schooling, the Composite hourly wage from all jobs increases by $1.11 on an average with the age and gender being constant(p<0.001)
Gender coefficient: Given the age in years and the number of years of schooling, the Composite hourly wage from all jobs is $3.79 less on average for a female compared with a male(p<0.02)
Multiple R-squared: 30.92% of the variability in Composite hourly wage from all jobs can be explained by a variability in age, education, and gender
F statistic:

H0: rho2 = 0

H1: rho2 > 0

We reject H0(p<0.001) and can say this model explains some of the variability in Composite hourly wage from all jobs

Example:

Estimate the Composite hourly wage from all jobs for someone who is 35 years old, has 14 years of schooling, and is a male

Composite hourly wage from all jobs(estimated) = -5.167 + 0.218(35) + 1.11(14) - 3.79(0)

= $18

sqrt(summary(fitA)$r.squared)

## [1] 0.5560808

The above value shows the relationship between the variables is moderate

library(lm.beta)
lm.beta(fitA)

## 
## Call:
## lm(formula = wages ~ age + education + genderF, data = mydataB)
## 
## Standardized Coefficients::
##   (Intercept)           age     education genderFFemale 
##            NA     0.3805994     0.3664231    -0.2606534

The above coefficients show the age variable is the most effective in explaining the wages in this fit, and the gender variable is the least effective

Another fit having the language variable:

fitLanguage <- lm(wages ~ age + education + genderF + languageF,
           data = mydataB)

summary(fitLanguage)

## 
## Call:
## lm(formula = wages ~ age + education + genderF + languageF, data = mydataB)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0676  -4.1341  -0.8374   3.9729  16.4694 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -4.73033    4.37708  -1.081 0.283884    
## age              0.20008    0.05906   3.388 0.001210 ** 
## education        1.19152    0.31290   3.808 0.000317 ***
## genderFFemale   -3.89257    1.49554  -2.603 0.011478 *  
## languageFFrench -6.74627    3.18385  -2.119 0.037985 *  
## languageFOther  -3.83419    2.43817  -1.573 0.120749    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.019 on 64 degrees of freedom
## Multiple R-squared:  0.3735, Adjusted R-squared:  0.3246 
## F-statistic: 7.632 on 5 and 64 DF,  p-value: 1.124e-05

Explanation of summary:

Age coefficient: With an increase of one year in age, the Composite hourly wage from all jobs increases by $0.20 on an average with the education level, gender, and the language being constant(p<0.002)
Education coefficient: With an increase of one year in terms of years of schooling, the Composite hourly wage from all jobs increases by $1.19 on an average with the age, gender, and the language being constant(p<0.001)
Gender coefficient: Given the age in years, the number of years of schooling, and the language, the Composite hourly wage from all jobs is $3.89 less on average for a female compared with a male(p<0.02)
languageFFrench: Given the age in years, the number of years of schooling, and the gender, the Composite hourly wage from all jobs is $6.74 less on average for a French user compared with an English user(p<0.04)
Multiple R-squared: 37.35% of the variability in Composite hourly wage from all jobs can be explained by a variability in age, education, gender, and language
F statistic:

H0: rho2 = 0

H1: rho2 > 0

We reject H0(p<0.001) and can say this model explains some of the variability in Composite hourly wage from all jobs

Example:

Estimate the Composite hourly wage from all jobs for someone who is 35 years old, has 14 years of schooling, is a female, and uses French

Composite hourly wage from all jobs(estimated) = -4.73 + 0.20(35) + 1.19(14) - 3.89(1) - 6.74(1) - 3.83(0)

= $8.3

Comparing the equation which doesn’t have the language variable with the equation having the language variable

anova(fitA, fitLanguage)

## Analysis of Variance Table
## 
## Model 1: wages ~ age + education + genderF
## Model 2: wages ~ age + education + genderF + languageF
##   Res.Df    RSS Df Sum of Sq     F  Pr(>F)  
## 1     66 2556.5                             
## 2     64 2318.5  2    238.01 3.285 0.04384 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: Simple model(with less variables) should be used

H1: Complex model(with more variables) should be used

We reject H0(p<0.05) and can say the complex model with language(fitLanguage) provides a higher degree of explanation

Regression Assignment

Ashish Bhambhaney

2024-01-27

Research Question

Description:

Showing descriptive statistics

Explaining a few estimates of parameters

Initial regression equation

Showing the scatterplot

Forming the first equation without the language variable

Checking for multi collinearity

Checking for outliers and units with significant impact

Removing all the values which are outside the +/-3 range for Standard Residuals and with gaps for Cooks Distance

Checking the data again:

Assumption about normality of the standard residuals

Outliers and units with significant impact

Homoskedascity

Showing summary of the fit

Explanation of summary:

Example:

Another fit having the language variable:

Explanation of summary:

Example:

Comparing the equation which doesn’t have the language variable with the equation having the language variable

Conclusion(Answer to the question)