We have a dataset that provides comprehensive information about professors. With regression analysis, we will check if we can use a set of explanatory variables (rank of professors degree and working years) to explain the differences in the values of the dependent variable (height of nine month salary of a professor).
Data source: Fox J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.
#data(package = .packages(all.available = TRUE))
#install.packages("carData")
library(carData)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Importing dataset in Rstudio and adding the ID number column
data("Salaries", package="carData")
mydata <- force(Salaries)
mydata$ID <- seq(1, nrow(mydata))
# Selecting the relevant columns
mydata <- select(mydata, ID, salary, rank, yrs.service)
head(mydata)
## ID salary rank yrs.service
## 1 1 139750 Prof 18
## 2 2 173200 Prof 16
## 3 3 79750 AsstProf 3
## 4 4 115000 Prof 39
## 5 5 141500 Prof 41
## 6 6 97000 AssocProf 6
#Showing descriptive statistics
library(psych)
summary(mydata[-1])
## salary rank yrs.service
## Min. : 57800 AsstProf : 67 Min. : 0.00
## 1st Qu.: 91000 AssocProf: 64 1st Qu.: 7.00
## Median :107300 Prof :266 Median :16.00
## Mean :113706 Mean :17.61
## 3rd Qu.:134185 3rd Qu.:27.00
## Max. :231545 Max. :60.00
result <- describeBy(mydata[-1])
## Warning in describeBy(mydata[-1]): no grouping variable requested
print(result)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## salary 1 397 113706.46 30289.04 107300 111401.61 29355.48 57800 231545 173745 0.71 0.18 1520.16
## rank* 2 397 2.50 0.77 3 2.62 0.00 1 3 2 -1.12 -0.38 0.04
## yrs.service 3 397 17.61 13.01 16 16.51 14.83 0 60 60 0.65 -0.34 0.65
#Factorizing
mydata$FRank <- factor(mydata$rank,
levels = c ("Prof","AsstProf","AssocProf"),
labels = c ("P", "AssistP","AssocP"))
Unit of observation: Individual professor
Sample size: 397 employees
Variables:
Salary: Average yearly salary of an individual employee in the sample (money unit).
Rank: Rank of their of their job position - AssocProf, AsstProf and Prof.
yrs.service: Number of working years (in years)
Explanation of few parameters:
Min: Professor who has the lowest salary is paid 57,800 money units, the minimum working years is 0.
Max: Professor who has the highest salary is paid 231,545 money units, the maximum working years are 60 years.
1st quartile: 25% of professors receive a salary of 91,000 money units or less, and 25% of professors have been working 7 years or less.
Mean: The average salary of professors is 113,706 money units.
In the sample, there are 64 Assoc. professors, 67 Assistant professors and 266 professors.
Research question: Can difference in the nine month salary of professors be explained by the rank of their job position and number of their working years?
Regression model: Nine month salary = beta0 + beta1 x rank + beta2 x working years + error
Regression function: Nine month salary = b0 + b1 x rank + b2 x working years
Expectation: We expect that the average nine month salary is higher if professors are higher ranked and have higher number of working years.
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
scatterplot(mydata$salary ~ mydata$yrs.service,
smooth = FALSE,
boxplots = FALSE,
ylab = "Nine month salary (money units)",
xlab = "Working years (years)")
Based on the scatterplot we created, we can conclude that assumption of linearity is met. We can see a positive linear correlation between salary and working years of employees.
fit <- lm(salary ~ yrs.service + FRank,
data = mydata)
summary(fit)
##
## Call:
## lm(formula = salary ~ yrs.service + FRank, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64515 -16180 -1234 12181 107174
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 130380.1 2995.6 43.524 <2e-16 ***
## yrs.service -158.1 115.0 -1.376 0.17
## FRankAssistP -49228.8 3991.9 -12.332 <2e-16 ***
## FRankAssocP -34613.4 3515.9 -9.845 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23610 on 393 degrees of freedom
## Multiple R-squared: 0.3972, Adjusted R-squared: 0.3926
## F-statistic: 86.3 on 3 and 393 DF, p-value: < 2.2e-16
vif(fit)
## GVIF Df GVIF^(1/(2*Df))
## yrs.service 1.588306 1 1.260280
## FRank 1.588306 2 1.122622
Based on the VIF we can conclude that there is not too strong multicolinearity, because both values are less than 2.24 (koren iz 5).
mydata$StdResid <- round(rstandard(fit), 3)
mydata$CooksD <- round(cooks.distance(fit), 3)
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.97293, p-value = 9.596e-07
On the histogram we can see that normality is violated, which we also proved with Shapiro-Wilk normality test. We can reject the null hypothesis and conclude that errors are not normally distributed. However, because our sample is big enough, we can assume normality.
As some of the units are above 3, we will also have to remove them.
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
Based on the histogram of Cooks distances, we can see that there are two bigger gaps between units. Those units have a high impact and should be removed.
head(mydata[order(- mydata$StdResid),], 10)
## ID salary rank yrs.service FRank StdResid CooksD
## 44 44 231545 Prof 38 P 4.561 0.048
## 365 365 205500 Prof 43 P 3.494 0.042
## 250 250 204000 Prof 7 P 3.181 0.025
## 331 331 192253 Prof 60 P 3.080 0.090
## 272 272 194800 Prof 18 P 2.856 0.009
## 78 78 193000 Prof 19 P 2.786 0.008
## 351 351 186960 Prof 49 P 2.753 0.039
## 199 199 189409 Prof 33 P 2.730 0.012
## 390 390 186023 Prof 18 P 2.483 0.007
## 293 293 183800 Prof 9 P 2.333 0.011
We can see that units with ID 44, 365, 250 and 331 have Standardised residuals higher than 3, therefore we will remove them.
library(dplyr)
mydata <- mydata %>%
filter(!ID %in% c(44, 365, 250, 331))
head(mydata[order(-mydata$CooksD),], 10)
## ID salary rank yrs.service FRank StdResid CooksD
## 281 283 57800 Prof 51 P -2.764 0.044
## 348 351 186960 Prof 49 P 2.753 0.039
## 131 132 76840 Prof 57 P -1.917 0.030
## 316 318 67559 Prof 45 P -2.378 0.022
## 125 126 78162 Prof 49 P -1.903 0.018
## 297 299 72300 Prof 43 P -2.187 0.016
## 198 199 189409 Prof 33 P 2.730 0.012
## 238 239 77202 Prof 40 P -1.995 0.011
## 291 293 183800 Prof 9 P 2.333 0.011
## 190 191 180000 Prof 9 P 2.171 0.010
Based on Cooks distances we cann see that there are bigger gaps between units from the unit ID 132 further, therefore we will remove units with IDs: 283 and 351.
library(dplyr)
mydata <- mydata %>%
filter(!ID %in% c(283, 351))
fit <- lm(salary ~ yrs.service + FRank,
data = mydata)
mydata$StdFitted <- scale(fit$fitted.values)
library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ----------------------------------
## Response : salary
## Variables: fitted values of salary
##
## Test Summary
## ------------------------------
## DF = 1
## Chi2 = 51.49351
## Prob > Chi2 = 7.18343e-13
We can see from the graph and also from the Breusch Pagan test that homoscedasticity is violated. We reject the null hypothesis at p<0.001 and conclude that the data is heteroscedastic. Therefore, we have to use robust standard errors.
library(estimatr)
fit_robust <- lm_robust(salary ~ yrs.service + FRank,
se_type = "HC1",
data = mydata)
summary(fit_robust)
##
## Call:
## lm_robust(formula = salary ~ yrs.service + FRank, data = mydata,
## se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 131869.6 3135.7 42.054 2.131e-146 125704.5 138034.74 387
## yrs.service -282.2 125.6 -2.246 2.524e-02 -529.2 -35.21 387
## FRankAssistP -50423.9 3057.2 -16.494 1.151e-46 -56434.6 -44413.14 387
## FRankAssocP -34619.8 2583.1 -13.403 6.507e-34 -39698.4 -29541.23 387
##
## Multiple R-squared: 0.431 , Adjusted R-squared: 0.4266
## F-statistic: 196 on 3 and 387 DF, p-value: < 2.2e-16
With every additional working year of an employee, his salary decreases on average by 282.2 money units, assuming all other variables are unchanged (p=0.02).
Given the values of other explanatory variables, Assistant professor in nine month period earns on average 50423.9 less money units compared to Professor (p<0.001).
Given the values of other explanatory variables, Associate professor in nine month period earns on average 34619.8 less money units compared to Professor (p<0.001).
Variability of nine month salary of employees is affected by rank and working years - F statistic (p<0.001).
43.1% of variability of nine month salary of employees is explained by variability of rank and working years.