MVA-HW3

MVA Homework 3

Data

We have a dataset that provides comprehensive information about professors. With regression analysis, we will check if we can use a set of explanatory variables (rank of professors degree and working years) to explain the differences in the values of the dependent variable (height of nine month salary of a professor).

Data source: Fox J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.

#data(package = .packages(all.available = TRUE))
#install.packages("carData")
library(carData)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#Importing dataset in Rstudio and adding the ID number column
data("Salaries", package="carData")
mydata <- force(Salaries) 
mydata$ID <- seq(1, nrow(mydata))


# Selecting the relevant columns
mydata <- select(mydata, ID, salary, rank, yrs.service)
head(mydata)

##   ID salary      rank yrs.service
## 1  1 139750      Prof          18
## 2  2 173200      Prof          16
## 3  3  79750  AsstProf           3
## 4  4 115000      Prof          39
## 5  5 141500      Prof          41
## 6  6  97000 AssocProf           6

Descriptive statistics

#Showing descriptive statistics
library(psych)
summary(mydata[-1])

##      salary              rank      yrs.service   
##  Min.   : 57800   AsstProf : 67   Min.   : 0.00  
##  1st Qu.: 91000   AssocProf: 64   1st Qu.: 7.00  
##  Median :107300   Prof     :266   Median :16.00  
##  Mean   :113706                   Mean   :17.61  
##  3rd Qu.:134185                   3rd Qu.:27.00  
##  Max.   :231545                   Max.   :60.00

result <- describeBy(mydata[-1])

## Warning in describeBy(mydata[-1]): no grouping variable requested

print(result)

##             vars   n      mean       sd median   trimmed      mad   min    max  range  skew kurtosis      se
## salary         1 397 113706.46 30289.04 107300 111401.61 29355.48 57800 231545 173745  0.71     0.18 1520.16
## rank*          2 397      2.50     0.77      3      2.62     0.00     1      3      2 -1.12    -0.38    0.04
## yrs.service    3 397     17.61    13.01     16     16.51    14.83     0     60     60  0.65    -0.34    0.65

#Factorizing
mydata$FRank <- factor(mydata$rank,
                         levels = c ("Prof","AsstProf","AssocProf"),
                         labels = c ("P", "AssistP","AssocP"))

Unit of observation: Individual professor
Sample size: 397 employees
Variables:
- Salary: Average yearly salary of an individual employee in the sample (money unit).
- Rank: Rank of their of their job position - AssocProf, AsstProf and Prof.
- yrs.service: Number of working years (in years)

Explanation of few parameters:

Min: Professor who has the lowest salary is paid 57,800 money units, the minimum working years is 0.
Max: Professor who has the highest salary is paid 231,545 money units, the maximum working years are 60 years.
1st quartile: 25% of professors receive a salary of 91,000 money units or less, and 25% of professors have been working 7 years or less.
Mean: The average salary of professors is 113,706 money units.
In the sample, there are 64 Assoc. professors, 67 Assistant professors and 266 professors.

Regression Analysis

Research question: Can difference in the nine month salary of professors be explained by the rank of their job position and number of their working years?

Regression model: Nine month salary = beta0 + beta1 x rank + beta2 x working years + error

Regression function: Nine month salary = b0 + b1 x rank + b2 x working years

Expectation: We expect that the average nine month salary is higher if professors are higher ranked and have higher number of working years.

Linearity assumption

library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

scatterplot(mydata$salary ~ mydata$yrs.service,
            smooth = FALSE,
            boxplots = FALSE,
            ylab = "Nine month salary (money units)",
            xlab = "Working years (years)")

Based on the scatterplot we created, we can conclude that assumption of linearity is met. We can see a positive linear correlation between salary and working years of employees.

Absence of too strong multicolinearity assumption

fit <- lm(salary ~ yrs.service + FRank,
          data = mydata)

summary(fit)

## 
## Call:
## lm(formula = salary ~ yrs.service + FRank, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64515 -16180  -1234  12181 107174 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  130380.1     2995.6  43.524   <2e-16 ***
## yrs.service    -158.1      115.0  -1.376     0.17    
## FRankAssistP -49228.8     3991.9 -12.332   <2e-16 ***
## FRankAssocP  -34613.4     3515.9  -9.845   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23610 on 393 degrees of freedom
## Multiple R-squared:  0.3972, Adjusted R-squared:  0.3926 
## F-statistic:  86.3 on 3 and 393 DF,  p-value: < 2.2e-16

vif(fit)

##                 GVIF Df GVIF^(1/(2*Df))
## yrs.service 1.588306  1        1.260280
## FRank       1.588306  2        1.122622

Based on the VIF we can conclude that there is not too strong multicolinearity, because both values are less than 2.24 (koren iz 5).

No outliers assumption

mydata$StdResid <- round(rstandard(fit), 3)
mydata$CooksD <- round(cooks.distance(fit), 3)

hist(mydata$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

shapiro.test(mydata$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.97293, p-value = 9.596e-07

On the histogram we can see that normality is violated, which we also proved with Shapiro-Wilk normality test. We can reject the null hypothesis and conclude that errors are not normally distributed. However, because our sample is big enough, we can assume normality.

As some of the units are above 3, we will also have to remove them.

No high impact units assumption

hist(mydata$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances")

Based on the histogram of Cooks distances, we can see that there are two bigger gaps between units. Those units have a high impact and should be removed.

Removing outliers and units with high impact

head(mydata[order(- mydata$StdResid),], 10)

##      ID salary rank yrs.service FRank StdResid CooksD
## 44   44 231545 Prof          38     P    4.561  0.048
## 365 365 205500 Prof          43     P    3.494  0.042
## 250 250 204000 Prof           7     P    3.181  0.025
## 331 331 192253 Prof          60     P    3.080  0.090
## 272 272 194800 Prof          18     P    2.856  0.009
## 78   78 193000 Prof          19     P    2.786  0.008
## 351 351 186960 Prof          49     P    2.753  0.039
## 199 199 189409 Prof          33     P    2.730  0.012
## 390 390 186023 Prof          18     P    2.483  0.007
## 293 293 183800 Prof           9     P    2.333  0.011

We can see that units with ID 44, 365, 250 and 331 have Standardised residuals higher than 3, therefore we will remove them.

library(dplyr)
mydata <- mydata %>%
  filter(!ID %in% c(44, 365, 250, 331))

head(mydata[order(-mydata$CooksD),], 10)

##      ID salary rank yrs.service FRank StdResid CooksD
## 281 283  57800 Prof          51     P   -2.764  0.044
## 348 351 186960 Prof          49     P    2.753  0.039
## 131 132  76840 Prof          57     P   -1.917  0.030
## 316 318  67559 Prof          45     P   -2.378  0.022
## 125 126  78162 Prof          49     P   -1.903  0.018
## 297 299  72300 Prof          43     P   -2.187  0.016
## 198 199 189409 Prof          33     P    2.730  0.012
## 238 239  77202 Prof          40     P   -1.995  0.011
## 291 293 183800 Prof           9     P    2.333  0.011
## 190 191 180000 Prof           9     P    2.171  0.010

Based on Cooks distances we cann see that there are bigger gaps between units from the unit ID 132 further, therefore we will remove units with IDs: 283 and 351.

library(dplyr)
mydata <- mydata %>%
  filter(!ID %in% c(283, 351))

fit <- lm(salary ~ yrs.service + FRank, 
           data = mydata)

Homoscedasticity assumption

mydata$StdFitted <- scale(fit$fitted.values)

library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                Data                
##  ----------------------------------
##  Response : salary 
##  Variables: fitted values of salary 
## 
##          Test Summary          
##  ------------------------------
##  DF            =    1 
##  Chi2          =    51.49351 
##  Prob > Chi2   =    7.18343e-13

We can see from the graph and also from the Breusch Pagan test that homoscedasticity is violated. We reject the null hypothesis at p<0.001 and conclude that the data is heteroscedastic. Therefore, we have to use robust standard errors.

library(estimatr)
fit_robust <- lm_robust(salary ~ yrs.service + FRank,  
                        se_type = "HC1",
                        data = mydata)
summary(fit_robust)

## 
## Call:
## lm_robust(formula = salary ~ yrs.service + FRank, data = mydata, 
##     se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##              Estimate Std. Error t value   Pr(>|t|) CI Lower  CI Upper  DF
## (Intercept)  131869.6     3135.7  42.054 2.131e-146 125704.5 138034.74 387
## yrs.service    -282.2      125.6  -2.246  2.524e-02   -529.2    -35.21 387
## FRankAssistP -50423.9     3057.2 -16.494  1.151e-46 -56434.6 -44413.14 387
## FRankAssocP  -34619.8     2583.1 -13.403  6.507e-34 -39698.4 -29541.23 387
## 
## Multiple R-squared:  0.431 , Adjusted R-squared:  0.4266 
## F-statistic:   196 on 3 and 387 DF,  p-value: < 2.2e-16

Conclusion

With every additional working year of an employee, his salary decreases on average by 282.2 money units, assuming all other variables are unchanged (p=0.02).
Given the values of other explanatory variables, Assistant professor in nine month period earns on average 50423.9 less money units compared to Professor (p<0.001).
Given the values of other explanatory variables, Associate professor in nine month period earns on average 34619.8 less money units compared to Professor (p<0.001).
Variability of nine month salary of employees is affected by rank and working years - F statistic (p<0.001).
43.1% of variability of nine month salary of employees is explained by variability of rank and working years.