RQ: How is the nine month salary of the professors affected by the years since their PhD study, years of service, position, and gender?

#install.packages("carData")
library(carData)
## Warning: package 'carData' was built under R version 4.3.2

References: Fox J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.

mydata <- data.frame(Salaries)
nrow(mydata)
## [1] 397
head(mydata)
##        rank discipline yrs.since.phd yrs.service  sex salary
## 1      Prof          B            19          18 Male 139750
## 2      Prof          B            20          16 Male 173200
## 3  AsstProf          B             4           3 Male  79750
## 4      Prof          B            45          39 Male 115000
## 5      Prof          B            40          41 Male 141500
## 6 AssocProf          B             6           6 Male  97000

Description of the data set:

We have a data set of salaries for professors and we have 397 observations (units) with 6 variables.

Unit of observation: A professor in the U.S. Sample size = 397

any(is.na(mydata))
## [1] FALSE

There are no missing values.

colnames(mydata) <- c("Position", "Discipline", "YearsSincePhD", "YearsOfService", "Gender", "Salary")   
                #Change of names of the variables
summary(mydata)
##       Position   Discipline YearsSincePhD   YearsOfService     Gender   
##  AsstProf : 67   A:181      Min.   : 1.00   Min.   : 0.00   Female: 39  
##  AssocProf: 64   B:216      1st Qu.:12.00   1st Qu.: 7.00   Male  :358  
##  Prof     :266              Median :21.00   Median :16.00               
##                             Mean   :22.31   Mean   :17.61               
##                             3rd Qu.:32.00   3rd Qu.:27.00               
##                             Max.   :56.00   Max.   :60.00               
##      Salary      
##  Min.   : 57800  
##  1st Qu.: 91000  
##  Median :107300  
##  Mean   :113706  
##  3rd Qu.:134185  
##  Max.   :231545

*We see that there are more males in the sample and this is why I will put it in the first place when factoring the variable, so males are our reference group.

mydata$PositionF <- factor(mydata$Position,        #Factoring the variables
                         levels = c ("Prof","AsstProf","AssocProf"),
                         labels = c ("Prof", "AsstP","AssocP"))
mydata$DisciplineF <- factor(mydata$Discipline,
                         levels = c ("B","A"),
                         labels = c ("Applied", "Theoretical"))
mydata$GenderF <- factor(mydata$Gender,
                         levels = c ("Male","Female"),
                         labels = c ("Male", "Female"))   

mydata$ID <- seq(1, nrow(mydata))                 #We add ID number for each unit
head(mydata)                                
##    Position Discipline YearsSincePhD YearsOfService Gender Salary PositionF
## 1      Prof          B            19             18   Male 139750      Prof
## 2      Prof          B            20             16   Male 173200      Prof
## 3  AsstProf          B             4              3   Male  79750     AsstP
## 4      Prof          B            45             39   Male 115000      Prof
## 5      Prof          B            40             41   Male 141500      Prof
## 6 AssocProf          B             6              6   Male  97000    AssocP
##   DisciplineF GenderF ID
## 1     Applied    Male  1
## 2     Applied    Male  2
## 3     Applied    Male  3
## 4     Applied    Male  4
## 5     Applied    Male  5
## 6     Applied    Male  6

Regression analysis

How the Years Since PhD, Years of Service, Position, and Gender affect the nine month Salary of the Professors?

Dependent variable: (a nine month) Salary for professors

Explanatory variables:

Assumptions:

  1. Linearity in parameters.

  2. The expected value of errors is equal to 0.

  3. Homoskedasticity (we have equal variances).

  4. Normal distribution of errors.

  5. Errors are independent.

This assumption is already met, since we do not deal with time-series data or hierarchical type of data.

  1. No perfect multicolinearity.

  2. The number of units has to be greater than the number of estimated parameters: 𝑛 > π‘˜.

This assumption is already met (397 > 6).

Requirements:

  1. The dependent variable is numeric and the explanatory variables are numeric or a dummy.

This requirement is fulfilled.

  1. Each explanatory variable must vary.

Non of the variables’ variance is equal to 0, therefore this requirement is fulfilled.

  1. The outliers and units with high impact need to be removed.

  2. No too strong multicolinearity.

Linearity

library(car)
scatterplotMatrix(mydata[ , c(-1, -2, -5,-7, -8, -9, -10)],      #Scatter plot matrix
                  smooth = FALSE)

From the scatter plot matrix, we can see that the linearity is present - we do not see any round shapes in the graphs. But we should check if some of the variables are too strongly correlated.

library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(mydata[ , c(-1, -2, -5,-7, -8, -9, -10)]))
##                YearsSincePhD YearsOfService Salary
## YearsSincePhD           1.00           0.91   0.42
## YearsOfService          0.91           1.00   0.33
## Salary                  0.42           0.33   1.00
## 
## n= 397 
## 
## 
## P
##                YearsSincePhD YearsOfService Salary
## YearsSincePhD                 0              0    
## YearsOfService  0                            0    
## Salary          0             0
  • We can see that the variables YearsOfService and YearsSincePhD are too strongly correlated. Pearson correlation coefficient above the value of 0.9 identifies a β€œvery strong” relationship.
fit <- lm(Salary ~ YearsSincePhD + YearsOfService + PositionF + GenderF,
          data = mydata)        #Setting up the regression model
summary(fit)
## 
## Call:
## lm(formula = Salary ~ YearsSincePhD + YearsOfService + PositionF + 
##     GenderF, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64844 -14939  -1401  12137 107498 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     128148.9     4072.2  31.469   <2e-16 ***
## YearsSincePhD      265.8      247.9   1.072   0.2842    
## YearsOfService    -373.8      220.8  -1.693   0.0913 .  
## PositionFAsstP  -46940.0     4421.4 -10.617   <2e-16 ***
## PositionFAssocP -33053.4     3700.7  -8.932   <2e-16 ***
## GenderFFemale    -5499.1     4034.7  -1.363   0.1737    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23580 on 391 degrees of freedom
## Multiple R-squared:  0.4017, Adjusted R-squared:  0.3941 
## F-statistic: 52.51 on 5 and 391 DF,  p-value: < 2.2e-16
  • We notice the same two variables are statistically insignificant (p < 0.001).

We need to check for multicolinearity with VIF.

Multicolinearity

vif(fit)   #Check for the multicolinearity
##                    GVIF Df GVIF^(1/(2*Df))
## YearsSincePhD  7.271171  1        2.696511
## YearsOfService 5.876405  1        2.424130
## PositionF      2.002583  2        1.189591
## GenderF        1.029868  1        1.014824
  • VIF (Variance Inflation Factor) should be as close to 1, but not more than 5. However two values are higher than 5, which means that there is a too strong correlation between the variables - Years Since PhD and Years of Services are too strongly correlated. It means that we should remove one of there variables in the model because of the violation of the assumption.
mean(vif(fit))
## [1] 2.375424
  • When we look at the last column at VIF analysis (GVIF ^(1/2*Df)), the values should be below the square root of 5, which is approximately around 2.24. Again, we see the violation, which means that we have to exclude one of these variables that are too strongly correlated.
fit1 <- lm(Salary ~ YearsSincePhD + PositionF + GenderF,
         data = mydata)          #New regression model, we excluded Years Of Service
summary(fit1)
## 
## Call:
## lm(formula = Salary ~ YearsSincePhD + PositionF + GenderF, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67230 -15338  -1530  12163 105318 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     129727.0     3973.4  32.649   <2e-16 ***
## YearsSincePhD      -92.1      129.7  -0.710    0.478    
## PositionFAsstP  -47635.9     4412.7 -10.795   <2e-16 ***
## PositionFAssocP -33623.1     3694.1  -9.102   <2e-16 ***
## GenderFFemale    -5146.6     4038.9  -1.274    0.203    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23630 on 392 degrees of freedom
## Multiple R-squared:  0.3973, Adjusted R-squared:  0.3912 
## F-statistic: 64.61 on 4 and 392 DF,  p-value: < 2.2e-16
vif(fit1)   #Check for the multicolinearity
##                   GVIF Df GVIF^(1/(2*Df))
## YearsSincePhD 1.980427  1        1.407276
## PositionF     1.979087  2        1.186086
## GenderF       1.027125  1        1.013472
  • Now, the VIF values are below 5 and close to 1, which is okay. Also the last column’s values (GVIF ^(1/2*Df)) are below 2.24, which is great.
mean(vif(fit1))
## [1] 1.399275
  • It is below 2.24, which is okay.

Outliers and High impact units

mydata$StdResid <- round(rstandard(fit1), 3)
mydata$CooksD <- round(cooks.distance(fit1), 3)
head(mydata[c(-1,-2,-4,-5,-8)])
##   YearsSincePhD Salary PositionF GenderF ID StdResid CooksD
## 1            19 139750      Prof    Male  1    0.500  0.000
## 2            20 173200      Prof    Male  2    1.923  0.004
## 3             4  79750     AsstP    Male  3   -0.084  0.000
## 4            45 115000      Prof    Male  4   -0.451  0.000
## 5            40 141500      Prof    Male  5    0.657  0.001
## 6             6  97000    AssocP    Male  6    0.062  0.000
sum(mydata$StdResid)
## [1] -0.115
  • The sum of standardized residuals should by the theory be equal to 1. Here, however it is not so let it further analyze the potential outliers, that should be removed.
hist(mydata$StdResid,
     xlab = "Standardized Residuals",
     ylab = "Frequency",
     main = " Histogram of standardized residuals")

  • From the histogram we can identify the outliers by looking at units that fall out of the interval from -3 to +3.

Normal distribution of errors

shapiro.test(mydata$StdResid)     #Normal distribution of errors
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.97497, p-value = 2.378e-06

*Note: This is not a necessary since the sample size is large enough.

H0: Errors are normally distributed.

H1: Errors are not normally distributed.

Based on the sample data, we reject the null hypothesis (p < 0.001) and conclude that errors are not normally distributed, however, this does not matter since the sample size is sufficiently large.

Additionally, from the histogram of standardized residuals, it can be seen that there are some outliers since some values are smaller than -3 and bigger than 3.

hist(mydata$CooksD,
     xlab = "Cook's distance",
     ylab = "Frequency",
     main = "Histogram of Cook's distance")

  • A histogram of Cook’s distances show us the units with high impact that should be removed. The distances should be lower than 1, but also we should not identify any gaps.
std_resid <- rstandard(fit1)

# Identify units with standardized residuals exceeding a certain threshold
threshold <- 3  # Adjust the threshold based on your criteria
outlier_units <- which(abs(std_resid) > threshold)

# Print or inspect the units with large standardized residuals
print(outlier_units)
##  44 250 365 
##  44 250 365

The function tells us to remove the units 44, 250, and 365 as the outliers.

head(mydata[order(-mydata$CooksD), c("ID", "CooksD")], 10)  #Display 10 units with the largest Cook's distance
##      ID CooksD
## 351 351  0.039
## 283 283  0.032
## 331 331  0.028
## 44   44  0.027
## 365 365  0.024
## 132 132  0.023
## 126 126  0.019
## 299 299  0.017
## 324 324  0.017
## 272 272  0.016

When looking at the values of Cooks distances, we can see some bigger gaps in between the IDs. We will have to remove units with ID of 331, 44, 283, 365, 351, and 132.

mydata1 <- data.frame(mydata)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata1 <- mydata %>%
  filter(!ID %in% c(44, 132,250, 283, 331, 351, 365))            #Removing units

Homoskedasticity?

fit2 <- lm(Salary ~ YearsSincePhD + PositionF + GenderF,   #Fit the model again without the deleted units
           data = mydata1) 

mydata1$stdFitted <- scale(fit2$fitted.values)

library(car)
scatterplot(StdResid ~ stdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            smooth = FALSE,
            regLine = FALSE,
            boxplots = FALSE,
            data = mydata1)

We can see from the scatter plot heteroskedasticity however let’s check it with Breusch Pagan test.

H0: The variance is constant./ We are dealing with homoskedasticity.

H1: The variance is not constant. / We are dealing with heteroskedasticity.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                Data                
##  ----------------------------------
##  Response : Salary 
##  Variables: fitted values of Salary 
## 
##          Test Summary          
##  ------------------------------
##  DF            =    1 
##  Chi2          =    51.90417 
##  Prob > Chi2   =    5.82765e-13

Based on the sample data, we reject H0 at p < 0.001, and conclude that we are dealing with heteroskedasticity. It means we violate the assumption and have to use the Robust Standard Errors.

#install.packages("estimatr")
library(estimatr)
## Warning: package 'estimatr' was built under R version 4.3.2
fit_robust <- lm_robust(Salary ~ YearsSincePhD + PositionF + GenderF,  
                        se_type = "HC1",               #White's robust standard errors
                        data = mydata1)
summary(fit_robust)
## 
## Call:
## lm_robust(formula = Salary ~ YearsSincePhD + PositionF + GenderF, 
##     data = mydata1, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                 Estimate Std. Error t value   Pr(>|t|) CI Lower  CI Upper  DF
## (Intercept)       131021     3762.6  34.822 5.046e-121 123623.3 138418.87 385
## YearsSincePhD       -178      134.7  -1.321  1.872e-01   -442.8     86.85 385
## PositionFAsstP    -48556     3268.4 -14.856  8.756e-40 -54982.0 -42129.68 385
## PositionFAssocP   -33651     2604.1 -12.922  5.569e-32 -38771.6 -28531.39 385
## GenderFFemale      -4756     2864.5  -1.660  9.764e-02 -10388.2    875.69 385
## 
## Multiple R-squared:  0.4317 ,    Adjusted R-squared:  0.4258 
## F-statistic: 155.1 on 4 and 385 DF,  p-value: < 2.2e-16

We can interpret only coefficients that are statistically significant. We look at p-values.

H0: 𝛽1 = 0

H1: 𝛽1 =/= 0

We cannot reject the null hypothesis (p = 0.188) and conclude that we could not find the statistical effect of Years Since PhD on a nine month Salary of professors (the partial regression coefficient is statistically insignificant).

H0: 𝛽2 = 0

H1: 𝛽2 =/= 0

We reject null hypothesis and conclude that based on the sample data, given the values of other explanatory variables, Assistant Professors have on average their nine month salary lower than Professors by 48556 USD, p < 0.001.The partial regression coefficient is statistically significant.

H0: 𝛽3 = 0

H1: 𝛽3 =/= 0

We reject null hypothesis and based on the sample data, given other explanatory variables, Associate Professors have on average their nine month salary lower than Professors by 33651 USD, p < 0.001. The partial regression coefficient is statistically significant.

H0: 𝛽4 = 0

H1: 𝛽4 =/= 0

We cannot reject the null hypothesis. Given the gender of professors, there is a difference in a nine month salary between males and females. However, the observed difference in the Salary is not statistically significant at p = 0.098 (because p > 0.05). Therefore, we do not have enough evidence to conclude that Female Professors have a significantly different Salary, compared to Male Professors.

When looking at the coefficient of determination (R squared): 43.17% of variability of nine-month salary of professors, is explained by the variability of Years of service, Position, and Gender.

Nine-month Salary of professors is significantly affected by the Position of the Professors, with both Assistants and Associate Professors earning less than regular Professors. Years since PhD and Gender do not have a significant effect on the nine-month Salary of the Professors in this model.