RQ: How is the nine month salary of the professors affected by the years since their PhD study, years of service, position, and gender?

#install.packages("carData")
library(carData)

## Warning: package 'carData' was built under R version 4.3.2

References: Fox J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.

mydata <- data.frame(Salaries)
nrow(mydata)

## [1] 397

head(mydata)

##        rank discipline yrs.since.phd yrs.service  sex salary
## 1      Prof          B            19          18 Male 139750
## 2      Prof          B            20          16 Male 173200
## 3  AsstProf          B             4           3 Male  79750
## 4      Prof          B            45          39 Male 115000
## 5      Prof          B            40          41 Male 141500
## 6 AssocProf          B             6           6 Male  97000

Description of the data set:

We have a data set of salaries for professors and we have 397 observations (units) with 6 variables.

Unit of observation: A professor in the U.S. Sample size = 397

rank: A factor with levels AssocProf, AsstProf, Prof.It is a type of categorical variable with 3 values.
discipline: A factor with levels A (“theoretical” departments) or B (“applied” departments). It is a categorical (dummy) variable.
yrs.since.phd: Years since PhD. It is a numeric type of variable.
yrs.service: Years of service. It is a numeric type of variable.
sex: A factor with levels Female and Male. It is a categorical variable (dummy) with two variables.
salary: A nine-month salary, measured in dollars.

any(is.na(mydata))

## [1] FALSE

There are no missing values.

colnames(mydata) <- c("Position", "Discipline", "YearsSincePhD", "YearsOfService", "Gender", "Salary")   
                #Change of names of the variables
summary(mydata)

##       Position   Discipline YearsSincePhD   YearsOfService     Gender   
##  AsstProf : 67   A:181      Min.   : 1.00   Min.   : 0.00   Female: 39  
##  AssocProf: 64   B:216      1st Qu.:12.00   1st Qu.: 7.00   Male  :358  
##  Prof     :266              Median :21.00   Median :16.00               
##                             Mean   :22.31   Mean   :17.61               
##                             3rd Qu.:32.00   3rd Qu.:27.00               
##                             Max.   :56.00   Max.   :60.00               
##      Salary      
##  Min.   : 57800  
##  1st Qu.: 91000  
##  Median :107300  
##  Mean   :113706  
##  3rd Qu.:134185  
##  Max.   :231545

In our sample of 397 professors, there are 67 professors, who hold the position of an Assistant Professor, 64 of an Associate Professor, and 266 of them hold the position of a Professor (a regular one).
Minimum number of years since professors’ PhD study is 1 year and the maximum is 56 years.
75% of professors in the sample have up to 27 years of service, and the rest 25% have been in service for a longer period of time.
The average nine month salary of professors in the sample is 113.704 USD

*We see that there are more males in the sample and this is why I will put it in the first place when factoring the variable, so males are our reference group.

mydata$PositionF <- factor(mydata$Position,        #Factoring the variables
                         levels = c ("Prof","AsstProf","AssocProf"),
                         labels = c ("Prof", "AsstP","AssocP"))
mydata$DisciplineF <- factor(mydata$Discipline,
                         levels = c ("B","A"),
                         labels = c ("Applied", "Theoretical"))
mydata$GenderF <- factor(mydata$Gender,
                         levels = c ("Male","Female"),
                         labels = c ("Male", "Female"))   

mydata$ID <- seq(1, nrow(mydata))                 #We add ID number for each unit
head(mydata)

##    Position Discipline YearsSincePhD YearsOfService Gender Salary PositionF
## 1      Prof          B            19             18   Male 139750      Prof
## 2      Prof          B            20             16   Male 173200      Prof
## 3  AsstProf          B             4              3   Male  79750     AsstP
## 4      Prof          B            45             39   Male 115000      Prof
## 5      Prof          B            40             41   Male 141500      Prof
## 6 AssocProf          B             6              6   Male  97000    AssocP
##   DisciplineF GenderF ID
## 1     Applied    Male  1
## 2     Applied    Male  2
## 3     Applied    Male  3
## 4     Applied    Male  4
## 5     Applied    Male  5
## 6     Applied    Male  6

Regression analysis

How the Years Since PhD, Years of Service, Position, and Gender affect the nine month Salary of the Professors?

Dependent variable: (a nine month) Salary for professors

Explanatory variables:

Years of Service: I expect a positive regression coefficient, meaning that with each additional year of service, the nine month salary would increase.
Years Since PhD: I expect a positive regression coefficient, meaning that with each additional year since PhD, the nine month salary would increase (indirectly, “because the more experienced one becomes each year”).
Position: With a higher position, we expect a higher nine month salary (a positive regression coefficient).
Gender: By the statistics, males on average have higher salaries and that is why we expect higher nine month salaries for Males (a positive regression coefficient).

Assumptions:

Linearity in parameters.
The expected value of errors is equal to 0.
Homoskedasticity (we have equal variances).
Normal distribution of errors.
Errors are independent.

This assumption is already met, since we do not deal with time-series data or hierarchical type of data.

No perfect multicolinearity.
The number of units has to be greater than the number of estimated parameters: 𝑛 > 𝑘.

This assumption is already met (397 > 6).

Requirements:

The dependent variable is numeric and the explanatory variables are numeric or a dummy.

This requirement is fulfilled.

Each explanatory variable must vary.

Non of the variables’ variance is equal to 0, therefore this requirement is fulfilled.

The outliers and units with high impact need to be removed.
No too strong multicolinearity.

Linearity

library(car)

scatterplotMatrix(mydata[ , c(-1, -2, -5,-7, -8, -9, -10)],      #Scatter plot matrix
                  smooth = FALSE)

From the scatter plot matrix, we can see that the linearity is present - we do not see any round shapes in the graphs. But we should check if some of the variables are too strongly correlated.

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[ , c(-1, -2, -5,-7, -8, -9, -10)]))

##                YearsSincePhD YearsOfService Salary
## YearsSincePhD           1.00           0.91   0.42
## YearsOfService          0.91           1.00   0.33
## Salary                  0.42           0.33   1.00
## 
## n= 397 
## 
## 
## P
##                YearsSincePhD YearsOfService Salary
## YearsSincePhD                 0              0    
## YearsOfService  0                            0    
## Salary          0             0

We can see that the variables YearsOfService and YearsSincePhD are too strongly correlated. Pearson correlation coefficient above the value of 0.9 identifies a “very strong” relationship.

fit <- lm(Salary ~ YearsSincePhD + YearsOfService + PositionF + GenderF,
          data = mydata)        #Setting up the regression model

summary(fit)

## 
## Call:
## lm(formula = Salary ~ YearsSincePhD + YearsOfService + PositionF + 
##     GenderF, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64844 -14939  -1401  12137 107498 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     128148.9     4072.2  31.469   <2e-16 ***
## YearsSincePhD      265.8      247.9   1.072   0.2842    
## YearsOfService    -373.8      220.8  -1.693   0.0913 .  
## PositionFAsstP  -46940.0     4421.4 -10.617   <2e-16 ***
## PositionFAssocP -33053.4     3700.7  -8.932   <2e-16 ***
## GenderFFemale    -5499.1     4034.7  -1.363   0.1737    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23580 on 391 degrees of freedom
## Multiple R-squared:  0.4017, Adjusted R-squared:  0.3941 
## F-statistic: 52.51 on 5 and 391 DF,  p-value: < 2.2e-16

We notice the same two variables are statistically insignificant (p < 0.001).

We need to check for multicolinearity with VIF.

Multicolinearity

vif(fit)   #Check for the multicolinearity

##                    GVIF Df GVIF^(1/(2*Df))
## YearsSincePhD  7.271171  1        2.696511
## YearsOfService 5.876405  1        2.424130
## PositionF      2.002583  2        1.189591
## GenderF        1.029868  1        1.014824

VIF (Variance Inflation Factor) should be as close to 1, but not more than 5. However two values are higher than 5, which means that there is a too strong correlation between the variables - Years Since PhD and Years of Services are too strongly correlated. It means that we should remove one of there variables in the model because of the violation of the assumption.

mean(vif(fit))

## [1] 2.375424

When we look at the last column at VIF analysis (GVIF ^(1/2*Df)), the values should be below the square root of 5, which is approximately around 2.24. Again, we see the violation, which means that we have to exclude one of these variables that are too strongly correlated.

fit1 <- lm(Salary ~ YearsSincePhD + PositionF + GenderF,
         data = mydata)          #New regression model, we excluded Years Of Service

summary(fit1)

## 
## Call:
## lm(formula = Salary ~ YearsSincePhD + PositionF + GenderF, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67230 -15338  -1530  12163 105318 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     129727.0     3973.4  32.649   <2e-16 ***
## YearsSincePhD      -92.1      129.7  -0.710    0.478    
## PositionFAsstP  -47635.9     4412.7 -10.795   <2e-16 ***
## PositionFAssocP -33623.1     3694.1  -9.102   <2e-16 ***
## GenderFFemale    -5146.6     4038.9  -1.274    0.203    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23630 on 392 degrees of freedom
## Multiple R-squared:  0.3973, Adjusted R-squared:  0.3912 
## F-statistic: 64.61 on 4 and 392 DF,  p-value: < 2.2e-16

vif(fit1)   #Check for the multicolinearity

##                   GVIF Df GVIF^(1/(2*Df))
## YearsSincePhD 1.980427  1        1.407276
## PositionF     1.979087  2        1.186086
## GenderF       1.027125  1        1.013472

Now, the VIF values are below 5 and close to 1, which is okay. Also the last column’s values (GVIF ^(1/2*Df)) are below 2.24, which is great.

mean(vif(fit1))

## [1] 1.399275

It is below 2.24, which is okay.

Outliers and High impact units

mydata$StdResid <- round(rstandard(fit1), 3)
mydata$CooksD <- round(cooks.distance(fit1), 3)
head(mydata[c(-1,-2,-4,-5,-8)])

##   YearsSincePhD Salary PositionF GenderF ID StdResid CooksD
## 1            19 139750      Prof    Male  1    0.500  0.000
## 2            20 173200      Prof    Male  2    1.923  0.004
## 3             4  79750     AsstP    Male  3   -0.084  0.000
## 4            45 115000      Prof    Male  4   -0.451  0.000
## 5            40 141500      Prof    Male  5    0.657  0.001
## 6             6  97000    AssocP    Male  6    0.062  0.000

sum(mydata$StdResid)

## [1] -0.115

The sum of standardized residuals should by the theory be equal to 1. Here, however it is not so let it further analyze the potential outliers, that should be removed.

hist(mydata$StdResid,
     xlab = "Standardized Residuals",
     ylab = "Frequency",
     main = " Histogram of standardized residuals")

From the histogram we can identify the outliers by looking at units that fall out of the interval from -3 to +3.

Normal distribution of errors

shapiro.test(mydata$StdResid)     #Normal distribution of errors

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.97497, p-value = 2.378e-06

*Note: This is not a necessary since the sample size is large enough.

H0: Errors are normally distributed.

H1: Errors are not normally distributed.

Based on the sample data, we reject the null hypothesis (p < 0.001) and conclude that errors are not normally distributed, however, this does not matter since the sample size is sufficiently large.

Additionally, from the histogram of standardized residuals, it can be seen that there are some outliers since some values are smaller than -3 and bigger than 3.

hist(mydata$CooksD,
     xlab = "Cook's distance",
     ylab = "Frequency",
     main = "Histogram of Cook's distance")

A histogram of Cook’s distances show us the units with high impact that should be removed. The distances should be lower than 1, but also we should not identify any gaps.

std_resid <- rstandard(fit1)

# Identify units with standardized residuals exceeding a certain threshold
threshold <- 3  # Adjust the threshold based on your criteria
outlier_units <- which(abs(std_resid) > threshold)

# Print or inspect the units with large standardized residuals
print(outlier_units)

##  44 250 365 
##  44 250 365

The function tells us to remove the units 44, 250, and 365 as the outliers.

head(mydata[order(-mydata$CooksD), c("ID", "CooksD")], 10)  #Display 10 units with the largest Cook's distance

##      ID CooksD
## 351 351  0.039
## 283 283  0.032
## 331 331  0.028
## 44   44  0.027
## 365 365  0.024
## 132 132  0.023
## 126 126  0.019
## 299 299  0.017
## 324 324  0.017
## 272 272  0.016

When looking at the values of Cooks distances, we can see some bigger gaps in between the IDs. We will have to remove units with ID of 331, 44, 283, 365, 351, and 132.

mydata1 <- data.frame(mydata)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata1 <- mydata %>%
  filter(!ID %in% c(44, 132,250, 283, 331, 351, 365))            #Removing units

Homoskedasticity?

fit2 <- lm(Salary ~ YearsSincePhD + PositionF + GenderF,   #Fit the model again without the deleted units
           data = mydata1) 

mydata1$stdFitted <- scale(fit2$fitted.values)

library(car)
scatterplot(StdResid ~ stdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            smooth = FALSE,
            regLine = FALSE,
            boxplots = FALSE,
            data = mydata1)

We can see from the scatter plot heteroskedasticity however let’s check it with Breusch Pagan test.

H0: The variance is constant./ We are dealing with homoskedasticity.

H1: The variance is not constant. / We are dealing with heteroskedasticity.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                Data                
##  ----------------------------------
##  Response : Salary 
##  Variables: fitted values of Salary 
## 
##          Test Summary          
##  ------------------------------
##  DF            =    1 
##  Chi2          =    51.90417 
##  Prob > Chi2   =    5.82765e-13

Based on the sample data, we reject H0 at p < 0.001, and conclude that we are dealing with heteroskedasticity. It means we violate the assumption and have to use the Robust Standard Errors.

#install.packages("estimatr")
library(estimatr)

## Warning: package 'estimatr' was built under R version 4.3.2

fit_robust <- lm_robust(Salary ~ YearsSincePhD + PositionF + GenderF,  
                        se_type = "HC1",               #White's robust standard errors
                        data = mydata1)
summary(fit_robust)

## 
## Call:
## lm_robust(formula = Salary ~ YearsSincePhD + PositionF + GenderF, 
##     data = mydata1, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                 Estimate Std. Error t value   Pr(>|t|) CI Lower  CI Upper  DF
## (Intercept)       131021     3762.6  34.822 5.046e-121 123623.3 138418.87 385
## YearsSincePhD       -178      134.7  -1.321  1.872e-01   -442.8     86.85 385
## PositionFAsstP    -48556     3268.4 -14.856  8.756e-40 -54982.0 -42129.68 385
## PositionFAssocP   -33651     2604.1 -12.922  5.569e-32 -38771.6 -28531.39 385
## GenderFFemale      -4756     2864.5  -1.660  9.764e-02 -10388.2    875.69 385
## 
## Multiple R-squared:  0.4317 ,    Adjusted R-squared:  0.4258 
## F-statistic: 155.1 on 4 and 385 DF,  p-value: < 2.2e-16

We can interpret only coefficients that are statistically significant. We look at p-values.

H0: 𝛽1 = 0

H1: 𝛽1 =/= 0

We cannot reject the null hypothesis (p = 0.188) and conclude that we could not find the statistical effect of Years Since PhD on a nine month Salary of professors (the partial regression coefficient is statistically insignificant).

H0: 𝛽2 = 0

H1: 𝛽2 =/= 0

We reject null hypothesis and conclude that based on the sample data, given the values of other explanatory variables, Assistant Professors have on average their nine month salary lower than Professors by 48556 USD, p < 0.001.The partial regression coefficient is statistically significant.

H0: 𝛽3 = 0

H1: 𝛽3 =/= 0

We reject null hypothesis and based on the sample data, given other explanatory variables, Associate Professors have on average their nine month salary lower than Professors by 33651 USD, p < 0.001. The partial regression coefficient is statistically significant.

H0: 𝛽4 = 0

H1: 𝛽4 =/= 0

We cannot reject the null hypothesis. Given the gender of professors, there is a difference in a nine month salary between males and females. However, the observed difference in the Salary is not statistically significant at p = 0.098 (because p > 0.05). Therefore, we do not have enough evidence to conclude that Female Professors have a significantly different Salary, compared to Male Professors.

When looking at the coefficient of determination (R squared): 43.17% of variability of nine-month salary of professors, is explained by the variability of Years of service, Position, and Gender.

Nine-month Salary of professors is significantly affected by the Position of the Professors, with both Assistants and Associate Professors earning less than regular Professors. Years since PhD and Gender do not have a significant effect on the nine-month Salary of the Professors in this model.

Nika_Boh_MVA_HW3

Nika Boh

2024-01-22