#install.packages("carData")
library(carData)
## Warning: package 'carData' was built under R version 4.3.2
References: Fox J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.
mydata <- data.frame(Salaries)
nrow(mydata)
## [1] 397
head(mydata)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139750
## 2 Prof B 20 16 Male 173200
## 3 AsstProf B 4 3 Male 79750
## 4 Prof B 45 39 Male 115000
## 5 Prof B 40 41 Male 141500
## 6 AssocProf B 6 6 Male 97000
Description of the data set:
We have a data set of salaries for professors and we have 397 observations (units) with 6 variables.
Unit of observation: A professor in the U.S. Sample size = 397
rank: A factor with levels AssocProf, AsstProf, Prof.It is a type of categorical variable with 3 values.
discipline: A factor with levels A (βtheoreticalβ departments) or B (βappliedβ departments). It is a categorical (dummy) variable.
yrs.since.phd: Years since PhD. It is a numeric type of variable.
yrs.service: Years of service. It is a numeric type of variable.
sex: A factor with levels Female and Male. It is a categorical variable (dummy) with two variables.
salary: A nine-month salary, measured in dollars.
any(is.na(mydata))
## [1] FALSE
There are no missing values.
colnames(mydata) <- c("Position", "Discipline", "YearsSincePhD", "YearsOfService", "Gender", "Salary")
#Change of names of the variables
summary(mydata)
## Position Discipline YearsSincePhD YearsOfService Gender
## AsstProf : 67 A:181 Min. : 1.00 Min. : 0.00 Female: 39
## AssocProf: 64 B:216 1st Qu.:12.00 1st Qu.: 7.00 Male :358
## Prof :266 Median :21.00 Median :16.00
## Mean :22.31 Mean :17.61
## 3rd Qu.:32.00 3rd Qu.:27.00
## Max. :56.00 Max. :60.00
## Salary
## Min. : 57800
## 1st Qu.: 91000
## Median :107300
## Mean :113706
## 3rd Qu.:134185
## Max. :231545
In our sample of 397 professors, there are 67 professors, who hold the position of an Assistant Professor, 64 of an Associate Professor, and 266 of them hold the position of a Professor (a regular one).
Minimum number of years since professorsβ PhD study is 1 year and the maximum is 56 years.
75% of professors in the sample have up to 27 years of service, and the rest 25% have been in service for a longer period of time.
The average nine month salary of professors in the sample is 113.704 USD
*We see that there are more males in the sample and this is why I will put it in the first place when factoring the variable, so males are our reference group.
mydata$PositionF <- factor(mydata$Position, #Factoring the variables
levels = c ("Prof","AsstProf","AssocProf"),
labels = c ("Prof", "AsstP","AssocP"))
mydata$DisciplineF <- factor(mydata$Discipline,
levels = c ("B","A"),
labels = c ("Applied", "Theoretical"))
mydata$GenderF <- factor(mydata$Gender,
levels = c ("Male","Female"),
labels = c ("Male", "Female"))
mydata$ID <- seq(1, nrow(mydata)) #We add ID number for each unit
head(mydata)
## Position Discipline YearsSincePhD YearsOfService Gender Salary PositionF
## 1 Prof B 19 18 Male 139750 Prof
## 2 Prof B 20 16 Male 173200 Prof
## 3 AsstProf B 4 3 Male 79750 AsstP
## 4 Prof B 45 39 Male 115000 Prof
## 5 Prof B 40 41 Male 141500 Prof
## 6 AssocProf B 6 6 Male 97000 AssocP
## DisciplineF GenderF ID
## 1 Applied Male 1
## 2 Applied Male 2
## 3 Applied Male 3
## 4 Applied Male 4
## 5 Applied Male 5
## 6 Applied Male 6
How the Years Since PhD, Years of Service, Position, and Gender affect the nine month Salary of the Professors?
Dependent variable: (a nine month) Salary for professors
Explanatory variables:
Years of Service: I expect a positive regression coefficient, meaning that with each additional year of service, the nine month salary would increase.
Years Since PhD: I expect a positive regression coefficient, meaning that with each additional year since PhD, the nine month salary would increase (indirectly, βbecause the more experienced one becomes each yearβ).
Position: With a higher position, we expect a higher nine month salary (a positive regression coefficient).
Gender: By the statistics, males on average have higher salaries and that is why we expect higher nine month salaries for Males (a positive regression coefficient).
Linearity in parameters.
The expected value of errors is equal to 0.
Homoskedasticity (we have equal variances).
Normal distribution of errors.
Errors are independent.
This assumption is already met, since we do not deal with time-series data or hierarchical type of data.
No perfect multicolinearity.
The number of units has to be greater than the number of estimated parameters: π > π.
This assumption is already met (397 > 6).
This requirement is fulfilled.
Non of the variablesβ variance is equal to 0, therefore this requirement is fulfilled.
The outliers and units with high impact need to be removed.
No too strong multicolinearity.
library(car)
scatterplotMatrix(mydata[ , c(-1, -2, -5,-7, -8, -9, -10)], #Scatter plot matrix
smooth = FALSE)
From the scatter plot matrix, we can see that the linearity is present - we do not see any round shapes in the graphs. But we should check if some of the variables are too strongly correlated.
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[ , c(-1, -2, -5,-7, -8, -9, -10)]))
## YearsSincePhD YearsOfService Salary
## YearsSincePhD 1.00 0.91 0.42
## YearsOfService 0.91 1.00 0.33
## Salary 0.42 0.33 1.00
##
## n= 397
##
##
## P
## YearsSincePhD YearsOfService Salary
## YearsSincePhD 0 0
## YearsOfService 0 0
## Salary 0 0
fit <- lm(Salary ~ YearsSincePhD + YearsOfService + PositionF + GenderF,
data = mydata) #Setting up the regression model
summary(fit)
##
## Call:
## lm(formula = Salary ~ YearsSincePhD + YearsOfService + PositionF +
## GenderF, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64844 -14939 -1401 12137 107498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 128148.9 4072.2 31.469 <2e-16 ***
## YearsSincePhD 265.8 247.9 1.072 0.2842
## YearsOfService -373.8 220.8 -1.693 0.0913 .
## PositionFAsstP -46940.0 4421.4 -10.617 <2e-16 ***
## PositionFAssocP -33053.4 3700.7 -8.932 <2e-16 ***
## GenderFFemale -5499.1 4034.7 -1.363 0.1737
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23580 on 391 degrees of freedom
## Multiple R-squared: 0.4017, Adjusted R-squared: 0.3941
## F-statistic: 52.51 on 5 and 391 DF, p-value: < 2.2e-16
We need to check for multicolinearity with VIF.
vif(fit) #Check for the multicolinearity
## GVIF Df GVIF^(1/(2*Df))
## YearsSincePhD 7.271171 1 2.696511
## YearsOfService 5.876405 1 2.424130
## PositionF 2.002583 2 1.189591
## GenderF 1.029868 1 1.014824
mean(vif(fit))
## [1] 2.375424
fit1 <- lm(Salary ~ YearsSincePhD + PositionF + GenderF,
data = mydata) #New regression model, we excluded Years Of Service
summary(fit1)
##
## Call:
## lm(formula = Salary ~ YearsSincePhD + PositionF + GenderF, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67230 -15338 -1530 12163 105318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 129727.0 3973.4 32.649 <2e-16 ***
## YearsSincePhD -92.1 129.7 -0.710 0.478
## PositionFAsstP -47635.9 4412.7 -10.795 <2e-16 ***
## PositionFAssocP -33623.1 3694.1 -9.102 <2e-16 ***
## GenderFFemale -5146.6 4038.9 -1.274 0.203
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23630 on 392 degrees of freedom
## Multiple R-squared: 0.3973, Adjusted R-squared: 0.3912
## F-statistic: 64.61 on 4 and 392 DF, p-value: < 2.2e-16
vif(fit1) #Check for the multicolinearity
## GVIF Df GVIF^(1/(2*Df))
## YearsSincePhD 1.980427 1 1.407276
## PositionF 1.979087 2 1.186086
## GenderF 1.027125 1 1.013472
mean(vif(fit1))
## [1] 1.399275
mydata$StdResid <- round(rstandard(fit1), 3)
mydata$CooksD <- round(cooks.distance(fit1), 3)
head(mydata[c(-1,-2,-4,-5,-8)])
## YearsSincePhD Salary PositionF GenderF ID StdResid CooksD
## 1 19 139750 Prof Male 1 0.500 0.000
## 2 20 173200 Prof Male 2 1.923 0.004
## 3 4 79750 AsstP Male 3 -0.084 0.000
## 4 45 115000 Prof Male 4 -0.451 0.000
## 5 40 141500 Prof Male 5 0.657 0.001
## 6 6 97000 AssocP Male 6 0.062 0.000
sum(mydata$StdResid)
## [1] -0.115
hist(mydata$StdResid,
xlab = "Standardized Residuals",
ylab = "Frequency",
main = " Histogram of standardized residuals")
shapiro.test(mydata$StdResid) #Normal distribution of errors
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.97497, p-value = 2.378e-06
*Note: This is not a necessary since the sample size is large enough.
H0: Errors are normally distributed.
H1: Errors are not normally distributed.
Based on the sample data, we reject the null hypothesis (p < 0.001) and conclude that errors are not normally distributed, however, this does not matter since the sample size is sufficiently large.
Additionally, from the histogram of standardized residuals, it can be seen that there are some outliers since some values are smaller than -3 and bigger than 3.
hist(mydata$CooksD,
xlab = "Cook's distance",
ylab = "Frequency",
main = "Histogram of Cook's distance")
std_resid <- rstandard(fit1)
# Identify units with standardized residuals exceeding a certain threshold
threshold <- 3 # Adjust the threshold based on your criteria
outlier_units <- which(abs(std_resid) > threshold)
# Print or inspect the units with large standardized residuals
print(outlier_units)
## 44 250 365
## 44 250 365
The function tells us to remove the units 44, 250, and 365 as the outliers.
head(mydata[order(-mydata$CooksD), c("ID", "CooksD")], 10) #Display 10 units with the largest Cook's distance
## ID CooksD
## 351 351 0.039
## 283 283 0.032
## 331 331 0.028
## 44 44 0.027
## 365 365 0.024
## 132 132 0.023
## 126 126 0.019
## 299 299 0.017
## 324 324 0.017
## 272 272 0.016
When looking at the values of Cooks distances, we can see some bigger gaps in between the IDs. We will have to remove units with ID of 331, 44, 283, 365, 351, and 132.
mydata1 <- data.frame(mydata)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata1 <- mydata %>%
filter(!ID %in% c(44, 132,250, 283, 331, 351, 365)) #Removing units
fit2 <- lm(Salary ~ YearsSincePhD + PositionF + GenderF, #Fit the model again without the deleted units
data = mydata1)
mydata1$stdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(StdResid ~ stdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
smooth = FALSE,
regLine = FALSE,
boxplots = FALSE,
data = mydata1)
We can see from the scatter plot heteroskedasticity however letβs check it with Breusch Pagan test.
H0: The variance is constant./ We are dealing with homoskedasticity.
H1: The variance is not constant. / We are dealing with heteroskedasticity.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ----------------------------------
## Response : Salary
## Variables: fitted values of Salary
##
## Test Summary
## ------------------------------
## DF = 1
## Chi2 = 51.90417
## Prob > Chi2 = 5.82765e-13
Based on the sample data, we reject H0 at p < 0.001, and conclude that we are dealing with heteroskedasticity. It means we violate the assumption and have to use the Robust Standard Errors.
#install.packages("estimatr")
library(estimatr)
## Warning: package 'estimatr' was built under R version 4.3.2
fit_robust <- lm_robust(Salary ~ YearsSincePhD + PositionF + GenderF,
se_type = "HC1", #White's robust standard errors
data = mydata1)
summary(fit_robust)
##
## Call:
## lm_robust(formula = Salary ~ YearsSincePhD + PositionF + GenderF,
## data = mydata1, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 131021 3762.6 34.822 5.046e-121 123623.3 138418.87 385
## YearsSincePhD -178 134.7 -1.321 1.872e-01 -442.8 86.85 385
## PositionFAsstP -48556 3268.4 -14.856 8.756e-40 -54982.0 -42129.68 385
## PositionFAssocP -33651 2604.1 -12.922 5.569e-32 -38771.6 -28531.39 385
## GenderFFemale -4756 2864.5 -1.660 9.764e-02 -10388.2 875.69 385
##
## Multiple R-squared: 0.4317 , Adjusted R-squared: 0.4258
## F-statistic: 155.1 on 4 and 385 DF, p-value: < 2.2e-16
We can interpret only coefficients that are statistically significant. We look at p-values.
H0: π½1 = 0
H1: π½1 =/= 0
We cannot reject the null hypothesis (p = 0.188) and conclude that we could not find the statistical effect of Years Since PhD on a nine month Salary of professors (the partial regression coefficient is statistically insignificant).
H0: π½2 = 0
H1: π½2 =/= 0
We reject null hypothesis and conclude that based on the sample data, given the values of other explanatory variables, Assistant Professors have on average their nine month salary lower than Professors by 48556 USD, p < 0.001.The partial regression coefficient is statistically significant.
H0: π½3 = 0
H1: π½3 =/= 0
We reject null hypothesis and based on the sample data, given other explanatory variables, Associate Professors have on average their nine month salary lower than Professors by 33651 USD, p < 0.001. The partial regression coefficient is statistically significant.
H0: π½4 = 0
H1: π½4 =/= 0
We cannot reject the null hypothesis. Given the gender of professors, there is a difference in a nine month salary between males and females. However, the observed difference in the Salary is not statistically significant at p = 0.098 (because p > 0.05). Therefore, we do not have enough evidence to conclude that Female Professors have a significantly different Salary, compared to Male Professors.
When looking at the coefficient of determination (R squared): 43.17% of variability of nine-month salary of professors, is explained by the variability of Years of service, Position, and Gender.
Nine-month Salary of professors is significantly affected by the Position of the Professors, with both Assistants and Associate Professors earning less than regular Professors. Years since PhD and Gender do not have a significant effect on the nine-month Salary of the Professors in this model.