library(carData)
mydata <- Salaries
mydata <- mydata %>% tidyr::drop_na()
mydata$rank <- factor(mydata$rank,
levels = c("Prof","AsstProf","AssocProf"),
labels = c("Prof", "Non-Prof", "Non-Prof"))
# levels(mydata$rank)
head(mydata)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139750
## 2 Prof B 20 16 Male 173200
## 3 Non-Prof B 4 3 Male 79750
## 4 Prof B 45 39 Male 115000
## 5 Prof B 40 41 Male 141500
## 6 Non-Prof B 6 6 Male 97000
A data frame with 397 observations on the following 6 variables:
Units of observation are Assistant Professors, Associate Professors and Professors in a college in the U.S.
Main goal of the data analysis: Main goal of the data analysis is to figure out how big are salary differences between Non-Prof and Prof.
Source: Library called carData
summary(mydata)
## rank discipline yrs.since.phd yrs.service sex salary
## Prof :266 A:181 Min. : 1.00 Min. : 0.00 Female: 39 Min. : 57800
## Non-Prof:131 B:216 1st Qu.:12.00 1st Qu.: 7.00 Male :358 1st Qu.: 91000
## Median :21.00 Median :16.00 Median :107300
## Mean :22.31 Mean :17.61 Mean :113706
## 3rd Qu.:32.00 3rd Qu.:27.00 3rd Qu.:134185
## Max. :56.00 Max. :60.00 Max. :231545
I started only with regression constant
fit0 <- glm(rank ~ 1,
family = binomial,
data = mydata)
summary(fit0)
##
## Call:
## glm(formula = rank ~ 1, family = binomial, data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8949 -0.8949 -0.8949 1.4891 1.4891
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7083 0.1067 -6.636 3.23e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 503.52 on 396 degrees of freedom
## Residual deviance: 503.52 on 396 degrees of freedom
## AIC: 505.52
##
## Number of Fisher Scoring iterations: 4
Odds for Y=1
exp(cbind(odds = fit0$coefficients, confint.default(fit0)))
## odds 2.5 % 97.5 %
## (Intercept) 0.4924812 0.399516 0.6070789
Estimated probability for Y=1
head(fitted(fit0))
## 1 2 3 4 5 6
## 0.3299748 0.3299748 0.3299748 0.3299748 0.3299748 0.3299748
The probability of employee being Non-Prof is 0.3299 percent (probability that y = 1).
Prop. of correctly classified units is 67%.
Pseudo_R2_fit1 <- 266/(266+131)
Pseudo_R2_fit1
## [1] 0.6700252
We add yrs.since.phd as first explanatory variable.
fit1 <- glm(rank ~ yrs.since.phd,
family = binomial,
data = mydata)
summary(fit1)
##
## Call:
## glm(formula = rank ~ yrs.since.phd, family = binomial, data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5337 -0.5149 -0.1191 0.4132 3.9841
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.33829 0.39233 8.509 <2e-16 ***
## yrs.since.phd -0.23009 0.02365 -9.728 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 503.52 on 396 degrees of freedom
## Residual deviance: 260.73 on 395 degrees of freedom
## AIC: 264.73
##
## Number of Fisher Scoring iterations: 6
Deviance of more complex model is lower (good).
\(H_0:\) Simple model is better \(H_1:\) Complex model is better
The degrees of freedom (m) equals to 1 and we can reject the hypothesis that simple model is better (p < 0.05).
anova(fit0, fit1, test = "Chi")
## Analysis of Deviance Table
##
## Model 1: rank ~ 1
## Model 2: rank ~ yrs.since.phd
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 396 503.52
## 2 395 260.73 1 242.79 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The odds ratio for Y=1 (with 85% CI)
exp(cbind(OR = fit1$coefficients, confint.default(fit1)))
## OR 2.5 % 97.5 %
## (Intercept) 28.170780 13.0571895 60.7782294
## yrs.since.phd 0.794461 0.7584709 0.8321588
fit2 <- glm(rank ~ discipline + yrs.since.phd + sex + salary,
family = binomial,
data = mydata)
library(car)
vif(fit2)
## discipline yrs.since.phd sex salary
## 1.320761 1.319590 1.013129 1.666808
mydata$StdResid <- rstandard(fit2)
mydata$CookDist <- cooks.distance(fit2)
hist(mydata$StdResid,
main = "histogram of standardized residuals",
ylab = "Frequency",
xlab = "Standardized residuals")
Standard residuals are [-3,3] which is good (no outliers).
head(mydata$CookDist)
## [1] 1.467505e-06 1.040715e-10 4.018801e-07 2.771403e-07 8.222381e-10 6.927315e-05
head(mydata[order(-mydata$CookDist), ],10)
## rank discipline yrs.since.phd yrs.service sex salary StdResid CookDist
## 286 Non-Prof A 49 49 Male 81800 2.916515 0.12874859
## 195 Non-Prof B 48 53 Male 90000 2.613924 0.12121754
## 318 Prof B 46 45 Male 67559 -1.569012 0.09066124
## 300 Non-Prof A 45 39 Male 70700 2.011896 0.08600096
## 261 Non-Prof A 41 33 Male 88600 2.770189 0.08317503
## 246 Prof A 17 11 Female 90450 -1.472302 0.04738430
## 323 Non-Prof B 13 11 Male 126431 2.298757 0.04639665
## 368 Non-Prof A 10 1 Male 108413 1.720893 0.03126847
## 218 Non-Prof B 29 22 Male 105350 2.159430 0.02988403
## 335 Non-Prof B 19 6 Female 104542 1.293919 0.02918575
We remove units with highest Cooks distances.
mydata <- mydata[-c(286,195,318,300), ]
Because we removed units, we will estimate fit1 and fit2 again
fit1 <- glm(rank ~ yrs.since.phd,
family = binomial,
data = mydata)
fit2 <- glm(rank ~ discipline + yrs.since.phd + sex + salary,
family = binomial,
data = mydata)
anova(fit1, fit2, test = "Chi")
## Analysis of Deviance Table
##
## Model 1: rank ~ yrs.since.phd
## Model 2: rank ~ discipline + yrs.since.phd + sex + salary
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 391 208.75
## 2 388 105.05 3 103.7 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\(H_0:\) Simple model is better \(H_1:\) Complex model is better
We see that complex model is better (we reject null hypothesis).
summary(fit2)
##
## Call:
## glm(formula = rank ~ discipline + yrs.since.phd + sex + salary,
## family = binomial, data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7623 -0.1172 -0.0068 0.0548 3.3592
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 19.7040220 2.9221875 6.743 1.55e-11 ***
## disciplineB 2.2411096 0.6180351 3.626 0.000288 ***
## yrs.since.phd -0.2576486 0.0436864 -5.898 3.69e-09 ***
## sexMale -0.1391630 0.7082558 -0.196 0.844229
## salary -0.0001652 0.0000263 -6.282 3.35e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 496.04 on 392 degrees of freedom
## Residual deviance: 105.05 on 388 degrees of freedom
## AIC: 115.05
##
## Number of Fisher Scoring iterations: 8
From the table above, we see that sex is not significant. Other coefficients are explained with odds ratio.
exp(cbind(OR = fit2$coefficients, confint.default(fit2)))
## OR 2.5 % 97.5 %
## (Intercept) 3.608677e+08 1.174837e+06 1.108456e+11
## disciplineB 9.403760e+00 2.800458e+00 3.157723e+01
## yrs.since.phd 7.728668e-01 7.094449e-01 8.419583e-01
## sexMale 8.700862e-01 2.171170e-01 3.486829e+00
## salary 9.998348e-01 9.997832e-01 9.998863e-01
Given the value of all other variables, the odds for Non-Prof working in discipline B are 7.3 times the odds of Prof working in discipline B (p<0.001).
If years since phd increases by 1, the odds for Non-Prof is equal to 0.8 times the initial odds, under the assumption that the remaining explanatory variables do not change (p<0.001).
If salary increases by 1, the odds for Non-Prof is equal to 0.9 times the initial odds, under the assumption that the remaining explanatory variables do not change (p<0.001).
mydata$EstimatedProbab <- fitted(fit2)
head(mydata)
## rank discipline yrs.since.phd yrs.service sex salary StdResid CookDist EstimatedProbab
## 1 Prof B 19 18 Male 139750 -0.081569080 1.467505e-06 2.064087e-03
## 2 Prof B 20 16 Male 173200 -0.005462838 1.040715e-10 6.358106e-06
## 3 Non-Prof B 4 3 Male 79750 0.056644034 4.018801e-07 9.994987e-01
## 4 Prof B 45 39 Male 115000 -0.050759004 2.771403e-07 1.521808e-04
## 5 Prof B 40 41 Male 141500 -0.010189226 8.222381e-10 6.922103e-06
## 6 Non-Prof B 6 6 Male 97000 0.260988823 6.927315e-05 9.856880e-01
mydata$Classification <- ifelse(test = mydata$EstimatedProbab < 0.5,
yes = "Prof",
no = "Non-Prof")
ClassificationTable <- table(mydata$rank, mydata$Classification)
ClassificationTable
##
## Non-Prof Prof
## Prof 11 254
## Non-Prof 114 14
11 were Non-Prof, but model predicted they were Prof.
Pseudo_R2_fit2 <- (ClassificationTable[1,2] + ClassificationTable[2,1])/nrow(mydata)
Pseudo_R2_fit2
## [1] 0.9363868