library(carData)
mydata <- Salaries
mydata <- mydata %>% tidyr::drop_na()
mydata$rank <- factor(mydata$rank,
                           levels = c("Prof","AsstProf","AssocProf"),
                           labels = c("Prof", "Non-Prof", "Non-Prof"))
# levels(mydata$rank)
head(mydata)
##       rank discipline yrs.since.phd yrs.service  sex salary
## 1     Prof          B            19          18 Male 139750
## 2     Prof          B            20          16 Male 173200
## 3 Non-Prof          B             4           3 Male  79750
## 4     Prof          B            45          39 Male 115000
## 5     Prof          B            40          41 Male 141500
## 6 Non-Prof          B             6           6 Male  97000

A data frame with 397 observations on the following 6 variables:

Units of observation are Assistant Professors, Associate Professors and Professors in a college in the U.S.

Main goal of the data analysis: Main goal of the data analysis is to figure out how big are salary differences between Non-Prof and Prof.

Source: Library called carData

summary(mydata)
##        rank     discipline yrs.since.phd    yrs.service        sex          salary      
##  Prof    :266   A:181      Min.   : 1.00   Min.   : 0.00   Female: 39   Min.   : 57800  
##  Non-Prof:131   B:216      1st Qu.:12.00   1st Qu.: 7.00   Male  :358   1st Qu.: 91000  
##                            Median :21.00   Median :16.00                Median :107300  
##                            Mean   :22.31   Mean   :17.61                Mean   :113706  
##                            3rd Qu.:32.00   3rd Qu.:27.00                3rd Qu.:134185  
##                            Max.   :56.00   Max.   :60.00                Max.   :231545

I started only with regression constant

fit0 <- glm(rank ~ 1, 
            family = binomial,
            data = mydata)

summary(fit0)
## 
## Call:
## glm(formula = rank ~ 1, family = binomial, data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8949  -0.8949  -0.8949   1.4891   1.4891  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.7083     0.1067  -6.636 3.23e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 503.52  on 396  degrees of freedom
## Residual deviance: 503.52  on 396  degrees of freedom
## AIC: 505.52
## 
## Number of Fisher Scoring iterations: 4

Odds for Y=1

exp(cbind(odds = fit0$coefficients, confint.default(fit0)))
##                  odds    2.5 %    97.5 %
## (Intercept) 0.4924812 0.399516 0.6070789

Estimated probability for Y=1

head(fitted(fit0))
##         1         2         3         4         5         6 
## 0.3299748 0.3299748 0.3299748 0.3299748 0.3299748 0.3299748

The probability of employee being Non-Prof is 0.3299 percent (probability that y = 1).

Prop. of correctly classified units is 67%.

Pseudo_R2_fit1 <-  266/(266+131)
Pseudo_R2_fit1
## [1] 0.6700252

We add yrs.since.phd as first explanatory variable.

fit1 <- glm(rank ~ yrs.since.phd,
            family = binomial, 
            data = mydata)
summary(fit1)
## 
## Call:
## glm(formula = rank ~ yrs.since.phd, family = binomial, data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5337  -0.5149  -0.1191   0.4132   3.9841  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    3.33829    0.39233   8.509   <2e-16 ***
## yrs.since.phd -0.23009    0.02365  -9.728   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 503.52  on 396  degrees of freedom
## Residual deviance: 260.73  on 395  degrees of freedom
## AIC: 264.73
## 
## Number of Fisher Scoring iterations: 6

Deviance of more complex model is lower (good).

\(H_0:\) Simple model is better \(H_1:\) Complex model is better

The degrees of freedom (m) equals to 1 and we can reject the hypothesis that simple model is better (p < 0.05).

anova(fit0, fit1, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: rank ~ 1
## Model 2: rank ~ yrs.since.phd
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       396     503.52                          
## 2       395     260.73  1   242.79 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The odds ratio for Y=1 (with 85% CI)

exp(cbind(OR = fit1$coefficients, confint.default(fit1)))
##                      OR      2.5 %     97.5 %
## (Intercept)   28.170780 13.0571895 60.7782294
## yrs.since.phd  0.794461  0.7584709  0.8321588
fit2 <- glm(rank ~ discipline + yrs.since.phd + sex + salary,
            family = binomial,
            data = mydata)
library(car)
vif(fit2)
##    discipline yrs.since.phd           sex        salary 
##      1.320761      1.319590      1.013129      1.666808
mydata$StdResid <- rstandard(fit2)
mydata$CookDist <- cooks.distance(fit2)

hist(mydata$StdResid, 
     main = "histogram of standardized residuals",
     ylab = "Frequency", 
     xlab = "Standardized residuals")

Standard residuals are [-3,3] which is good (no outliers).

head(mydata$CookDist)
## [1] 1.467505e-06 1.040715e-10 4.018801e-07 2.771403e-07 8.222381e-10 6.927315e-05
head(mydata[order(-mydata$CookDist), ],10)
##         rank discipline yrs.since.phd yrs.service    sex salary  StdResid   CookDist
## 286 Non-Prof          A            49          49   Male  81800  2.916515 0.12874859
## 195 Non-Prof          B            48          53   Male  90000  2.613924 0.12121754
## 318     Prof          B            46          45   Male  67559 -1.569012 0.09066124
## 300 Non-Prof          A            45          39   Male  70700  2.011896 0.08600096
## 261 Non-Prof          A            41          33   Male  88600  2.770189 0.08317503
## 246     Prof          A            17          11 Female  90450 -1.472302 0.04738430
## 323 Non-Prof          B            13          11   Male 126431  2.298757 0.04639665
## 368 Non-Prof          A            10           1   Male 108413  1.720893 0.03126847
## 218 Non-Prof          B            29          22   Male 105350  2.159430 0.02988403
## 335 Non-Prof          B            19           6 Female 104542  1.293919 0.02918575

We remove units with highest Cooks distances.

mydata <- mydata[-c(286,195,318,300), ]

Because we removed units, we will estimate fit1 and fit2 again

fit1 <- glm(rank ~ yrs.since.phd,
            family = binomial, 
            data = mydata)


fit2 <- glm(rank ~ discipline + yrs.since.phd + sex + salary,
            family = binomial,
            data = mydata)
anova(fit1, fit2, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: rank ~ yrs.since.phd
## Model 2: rank ~ discipline + yrs.since.phd + sex + salary
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       391     208.75                          
## 2       388     105.05  3    103.7 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(H_0:\) Simple model is better \(H_1:\) Complex model is better

We see that complex model is better (we reject null hypothesis).

summary(fit2)
## 
## Call:
## glm(formula = rank ~ discipline + yrs.since.phd + sex + salary, 
##     family = binomial, data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7623  -0.1172  -0.0068   0.0548   3.3592  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   19.7040220  2.9221875   6.743 1.55e-11 ***
## disciplineB    2.2411096  0.6180351   3.626 0.000288 ***
## yrs.since.phd -0.2576486  0.0436864  -5.898 3.69e-09 ***
## sexMale       -0.1391630  0.7082558  -0.196 0.844229    
## salary        -0.0001652  0.0000263  -6.282 3.35e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 496.04  on 392  degrees of freedom
## Residual deviance: 105.05  on 388  degrees of freedom
## AIC: 115.05
## 
## Number of Fisher Scoring iterations: 8

From the table above, we see that sex is not significant. Other coefficients are explained with odds ratio.

exp(cbind(OR = fit2$coefficients, confint.default(fit2)))
##                         OR        2.5 %       97.5 %
## (Intercept)   3.608677e+08 1.174837e+06 1.108456e+11
## disciplineB   9.403760e+00 2.800458e+00 3.157723e+01
## yrs.since.phd 7.728668e-01 7.094449e-01 8.419583e-01
## sexMale       8.700862e-01 2.171170e-01 3.486829e+00
## salary        9.998348e-01 9.997832e-01 9.998863e-01

Given the value of all other variables, the odds for Non-Prof working in discipline B are 7.3 times the odds of Prof working in discipline B (p<0.001).

If years since phd increases by 1, the odds for Non-Prof is equal to 0.8 times the initial odds, under the assumption that the remaining explanatory variables do not change (p<0.001).

If salary increases by 1, the odds for Non-Prof is equal to 0.9 times the initial odds, under the assumption that the remaining explanatory variables do not change (p<0.001).

Estimates of probabilities for Y=1: P(Y=1)

mydata$EstimatedProbab <- fitted(fit2)
head(mydata)
##       rank discipline yrs.since.phd yrs.service  sex salary     StdResid     CookDist EstimatedProbab
## 1     Prof          B            19          18 Male 139750 -0.081569080 1.467505e-06    2.064087e-03
## 2     Prof          B            20          16 Male 173200 -0.005462838 1.040715e-10    6.358106e-06
## 3 Non-Prof          B             4           3 Male  79750  0.056644034 4.018801e-07    9.994987e-01
## 4     Prof          B            45          39 Male 115000 -0.050759004 2.771403e-07    1.521808e-04
## 5     Prof          B            40          41 Male 141500 -0.010189226 8.222381e-10    6.922103e-06
## 6 Non-Prof          B             6           6 Male  97000  0.260988823 6.927315e-05    9.856880e-01
mydata$Classification <- ifelse(test = mydata$EstimatedProbab < 0.5, 
                                 yes = "Prof",
                                 no = "Non-Prof")

ClassificationTable <- table(mydata$rank, mydata$Classification)

ClassificationTable
##           
##            Non-Prof Prof
##   Prof           11  254
##   Non-Prof      114   14

11 were Non-Prof, but model predicted they were Prof.

Pseudo_R2_fit2 <- (ClassificationTable[1,2] + ClassificationTable[2,1])/nrow(mydata)
Pseudo_R2_fit2
## [1] 0.9363868