library(dplyr)
library(GGally)
library(ggplot2)
library(modelr)
library(car)
library(randomForest)

Exploratory Analysis

data <- read.csv("Salaries.csv", header = TRUE)
glimpse(data)
## Observations: 397
## Variables: 6
## $ rank          <fct> Prof, Prof, AsstProf, Prof, Prof, AssocProf, Prof, Prof…
## $ discipline    <fct> B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, A, A…
## $ yrs.since.phd <int> 19, 20, 4, 45, 40, 6, 30, 45, 21, 18, 12, 7, 1, 2, 20, …
## $ yrs.service   <int> 18, 16, 3, 39, 41, 6, 23, 45, 20, 18, 8, 2, 1, 0, 18, 3…
## $ sex           <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, F…
## $ salary        <int> 139750, 173200, 79750, 115000, 141500, 97000, 175000, 1…
lapply(data, FUN=summary)
## $rank
## AssocProf  AsstProf      Prof 
##        64        67       266 
## 
## $discipline
##   A   B 
## 181 216 
## 
## $yrs.since.phd
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   12.00   21.00   22.31   32.00   56.00 
## 
## $yrs.service
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    7.00   16.00   17.61   27.00   60.00 
## 
## $sex
## Female   Male 
##     39    358 
## 
## $salary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57800   91000  107300  113706  134185  231545

There is no outliers here or missing values. We can check for outliers separately as well.

Plot pairwise correlation of attributes.

It is seen that years of service and years since PhD is highly correlated. Both these variables have less effect on the salary but that could be due to multicollinearity issue.

(1) What percentage of records are Assistant Professors with less than 5 years of experience?

ast_prof <- data %>% filter(rank == "AsstProf", yrs.service <= 5)
ast_prof_pct <- (nrow(ast_prof)/nrow(data))*100
ast_prof_pct
## [1] 16.62469

(2) Is there a statistically significant difference between female and male salaries?

This question can be answered in multiple ways. With or without running the actual model. Yes there is a statistically significant difference between female and male salaries. The male salary is $14088 more than the female salary

We can now check the whisker plot of the same.

boxplot(salary~sex,data=data, main="Salary across gender",
        xlab="Sex", ylab="Salary")

Boxplot of salary by sex - shows difference in salary across genders

Now we can model the salary difference between genders by computing a simple linear regression model

model_sex <- lm(salary ~ sex, data = data)
summary(model_sex)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 101002.41   4809.386 21.001103 2.683482e-66
## sexMale      14088.01   5064.579  2.781674 5.667107e-03

The output shows that the average salary for female is estimated at $101002, whereas males are estimated a total of $101002 + $14088 = $115090. Also the p-value of the variable sexMale is very significant (< 0.05), suggesting that there is a statistical evidence of a difference in average salary between the genders.

(3) What is the distribution of salary by rank and discipline?

ggplot(data) + geom_boxplot(aes(x=rank, y=salary))

ggplot(data) + geom_boxplot(aes(x=discipline, y=salary))

salary_discipline_rank <-  data %>% ggplot() +
  geom_jitter(aes(rank,salary, colour=discipline)) + 
  geom_smooth(aes(rank,salary, colour=discipline), method=lm, se=FALSE) +
  labs(x = "Rank", y = "Nine-Month Salary (in Dollars)",
       title = "Nine-Month Salary vs. Discipline")
salary_discipline_rank

(4) How would you recode discipline as a 0/1 binary indicator?

Some statistical tools can automatically recode categorical variables into dummy variables for regression analysis. In R, the command contrasts(data$discipline) can be used to check for automatc recoding of categorical variables. For manual recoding of discipline to binary variables “A” = 0 and “B” = 1 we can use the following

contrasts(data$discipline)
##   B
## A 0
## B 1
data$discipline <- recode_factor(data$discipline, "B" = 1, "A"= 0)

Model Building

Multiple linear regression model

This predictive analysis problem will be done using multiple linear regression model. As a prerequisite of this model building we need to check multicollinearity of the variables.

We will deal with the categorical and continuous variables separately. For the continuous variables we can check correlation as following:

library(corrgram)
## Warning: package 'corrgram' was built under R version 3.6.2
## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus
cor(data[,c("yrs.since.phd", "yrs.service", "salary")])
##               yrs.since.phd yrs.service    salary
## yrs.since.phd     1.0000000   0.9096491 0.4192311
## yrs.service       0.9096491   1.0000000 0.3347447
## salary            0.4192311   0.3347447 1.0000000

Years since PhD and Years of service are highly correlated and we will keep either variable in our final model development. Since coefficient of years of service and salary is poor than that of years since PhD, so we will remove years of service from final model. The same can be seen from the ggpairs plot above.

Correlation for categorical variables

sr_table <- table(data[,c("sex", "rank")])
rd_table <- table(data[,c("rank", "discipline")])
sd_table <- table(data[,c("sex", "discipline")])

fisher.test(x=as.matrix(sr_table))
## 
##  Fisher's Exact Test for Count Data
## 
## data:  as.matrix(sr_table)
## p-value = 0.0114
## alternative hypothesis: two.sided
fisher.test(x=as.matrix(rd_table))
## 
##  Fisher's Exact Test for Count Data
## 
## data:  as.matrix(rd_table)
## p-value = 0.09836
## alternative hypothesis: two.sided
fisher.test(x=as.matrix(sd_table))
## 
##  Fisher's Exact Test for Count Data
## 
## data:  as.matrix(sd_table)
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.476565 2.014236
## sample estimates:
## odds ratio 
##  0.9752679

Here we can see, sex and rank can have some positive correlation but other categorical variables does not show that trend.

Multiple linear regression model with sex,

mod1 <- lm(salary~sex+ yrs.since.phd + rank + discipline, data = data)
summary(mod1)
## 
## Call:
## lm(formula = salary ~ sex + yrs.since.phd + rank + discipline, 
##     data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67451 -13860  -1549  10716  97023 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    94925.94    4685.36  20.260  < 2e-16 ***
## sexMale         4349.37    3875.39   1.122  0.26242    
## yrs.since.phd     61.01     127.01   0.480  0.63124    
## rankAsstProf  -13104.15    4167.31  -3.145  0.00179 ** 
## rankProf       32928.40    3544.40   9.290  < 2e-16 ***
## discipline0   -13937.47    2346.53  -5.940 6.32e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22660 on 391 degrees of freedom
## Multiple R-squared:  0.4472, Adjusted R-squared:  0.4401 
## F-statistic: 63.27 on 5 and 391 DF,  p-value: < 2.2e-16

ANOVA is a special case of linear model where the predictors are categorical variables and we can see the classic ANOVA table from the regression model.

We can see Sex is no longer significant in explaining the variability. Variable like yrs.since.phd has also high P-value indicating that it needs to be handled differently. Other variables like rank and discipline are more significant and explain variability in salary. It explains 44% of the variance in nine-month salary.

Now we model salary by removing sex.

mod2 <- lm(salary ~ yrs.since.phd + rank + discipline, data = data)
summary(mod2)
## 
## Call:
## lm(formula = salary ~ yrs.since.phd + rank + discipline, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67395 -13480  -1536  10416  97166 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    98464.24    3467.13  28.399  < 2e-16 ***
## yrs.since.phd     71.92     126.68   0.568   0.5706    
## rankAsstProf  -13030.16    4168.17  -3.126   0.0019 ** 
## rankProf       33181.41    3538.40   9.378  < 2e-16 ***
## discipline0   -14028.68    2345.90  -5.980 5.03e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22670 on 392 degrees of freedom
## Multiple R-squared:  0.4454, Adjusted R-squared:  0.4398 
## F-statistic: 78.72 on 4 and 392 DF,  p-value: < 2.2e-16

The F-statistics get better. I also tried modeling other different combinations of predictors. salary ~ rank + yrs.since.phd does not improve performance any better. Let us try modeling by removing the yrs.since.phd.

mod3 <- lm(salary ~ rank + discipline, data = data)
summary(mod3)
## 
## Call:
## lm(formula = salary ~ rank + discipline, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65990 -14049  -1288  10760  97996 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     99467       2981  33.366  < 2e-16 ***
## rankAsstProf   -13762       3961  -3.475 0.000569 ***
## rankProf        34082       3160  10.786  < 2e-16 ***
## discipline0    -13761       2296  -5.993 4.65e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22650 on 393 degrees of freedom
## Multiple R-squared:  0.445,  Adjusted R-squared:  0.4407 
## F-statistic:   105 on 3 and 393 DF,  p-value: < 2.2e-16

The F statistic value has increased drastically. The predictors in this case explain ~44% of the variance in nine-month salary. Since there could be other factors that contribute in determining nine-month salary for faculties, but not present in this dataset, this model performance is justified. Some features are not well understood such as years since Ph.D. and years of service and how they contribute towards.

Train and Test

Split into training and test dataset with 75% data in training and rest in test.

tr_index <- sample(1:nrow(data), round(nrow(data) * 0.75))
train_set <- data[tr_index,]
test_set <- data[-tr_index,] 
fit <- lm(salary ~ rank + discipline, data=train_set)

Predict

Predict on the test dataset. We also calculate the residual sum of squares and root mean squared error for training and test set.

prediction_test <- predict(fit, test_set)  
sum((prediction_test - test_set$salary)^2)
## [1] 46844334746
prediction_train <- predict(fit, train_set) 
sum((prediction_train - train_set$salary)^2) 
## [1] 1.55745e+11
sqrt(sum((prediction_test - test_set$salary)^2) / length(prediction_test))
## [1] 21752.59
sqrt(sum((prediction_train - train_set$salary)^2) / length(prediction_train))
## [1] 22861.19

It is observed that the root mean square of errors is less in training set but the value does not fall off the cliff so the performance in the test set is as expected.

mean <- mean(test_set$salary)
SST_test <- sum((test_set$salary - mean)^2)
SSE_test <- sum((prediction_test - test_set$salary)^2)
r_square_pred <- ((SST_test - SSE_test)/SST_test)*100
r_square_pred
## [1] 46.70714

The predicted R^2 value is 36.6% and the adjusted R^2 is about 44%. Therefore the model performance is consistent. It is to be noted that the data set is quite small and the training set does not capture variability in actual scenarios. The model performance might have been better with better sample size and more data points.

Binary classification model

Recoding salary to binary variable based on median salary. If salary > median then salary is 1 else salary is 0.

data_rf <- data
data_rf$sal_bin <- as.factor(ifelse(data$salary >= 107300, 1, 0))
data_rf <- data_rf %>% select(-salary)

Prediction of the 0/1 binary indicator of salary leads to a classification problem. Since the Random Forest classifier is widely known for its high accuracy, we shall build a Random Forest classification model to predict the 0/1 salary indicator.

Random Forests do not require us to manually split the data into “training” and “test” sets. This is because every decision tree in the Random Forest is built using bootstrapped data. Thus, not every sample is used to build every tree. The “training” dataset is the bootstrapped data and the “test” dataset is the remaining samples. The remaining samples are called the “Out-Of-Bag” (OOB) data. Thus, the out-of-bag error rate is a measure of the prediction accuracy of the Random Forest classifier. The out-of-bag error rate provides the proportion of out-of-bag (i.e. testing) samples that were incorrectly classified.

Since we are building a classification model to predict the discrete 0/1 indicator, the Random Forest model in R will set “mtry”, the number of variables to consider at each step, to floor( sqrt(no. of predictor variables) ) = floor( sqrt(5) ) = 2 by default. The default number of trees in the forest is 500. We shall first build the Random Forest with these default parameter values, and then fine tune them going forward.

set.seed(42)
rf_model <- randomForest(sal_bin ~ ., 
                  data = data_rf, 
                  metric = "Accuracy")
rf_model
## 
## Call:
##  randomForest(formula = sal_bin ~ ., data = data_rf, metric = "Accuracy") 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.65%
## Confusion matrix:
##     0   1 class.error
## 0 130  68  0.34343434
## 1  14 185  0.07035176

With the default parameters, the OOB error is 20.65% that represents an accuracy around 80% which is pretty good. We will plot the OOB error rate and error rates of both classes (0 and 1) against the number of trees.

oob.error.data <- data.frame(
  Trees=rep(1:nrow(rf_model$err.rate), times=3),
  Type=rep(c("OOB", '1', '0'), each=nrow(rf_model$err.rate)),
  Error=c(rf_model$err.rate[,"OOB"], 
    rf_model$err.rate[,'1'], 
    rf_model$err.rate[,'0']))

ggplot(data=oob.error.data, aes(x=Trees, y=Error)) +
  geom_line(aes(color=Type))

Here we see convergence in the graph. So 500 trees is sufficient for our case.

If we want to compare this random forest to others with different values for mtry (to control how many variables are considered at each step)

oob.values <- vector(length=5)
for(i in 1:5) {
  temp.model <- randomForest(sal_bin ~ ., 
                  data = data_rf, mtry=i, ntree=500)
  oob.values[i] <- temp.model$err.rate[nrow(temp.model$err.rate),1]
}
oob.values
## [1] 0.2015113 0.2065491 0.2292191 0.2594458 0.2644836

The OOB value is lowest for mtry = 1.

Data Set Enhancement

Interesting research areas could be developed from this problem. Here are some possible directions.

(1) Probability of a professor changing job/university:

This research problem is important because of several factors. If a professor leaves a university, the university may incur a significant cost and time to fill the vacant position. Some of the factors that may influence this probability are:

  1. Present salary and discipline,

  2. Years of experience (years since PhD and years in the present position),

  3. Rank of the professor, tenured / tenure-track / non-tenured, whether the professor has post-doc,

  4. Performance metrics of the professor (e.g. no. of publications, no. of citations, no. and amount external grants currently available, grant history)

  5. Location of the university, professor’s hometown

The probability of a professor changing job/university can be formulated as a classification problem, and hence can be solved using a logistic regression or random forest classifier.

(2) Whether a tenure-track professor will be granted tenure:

This question is important to both the individual professor and the university. Getting tenure is a significant milestone rewarded to a professor for excellence in research/teaching. Whether or not a tenure-track professor will be granted tenure can depend on several factors such as:

  1. Performance metrics of the professor in research (e.g. no. of publications, no. of citations, no. and amount external grants currently available, grant history),

  2. Performance metrics of the professor in teaching (e.g. rating provided by students, average class performance), collaboration with industry and academia,

  3. Whether or not professor has post-doc,

  4. Demand and supply of professors in the particular area the individual professor specializes on.

(3) Influence of demographics in the salary:

The given data set does not include demographic information of professors. It may be important to discover whether or how demographics influence the salary of professors, given all other parameters remain constant.

In order to do that, we can import several demographic information, such as race, ethnicity, age, marital status, citizenship, place of birth, and combine them with gender to see if these attributes alone have impact on the salary.