library(dplyr)
library(GGally)
library(ggplot2)
library(modelr)
library(car)
library(randomForest)
data <- read.csv("Salaries.csv", header = TRUE)
glimpse(data)
## Observations: 397
## Variables: 6
## $ rank <fct> Prof, Prof, AsstProf, Prof, Prof, AssocProf, Prof, Prof…
## $ discipline <fct> B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, A, A…
## $ yrs.since.phd <int> 19, 20, 4, 45, 40, 6, 30, 45, 21, 18, 12, 7, 1, 2, 20, …
## $ yrs.service <int> 18, 16, 3, 39, 41, 6, 23, 45, 20, 18, 8, 2, 1, 0, 18, 3…
## $ sex <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, F…
## $ salary <int> 139750, 173200, 79750, 115000, 141500, 97000, 175000, 1…
lapply(data, FUN=summary)
## $rank
## AssocProf AsstProf Prof
## 64 67 266
##
## $discipline
## A B
## 181 216
##
## $yrs.since.phd
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 12.00 21.00 22.31 32.00 56.00
##
## $yrs.service
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 7.00 16.00 17.61 27.00 60.00
##
## $sex
## Female Male
## 39 358
##
## $salary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57800 91000 107300 113706 134185 231545
There is no outliers here or missing values. We can check for outliers separately as well.
Plot pairwise correlation of attributes.
It is seen that years of service and years since PhD is highly correlated. Both these variables have less effect on the salary but that could be due to multicollinearity issue.
ast_prof <- data %>% filter(rank == "AsstProf", yrs.service <= 5)
ast_prof_pct <- (nrow(ast_prof)/nrow(data))*100
ast_prof_pct
## [1] 16.62469
This question can be answered in multiple ways. With or without running the actual model. Yes there is a statistically significant difference between female and male salaries. The male salary is $14088 more than the female salary
We can now check the whisker plot of the same.
boxplot(salary~sex,data=data, main="Salary across gender",
xlab="Sex", ylab="Salary")
Boxplot of salary by sex - shows difference in salary across genders
Now we can model the salary difference between genders by computing a simple linear regression model
model_sex <- lm(salary ~ sex, data = data)
summary(model_sex)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101002.41 4809.386 21.001103 2.683482e-66
## sexMale 14088.01 5064.579 2.781674 5.667107e-03
The output shows that the average salary for female is estimated at $101002, whereas males are estimated a total of $101002 + $14088 = $115090. Also the p-value of the variable sexMale is very significant (< 0.05), suggesting that there is a statistical evidence of a difference in average salary between the genders.
ggplot(data) + geom_boxplot(aes(x=rank, y=salary))
ggplot(data) + geom_boxplot(aes(x=discipline, y=salary))
salary_discipline_rank <- data %>% ggplot() +
geom_jitter(aes(rank,salary, colour=discipline)) +
geom_smooth(aes(rank,salary, colour=discipline), method=lm, se=FALSE) +
labs(x = "Rank", y = "Nine-Month Salary (in Dollars)",
title = "Nine-Month Salary vs. Discipline")
salary_discipline_rank
Some statistical tools can automatically recode categorical variables into dummy variables for regression analysis. In R, the command contrasts(data$discipline) can be used to check for automatc recoding of categorical variables. For manual recoding of discipline to binary variables “A” = 0 and “B” = 1 we can use the following
contrasts(data$discipline)
## B
## A 0
## B 1
data$discipline <- recode_factor(data$discipline, "B" = 1, "A"= 0)
This predictive analysis problem will be done using multiple linear regression model. As a prerequisite of this model building we need to check multicollinearity of the variables.
We will deal with the categorical and continuous variables separately. For the continuous variables we can check correlation as following:
library(corrgram)
## Warning: package 'corrgram' was built under R version 3.6.2
## Registered S3 method overwritten by 'seriation':
## method from
## reorder.hclust gclus
cor(data[,c("yrs.since.phd", "yrs.service", "salary")])
## yrs.since.phd yrs.service salary
## yrs.since.phd 1.0000000 0.9096491 0.4192311
## yrs.service 0.9096491 1.0000000 0.3347447
## salary 0.4192311 0.3347447 1.0000000
Years since PhD and Years of service are highly correlated and we will keep either variable in our final model development. Since coefficient of years of service and salary is poor than that of years since PhD, so we will remove years of service from final model. The same can be seen from the ggpairs plot above.
Correlation for categorical variables
sr_table <- table(data[,c("sex", "rank")])
rd_table <- table(data[,c("rank", "discipline")])
sd_table <- table(data[,c("sex", "discipline")])
fisher.test(x=as.matrix(sr_table))
##
## Fisher's Exact Test for Count Data
##
## data: as.matrix(sr_table)
## p-value = 0.0114
## alternative hypothesis: two.sided
fisher.test(x=as.matrix(rd_table))
##
## Fisher's Exact Test for Count Data
##
## data: as.matrix(rd_table)
## p-value = 0.09836
## alternative hypothesis: two.sided
fisher.test(x=as.matrix(sd_table))
##
## Fisher's Exact Test for Count Data
##
## data: as.matrix(sd_table)
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.476565 2.014236
## sample estimates:
## odds ratio
## 0.9752679
Here we can see, sex and rank can have some positive correlation but other categorical variables does not show that trend.
Multiple linear regression model with sex,
mod1 <- lm(salary~sex+ yrs.since.phd + rank + discipline, data = data)
summary(mod1)
##
## Call:
## lm(formula = salary ~ sex + yrs.since.phd + rank + discipline,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67451 -13860 -1549 10716 97023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94925.94 4685.36 20.260 < 2e-16 ***
## sexMale 4349.37 3875.39 1.122 0.26242
## yrs.since.phd 61.01 127.01 0.480 0.63124
## rankAsstProf -13104.15 4167.31 -3.145 0.00179 **
## rankProf 32928.40 3544.40 9.290 < 2e-16 ***
## discipline0 -13937.47 2346.53 -5.940 6.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22660 on 391 degrees of freedom
## Multiple R-squared: 0.4472, Adjusted R-squared: 0.4401
## F-statistic: 63.27 on 5 and 391 DF, p-value: < 2.2e-16
ANOVA is a special case of linear model where the predictors are categorical variables and we can see the classic ANOVA table from the regression model.
We can see Sex is no longer significant in explaining the variability. Variable like yrs.since.phd has also high P-value indicating that it needs to be handled differently. Other variables like rank and discipline are more significant and explain variability in salary. It explains 44% of the variance in nine-month salary.
Now we model salary by removing sex.
mod2 <- lm(salary ~ yrs.since.phd + rank + discipline, data = data)
summary(mod2)
##
## Call:
## lm(formula = salary ~ yrs.since.phd + rank + discipline, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67395 -13480 -1536 10416 97166
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98464.24 3467.13 28.399 < 2e-16 ***
## yrs.since.phd 71.92 126.68 0.568 0.5706
## rankAsstProf -13030.16 4168.17 -3.126 0.0019 **
## rankProf 33181.41 3538.40 9.378 < 2e-16 ***
## discipline0 -14028.68 2345.90 -5.980 5.03e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22670 on 392 degrees of freedom
## Multiple R-squared: 0.4454, Adjusted R-squared: 0.4398
## F-statistic: 78.72 on 4 and 392 DF, p-value: < 2.2e-16
The F-statistics get better. I also tried modeling other different combinations of predictors. salary ~ rank + yrs.since.phd does not improve performance any better. Let us try modeling by removing the yrs.since.phd.
mod3 <- lm(salary ~ rank + discipline, data = data)
summary(mod3)
##
## Call:
## lm(formula = salary ~ rank + discipline, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65990 -14049 -1288 10760 97996
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99467 2981 33.366 < 2e-16 ***
## rankAsstProf -13762 3961 -3.475 0.000569 ***
## rankProf 34082 3160 10.786 < 2e-16 ***
## discipline0 -13761 2296 -5.993 4.65e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22650 on 393 degrees of freedom
## Multiple R-squared: 0.445, Adjusted R-squared: 0.4407
## F-statistic: 105 on 3 and 393 DF, p-value: < 2.2e-16
The F statistic value has increased drastically. The predictors in this case explain ~44% of the variance in nine-month salary. Since there could be other factors that contribute in determining nine-month salary for faculties, but not present in this dataset, this model performance is justified. Some features are not well understood such as years since Ph.D. and years of service and how they contribute towards.
Split into training and test dataset with 75% data in training and rest in test.
tr_index <- sample(1:nrow(data), round(nrow(data) * 0.75))
train_set <- data[tr_index,]
test_set <- data[-tr_index,]
fit <- lm(salary ~ rank + discipline, data=train_set)
Predict on the test dataset. We also calculate the residual sum of squares and root mean squared error for training and test set.
prediction_test <- predict(fit, test_set)
sum((prediction_test - test_set$salary)^2)
## [1] 46844334746
prediction_train <- predict(fit, train_set)
sum((prediction_train - train_set$salary)^2)
## [1] 1.55745e+11
sqrt(sum((prediction_test - test_set$salary)^2) / length(prediction_test))
## [1] 21752.59
sqrt(sum((prediction_train - train_set$salary)^2) / length(prediction_train))
## [1] 22861.19
It is observed that the root mean square of errors is less in training set but the value does not fall off the cliff so the performance in the test set is as expected.
mean <- mean(test_set$salary)
SST_test <- sum((test_set$salary - mean)^2)
SSE_test <- sum((prediction_test - test_set$salary)^2)
r_square_pred <- ((SST_test - SSE_test)/SST_test)*100
r_square_pred
## [1] 46.70714
The predicted R^2 value is 36.6% and the adjusted R^2 is about 44%. Therefore the model performance is consistent. It is to be noted that the data set is quite small and the training set does not capture variability in actual scenarios. The model performance might have been better with better sample size and more data points.
Recoding salary to binary variable based on median salary. If salary > median then salary is 1 else salary is 0.
data_rf <- data
data_rf$sal_bin <- as.factor(ifelse(data$salary >= 107300, 1, 0))
data_rf <- data_rf %>% select(-salary)
Prediction of the 0/1 binary indicator of salary leads to a classification problem. Since the Random Forest classifier is widely known for its high accuracy, we shall build a Random Forest classification model to predict the 0/1 salary indicator.
Random Forests do not require us to manually split the data into “training” and “test” sets. This is because every decision tree in the Random Forest is built using bootstrapped data. Thus, not every sample is used to build every tree. The “training” dataset is the bootstrapped data and the “test” dataset is the remaining samples. The remaining samples are called the “Out-Of-Bag” (OOB) data. Thus, the out-of-bag error rate is a measure of the prediction accuracy of the Random Forest classifier. The out-of-bag error rate provides the proportion of out-of-bag (i.e. testing) samples that were incorrectly classified.
Since we are building a classification model to predict the discrete 0/1 indicator, the Random Forest model in R will set “mtry”, the number of variables to consider at each step, to floor( sqrt(no. of predictor variables) ) = floor( sqrt(5) ) = 2 by default. The default number of trees in the forest is 500. We shall first build the Random Forest with these default parameter values, and then fine tune them going forward.
set.seed(42)
rf_model <- randomForest(sal_bin ~ .,
data = data_rf,
metric = "Accuracy")
rf_model
##
## Call:
## randomForest(formula = sal_bin ~ ., data = data_rf, metric = "Accuracy")
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 20.65%
## Confusion matrix:
## 0 1 class.error
## 0 130 68 0.34343434
## 1 14 185 0.07035176
With the default parameters, the OOB error is 20.65% that represents an accuracy around 80% which is pretty good. We will plot the OOB error rate and error rates of both classes (0 and 1) against the number of trees.
oob.error.data <- data.frame(
Trees=rep(1:nrow(rf_model$err.rate), times=3),
Type=rep(c("OOB", '1', '0'), each=nrow(rf_model$err.rate)),
Error=c(rf_model$err.rate[,"OOB"],
rf_model$err.rate[,'1'],
rf_model$err.rate[,'0']))
ggplot(data=oob.error.data, aes(x=Trees, y=Error)) +
geom_line(aes(color=Type))
Here we see convergence in the graph. So 500 trees is sufficient for our case.
If we want to compare this random forest to others with different values for mtry (to control how many variables are considered at each step)
oob.values <- vector(length=5)
for(i in 1:5) {
temp.model <- randomForest(sal_bin ~ .,
data = data_rf, mtry=i, ntree=500)
oob.values[i] <- temp.model$err.rate[nrow(temp.model$err.rate),1]
}
oob.values
## [1] 0.2015113 0.2065491 0.2292191 0.2594458 0.2644836
The OOB value is lowest for mtry = 1.
Interesting research areas could be developed from this problem. Here are some possible directions.
This research problem is important because of several factors. If a professor leaves a university, the university may incur a significant cost and time to fill the vacant position. Some of the factors that may influence this probability are:
Present salary and discipline,
Years of experience (years since PhD and years in the present position),
Rank of the professor, tenured / tenure-track / non-tenured, whether the professor has post-doc,
Performance metrics of the professor (e.g. no. of publications, no. of citations, no. and amount external grants currently available, grant history)
Location of the university, professor’s hometown
The probability of a professor changing job/university can be formulated as a classification problem, and hence can be solved using a logistic regression or random forest classifier.
This question is important to both the individual professor and the university. Getting tenure is a significant milestone rewarded to a professor for excellence in research/teaching. Whether or not a tenure-track professor will be granted tenure can depend on several factors such as:
Performance metrics of the professor in research (e.g. no. of publications, no. of citations, no. and amount external grants currently available, grant history),
Performance metrics of the professor in teaching (e.g. rating provided by students, average class performance), collaboration with industry and academia,
Whether or not professor has post-doc,
Demand and supply of professors in the particular area the individual professor specializes on.
The given data set does not include demographic information of professors. It may be important to discover whether or how demographics influence the salary of professors, given all other parameters remain constant.
In order to do that, we can import several demographic information, such as race, ethnicity, age, marital status, citizenship, place of birth, and combine them with gender to see if these attributes alone have impact on the salary.