library(carData)
library(MASS)
library(ISLR)
library(car)
library(leaps)
For this data analysis project, we wanted to look into something that was interesting for us. We decided to explore Salaries for Professors , which is a data collection in the {carData} package in r. The data was collected from 2008-2009 and is a nine-month academic salary for Assistant Professors, Associate Professors, and Professors in a college in the U.S. The data variables and attributes are:
-rank:Factor variable composed by te following (AssocProf,AsstProf,Prof).
-discipline:Factor variable with levels A (“theoretical” departments) or B (“applied” departments).
-yrs.since.pdh:Number of years since the professor has obtained their PhD.
-yrs.service:Number of years the professor has served university.
-sex:Factor variable with levels (Female, Male)
-salary:Nine-month salary, in dollars.
Our project’s goal is to identify if a professors’ salary is influenced by other variables. In order to do this, we are going to analyze which of the variables in the dataset are important and should be considered through regression analysis.
Salaries1<-Salaries
str(Salaries1)
## 'data.frame': 397 obs. of 6 variables:
## $ rank : Factor w/ 3 levels "AsstProf","AssocProf",..: 3 3 1 3 3 2 3 3 3 3 ...
## $ discipline : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2 ...
## $ yrs.since.phd: int 19 20 4 45 40 6 30 45 21 18 ...
## $ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
## $ salary : int 139750 173200 79750 115000 141500 97000 175000 147765 119250 129000 ...
attach(Salaries1)
Since we wanted to start making inferences from the data. We first start examining the distribution of the variables.The first variable we start looking at was the dependent variable(Salary)
#Summary variables
summary(Salaries1)
## rank discipline yrs.since.phd yrs.service sex
## AsstProf : 67 A:181 Min. : 1.00 Min. : 0.00 Female: 39
## AssocProf: 64 B:216 1st Qu.:12.00 1st Qu.: 7.00 Male :358
## Prof :266 Median :21.00 Median :16.00
## Mean :22.31 Mean :17.61
## 3rd Qu.:32.00 3rd Qu.:27.00
## Max. :56.00 Max. :60.00
## salary
## Min. : 57800
## 1st Qu.: 91000
## Median :107300
## Mean :113706
## 3rd Qu.:134185
## Max. :231545
#1. Histogram Salaries
summary(Salaries1$salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57800 91000 107300 113706 134185 231545
par(mfrow=c(1,2))
hist(salary,col="pink",main="Professors Salary")
boxplot(salary)
The histogram for the response variable “Salary” shows that this variable is skewed to the right.Also, The summary statistics show that the median for a nine-month salary is $107,300.The minimum salary is $57,800 and the maximum salary is $231,545. We can also see from the boxplot that there are no significant outliers. Based on both,the histogram and the boxplot, we can say that overall the compensation among professors does not significantly vary. In general , professors salaries situate areound the mean average and they are not significant differences in the professors salary compensation.
#2. Histogram & Boxplots independent variables
summary(Salaries1$yrs.since.phd)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 12.00 21.00 22.31 32.00 56.00
summary(Salaries1$yrs.service)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 7.00 16.00 17.61 27.00 60.00
par(mfrow=c(2,2))
hist(yrs.since.phd,col="orange", main="Professors years since Phd")
hist(yrs.service, col="green", main="Professors years of Service")
boxplot(yrs.since.phd,main="Professors years since Phd")
boxplot(yrs.service,main="Professors years of service")
The histograms for the predictor variables “Years since Phd” and “Years of service” show a similar pattern, both variables are skewed to the right and have no outliers exept for “Years of service” which just have one outlier. The summary statistics show that the mean numbers of years since a professor has obtained a PhD is 22.31 years and the range of the overall distribution is 55 years. Likewise, the summary statistics for years of service show that mean number of years of service is 17.61 years and the range is 60 years in service .
#3. Boxplots Salary vs Categorical variables
par(mfrow=c(1,3))
plot(salary~rank)
plot(salary~discipline)
plot(salary~sex)
The boxplots for the categorical variables “rank”,“discipline” and “sex” suggest that associate professors earn lower salaries compare to assistant professors, and professors.
Also, we can see salaries differences by discipline where applied deparments professors seem to receive significant more salaries either when they have lower or higher salaries.
The last boxplot with the categorical variable sex shows that the median average salary for a male professor is higher compared with female proffesors.
To further continue the regression analysis of salary we are going to assess how the variables interact with each other, and so we are goin to start by doing a scatter plot matrix.
#Correlation variables
pairs(Salaries1)
The scatter plot matrix shows that overall the majority of variables are categorical and only “Years of service” and “Years since Phd” are numerical. It also apprears that years since PhD and years service have a strong linear relationship. This suggest that multicolinearity will probabily be an issue with these two columns so we will need to investigate that in future models.
For now we are going to start exploring the linear regression between salary and years since PhD.
The first fitted model we decided to fit was salary Vs. years since PhD.
#Salary years since phd
salary.phd<-lm(salary~yrs.since.phd)
summary(salary.phd)
##
## Call:
## lm(formula = salary ~ yrs.since.phd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -84171 -19432 -2858 16086 102383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91718.7 2765.8 33.162 <2e-16 ***
## yrs.since.phd 985.3 107.4 9.177 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27530 on 395 degrees of freedom
## Multiple R-squared: 0.1758, Adjusted R-squared: 0.1737
## F-statistic: 84.23 on 1 and 395 DF, p-value: < 2.2e-16
plot(yrs.since.phd,salary,main="Salary vs professors years PHD")
abline(salary.phd,col=2,lwd=2)
The linear regression model for Salary vs Years since Phd is as follows:
\[Salary= 91718.7+ 985.3*yrs.since.phd \]
We can see that the summary statistics show that 17.58% of the variability in salaries can be explained by the fitted linear regression model and the model overall seems to be valid. The plot suggests that there is a positive relationship between these variables. Although, this relationship is not considered to be strong we are going to keep exploring this model to see if it is a good fit for salaries analysis.
In order to keep assessing the validity of the model. We are going to check the model’s diagnostics:
par(mfrow=c(2,2))
plot(salary.phd)
As we can see the diagnostic plots show that the fitted linear model (Salary vs. Years since Phd) presents some issues. Especifically, the data presents problems with the linearity and the variance of the standarized residuals.Because of these issues we are going to transformn the fitted model with multiple transformations and see what transfromation is able to improve the linear model assumptions.
#Applying boxcox
b<-boxcox(salary~yrs.since.phd)
lambda<-round(b$x[which.max(b$y)],2)
#New Model
new_salaryphd<-lm(salary^lambda~yrs.since.phd)
plot(salary^lambda~yrs.since.phd,main="Transformed Salary Box-Cox")
abline(new_salaryphd,col=2,lwd=2)
The box-cox transformation was the first transformation we tried since the scale-location plot seemed to have a funnel shape suggesting variance issues. As we can see, even when we transformed the data we can observe that the fitted transformed model does not fit the data very well. Overall,the fitted model seems to be quadratic rather than linear. However, in order to verify if this apparent behavior would be an issue, we also plotted the diagnostic plots.The output is given below:
#Box-cox Transformation
par(mfrow=c(2,2))
plot(new_salaryphd)
As we can see the diagnostic plots show that the variance of the model does not improve when we transformed the data using box-cox. We can also see that the relationship between salary and Years since Ph.D. does not follow a linear relation, and it rather follows a quadratic shape, just as we previously mentioned. Therefore, we decided to keep trying more transformations in order to fix linearity. Below are some of the transformations we tried:
#Diagnostics Log(Y) transformation
logy_salaryphd<-lm(log(salary)~yrs.since.phd)
par(mfrow=c(2,2))
plot(logy_salaryphd)
#Diagnostics Log(X) transformation
logx_salaryphd<-lm(salary~log(yrs.since.phd))
par(mfrow=c(2,2))
plot(logx_salaryphd)
#Diagnostics Log(Y,X) transformation
log_salaryphd<-lm(log(salary)~log(yrs.since.phd))
par(mfrow=c(2,2))
plot(log_salaryphd)
#Diagnostics Polynomial transformation
poly_salaryphd<-lm(salary~yrs.since.phd+I(yrs.since.phd^2))
par(mfrow=c(2,2))
plot(poly_salaryphd)
#Diagnostics Sqrt(Y) and Polynomial (X) transformation
sqrty_salaryphd<-lm(sqrt(salary)~yrs.since.phd+I(yrs.since.phd^2))
par(mfrow=c(2,2))
plot(sqrty_salaryphd)
Based on the diagnostic plots we consider that from the previous transformations. The transformation that possibly better describes the relation between salary and years since phd is as follows:
\[\sqrt{Salary}=yrs.since.phd+yrs.since.phd^2\]
As we can see in this transformation is far from being perfect but we were able address linearity and homoscedasticity.
In this section we decided to further our analysis examining if salary is affected for more than one variable. For this part we start considering all variables. The full model is as follows:
# Model 1 (Y=salary,X=yrs.service)
model.full<-lm(salary~rank+discipline+yrs.since.phd+yrs.service+sex,data=Salaries1)
summary(model.full)
##
## Call:
## lm(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service +
## sex, data = Salaries1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65248 -13211 -1775 10384 99592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65955.2 4588.6 14.374 < 2e-16 ***
## rankAssocProf 12907.6 4145.3 3.114 0.00198 **
## rankProf 45066.0 4237.5 10.635 < 2e-16 ***
## disciplineB 14417.6 2342.9 6.154 1.88e-09 ***
## yrs.since.phd 535.1 241.0 2.220 0.02698 *
## yrs.service -489.5 211.9 -2.310 0.02143 *
## sexMale 4783.5 3858.7 1.240 0.21584
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22540 on 390 degrees of freedom
## Multiple R-squared: 0.4547, Adjusted R-squared: 0.4463
## F-statistic: 54.2 on 6 and 390 DF, p-value: < 2.2e-16
f<-model.full$fitted.values
plot(f,salary,main="Multiple Linear Regression Professors Salaries")
abline(lm(salary~f),col=2,lwd=2)
#Assessing Multicollinearity
#Pearson correlation yrs.service and years since phd
cor(yrs.service,yrs.since.phd,method="pearson")
## [1] 0.9096491
# VIF
vif(model.full)
## GVIF Df GVIF^(1/(2*Df))
## rank 2.013193 2 1.191163
## discipline 1.064105 1 1.031555
## yrs.since.phd 7.518936 1 2.742068
## yrs.service 5.923038 1 2.433729
## sex 1.030805 1 1.015285
\[Salary=rank + discipline + yrs.since.phd + yrs.service + sex\]
As we can see the multiple linear regression model shows all variables are statistically significant except for the categorical variable sex. The fitted multiple linear regression model also shows that 45.47% of the variability in salary can be explained by the fitted linear regression model.
Fitting a multiple linear regression for this model and considering all predictor variables, we were also able to verify the initial assumption we had about multicollinearity. The Pearson correlation between years since Ph.D. and years service is 0.9096491 which is almost 1. This means, that years of service and years since PhD hold a strong positive relationship. Therefore, we should not consider both variables in our fitted regression model since they are not independent and having both variable would affect our prediction results.The VIF test for this model shows that 2 out of the 5 variables are indeed pretty high, these two variables are years since Ph.D. with a VIF of 7.518936 a years of service with a VIF of 5.923038.Therefore, we can say that the coefficients in our sample were poorly estimated and the variables years since PhD and years service should be further analyzed.
In order to address these issues we are going to compare the full model with a reduce model. Esentially, we will consider all predictor variables except sex and years PhD because it has a higher VIF.The output for the reduced model is below:
model_reduced<-lm(salary~rank+discipline+yrs.service,data=Salaries1)
summary(model_reduced)
##
## Call:
## lm(formula = salary ~ rank + discipline + yrs.service, data = Salaries1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64198 -14040 -1299 10724 99253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72253.53 3169.48 22.797 < 2e-16 ***
## rankAssocProf 14483.23 4100.53 3.532 0.000461 ***
## rankProf 49377.50 3832.90 12.883 < 2e-16 ***
## disciplineB 13561.43 2315.91 5.856 1.01e-08 ***
## yrs.service -76.33 111.25 -0.686 0.493039
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22670 on 392 degrees of freedom
## Multiple R-squared: 0.4456, Adjusted R-squared: 0.44
## F-statistic: 78.78 on 4 and 392 DF, p-value: < 2.2e-16
#VIF test
vif(model_reduced)
## GVIF Df GVIF^(1/(2*Df))
## rank 1.588631 2 1.122679
## discipline 1.028057 1 1.013932
## yrs.service 1.613750 1 1.270335
Reducing the model seems to partially adress the issues with collinearity. The VIF test shows that all values are below 5 which is a good indicator for no collinearity problems. However, the summary of the model shows that excluding sex and years from the model does not completely fix collinearity. As we can see years service still has a negative coefficient.
To further this analysis we decided to thoroughly examine which of these variables should be included in our model.
Because we encounter multicollinearity problems in our data between (years since PhD and years service). In this section of the analysis, we are going to focus on estimating which variables should be included in the fitted multiple linear model. In this case, we applied subset wise selection. The output for this section is as follows:
salary.bwd<-regsubsets(salary~.,data=Salaries1)
salary.sum<-summary(salary.bwd)
par(mfrow=c(2,2))
plot(salary.sum$rss,xlab="Number of Variables",ylab="RSS",type="l")
plot(salary.sum$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
which.max(salary.sum$adjr2)
## [1] 6
points(6,salary.sum$adjr2[6],col="red",pch=20)
plot(salary.sum$cp,xlab="Number of Variables",ylab="Cp",type="l")
which.min(salary.sum$cp)
## [1] 5
points(5,salary.sum$cp[5],col="red",pch=20)
plot(salary.sum$bic,xlab="Number of Variables",ylab="BIC",type="l")
which.min(salary.sum$bic)
## [1] 3
points(3,salary.sum$bic[3],col="red",pch=20)
par(mfrow=c(2,2))
plot(salary.bwd,scale="r2")
plot(salary.bwd,scale="adjr2")
plot(salary.bwd,scale="Cp")
plot(salary.bwd,scale="bic")
As we can see subset wise variable selection suggest that we should use the third model according to BIC criterion.While, the adjusted Rssq and CP are pretty close to each other.Suggesting using all 6 variables and 5 variables excluding sex, respectively. However, as we explore in the previous session including all predictor variables could be problematic for our analysis.Therefore, we are going to fit a model using BIC suggestion (rankAssocProf,rankProf and disciplineB). The fitted model is the following:
Final_model<-lm(salary~rank+discipline,data=Salaries1)
summary(Final_model)
##
## Call:
## lm(formula = salary ~ rank + discipline, data = Salaries1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65990 -14049 -1288 10760 97996
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71944 3135 22.948 < 2e-16 ***
## rankAssocProf 13762 3961 3.475 0.000569 ***
## rankProf 47844 3112 15.376 < 2e-16 ***
## disciplineB 13761 2296 5.993 4.65e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22650 on 393 degrees of freedom
## Multiple R-squared: 0.445, Adjusted R-squared: 0.4407
## F-statistic: 105 on 3 and 393 DF, p-value: < 2.2e-16
44.07% of the variability in Salary can be explained by the fitted multiple linear regression model. Also we can see all variables are significant and the p-value of the overall test is significant as well. To explore how good is the model we are going to explore the effect of each predictor having adjusted the effect of the other predictor variables. The output is shown below:
avPlots(Final_model)
The added variable plots tells us that discipline and rank are significant in the analysis of salary. Therefore,we should consider having them in the model since they have a steady positive relation with salary.