DATA : The 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members. A data frame with 397 observations on the following 6 variables. rank-a factor with levels AssocProf AsstProf Prof discipline-a factor with levels A (“theoretical” departments) or B (“applied” departments). yrs.since.phd-years since PhD. yrs.service-years of service. sex-a factor with levels Female Male salary-nine-month salary, in dollars.
df <- data.frame(read.csv("Salaries.csv"))
#df
anyNA(df)
## [1] FALSE
dim(df)
## [1] 397 7
summary(df)
## X rank discipline yrs.since.phd yrs.service
## Min. : 1 AssocProf: 64 A:181 Min. : 1.00 Min. : 0.00
## 1st Qu.:100 AsstProf : 67 B:216 1st Qu.:12.00 1st Qu.: 7.00
## Median :199 Prof :266 Median :21.00 Median :16.00
## Mean :199 Mean :22.31 Mean :17.61
## 3rd Qu.:298 3rd Qu.:32.00 3rd Qu.:27.00
## Max. :397 Max. :56.00 Max. :60.00
## sex salary
## Female: 39 Min. : 57800
## Male :358 1st Qu.: 91000
## Median :107300
## Mean :113706
## 3rd Qu.:134185
## Max. :231545
df1 <- c("rank","discipline","yrs.since.phd","yrs.service","sex","salary")
sal <- df[df1]
head(sal)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139750
## 2 Prof B 20 16 Male 173200
## 3 AsstProf B 4 3 Male 79750
## 4 Prof B 45 39 Male 115000
## 5 Prof B 40 41 Male 141500
## 6 AssocProf B 6 6 Male 97000
par(mfrow=c(3,2))
plot(sal$salary ~ sal$rank)
plot(sal$salary ~ sal$discipline)
plot(sal$salary ~ sal$yrs.since.phd)
plot(sal$salary ~ sal$yrs.service)
plot(sal$salary ~ sal$sex)
From the graph it appears that professor’s sex is not a factor in salary determination, however we see that males has slight higher salary than females. There appears to be somewhat linear relationship with other attributes like yrs.since.phd,yrs.service rank, discipline also seems to have positive relationship with salary. As rank, discipline increases, salary increases.
sal$sex <- as.character(sal$sex)
sal$sex[sal$sex == "Male"] <- 0
sal$sex[sal$sex == "Female"] <- 1
sal$sex <- as.integer(sal$sex)
sal$rank <- as.character(sal$rank)
sal$rank[sal$rank == "AssocProf"] <- 1
sal$rank[sal$rank == "AsstProf"] <- 2
sal$rank[sal$rank == "Prof"] <- 3
sal$rank <- as.integer(sal$rank)
sal$discipline <- as.character(sal$discipline)
sal$discipline[sal$discipline == "A"] <- 1
sal$discipline[sal$discipline == "B"] <- 2
sal$discipline <- as.integer(sal$discipline)
head(sal)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 3 2 19 18 0 139750
## 2 3 2 20 16 0 173200
## 3 2 2 4 3 0 79750
## 4 3 2 45 39 0 115000
## 5 3 2 40 41 0 141500
## 6 1 2 6 6 0 97000
rank1 <- sal$rank^2
S_y <- as.numeric(df$sex) * as.numeric(df$yrs.since.phd)
df_lm <- lm(salary ~ sex + rank + rank1 + discipline + yrs.since.phd + yrs.service + S_y, data=sal)
summary(df_lm)
##
## Call:
## lm(formula = salary ~ sex + rank + rank1 + discipline + yrs.since.phd +
## yrs.service + S_y, data = sal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65259 -13216 -1773 10225 99584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 140161.18 14922.01 9.393 < 2e-16 ***
## sex -4538.60 7663.20 -0.592 0.5540
## rank -99954.57 15447.87 -6.470 2.94e-10 ***
## rank1 29009.03 3851.50 7.532 3.53e-13 ***
## discipline 14419.41 2346.38 6.145 1.98e-09 ***
## yrs.since.phd 506.96 796.66 0.636 0.5249
## yrs.service -490.08 212.76 -2.303 0.0218 *
## S_y 14.49 391.71 0.037 0.9705
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22570 on 389 degrees of freedom
## Multiple R-squared: 0.4547, Adjusted R-squared: 0.4449
## F-statistic: 46.33 on 7 and 389 DF, p-value: < 2.2e-16
highest p value is sex so removing sex and rank square
df_lm <- update(df_lm, .~. -sex)
summary(df_lm)
##
## Call:
## lm(formula = salary ~ rank + rank1 + discipline + yrs.since.phd +
## yrs.service + S_y, data = sal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65407 -13563 -1740 9838 99501
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 140638.0 14887.9 9.446 < 2e-16 ***
## rank -101186.0 15294.6 -6.616 1.22e-10 ***
## rank1 29333.9 3809.1 7.701 1.12e-13 ***
## discipline 14465.2 2343.1 6.173 1.68e-09 ***
## yrs.since.phd 118.5 451.9 0.262 0.7932
## yrs.service -494.8 212.4 -2.329 0.0203 *
## S_y 214.8 197.3 1.089 0.2769
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22550 on 390 degrees of freedom
## Multiple R-squared: 0.4542, Adjusted R-squared: 0.4458
## F-statistic: 54.09 on 6 and 390 DF, p-value: < 2.2e-16
df_lm <- update(df_lm, .~. -rank1)
summary(df_lm)
##
## Call:
## lm(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service +
## S_y, data = sal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72745 -16524 -3253 14601 96473
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34428.3 6011.0 5.728 2.03e-08 ***
## rank 15809.5 1895.2 8.342 1.27e-15 ***
## discipline 15591.3 2506.9 6.219 1.28e-09 ***
## yrs.since.phd 944.3 470.6 2.007 0.04547 *
## yrs.service -593.1 227.3 -2.609 0.00942 **
## S_y 114.3 211.1 0.542 0.58836
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24170 on 391 degrees of freedom
## Multiple R-squared: 0.3712, Adjusted R-squared: 0.3631
## F-statistic: 46.16 on 5 and 391 DF, p-value: < 2.2e-16
df_lm <- update(df_lm, .~. -yrs.since.phd)
summary(df_lm)
##
## Call:
## lm(formula = salary ~ rank + discipline + yrs.service + S_y,
## data = sal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72970 -16494 -3401 14071 95725
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36340.4 5957.8 6.100 2.55e-09 ***
## rank 16301.2 1886.5 8.641 < 2e-16 ***
## discipline 15096.7 2504.4 6.028 3.83e-09 ***
## yrs.service -417.7 210.6 -1.983 0.0481 *
## S_y 477.0 109.4 4.360 1.67e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24260 on 392 degrees of freedom
## Multiple R-squared: 0.3647, Adjusted R-squared: 0.3582
## F-statistic: 56.26 on 4 and 392 DF, p-value: < 2.2e-16
df_lm <- update(df_lm, .~. -S_y)
summary(df_lm)
##
## Call:
## lm(formula = salary ~ rank + discipline + yrs.service, data = sal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71843 -17913 -3922 14642 94776
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39086.2 6058.7 6.451 3.26e-10 ***
## rank 18760.7 1841.0 10.191 < 2e-16 ***
## discipline 13553.4 2535.4 5.346 1.53e-07 ***
## yrs.service 376.1 108.3 3.473 0.000571 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24810 on 393 degrees of freedom
## Multiple R-squared: 0.3339, Adjusted R-squared: 0.3288
## F-statistic: 65.67 on 3 and 393 DF, p-value: < 2.2e-16
df_lm <- update(df_lm, .~. -discipline)
summary(df_lm)
##
## Call:
## lm(formula = salary ~ rank + yrs.service, data = sal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74877 -19554 -3799 17058 102694
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61808.7 4465.9 13.840 < 2e-16 ***
## rank 18620.0 1904.1 9.779 < 2e-16 ***
## yrs.service 294.3 110.9 2.654 0.00829 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25670 on 394 degrees of freedom
## Multiple R-squared: 0.2855, Adjusted R-squared: 0.2818
## F-statistic: 78.71 on 2 and 394 DF, p-value: < 2.2e-16
plot(df_lm$fitted.values, df_lm$residuals, xlab="Fitted Values", ylab="Residuals")
abline(h=0)
qqnorm(df_lm$residuals)
qqline(df_lm$residuals)
We can say with increase in rank salary increases from residuals versus fitted plot, residuals of smaller fitted values are biased toward the regression model. Based on R2 value, the model explains 29% of variability in the data.