DATA605- Multiple linear regression model

DATA : The 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members. A data frame with 397 observations on the following 6 variables. rank-a factor with levels AssocProf AsstProf Prof discipline-a factor with levels A (“theoretical” departments) or B (“applied” departments). yrs.since.phd-years since PhD. yrs.service-years of service. sex-a factor with levels Female Male salary-nine-month salary, in dollars.

df <- data.frame(read.csv("Salaries.csv"))
#df
anyNA(df)

## [1] FALSE

dim(df)

## [1] 397   7

summary(df)

##        X              rank     discipline yrs.since.phd    yrs.service   
##  Min.   :  1   AssocProf: 64   A:181      Min.   : 1.00   Min.   : 0.00  
##  1st Qu.:100   AsstProf : 67   B:216      1st Qu.:12.00   1st Qu.: 7.00  
##  Median :199   Prof     :266              Median :21.00   Median :16.00  
##  Mean   :199                              Mean   :22.31   Mean   :17.61  
##  3rd Qu.:298                              3rd Qu.:32.00   3rd Qu.:27.00  
##  Max.   :397                              Max.   :56.00   Max.   :60.00  
##      sex          salary      
##  Female: 39   Min.   : 57800  
##  Male  :358   1st Qu.: 91000  
##               Median :107300  
##               Mean   :113706  
##               3rd Qu.:134185  
##               Max.   :231545

df1 <- c("rank","discipline","yrs.since.phd","yrs.service","sex","salary")
sal <- df[df1]
head(sal)

##        rank discipline yrs.since.phd yrs.service  sex salary
## 1      Prof          B            19          18 Male 139750
## 2      Prof          B            20          16 Male 173200
## 3  AsstProf          B             4           3 Male  79750
## 4      Prof          B            45          39 Male 115000
## 5      Prof          B            40          41 Male 141500
## 6 AssocProf          B             6           6 Male  97000

par(mfrow=c(3,2)) 
plot(sal$salary ~ sal$rank)
plot(sal$salary ~ sal$discipline)
plot(sal$salary ~ sal$yrs.since.phd)
plot(sal$salary ~ sal$yrs.service)
plot(sal$salary ~ sal$sex)

From the graph it appears that professor’s sex is not a factor in salary determination, however we see that males has slight higher salary than females. There appears to be somewhat linear relationship with other attributes like yrs.since.phd,yrs.service rank, discipline also seems to have positive relationship with salary. As rank, discipline increases, salary increases.

sal$sex <- as.character(sal$sex)
sal$sex[sal$sex == "Male"] <- 0
sal$sex[sal$sex == "Female"] <- 1
sal$sex <- as.integer(sal$sex)

sal$rank <- as.character(sal$rank)
sal$rank[sal$rank == "AssocProf"] <- 1
sal$rank[sal$rank == "AsstProf"] <- 2
sal$rank[sal$rank == "Prof"] <- 3
sal$rank <- as.integer(sal$rank)
sal$discipline <- as.character(sal$discipline)
sal$discipline[sal$discipline == "A"] <- 1
sal$discipline[sal$discipline == "B"] <- 2
sal$discipline <- as.integer(sal$discipline)
head(sal)

##   rank discipline yrs.since.phd yrs.service sex salary
## 1    3          2            19          18   0 139750
## 2    3          2            20          16   0 173200
## 3    2          2             4           3   0  79750
## 4    3          2            45          39   0 115000
## 5    3          2            40          41   0 141500
## 6    1          2             6           6   0  97000

rank1 <- sal$rank^2
S_y <- as.numeric(df$sex) * as.numeric(df$yrs.since.phd)
df_lm <- lm(salary ~ sex + rank + rank1 + discipline + yrs.since.phd + yrs.service + S_y, data=sal)
summary(df_lm)

## 
## Call:
## lm(formula = salary ~ sex + rank + rank1 + discipline + yrs.since.phd + 
##     yrs.service + S_y, data = sal)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65259 -13216  -1773  10225  99584 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   140161.18   14922.01   9.393  < 2e-16 ***
## sex            -4538.60    7663.20  -0.592   0.5540    
## rank          -99954.57   15447.87  -6.470 2.94e-10 ***
## rank1          29009.03    3851.50   7.532 3.53e-13 ***
## discipline     14419.41    2346.38   6.145 1.98e-09 ***
## yrs.since.phd    506.96     796.66   0.636   0.5249    
## yrs.service     -490.08     212.76  -2.303   0.0218 *  
## S_y               14.49     391.71   0.037   0.9705    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22570 on 389 degrees of freedom
## Multiple R-squared:  0.4547, Adjusted R-squared:  0.4449 
## F-statistic: 46.33 on 7 and 389 DF,  p-value: < 2.2e-16

highest p value is sex so removing sex and rank square

df_lm <- update(df_lm, .~. -sex)
summary(df_lm)

## 
## Call:
## lm(formula = salary ~ rank + rank1 + discipline + yrs.since.phd + 
##     yrs.service + S_y, data = sal)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65407 -13563  -1740   9838  99501 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    140638.0    14887.9   9.446  < 2e-16 ***
## rank          -101186.0    15294.6  -6.616 1.22e-10 ***
## rank1           29333.9     3809.1   7.701 1.12e-13 ***
## discipline      14465.2     2343.1   6.173 1.68e-09 ***
## yrs.since.phd     118.5      451.9   0.262   0.7932    
## yrs.service      -494.8      212.4  -2.329   0.0203 *  
## S_y               214.8      197.3   1.089   0.2769    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22550 on 390 degrees of freedom
## Multiple R-squared:  0.4542, Adjusted R-squared:  0.4458 
## F-statistic: 54.09 on 6 and 390 DF,  p-value: < 2.2e-16

df_lm <- update(df_lm, .~. -rank1)
summary(df_lm)

## 
## Call:
## lm(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service + 
##     S_y, data = sal)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -72745 -16524  -3253  14601  96473 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    34428.3     6011.0   5.728 2.03e-08 ***
## rank           15809.5     1895.2   8.342 1.27e-15 ***
## discipline     15591.3     2506.9   6.219 1.28e-09 ***
## yrs.since.phd    944.3      470.6   2.007  0.04547 *  
## yrs.service     -593.1      227.3  -2.609  0.00942 ** 
## S_y              114.3      211.1   0.542  0.58836    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24170 on 391 degrees of freedom
## Multiple R-squared:  0.3712, Adjusted R-squared:  0.3631 
## F-statistic: 46.16 on 5 and 391 DF,  p-value: < 2.2e-16

df_lm <- update(df_lm, .~. -yrs.since.phd)
summary(df_lm)

## 
## Call:
## lm(formula = salary ~ rank + discipline + yrs.service + S_y, 
##     data = sal)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -72970 -16494  -3401  14071  95725 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36340.4     5957.8   6.100 2.55e-09 ***
## rank         16301.2     1886.5   8.641  < 2e-16 ***
## discipline   15096.7     2504.4   6.028 3.83e-09 ***
## yrs.service   -417.7      210.6  -1.983   0.0481 *  
## S_y            477.0      109.4   4.360 1.67e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24260 on 392 degrees of freedom
## Multiple R-squared:  0.3647, Adjusted R-squared:  0.3582 
## F-statistic: 56.26 on 4 and 392 DF,  p-value: < 2.2e-16

df_lm <- update(df_lm, .~. -S_y)
summary(df_lm)

## 
## Call:
## lm(formula = salary ~ rank + discipline + yrs.service, data = sal)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -71843 -17913  -3922  14642  94776 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39086.2     6058.7   6.451 3.26e-10 ***
## rank         18760.7     1841.0  10.191  < 2e-16 ***
## discipline   13553.4     2535.4   5.346 1.53e-07 ***
## yrs.service    376.1      108.3   3.473 0.000571 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24810 on 393 degrees of freedom
## Multiple R-squared:  0.3339, Adjusted R-squared:  0.3288 
## F-statistic: 65.67 on 3 and 393 DF,  p-value: < 2.2e-16

df_lm <- update(df_lm, .~. -discipline)
summary(df_lm)

## 
## Call:
## lm(formula = salary ~ rank + yrs.service, data = sal)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -74877 -19554  -3799  17058 102694 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  61808.7     4465.9  13.840  < 2e-16 ***
## rank         18620.0     1904.1   9.779  < 2e-16 ***
## yrs.service    294.3      110.9   2.654  0.00829 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25670 on 394 degrees of freedom
## Multiple R-squared:  0.2855, Adjusted R-squared:  0.2818 
## F-statistic: 78.71 on 2 and 394 DF,  p-value: < 2.2e-16

plot(df_lm$fitted.values, df_lm$residuals, xlab="Fitted Values", ylab="Residuals")
abline(h=0)

qqnorm(df_lm$residuals)
qqline(df_lm$residuals)

We can say with increase in rank salary increases from residuals versus fitted plot, residuals of smaller fitted values are biased toward the regression model. Based on R2 value, the model explains 29% of variability in the data.