Consider the following data set Salaries
library(car); names(Salaries); dim(Salaries)
The data set Salary reports data of a sample of university professors’ salaries collected during the 2008-2009 academic year in the USA. In addition to the posted salaries in US dollars, the data includes the following 5 additional variables: sex, years since the Ph.D., years of service, discipline (theoretical (1) or applied (2)), and academic rank.Imagine that you have been consulted on whether the data gives evidence of gender discrimination on salary. Write a brief report on this issue, reaffirming or not on the hypothesis that there is no discrimination by gender. Use the statistical tools you feel are more convenient to help build your report. Give two summary reports: one addressed to a non-expert in statistics; the other addressed to a statistician. Do all your analysis, and if possible, your report, in R. Feel free to design your story in the way you find it is more accurate and convincing.The queestion is wether there is gender discrimination or not. We set an analysis on the correlation between SEX, YEARSPHD, YEARSofSERVICE, RANK, DISCIPLINE, SALARY.
library(car)
names(Salaries)
[1] "rank" "discipline" "yrs.since.phd" "yrs.service"
[5] "sex" "salary"
dim(Salaries)
[1] 397 6
SEX <- c(Salaries$sex)
YEARSPHD<- c(Salaries$yrs.since.phd)
YEARSofSERVICE <- c(Salaries$yrs.service)
RANK <- c(Salaries$rank)
DISCIPLINE <- c(Salaries$discipline)
SALARY <- c(Salaries$salary)
For the sake of getting a clearer and more understandable work we have renamed each piece of data to an object.
boxplot(SALARY ~ SEX)
As a first piece of analysis we impemented a plot of Sex vs Salaries. Becasue Sex is a dummy variable, namely it refers to male and female relatively to the numbers 1 and 2. A scatter plot does not make much sense, but by running it (not reported here becasue not relevant) we immediately noticed that there is a difference in the number of males and females in the sample which already gives a hint on the attendability of the following boxplot graph.
Here notice difference in Means and spread of values for female (2) to lower values, still we cannot assert this is significant; in order to do so we have to run some regressions and multiple regression to test statistics these variables.
plot(SALARY ~ YEARSofSERVICE, data=Salaries)
abline(lm(formula= SALARY ~ YEARSofSERVICE), data = Salaries, col = "red" )
Here we made a plot of SALARY VS YEARSofSERVICE for the sake of noticng that there are more significant factors influencing the level of salary of an individual.
We run a regression to test the significance
ModelSALARYvsYEARSofSERVICE <- lm(formula = SALARY ~ YEARSofSERVICE)
summary(ModelSALARYvsYEARSofSERVICE)
Call:
lm(formula = SALARY ~ YEARSofSERVICE)
Residuals:
Min 1Q Median 3Q Max
-81933 -20511 -3776 16417 101947
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 99974.7 2416.6 41.37 < 2e-16 ***
YEARSofSERVICE 779.6 110.4 7.06 7.53e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 28580 on 395 degrees of freedom
Multiple R-squared: 0.1121, Adjusted R-squared: 0.1098
F-statistic: 49.85 on 1 and 395 DF, p-value: 7.529e-12
This is a good insight to question if there is in fact a correlation between sex and salary as this variable is much more related and statistically siginificant(indicated by the Low p-values).
In order to go deeper in our doubt we run a multiregression with all our indipendent variables and we look at the results to see what is the influence of SEX in the whole equation.
ModelTOTAL <- lm(formula = SALARY ~ SEX + YEARSofSERVICE + YEARSPHD + RANK + DISCIPLINE)
summary( ModelTOTAL)
Call:
lm(formula = SALARY ~ SEX + YEARSofSERVICE + YEARSPHD + RANK +
DISCIPLINE)
Residuals:
Min 1Q Median 3Q Max
-64552 -13795 -2426 10809 100435
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18512.7 8841.0 2.094 0.0369 *
SEX 5323.4 3893.3 1.367 0.1723
YEARSofSERVICE -513.9 213.9 -2.402 0.0167 *
YEARSPHD 574.8 243.1 2.365 0.0185 *
RANK 23611.3 2108.9 11.196 < 2e-16 ***
DISCIPLINE 14402.8 2366.5 6.086 2.76e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22770 on 391 degrees of freedom
Multiple R-squared: 0.4422, Adjusted R-squared: 0.4351
F-statistic: 61.99 on 5 and 391 DF, p-value: < 2.2e-16
These results bring to the surface the doubts we had on the attendbility of gender gap in our sample. In fact from a mere mathematical point of view SEX is not statistically significant.
On the other hand we noticed that there are two indipendent variables that are significant but not as much as the others. By reasonign on that we came to the conclusion that YEARSPHD and YEARSofSERVICE could overlap each other, as post PHD most of the individuals began to work. These might be a problem of multicollinearity in the regression. Though we cannot remove either of them as it is not an indication to remove the issue. Either way we can still deduce the statistical non attendability of SEX in terms of level of salaries.