I swear that I worked solely alone on this take-home exam and within the time frame that was specified.
Name of student: Andrea Valtorta
In this part of the exam you will work with data on the 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members. Note that in American colleges, Assistant Professor is the lowest rank and Professor is the highest rank for university faculty. The data are in the file Salaries.txt which includes the following variables (note that the variable names are included in the data file):
| Col. | Variable name | Description |
|---|---|---|
| 1 | ID | Identification number 1–397 |
| 2 | rank | Assistant Professor (AsstProf),Associate Professor (AssocProf) or Professor (Prof) |
| 3 | discipline | A = “theoretical” department or B = “applied” department. |
| 4 | yrs.since.phd | Number of years since PhD degree was earned. |
| 5 | yrs.service | Number of years of service to the University. |
| 6 | sex | A factor with levels Female and Male. |
| 7 | salary | Academic year (nine-month) salary, in dollars. |
Create side-by-side box plots of salary for each category of the qualitative predictors Sex, Rank and Discipline. Describe what these plots reveal about salary distributions for the different sexes, ranks and disciplines.
library(tidyverse)
## -- Attaching packages ---------
## v tibble 2.1.3 v dplyr 0.8.5
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## v purrr 0.3.3
## -- Conflicts ------------------
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::recode() masks car::recode()
## x purrr::some() masks car::some()
sal_data<-read.table("Salaries.txt",header = TRUE) ## The first row already contains the col names.
levels(sal_data$discipline)<-c("Theoretical", "Applied")
sal_data$rank<-factor(sal_data$rank, levels(sal_data$rank)[c(2,1,3)]) ## reorder the levels for 'rank'
sal_data %>% ggplot()+geom_boxplot(mapping=aes(x=sex,y=salary,fill=sex)) + xlab("Sex") + ylab("Salary") + ggtitle("Salary level by sex")
Box plot - Salary vs Sex: Female and Male have similar median salary (approx 100.000 dollars) and minimum salary (approx 80.000 dollars). Further to that it seems that the boxes have equal height, indicating that the variability of salary for female and male is comparable. The similarities end here, Males tend to have higher 1st and 3rd quartile. The biggest difference can be noted when comparing the max values of the salary for males and females. The maximum salary value for the male is much higher than the corresponding one for the females.Last but not least, we can see that there are few outliers for males, where the associate salary is double compared to the median.
sal_data %>% ggplot()+geom_boxplot(mapping=aes(x=rank,y=salary,fill=rank)) + xlab("Rank") + ylab("Salary") + ggtitle("Salary level by rank")
Box plot - Salary vs Rank: As expected 1st quartile,2nd quartile, median and maximum value of salary, grow as the rank increases - these values seem to increase exponentially. We can note that salary variability increases as the rank increases (the box plot for professor is bigger than the box plot for assistant professor and associate professor). It is worth noting that the minimum values for the 3 categories is very similar, actually the rank ‘Professor’ has the lowest minimum value. As for the previous boxplot we can see that there are few outliers for rank Professor. The salary distribution for the rank professor has much longer tails compared to the other two ranks.
sal_data %>% ggplot()+geom_boxplot(mapping=aes(x=discipline,y=salary,fill=discipline))+ xlab("Discipline type") + ylab("Salary") + ggtitle("Salary level by discipline")
Box plot - Salary vs Discipline: The two box plot have similar median and have comparable variability as the size of the box plot are similar. However the box plot for applied disciplines is shifted upwards: minimum, 1st quartile, 3rd quartile and maximum values are higher compared to theoretical disciplines. Both the departments have outliers. ### (b)
SvsR_tab<-table(sal_data$sex,sal_data$rank) %>% addmargins()
knitr::kable(SvsR_tab)
| AsstProf | AssocProf | Prof | Sum | |
|---|---|---|---|---|
| Female | 11 | 10 | 18 | 39 |
| Male | 56 | 54 | 248 | 358 |
| Sum | 67 | 64 | 266 | 397 |
Sex vs Rank Table: We can note that the number of female is much lower than the number of male. Female are just 10% of the total. The rank professor is by far the most frequent category. From the table we can see that most of the data refers to male professors. Another thing we can note is the fact that if we consider the total number of the associate professor, 1/6 is female. The same can be said for the total number of assistant professor. Almost 50% of the females is a professor.
SvsD_tab<-table(sal_data$sex,sal_data$discipline) %>% addmargins()
knitr::kable(SvsD_tab)
| Theoretical | Applied | Sum | |
|---|---|---|---|
| Female | 18 | 21 | 39 |
| Male | 163 | 195 | 358 |
| Sum | 181 | 216 | 397 |
Sex vs Discipline Table: Males and females are equally distributed between the two disciplines type: approximately 50% of the total number of males and 50% of the total number females are in the theoretical department. If we consider the column of the table, we can say that 10% of theoretical departments and 10% of the applied departments are composed by females.
RvsD_tab<-table(sal_data$rank,sal_data$discipline) %>% addmargins()
knitr::kable(RvsD_tab)
| Theoretical | Applied | Sum | |
|---|---|---|---|
| AsstProf | 24 | 43 | 67 |
| AssocProf | 26 | 38 | 64 |
| Prof | 131 | 135 | 266 |
| Sum | 181 | 216 | 397 |
Rank vs Discipline Table: Here we can note that professors are evenly distributed between theoretical and applied departments. This cannot be said for the other two ranks. 2/3 of the assistant professors are in the applied departments. Associate professors are distributed in a similar way: 60% of the associate professor is in the applied department. So it seems that as the rank increase the distribution/split between departments gets more and more even.
Plot the scatter plot matrix for the quantitative variables; years since PhD, years in service and salary. Describe the relationships between these three variables.
pairs(sal_data[,c(7,4,5)])
yrs.service VS yrs.since.phd: There is quite a clear linear relationship between ‘years since the phd’ and ‘years of service’. the variability is quite constant and eventually it decreases as ‘years of service’ increases. We have no cases where yrs.service is much higher than the ‘years since the phd’. Further to that with few exception, it seems that ‘yrs.since.phd’ is always higher than ‘yrs.service’.
yrs.since.phd VS salary: It seems that there is a linear relationship between these 2 variables. The variability is not constant and it considerably increases as the ‘yrs.since.phd’ increases
yrs.service VS salary: The relationship between salary and years of service is not clear. the variability is high across the range of values for yrs.service.
Create three scatter plots of salary versus years since PhD. In one plot color the points by sex, the next by rank and the last by discipline. Is the relationship between salary and years since PhD different between categories of these qualitative factors?
ggplot(sal_data,mapping = aes(x=yrs.since.phd, y=salary,color=sex)) + geom_point() + xlab("Years since PhD") + ylab("Salary")
Sex as color points: The relationship between years since PhD and salary is similar between males and females:linear relationship with increasing variance as years since PhD increases. Further to this, we can note that for the same level of X (year since PhD) the corresponding salary for females seems lower than males.
ggplot(sal_data,mapping = aes(x=yrs.since.phd, y=salary,color=rank)) + geom_point() + xlab("Years since PhD") + ylab("Salary")
Rank as color points: As we noticed from the box plot graph, the salary and the related variability increases as the rank increases (from assistant to professor). The dots are like divided into compartments/sections.
ggplot(sal_data,mapping = aes(x=yrs.since.phd, y=salary,color=discipline)) + geom_point() + xlab("Years since PhD") + ylab("Salary")
Discipline as color points: The relationship between years since PhD and salary is similar between theoretical and applied departments,i.e. linear relationship with increasing variance as years since PhD increases. For the same level of X (years since PhD) it seems that applied departments have higher salary. Further to this, we can say that the salary variability for theoretical departments increase slightly more as Years since PhD increase compared to applied departments.
Here you will consider only the qualitative variables: sex,rank, and discipline. For each of these three qualitative variables, fit a one-way ANOVA model (salary is the response) and give the one-way ANOVA table. Is salary significantly different between categories? In each case, use alpha= 0.01 and give the p-values.
Factor = sex
alpha<-0.01
sal_sex_fit<-lm(salary~sex, data=sal_data, x=T)
c(sal_sex_fit$coefficients[1],sal_sex_fit$coefficients[2]+sal_sex_fit$coefficients[1])
## (Intercept) sexMale
## 101002.4 115090.4
anova(sal_sex_fit) ##one-way ANOVA table
## Analysis of Variance Table
##
## Response: salary
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 6.9800e+09 6980014930 7.7377 0.005667 **
## Residuals 395 3.5632e+11 902077538
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(sal_sex_fit)["sex","Pr(>F)"] ## p-value
## [1] 0.005667107
anova(sal_sex_fit)["sex","Pr(>F)"]<alpha
## [1] TRUE
Salary mean for males (115.090 dollars ) is almost 15% higher than salary means for females (101.002 dollars). The p-values is lower than the given alpha value (0.01). We reject null hypothesis and conclude that the treatment effect for female and male are not equal - the salary means for female and male are different.
Factor = rank
sal_rank_fit<-lm(salary~rank, data = sal_data, x=T)
c(sal_rank_fit$coefficients[1],sal_rank_fit$coefficients[2:3]+sal_rank_fit$coefficients[1])
## (Intercept) rankAssocProf rankProf
## 80775.99 93876.44 126772.11
anova(sal_rank_fit) ##one-way ANOVA table
## Analysis of Variance Table
##
## Response: salary
## Df Sum Sq Mean Sq F value Pr(>F)
## rank 2 1.4323e+11 7.1616e+10 128.22 < 2.2e-16 ***
## Residuals 394 2.2007e+11 5.5855e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(sal_rank_fit)["rank","Pr(>F)"] ## p-value
## [1] 1.293048e-43
anova(sal_rank_fit)["rank","Pr(>F)"]<alpha
## [1] TRUE
The salary mean increases as the rank increases (rank-wise: Assistant Prof < Associate Prof < Prof) The p-value is close to 0, therefore we reject the null hypothesis and conclude that the treatment effects are not equal and the salary means for assistant prof, associate prof and prof are different.
Factor = discipline
sal_disc_fit<-lm(salary~ discipline, data = sal_data, x=T)
c(sal_disc_fit$coefficients[1],sal_disc_fit$coefficients[2]+sal_disc_fit$coefficients[1])
## (Intercept) disciplineApplied
## 108548.4 118028.7
anova(sal_disc_fit) ##one-way ANOVA table
## Analysis of Variance Table
##
## Response: salary
## Df Sum Sq Mean Sq F value Pr(>F)
## discipline 1 8.8508e+09 8850802234 9.8634 0.001813 **
## Residuals 395 3.5445e+11 897341368
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(sal_disc_fit)["discipline","Pr(>F)"] ## p-value
## [1] 0.001812916
anova(sal_disc_fit)["discipline","Pr(>F)"]<alpha
## [1] TRUE
Salary mean for applied departments is higher than the theoretical one. The p-value is lower than the given alpha, therefore we conclude that the treatment effects are not equal and the means associated to applied departments and theoretical department are different.
Create an interaction plot for each pair of the factors (three plots). Does there appear to be an interaction effect between any of the factor pairs? Explain.
Sex vs Rank
n_Sex<-levels(sal_data$sex) %>% length()
n_Rank<-levels(sal_data$rank) %>% length()
n_Disc<-levels(sal_data$discipline) %>% length()
tr_means_SR <- matrix(NA, nrow=n_Sex, ncol=n_Rank)
for(i in 1:n_Sex){
for(j in 1:n_Rank){
tr_means_SR[i,j] <- mean(sal_data[sal_data$sex == levels(sal_data$sex)[i] & sal_data$rank == levels(sal_data$rank)[j], ]$salary)
}
}
rownames(tr_means_SR)<-c("Female","Male")
colnames(tr_means_SR)<-c("AsstProf", "AssocProf", "Prof")
tr_means_SR
## AsstProf AssocProf Prof
## Female 78049.91 88512.8 121967.6
## Male 81311.46 94869.7 127120.8
par(mfrow=c(1,2), mar=c(3,3,2,1), mgp=c(2,0.8,0))
matplot(tr_means_SR, type = 'l', xlab="Sex", ylab="Salary",col=1:3, lty=1, xaxt="n")
legend("right", legend = levels(sal_data$rank), col=1:3, lty=1,cex=0.75)
axis(1,at=c(1,2),labels = levels(sal_data$sex))
matplot(t(tr_means_SR), type = 'l', xlab = "Rank", ylab="Salary", col=1:2, lty=1, xaxt="n")
legend("right", legend = levels(sal_data$sex), col=1:2, lty=1,cex=0.75)
axis(1,at=c(1,2,3),labels = levels(sal_data$rank))
The lines are approximately parallel, therefore there is no interaction effect between factors Sex and Ranks. The “effect” of rank is the same males and females. Male category has always the highest salary mean across the rank levels.
Sex vs Discipline
tr_means_SD <- matrix(NA, nrow=n_Sex, ncol=n_Disc)
for(i in 1:n_Sex){
for(j in 1:n_Disc){
tr_means_SD[i,j] <- mean(sal_data[sal_data$sex == levels(sal_data$sex)[i] & sal_data$discipline == levels(sal_data$discipline)[j], ]$salary)
}
}
rownames(tr_means_SD)<-c("Female","Male")
colnames(tr_means_SD)<-c("Theoretical", "Applied")
tr_means_SD
## Theoretical Applied
## Female 89064.94 111234.5
## Male 110699.98 118760.4
matplot(t(tr_means_SD), type = 'l', xlab = "Discipline", ylab="Salary", col=1:2, lty=1, xaxt="n")
legend("right", legend = levels(sal_data$sex), col=1:2, lty=1,cex=0.75)
axis(1,at=c(1,2),labels = levels(sal_data$discipline))
The lines are not parallel, there is an interaction effect between the factors Sex and Discipline type. The difference in salary (mean) between Theoretical and Applied department is clearer for Female than for Male. The effect of Discipline type is different for female and male. That being said males have always higher salary mean regardless of the level for discipline type. The interaction effect does not seem to be strong.
Discipline vs Rank
tr_means_DR <- matrix(NA, nrow=n_Disc, ncol=n_Rank)
for(i in 1:n_Disc){
for(j in 1:n_Rank){
tr_means_DR[i,j] <- mean(sal_data[sal_data$discipline == levels(sal_data$discipline)[i] & sal_data$rank == levels(sal_data$rank)[j], ]$salary)
}
}
rownames(tr_means_DR)<-c("Theoretical", "Applied")
colnames(tr_means_DR)<-c("AsstProf", "AssocProf", "Prof")
tr_means_DR
## AsstProf AssocProf Prof
## Theoretical 73935.54 83061.12 119948.3
## Applied 84593.91 101276.39 133393.8
par(mfrow=c(1,2), mar=c(3,3,2,1), mgp=c(2,0.8,0))
matplot(tr_means_DR, type = 'l', xlab="Discipline", ylab="Salary",col=1:3, lty=1, xaxt="n")
legend("topleft", legend = levels(sal_data$rank), col=1:3, lty=1,cex=0.75)
axis(1,at=c(1,2),labels = levels(sal_data$discipline))
matplot(t(tr_means_DR), type = 'l', xlab = "Rank", ylab="Salary", col=1:2, lty=1, xaxt="n")
legend("topleft", legend = levels(sal_data$discipline), col=1:2, lty=1,cex=0.75)
axis(1,at=c(1:3),labels = levels(sal_data$rank))
The lines are almost parallel, therefore we can say that there is no interaction effect between the factors Discipline and Rank. The “effect” of rank is the same for theoretical and applied department.
Fit a two-way ANOVA model for salary, with main effects rank and sex as well as the interaction between them, and give the two-way ANOVA table. Is the interaction effect significant? Use alpha = 0.01 and give the p-value.
sal_sexrank_fit<-lm(salary ~ sex + rank, data = sal_data, x=T)
sal_sexrank_fit_Int<-lm(salary ~ sex + rank + sex:rank, data = sal_data, x=T)
anova(sal_sexrank_fit_Int) ##two-way ANOVA table
## Analysis of Variance Table
##
## Response: salary
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 6.9800e+09 6.9800e+09 12.4515 0.0004673 ***
## rank 2 1.3709e+11 6.8546e+10 122.2787 < 2.2e-16 ***
## sex:rank 2 4.3603e+07 2.1802e+07 0.0389 0.9618588
## Residuals 391 2.1918e+11 5.6057e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(sal_sexrank_fit_Int)["sex:rank","Pr(>F)"] ## p-value
## [1] 0.9618588
anova(sal_sexrank_fit_Int)["sex:rank","Pr(>F)"]<alpha
## [1] FALSE
The p-value related to the interaction term between sex and rank is very high, we conclude that there is no significant interaction effect between the factors sex and rank. This is in line with the analysis done in point (a).
Fit a linear regression model for salary using all the predictors, both quantitative and qualitative, and show the summary table for the fit. Are all predictors significant? Again, use alpha= 0.01.
sal_all_fit<-lm(salary ~.,data= sal_data[,-1], x=T)
summary(sal_all_fit)
##
## Call:
## lm(formula = salary ~ ., data = sal_data[, -1], x = T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65248 -13211 -1775 10384 99592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65955.2 4588.6 14.374 < 2e-16 ***
## rankAssocProf 12907.6 4145.3 3.114 0.00198 **
## rankProf 45066.0 4237.5 10.635 < 2e-16 ***
## disciplineApplied 14417.6 2342.9 6.154 1.88e-09 ***
## yrs.since.phd 535.1 241.0 2.220 0.02698 *
## yrs.service -489.5 211.9 -2.310 0.02143 *
## sexMale 4783.5 3858.7 1.240 0.21584
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22540 on 390 degrees of freedom
## Multiple R-squared: 0.4547, Adjusted R-squared: 0.4463
## F-statistic: 54.2 on 6 and 390 DF, p-value: < 2.2e-16
Not all the predictors are significant, according to the summary table we can say that sex is not a significant predictor given that all the other predictors are included in the model. The p-value for sex is way higher than alpha=0.01. This is confirmed by the Anova table. From the summary table we can see that also yrs.since.phd and yrs.service have p-values higher than alpha.However we would need a generate new models in order to exclude them.
We get mixed results from the analysis performed. According to the summary table (t-test) it seems that sex is not significant (given that the other predictors have been taken into account). While according to the one-way Anova analysis there is a difference in salary between male and female faculty members. The number of female faculty members is much lower compared to the number of male faculty members. For this reason, the sampling variance of salary for female and male can considerably differ and the power of the test is not “maximized”. I would suggest to split the data in such a way that we have a comparable number of male and female faculty members. Further to this, we would need to check the presence of influential outliers and that the assumption underlying the model are verified: Residuals are normally distributed. Residuals have the same variance for all factor levels.