1.1 Provide a brief description of the project

- What is the study design?

A cross-sectional invetigation of 397 professors was conducted to determine whether professors’ salaries differed among their ranks.

- What is the null hypothesis?

Salaries did not differ among professors’ ranks.

- What is the alternative hypothesis?

Salaries differed among progessors’ ranks.

1.2 Import the data set “Professorial Salaries.csv” into R

f = file.choose()
salary = read.csv(f)

1.3 Describe characteristics of the study sample by professors’ ranks

Fill in the following table and write a sentence to describe the study sample.

Table 1.3 Characteristics of the study sample

library(table1)
## 
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
## 
##     units, units<-
table1(~ Sex + Discipline + Yrs.since.phd + Yrs.service + NPubs + Ncits + Salary | Rank, data = salary)
AssocProf
(N=64)
AsstProf
(N=67)
Prof
(N=266)
Overall
(N=397)
Sex
Female 10 (15.6%) 11 (16.4%) 18 (6.8%) 39 (9.8%)
Male 54 (84.4%) 56 (83.6%) 248 (93.2%) 358 (90.2%)
Discipline
A 26 (40.6%) 24 (35.8%) 131 (49.2%) 181 (45.6%)
B 38 (59.4%) 43 (64.2%) 135 (50.8%) 216 (54.4%)
Yrs.since.phd
Mean (SD) 15.5 (9.65) 5.10 (2.54) 28.3 (10.1) 22.3 (12.9)
Median [Min, Max] 12.0 [6.00, 49.0] 4.00 [1.00, 11.0] 28.0 [11.0, 56.0] 21.0 [1.00, 56.0]
Yrs.service
Mean (SD) 12.0 (10.1) 2.37 (1.50) 22.8 (11.6) 17.6 (13.0)
Median [Min, Max] 8.00 [1.00, 53.0] 3.00 [0, 6.00] 21.0 [0, 60.0] 16.0 [0, 60.0]
NPubs
Mean (SD) 18.7 (13.5) 15.9 (11.2) 18.6 (14.7) 18.2 (14.0)
Median [Min, Max] 16.0 [1.00, 50.0] 12.0 [2.00, 50.0] 13.0 [1.00, 69.0] 13.0 [1.00, 69.0]
Ncits
Mean (SD) 42.0 (15.6) 36.3 (16.7) 40.8 (17.2) 40.2 (16.9)
Median [Min, Max] 36.5 [14.0, 83.0] 34.0 [1.00, 83.0] 35.0 [1.00, 90.0] 35.0 [1.00, 90.0]
Salary
Mean (SD) 93900 (13800) 80800 (8170) 127000 (27700) 114000 (30300)
Median [Min, Max] 95600 [62900, 126000] 79800 [63100, 97000] 123000 [57800, 232000] 107000 [57800, 232000]

1.4 Develop a graph to check the distribution of professors’ salaries. Write a sentence to describe the graph (i.e., are professor’s salaries normally distributed?)

library(ggplot2)
p = ggplot(data = salary, aes(x = Salary))
p1 = p + geom_histogram(aes(y = ..density..), color = "white", fill = "blue")
p2 = p1 + geom_density(col="red")
p2 + ggtitle("Distribution of professors' salaries") + theme_bw()
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.5 Develop a box plot to describe the differences in salaries among professors’ ranks. Write a sentence to describe the graph.

p = ggplot(data = salary, aes(x = Rank,  y = Salary, fill = Rank, col = Rank))
p1 = p + geom_boxplot(col = "black") + geom_jitter(alpha = 0.05) 
p1 + labs(x = "Rank", y = "Salaries (USD)") + ggtitle("Professors' salaries by rank") + theme_bw()

1.6 Determine whether salaries significantly differed among professors’ ranks. Interpret the finding:

- Is there evidence that salaries significantly differed among professors’ ranks?

- What is the next step?

sal.rank = aov(Salary ~ Rank, data = salary)
summary(sal.rank)
##              Df    Sum Sq   Mean Sq F value Pr(>F)    
## Rank          2 1.432e+11 7.162e+10   128.2 <2e-16 ***
## Residuals   394 2.201e+11 5.586e+08                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1.7 Check the assumptions (i.e., are the assumptions met?).

par(mfrow=c(2,2))
plot(sal.rank)

1.8 Conduct post-hoc analyses to determine which particular groups significantly differ from each other. Interpret the findings.

tukey.sal.rank = TukeyHSD(sal.rank)
tukey.sal.rank
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Salary ~ Rank, data = salary)
## 
## $Rank
##                         diff       lwr       upr     p adj
## AsstProf-AssocProf -13100.45 -22818.71 -3382.195 0.0046514
## Prof-AssocProf      32895.67  25154.51 40636.836 0.0000000
## Prof-AsstProf       45996.12  38395.94 53596.307 0.0000000

2.1 Provide a brief description of the project

- What is the study design?

A cross-sectional investigation of 21 female professors was conducted to determine whether number of publications differed among their ranks. ## - What is the null hypothesis? ## - What is the alternative hypothesis?

2.2 Select a subgroup of female professors in the Applied discipline

female.B = subset(salary, Sex == "Female" & Discipline == "B")
dim(female.B)
## [1] 21  9

2.3 Describe characteristics of the study sample by professors’ ranks

library(table1)
table1(~ Sex + Discipline + Yrs.since.phd + Yrs.service + NPubs + Ncits + Salary | Rank, data = female.B)
AssocProf
(N=6)
AsstProf
(N=5)
Prof
(N=10)
Overall
(N=21)
Sex
Female 6 (100%) 5 (100%) 10 (100%) 21 (100%)
Discipline
B 6 (100%) 5 (100%) 10 (100%) 21 (100%)
Yrs.since.phd
Mean (SD) 13.5 (2.88) 6.60 (3.65) 21.5 (5.95) 15.7 (7.72)
Median [Min, Max] 12.5 [11.0, 19.0] 5.00 [3.00, 11.0] 19.0 [17.0, 36.0] 17.0 [3.00, 36.0]
Yrs.service
Mean (SD) 8.83 (1.94) 2.60 (1.82) 17.9 (4.77) 11.7 (7.36)
Median [Min, Max] 9.50 [6.00, 11.0] 3.00 [0, 5.00] 17.5 [10.0, 26.0] 10.0 [0, 26.0]
NPubs
Mean (SD) 16.3 (14.3) 18.8 (18.3) 21.3 (14.2) 19.3 (14.6)
Median [Min, Max] 14.5 [1.00, 38.0] 11.0 [3.00, 50.0] 20.0 [6.00, 50.0] 18.0 [1.00, 50.0]
Ncits
Mean (SD) 45.8 (12.9) 46.4 (24.1) 38.0 (15.7) 42.2 (16.9)
Median [Min, Max] 49.0 [26.0, 60.0] 49.0 [14.0, 69.0] 36.5 [18.0, 60.0] 48.0 [14.0, 69.0]
Salary
Mean (SD) 99400 (14100) 84200 (9790) 132000 (17500) 111000 (25400)
Median [Min, Max] 104000 [71100, 110000] 80200 [74700, 97000] 128000 [105000, 161000] 105000 [71100, 161000]

2.4 Develop a graph to check the distribution of the number of publications. Write a sentence to describe the graph (i.e., is the number of publications normally distributed?)

library(ggplot2)
p = ggplot(data = female.B, aes(x = NPubs))
p1 = p + geom_histogram(aes(y = ..density..), color = "white", fill = "blue")
p2 = p1 + geom_density(col="red")
p2 + ggtitle("Distribution of professors' salaries") + theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.5 Conduct a non-parametric test to determine whether the number of publications differed among professors’ ranks. Interpret the findings (i.e., is there evidence that the number of publications significantly differed among professors’ ranks?)

kruskal.test(NPubs ~ Rank, data = female.B)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  NPubs by Rank
## Kruskal-Wallis chi-squared = 0.81015, df = 2, p-value = 0.6669

2.6 Carry out a bootstrap to determine whether the mean number of publications differed among professors’ ranks. Interpret the findings (i.e., is there evidence that the number of publications differed among professors’ ranks?)

library(lmboot)
boot = ANOVA.boot(NPubs ~ Rank, B = 1000, seed = 1234, data = female.B)
boot$'p-values'
## [1] 0.821