1.1 Provide a brief description of the study in Task 1

- What is the study design?

A cross-sectional investigation of 397 professors in the us to determine whether salaries differed between males and females. ### - What is the null hypothesis? There was no difference in salaries between male and female professors ### - What is the alternative hypothesis? Salraries differs between male and female professors

1.2 Import the data set “Professorial Salaries.csv” into R

f = file.choose()
salary = read.csv(f)

1.3 Describe characteristics of the study sample by sex.

library(table1)
## 
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
## 
##     units, units<-
table1(~ Rank + Discipline + Yrs.since.phd + Yrs.service + NPubs + Ncits + Salary | Sex, data = salary)
Female
(N=39)
Male
(N=358)
Overall
(N=397)
Rank
AssocProf 10 (25.6%) 54 (15.1%) 64 (16.1%)
AsstProf 11 (28.2%) 56 (15.6%) 67 (16.9%)
Prof 18 (46.2%) 248 (69.3%) 266 (67.0%)
Discipline
A 18 (46.2%) 163 (45.5%) 181 (45.6%)
B 21 (53.8%) 195 (54.5%) 216 (54.4%)
Yrs.since.phd
Mean (SD) 16.5 (9.78) 22.9 (13.0) 22.3 (12.9)
Median [Min, Max] 17.0 [2.00, 39.0] 22.0 [1.00, 56.0] 21.0 [1.00, 56.0]
Yrs.service
Mean (SD) 11.6 (8.81) 18.3 (13.2) 17.6 (13.0)
Median [Min, Max] 10.0 [0, 36.0] 18.0 [0, 60.0] 16.0 [0, 60.0]
NPubs
Mean (SD) 20.2 (14.4) 17.9 (13.9) 18.2 (14.0)
Median [Min, Max] 18.0 [1.00, 50.0] 13.0 [1.00, 69.0] 13.0 [1.00, 69.0]
Ncits
Mean (SD) 40.7 (16.2) 40.2 (17.0) 40.2 (16.9)
Median [Min, Max] 36.0 [14.0, 70.0] 35.0 [1.00, 90.0] 35.0 [1.00, 90.0]
Salary
Mean (SD) 101000 (26000) 115000 (30400) 114000 (30300)
Median [Min, Max] 104000 [62900, 161000] 108000 [57800, 232000] 107000 [57800, 232000]

1.4 Develop a graph to check the distribution of professors’ salaries.

Table is slightly skewed to the left but reasonably normally distributed

library(ggplot2)
p = ggplot(data = salary, aes(x = Salary))
p1 = p + geom_histogram(aes(y = ..density..), color = "white", fill = "blue")
p2 = p1 + geom_density(col="red")
p2 + ggtitle("Distribution of professors' salaries") + theme_bw()
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 1.5 Determine whether salaries significantly differed between male and female professors:

p = ggplot(data = salary, aes(x = Rank,  y = Salary, fill = Rank, col = Rank))
p1 = p + geom_boxplot(col = "black") + geom_jitter(alpha = 0.05) 
p1 + labs(x = "Rank", y = "Salaries (USD)") + ggtitle("Professors' salaries by rank") + theme_bw()

### T-test ### Focus on two results, first is p-value, p-value is small enough, so the difference in results must be true. Second is confidence interval, which means that salaries of male professors were $14088 higher than female professors, ranging from $5138 to $23037, and this will happen 95% of the time.

t.test(Salary ~ Sex, data = salary)
## 
##  Welch Two Sample t-test
## 
## data:  Salary by Sex
## t = -3.1615, df = 50.122, p-value = 0.002664
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -23037.916  -5138.102
## sample estimates:
## mean in group Female   mean in group Male 
##             101002.4             115090.4

Task 2

A cross-sectional investigation of 26 associate professors in the Theoretical discipline was conducted. An exploratory was carried out to determine whether the average number of publications differed between male and female professors.

2.1

###Provide a brief description of the study in Task 2 ###- What is the study design? ###- What is the null hypothesis? ###- What is the alternative hypothesis?

2.2 Select a subgroup of associate professors in the Theoretical discipline

Assoc.A = subset(salary, Rank == "AssocProf" & Discipline == "A")
dim(Assoc.A)
## [1] 26  9

2.3 Describe characteristics of the study sample by sex

library(table1)
table1(~ Rank + Discipline + Yrs.since.phd + Yrs.service + NPubs + Ncits + Salary | Sex, data = Assoc.A)
Female
(N=4)
Male
(N=22)
Overall
(N=26)
Rank
AssocProf 4 (100%) 22 (100%) 26 (100%)
Discipline
A 4 (100%) 22 (100%) 26 (100%)
Yrs.since.phd
Mean (SD) 18.5 (8.19) 17.7 (12.2) 17.8 (11.5)
Median [Min, Max] 19.0 [10.0, 26.0] 12.5 [8.00, 49.0] 13.0 [8.00, 49.0]
Yrs.service
Mean (SD) 15.5 (8.70) 13.1 (12.3) 13.5 (11.7)
Median [Min, Max] 15.0 [8.00, 24.0] 8.00 [1.00, 49.0] 8.00 [1.00, 49.0]
NPubs
Mean (SD) 10.0 (4.97) 21.6 (14.2) 19.8 (13.8)
Median [Min, Max] 10.0 [4.00, 16.0] 16.0 [3.00, 48.0] 16.0 [3.00, 48.0]
Ncits
Mean (SD) 38.5 (18.5) 44.3 (15.2) 43.4 (15.5)
Median [Min, Max] 37.5 [19.0, 60.0] 47.0 [24.0, 69.0] 47.0 [19.0, 69.0]
Salary
Mean (SD) 72100 (6400) 85000 (10600) 83100 (11100)
Median [Min, Max] 74100 [62900, 77500] 82400 [70000, 108000] 81900 [62900, 108000]

2.4 Develop a graph to check the distribution of the number of publications.

library(ggplot2)
p = ggplot(data = Assoc.A, aes(x = NPubs))
p1 = p + geom_histogram(aes(y = ..density..), color = "white", fill = "blue")
p2 = p1 + geom_density(col="red")
p2 + ggtitle("Distribution of professors' number of publications") + theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 2.5 Conduct a non-parametric test to determine whether the number of publications differed between male and female professors:

kruskal.test(NPubs ~ Sex, data = Assoc.A)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  NPubs by Sex
## Kruskal-Wallis chi-squared = 2.5726, df = 1, p-value = 0.1087

2.6 Carry out a bootstrap to determine whether the mean number of publications differed between male and female professors:

library(lmboot)
boot = ANOVA.boot(NPubs ~ Sex, B = 1000, seed = 1234, data = Assoc.A)
boot$'p-values'
## [1] 0.12