ABSTRACT

Inferences about the two population means were conducted using both Frequentist approach and Bayesian approach. From the frequentist perspective, several distinct types of inference were performed, including point estimation, interval estimation, and hypothesis testing. Each of these types of inference were also performed in a Bayesian manner.

1. INTRODUCTION

The Salaries for Professors dataset contains the 2008-09 nine-month academic salary data for Assistant Professors, Associate Professors and Professors in a college in the U.S. (Data Source: Fox J. and Weisberg, S. (2011) An R Companion to Applied Regression). The data is collected by the college’s administration, and this project use the copy from R package “carData”. This project use the salaries data to analyze the difference between salaries of male and female faculty members.

The point estimation and interval estimation were conducted with the frequentist approach. Also, a Wald test was used to test the null hypothesis which states that the two population means are equal. Using Bayesian method, we also obtained the point estimation and credible interval. The two approaches provides us with similar answers.

There is a significant difference between the male professors’ salaries and female professors’ salaries.

2. DATA ANALYSIS AND STATISTICAL INFERENCE

2.1 Data Exploration

The salaries dataset is separated into two parts, male salary and female salary, by gender of faculty members.

library(carData)
data("Salaries")
library(dplyr)
Fdata <- Salaries%>%
    filter(sex=="Female")
Mdata <- Salaries%>%
    filter(sex=="Male")
FSalary <- Fdata$salary
FSalary <-as.data.frame(FSalary)
FSalary$newcolumn <- "Female"
colnames(FSalary) <- c("Salary","Gender")
MSalary <- Mdata$salary
MSalary <- as.data.frame(MSalary)
MSalary $newcolumn <-"Male"
colnames(MSalary) <- c("Salary","Gender")
S <- rbind(FSalary,MSalary)

Table 1 shows the summary of the salary variable for male and female.

rbind(summary(filter(S,Gender=="Female")$Salary), summary(filter(S,Gender=="Male")$Salary))
##       Min. 1st Qu. Median     Mean  3rd Qu.   Max.
## [1,] 62884   77250 103750 101002.4 117002.5 161101
## [2,] 57800   92000 108043 115090.4 134863.8 231545

Table 1 Summary of Salary Variable

According to the boxplots in Figure 1 , there are 3 outliers in male salary.

library(ggplot2)
ggplot(S, aes(x=Gender, y= Salary))+
    geom_boxplot()+
    ggtitle("Figure 1: Female and Male Professors' Salaries")

2.2 Frequentist Approach

2.2.1 Nonparametric Approach

Plug-In Estimator and Confidence Interval

The average salaries of male and female professors are denoted as 1 and 2, respectively. The difference between the mean salaries is represented by . The plug-in estimator for the mean salary of male professors Xbar=115090.4 , and the plug-in estimator of mean female salary is Ybar=101002.4, The plug in estimator for the difference between the two are calculated as diff=Ybar-Xbar -14088.01.

Ybar <- mean(filter(S, Gender=="Female")$Salary)
Xbar <- mean(filter(S, Gender=="Male")$Salary)
diff <- Ybar - Xbar

Nonparametric bootstrap is used to find the standard error of the difference of the mean. The bootstrap standard error of the mean is 4416.451, then we get the approximate 95% confidence interval (-22920.912, -5255.106).

library(bootstrap)
psigma <- diff
Mmean <- bootstrap(x=filter(S, Gender=="Female")$Salary,3200,mean)
Fmean <- bootstrap(x=filter(S, Gender=="Male")$Salary,3200,mean)
se.boot <- sqrt(var(Fmean$thetastar-Mmean$thetastar))
CI.boot <-c(psigma-2*se.boot, psigma+2*se.boot) 

2.2.2 Parametric Analysis

The two assumptions are made for following analysis: 1. The two samples (male salary and female salary) are random variables drawn from normal distribution; 2. The sampling distribution of the mean is normally distributed.

MLE and Confidence Interval

MLE=Ybar-Xbar=-14088.01. The standard error is 4405.366. From the asymptotic normality of the MLE, we can construct an (asymptotic) approximate 95 percent confidence interval, which is (-22898.741, -5277.276).

sigmahat <- Ybar-Xbar
Fe <- filter(S, Gender=="Female")$Salary
M <- filter(S, Gender=="Male")$Salary
S.F <- sum((Fe-Ybar)^2/nrow(filter(S, Gender=="Female")))
S.M <- sum((M-Xbar)^2/nrow(filter(S, Gender=="Male")))
sehat <- sqrt(S.F/nrow(filter(S, Gender=="Female"))+S.M/nrow(filter(S, Gender=="Male")))
CI.normal <- c(sigmahat-2*sehat, sigmahat+2*sehat)

Wald Test

We are interested in if the mean salaries of male and female professors are the same. The null and alternative hypotheses are setup as follows. H0: miu1 = miu2 i.e. the salaries for the two genders are the same Ha: miu1 != miu2 i.e. the salaries for the two genders are not the same We specify the size of the Wald test as =0.05.The Wald test statistic and corresponding p-value are calculated as W=-3.198 P-value = 2 P(Z<-3.198) = 0.00138 The p-value is less than 0.05. Therefore, we reject the null hypothesis and conclude that the salaries for two different genders are not the same. Furthermore, female professors have lower salaries than male professors.

sigmahat <- Ybar-Xbar
W <- sigmahat/sehat
pvalue <- 2*pnorm(-abs(W))

2.3 Bayesian Approach

The population variances are not know, we can use the estimated variance calculated from the sample. This introduces a certain amount of uncertainty because we do not know the actual value of population variance . We allow for this extra uncertainty by using Student’s t distribution instead of the standard normal distribution to find critical values for our credible intervals and to do probability calculations.

m.est.var <- 1/(nrow(filter(S, Gender=="Male"))-1)*sum((M-mean(M))^2)
f.est.var <- 1/(nrow(filter(S, Gender=="Female"))-1)*sum((Fe-mean(Fe))^2)

Posterior Mean and Posterior Variance

The two samples are independent of each other, we will use independent priors for both means. Because the priors are independent, and the samples are independent, the posteriors are also independent.

# posterior mean
# male professors salaries
a1 <- 110000
b1 <- 9*10^8
pm.m <- (nrow(filter(S, Gender=="Male"))/m.est.var*mean(M)+1/b1*a1)/(nrow(filter(S, Gender=="Male"))/m.est.var+1/b1)
pvar.m <- 1/(nrow(filter(S, Gender=="Male"))/m.est.var+1/b1)
# female professors salaries
a2 <- 100000
b2 <- 6*10^8
pm.f <- (nrow(filter(S, Gender=="Female"))/f.est.var*mean(Fe)+1/b2*a2)/(nrow(filter(S, Gender=="Female"))/f.est.var+1/b2)
pvar.f <- 1/(nrow(filter(S, Gender=="Female"))/f.est.var+1/b2)
sigma.bay <- pm.f-pm.m
var.bay <- pvar.m+pvar.f

Credible Interval for Difference between Means

# credible interval
n1 <- nrow(filter(S, Gender=="Male"))
n2 <- nrow(filter(S, Gender=="Female"))
df.numerator <- (m.est.var/n1+f.est.var/n2)^2
df.denorminator <- (m.est.var/n1)^2/(n1+1)+(f.est.var/n2)^2/(n2+1)
df <- floor(df.numerator/df.denorminator)
t <- abs(qt(0.05,df))
Cre.Int <- c(sigma.bay-t*sqrt(var.bay),sigma.bay+t*sqrt(var.bay))

CONCLUSIONS

From a Frequentist approach would say there is a 95% probability that when we compute a confidence interval from data of this sort, i.e., (-22920.912, -5255.106) from nonparametric and (-22898.741, -5277.276) from parametric, the true value of will fall within it. While,from a Bayesian approach, we would interpret the result as given the observed data, there is a 95% probability that the true value of falls within the credible region (-21471.369 -6731.549). The female professors’ salaries are significantly less than male professors’ salaries.

BIBLIOGRAPHY

Aspects of frequentist approach can be found in Wasserman (2004). Some references on Bayesian inference include Carlin and Louis (1996), Bolstad (2016).