An online version of the Notebook is available at http://rpubs.com/jvervaart/assignment4.
library(rmarkdown)
library(knitr)
library(lsr)
library(stats)
library(car)
## Loading required package: carData
library(BSDA)
## Loading required package: lattice
##
## Attaching package: 'BSDA'
## The following objects are masked from 'package:carData':
##
## Vocab, Wool
## The following object is masked from 'package:datasets':
##
## Orange
options(scipen=999) # restricting the use of the scientific notation
Loading the salary dataset:
salary_data <- carData::Salaries
str(salary_data)
## 'data.frame': 397 obs. of 6 variables:
## $ rank : Factor w/ 3 levels "AsstProf","AssocProf",..: 3 3 1 3 3 2 3 3 3 3 ...
## $ discipline : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2 ...
## $ yrs.since.phd: int 19 20 4 45 40 6 30 45 21 18 ...
## $ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
## $ salary : int 139750 173200 79750 115000 141500 97000 175000 147765 119250 129000 ...
summary(salary_data)
## rank discipline yrs.since.phd yrs.service sex
## AsstProf : 67 A:181 Min. : 1.00 Min. : 0.00 Female: 39
## AssocProf: 64 B:216 1st Qu.:12.00 1st Qu.: 7.00 Male :358
## Prof :266 Median :21.00 Median :16.00
## Mean :22.31 Mean :17.61
## 3rd Qu.:32.00 3rd Qu.:27.00
## Max. :56.00 Max. :60.00
## salary
## Min. : 57800
## 1st Qu.: 91000
## Median :107300
## Mean :113706
## 3rd Qu.:134185
## Max. :231545
Calculating z-scores of the yrs.since.phd variable:
ZScoreFunction <- function(x, mu, sig) {(x-mu)/sig}
salary_data$zscores <- ZScoreFunction(salary_data$yrs.since.phd, mu = mean(salary_data$yrs.since.phd), sig = sd(salary_data$yrs.since.phd))
The resulting z.scores:
paged_table(salary_data[c('yrs.since.phd', 'zscores')],)
Differences in salary of the professors with the highest and lowest z-scores:
max(salary_data$zscores)
## [1] 2.613885
min(salary_data$zscores)
## [1] -1.653981
# calculating the range:
max(salary_data$zscores) - min(salary_data$zscores)
## [1] 4.267866
Testing the following hypotheses:
First, looking at the distribution across the two variables (sex and rank):
table(salary_data$rank, salary_data$sex)
##
## Female Male
## AsstProf 11 56
## AssocProf 10 54
## Prof 18 248
Next, performing the Chi-squared test:
chisq.test(x = salary_data$rank, y = salary_data$sex)
##
## Pearson's Chi-squared test
##
## data: salary_data$rank and salary_data$sex
## X-squared = 8.5259, df = 2, p-value = 0.01408
The p-value of the Chi-squared test was < 0.05, thus we reject h0.
shapiro.test(salary_data$salary)
##
## Shapiro-Wilk normality test
##
## data: salary_data$salary
## W = 0.95988, p-value = 0.000000006076
qqPlot(salary_data$salary)
## [1] 44 365
The p-value of the Shapiro-Wilk normality test is < 0.05. The visual representation using qqPlot() shows a fairly straight line. Therefore, we can say that the assumption of normality is not met, as one of the tests (the Q-Q Plot) shows non-normality.
Testing the following hypotheses:
Performing a t-test:
t <- t.test(x = subset(salary_data$salary, salary_data$sex == 'Male'),
y = subset(salary_data$salary, salary_data$sex == 'Female'),
alternative = 'g')
t
##
## Welch Two Sample t-test
##
## data: subset(salary_data$salary, salary_data$sex == "Male") and subset(salary_data$salary, salary_data$sex == "Female")
## t = 3.1615, df = 50.122, p-value = 0.001332
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 6620.263 Inf
## sample estimates:
## mean of x mean of y
## 115090.4 101002.4
I have used a Welch Two Sample t-test, because doing the analysis in two directions is meaningful in this case.
Because the alternative hypothesis states that the salaries of Males and Females are not equal, the ‘alternative’ argument in the t.test() formula was set to ‘greater’ (or ‘g’). This shows if the salaries of Males are greater than Females’ salaries or not.
Calculating relevant reporting statistics:
n.male <- nrow(subset(salary_data, salary_data$sex == 'Male'))
n.female <- nrow(subset(salary_data, !salary_data$sex == 'Male'))
salary.male <- round(mean(subset(salary_data$salary, salary_data$sex == 'Male')), digits = 2)
salary.sd.male <- round(sd(subset(salary_data$salary, salary_data$sex == 'Male')), digits = 2)
salary.female <- round(mean(subset(salary_data$salary, !salary_data$sex == 'Male')), digits = 2)
salary.sd.female <- round(sd(subset(salary_data$salary, !salary_data$sex == 'Male')), digits = 2)
t.score <- round(t[["statistic"]][["t"]], digits = 2)
The data set contains 358 male subjects and 39 female subjects. A Welch Two Sample t-test was conducted to evaluate the hypothesis that salaries between male and female subjects are equal. The results of this t-test suggested that salaries of male subjects (M = 115090.42, SD = 30436.93) are significantly higher than salaries of female subjects (M = 101002.41, SD = 25952.13), (t = 3.16, p < 0.05).
Calculating the z-score between males and females, using females as a point of reference:
z.test(x = subset(salary_data$salary, salary_data$sex == 'Male'), sigma.x = salary.sd.female)
##
## One-sample z-Test
##
## data: subset(salary_data$salary, salary_data$sex == "Male")
## z = 83.909, p-value < 0.00000000000000022
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 112402.1 117778.7
## sample estimates:
## mean of x
## 115090.4
One-sample t-tests assume normality of the population distribution and independence.
Independent samples t-tests:
Paired samples t-tests assume normality of differences of matched-pairs.