An online version of the Notebook is available at http://rpubs.com/jvervaart/assignment4.

library(rmarkdown)
library(knitr)
library(lsr)
library(stats)
library(car)

## Loading required package: carData

library(BSDA)

## Loading required package: lattice

## 
## Attaching package: 'BSDA'

## The following objects are masked from 'package:carData':
## 
##     Vocab, Wool

## The following object is masked from 'package:datasets':
## 
##     Orange

options(scipen=999) # restricting the use of the scientific notation

Task 1

Loading the salary dataset:

salary_data <- carData::Salaries

str(salary_data)

## 'data.frame':    397 obs. of  6 variables:
##  $ rank         : Factor w/ 3 levels "AsstProf","AssocProf",..: 3 3 1 3 3 2 3 3 3 3 ...
##  $ discipline   : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2 ...
##  $ yrs.since.phd: int  19 20 4 45 40 6 30 45 21 18 ...
##  $ yrs.service  : int  18 16 3 39 41 6 23 45 20 18 ...
##  $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
##  $ salary       : int  139750 173200 79750 115000 141500 97000 175000 147765 119250 129000 ...

summary(salary_data)

##         rank     discipline yrs.since.phd    yrs.service        sex     
##  AsstProf : 67   A:181      Min.   : 1.00   Min.   : 0.00   Female: 39  
##  AssocProf: 64   B:216      1st Qu.:12.00   1st Qu.: 7.00   Male  :358  
##  Prof     :266              Median :21.00   Median :16.00               
##                             Mean   :22.31   Mean   :17.61               
##                             3rd Qu.:32.00   3rd Qu.:27.00               
##                             Max.   :56.00   Max.   :60.00               
##      salary      
##  Min.   : 57800  
##  1st Qu.: 91000  
##  Median :107300  
##  Mean   :113706  
##  3rd Qu.:134185  
##  Max.   :231545

Task 2

Calculating z-scores of the yrs.since.phd variable:

ZScoreFunction <- function(x, mu, sig) {(x-mu)/sig}

salary_data$zscores <- ZScoreFunction(salary_data$yrs.since.phd, mu = mean(salary_data$yrs.since.phd), sig = sd(salary_data$yrs.since.phd))

The resulting z.scores:

paged_table(salary_data[c('yrs.since.phd', 'zscores')],)

Tasks 2a & 2b

Differences in salary of the professors with the highest and lowest z-scores:

max(salary_data$zscores)

## [1] 2.613885

min(salary_data$zscores)

## [1] -1.653981

# calculating the range:
max(salary_data$zscores) - min(salary_data$zscores)

## [1] 4.267866

Task 3

Testing the following hypotheses:

H0: There is no difference in rank between males and females.
H1: There is a difference in rank between males and females.

First, looking at the distribution across the two variables (sex and rank):

table(salary_data$rank, salary_data$sex)

##            
##             Female Male
##   AsstProf      11   56
##   AssocProf     10   54
##   Prof          18  248

Next, performing the Chi-squared test:

chisq.test(x = salary_data$rank, y = salary_data$sex)

## 
##  Pearson's Chi-squared test
## 
## data:  salary_data$rank and salary_data$sex
## X-squared = 8.5259, df = 2, p-value = 0.01408

Task 3a

The p-value of the Chi-squared test was < 0.05, thus we reject h0.

Task 4

shapiro.test(salary_data$salary)

## 
##  Shapiro-Wilk normality test
## 
## data:  salary_data$salary
## W = 0.95988, p-value = 0.000000006076

qqPlot(salary_data$salary)

## [1]  44 365

Task 4a

The p-value of the Shapiro-Wilk normality test is < 0.05. The visual representation using qqPlot() shows a fairly straight line. Therefore, we can say that the assumption of normality is not met, as one of the tests (the Q-Q Plot) shows non-normality.

Task 5

Testing the following hypotheses:

H0: The salary between male and female is equal.
H1: The salary between male and female is not equal.

Performing a t-test:

t <- t.test(x = subset(salary_data$salary, salary_data$sex == 'Male'), 
       y = subset(salary_data$salary, salary_data$sex == 'Female'), 
       alternative = 'g')

t

## 
##  Welch Two Sample t-test
## 
## data:  subset(salary_data$salary, salary_data$sex == "Male") and subset(salary_data$salary, salary_data$sex == "Female")
## t = 3.1615, df = 50.122, p-value = 0.001332
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  6620.263      Inf
## sample estimates:
## mean of x mean of y 
##  115090.4  101002.4

Task 5a

I have used a Welch Two Sample t-test, because doing the analysis in two directions is meaningful in this case.

Because the alternative hypothesis states that the salaries of Males and Females are not equal, the ‘alternative’ argument in the t.test() formula was set to ‘greater’ (or ‘g’). This shows if the salaries of Males are greater than Females’ salaries or not.

Task 5b

Calculating relevant reporting statistics:

n.male <- nrow(subset(salary_data, salary_data$sex == 'Male'))
n.female <- nrow(subset(salary_data, !salary_data$sex == 'Male'))

salary.male <- round(mean(subset(salary_data$salary, salary_data$sex == 'Male')), digits = 2)
salary.sd.male <- round(sd(subset(salary_data$salary, salary_data$sex == 'Male')), digits = 2) 
salary.female <- round(mean(subset(salary_data$salary, !salary_data$sex == 'Male')), digits = 2)
salary.sd.female <- round(sd(subset(salary_data$salary, !salary_data$sex == 'Male')), digits = 2)

t.score <- round(t[["statistic"]][["t"]], digits = 2)

The data set contains 358 male subjects and 39 female subjects. A Welch Two Sample t-test was conducted to evaluate the hypothesis that salaries between male and female subjects are equal. The results of this t-test suggested that salaries of male subjects (M = 115090.42, SD = 30436.93) are significantly higher than salaries of female subjects (M = 101002.41, SD = 25952.13), (t = 3.16, p < 0.05).

Task 5c

Calculating the z-score between males and females, using females as a point of reference:

z.test(x = subset(salary_data$salary, salary_data$sex == 'Male'), sigma.x = salary.sd.female)

## 
##  One-sample z-Test
## 
## data:  subset(salary_data$salary, salary_data$sex == "Male")
## z = 83.909, p-value < 0.00000000000000022
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  112402.1 117778.7
## sample estimates:
## mean of x 
##  115090.4

Task 5d

One-sample t-tests assume normality of the population distribution and independence.

Independent samples t-tests:

Student-tests assume normality and independence, but also homogeneity of variance.
Welch-tests do not assume homogeneity of variance

Paired samples t-tests assume normality of differences of matched-pairs.

Statistics for Pre-masters DSS - Assignment 4

Jesse Vervaart

18-11-2019