For this project, we will use the “jobsatisfaction” dataset in the package “Datarium”. We will try to understand whether there are differences in the job satisfaction between males and females, and wether the education level impact the job satisfaction score.
Let’s start by loading the first libraries required!
library(datarium)
library(ggplot2)
library(dplyr)
##
## Caricamento pacchetto: 'dplyr'
## I seguenti oggetti sono mascherati da 'package:stats':
##
## filter, lag
## I seguenti oggetti sono mascherati da 'package:base':
##
## intersect, setdiff, setequal, union
library(nortest)
And now we can pick our dataset.
data("jobsatisfaction")
head(jobsatisfaction)
## # A tibble: 6 x 4
## id gender education_level score
## <fct> <fct> <fct> <dbl>
## 1 1 male school 5.51
## 2 2 male school 5.65
## 3 3 male school 5.07
## 4 4 male school 5.51
## 5 5 male school 5.94
## 6 6 male school 5.8
The dataset records the job satisfaction score of 58 people. Our aim in this project is to evaluate the variables that statistically impact this score.
First things first, let’s try to visualize the situation.
ggplot(jobsatisfaction, aes(y=score, x=education_level, fill= gender)) +
geom_boxplot()+ ylab("Satisfaction score")+ ggtitle("Satisfaction score by gender and education level") +
scale_fill_manual(values = c("blue", "red"))
Seems that for school and college level of education, female are a little bit more satisfied, while for university males are more satisfied. Let’s see if this differences are significant.
First things first, we need to check for the sample sizes.
table(select(jobsatisfaction, -c(id, score)))
## education_level
## gender school college university
## male 9 9 10
## female 10 10 10
As we can see, the size of females and males (at least for school and college) samples are different.
We start by performing an independent t-test on whether or not there is a difference in satisfaction between males and females if they have just school as education level. Let’s prepare our data to this test.
male_school_score <- jobsatisfaction %>%
filter(gender == "male" & education_level == "school")%>%
select(score)
female_school_score <- jobsatisfaction %>%
filter(gender == "female" & education_level == "school")%>%
select(score)
both_school_score <- jobsatisfaction%>%
filter(education_level == "school")%>%
select(score, gender)
Let’s check for the t-test assumptions!
shapiro.test(male_school_score$score)
##
## Shapiro-Wilk normality test
##
## data: male_school_score$score
## W = 0.98043, p-value = 0.9664
shapiro.test(female_school_score$score)
##
## Shapiro-Wilk normality test
##
## data: female_school_score$score
## W = 0.96292, p-value = 0.8185
The p-value is really high in both cases, so we can go on with assumption number 2: the samples have equal variances.
Rough check: the ratio between the SDs must not be greater than 2.
sd(male_school_score$score)/sd(female_school_score$score)
## [1] 0.7669707
sd(female_school_score$score)/sd(male_school_score$score)
## [1] 1.303831
It seems that we are good, but we need to be sure so let’s perform a test.
bartlett.test(both_school_score$score,both_school_score$gender)
##
## Bartlett test of homogeneity of variances
##
## data: both_school_score$score and both_school_score$gender
## Bartlett's K-squared = 0.55084, df = 1, p-value = 0.458
The last assumption is that the dependent variable should not contain any outlier. In order to remove outliers, we create a function.
remove_outliers_iqr=function(x){
IQR=summary(x)[5]-summary(x)[2]
up_out=summary(x)[5]+1.5*IQR
down_out=summary(x)[2]-1.5*IQR
out_data=c(which(x<down_out), which(x>up_out))
return (out_data)
}
male_school_score <- male_school_score %>%
filter(!score %in% remove_outliers_iqr(male_school_score$score))
female_school_score <- female_school_score %>%
filter(!score %in% remove_outliers_iqr(female_school_score$score))
Let’s check for the dimensions of the samples.
dim(male_school_score)
## [1] 9 1
dim(female_school_score)
## [1] 10 1
Seems that we hadn’t any outlier.
Now that assumptions are checked, we can perform our t-test. Hyphotesis: we’re going to test the hyphotesis assumption that females have different satisfaction with respect to males. Our null Hyphotesis will be that the satisfaction is equal for the 2 genders.
H0: mean(females) = mean(males)
Ha: mean(females) \(\neq\) mean(males)
t.test(female_school_score$score, male_school_score$score, alternative = "two.sided" )
##
## Welch Two Sample t-test
##
## data: female_school_score$score and male_school_score$score
## t = 1.6293, df = 16.621, p-value = 0.122
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.09340319 0.72206985
## sample estimates:
## mean of x mean of y
## 5.741000 5.426667
Our p-value= 0.12 is greater than our significance level alpha = 0.05, so we can accept the null Hyphotesis: Females are equal satisfied by their jobs if they just have primary school as education.
Let’s now conduct the same experiment, but with college as level of education.
male_college_score <- jobsatisfaction %>%
filter(gender == "male" & education_level == "college")%>%
select(score)
female_college_score <- jobsatisfaction %>%
filter(gender == "female" & education_level == "college")%>%
select(score)
both_college_score <- jobsatisfaction %>%
filter(education_level =="college")%>%
select(score, gender)
Let’s check the assumptions now! 1st: Normality. We check with the Anderson-Darling normality test.
ad.test(male_college_score$score)
##
## Anderson-Darling normality test
##
## data: male_college_score$score
## A = 0.265, p-value = 0.5989
ad.test(female_college_score$score)
##
## Anderson-Darling normality test
##
## data: female_college_score$score
## A = 0.16973, p-value = 0.9055
At significance level 0.05 the normality is checked.
2nd: Homoschedasticity
sd(male_college_score$score)/sd(female_college_score$score)
## [1] 0.7154759
sd(female_college_score$score)/sd(male_college_score$score)
## [1] 1.397671
Checked: the ratio between the SDs is less than 2. Let’s perform also a test:
fligner.test(both_college_score$score, both_college_score$gender)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: both_college_score$score and both_college_score$gender
## Fligner-Killeen:med chi-squared = 1.0117, df = 1, p-value = 0.3145
Checked with rough roule and with test.
We need to remove outliers.
male_college_score <- male_college_score%>%
filter(!score %in% remove_outliers_iqr(male_college_score$score))
female_college_score <- female_college_score %>%
filter(!score %in% remove_outliers_iqr(female_college_score$score))
dim(male_college_score)
## [1] 9 1
dim(female_college_score)
## [1] 10 1
Again, seems that we didn’t have any outlier.
Now we can perform the t-test.
Hyphotesis: Female at college education level have different satisfaction with respect to males on the job.
H0: mean(females) = mean(males)
Ha: mean(females) \(\neq\) mean(males)
t.test(female_college_score$score, male_college_score$score, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: female_college_score$score and male_college_score$score
## t = 1.2747, df = 16.239, p-value = 0.2203
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1584331 0.6377664
## sample estimates:
## mean of x mean of y
## 6.463000 6.223333
Again, we accept the null Hyphotesis since our p-value = 0.22 is greater than the level of significance alpha = 0.05. So the level of satisfaction seems quite the same.
Now we perform the last t-test!
male_university_score <- jobsatisfaction %>%
filter(gender == "male" & education_level == "university") %>%
select(score)
female_university_score <- jobsatisfaction %>%
filter(gender == "female" & education_level == "university") %>%
select(score)
both_university_score <- jobsatisfaction %>%
filter(education_level == "university") %>%
select(score, gender)
Let’s check again the assumptions for the t-test.
1st: Normality. We use a Lilliefors(Kolmogorov–Smirnov) test.
lillie.test(female_university_score$score)
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: female_university_score$score
## D = 0.13815, p-value = 0.8501
lillie.test(male_university_score$score)
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: male_university_score$score
## D = 0.21077, p-value = 0.2364
Normality checked for both samples.
Homoschedasticity with rough rule:
sd(male_university_score$score)/sd(female_university_score$score)
## [1] 0.4739722
sd(female_university_score$score)/sd(male_university_score$score)
## [1] 2.109829
Seems that we have a problem by comparing the two variances with the rough rule! Let’s try using a test.
fligner.test(both_university_score$score, both_university_score$gender)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: both_university_score$score and both_university_score$gender
## Fligner-Killeen:med chi-squared = 2.2418, df = 1, p-value = 0.1343
At significance level 0.05, homoschedasticity assumption is checked since our p-value is 0.13.
We can now remove outliers!
male_university_score <- male_university_score %>%
filter(!score %in% remove_outliers_iqr(male_university_score$score))
female_university_score <- female_university_score %>%
filter(!score %in% remove_outliers_iqr(female_university_score$score))
dim(male_university_score)
## [1] 10 1
dim(female_university_score)
## [1] 10 1
Let’s perform the t-test!
H0:mean(females) = mean(males)
Ha: mean(females) \(\neq\) mean(males)
t.test(male_university_score$score, female_university_score$score, alternative = "two.sided" )
##
## Welch Two Sample t-test
##
## data: male_university_score$score and female_university_score$score
## t = 2.6994, df = 12.849, p-value = 0.01837
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1760765 1.5959235
## sample estimates:
## mean of x mean of y
## 9.292 8.406
Look! The p-value is smaller than the level of significance alpha = 0.05, so we reject the null Hyphotesis.
We want to know if there is any significant difference between the average satisfaction score in 3 different education levels! In this case, we are not interested in differences between genders!
H0: There is not statistically significant difference between the means in different education levels
Ha: At least one mean is significantly different from the others.
Let’s do some primary investigations!
ggplot(jobsatisfaction) +
geom_density(aes(x=score, fill=education_level), alpha = 0.4)+
xlab("Job satisfaction score") + ylab("Density")+
ggtitle("Score distributions according to the education level")
Each group distribution seems pretty normal, but we perform some additional tests to be sure.
shapiro.test(both_school_score$score)
##
## Shapiro-Wilk normality test
##
## data: both_school_score$score
## W = 0.97513, p-value = 0.8722
shapiro.test(both_college_score$score)
##
## Shapiro-Wilk normality test
##
## data: both_college_score$score
## W = 0.97435, p-value = 0.8588
shapiro.test(both_university_score$score)
##
## Shapiro-Wilk normality test
##
## data: both_university_score$score
## W = 0.91518, p-value = 0.08005
The normality is checked for each of the education level groups.
Now we need to check for the homogeneity of variances.
fligner.test(jobsatisfaction$score, jobsatisfaction$education_level)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: jobsatisfaction$score and jobsatisfaction$education_level
## Fligner-Killeen:med chi-squared = 4.1433, df = 2, p-value = 0.126
Checked! The p-value= 0.126 is greater than our significance level alpha = 0.05.
Let’s perform Anova.
mod <- aov(score ~ education_level , data =jobsatisfaction)
summary(mod)
## Df Sum Sq Mean Sq F value Pr(>F)
## education_level 2 114.0 57.00 153.7 <2e-16 ***
## Residuals 55 20.4 0.37
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is really close to 0, so we reject the null-Hyphotesis. This means that there is a significant difference in job satisfaction between different education levels.
Problem: Which couples of means are different from each other?
We perform a Post-hoc test in order to find out this.
TUKEY = TukeyHSD(mod)
TUKEY
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = score ~ education_level, data = jobsatisfaction)
##
## $education_level
## diff lwr upr p adj
## college-school 0.7573684 0.2814578 1.233279 0.0009414
## university-school 3.2568947 2.7869706 3.726819 0.0000000
## university-college 2.4995263 2.0296022 2.969450 0.0000000
All the p-values are smaller than our level of significance. With the university education level, however, it seems we have a stronger evidence against the null hyphotesis.
library(multcompView)
plot(TUKEY , las=1 , col="red")
We use Two-way ANOVA to evaluate simultaneously the effect of two categorical variables on a response numerical variable. So we will evaluate the effect of the 2 categorical variables(gender and education level) simultaneously on the job satisfaction score.
First things first, let’s check(again) sample sizes.
table(select(jobsatisfaction,-c(id, score)))
## education_level
## gender school college university
## male 9 9 10
## female 10 10 10
As we see, not all of the cells have equal size, so this is an UNBALANCED design.
Let’s start with preliminary investigations by doing a Two-way interaction plot.
interaction.plot(x.factor = jobsatisfaction$education_level, trace.factor = jobsatisfaction$gender,
response = jobsatisfaction$score, fun = mean, type = "b", legend = TRUE,
xlab = "Education level",
ylab = "Job Satisfaction Score",pch = c(1,19), col = c("blue", "red"))
Let’s check the assumptions for two-way anova! First assumption: normality. We need to check it for each cell.
shapiro.test(jobsatisfaction$score[jobsatisfaction$gender=="male" & jobsatisfaction$education_level == "school"]) #School
##
## Shapiro-Wilk normality test
##
## data: jobsatisfaction$score[jobsatisfaction$gender == "male" & jobsatisfaction$education_level == "school"]
## W = 0.98043, p-value = 0.9664
shapiro.test(jobsatisfaction$score[jobsatisfaction$gender == "female" & jobsatisfaction$education_level == "school"])
##
## Shapiro-Wilk normality test
##
## data: jobsatisfaction$score[jobsatisfaction$gender == "female" & jobsatisfaction$education_level == "school"]
## W = 0.96292, p-value = 0.8185
ad.test(jobsatisfaction$score[jobsatisfaction$gender=="male" & jobsatisfaction$education_level == "college"])#College
##
## Anderson-Darling normality test
##
## data: jobsatisfaction$score[jobsatisfaction$gender == "male" & jobsatisfaction$education_level == "college"]
## A = 0.265, p-value = 0.5989
ad.test(jobsatisfaction$score[jobsatisfaction$gender == "female" & jobsatisfaction$education_level == "college"])
##
## Anderson-Darling normality test
##
## data: jobsatisfaction$score[jobsatisfaction$gender == "female" & jobsatisfaction$education_level == "college"]
## A = 0.16973, p-value = 0.9055
shapiro.test(jobsatisfaction$score[jobsatisfaction$gender == "male" & jobsatisfaction$education_level=="university"])#University
##
## Shapiro-Wilk normality test
##
## data: jobsatisfaction$score[jobsatisfaction$gender == "male" & jobsatisfaction$education_level == "university"]
## W = 0.91572, p-value = 0.3226
shapiro.test(jobsatisfaction$score[jobsatisfaction$gender=="female" & jobsatisfaction$education_level =="university"])
##
## Shapiro-Wilk normality test
##
## data: jobsatisfaction$score[jobsatisfaction$gender == "female" & jobsatisfaction$education_level == "university"]
## W = 0.95044, p-value = 0.6737
The normality assumption is checked for each cell.
We need to check for similarity of variances.
library(car)
## Caricamento del pacchetto richiesto: carData
##
## Caricamento pacchetto: 'car'
## Il seguente oggetto è mascherato da 'package:dplyr':
##
## recode
leveneTest(score ~ education_level*gender, data = jobsatisfaction)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 5 2.197 0.06856 .
## 52
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value=0.06 is greater than our significance level alpha =0.05, so our assumption is checked.
Now we can perform our Anova.
H0:The effect of education level on the satisfaction score does not depend on the gender.
Ha: For different levels of education level, the gender affects differently the job satisfaction score.
two_way_anova <- aov(score ~ education_level*gender, data = jobsatisfaction)
Anova(two_way_anova, type = "III")
## Anova Table (Type III tests)
##
## Response: score
## Sum Sq Df F value Pr(>F)
## (Intercept) 265.038 1 876.0876 < 2.2e-16 ***
## education_level 80.128 2 132.4321 < 2.2e-16 ***
## gender 0.468 1 1.5471 0.219147
## education_level:gender 4.440 2 7.3379 0.001559 **
## Residuals 15.731 52
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What we can see from the table above is that there is no statistically significant difference in mean job satisfaction score between females and males (p = 0.21) but there are significant differences between different education levels (p close to 0). Also the interaction, as we see, is significant.
Now, wait: Which pair of education levels are different between them? Let’s do a post-hoc test!
Tukey = TukeyHSD(two_way_anova, which = "education_level")
Tukey
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = score ~ education_level * gender, data = jobsatisfaction)
##
## $education_level
## diff lwr upr p adj
## college-school 0.7573684 0.3268386 1.187898 0.0002637
## university-school 3.2568947 2.8317806 3.682009 0.0000000
## university-college 2.4995263 2.0744121 2.924641 0.0000000
All the couples, as we see, lead to reject the null Hyphotesis.