Author: Tamara Milinković
library(readxl)
mydata <- read_xlsx("DatasetEmployees.xlsx")
mydata <- as.data.frame(mydata)
head(mydata)
## Education JYear City Payment Age Gender EverB Experience
## 1 Bachelors 2017 Bangalore 3 34 Male No 0
## 2 Bachelors 2013 Pune 1 28 Female No 3
## 3 Bachelors 2014 New Delhi 3 38 Female No 2
## 4 Masters 2016 Bangalore 3 27 Male No 5
## 5 Masters 2017 Pune 3 24 Male Yes 2
## 6 Bachelors 2016 Bangalore 3 22 Male No 0
This dataset contains information about employees in an Indian company, including their educational backgrounds, work history, demographics, and employment-related factors. It has been anonymized to protect privacy while still providing valuable insights into the workforce.
Source: Kaggle, Tawfik Elmetwally
Unit of observation: an employee
Sample size: 4653
Variables:
-Education: education level of an employee (Bachelors, Masters, PHD)
-JYear: the year each employee joined the company
-City: the location or city where each employee is based or works
-Payment: categorization of employees into different salary tiers (1-Low, 2-Middle, 3-High)
-Age: age of each employee
-Gender: gender of each employee
-EverB: indicates whether an employee has ever been temporarily without assigned work (Yes, No)
-Experience: years of experience employees have in their current domain
mydata$PaymentFactor <- factor(mydata$Payment,
levels = c(1, 2, 3),
labels = c("Low", "Middle", "High"))
library (psych)
describe(mydata$Experience)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4653 2.91 1.56 3 2.97 1.48 0 7 7 -0.16 -0.97 0.02
The average relevant work experience in this company is 2.91 years.
Half of the employees have up to 3 years of relevant work experience, the others have more.
Distribution of years of work experience is left-skewed, indicating that more employees have above-average experience compared to those with lower experience.
The range of experience is 7 (each employee has between 0 and 7 years of experience).
PARAMETRIC (independent samples t-test)
H0: μM = μF or μM - μF = 0
H1: μM ≠ μF or μM - μF ≠ 0
NON-PARAMETRIC (Wilcoxon Rank Sum Test)
H0: Location distribution of years of relevant work experience is the same for males and females.
H1: Location distribution of years of relevant work experience is different for males and females.
Assumptions:
Experience measured in years is a numerical variable - FIRST ASSUMPTION MET.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata, aes (x=Experience)) +
geom_histogram(binwidth = 1 , colour = "pink", fill="purple")+
facet_wrap(~Gender, ncol=1)+
ylab("Frequency")
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.4.2
ggqqplot(mydata,
"Experience",
facet.by = "Gender")
Distribution of experience does not seem normal for neither males nor females. Normality of distribution of experience will be formally tested below with Shapiro-Wilk test.
library (rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(Gender) %>%
shapiro_test(Experience)
## # A tibble: 2 × 4
## Gender variable statistic p
## <chr> <chr> <dbl> <dbl>
## 1 Female Experience 0.925 1.11e-29
## 2 Male Experience 0.923 1.76e-35
H0: Experience is normally distributed for females (row 1)
H1: Experience isn’t normally distributed for females (row 1)
H0: Experience is normally distributed for males (row 2)
H1: Experience isn’t normally distributed for females (row 2)
We can reject H0 in both groups (p<0.001), and conclude that distribution of experience isn’t normal neither for males nor for females.
SECOND ASSUMPTION NOT MET.
Relevant years of work experience of males and females, or each individual employee, does not affect work experience of any other employee - THIRD ASSUMPTION MET.
library (psych)
describeBy(mydata$Experience, mydata$Gender)
##
## Descriptive statistics by group
## group: Female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1875 2.89 1.55 3 2.95 1.48 0 7 7 -0.14 -0.96 0.04
## ------------------------------------------------------------
## group: Male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 2778 2.92 1.56 3 2.98 1.48 0 7 7 -0.18 -0.98 0.03
Given the fact that the standard deviation of years of experience for females is 1.55, and for males is 1.56, their variances are consequently different (2.40 for females and 2.43 for males) - FOURTH ASSUMPTION NOT MET.
Because of unequal variances, Welch correction will be applied inside the independent samples t-test.
#independent samples t-test - assuming normality (parametric)
t.test(mydata$Experience~mydata$Gender,
var.equal=FALSE,
alternative="two.sided")
##
## Welch Two Sample t-test
##
## data: mydata$Experience by mydata$Gender
## t = -0.59712, df = 4037.1, p-value = 0.5505
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -0.11899069 0.06343072
## sample estimates:
## mean in group Female mean in group Male
## 2.889067 2.916847
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize::cohens_d(mydata$Experience~mydata$Gender,
pooled_sd=FALSE)
## Cohen's d | 95% CI
## -------------------------
## -0.02 | [-0.08, 0.04]
##
## - Estimated using un-pooled SD.
interpret_cohens_d(0.02, rules="sawilowsky2009")
## [1] "tiny"
## (Rules: sawilowsky2009)
Based on sample data, we can’t reject null hypothesis. We can’t say that males and females differ in the average years of relevant work experience. The effect size is tiny (d=0.02).
#Wilcoxon Rank Sum Test - normality not met (non-parametric)
wilcox.test(mydata$Experience~mydata$Gender,
correct=FALSE,
exact=FALSE,
alternative="two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata$Experience by mydata$Gender
## W = 2576116, p-value = 0.5221
## alternative hypothesis: true location shift is not equal to 0
library(effectsize)
effectsize(wilcox.test(mydata$Experience~mydata$Gender,
correct=FALSE,
exact=FALSE,
alternative="two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## -0.01 | [-0.04, 0.02]
interpret_rank_biserial(0.01)
## [1] "tiny"
## (Rules: funder2019)
We can’t reject null hypothesis. We can’t say that males and females differ in the distribution location of years of relevant work experience. The effect size is tiny (r=0.01).
In this case, non-parametric test is more suitable because Shapiro-Wilk test proved that the variable experience is not normally distributed in neither of two groups.
Based on Shapiro-Wilk test we performed, we cannot reject the null hypothesis and say that males and females differ in the distribution location of years of relevant work experience.
H0: ρAGE,EXP = 0
H1: ρAGE,EXP ≠ 0
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(mydata[, c(5,8) ], smooth=FALSE)
cor(mydata$Age, mydata$Experience,
method="pearson")
## [1] -0.1346429
Linear relationship between Age and Experience is negative and weak.
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[, c(5,8) ]),
type="pearson")
## Age Experience
## Age 1.00 -0.13
## Experience -0.13 1.00
##
## n= 4653
##
##
## P
## Age Experience
## Age 0
## Experience 0
We can reject H0 at p<0.001. We found a linear relationship between Age and Experience, and it is negative and weak.
We found a negative linear relationship between Age and Experience (p<0.001).
H0: There is no association between Education and City.
H1: There is association between Education and City.
Assumptions:
Observations are independent from each other. - different education levels and cities are completely separated, since one employee can’t have two education levels or be in two cities at the same time - ASSUMPTION 1 MET
All expected frequencies are bigger than 5, or in larger tables, up to 20% of expected frequencies can lie between 1 and 5.
results <- chisq.test(mydata$Education, mydata$City)
results
##
## Pearson's Chi-squared test
##
## data: mydata$Education and mydata$City
## X-squared = 930.76, df = 4, p-value < 2.2e-16
We can reject null hypothesis at p<0.001. We found association between Education and City.
addmargins(results$observed)
## mydata$City
## mydata$Education Bangalore New Delhi Pune Sum
## Bachelors 2052 537 1012 3601
## Masters 124 517 232 873
## PHD 52 103 24 179
## Sum 2228 1157 1268 4653
Out of 4653 employees, 2052 were those that are based in Bangalore and have Bachelors degree, 517 were those that are based in New Delhi and have Masters degree, and 24 were those that are based in Pune and have PHD.
Out of 4653 employees, 2228 were those that are based in Bangalore, and 179 were those that have PHD.
round(results$expected, 2)
## mydata$City
## mydata$Education Bangalore New Delhi Pune
## Bachelors 1724.27 895.41 981.32
## Masters 418.02 217.08 237.90
## PHD 85.71 44.51 48.78
All expected frequencies are bigger than 5 - ASSUMPTION 2 MET.
If there was no association between Education and City, we would expect to see 1724.27 employees who are based in Bangalore and have Bachelors degree. In reality there is 2052 of them.
If there was no association between Education and City, we would expect to see 48.78 employees who are based in Pune and have PHD. In reality there is 24 of them.
round (results$stdres, 2)
## mydata$City
## mydata$Education Bangalore New Delhi Pune
## Bachelors 22.99 -29.06 2.42
## Masters -22.10 26.06 -0.50
## PHD -5.14 10.31 -4.24
-BANGALORE
In the combination Bangalore and Bachelors, there is more than expected number of employees in this category at α=0.1%.
In the combination Bangalore and Masters, there is less than expected number of employees in this category at α=0.1%.
In the combination Bangalore and PHD, there is more than expected number of employees in this category at α=0.1%.
-NEW DELHI
In the combination New Delhi and Bachelors, there is less than expected number of employees in this category at α=0.1%.
In the combination New Delhi and Masters, there is more than expected number of employees in this category at α=0.1%.
In the combination New Delhi and PHD, there is more than expected number of employees in this category at α=0.1%
-PUNE
In the combination Pune and Bachelors, there is more than expected number of employees in this category at α=5%.
We can’t conclude that there is difference between expected and empirical frequencies in the combination Pune and Masters.
In the combination Pune and PHD, there is less than expected number of employees in this category at α=0.1%
Therefore, more people in Bangalore have lower level of education than expected, while in New Delhi, more people with higher education levels than expected are based.
library (effectsize)
effectsize::cramers_v(mydata$City, mydata$Education)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.32 | [0.30, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.32)
## [1] "large"
## (Rules: funder2019)
Education level of employees varies with the city where they are based in (p<0.001). The differences between education levels are large. ___________________________________________