Tamara Milinković HW1

Author: Tamara Milinković

1 Data import

library(readxl)
mydata <- read_xlsx("DatasetEmployees.xlsx")

mydata <- as.data.frame(mydata)

head(mydata)

##   Education JYear      City Payment Age Gender EverB Experience
## 1 Bachelors  2017 Bangalore       3  34   Male    No          0
## 2 Bachelors  2013      Pune       1  28 Female    No          3
## 3 Bachelors  2014 New Delhi       3  38 Female    No          2
## 4   Masters  2016 Bangalore       3  27   Male    No          5
## 5   Masters  2017      Pune       3  24   Male   Yes          2
## 6 Bachelors  2016 Bangalore       3  22   Male    No          0

2 Data description

This dataset contains information about employees in an Indian company, including their educational backgrounds, work history, demographics, and employment-related factors. It has been anonymized to protect privacy while still providing valuable insights into the workforce.

Source: Kaggle, Tawfik Elmetwally

Unit of observation: an employee

Sample size: 4653

Variables:

-Education: education level of an employee (Bachelors, Masters, PHD)

-JYear: the year each employee joined the company

-City: the location or city where each employee is based or works

-Payment: categorization of employees into different salary tiers (1-Low, 2-Middle, 3-High)

-Age: age of each employee

-Gender: gender of each employee

-EverB: indicates whether an employee has ever been temporarily without assigned work (Yes, No)

-Experience: years of experience employees have in their current domain

mydata$PaymentFactor <- factor(mydata$Payment, 
                             levels = c(1, 2, 3), 
                             labels = c("Low", "Middle", "High"))

library (psych)
describe(mydata$Experience)

##    vars    n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 4653 2.91 1.56      3    2.97 1.48   0   7     7 -0.16    -0.97 0.02

The average relevant work experience in this company is 2.91 years.

Half of the employees have up to 3 years of relevant work experience, the others have more.

Distribution of years of work experience is left-skewed, indicating that more employees have above-average experience compared to those with lower experience.

The range of experience is 7 (each employee has between 0 and 7 years of experience).

3.1 Research question 1: Do males and females in this company differ in their relevant work experience?

PARAMETRIC (independent samples t-test)

H0: μM = μF or μM - μF = 0

H1: μM ≠ μF or μM - μF ≠ 0

NON-PARAMETRIC (Wilcoxon Rank Sum Test)

H0: Location distribution of years of relevant work experience is the same for males and females.

H1: Location distribution of years of relevant work experience is different for males and females.

3.1.1 Analysis

Assumptions:

Variable is numeric.

Experience measured in years is a numerical variable - FIRST ASSUMPTION MET.

Distribution of variable is normal in both populations.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata, aes (x=Experience)) +
  geom_histogram(binwidth = 1 , colour = "pink", fill="purple")+
  facet_wrap(~Gender, ncol=1)+
  ylab("Frequency")

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.4.2

ggqqplot(mydata,
         "Experience",
         facet.by = "Gender")

Distribution of experience does not seem normal for neither males nor females. Normality of distribution of experience will be formally tested below with Shapiro-Wilk test.

library (rstatix)

## Warning: package 'rstatix' was built under R version 4.4.2

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata %>%
  group_by(Gender) %>%
  shapiro_test(Experience)

## # A tibble: 2 × 4
##   Gender variable   statistic        p
##   <chr>  <chr>          <dbl>    <dbl>
## 1 Female Experience     0.925 1.11e-29
## 2 Male   Experience     0.923 1.76e-35

H0: Experience is normally distributed for females (row 1)

H1: Experience isn’t normally distributed for females (row 1)

H0: Experience is normally distributed for males (row 2)

H1: Experience isn’t normally distributed for females (row 2)

We can reject H0 in both groups (p<0.001), and conclude that distribution of experience isn’t normal neither for males nor for females.

SECOND ASSUMPTION NOT MET.

Data must come from two independent populations.

Relevant years of work experience of males and females, or each individual employee, does not affect work experience of any other employee - THIRD ASSUMPTION MET.

Variable has the same variance in both populations.

library (psych)
describeBy(mydata$Experience, mydata$Gender)

## 
##  Descriptive statistics by group 
## group: Female
##    vars    n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 1875 2.89 1.55      3    2.95 1.48   0   7     7 -0.14    -0.96 0.04
## ------------------------------------------------------------ 
## group: Male
##    vars    n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 2778 2.92 1.56      3    2.98 1.48   0   7     7 -0.18    -0.98 0.03

Given the fact that the standard deviation of years of experience for females is 1.55, and for males is 1.56, their variances are consequently different (2.40 for females and 2.43 for males) - FOURTH ASSUMPTION NOT MET.

Because of unequal variances, Welch correction will be applied inside the independent samples t-test.

#independent samples t-test - assuming normality (parametric)

t.test(mydata$Experience~mydata$Gender,
       var.equal=FALSE,
       alternative="two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  mydata$Experience by mydata$Gender
## t = -0.59712, df = 4037.1, p-value = 0.5505
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -0.11899069  0.06343072
## sample estimates:
## mean in group Female   mean in group Male 
##             2.889067             2.916847

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

effectsize::cohens_d(mydata$Experience~mydata$Gender, 
                     pooled_sd=FALSE)

## Cohen's d |        95% CI
## -------------------------
## -0.02     | [-0.08, 0.04]
## 
## - Estimated using un-pooled SD.

interpret_cohens_d(0.02, rules="sawilowsky2009")

## [1] "tiny"
## (Rules: sawilowsky2009)

Based on sample data, we can’t reject null hypothesis. We can’t say that males and females differ in the average years of relevant work experience. The effect size is tiny (d=0.02).

#Wilcoxon Rank Sum Test - normality not met (non-parametric)

wilcox.test(mydata$Experience~mydata$Gender,
            correct=FALSE,
            exact=FALSE,
            alternative="two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mydata$Experience by mydata$Gender
## W = 2576116, p-value = 0.5221
## alternative hypothesis: true location shift is not equal to 0

library(effectsize)
effectsize(wilcox.test(mydata$Experience~mydata$Gender,
                       correct=FALSE,
                       exact=FALSE,
                       alternative="two.sided"))

## r (rank biserial) |        95% CI
## ---------------------------------
## -0.01             | [-0.04, 0.02]

interpret_rank_biserial(0.01)

## [1] "tiny"
## (Rules: funder2019)

We can’t reject null hypothesis. We can’t say that males and females differ in the distribution location of years of relevant work experience. The effect size is tiny (r=0.01).

3.1.2 Conclusion

In this case, non-parametric test is more suitable because Shapiro-Wilk test proved that the variable experience is not normally distributed in neither of two groups.

Based on Shapiro-Wilk test we performed, we cannot reject the null hypothesis and say that males and females differ in the distribution location of years of relevant work experience.

3.2 Research question 2: Is there a relationship between employee’s age and years of relevant experience in this company?

H0: ρAGE,EXP = 0

H1: ρAGE,EXP ≠ 0

3.2.1 Analysis

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplotMatrix(mydata[, c(5,8) ], smooth=FALSE)

cor(mydata$Age, mydata$Experience,
    method="pearson")

## [1] -0.1346429

Linear relationship between Age and Experience is negative and weak.

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[, c(5,8) ]),
type="pearson")

##              Age Experience
## Age         1.00      -0.13
## Experience -0.13       1.00
## 
## n= 4653 
## 
## 
## P
##            Age Experience
## Age             0        
## Experience  0

We can reject H0 at p<0.001. We found a linear relationship between Age and Experience, and it is negative and weak.

3.2.2 Conclusion

We found a negative linear relationship between Age and Experience (p<0.001).

3.3 Research question 3: Does education level of an employee vary with the city they are based in?

H0: There is no association between Education and City.

H1: There is association between Education and City.

Assumptions:

Observations are independent from each other. - different education levels and cities are completely separated, since one employee can’t have two education levels or be in two cities at the same time - ASSUMPTION 1 MET
All expected frequencies are bigger than 5, or in larger tables, up to 20% of expected frequencies can lie between 1 and 5.

3.3.1 Analysis

results <- chisq.test(mydata$Education, mydata$City)
results

## 
##  Pearson's Chi-squared test
## 
## data:  mydata$Education and mydata$City
## X-squared = 930.76, df = 4, p-value < 2.2e-16

We can reject null hypothesis at p<0.001. We found association between Education and City.

addmargins(results$observed)

##                 mydata$City
## mydata$Education Bangalore New Delhi Pune  Sum
##        Bachelors      2052       537 1012 3601
##        Masters         124       517  232  873
##        PHD              52       103   24  179
##        Sum            2228      1157 1268 4653

Out of 4653 employees, 2052 were those that are based in Bangalore and have Bachelors degree, 517 were those that are based in New Delhi and have Masters degree, and 24 were those that are based in Pune and have PHD.

Out of 4653 employees, 2228 were those that are based in Bangalore, and 179 were those that have PHD.

round(results$expected, 2)

##                 mydata$City
## mydata$Education Bangalore New Delhi   Pune
##        Bachelors   1724.27    895.41 981.32
##        Masters      418.02    217.08 237.90
##        PHD           85.71     44.51  48.78

All expected frequencies are bigger than 5 - ASSUMPTION 2 MET.

If there was no association between Education and City, we would expect to see 1724.27 employees who are based in Bangalore and have Bachelors degree. In reality there is 2052 of them.

If there was no association between Education and City, we would expect to see 48.78 employees who are based in Pune and have PHD. In reality there is 24 of them.

round (results$stdres, 2)

##                 mydata$City
## mydata$Education Bangalore New Delhi   Pune
##        Bachelors     22.99    -29.06   2.42
##        Masters      -22.10     26.06  -0.50
##        PHD           -5.14     10.31  -4.24

-BANGALORE

In the combination Bangalore and Bachelors, there is more than expected number of employees in this category at α=0.1%.

In the combination Bangalore and Masters, there is less than expected number of employees in this category at α=0.1%.

In the combination Bangalore and PHD, there is more than expected number of employees in this category at α=0.1%.

-NEW DELHI

In the combination New Delhi and Bachelors, there is less than expected number of employees in this category at α=0.1%.

In the combination New Delhi and Masters, there is more than expected number of employees in this category at α=0.1%.

In the combination New Delhi and PHD, there is more than expected number of employees in this category at α=0.1%

-PUNE

In the combination Pune and Bachelors, there is more than expected number of employees in this category at α=5%.

We can’t conclude that there is difference between expected and empirical frequencies in the combination Pune and Masters.

In the combination Pune and PHD, there is less than expected number of employees in this category at α=0.1%

Therefore, more people in Bangalore have lower level of education than expected, while in New Delhi, more people with higher education levels than expected are based.

library (effectsize)
effectsize::cramers_v(mydata$City, mydata$Education)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.32              | [0.30, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.32)

## [1] "large"
## (Rules: funder2019)

3.3.2 Conclusion

Education level of employees varies with the city where they are based in (p<0.001). The differences between education levels are large. ___________________________________________