Is there a linear correlation between Age and Monthly Income?
Is there an association between Gender and Working Overtime?
#Importing data set
mydata <- read.csv("./HR_Analytics.csv")
#Removing variables from the data set, so I can keep the ones I am interested in
mydata1 <- mydata[, !(names(mydata) %in% c("Attrition", "BusinessTravel", "DailyRate", "Department", "DistanceFromHome", "Education", "EducationField", "EmployeeCount", "EmployeeNumber", "EnvironmentSatisfaction", "HourlyRate", "JobInvolvement", "JobLevel", "JobRole", "JobSatisfaction", "MaritalStatus", "MonthlyRate", "NumCompaniesWorked", "Over18", "PercentSalaryHike", "PerformanceRating", "RelationshipSatisfaction", "StandardHours", "StockOptionLevel", "TotalWorkingYears", "TrainingTimesLastYear", "WorkLifeBalance", "YearsAtCompany", "YearsInCurrentRole", "YearsSinceLastPromotion", "YearsWithCurrManager"))]
head(mydata1)
## Age Gender MonthlyIncome OverTime
## 1 41 Female 5993 Yes
## 2 49 Male 5130 No
## 3 37 Male 2090 Yes
## 4 33 Female 2909 Yes
## 5 27 Male 3468 No
## 6 32 Male 3068 No
Explanation of the data set:
unit of observation: an employee
number of observations: 1470
number of variables: originally 35, however, I showed only 4 variables I will use
source: Kaggle.com (Employee Attrition and Factors)
Description of the variables:
Age: the age of the employee (numerical)
Gender: the gender of the employee (categorical)
Monthly Income: the monthly income of the employee in US $ (numerical)
Over Time: whether or not the employee works overtime (categorical)
Data manipulation
#Creating factors
mydata1$GenderF <- factor(mydata1$Gender,
levels = c("Female", "Male"),
labels = c("Female", "Male"))
mydata1$OverTimeF <- factor(mydata1$OverTime,
levels = c("Yes", "No"),
labels = c("Yes", "No"))
head(mydata1, 3)
## Age Gender MonthlyIncome OverTime GenderF OverTimeF
## 1 41 Female 5993 Yes Female Yes
## 2 49 Male 5130 No Male No
## 3 37 Male 2090 Yes Male Yes
#Showing descriptive statistics
summary(mydata1)
## Age Gender MonthlyIncome OverTime
## Min. :18.00 Length:1470 Min. : 1009 Length:1470
## 1st Qu.:30.00 Class :character 1st Qu.: 2911 Class :character
## Median :36.00 Mode :character Median : 4919 Mode :character
## Mean :36.92 Mean : 6503
## 3rd Qu.:43.00 3rd Qu.: 8379
## Max. :60.00 Max. :19999
## GenderF OverTimeF
## Female:588 Yes: 416
## Male :882 No :1054
##
##
##
##
Median of 4919 for variable “MonthlyIncome”: half of the employees have monthly income up to 4 919 dollars while the other half has monthly income higher than 4 919 dollars.
1st quartile of 30 for variable “Age”: 25% of employees are up to 30 years old while other 75% of the employees are older than that.
Correlation analysis assumptions:
Variables must be numeric. (this assumptions is met)
Errors are normally distributed. (since we do have big enough sample, we don’t check)
Linear relationship between variables.
#Reducing the size of a sample to 300 units to check linearity assumption
mydata2 <- mydata1[sample(nrow(mydata1), 300), ]
head(mydata2)
## Age Gender MonthlyIncome OverTime GenderF OverTimeF
## 1111 35 Female 2074 Yes Female Yes
## 136 36 Male 4941 No Male No
## 423 19 Male 2564 No Male No
## 279 26 Female 6397 No Female No
## 463 34 Male 5337 No Male No
## 1353 44 Male 5033 No Male No
#Descriptive statistics
library(psych)
psych::describe(mydata2[ ,c("Age", "MonthlyIncome")])
## vars n mean sd median trimmed mad min max range
## Age 1 300 37.50 9.18 36.0 37.14 8.9 18 60 42
## MonthlyIncome 2 300 6858.18 5063.35 5125.5 6000.25 3579.0 1081 19999 18918
## skew kurtosis se
## Age 0.34 -0.50 0.53
## MonthlyIncome 1.28 0.58 292.33
Mean of 36.84 for variable “Age”: the average age of the employee in a sample of 300 employees is 36.84 years.
Range of 18 757 for variable “Monthly Income”: the difference between minimum and maximum monthly income is 18 757$ in a sample of 300 employees.
#Showing scatterplot to check linearity
library(car)
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(mydata2[, c(1, 3)], smooth = FALSE)
From the scatter plot it seems like there is a linear relationship between age and monthly income, so the third assumption is met.
cor(mydata2$Age, mydata2$MonthlyIncome,
method = "pearson",
use = "complete.obs")
## [1] 0.5162647
The assumed linear correlation between variables age and monthly income is positive and semi-strong.
cor.test(mydata2$Age, mydata2$MonthlyIncome,
method = "pearson",
use = "complete.obs")
##
## Pearson's product-moment correlation
##
## data: mydata2$Age and mydata2$MonthlyIncome
## t = 10.406, df = 298, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4280486 0.5947360
## sample estimates:
## cor
## 0.5162647
H0: Rho = 0 or correlation is equal to 0
H1: Rho =/ 0 or correlation is not equal to 0
We reject H0 (at p-value<0.001) and conclude that there is a statistically significant linear correlation between age and monthly income.
Pearson Chi2 test (parametric test) should be used because we are analyzing the association between 2 categorical variables
Assumptions:
Observations must be independent
All expected frequencies are greater than 1
Maximum 20% of the frequencies can be between 1 and 5, however, this will reduce the power of the test
#Descriptive statistics
summary(mydata1[, c("GenderF","OverTimeF")])
## GenderF OverTimeF
## Female:588 Yes: 416
## Male :882 No :1054
The summary shows that among observed employees, 588 were females, while 882 were males.
The summary shows that among observed employees, 416 were working over time, while 1054 were not working over time.
#Pearson Chi2 test
results <- chisq.test(mydata1$GenderF, mydata1$OverTimeF,
correct = TRUE) #Yates correction for 2x2 contingency table
results
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata1$GenderF and mydata1$OverTimeF
## X-squared = 2.3973, df = 1, p-value = 0.1215
H0: There is no association between two categorical variables “Gender” and “Over Time”
H1: There is association between two categorical variables “Gender” and “Over Time”
Since p-value is greater than 0.05, we cannot reject H0.
#Observed (empirical) frequencies
addmargins(results$observed)
## mydata1$OverTimeF
## mydata1$GenderF Yes No Sum
## Female 180 408 588
## Male 236 646 882
## Sum 416 1054 1470
#Expected (theoretical) frequencies
round(results$expected, 2)
## mydata1$OverTimeF
## mydata1$GenderF Yes No
## Female 166.4 421.6
## Male 249.6 632.4
Second assumption is met because all expected frequencies are greater than 1.
Third assumption is met because all expected frequencies are greater than 5 as well.
#Standardized residuals - checking if the differences between empirical and expected frequencies are statistically significant
round(results$res, 2)
## mydata1$OverTimeF
## mydata1$GenderF Yes No
## Female 1.05 -0.66
## Male -0.86 0.54
From the table we can see that the discrepancies between the actual and expected values are not statistically significant because all values of residuals are below 1.96.
However, for educational purposes, I will explain one number as if it was significant. Let’s say that for example, standardized residual for “Female” and “Yes” was 1.98.
Explanation: The actual number of Females that did work Over Time is higher than expected (Alpha = 0.05)
#Proportion table 1
addmargins(round(prop.table(results$observed), 3))
## mydata1$OverTimeF
## mydata1$GenderF Yes No Sum
## Female 0.122 0.278 0.400
## Male 0.161 0.439 0.600
## Sum 0.283 0.717 1.000
Explanation of the number 0.122 - Out of 1470 employees, there are 12.2% (180/1470) of employees that are females and worked over time.
Explanation of the number 0.439 - Out of 1470 employees, there are 43.9% (646/1470) of employees that are males and did not work over time.
#Proportion table 2
addmargins(round(prop.table(results$observed, 1), 3), 2)
## mydata1$OverTimeF
## mydata1$GenderF Yes No Sum
## Female 0.306 0.694 1.000
## Male 0.268 0.732 1.000
Explanation of the number 0.306 - Out of 588 employees that were females, 30.6% (180/588) of them did work over time.
Explanation of the number 0.732 - Out of 882 employees that were males, 73.2% (646/882) of them did not work over time.
#Proportion table 3
addmargins(round(prop.table(results$observed, 2), 3), 1)
## mydata1$OverTimeF
## mydata1$GenderF Yes No
## Female 0.433 0.387
## Male 0.567 0.613
## Sum 1.000 1.000
Explanation of the number 0.433 - Out of 416 employees that did work over time, 43.3% (180/416) of them were females.
Explanation of the number 0.613 - Out of 1054 employees that did not work over time, 61.3% (646/1054) of them were males.
#Effect size
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize::cramers_v(mydata1$GenderF, mydata1$OverTimeF)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.03 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.03)
## [1] "tiny"
## (Rules: funder2019)
Based on the sample data we cannot reject H0 (at p= 0.123) and we can conclude that there is no association between Gender and working Over Time. Additionally, the effect size is tiny (0.03) which supports the conclusion that there is no association between variables.
Even though assumptions for parametric test were met, for educational purposes I will show Fisher’s exact probability test (non-parametric test).
fisher.test(mydata1$GenderF, mydata1$OverTimeF)
##
## Fisher's Exact Test for Count Data
##
## data: mydata1$GenderF and mydata1$OverTimeF
## p-value = 0.1109
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9524004 1.5297092
## sample estimates:
## odds ratio
## 1.207461
H0: Odds ratio is equal to 1.
H1: Odds ratio is not equal to 1.
We cannot reject H0 at p-value = 0.112 and we cannot conclude that there are differences in gender and working over time.
interpret_oddsratio(1.21)
## [1] "very small"
## (Rules: chen2010)
The odds ratio (OR) of 1.21 implies that any observed difference in the odds of gender with working over time may be negligible.