Homework 2

Research questions:

Is there a linear correlation between Age and Monthly Income?
Is there an association between Gender and Working Overtime?

#Importing data set
mydata <- read.csv("./HR_Analytics.csv")

#Removing variables from the data set, so I can keep the ones I am interested in

mydata1 <- mydata[, !(names(mydata) %in% c("Attrition", "BusinessTravel", "DailyRate", "Department", "DistanceFromHome", "Education", "EducationField", "EmployeeCount", "EmployeeNumber", "EnvironmentSatisfaction", "HourlyRate", "JobInvolvement", "JobLevel", "JobRole", "JobSatisfaction", "MaritalStatus", "MonthlyRate", "NumCompaniesWorked", "Over18", "PercentSalaryHike", "PerformanceRating", "RelationshipSatisfaction", "StandardHours", "StockOptionLevel", "TotalWorkingYears", "TrainingTimesLastYear", "WorkLifeBalance", "YearsAtCompany", "YearsInCurrentRole", "YearsSinceLastPromotion", "YearsWithCurrManager"))]

head(mydata1)

##   Age Gender MonthlyIncome OverTime
## 1  41 Female          5993      Yes
## 2  49   Male          5130       No
## 3  37   Male          2090      Yes
## 4  33 Female          2909      Yes
## 5  27   Male          3468       No
## 6  32   Male          3068       No

Explanation of the data set:

unit of observation: an employee
number of observations: 1470
number of variables: originally 35, however, I showed only 4 variables I will use
source: Kaggle.com (Employee Attrition and Factors)

Description of the variables:

Age: the age of the employee (numerical)
Gender: the gender of the employee (categorical)
Monthly Income: the monthly income of the employee in US $ (numerical)
Over Time: whether or not the employee works overtime (categorical)

Data manipulation

#Creating factors
mydata1$GenderF <- factor(mydata1$Gender,
                          levels = c("Female", "Male"),
                          labels = c("Female", "Male"))

mydata1$OverTimeF <- factor(mydata1$OverTime,
                          levels = c("Yes", "No"),
                          labels = c("Yes", "No"))

head(mydata1, 3)

##   Age Gender MonthlyIncome OverTime GenderF OverTimeF
## 1  41 Female          5993      Yes  Female       Yes
## 2  49   Male          5130       No    Male        No
## 3  37   Male          2090      Yes    Male       Yes

#Showing descriptive statistics
summary(mydata1)

##       Age           Gender          MonthlyIncome     OverTime        
##  Min.   :18.00   Length:1470        Min.   : 1009   Length:1470       
##  1st Qu.:30.00   Class :character   1st Qu.: 2911   Class :character  
##  Median :36.00   Mode  :character   Median : 4919   Mode  :character  
##  Mean   :36.92                      Mean   : 6503                     
##  3rd Qu.:43.00                      3rd Qu.: 8379                     
##  Max.   :60.00                      Max.   :19999                     
##    GenderF    OverTimeF 
##  Female:588   Yes: 416  
##  Male  :882   No :1054  
##                         
##                         
##                         
##

Median of 4919 for variable “MonthlyIncome”: half of the employees have monthly income up to 4 919 dollars while the other half has monthly income higher than 4 919 dollars.
1st quartile of 30 for variable “Age”: 25% of employees are up to 30 years old while other 75% of the employees are older than that.

1st RQ: Is there a linear correlation between Age and Monthly Income? (2 numerical variables)

Correlation analysis assumptions:

Variables must be numeric. (this assumptions is met)
Errors are normally distributed. (since we do have big enough sample, we don’t check)
Linear relationship between variables.

#Reducing the size of a sample to 300 units to check linearity assumption

mydata2 <- mydata1[sample(nrow(mydata1), 300), ]

head(mydata2)

##      Age Gender MonthlyIncome OverTime GenderF OverTimeF
## 432   54 Female          3780       No  Female        No
## 1354  34   Male          2307      Yes    Male       Yes
## 1197  41   Male          7082      Yes    Male       Yes
## 211   32   Male         10400       No    Male        No
## 1080  39 Female          8376       No  Female        No
## 865   41   Male          2107       No    Male        No

#Descriptive statistics
library(psych)
psych::describe(mydata2[ ,c("Age", "MonthlyIncome")])

##               vars   n    mean      sd median trimmed     mad  min   max range
## Age              1 300   36.84    9.60   35.0   36.30   10.38   19    60    41
## MonthlyIncome    2 300 6335.84 4617.01 4692.5 5463.04 3105.31 1091 19926 18835
##               skew kurtosis     se
## Age           0.47    -0.58   0.55
## MonthlyIncome 1.44     1.27 266.56

Mean of 36.84 for variable “Age”: the average age of the employee in a sample of 300 employees is 36.84 years.
Range of 18 757 for variable “Monthly Income”: the difference between minimum and maximum monthly income is 18 757$ in a sample of 300 employees.

#Showing scatterplot to check linearity

library(car)

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.2

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplotMatrix(mydata2[, c(1, 3)], smooth = FALSE)

From the scatter plot it seems like there is a linear relationship between age and monthly income, so the third assumption is met.

cor(mydata2$Age, mydata2$MonthlyIncome,
    method = "pearson",
    use = "complete.obs")

## [1] 0.4861412

The assumed linear correlation between variables age and monthly income is positive and semi-strong.

cor.test(mydata2$Age, mydata2$MonthlyIncome,
         method = "pearson",
         use = "complete.obs")

## 
##  Pearson's product-moment correlation
## 
## data:  mydata2$Age and mydata2$MonthlyIncome
## t = 9.6033, df = 298, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3946249 0.5681072
## sample estimates:
##       cor 
## 0.4861412

H0: Rho = 0 or correlation is equal to 0

H1: Rho =/ 0 or correlation is not equal to 0

We reject H0 (at p-value<0.001) and conclude that there is a statistically significant correlation between age and monthly income.

2nd RQ: Is there association between Gender and working Over Time? (2 categorical variables)

Pearson Chi2 test (parametric test) should be used because we are analyzing the association between 2 categorical variables

Assumptions:

Observations must be independent
All expected frequencies are greater than 1
Maximum 20% of the frequencies can be between 1 and 5, however, this will reduce the power of the test

#Descriptive statistics

summary(mydata1[, c("GenderF","OverTimeF")])

##    GenderF    OverTimeF 
##  Female:588   Yes: 416  
##  Male  :882   No :1054

The summary shows that among observed employees, 588 were females, while 882 were males.
The summary shows that among observed employees, 416 were working over time, while 1054 were not working over time.

#Pearson Chi2 test

results <- chisq.test(mydata1$GenderF, mydata1$OverTimeF,
                      correct = TRUE) #Yates correction for 2x2 contingency table

results

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata1$GenderF and mydata1$OverTimeF
## X-squared = 2.3973, df = 1, p-value = 0.1215

H0: There is no association between two categorical variables “Gender” and “Over Time”

H1: There is association between two categorical variables “Gender” and “Over Time”

Since p-value is greater than 0.05, we cannot reject H0.

#Observed (empirical) frequencies

addmargins(results$observed)

##                mydata1$OverTimeF
## mydata1$GenderF  Yes   No  Sum
##          Female  180  408  588
##          Male    236  646  882
##          Sum     416 1054 1470

#Expected (theoretical) frequencies

round(results$expected, 2)

##                mydata1$OverTimeF
## mydata1$GenderF   Yes    No
##          Female 166.4 421.6
##          Male   249.6 632.4

Second assumption is met because all expected frequencies are greater than 1.

Third assumption is met because all expected frequencies are greater than 5 as well.

#Standardized residuals - checking if the differences between empirical and expected frequencies are statistically significant

round(results$res, 2)

##                mydata1$OverTimeF
## mydata1$GenderF   Yes    No
##          Female  1.05 -0.66
##          Male   -0.86  0.54

From the table we can see that the discrepancies between the actual and expected values are not statistically significant because all values of residuals are below 1.96.

However, for educational purposes, I will explain one number as if it was significant. Let’s say that for example, standardized residual for “Female” and “Yes” was 1.98.

Explanation: The actual number of Females that did work Over Time is higher than expected (Alpha = 0.05)

#Proportion table 1

addmargins(round(prop.table(results$observed), 3))

##                mydata1$OverTimeF
## mydata1$GenderF   Yes    No   Sum
##          Female 0.122 0.278 0.400
##          Male   0.161 0.439 0.600
##          Sum    0.283 0.717 1.000

Explanation of the number 0.122 - Out of 1470 employees, there are 12.2% (180/1470) of employees that are females and worked over time.

Explanation of the number 0.439 - Out of 1470 employees, there are 43.9% (646/1470) of employees that are males and did not work over time.

#Proportion table 2

addmargins(round(prop.table(results$observed, 1), 3), 2)

##                mydata1$OverTimeF
## mydata1$GenderF   Yes    No   Sum
##          Female 0.306 0.694 1.000
##          Male   0.268 0.732 1.000

Explanation of the number 0.306 - Out of 588 employees that were females, 30.6% (180/588) of them did work over time.

Explanation of the number 0.732 - Out of 882 employees that were males, 73.2% (646/882) of them did not work over time.

#Proportion table 3

addmargins(round(prop.table(results$observed, 2), 3), 1)

##                mydata1$OverTimeF
## mydata1$GenderF   Yes    No
##          Female 0.433 0.387
##          Male   0.567 0.613
##          Sum    1.000 1.000

Explanation of the number 0.433 - Out of 416 employees that did work over time, 43.3% (180/416) of them were females.

Explanation of the number 0.613 - Out of 1054 employees that did not work over time, 61.3% (646/1054) of them were males.

#Effect size

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

effectsize::cramers_v(mydata1$GenderF, mydata1$OverTimeF)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.03              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.03)

## [1] "tiny"
## (Rules: funder2019)

Conclusion:

Based on the sample data we cannot reject H0 (at p= 0.123) and we can conclude that there is no association between Gender and working Over Time. Additionally, the effect size is tiny (0.03) which supports the conclusion that there is no association between variables.

Even though assumptions for parametric test were met, for educational purposes I will show Fisher’s exact probability test (non-parametric test).

fisher.test(mydata1$GenderF, mydata1$OverTimeF)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  mydata1$GenderF and mydata1$OverTimeF
## p-value = 0.1109
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9524004 1.5297092
## sample estimates:
## odds ratio 
##   1.207461

H0: Odds ratio is equal to 1.

H1: Odds ratio is not equal to 1.

We cannot reject H0 at p-value = 0.112 and we cannot conclude that there are differences in gender and working over time.

interpret_oddsratio(1.21)

## [1] "very small"
## (Rules: chen2010)

The odds ratio (OR) of 1.21 implies that any observed difference in the odds of gender with working over time may be negligible.

Homework 2

Anja Ilic

2024-01-15

Research questions:

1st RQ: Is there a linear correlation between Age and Monthly Income? (2 numerical variables)

2nd RQ: Is there association between Gender and working Over Time? (2 categorical variables)

Conclusion: