We have a data set that provides us insights into the similarities and differences between female and male employees across different factors. With correlation analysis, we will analyse the relationship between age of employees and height of their average monthly salary.
Data source: soleman sarkovish. (2023). Male vs Female [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/5665153
setwd("C:/Users/Neimenovan uporabnik/Documents/R data/R data/Basics/Reading csv data")
mydata <- read.csv("dataset csv.csv")
head(mydata)
## Date Gender Driving.test.result Bmi Childeren Salary region smoker age
## 1 1/11/2022 female 5 27.900 0 16884.924 southwest yes 19
## 2 1/11/2022 female 4 33.770 1 1725.552 southeast no 18
## 3 1/11/2022 male 8 33.000 3 4449.462 southeast no 28
## 4 1/11/2022 male 9 22.705 0 21984.471 northwest no 33
## 5 1/11/2022 female 4 28.880 0 3866.855 northwest no 32
## 6 2/11/2022 female 2 25.740 0 3756.622 southeast no 31
#install.packages("dplyr")
# Loading the dplyr package
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Selecting only the relevant columns
mydata <- select(mydata, Gender, smoker, Salary, age)
head(mydata)
## Gender smoker Salary age
## 1 female yes 16884.924 19
## 2 female no 1725.552 18
## 3 male no 4449.462 28
## 4 male no 21984.471 33
## 5 female no 3866.855 32
## 6 female no 3756.622 31
#Factorizing
mydata$FGender <- factor(mydata$Gender,
levels = c("male", "female"),
labels = c("Male","Female"))
mydata$FSmoker <- factor(mydata$smoker,
levels = c("no", "yes"),
labels = c("Smoker","Nonsmoker"))
head(mydata)
## Gender smoker Salary age FGender FSmoker
## 1 female yes 16884.924 19 Female Nonsmoker
## 2 female no 1725.552 18 Female Smoker
## 3 male no 4449.462 28 Male Smoker
## 4 male no 21984.471 33 Male Smoker
## 5 female no 3866.855 32 Female Smoker
## 6 female no 3756.622 31 Female Smoker
Unit of observation: Individual employee
Sample size: 354 employees
Variables:
Salary: Average monthly salary of an individual employee in the sample.
Age: Age of an indivudal employee in the sample (years).
Gender: Gender of an individual employee in the sample (Male, Female).
Smoker: Smoking or non-smoking individual employee (Smoker, Nonsmoker).
#Showing descriptive statistics
library(psych)
summary(mydata)
## Gender smoker Salary age FGender FSmoker
## Length:354 Length:354 Min. : 1137 Min. :18.00 Male :174 Smoker :258
## Class :character Class :character 1st Qu.: 3580 1st Qu.:23.00 Female:180 Nonsmoker: 96
## Mode :character Mode :character Median :10602 Median :34.00
## Mean :15390 Mean :37.19
## 3rd Qu.:23568 3rd Qu.:55.00
## Max. :51195 Max. :63.00
result <- describeBy(mydata)
## Warning in describeBy(mydata): no grouping variable requested
print(result)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## Gender* 1 354 1.49 0.50 1.00 1.49 0.00 1.00 2.00 1.00 0.03 -2.00 0.03
## smoker* 2 354 1.27 0.45 1.00 1.21 0.00 1.00 2.00 1.00 1.03 -0.95 0.02
## Salary 3 354 15389.76 14498.79 10602.39 13510.62 11203.00 1137.01 51194.56 50057.55 0.97 -0.40 770.60
## age 4 354 37.19 15.28 34.00 36.53 17.79 18.00 63.00 45.00 0.39 -1.28 0.81
## FGender* 5 354 1.51 0.50 2.00 1.51 0.00 1.00 2.00 1.00 -0.03 -2.00 0.03
## FSmoker* 6 354 1.27 0.45 1.00 1.21 0.00 1.00 2.00 1.00 1.03 -0.95 0.02
Explanation of few parameters:
Min: The youngest employee in the sample is 18 years old. The employee with the lowest average monthly salary is paid 1137 (money unit) per month on average.
1st quartile: 25% of employees in the sample are 23 years old or younger. 25% of employees in the sample are paid 3580 (money unit) per month on average.
Mean: The average age of employees in the sample is 37.19 years. The average monthly salary of employees in the sample is 15390 (money unit).
In the sample, there are 258 smokers and 96 non-smokers.
In the sample, there are 174 men and 180 women.
We can use Pearson’s correlation test but we need to first meet all of the following assumptions:
Both variables are numeric.
Normality: both variables, Age and Salary, should be normally distributed.
Linearity: The relationship between Salary and Age should be linear.
#Checking the normality with histogram
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata, aes(x = Salary, y = age)) +
geom_col(colour = "grey71", fill = "hotpink") +
scale_x_continuous(breaks = seq(0, 100000, 10000)) +
ylab("Age")
#Checking the linearity with scatterplot
library(ggplot2)
ggplot(mydata, aes(x = age, y = Salary)) +
geom_point()
We can see that normality is violated, therefore we should use a non-parametric test - Spearman correlation test.
#Performing Spearman correlation test
cor(mydata$Salary, mydata$age,
method = "spearman")
## [1] 0.5280009
Based on the sample data we can conclude there is positive and moderate relationship between age and average monthly salary of employees.
#Checking the significance of correlation coefficient
cor.test(mydata$Salary, mydata$age,
method = "spearman",
exact = FALSE)
##
## Spearman's rank correlation rho
##
## data: mydata$Salary and mydata$age
## S = 3489766, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.5280009
H0: ro = 0
H1: ro =/ 0
Based on the sample data we can reject null hypothesis (p<0.001) and conclude there is a linear relationship between age of employees and their average monthly salary.
#Creating the contingency table
contingency_table <- addmargins(table(mydata$FGender, mydata$FSmoker))
print(contingency_table)
##
## Smoker Nonsmoker Sum
## Male 120 54 174
## Female 138 42 180
## Sum 258 96 354
Checking the assumptions for Pearson’s Chi-Squared test:
Independence
Expected frequencies:
++ Expected frequency for Male Smokers: (174 * 258) / 354 = 127.31
++ Expected frequency for Male Non-smokers: (174 * 96) / 354 = 46.69
++ Expected frequency for Female Smokers: (180 * 258) / 354 = 130.69
++ Expected frequency for Female Non-smokers: (180 * 96) / 354 = 49.31
All assumptions are met, which means that we can perform Pearson’s Chi Squared test.
#Performing the Pearson Chi squared test
results <- chisq.test(mydata$FGender, mydata$FSmoker,
correct = TRUE)
print(results)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata$FGender and mydata$FSmoker
## X-squared = 2.2795, df = 1, p-value = 0.1311
H0: There is no association between gender of an employee and his/her smoking status.
H1: There is association between gender of an employee and his/her smoking status.
Based on the sample data, we can’t rejct the null hypothesis (p = 0.1311) and conclude that there is no association between gender of an employee and his/her smoking status.
#Calculating standardized residuals
std_residuals <- results$stdres
print(std_residuals)
## mydata$FSmoker
## mydata$FGender Smoker Nonsmoker
## Male -1.629381 1.629381
## Female 1.629381 -1.629381
Based on calculated standardized residuals we can’t say that gender had an effect on smoking status of an employee, because all four standardized residuals are (in absolute) smaller than 1.96.
#Proportion table 1
addmargins(round(prop.table(results$observed), 2))
## mydata$FSmoker
## mydata$FGender Smoker Nonsmoker Sum
## Male 0.34 0.15 0.49
## Female 0.39 0.12 0.51
## Sum 0.73 0.27 1.00
Interpretation:
Male Smokers: 34% of all the employees are male smokers.
Female Non-Smokers: 12% of the employees are female non-smokers.
Sum of Smokers: 73% of all the employees are smokers.
Sum of Males: 49% of all the employees are males.
#Proportion table 2
addmargins(round(prop.table(results$observed, 1), 2), 2)
## mydata$FSmoker
## mydata$FGender Smoker Nonsmoker Sum
## Male 0.69 0.31 1.00
## Female 0.77 0.23 1.00
Interpretation:
Male Smokers: 69% of male employees are smokers.
Male Non-Smokers: 31% of male employees are not smokers.
Female Smokers: 77% of female employees are smokers.
Female Non-Smokers: 23% of female employees are non-smokers.
#Proportion table 3
addmargins(round(prop.table(results$observed, 2), 2), 1)
## mydata$FSmoker
## mydata$FGender Smoker Nonsmoker
## Male 0.47 0.56
## Female 0.53 0.44
## Sum 1.00 1.00
Interpretation:
Male Smokers: 47% of smoking employees are males.
Female Smokers: 53% of smoking employees are females.
Male Non-Smokers: 56% of non-smoking employees are males.
Female Non-Smokers: 44% of non-smoking employees are females.
#Calculating effect size
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize::cramers_v(mydata$FGender, mydata$FSmoker)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.07 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.07)
## [1] "very small"
## (Rules: funder2019)
The size of of association between gender and smoking status of employees is very small (V=0.07).