We have a data set that provides us insights into the similarities and differences between females and males across different factors. As males and females can be considered as two samples, we will use independent samples t-test to answer our research question.
Data source: soleman sarkovish. (2023). Male vs Female [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/5665153
setwd("C:/Users/Neimenovan uporabnik/Documents/R data/R data/Basics/Reading csv data")
mydata <- read.csv("dataset csv.csv")
head(mydata)
## Date Gender Driving.test.result Bmi Childeren Salary region smoker age
## 1 1/11/2022 female 5 27.900 0 16884.924 southwest yes 19
## 2 1/11/2022 female 4 33.770 1 1725.552 southeast no 18
## 3 1/11/2022 male 8 33.000 3 4449.462 southeast no 28
## 4 1/11/2022 male 9 22.705 0 21984.471 northwest no 33
## 5 1/11/2022 female 4 28.880 0 3866.855 northwest no 32
## 6 2/11/2022 female 2 25.740 0 3756.622 southeast no 31
We will be observing salaries between males and females. We have information about the height of monthly salary of 354 employees.
#install.packages("dplyr")
# Loading the dplyr package
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Selecting only the Gender and Salary columns
mydata <- select(mydata, Gender, Salary)
head(mydata)
## Gender Salary
## 1 female 16884.924
## 2 female 1725.552
## 3 male 4449.462
## 4 male 21984.471
## 5 female 3866.855
## 6 female 3756.622
#Factorizing
mydata$FGender <- factor(mydata$Gender,
levels = c("male", "female"),
labels = c("Male","Female"))
head(mydata)
## Gender Salary FGender
## 1 female 16884.924 Female
## 2 female 1725.552 Female
## 3 male 4449.462 Male
## 4 male 21984.471 Male
## 5 female 3866.855 Female
## 6 female 3756.622 Female
Description:
Gender: Female or Male employee
Salary: The monthly salary of an employee
#Showing descriptive statistics
library(psych)
summary(mydata[c(-1)])
## Salary FGender
## Min. : 1137 Male :174
## 1st Qu.: 3580 Female:180
## Median :10602
## Mean :15390
## 3rd Qu.:23568
## Max. :51195
result <- describeBy(mydata$Salary, mydata$FGender)
print(result)
##
## Descriptive statistics by group
## group: Male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 174 17643 14126.77 13228.85 16430.72 14849.64 1137.01 51194.56 50057.55 0.71 -0.68 1070.95
## ------------------------------------------------------------------------------------------
## group: Female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 180 13211.64 14559.38 6258.83 10715.65 5797.11 1725.55 48173.36 46447.81 1.29 0.17 1085.19
Explanation of few estimates of parameters:
Mean: Average monthly salary of the employees is 15390.
Median: 50% of employees have a monthly salary lower or 10602 and 50% of employees have a higher salary than that.
Min: The employee with the lowest monthly salary has a salary of 1137.
Max: The employee with the highest salary has a salary of 51195.
There is 174 Males and 180 Females among employees.
Research question: Is there a difference in height of salary of males and females?
Unit of observation: Individual employee.
H0 = average monthly salary of male = average monthly salary of female
or H0 = average monthly salary of male - average monthly salary of female = 0
H1 = average monthly salary of male =/ average monthly salary of female
or H1 = average monthly salary of male - average monthly salary of female =/ 0
Assumptions, we need to meet:
Numeric variable.
The distribution of the variable is normal in both populations.
Variable has the same variance in both populations.
#Checking the normality of distribution with histogram
# Loading the necessary libraries
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(ggpubr)
# Creating the histograms
Gender_0 <- ggplot(mydata[mydata$FGender == "Male", ], aes(x = Salary)) +
theme_linedraw() +
geom_histogram() +
ggtitle("Male")
Gender_1 <- ggplot(mydata[mydata$FGender == "Female", ], aes(x = Salary)) +
theme_linedraw() +
geom_histogram() +
ggtitle("Female")
# Arranging the plots
ggarrange(Gender_0, Gender_1, ncol = 2, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
library(ggplot2)
ggplot(mydata, aes(x = Salary, fill = FGender)) +
geom_histogram(position = position_dodge(width = 800), bindwidth = 39) +
scale_x_continuous(breaks = seq(0, 100000, 10000)) +
ylab("Frequency") +
labs(fill = "Gender")
## Warning in geom_histogram(position = position_dodge(width = 800), bindwidth = 39): Ignoring unknown parameters:
## `bindwidth`
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Based on the histograms we can see that the distribution isn’t normal. It looks like it is asymmetrical to the right.
Therefore, we will check the normality of distribution also with Saphiro-Wilk test.
# Loading the necessary library
library(ggplot2)
# Performing the Shapiro-Wilk test for normality
shapiro_result_male <- shapiro.test(mydata$Salary[mydata$FGender == "Male"])
shapiro_result_female <- shapiro.test(mydata$Salary[mydata$FGender == "Female"])
# Print the results
print(shapiro_result_male)
##
## Shapiro-Wilk normality test
##
## data: mydata$Salary[mydata$FGender == "Male"]
## W = 0.89199, p-value = 6.227e-10
print(shapiro_result_female)
##
## Shapiro-Wilk normality test
##
## data: mydata$Salary[mydata$FGender == "Female"]
## W = 0.73435, p-value < 2.2e-16
H0: Variable is normally distributed.
H1: Variable is not normally distributed.
We reject H0 at p<0.001 in both populations. Variable Salary is not normally distributed in both population samples (Male and Female).
As we do not meet the assumption of normal distribution for parametric test, we will perform a non-parametric test - Wilcoxon Rank Sum test.
#Wilcoxon Rank Sum test
library(ggplot2)
wilcox.test(mydata$Salary ~ mydata$FGender,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata$Salary by mydata$FGender
## W = 19152, p-value = 0.0002853
## alternative hypothesis: true location shift is not equal to 0
H0: Distribution location is the same for male and female.
H1: Distribution location is not the same for male and female.
Based on the data, we can reject H0 at p-value<0.001, distribution location isn’t the same for male and female employees.
library(psych)
describeBy(mydata$Salary, g = mydata$FGender)
##
## Descriptive statistics by group
## group: Male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 174 17643 14126.77 13228.85 16430.72 14849.64 1137.01 51194.56 50057.55 0.71 -0.68 1070.95
## ------------------------------------------------------------------------------------------
## group: Female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 180 13211.64 14559.38 6258.83 10715.65 5797.11 1725.55 48173.36 46447.81 1.29 0.17 1085.19
#Calculating the effect size
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize(wilcox.test(mydata$Salary ~ mydata$Gender,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ----------------------------------
## -0.22 | [-0.33, -0.11]
interpret_rank_biserial(0.22)
## [1] "medium"
## (Rules: funder2019)
Based on the sample data, we find that male and female employees differ in the height of average monthly salary (p<0.001) - men have higher salaries, the difference in distribution is medium (r = 0.22).