We have a data set that provides us insights into the similarities and differences between females and males across different factors. As males and females can be considered as two samples, we will use independent samples t-test to answer our research question.
Data source: soleman sarkovish. (2023). Male vs Female [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/5665153
setwd("C:/Users/Neimenovan uporabnik/Documents/R data/R data/Basics/Reading csv data")
mydata <- read.csv("dataset csv.csv")
head(mydata)
## Date Gender Driving.test.result Bmi Childeren Salary region smoker age
## 1 1/11/2022 female 5 27.900 0 16884.924 southwest yes 19
## 2 1/11/2022 female 4 33.770 1 1725.552 southeast no 18
## 3 1/11/2022 male 8 33.000 3 4449.462 southeast no 28
## 4 1/11/2022 male 9 22.705 0 21984.471 northwest no 33
## 5 1/11/2022 female 4 28.880 0 3866.855 northwest no 32
## 6 2/11/2022 female 2 25.740 0 3756.622 southeast no 31
We will be observing salaries between males and females. We have information about the height of monthly salary of 354 employees.
#install.packages("dplyr")
# Loading the dplyr package
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Selecting only the Gender and Salary columns
mydata <- select(mydata, Gender, Salary)
head(mydata)
## Gender Salary
## 1 female 16884.924
## 2 female 1725.552
## 3 male 4449.462
## 4 male 21984.471
## 5 female 3866.855
## 6 female 3756.622
#Converting the Gender variable into numerical variable
mydata$Gender <- as.numeric(factor(mydata$Gender, levels = c("male", "female"))) - 1
head(mydata)
## Gender Salary
## 1 1 16884.924
## 2 1 1725.552
## 3 0 4449.462
## 4 0 21984.471
## 5 1 3866.855
## 6 1 3756.622
Description:
Gender: Female or Male employee (0 = Male, 1 = Female)
Salary: The monthly salary of an employee
#Showing descriptive statistics
library(psych)
describe(mydata)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## Gender 1 354 0.51 0.50 1.00 0.51 0 0.00 1.00 1.00 -0.03 -2.0 0.03
## Salary 2 354 15389.76 14498.79 10602.39 13510.62 11203 1137.01 51194.56 50057.55 0.97 -0.4 770.60
Explanation of few estimates of parameters:
Mean: Average monthly salary of the employees is 15389.76.
Median: 50% of employees have a monthly salary lower or 10602.39 and 50% of employees have a higher salary than that.
Min: The employee with the lowest monthly salary has a salary of 1137.01.
Max: The employee with the highest salary has a salary of 51194.56.
Research question: Is there a difference in height of salary of males and females?
Unit of observation: Individual employee.
H0 = average salary of male = average salary of female
or H0 = average salary of male - average salary of female = 0
H1 = average salary of male =/ average salary of female
or H1 = average salary of male average salary of female =/ 0
Assumptions, we need to meet:
Numeric variable.
The distribution of the variable is normal in both populations.
Variable has the same variance in both populations.
#Checking the normality of distribution with histogram
# Loading the necessary libraries
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(ggpubr)
# Creating the histograms
Gender_0 <- ggplot(mydata[mydata$Gender == 0, ], aes(x = Salary)) +
theme_linedraw() +
geom_histogram() +
ggtitle("Male")
Gender_1 <- ggplot(mydata[mydata$Gender == 1, ], aes(x = Salary)) +
theme_linedraw() +
geom_histogram() +
ggtitle("Female")
# Arranging the plots
ggarrange(Gender_0, Gender_1, ncol = 2, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Based on the histograms we can see that the distribution isn’t normal. It looks like it is assymetrical to the right.
Therefore, we will check the normality of distribution also with Saphiro-Wilk test.
# Loading the necessary library
library(ggplot2)
# Performing the Shapiro-Wilk test for normality
shapiro_result_male <- shapiro.test(mydata$Salary[mydata$Gender == 0])
shapiro_result_female <- shapiro.test(mydata$Salary[mydata$Gender == 1])
# Print the results
print(shapiro_result_male)
##
## Shapiro-Wilk normality test
##
## data: mydata$Salary[mydata$Gender == 0]
## W = 0.89199, p-value = 6.227e-10
print(shapiro_result_female)
##
## Shapiro-Wilk normality test
##
## data: mydata$Salary[mydata$Gender == 1]
## W = 0.73435, p-value < 2.2e-16
H0: Variable is normally distributed.
H1: Variable is not normally distributed.
We reject H0 at p<0.001 in both populations.
As we do not meet the assumption of normal distribution for parametric test, we will perform a non-parametric test - Wilcoxon Rank Sum test.
#Wilcoxon Rank Sum test
library(ggplot2)
wilcox.test(mydata$Salary ~ mydata$Gender,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata$Salary by mydata$Gender
## W = 19152, p-value = 0.0002853
## alternative hypothesis: true location shift is not equal to 0
H0: Distribution location is the same for male and female.
H1: Distribution location is not the same for male and female.
We reject H0 at p-value<0.001, there is a statistically significant difference in the median salaries between male and female.
library(psych)
describeBy(mydata$Salary, g = mydata$Gender)
##
## Descriptive statistics by group
## group: 0
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 174 17643 14126.77 13228.85 16430.72 14849.64 1137.01 51194.56 50057.55 0.71 -0.68 1070.95
## ------------------------------------------------------------------------------------------
## group: 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 180 13211.64 14559.38 6258.83 10715.65 5797.11 1725.55 48173.36 46447.81 1.29 0.17 1085.19
#Calculating the effect size
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize(wilcox.test(mydata$Salary ~ mydata$Gender,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## --------------------------------
## 0.22 | [0.11, 0.33]
interpret_rank_biserial(0.22)
## [1] "medium"
## (Rules: funder2019)
Based on the sample data, we find that men and women differ in the height of monthly salary (p<0.001) - men have higher salaries, the difference in distribution is medium (r = 0.22).