Hypothesis testing

Data

We have a data set that provides us insights into the similarities and differences between females and males across different factors. As males and females can be considered as two samples, we will use independent samples t-test to answer our research question.

Data source: soleman sarkovish. (2023). Male vs Female [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/5665153

setwd("C:/Users/Neimenovan uporabnik/Documents/R data/R data/Basics/Reading csv data")
mydata <- read.csv("dataset csv.csv")
head(mydata)
##        Date Gender Driving.test.result    Bmi Childeren    Salary    region smoker age
## 1 1/11/2022 female                   5 27.900         0 16884.924 southwest    yes  19
## 2 1/11/2022 female                   4 33.770         1  1725.552 southeast     no  18
## 3 1/11/2022   male                   8 33.000         3  4449.462 southeast     no  28
## 4 1/11/2022   male                   9 22.705         0 21984.471 northwest     no  33
## 5 1/11/2022 female                   4 28.880         0  3866.855 northwest     no  32
## 6 2/11/2022 female                   2 25.740         0  3756.622 southeast     no  31

We will be observing salaries between males and females. We have information about the height of monthly salary of 354 employees.

#install.packages("dplyr")
# Loading the dplyr package
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Selecting only the Gender and Salary columns
mydata <- select(mydata, Gender, Salary)
head(mydata)
##   Gender    Salary
## 1 female 16884.924
## 2 female  1725.552
## 3   male  4449.462
## 4   male 21984.471
## 5 female  3866.855
## 6 female  3756.622
#Converting the Gender variable into numerical variable

mydata$Gender <- as.numeric(factor(mydata$Gender, levels = c("male", "female"))) - 1

head(mydata)
##   Gender    Salary
## 1      1 16884.924
## 2      1  1725.552
## 3      0  4449.462
## 4      0 21984.471
## 5      1  3866.855
## 6      1  3756.622

Description:

Descriptive statistics

#Showing descriptive statistics
library(psych)
describe(mydata)
##        vars   n     mean       sd   median  trimmed   mad     min      max    range  skew kurtosis     se
## Gender    1 354     0.51     0.50     1.00     0.51     0    0.00     1.00     1.00 -0.03     -2.0   0.03
## Salary    2 354 15389.76 14498.79 10602.39 13510.62 11203 1137.01 51194.56 50057.55  0.97     -0.4 770.60

Explanation of few estimates of parameters:

Hypothesis testing

H0 = average salary of male = average salary of female

or H0 = average salary of male - average salary of female = 0

H1 = average salary of male =/ average salary of female

or H1 = average salary of male average salary of female =/ 0

Assumptions, we need to meet:

Checking normal distribution of the variable with histogram

#Checking the normality of distribution with histogram

# Loading the necessary libraries
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(ggpubr)

# Creating the histograms
Gender_0 <- ggplot(mydata[mydata$Gender == 0, ], aes(x = Salary)) + 
   theme_linedraw() + 
   geom_histogram() +
   ggtitle("Male")

Gender_1 <- ggplot(mydata[mydata$Gender == 1, ], aes(x = Salary)) + 
   theme_linedraw() + 
   geom_histogram() +
   ggtitle("Female")

# Arranging the plots
ggarrange(Gender_0, Gender_1, ncol = 2, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Based on the histograms we can see that the distribution isn’t normal. It looks like it is assymetrical to the right.

Therefore, we will check the normality of distribution also with Saphiro-Wilk test.

Checking normal distribution of the variable with Shapiro-Wilk test

# Loading the necessary library
library(ggplot2)

# Performing the Shapiro-Wilk test for normality
shapiro_result_male <- shapiro.test(mydata$Salary[mydata$Gender == 0])
shapiro_result_female <- shapiro.test(mydata$Salary[mydata$Gender == 1])

# Print the results
print(shapiro_result_male)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$Salary[mydata$Gender == 0]
## W = 0.89199, p-value = 6.227e-10
print(shapiro_result_female)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$Salary[mydata$Gender == 1]
## W = 0.73435, p-value < 2.2e-16

H0: Variable is normally distributed.

H1: Variable is not normally distributed.

We reject H0 at p<0.001 in both populations.

As we do not meet the assumption of normal distribution for parametric test, we will perform a non-parametric test - Wilcoxon Rank Sum test.

Using non-parametric test Wilcoxon Rank Sum test

#Wilcoxon Rank Sum test

library(ggplot2)
wilcox.test(mydata$Salary ~ mydata$Gender,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  mydata$Salary by mydata$Gender
## W = 19152, p-value = 0.0002853
## alternative hypothesis: true location shift is not equal to 0

H0: Distribution location is the same for male and female.

H1: Distribution location is not the same for male and female.

We reject H0 at p-value<0.001, there is a statistically significant difference in the median salaries between male and female.

library(psych)
describeBy(mydata$Salary, g = mydata$Gender)
## 
##  Descriptive statistics by group 
## group: 0
##    vars   n  mean       sd   median  trimmed      mad     min      max    range skew kurtosis      se
## X1    1 174 17643 14126.77 13228.85 16430.72 14849.64 1137.01 51194.56 50057.55 0.71    -0.68 1070.95
## ------------------------------------------------------------------------------------------ 
## group: 1
##    vars   n     mean       sd  median  trimmed     mad     min      max    range skew kurtosis      se
## X1    1 180 13211.64 14559.38 6258.83 10715.65 5797.11 1725.55 48173.36 46447.81 1.29     0.17 1085.19

Effect size

#Calculating the effect size

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
## 
##     phi
effectsize(wilcox.test(mydata$Salary ~ mydata$Gender,
                       paired = FALSE,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))
## r (rank biserial) |       95% CI
## --------------------------------
## 0.22              | [0.11, 0.33]
interpret_rank_biserial(0.22)
## [1] "medium"
## (Rules: funder2019)

Conclusion

Based on the sample data, we find that men and women differ in the height of monthly salary (p<0.001) - men have higher salaries, the difference in distribution is medium (r = 0.22).