MVA Homework 1

Data

We have a data set that provides us insights into the similarities and differences between females and males across different factors. As males and females can be considered as two samples, we will use independent samples t-test to answer our research question.

Data source: soleman sarkovish. (2023). Male vs Female [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/5665153

setwd("C:/Users/Neimenovan uporabnik/Documents/R data/R data/Basics/Reading csv data")
mydata <- read.csv("dataset csv.csv")
head(mydata)
##        Date Gender Driving.test.result    Bmi Childeren    Salary    region smoker age
## 1 1/11/2022 female                   5 27.900         0 16884.924 southwest    yes  19
## 2 1/11/2022 female                   4 33.770         1  1725.552 southeast     no  18
## 3 1/11/2022   male                   8 33.000         3  4449.462 southeast     no  28
## 4 1/11/2022   male                   9 22.705         0 21984.471 northwest     no  33
## 5 1/11/2022 female                   4 28.880         0  3866.855 northwest     no  32
## 6 2/11/2022 female                   2 25.740         0  3756.622 southeast     no  31

We will be observing salaries between males and females. We have information about the height of monthly salary of 354 employees.

#install.packages("dplyr")
# Loading the dplyr package
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Selecting only the Gender and Salary columns
mydata <- select(mydata, Gender, Salary)
head(mydata)
##   Gender    Salary
## 1 female 16884.924
## 2 female  1725.552
## 3   male  4449.462
## 4   male 21984.471
## 5 female  3866.855
## 6 female  3756.622
#Factorizing

mydata$FGender <- factor(mydata$Gender, 
                         levels = c("male", "female"),
                         labels = c("Male","Female"))

head(mydata)
##   Gender    Salary FGender
## 1 female 16884.924  Female
## 2 female  1725.552  Female
## 3   male  4449.462    Male
## 4   male 21984.471    Male
## 5 female  3866.855  Female
## 6 female  3756.622  Female

Description:

Descriptive statistics

#Showing descriptive statistics
library(psych)
summary(mydata[c(-1)])
##      Salary        FGender   
##  Min.   : 1137   Male  :174  
##  1st Qu.: 3580   Female:180  
##  Median :10602               
##  Mean   :15390               
##  3rd Qu.:23568               
##  Max.   :51195
result <- describeBy(mydata$Salary, mydata$FGender)
print(result)
## 
##  Descriptive statistics by group 
## group: Male
##    vars   n  mean       sd   median  trimmed      mad     min      max    range skew kurtosis      se
## X1    1 174 17643 14126.77 13228.85 16430.72 14849.64 1137.01 51194.56 50057.55 0.71    -0.68 1070.95
## ------------------------------------------------------------------------------------------ 
## group: Female
##    vars   n     mean       sd  median  trimmed     mad     min      max    range skew kurtosis      se
## X1    1 180 13211.64 14559.38 6258.83 10715.65 5797.11 1725.55 48173.36 46447.81 1.29     0.17 1085.19

Explanation of few estimates of parameters:

Hypothesis testing

H0 = average monthly salary of male = average monthly salary of female

or H0 = average monthly salary of male - average monthly salary of female = 0

H1 = average monthly salary of male =/ average monthly salary of female

or H1 = average monthly salary of male - average monthly salary of female =/ 0

Assumptions, we need to meet:

Checking normal distribution of the variable with histogram

#Checking the normality of distribution with histogram

# Loading the necessary libraries
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(ggpubr)

# Creating the histograms
Gender_0 <- ggplot(mydata[mydata$FGender == "Male", ], aes(x = Salary)) + 
   theme_linedraw() + 
   geom_histogram() +
   ggtitle("Male")

Gender_1 <- ggplot(mydata[mydata$FGender == "Female", ], aes(x = Salary)) + 
   theme_linedraw() + 
   geom_histogram() +
   ggtitle("Female")

# Arranging the plots
ggarrange(Gender_0, Gender_1, ncol = 2, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(ggplot2)
ggplot(mydata, aes(x = Salary, fill = FGender)) +
  geom_histogram(position = position_dodge(width = 800), bindwidth = 39) +
  scale_x_continuous(breaks = seq(0, 100000, 10000)) +
  ylab("Frequency") +
  labs(fill = "Gender")
## Warning in geom_histogram(position = position_dodge(width = 800), bindwidth = 39): Ignoring unknown parameters:
## `bindwidth`
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Based on the histograms we can see that the distribution isn’t normal. It looks like it is asymmetrical to the right.

Therefore, we will check the normality of distribution also with Saphiro-Wilk test.

Checking normal distribution of the variable with Shapiro-Wilk test

# Loading the necessary library
library(ggplot2)

# Performing the Shapiro-Wilk test for normality
shapiro_result_male <- shapiro.test(mydata$Salary[mydata$FGender == "Male"])
shapiro_result_female <- shapiro.test(mydata$Salary[mydata$FGender == "Female"])

# Print the results
print(shapiro_result_male)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$Salary[mydata$FGender == "Male"]
## W = 0.89199, p-value = 6.227e-10
print(shapiro_result_female)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$Salary[mydata$FGender == "Female"]
## W = 0.73435, p-value < 2.2e-16

H0: Variable is normally distributed.

H1: Variable is not normally distributed.

We reject H0 at p<0.001 in both populations. Variable Salary is not normally distributed in both population samples (Male and Female).

As we do not meet the assumption of normal distribution for parametric test, we will perform a non-parametric test - Wilcoxon Rank Sum test.

Using non-parametric test Wilcoxon Rank Sum test

#Wilcoxon Rank Sum test

library(ggplot2)
wilcox.test(mydata$Salary ~ mydata$FGender,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  mydata$Salary by mydata$FGender
## W = 19152, p-value = 0.0002853
## alternative hypothesis: true location shift is not equal to 0

H0: Distribution location is the same for male and female.

H1: Distribution location is not the same for male and female.

Based on the data, we can reject H0 at p-value<0.001, distribution location isn’t the same for male and female employees.

library(psych)
describeBy(mydata$Salary, g = mydata$FGender)
## 
##  Descriptive statistics by group 
## group: Male
##    vars   n  mean       sd   median  trimmed      mad     min      max    range skew kurtosis      se
## X1    1 174 17643 14126.77 13228.85 16430.72 14849.64 1137.01 51194.56 50057.55 0.71    -0.68 1070.95
## ------------------------------------------------------------------------------------------ 
## group: Female
##    vars   n     mean       sd  median  trimmed     mad     min      max    range skew kurtosis      se
## X1    1 180 13211.64 14559.38 6258.83 10715.65 5797.11 1725.55 48173.36 46447.81 1.29     0.17 1085.19

Effect size

#Calculating the effect size

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
## 
##     phi
effectsize(wilcox.test(mydata$Salary ~ mydata$Gender,
                       paired = FALSE,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))
## r (rank biserial) |         95% CI
## ----------------------------------
## -0.22             | [-0.33, -0.11]
interpret_rank_biserial(0.22)
## [1] "medium"
## (Rules: funder2019)

Conclusion

Based on the sample data, we find that male and female employees differ in the height of average monthly salary (p<0.001) - men have higher salaries, the difference in distribution is medium (r = 0.22).