data()
data(package= .packages(all.available = TRUE))
#install.packages("carData")
library(carData)
mydata <- force(SLID)
head(mydata)
## wages education age sex language
## 1 10.56 15.0 40 Male English
## 2 11.00 13.2 19 Male English
## 3 NA 16.0 49 Male Other
## 4 17.76 14.0 46 Male Other
## 5 NA 8.0 71 Male English
## 6 14.00 16.0 50 Female English
##Description:
Unit of observation: a person
These are independent samples
Sample size : 7425
Definition of all variables: wage (numeric - continuous); education (numeric - continuous); age (numeric - continuous); sex (categorical - nominal); language (categorical - nominal)
Wages: The amount of hourly wage rate from all job, measured in Canadian dollars
Education: The highest level of education attained by the individual, often measured in years
Age: The individual’s age, represented in years
Sex: The gender of the individual, categorized as “Male” or “Female”
Language: The primary language spoken by the individual, which can be English or other
set.seed(1)
mydata <- mydata[sample(nrow(mydata), 100), ]
library(tidyr)
mydata <- drop_na(mydata)
Now my data have 47 rows in comparison with 7425 at beginning
Converting categorical variables to factors
mydata$sexF <- factor(mydata$sex,
levels = c("Male", "Female"),
labels = c("Male","Female"))
mydata$languageF <- factor(mydata$language,
levels = c("English", "Other"),
labels = c("Enlish","Other"))
library(psych)
describeBy(x=mydata$wages, group=mydata$sexF)
##
## Descriptive statistics by group
## group: Male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 29 18.15 8.5 18.29 17.46 8.47 6.35 48 41.65 1.22 2.87 1.58
## ------------------------------------------------------------
## group: Female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 18 16.61 10.66 12.97 15.8 7.37 6.08 40.05 33.97 0.88 -0.74 2.51
##Male
Mean: The mean value in male group is 18.15, representing the average wage of males in this group.
Standard Deviation: The standard deviation in male group is 8.5, which means that on average, the wages deviate from the mean by approximately 8.5.
Range: The range of 41.65 in male group represents the difference between the highest (48) and lowest (6.35) wages.
Skewness: With a skewness of 1.22 in male group, the distribution among males is positively skewed, suggesting the presence of individuals with higher wages.
Kurtosis: The kurtosis of 2.87 in male group indicates that the distribution has heavier tails and a sharper peak than the normal distribution. It suggests a higher probability of extreme outcomes.
##Female
Mean: The mean value in female group is 16.61, representing the average wage of females in this group.
Standard Deviation: The standard deviation in female group is 10.66, which means that on average, the wages deviate from the mean by approximately 10.66.
Range: The range of 33.97 in female group represents the difference between the highest (40.05) and lowest (6.08) wages.
Skewness: With a skewness of 0.88 in female group, the distribution among females is positively skewed, suggesting the presence of individuals with higher wages.
Kurtosis: The kurtosis of -0.74 in female group indicates that the distribution has lighter tails and a flatter peak than the normal distribution. It suggests a lower probability of extreme outcomes.
###Assumptions that need to be met:
First assumption is fulfilled since wage is expressed in Canadian dollars, it is numerical variable
Second assumption should be checked with Shapiro-Wilk test and we will check it now
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.3.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(sexF) %>%
shapiro_test(wages)
## # A tibble: 2 × 4
## sexF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Male wages 0.887 0.00478
## 2 Female wages 0.834 0.00475
H0:Both groups have normally distributed wages
H1:Both groups do not have normally distributed wages
We should reject H0 (normality assumption) for both groups at corresponding p-values 0.005
The wage data for both males and females do not come from a normally distributed population
We did Histogram for visual support of results
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata, aes (x = wages)) +
geom_histogram(binwidth = 5, colour="grey" , fill= "lightblue") +
facet_wrap(~sexF, ncol = 1) +
ylab("'Frequency")
remove_high_wages <- function(df, wage_threshold = 43) {
df[df$wages <= wage_threshold, ]
}
t.test(mydata$wages ~ mydata$sexF,
paired= FALSE,
var.equal = FALSE,
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: mydata$wages by mydata$sexF
## t = 0.52075, df = 30.212, p-value = 0.6063
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
## -4.512310 7.602271
## sample estimates:
## mean in group Male mean in group Female
## 18.15276 16.60778
library(effectsize)
## Warning: package 'effectsize' was built under R version 4.3.2
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize ::cohens_d(mydata$wages ~ mydata$sexF,
pooled_sd - FALSE)
## Cohen's d | 95% CI
## -------------------------
## 0.16 | [-0.43, 0.75]
##
## - Estimated using pooled SD.
interpret_cohens_d (0.16, rules = "sawilowsky2009")
## [1] "very small"
## (Rules: sawilowsky2009)
Based on the sample data, it was not possible to establish that the average wages differed significantly between males and females. The effect size is very small, with a 𝑑 value of 0.16.
Due to the violation of the assumption of normality, we will now perform the Wilcoxon Rank Sum Test, which serves as an alternative non-parametric test to address this issue more effectively.
library(psych)
describeBy(mydata$wages, mydata$sexF) #Descriptive statistics for wages based on gender
##
## Descriptive statistics by group
## group: Male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 29 18.15 8.5 18.29 17.46 8.47 6.35 48 41.65 1.22 2.87 1.58
## ------------------------------------------------------------
## group: Female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 18 16.61 10.66 12.97 15.8 7.37 6.08 40.05 33.97 0.88 -0.74 2.51
wilcox.test(mydata$wages ~ mydata$sexF,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata$wages by mydata$sexF
## W = 305, p-value = 0.3355
## alternative hypothesis: true location shift is not equal to 0
H0: Locations of distributions of wages among males and females are same
H1: Locations of distributions of wages among males and females are different
There is not enough statistical evidence to reject the null hypothesis.
library(effectsize)
effectsize(wilcox.test(mydata$wages ~ mydata$sexF,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## 0.17 | [-0.17, 0.47]
interpret_rank_biserial(0.17)
## [1] "small"
## (Rules: funder2019)