Student: Katja Volk Štefić

Research question: Is there a difference in sleep duration between males and females?

mydata <- read.table("./Sleep_health_and_lifestyle_dataset.csv", header=TRUE, sep = ",", dec = ".")

head(mydata)
##   Person.ID Gender Age           Occupation Sleep.Duration
## 1         1   Male  27    Software Engineer            6.1
## 2         2   Male  28               Doctor            6.2
## 3         3   Male  28               Doctor            6.2
## 4         4   Male  28 Sales Representative            5.9
## 5         5   Male  28 Sales Representative            5.9
## 6         6   Male  28    Software Engineer            5.9
##   Quality.of.Sleep Physical.Activity.Level Stress.Level BMI.Category
## 1                6                      42            6   Overweight
## 2                6                      60            8       Normal
## 3                6                      60            8       Normal
## 4                4                      30            8        Obese
## 5                4                      30            8        Obese
## 6                4                      30            8        Obese
##   Blood.Pressure Heart.Rate Daily.Steps Sleep.Disorder
## 1         126/83         77        4200           None
## 2         125/80         75       10000           None
## 3         125/80         75       10000           None
## 4         140/90         85        3000    Sleep Apnea
## 5         140/90         85        3000    Sleep Apnea
## 6         140/90         85        3000       Insomnia

Unit of observation is one participant of this study.

Sample size is 374.

Explanation of data:

The source of the data is Kaggle.

mydata$GenderF <- factor(mydata$Gender, 
                         levels = c("Male", "Female"), 
                         labels = c("Male", "Female"))
mydata$OccupationF <- factor(mydata$Occupation, 
                             levels = c("Accountant","Doctor", "Engineer", "Lawyer", "Manager", "Nurse", "Sales Representative", "Salesperson", "Scientist", "Software Engineer", "Teacher"),
                             labels = c("Accountant","Doctor", "Engineer", "Lawyer", "Manager", "Nurse", "Sales Representative", "Salesperson", "Scientist", "Software Engineer", "Teacher"))
mydata$BMI.CategoryF <- factor(mydata$BMI.Category,
                              levels = c("Normal", "Normal Weight", "Obese", "Overweight"),
                              labels = c("Normal", "Normal Weight", "Obese", "Overweight"))
mydata$Sleep.DisorderF <- factor(mydata$Sleep.Disorder,
                                 levels = c("Insomnia", "None", "Sleep Apnea"),
                                 labels = c("Insomnia", "None", "Sleep Apnea"))
#Changing variables into factors
summary(mydata[ , c(-1, -2, -4, -9, -10, -13)]) 
##       Age        Sleep.Duration  Quality.of.Sleep
##  Min.   :27.00   Min.   :5.800   Min.   :4.000   
##  1st Qu.:35.25   1st Qu.:6.400   1st Qu.:6.000   
##  Median :43.00   Median :7.200   Median :7.000   
##  Mean   :42.18   Mean   :7.132   Mean   :7.313   
##  3rd Qu.:50.00   3rd Qu.:7.800   3rd Qu.:8.000   
##  Max.   :59.00   Max.   :8.500   Max.   :9.000   
##                                                  
##  Physical.Activity.Level  Stress.Level     Heart.Rate   
##  Min.   :30.00           Min.   :3.000   Min.   :65.00  
##  1st Qu.:45.00           1st Qu.:4.000   1st Qu.:68.00  
##  Median :60.00           Median :5.000   Median :70.00  
##  Mean   :59.17           Mean   :5.385   Mean   :70.17  
##  3rd Qu.:75.00           3rd Qu.:7.000   3rd Qu.:72.00  
##  Max.   :90.00           Max.   :8.000   Max.   :86.00  
##                                                         
##   Daily.Steps      GenderF        OccupationF       BMI.CategoryF
##  Min.   : 3000   Male  :189   Nurse     :73   Normal       :195  
##  1st Qu.: 5600   Female:185   Doctor    :71   Normal Weight: 21  
##  Median : 7000                Engineer  :63   Obese        : 10  
##  Mean   : 6817                Lawyer    :47   Overweight   :148  
##  3rd Qu.: 8000                Teacher   :40                      
##  Max.   :10000                Accountant:37                      
##                               (Other)   :43                      
##     Sleep.DisorderF
##  Insomnia   : 77   
##  None       :219   
##  Sleep Apnea: 78   
##                    
##                    
##                    
## 
#Descriptive statistics

Explanation of estimates of parameters:

Age: Mean:42.18
The average age of participants in this study was 42.18 years.

Sleep duration: Min:5.800
The minimal sleep time was 5.800 hours per day.

Daily steps: 1st Qu.:5600
25% of participants did 5600 steps per day or less and 75% of participants did more than 5600 steps per day.

Hypothesis testing

As my research question is if there is a difference in sleep duration between males sand females, I only need variables Gender and Sleep.duration.

If we want to perform a parametric test- Independent Sample t-test these assumptions have to be met:

Main variable: Sleep duration, Factorial variable: gender

library(psych)
describeBy(mydata$Sleep.Duration, mydata$GenderF)
## 
##  Descriptive statistics by group 
## group: Male
##    vars   n mean   sd median trimmed  mad min max range  skew
## X1    1 189 7.04 0.69    7.2    7.06 0.89 5.9 8.1   2.2 -0.27
##    kurtosis   se
## X1    -1.53 0.05
## ---------------------------------------------------- 
## group: Female
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis
## X1    1 185 7.23 0.88    7.2    7.23 1.33 5.8 8.5   2.7 0.07    -1.48
##      se
## X1 0.06
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
Sleep.Duration_male <- ggplot(mydata[mydata$GenderF == "Male",  ], aes(x = Sleep.Duration)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 0.2, colour= "lightskyblue3", fill="lightskyblue1") +
  ggtitle("Sleep_duration_male")

Sleep.Duration_female <- ggplot(mydata[mydata$GenderF == "Female",  ], aes(x = Sleep.Duration)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 0.2, colour= "plum3", fill="plum2") +
  ggtitle("Sleep_duration_female")

library(ggpubr)
ggarrange(Sleep.Duration_male, Sleep.Duration_female,
          ncol = 2, nrow = 1)

Distribution of sleep duration doesn’t look normal in population of males nor in the population of females. We performed Shapiro-Wilk test just to make sure that this is true.

library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
mydata %>%
  group_by(GenderF) %>%
  shapiro_test(Sleep.Duration)
## # A tibble: 2 × 4
##   GenderF variable       statistic        p
##   <fct>   <chr>              <dbl>    <dbl>
## 1 Male    Sleep.Duration     0.872 1.51e-11
## 2 Female  Sleep.Duration     0.899 6.36e-10

H0: Sleep duration is normally distributed.

H1: Sleep duration is not normally distributed.

We reject H0 for males at p < 0.001. We reject H0 for females at p < 0.001.

Because normality is not met, we have to perform a non-numeric test-Wilcoxon Rank Sum Test.

library(ggplot2)
ggplot(mydata, aes(x= Sleep.Duration, fill= GenderF)) +
  geom_histogram(position=position_dodge(width=2.5), binwidth = 0.2, colour= "black") +
  scale_x_continuous(breaks = seq(0, 75, 5)) +
  ylab("Frequency") +
  labs(fill="Gender") +
  scale_fill_manual(values = c("Male" = "cadetblue1", "Female" = "orange")) 

wilcox.test(mydata$Sleep.Duration ~ mydata$GenderF,
            paired = FALSE,
            correct = FALSE,
            exact= FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  mydata$Sleep.Duration by mydata$GenderF
## W = 14930, p-value = 0.01442
## alternative hypothesis: true location shift is not equal to 0

H0:The locations of distributions of the sleep duration are the same for males and females.

H1:The locations of distributions of the sleep duration are different for males and females.

Based on the sample data, we reject H0 at p=0.015 and conclude that there are differences in sleep duration of males and females.

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared
## The following object is masked from 'package:psych':
## 
##     phi
effectsize(wilcox.test(mydata$Sleep.Duration ~ mydata$GenderF,
            paired = FALSE,
            correct = FALSE,
            exact= FALSE,
            alternative = "two.sided"))
## r (rank biserial) |         95% CI
## ----------------------------------
## -0.15             | [-0.26, -0.03]
interpret_rank_biserial(0.15)
## [1] "small"
## (Rules: funder2019)

Conclusion: Based on the data, we find out that men and women differ in the sleep duration (p=0.015).Women have a longer sleep duration, the difference in distribution is small (r=0.15).

If the variable sleep duration would be normally distributed, we would perform a numeric test called Independent samples t-test.

t.test(mydata$Sleep.Duration ~ mydata$GenderF, 
       paired = FALSE,
       var.equal = FALSE,
       alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  mydata$Sleep.Duration by mydata$GenderF
## t = -2.3565, df = 349.38, p-value = 0.019
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  -0.35448564 -0.03195795
## sample estimates:
##   mean in group Male mean in group Female 
##             7.036508             7.229730

H0: The difference in means between males and females is equal to zero.

H1: The difference in means between males and females is not equal to zero.

We reject H0 at p=0.020.

library(effectsize)
effectsize::cohens_d(mydata$Sleep.Duration ~ mydata$GenderF,
                     pooled_sd = FALSE)
## Cohen's d |         95% CI
## --------------------------
## -0.24     | [-0.45, -0.04]
## 
## - Estimated using un-pooled SD.
interpret_cohens_d(0.24, rules = "sawilowsky2009")
## [1] "small"
## (Rules: sawilowsky2009)