Executive summary

This report will use Sleep Efficiency Dataset as original data to analysis. The reason for choosing sleep pattern data for analysis is that sleep is an essential part of human beings, but nowadays people have widespread problems with sleep, especially our youngs. There are differences in people’s sleeping habits, lifestyle, and physical exercise choices, all of which can affect sleep health. Although the number of data in this dataset is not large, it is possible to analyse sleep health and its correlates from a variety of perspectives by dividing the variables into three latitudes: comprehensive sleep health indicators (sleep duration and sleep efficiency), sleep depth and lifestyle factors. Therefore, we consider the use of this dataset to be of sufficient research interest. The aim of this study is to investigate the correlation between lifestyle and sleep health, and to provide effective suggestions for improving sleep health through our findings.

This report will ask the following three research questions and solve them with data analysis:

  1. How much and how efficiently do men and women get sleep at different ages?
  2. Whether lifestyle affect sleep efficiency? Do caffeine, alcohol and cigarettes affect sleep effeciency?
  3. Whether exercise promotes sleep health?

By analysing the data from these three research questions, this study will look at sleep health and its patterns across gender and age groups firstly. We will then also explore the effects of three lifestyles, caffeine, alcohol, and smoking, on sleep efficiency. Finally, the correlation between exercise habits and sleep quality will be analysed to see if exercise can promote sleep health and provide some recommendations on exercise and sleep health.

Data background

The data was reaped from kaggle.com and originally owned by a study conducted in Morocco by a group of artificial intelligence engineering students from ENSIAS. The dataset contains information about 446 test subjects and their sleep patterns. There are 18 variables in the whole datasets. In this report, according to the three research questions, the following variables are selected for study:

  1. In the research on ‘Sleep health overview by age and gender’, four variables, ‘Sleep efficiency’, ‘Sleep duration’, ‘Gender’ and ‘Age’ will be used. These variables represent ‘the proportion of time in bed spent asleep’, ‘the total amount of time the test subject slept (in hours)’, ‘male or female’ and ‘age of the test subject’ respectively.

  2. In the research on ‘Whether lifestyle affect sleep efficiency’ , four variables, ‘Sleep efficiency’, ‘Caffeine consumption’, ‘Alcohol consumption’ and ‘Smoking status’ will be used. They represent ‘the proportion of time in bed spent asleep’, ‘the amount of caffeine consumed in the 24 hours prior to bedtime (in mg)’, ‘the amount of alcohol consumed in the 24 hours prior to bedtime (in oz)’ and ‘whether or not the test subject smokes’, respectively.

  3. In the research on ‘Whether exercise promotes sleep health’, four variables, ‘Sleep efficiency’, ‘Deep sleep percentage’, ‘Awakenings’ and ‘Exercise frequency’ will be used. These variables represent ‘the proportion of time in bed spent asleep’, ‘the percentage of total sleep time spent in deep sleep’, ‘the number of times the test subject wakes up during the night’ and ‘the number of times the test subject exercises each week’ respectively.

Data loading and cleaning

First of all, loading the raw data and put the original data into the new object ‘sleep_efficiency’. Using the select function to extract the required variables and rename the variables.

## Rows: 452 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (2): Gender, Smoking status
## dbl  (11): ID, Age, Sleep duration, Sleep efficiency, REM sleep percentage, ...
## dttm  (2): Bedtime, Wakeup time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 452 × 12
##      Age Gender sleep_duration sleep_efficiency REM_pct deepsleep_pct
##    <dbl> <chr>           <dbl>            <dbl>   <dbl>         <dbl>
##  1    65 Female            6               0.88      18            70
##  2    69 Male              7               0.66      19            28
##  3    40 Female            8               0.89      20            70
##  4    40 Female            6               0.51      23            25
##  5    57 Male              8               0.76      27            55
##  6    36 Female            7.5             0.9       23            60
##  7    27 Female            6               0.54      28            25
##  8    53 Male             10               0.9       28            52
##  9    41 Female            6               0.79      28            55
## 10    11 Female            9               0.55      18            37
## # ℹ 442 more rows
## # ℹ 6 more variables: lightsleep_pct <dbl>, Awakenings <dbl>,
## #   caffeine_consumption <dbl>, alcohol_consumption <dbl>,
## #   smoking_status <chr>, exercise_freq <dbl>

In order to avoid outliers (NA) from having an impact on the data analysis of this study, we used the filter function in the data cleaning phase to sift the data of ‘sleep_duration’ and ‘sleep_efficiency’. While other variables will be sifted in each section below.

Individual figures

1. Sleep health overview by age and gender

(How much and how efficiently do men and women get sleep at different ages?)

1-1 The Relationship between sleep duration and sleep efficiency

First, we were interested in the relationship between sleep duration and sleep efficiency. Because generally speaking, if the sleep duration is short we may ascribe it to lack of sleep efficiency. But is the sleep efficiency the main reason or only reason to blame? To figure out the truth we created a smoothed scatterplot, since it can provide a general overview of the relationship between two variables and can capture non-linear relationships.

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

As the result, the smoothing line is close to a straight line. This means that there is no relationship between sleep duration and sleep efficiency.

Also we can find out that the spots that represent the sleeping time is concentrated between 7-8 hours but the sleep efficiency is relatively dispersed. This signifies that the sleep duration cannot be a effective variable to tell the relationships with other variables, though the sleep efficiency have more value in following research due to the diversity.

1-2 The Difference in sleep duration by gender

And then we wanted to know if there is the difference in sleep efficiency by gender. Such as ‘whether males sleep more efficiently than females’. To learn the difference we created a density plot. The peak of the density plot indicates the central tendency of the data, providing insights into the most common or frequent values. And by the group comparison we can visually see the differences in the distributions of different groups.

we colored the plot by gender and set the transparency to 0.3 so the result would be more clear and easier to compare.

In the sleep efficiency plot there is a significant difference by gender. The difference is that the distribution of males is more concentrated, since males’ sleep efficiency within 0.6-0.9 is higher than females’ and only has one peak. In another word, females are more likely have extremely low or extremely high sleep efficiency, since females’ sleep efficiency has two peak, which are approach to 0.55 and 0.9.

1-3 The Difference in sleep duration by ages

Thirdly we tried to find out the difference in sleep efficiency by ages. To do so we added a new variable called ‘ages’ by classifying the old variable ‘Age’ into six different age groups. Because we wanted to use ‘Age’ as a categorical variable which could be used to analyse the difference of age.

Same as the research on gender, we wanted to see if younger people sleep more efficiently than others. But we used the boxplot instead. Because it can not only show the differences by ages, but also give us the median values, outliers and data dispersion. These information can help us to understand the difference better.

We also colored the plot by ages to compare the result clearly.

In sleep efficiency plot there are significant differences among each groups. Comparing their medians of sleep efficiency, the 10s have lowest sleep efficiency and 30s have highest sleep efficiency, although 30s have more outliers than other groups which are lower than 0.55. Also we can perceive that the age between 20 to 59 have similar medians which are over 0.8, but 10s and 60s much lower medians, which are both lower than 0.75.

2. Whether lifestyle affect sleep efficiency?

(Do caffeine, alcohol and cigarettes affect sleep efficiency?)

Next, we wanted to look at whether other factors had an impact on sleep, including caffeine intake, alcohol intake, and smoking. We classified sleep efficiency into low, medium, and high levels in order to facilitate data analysis. We chose sleep efficiency for our analysis, the “Sleep efficiency” feature is a measure of the proportion of time spent in bed that is actually spent asleep.

## # A tibble: 452 × 5
##    sleep_efficiency cfe_c alh_c smk_s sleep_level
##               <dbl> <dbl> <dbl> <chr> <fct>      
##  1             0.88     0     0 Yes   high       
##  2             0.66     0     3 Yes   medium     
##  3             0.89     0     0 No    high       
##  4             0.51    50     5 Yes   low        
##  5             0.76     0     3 No    medium     
##  6             0.9     NA     0 No    high       
##  7             0.54    50     0 Yes   low        
##  8             0.9     50     0 Yes   high       
##  9             0.79    50     0 No    medium     
## 10             0.55     0     0 No    low        
## # ℹ 442 more rows

2-1 Sleep Efficiency by Caffeine Consumption

## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 25 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## Warning: Removed 25 rows containing missing values (`geom_point()`).

As can be seen from the chart, different caffeine consumption did not have a big effect on sleep efficiency, the difference was not large. The second half of the line in the chart is trending upward, possibly because there is less data on large coffee intake and the standard error is too large.

2-2 Sleep Efficiency by Alcohol Consumption

## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 14 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## Warning: Removed 14 rows containing missing values (`geom_point()`).

As can be seen from the chart, the relationship between sleep efficiency and alcohol consumption is almost the same for different sleep efficiency, the higher sleep efficiency, the lower the alcohol consumption.

2-3 Sleep Efficiency by Smoking Status

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

As can be seen from the chart, non-smokers sleep more efficiently than smokers.

So in general, we hypothesis that alcohol consumption and smoking status have an effect on people’s sleep, and caffeine consumption did not have a significant effect on sleep. People with higher sleep efficiency have lower alcohol consumption, and non-smokers have higher sleep efficiency than smokers.

3. Whether exercise promotes sleep health

The third research question indicates to whether exercising promotes sleep health. In this dataset, there are no variables that directly represent sleep health. So we would like to use ‘sleep efficiency’, ‘deep sleep percentage’, and ‘awakenings’ to refer to a composite indicator of sleep health. As for ‘sleep duration’, in the first research question we found that this variable could not be used as a valid influencing factor, so it was not added to the composite indicator of sleep health.

3-1 Relationship between exercising and sleep efficiency

In this dataset, sleep efficiency means the proportion of time in bed people spent asleep. We had questions about whether this variable correlated with exercise, so we first generated a boxplot using these two variables(‘exercise_freq’ and ‘sleep_efficiency’) for correlation analysis. The rationale for using boxplot is that boxplot can visualise the mean, range of variances, and extremes of sleep efficiency in terms of the number of exercise per week. This allows us to better see if there is a correlation between the two variables.

As a result, we found that sleep efficiency does correlate with exercise. Generally speaking, the more exercise one gets per week, the more efficient one’s sleep is, meaning the higher the percentage of actual sleep time spent lying in bed. So we hypothesised that exercise might promote one’s efficiency in falling asleep.

3-2 Relationship between sleep efficiency and deep sleep

But the variable of sleep efficiency still has some ambiguity if it is to represent sleep health. So we think exploring the relationship between sleep efficiency and sleep quality is necessary for our study. In the dataset, the author divided a period of sleep into three stages: REM sleep, deep sleep and light sleep. As a result of our investigation, we learnt that deep sleep is a stage in the sleep cycle that is considered important for the recovery of the body and brain. During deep sleep, muscles relax and the body repairs and grows, while cognitive functions and memory also benefit. Therefore, we believe that sleep with a high percentage of deep sleep can be seen as sleep of higher quality. Based on this speculation, we analysed sleep efficiency(‘sleep_efficiency’) in comparison to the percentage of deep sleep(‘deepsleep_pct’), generating smoothed scatterplot to explore the relationship between two variables.

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

As a result we can see two scatter aggregations in the graph. The subjects with sleep efficiency below 0.7 all had a percentage of deep sleep below 40%, while the subjects with high sleep efficiency had a higher quality of sleep, i.e. a percentage of deep sleep above 50%. From this we hypothesise that sleep efficiency is positively correlated with the percentage of deep sleep (sleep quality). However, as we can observe from the smooth line, higher sleep efficiency dose not relate to higher percentage of deep sleep all the time, so we still need to pay attention to the standard deviation and the occurrence of extreme values.

Based on the above analysis, in order to reduce the complexity in the next analysis, we compared the percentage of deep sleep with the sum of REM sleep and light sleep, and used the ‘MUTATE’ method to add a variable called ‘sleep_condition’, which classified the sleep status of the study participants into two categories of ‘DEEP’ and ‘LIGHT’ as a way of representing the level of the percentage of deep sleep.

3-3 Awakenings by exercising times in sleep condition

The last dimension of sleep health in this study is the number of awakenings. The number of awakenings is also an important measure of sleep health. Usually, we believe that a high number of awakenings in one sleep may affect overall sleep quality and health. In order to explore whether exercise correlates with awakenings, we used ‘awakenings’ as the x-axis, faceted by the ‘exercise_freq’, and categorised by ‘sleep_condition’ to generate the following six density plots as a way of exploring the number of awakenings under different exercise habits that already vary in sleep quality.

## Warning: Removed 20 rows containing non-finite values (`stat_density()`).
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

The graphs clearly show the distribution of the number of awakenings in the DEEP and LIGHT sleep conditions for each exercise frequency segment. We can see that, in general, subjects with a higher percentage of deep sleep wake up less often, while subjects with a higher percentage of light sleep are more likely to have multiple interruptions of sleep. In addition to this, the number of exercise frequency had an effect on the number of awakenings. Under the deep sleep classification, the distribution of the number of awakenings was skewed to concentrate on 0-1 times as the number of exercise times per week increased. However, we did not see such a pattern under the light sleep classification, so we surmised that we needed more data support for further research. From this, we hypothesise that exercise do influence the number of awakenings, but there may be a third influencing variable of the quality of sleep depth.

Conclusion

Question 1. How much and how efficiently do men and women get sleep at different ages?

Based on Figure 1’s results, we found out that

  1. Most people sleep 7-8 hours a day.
  2. People have very a very diverse sleep efficiency.
  3. People’s sleep efficiency does not affect sleep duration.
  4. Men sleep more efficiently than women, but women are tend to have more extreme sleep efficiency.
  5. People who are under 20 have lower sleep efficiency than others. On the opposite, People in their 30s have higher sleep efficiency than others, although this age group has more extremely low sleep efficiency.

Question 2. Whether lifestyle affect sleep efficiency? Do caffeine, alcohol and cigarettes affect sleep efficiency?

Based on Figure 2’s results, we found out that

  1. Caffeine does not have significant effect on sleep efficiency, but alcohol and cigarettes do have an effect on sleep efficiency.
  2. The more alcohol people drink, the lower sleep efficiency people will have.
  3. People who smoke cigarettes are more likely to have a low sleep efficiency than those who don’t. But among smokers, their sleep efficiency will be scattered in two extremes.

Question 3. Whether exercise promotes sleep health?

Based on Figure 3’s results, we found out that

  1. exercise does have a positive influence on sleep efficiency. Which means the more exercise people do, the more efficiently people sleep.
  2. Sleep efficiency is positively associated with deep sleep, which can be an index of sleep quality. So higher sleep efficiency equals to higher sleep quality. Combine with the previous conclusion, exercise can promote sleep quality.
  3. Deep sleep has less times of awakenings than light sleep. And exercise can increase the percentage of deep sleep. This means exercise can effectively reduce the times of awakenings while people fell asleep.

To sum up, we can find some differences of people’s sleep efficiency by both gender and ages. And people’s lifestyle can also affect their sleep efficiency. the main reasons are alcohol and cigarettes. Finally, exercise does have positive effect on sleep health. So if people have low sleep efficiency or bad sleep health, stop consuming alcohol or cigarettes and start exercising may be a good choice to improve those sleep problems.

## Saving 7 x 5 in image
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 25 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## Warning: Removed 25 rows containing missing values (`geom_point()`).
## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 14 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## Warning: Removed 14 rows containing missing values (`geom_point()`).
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Saving 7 x 5 in image
## Warning: Removed 20 rows containing non-finite values (`stat_density()`).
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf