This report will use Sleep Efficiency Dataset as original data to analysis. The reason for choosing sleep pattern data for analysis is that sleep is an essential part of human beings, but nowadays people have widespread problems with sleep, especially our youngs. There are differences in people’s sleeping habits, lifestyle, and physical exercise choices, all of which can affect sleep health. Although the number of data in this dataset is not large, it is possible to analyse sleep health and its correlates from a variety of perspectives by dividing the variables into three latitudes: comprehensive sleep health indicators (sleep duration and sleep efficiency), sleep depth and lifestyle factors. Therefore, we consider the use of this dataset to be of sufficient research interest. The aim of this study is to investigate the correlation between lifestyle and sleep health, and to provide effective suggestions for improving sleep health through our findings.
This report will ask the following three research questions and solve them with data analysis:
By analysing the data from these three research questions, this study will look at sleep health and its patterns across gender and age groups firstly. We will then also explore the effects of three lifestyles, caffeine, alcohol, and smoking, on sleep efficiency. Finally, the correlation between exercise habits and sleep quality will be analysed to see if exercise can promote sleep health and provide some recommendations on exercise and sleep health.
The data was reaped from kaggle.com and originally owned by a study conducted in Morocco by a group of artificial intelligence engineering students from ENSIAS. The dataset contains information about 446 test subjects and their sleep patterns. There are 18 variables in the whole datasets. In this report, according to the three research questions, the following variables are selected for study:
In the research on ‘Sleep health overview by age and gender’, four variables, ‘Sleep efficiency’, ‘Sleep duration’, ‘Gender’ and ‘Age’ will be used. These variables represent ‘the proportion of time in bed spent asleep’, ‘the total amount of time the test subject slept (in hours)’, ‘male or female’ and ‘age of the test subject’ respectively.
In the research on ‘Whether lifestyle affect sleep efficiency’ , four variables, ‘Sleep efficiency’, ‘Caffeine consumption’, ‘Alcohol consumption’ and ‘Smoking status’ will be used. They represent ‘the proportion of time in bed spent asleep’, ‘the amount of caffeine consumed in the 24 hours prior to bedtime (in mg)’, ‘the amount of alcohol consumed in the 24 hours prior to bedtime (in oz)’ and ‘whether or not the test subject smokes’, respectively.
In the research on ‘Whether exercise promotes sleep health’, four variables, ‘Sleep efficiency’, ‘Deep sleep percentage’, ‘Awakenings’ and ‘Exercise frequency’ will be used. These variables represent ‘the proportion of time in bed spent asleep’, ‘the percentage of total sleep time spent in deep sleep’, ‘the number of times the test subject wakes up during the night’ and ‘the number of times the test subject exercises each week’ respectively.
First of all, loading the raw data and put the original data into the new object ‘sleep_efficiency’. Using the select function to extract the required variables and rename the variables.
## Rows: 452 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Gender, Smoking status
## dbl (11): ID, Age, Sleep duration, Sleep efficiency, REM sleep percentage, ...
## dttm (2): Bedtime, Wakeup time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 452 × 12
## Age Gender sleep_duration sleep_efficiency REM_pct deepsleep_pct
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 65 Female 6 0.88 18 70
## 2 69 Male 7 0.66 19 28
## 3 40 Female 8 0.89 20 70
## 4 40 Female 6 0.51 23 25
## 5 57 Male 8 0.76 27 55
## 6 36 Female 7.5 0.9 23 60
## 7 27 Female 6 0.54 28 25
## 8 53 Male 10 0.9 28 52
## 9 41 Female 6 0.79 28 55
## 10 11 Female 9 0.55 18 37
## # ℹ 442 more rows
## # ℹ 6 more variables: lightsleep_pct <dbl>, Awakenings <dbl>,
## # caffeine_consumption <dbl>, alcohol_consumption <dbl>,
## # smoking_status <chr>, exercise_freq <dbl>
In order to avoid outliers (NA) from having an impact on the data analysis of this study, we used the filter function in the data cleaning phase to sift the data of ‘sleep_duration’ and ‘sleep_efficiency’. While other variables will be sifted in each section below.
First, we were interested in the relationship between sleep duration and sleep efficiency. Because generally speaking, if the sleep duration is short we may ascribe it to lack of sleep efficiency. But is the sleep efficiency the main reason or only reason to blame? To figure out the truth we created a smoothed scatterplot, since it can provide a general overview of the relationship between two variables and can capture non-linear relationships.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
As the result, the smoothing line is close to a straight line. This means that there is no relationship between sleep duration and sleep efficiency.
Also we can find out that the spots that represent the sleeping time is concentrated between 7-8 hours but the sleep efficiency is relatively dispersed. This signifies that the sleep duration cannot be a effective variable to tell the relationships with other variables, though the sleep efficiency have more value in following research due to the diversity.
And then we wanted to know if there is the difference in sleep efficiency by gender. Such as ‘whether males sleep more efficiently than females’. To learn the difference we created a density plot. The peak of the density plot indicates the central tendency of the data, providing insights into the most common or frequent values. And by the group comparison we can visually see the differences in the distributions of different groups.
we colored the plot by gender and set the transparency to 0.3 so the result would be more clear and easier to compare.
In the sleep efficiency plot there is a significant difference by gender. The difference is that the distribution of males is more concentrated, since males’ sleep efficiency within 0.6-0.9 is higher than females’ and only has one peak. In another word, females are more likely have extremely low or extremely high sleep efficiency, since females’ sleep efficiency has two peak, which are approach to 0.55 and 0.9.
Thirdly we tried to find out the difference in sleep efficiency by ages. To do so we added a new variable called ‘ages’ by classifying the old variable ‘Age’ into six different age groups. Because we wanted to use ‘Age’ as a categorical variable which could be used to analyse the difference of age.
Same as the research on gender, we wanted to see if younger people sleep more efficiently than others. But we used the boxplot instead. Because it can not only show the differences by ages, but also give us the median values, outliers and data dispersion. These information can help us to understand the difference better.
We also colored the plot by ages to compare the result clearly.
In sleep efficiency plot there are significant differences among each groups. Comparing their medians of sleep efficiency, the 10s have lowest sleep efficiency and 30s have highest sleep efficiency, although 30s have more outliers than other groups which are lower than 0.55. Also we can perceive that the age between 20 to 59 have similar medians which are over 0.8, but 10s and 60s much lower medians, which are both lower than 0.75.
Next, we wanted to look at whether other factors had an impact on sleep, including caffeine intake, alcohol intake, and smoking. We classified sleep efficiency into low, medium, and high levels in order to facilitate data analysis. We chose sleep efficiency for our analysis, the “Sleep efficiency” feature is a measure of the proportion of time spent in bed that is actually spent asleep.
## # A tibble: 452 × 5
## sleep_efficiency cfe_c alh_c smk_s sleep_level
## <dbl> <dbl> <dbl> <chr> <fct>
## 1 0.88 0 0 Yes high
## 2 0.66 0 3 Yes medium
## 3 0.89 0 0 No high
## 4 0.51 50 5 Yes low
## 5 0.76 0 3 No medium
## 6 0.9 NA 0 No high
## 7 0.54 50 0 Yes low
## 8 0.9 50 0 Yes high
## 9 0.79 50 0 No medium
## 10 0.55 0 0 No low
## # ℹ 442 more rows
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 25 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: Removed 25 rows containing missing values (`geom_point()`).
As can be seen from the chart, different caffeine consumption did not have a big effect on sleep efficiency, the difference was not large. The second half of the line in the chart is trending upward, possibly because there is less data on large coffee intake and the standard error is too large.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 14 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: Removed 14 rows containing missing values (`geom_point()`).
As can be seen from the chart, the relationship between sleep efficiency and alcohol consumption is almost the same for different sleep efficiency, the higher sleep efficiency, the lower the alcohol consumption.
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
As can be seen from the chart, non-smokers sleep more efficiently than smokers.
So in general, we hypothesis that alcohol consumption and smoking status have an effect on people’s sleep, and caffeine consumption did not have a significant effect on sleep. People with higher sleep efficiency have lower alcohol consumption, and non-smokers have higher sleep efficiency than smokers.
The third research question indicates to whether exercising promotes sleep health. In this dataset, there are no variables that directly represent sleep health. So we would like to use ‘sleep efficiency’, ‘deep sleep percentage’, and ‘awakenings’ to refer to a composite indicator of sleep health. As for ‘sleep duration’, in the first research question we found that this variable could not be used as a valid influencing factor, so it was not added to the composite indicator of sleep health.
In this dataset, sleep efficiency means the proportion of time in bed people spent asleep. We had questions about whether this variable correlated with exercise, so we first generated a boxplot using these two variables(‘exercise_freq’ and ‘sleep_efficiency’) for correlation analysis. The rationale for using boxplot is that boxplot can visualise the mean, range of variances, and extremes of sleep efficiency in terms of the number of exercise per week. This allows us to better see if there is a correlation between the two variables.
As a result, we found that sleep efficiency does correlate with exercise. Generally speaking, the more exercise one gets per week, the more efficient one’s sleep is, meaning the higher the percentage of actual sleep time spent lying in bed. So we hypothesised that exercise might promote one’s efficiency in falling asleep.
But the variable of sleep efficiency still has some ambiguity if it is to represent sleep health. So we think exploring the relationship between sleep efficiency and sleep quality is necessary for our study. In the dataset, the author divided a period of sleep into three stages: REM sleep, deep sleep and light sleep. As a result of our investigation, we learnt that deep sleep is a stage in the sleep cycle that is considered important for the recovery of the body and brain. During deep sleep, muscles relax and the body repairs and grows, while cognitive functions and memory also benefit. Therefore, we believe that sleep with a high percentage of deep sleep can be seen as sleep of higher quality. Based on this speculation, we analysed sleep efficiency(‘sleep_efficiency’) in comparison to the percentage of deep sleep(‘deepsleep_pct’), generating smoothed scatterplot to explore the relationship between two variables.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
As a result we can see two scatter aggregations in the graph. The subjects with sleep efficiency below 0.7 all had a percentage of deep sleep below 40%, while the subjects with high sleep efficiency had a higher quality of sleep, i.e. a percentage of deep sleep above 50%. From this we hypothesise that sleep efficiency is positively correlated with the percentage of deep sleep (sleep quality). However, as we can observe from the smooth line, higher sleep efficiency dose not relate to higher percentage of deep sleep all the time, so we still need to pay attention to the standard deviation and the occurrence of extreme values.
Based on the above analysis, in order to reduce the complexity in the next analysis, we compared the percentage of deep sleep with the sum of REM sleep and light sleep, and used the ‘MUTATE’ method to add a variable called ‘sleep_condition’, which classified the sleep status of the study participants into two categories of ‘DEEP’ and ‘LIGHT’ as a way of representing the level of the percentage of deep sleep.
The last dimension of sleep health in this study is the number of awakenings. The number of awakenings is also an important measure of sleep health. Usually, we believe that a high number of awakenings in one sleep may affect overall sleep quality and health. In order to explore whether exercise correlates with awakenings, we used ‘awakenings’ as the x-axis, faceted by the ‘exercise_freq’, and categorised by ‘sleep_condition’ to generate the following six density plots as a way of exploring the number of awakenings under different exercise habits that already vary in sleep quality.
## Warning: Removed 20 rows containing non-finite values (`stat_density()`).
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
The graphs clearly show the distribution of the number of awakenings in the DEEP and LIGHT sleep conditions for each exercise frequency segment. We can see that, in general, subjects with a higher percentage of deep sleep wake up less often, while subjects with a higher percentage of light sleep are more likely to have multiple interruptions of sleep. In addition to this, the number of exercise frequency had an effect on the number of awakenings. Under the deep sleep classification, the distribution of the number of awakenings was skewed to concentrate on 0-1 times as the number of exercise times per week increased. However, we did not see such a pattern under the light sleep classification, so we surmised that we needed more data support for further research. From this, we hypothesise that exercise do influence the number of awakenings, but there may be a third influencing variable of the quality of sleep depth.
Based on Figure 1’s results, we found out that
Based on Figure 2’s results, we found out that
Based on Figure 3’s results, we found out that
To sum up, we can find some differences of people’s sleep efficiency by both gender and ages. And people’s lifestyle can also affect their sleep efficiency. the main reasons are alcohol and cigarettes. Finally, exercise does have positive effect on sleep health. So if people have low sleep efficiency or bad sleep health, stop consuming alcohol or cigarettes and start exercising may be a good choice to improve those sleep problems.
## Saving 7 x 5 in image
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 25 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: Removed 25 rows containing missing values (`geom_point()`).
## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 14 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: Removed 14 rows containing missing values (`geom_point()`).
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Saving 7 x 5 in image
## Warning: Removed 20 rows containing non-finite values (`stat_density()`).
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf