Introduction

The topic of the dataset I chose to use is sleep efficiency. This dataset includes quantitative and categorical variables. The variables “age”, “exercise frequency”, and “ID” are number values but would be considered categorical variables, “gender” and “smoking status” are categorical variables, “bedtime” and “wakeup time” are dates, and “sleep duration”, “sleep efficiency”, “REM sleep percentage”, “deep sleep percentage”, “light sleep percentage”, “awakenings”, “caffiene consumption”, and “alcohol consumption” are all numerical variables. With the variables and values provided, I plan to explore correlations between sleep efficiency and different lifestyles. To do this, I will eliminate all “NA” values in the dataset and only keep columns that relate to different lifestyles along with the sleep efficiency score. It is a possibility that the consulusions this dataset draws may be misinformed as the datasets’ source is from Kaggle user ‘Equilibriumm’ who cited their source as “a research team from The University of Oxfordshire.” This university does not exist so the visualizations presented may be misleading.

library(tidyverse)
library(ggplot2)
library(ggfortify)
library(GGally)
library(plotly)
library(readr)
setwd("C:/Users/takla/OneDrive/Documents/Data110/3.8.23")
Sleep_Efficiency <- read_csv("~/Data110/3.8.23/Sleep_Efficiency.csv")
head(Sleep_Efficiency)
## # A tibble: 6 × 15
##      ID   Age Gender Bedtime             `Wakeup time`       Sleep dur…¹ Sleep…²
##   <dbl> <dbl> <chr>  <dttm>              <dttm>                    <dbl>   <dbl>
## 1     1    65 Female 2021-03-06 01:00:00 2021-03-06 07:00:00         6      0.88
## 2     2    69 Male   2021-12-05 02:00:00 2021-12-05 09:00:00         7      0.66
## 3     3    40 Female 2021-05-25 21:30:00 2021-05-25 05:30:00         8      0.89
## 4     4    40 Female 2021-11-03 02:30:00 2021-11-03 08:30:00         6      0.51
## 5     5    57 Male   2021-03-13 01:00:00 2021-03-13 09:00:00         8      0.76
## 6     6    36 Female 2021-07-01 21:00:00 2021-07-01 04:30:00         7.5    0.9 
## # … with 8 more variables: `REM sleep percentage` <dbl>,
## #   `Deep sleep percentage` <dbl>, `Light sleep percentage` <dbl>,
## #   Awakenings <dbl>, `Caffeine consumption` <dbl>,
## #   `Alcohol consumption` <dbl>, `Smoking status` <chr>,
## #   `Exercise frequency` <dbl>, and abbreviated variable names
## #   ¹​`Sleep duration`, ²​`Sleep efficiency`

Cleaning the dataset

The first thing I did was select specific columns I knew I needed and did not keep anything else.

sleep <- Sleep_Efficiency %>%
  select(`Sleep efficiency`, `Caffeine consumption`, `Alcohol consumption`, `Smoking status`, `Exercise frequency`)

Then I removed all “NA” values so that it would not skew the data.

cor_sleep <- sleep %>%
  filter(!is.na(`Sleep efficiency`) & !is.na(`Caffeine consumption`) & !is.na(`Alcohol consumption`) & !is.na(`Smoking status`) & !is.na(`Exercise frequency`))

Exploring the dataset

I do not consider gender to be a lifestyle factor, but I did want to see if there were any significant differences between the efficiency of sleep for men and women. So, I created a simple side by side boxplot and found that there was not a signifcant difference between the sleep efficiencies of the two genders.

gender <- ggplot(Sleep_Efficiency, aes(Sleep_Efficiency$Gender , Sleep_Efficiency$`Sleep efficiency`)) +
  labs (title = "Comparing sleep efficiency between men and women" , caption = "Source: Kaggle user 'Equilibriumm'") +
  xlab("Gender") +
  ylab("Sleep Efficiency") +
  theme_dark(base_size = 12)
gender + geom_boxplot()

Exploring significance and correlation using statistics

I then wanted to see if there were any lifestyle factors I could remove from my final visualization, so I turned to correlation values and p- values to see if there was a variable that did not affect the sleep efficiency of the subjects. The first tool I used was a correlation heatmap. From this I found that smoking has an inverse relationship to sleep efficiency, and that caffeiene consumption does not correlate much with sleep efficiency. I also must note that none of these correlation values are high at all, so they all don’t correlate with sleep efficiency on their own.

library(DataExplorer)
plot_correlation(cor_sleep, title = "The correlation between each category in the cor_sleep dataset")

Below I used three different diagnostic plots/ multiple regression models to see if any one variable had more statistical significance than the others. Caffeine consumption had a high p-value indicating that it did not have a statistical significance, so I removed it. When I did that, the remaining variables all had p-values below 0.5,but I noticed that the r^2 value went down, when the goal would be for it to increase. I eventually chose to continue with all the significant variables and without caffeine consumption.

diagnostic <- lm(cor_sleep$`Sleep efficiency` ~ cor_sleep$`Caffeine consumption` + cor_sleep$`Alcohol consumption` + cor_sleep$`Smoking status` + cor_sleep$`Exercise frequency`, data = cor_sleep)
summary(diagnostic)
## 
## Call:
## lm(formula = cor_sleep$`Sleep efficiency` ~ cor_sleep$`Caffeine consumption` + 
##     cor_sleep$`Alcohol consumption` + cor_sleep$`Smoking status` + 
##     cor_sleep$`Exercise frequency`, data = cor_sleep)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28144 -0.08131  0.01494  0.08395  0.27533 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       0.8014427  0.0119896  66.845  < 2e-16 ***
## cor_sleep$`Caffeine consumption`  0.0002200  0.0001985   1.108    0.269    
## cor_sleep$`Alcohol consumption`  -0.0305759  0.0035753  -8.552 2.57e-16 ***
## cor_sleep$`Smoking status`Yes    -0.0726928  0.0120704  -6.022 3.88e-09 ***
## cor_sleep$`Exercise frequency`    0.0243987  0.0039537   6.171 1.66e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1149 on 402 degrees of freedom
## Multiple R-squared:  0.2839, Adjusted R-squared:  0.2768 
## F-statistic: 39.85 on 4 and 402 DF,  p-value: < 2.2e-16
autoplot(diagnostic, 1:4, nrow=2, ncol=2)

diagnostic2 <- lm(cor_sleep$`Sleep efficiency` ~ cor_sleep$`Alcohol consumption` + cor_sleep$`Smoking status` + cor_sleep$`Exercise frequency`, data = cor_sleep)
summary(diagnostic2)
## 
## Call:
## lm(formula = cor_sleep$`Sleep efficiency` ~ cor_sleep$`Alcohol consumption` + 
##     cor_sleep$`Smoking status` + cor_sleep$`Exercise frequency`, 
##     data = cor_sleep)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28741 -0.08031  0.01624  0.08442  0.27169 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      0.807409   0.010715  75.351  < 2e-16 ***
## cor_sleep$`Alcohol consumption` -0.031010   0.003555  -8.723  < 2e-16 ***
## cor_sleep$`Smoking status`Yes   -0.072226   0.012066  -5.986 4.77e-09 ***
## cor_sleep$`Exercise frequency`   0.024088   0.003945   6.106 2.40e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1149 on 403 degrees of freedom
## Multiple R-squared:  0.2817, Adjusted R-squared:  0.2764 
## F-statistic: 52.69 on 3 and 403 DF,  p-value: < 2.2e-16
autoplot(diagnostic2, 1:4, nrow=2, ncol=2)

diagnostic3 <- lm(cor_sleep$`Sleep efficiency` ~ cor_sleep$`Alcohol consumption` + cor_sleep$`Exercise frequency`, data = cor_sleep)
summary(diagnostic3)
## 
## Call:
## lm(formula = cor_sleep$`Sleep efficiency` ~ cor_sleep$`Alcohol consumption` + 
##     cor_sleep$`Exercise frequency`, data = cor_sleep)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32798 -0.07798  0.02185  0.08202  0.23388 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      0.782731   0.010308  75.935  < 2e-16 ***
## cor_sleep$`Alcohol consumption` -0.032372   0.003697  -8.756  < 2e-16 ***
## cor_sleep$`Exercise frequency`   0.025083   0.004108   6.106 2.39e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1198 on 404 degrees of freedom
## Multiple R-squared:  0.2179, Adjusted R-squared:  0.214 
## F-statistic: 56.27 on 2 and 404 DF,  p-value: < 2.2e-16
autoplot(diagnostic3, 1:4, nrow=2, ncol=2)

## The final visual

I wanted to include all the variables that were deemed statistically significant by their p-values in relation to sleep efficiency and landed on a scatter plot to display the data. On the x-axis is sleep efficiency, and the y-axis is alcohol consumption, because it seemed to hold the most significance when relating to sleep efficiency. The two different colors indicate whether the subject is a smoker or not, and the size of the point indicates how often the subject exercises (the larger the circle, the more frequent they exercised).

My final visual was supposed to be the plotly visual, because I thought it would be useful to isolate the smokers from the non smokers and the different frequencies of exercise, but unfortunately, plotly has issues with displaying two different legends. Plotly also does not allow for a caption as far as I know. Due to these restraints, I created a non-interactive scatterplot along with an interactive one to help sift through the points a bit easier.

c <- ggplot(cor_sleep) +
     geom_point(alpha = 0.5, aes(x = `Sleep efficiency`, y = `Alcohol consumption`, color = `Smoking status`, size = `Exercise frequency`)) +
  ggtitle("Lifestyle Factors and Sleep Efficiency") + 
  labs (caption = "Sizes of circles are proportional to how frequent each subject exercises
        Source: Kaggle user 'Equilibriumm'") +
  xlab("Sleep Efficiency") +
  ylab ("Alcohol Consumption") +
  theme_light(base_size = 10)
c

c <- ggplot(cor_sleep) +
     geom_point(alpha = 0.5, aes(x = `Sleep efficiency`, y = `Alcohol consumption`, color = `Smoking status`, size = `Exercise frequency`)) +
  ggtitle("Lifestyle Factors and Sleep Efficiency") + 
  labs (caption = "Sizes of circles are proportional to how frequent each subject exercises") +
  xlab("Sleep Efficiency") +
  ylab ("Alcohol Consumption") +
  theme_light(base_size = 10)
c <- ggplotly(c)
c

Essay

I wanted to include all the variables that were deemed statistically significant by their p-values in relation to sleep efficiency and landed on a scatter plot to display the data to explore the relationships between each factor. I normally would have the sleep efficiency on the y axis rather than on x, because it is a value dependent on alcohol consumption, but it was easier to read horizontally in my opinion. While exploring the visualization I created, I noticed that alcohol consumption did have a big impact on sleep efficiency as each step up on the consumption axis, the sleep efficiency value above 0.8 continued to decrease within smoking populations and non-smoking. When separating the smokers from the non smokers, I found that smokers have the ability to have a much lower minimum sleep efficiency than non-smokers. While looking at exercise frequency, I was not able to find a definite trend in the visualization for that variable, it may be because I was unable to find a way to isolate the different frequencies with plotly. But overall I do not visually see a trend with frequency of exercise influencing the sleep efficiency on its own, but when coupled with smoking or alcohol consumption, it seems to improve efficiency.

I wish I could have made my final visualization to be fully interactive in the way I had imagined. I also wish I had a better way of overall plotting the data, I’m not sure a scatter plot was the best option, but it was better than the other options I had thought of initially.