Question of the Week 12

Assignment Overview

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Import Data

I have trouble sleeping myself, so any data pertaining to sleep statistics is very interesting for me. This dataset was provided by a kaggle user, and includes the user’s sleep data across a 4 month period. Data fields include:

day of the week
date
sleep_score (as computed by fitbit)
hours_sleep (total hours slept)
rem_sleep (percentage of REM sleep)
deep_sleep (percentage of deep sleep)
heart_rate_below_resting (percent of time heartbeat below resting heart rate)
sleep_time (time of sleep)

I don’t have a fitbit, but I am interested in understanding how the sleep score is computed. Creating my own multiple regression model will help me to understand better what the most important features are in determining a sleep score.

Our data is saved in 4 separate files, so we will need to load each individually, standardize each’s format, and concatenate them into a single dataframe.

data1 <- read_csv("November sleep data - Sheet1.csv")
data2 <- read_csv("December sleep data - Sheet1.csv")
data3 <- read_csv("January sleep data - Sheet1.csv")
data4 <- read_csv("February sleep data - Sheet1.csv")

colnames(data1) <- c("day","date","sleep_score","hours_sleep","rem_sleep","deep_sleep","heart_rate_below_resting","sleep_time")

colnames(data2) <- c("day","date","sleep_score","hours_sleep","rem_sleep","deep_sleep","heart_rate_below_resting","sleep_time")

colnames(data3) <- c("day","date","sleep_score","hours_sleep","rem_sleep","deep_sleep","heart_rate_below_resting","sleep_time")

colnames(data4) <- c("day","date","sleep_score","hours_sleep","rem_sleep","deep_sleep","heart_rate_below_resting","sleep_time")

all_data <- rbind(data1, data2, data3, data4)

head(all_data)

Cleaning and Feature Engineering

summary(all_data)

##      day                date            sleep_score    hours_sleep      
##  Length:123         Length:123         Min.   :63.00   Length:123       
##  Class :character   Class :character   1st Qu.:82.00   Class1:hms       
##  Mode  :character   Mode  :character   Median :85.00   Class2:difftime  
##                                        Mean   :84.33   Mode  :numeric   
##                                        3rd Qu.:87.00                    
##                                        Max.   :91.00                    
##                                        NA's   :3                        
##   rem_sleep          deep_sleep        heart_rate_below_resting
##  Length:123         Length:123         Length:123              
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##                                                                
##   sleep_time       
##  Length:123        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

We will first want to ensure there are no missing values.

colSums(is.na(all_data))

##                      day                     date              sleep_score 
##                        3                        3                        3 
##              hours_sleep                rem_sleep               deep_sleep 
##                        3                        3                        3 
## heart_rate_below_resting               sleep_time 
##                        3                        3

all_data <- all_data %>%
  na.omit

Next, we will want to convert the hours_sleep field to minutes sleep for more granularity.

convert_to_minutes <- function(hms){
  hours <- hour(hms) * 60
  minutes <- minute(hms)
  
  return(hours+minutes)
}

all_data <- all_data %>%
  mutate(
    minutes_sleep = convert_to_minutes(hours_sleep)
  )

Creating new feature for bedtime. We will create a boolean field indicating if the individual went to bed after midnight.

after_midnight <- function(bed_string) {
  value <- stringr::str_extract(bed_string, ".*\\-")
  value <- stringr::str_detect(value, "am")
  return(value)
}

all_data <- all_data %>%
  mutate(
    after_midnight = after_midnight(all_data$sleep_time)
  )

Finally, I’d like to convert all of the percentages to doubles, ranging from 0-100

convert_percentage <- function(perc_string){
  perc_string <- str_replace(perc_string, "%","")
  return(as.double(perc_string))
}

all_data <- all_data %>%
  mutate(
    rem_sleep = convert_percentage(rem_sleep),
    deep_sleep = convert_percentage(deep_sleep),
    heart_rate_below_resting = convert_percentage(heart_rate_below_resting)
  )

head(all_data)

all_data <- all_data %>%
  select(c("rem_sleep","deep_sleep","heart_rate_below_resting","minutes_sleep","after_midnight","sleep_score"))

Visualizing Data

pairs(all_data)

Based on the above, we can see that sleep_score is positively correlated with all of our quantitative variables. It appears that minutes_sleep and heart_rate_below_resting are the most highly correlated.

cor(all_data$sleep_score, all_data$minutes_sleep)

## [1] 0.5389509

cor(all_data$sleep_score, all_data$rem_sleep)

## [1] 0.4991039

cor(all_data$sleep_score, all_data$deep_sleep)

## [1] 0.3242035

cor(all_data$sleep_score, all_data$heart_rate_below_resting)

## [1] 0.6673503

Now to check the relationship between our dichotomous variable and our response variable.

all_data %>%
  ggplot(aes(x=after_midnight, y=sleep_score)) +
  geom_boxplot()

Itt appears that going to bed earlier will typically generate a higher sleep score.

Building a model

model <- lm(sleep_score ~ minutes_sleep + rem_sleep + deep_sleep + heart_rate_below_resting + after_midnight + minutes_sleep*after_midnight + minutes_sleep^2, data=all_data)

summary(model)

## 
## Call:
## lm(formula = sleep_score ~ minutes_sleep + rem_sleep + deep_sleep + 
##     heart_rate_below_resting + after_midnight + minutes_sleep * 
##     after_midnight + minutes_sleep^2, data = all_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4728 -1.0749 -0.0906  1.3698  4.2209 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      42.882804   2.488328  17.234  < 2e-16 ***
## minutes_sleep                     0.044618   0.004702   9.489 4.65e-16 ***
## rem_sleep                         0.443398   0.046943   9.445 5.86e-16 ***
## deep_sleep                        0.216383   0.053276   4.062 9.02e-05 ***
## heart_rate_below_resting          0.115787   0.008574  13.505  < 2e-16 ***
## after_midnightTRUE               -4.527464   4.249280  -1.065    0.289    
## minutes_sleep:after_midnightTRUE  0.009560   0.009818   0.974    0.332    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.889 on 113 degrees of freedom
## Multiple R-squared:  0.8473, Adjusted R-squared:  0.8392 
## F-statistic: 104.5 on 6 and 113 DF,  p-value: < 2.2e-16

Based on the above summary, we can say that this linear model was appropriate. With an Ajdusted R-squared of 83.92%, we know that the features chosen account for a large proportion of the total variability of the response variable. Additionally, we can see that all of our quantitative variables were highly significant, with p-values close to zero. Our dichotomous variable was not as significant. Similarly, our interaction term which consisted of an interaction between our dichotomous variable and minutes_sleep was not significant.

In future iterations, I think it would be interesting to create a feature that measures the consistency between bedtimes over time, rather than simply flagging if an individual went to bed after midnight.

Residual Analysis

plot(model)

From the above plots, we can see that our residuals are normally distributed. There are a few outliers that could be considered for removal in future reviews.