Assignment Overview
Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Import Data
I have trouble sleeping myself, so any data pertaining to sleep statistics is very interesting for me. This dataset was provided by a kaggle user, and includes the user’s sleep data across a 4 month period. Data fields include:
- day of the week
- date
- sleep_score (as computed by fitbit)
- hours_sleep (total hours slept)
- rem_sleep (percentage of REM sleep)
- deep_sleep (percentage of deep sleep)
- heart_rate_below_resting (percent of time heartbeat below resting heart rate)
- sleep_time (time of sleep)
I don’t have a fitbit, but I am interested in understanding how the sleep score is computed. Creating my own multiple regression model will help me to understand better what the most important features are in determining a sleep score.
Our data is saved in 4 separate files, so we will need to load each individually, standardize each’s format, and concatenate them into a single dataframe.
data1 <- read_csv("November sleep data - Sheet1.csv")
data2 <- read_csv("December sleep data - Sheet1.csv")
data3 <- read_csv("January sleep data - Sheet1.csv")
data4 <- read_csv("February sleep data - Sheet1.csv")colnames(data1) <- c("day","date","sleep_score","hours_sleep","rem_sleep","deep_sleep","heart_rate_below_resting","sleep_time")
colnames(data2) <- c("day","date","sleep_score","hours_sleep","rem_sleep","deep_sleep","heart_rate_below_resting","sleep_time")
colnames(data3) <- c("day","date","sleep_score","hours_sleep","rem_sleep","deep_sleep","heart_rate_below_resting","sleep_time")
colnames(data4) <- c("day","date","sleep_score","hours_sleep","rem_sleep","deep_sleep","heart_rate_below_resting","sleep_time")all_data <- rbind(data1, data2, data3, data4)head(all_data)Cleaning and Feature Engineering
summary(all_data)## day date sleep_score hours_sleep
## Length:123 Length:123 Min. :63.00 Length:123
## Class :character Class :character 1st Qu.:82.00 Class1:hms
## Mode :character Mode :character Median :85.00 Class2:difftime
## Mean :84.33 Mode :numeric
## 3rd Qu.:87.00
## Max. :91.00
## NA's :3
## rem_sleep deep_sleep heart_rate_below_resting
## Length:123 Length:123 Length:123
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## sleep_time
## Length:123
## Class :character
## Mode :character
##
##
##
##
We will first want to ensure there are no missing values.
colSums(is.na(all_data))## day date sleep_score
## 3 3 3
## hours_sleep rem_sleep deep_sleep
## 3 3 3
## heart_rate_below_resting sleep_time
## 3 3
all_data <- all_data %>%
na.omitNext, we will want to convert the hours_sleep field to minutes sleep for more granularity.
convert_to_minutes <- function(hms){
hours <- hour(hms) * 60
minutes <- minute(hms)
return(hours+minutes)
}all_data <- all_data %>%
mutate(
minutes_sleep = convert_to_minutes(hours_sleep)
)Creating new feature for bedtime. We will create a boolean field indicating if the individual went to bed after midnight.
after_midnight <- function(bed_string) {
value <- stringr::str_extract(bed_string, ".*\\-")
value <- stringr::str_detect(value, "am")
return(value)
}all_data <- all_data %>%
mutate(
after_midnight = after_midnight(all_data$sleep_time)
)Finally, I’d like to convert all of the percentages to doubles, ranging from 0-100
convert_percentage <- function(perc_string){
perc_string <- str_replace(perc_string, "%","")
return(as.double(perc_string))
}all_data <- all_data %>%
mutate(
rem_sleep = convert_percentage(rem_sleep),
deep_sleep = convert_percentage(deep_sleep),
heart_rate_below_resting = convert_percentage(heart_rate_below_resting)
)head(all_data)all_data <- all_data %>%
select(c("rem_sleep","deep_sleep","heart_rate_below_resting","minutes_sleep","after_midnight","sleep_score"))Visualizing Data
pairs(all_data)Based on the above, we can see that sleep_score is positively correlated with all of our quantitative variables. It appears that minutes_sleep and heart_rate_below_resting are the most highly correlated.
cor(all_data$sleep_score, all_data$minutes_sleep)## [1] 0.5389509
cor(all_data$sleep_score, all_data$rem_sleep)## [1] 0.4991039
cor(all_data$sleep_score, all_data$deep_sleep)## [1] 0.3242035
cor(all_data$sleep_score, all_data$heart_rate_below_resting)## [1] 0.6673503
Now to check the relationship between our dichotomous variable and our response variable.
all_data %>%
ggplot(aes(x=after_midnight, y=sleep_score)) +
geom_boxplot()Itt appears that going to bed earlier will typically generate a higher sleep score.
Building a model
model <- lm(sleep_score ~ minutes_sleep + rem_sleep + deep_sleep + heart_rate_below_resting + after_midnight + minutes_sleep*after_midnight + minutes_sleep^2, data=all_data)summary(model)##
## Call:
## lm(formula = sleep_score ~ minutes_sleep + rem_sleep + deep_sleep +
## heart_rate_below_resting + after_midnight + minutes_sleep *
## after_midnight + minutes_sleep^2, data = all_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4728 -1.0749 -0.0906 1.3698 4.2209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.882804 2.488328 17.234 < 2e-16 ***
## minutes_sleep 0.044618 0.004702 9.489 4.65e-16 ***
## rem_sleep 0.443398 0.046943 9.445 5.86e-16 ***
## deep_sleep 0.216383 0.053276 4.062 9.02e-05 ***
## heart_rate_below_resting 0.115787 0.008574 13.505 < 2e-16 ***
## after_midnightTRUE -4.527464 4.249280 -1.065 0.289
## minutes_sleep:after_midnightTRUE 0.009560 0.009818 0.974 0.332
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.889 on 113 degrees of freedom
## Multiple R-squared: 0.8473, Adjusted R-squared: 0.8392
## F-statistic: 104.5 on 6 and 113 DF, p-value: < 2.2e-16
Based on the above summary, we can say that this linear model was appropriate. With an Ajdusted R-squared of 83.92%, we know that the features chosen account for a large proportion of the total variability of the response variable. Additionally, we can see that all of our quantitative variables were highly significant, with p-values close to zero. Our dichotomous variable was not as significant. Similarly, our interaction term which consisted of an interaction between our dichotomous variable and minutes_sleep was not significant.
In future iterations, I think it would be interesting to create a feature that measures the consistency between bedtimes over time, rather than simply flagging if an individual went to bed after midnight.
Residual Analysis
plot(model)From the above plots, we can see that our residuals are normally distributed. There are a few outliers that could be considered for removal in future reviews.