Exploring Sleep Quality with Linear Regression

04/13/2025

Introduction to Linear Regression

Linear regression is a statistical technique used to model the relationship between two variables.

\[ y = \beta_1 x + \beta_0 + \varepsilon \]

\(x\): independent (predictor) variable
\(y\): dependent (response) variable
\(\beta_1\): slope - the expected change in \(y\) for each unit increase in \(x\)
\(\beta_0\): y-intercept - the predicted value of \(y\) when \(x = 0\)
\(\varepsilon\): error term - accounts for unexplained variability in the dependent variable

Introducing the Data Set

This project uses the Sleep Health and Lifestyle dataset, available on Kaggle.

The dataset contains 374 entries and 13 variables, capturing information such as age, sleep duration, sleep quality, stress level, heart rate, daily steps, etc.

Using these variables, we will explore relationships between various lifestyle factors and sleep health with linear regression.

Sleep Quality vs. Stress Level

Interpreting Sleep Quality vs. Stress Level

model <- lm(Quality.of.Sleep ~ Stress.Level, data = sleep_data)
intercept <- coef(model)[1]
slope <- coef(model)[2]
r_squared <- summary(model)$r.squared

cat(paste0("**Intercept ($\\beta_0$)**: ", round(intercept, 3), "\n\n",
           "**Slope ($\\beta_1$)**: ", round(slope, 3), "\n\n",
           "**R-squared ($R^2$)**: ", round(r_squared, 4), "\n\n"))

Intercept (\(\beta_0\)): 10.577

Slope (\(\beta_1\)): -0.606

R-squared (\(R^2\)): 0.8078

Linear regression: \(\hat{y} = -0.606x + 10.577\)

The \(R^2\) value signifies that ~80% of the variation in sleep quality can be explained by stress level.

Interpreting Sleep Quality vs. Stress Level

There is a negative correlation between sleep quality and stress level in this dataset, which is indicated by the negative slope of the linear regression line. However, this graph has several limitations. Both variables are measured on subjective, integer-based scales and are likely self-reported, which introduces bias and reduces precision. Because of this, there are several overlapping data points. This suggests the possibility that many different individuals, likely with varying levels of stress and sleep quality, reported identical values.

Despite these issues, the trend reflects the general expectation: higher stress is associated with poorer sleep. Interestingly, this example may represent a bidirectional relationship, where each variable could plausibly influence the other. High stress may lead to lower sleep quality, and poor sleep may induce higher levels of stress. Though this is beyond the scope of simple linear regression, it is a possible feedback loop worth noting. Regardless, the relatively low residuals suggest that the model fits the data well.

Sleep Duration vs. Age

Interpreting Sleep Duration vs. Age

Surprisingly, sleep duration appears to slightly increase with age in this dataset. This is unexpected, as it is generally understood that adults tend to sleep less as they age. However, the trend predicted by the linear regression model is likely unreliable; the data points are widely scattered, indicating high residuals and a poor fit. This may reflect inaccurate self-reporting or a lack of representation across age groups.

Ultimately, this dataset is not best predicted with linear regression in this case, nor does it appear to reflect the generally understood relationship between sleep duration and age that has been gleaned from more rigorous studies.

Heart Rate vs. Daily Steps

Interpreting Heart Rate vs. Daily Steps

The data points in the heart rate vs. daily steps plot show a clear downward trend: individuals who take more steps per day tend to have lower resting heart rates. However, the linear regression line has only a slightly negative slope, which doesn’t align with the visible pattern. This could be due to outliers in the dataset that are pulling the regression line away from the main cluster of points and flattening its slope.

This is an example where simple linear regression can sometimes fail to capture meaningful results if the trend is nonlinear or if outliers are skewing the model. It is likely that the relationship between daily steps and heart rate is not strictly linear, in which case a logarithmic or polynomial regression might better capture the trend in the data.

This graph highlights the importance of inspecting data visually and questioning whether a linear model is appropriate before drawing conclusions.

Conclusion

The linear regression model is a simple yet highly useful statistical technique. It allows us to generalize the relationship between variables and make predictions about one variable based on the value of another. However, its effectiveness depends heavily on the quality of the data and the true nature of the relationship.

The examples above highlight the challenges of working with unfamiliar or randomly sourced datasets. Sometimes the data contradicts well-established findings, and without knowing how the data was collected, it can be difficult to interpret or trust the results. This demonstrates why understanding data sources and collection methods is essential in order to produce the best models and convey meaningful insights about data.