In a linear model of a continuous and categorical variable, the coefficients represent:
\[Y = β0 + β1X1 + β2X2 +ϵ\]
Y-Intercept (β₀): The predicted value of Y for the reference group of the categorical variable when the continuous variable equals 0.
Continuous variable coefficient (β₁): How much Y changes for each 1-unit increase in the continuous variable, keeping the categorical variable fixed.
Categorical variable coefficient (β₂): How much the mean of Y differs between the comparison group and the reference group, holding the continuous variable fixed.
Example: If we model sleep_duration ~ age + gender, the intercept is the predicted sleep duration for the reference gender at age 0, the age slope shows how sleep duration changes per year of age, and genderMale shows the difference between males and females.
When the model includes an interaction term (e.g., age * gender), the coefficients represent:
Y-Intercept (β₀): Predicted value of Y for the reference group when the continuous variable is 0.
Main effect (continuous): The slope of the continuous variable for the reference group.
Main effect (categorical): The difference in intercepts between groups when the continuous variable is 0.
Interaction term: Shows how the slope of the continuous variable changes between groups.
So, in sleep_duration ~ age * gender, the interaction term tests whether the relationship between age and sleep duration differs between males and females.
Independence: Each observation must be independent. One person’s data shouldn’t affect another’s.
Normality: The residuals (errors) should be normally distributed around zero.
Linearity: The relationship between the predictor(s) (X’s) and outcome (Y) should be roughly a straight line.
Homoscedasticity: The residuals should have equal variance across all fitted values. The “spread” of errors should stay about the same.
No major outliers or influential points (optional but important): There shouldn’t be any extreme data points that strongly distort the fit.
Remember to always write a discussion after every graph of table produced to explain or interpret a finding and set up your next steps.
We will explore a new data set that evaluated sleep in 400 individuals. Dataset Columns:
Person ID: An identifier for each individual.
Gender: The gender of the person (Male/Female).
Age: The age of the person in years.
Occupation: The occupation or profession of the person.
Sleep Duration (hours): The number of hours the person sleeps per day.
Quality of Sleep (scale: 1-10): A subjective rating of the quality of sleep, ranging from 1 to 10.
Physical Activity Level (minutes/day): The number of minutes the person engages in physical activity daily.
Stress Level (scale: 1-10): A subjective rating of the stress level experienced by the person, ranging from 1 to 10.
BMI Category: The BMI category of the person (e.g., Underweight, Normal, Overweight).
Blood Pressure (systolic/diastolic): The blood pressure measurement of the person, indicated as systolic pressure over diastolic pressure.
Heart Rate (bpm): The resting heart rate of the person in beats per minute. Daily Steps: The number of steps the person takes per day.
Sleep Disorder: The presence or absence of a sleep disorder in the person (None, Insomnia, Sleep Apnea).
Load in the data and check for NAs or abnormal 0 values in the sleep data.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(ggpubr)
library(GGally)
library(broom)
library(car) # vif()
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(cowplot) # plot_grid similar to lesson
##
## Attaching package: 'cowplot'
##
## The following object is masked from 'package:ggpubr':
##
## get_legend
##
## The following object is masked from 'package:lubridate':
##
## stamp
sleep <- read_csv("Sleep_health_and_lifestyle_dataset.csv")
## Rows: 374 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Gender, Occupation, BMI Category, Blood Pressure, Sleep Disorder
## dbl (8): Person ID, Age, Sleep Duration, Quality of Sleep, Physical Activity...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(sleep)
## Rows: 374
## Columns: 13
## $ `Person ID` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Mal…
## $ Age <dbl> 27, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, …
## $ Occupation <chr> "Software Engineer", "Doctor", "Doctor", "Sa…
## $ `Sleep Duration` <dbl> 6.1, 6.2, 6.2, 5.9, 5.9, 5.9, 6.3, 7.8, 7.8,…
## $ `Quality of Sleep` <dbl> 6, 6, 6, 4, 4, 4, 6, 7, 7, 7, 6, 7, 6, 6, 6,…
## $ `Physical Activity Level` <dbl> 42, 60, 60, 30, 30, 30, 40, 75, 75, 75, 30, …
## $ `Stress Level` <dbl> 6, 8, 8, 8, 8, 8, 7, 6, 6, 6, 8, 6, 8, 8, 8,…
## $ `BMI Category` <chr> "Overweight", "Normal", "Normal", "Obese", "…
## $ `Blood Pressure` <chr> "126/83", "125/80", "125/80", "140/90", "140…
## $ `Heart Rate` <dbl> 77, 75, 75, 85, 85, 85, 82, 70, 70, 70, 70, …
## $ `Daily Steps` <dbl> 4200, 10000, 10000, 3000, 3000, 3000, 3500, …
## $ `Sleep Disorder` <chr> "None", "None", "None", "Sleep Apnea", "Slee…
summary(sleep)
## Person ID Gender Age Occupation
## Min. : 1.00 Length:374 Min. :27.00 Length:374
## 1st Qu.: 94.25 Class :character 1st Qu.:35.25 Class :character
## Median :187.50 Mode :character Median :43.00 Mode :character
## Mean :187.50 Mean :42.18
## 3rd Qu.:280.75 3rd Qu.:50.00
## Max. :374.00 Max. :59.00
## Sleep Duration Quality of Sleep Physical Activity Level Stress Level
## Min. :5.800 Min. :4.000 Min. :30.00 Min. :3.000
## 1st Qu.:6.400 1st Qu.:6.000 1st Qu.:45.00 1st Qu.:4.000
## Median :7.200 Median :7.000 Median :60.00 Median :5.000
## Mean :7.132 Mean :7.313 Mean :59.17 Mean :5.385
## 3rd Qu.:7.800 3rd Qu.:8.000 3rd Qu.:75.00 3rd Qu.:7.000
## Max. :8.500 Max. :9.000 Max. :90.00 Max. :8.000
## BMI Category Blood Pressure Heart Rate Daily Steps
## Length:374 Length:374 Min. :65.00 Min. : 3000
## Class :character Class :character 1st Qu.:68.00 1st Qu.: 5600
## Mode :character Mode :character Median :70.00 Median : 7000
## Mean :70.17 Mean : 6817
## 3rd Qu.:72.00 3rd Qu.: 8000
## Max. :86.00 Max. :10000
## Sleep Disorder
## Length:374
## Class :character
## Mode :character
##
##
##
head(sleep)
## # A tibble: 6 × 13
## `Person ID` Gender Age Occupation `Sleep Duration` `Quality of Sleep`
## <dbl> <chr> <dbl> <chr> <dbl> <dbl>
## 1 1 Male 27 Software Engineer 6.1 6
## 2 2 Male 28 Doctor 6.2 6
## 3 3 Male 28 Doctor 6.2 6
## 4 4 Male 28 Sales Representa… 5.9 4
## 5 5 Male 28 Sales Representa… 5.9 4
## 6 6 Male 28 Software Engineer 5.9 4
## # ℹ 7 more variables: `Physical Activity Level` <dbl>, `Stress Level` <dbl>,
## # `BMI Category` <chr>, `Blood Pressure` <chr>, `Heart Rate` <dbl>,
## # `Daily Steps` <dbl>, `Sleep Disorder` <chr>
#Test to check if the data is clean and doesn not contain odd values, NAs, or rare categories
#apply(X = sleep, MARGIN = 2, FUN = function(x) sum(x==0))
#apply(sleep, 2, function(x) sum(is.na(x))) # same as x==0
No NAs or 0s are found in the sleep data so I commented the apply functions out. The sleep data is now ready to be viewed and used.
In this step, I will make sure all column names in the sleep dataset follow R’s naming rules. Some names might have spaces or special characters that could cause errors later. To fix this, I’ll use the make.names() function, which automatically replaces invalid characters with periods and ensures that names don’t start with numbers. This will make the dataset easier to work with in the rest of the analysis.
colnames(sleep) <- make.names(colnames(sleep))
names(sleep)
## [1] "Person.ID" "Gender"
## [3] "Age" "Occupation"
## [5] "Sleep.Duration" "Quality.of.Sleep"
## [7] "Physical.Activity.Level" "Stress.Level"
## [9] "BMI.Category" "Blood.Pressure"
## [11] "Heart.Rate" "Daily.Steps"
## [13] "Sleep.Disorder"
After running the code, all column names in the sleep dataset became R-friendly. Names that had spaces or special characters were replaced with periods—for example, “Sleep Duration” became “Sleep.Duration.” The dataset now has consistent, easy-to-use column names that will prevent syntax errors in later steps.
In this step, I will check whether the dataset is tidy, meaning each variable has its own column and each observation has its own row. The Blood.Pressure column currently contains two values, systolic and diastolic pressure, separated by a slash (/). To make the data tidy, I will use the separate() function from the tidyr package to split this single column into two new columns: Systolic and Diastolic. I will also use mutate() and str_trim() to clean up extra spaces and convert the new columns to numeric values. This will make the dataset easier to analyze and compatible with future statistical models.
#Make the Blood.Pressure column Tidy by splitting Systolic and Diastolic Blood Pressures
#View Blood.Pressure Column
sleep %>% select(Blood.Pressure)
## # A tibble: 374 × 1
## Blood.Pressure
## <chr>
## 1 126/83
## 2 125/80
## 3 125/80
## 4 140/90
## 5 140/90
## 6 140/90
## 7 140/90
## 8 120/80
## 9 120/80
## 10 120/80
## # ℹ 364 more rows
#Split Blood.Pressure into Systolic and Diastolic
sleep1 <- sleep %>%
mutate(Blood.Pressure = str_trim(Blood.Pressure)) %>%
separate(Blood.Pressure, into = c("Systolic", "Diastolic"), sep = "/") %>%
mutate(across(c(Systolic, Diastolic), ~ as.numeric(.)))
#Observe sleep1
glimpse(sleep1)
## Rows: 374
## Columns: 14
## $ Person.ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Male"…
## $ Age <dbl> 27, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, 29…
## $ Occupation <chr> "Software Engineer", "Doctor", "Doctor", "Sale…
## $ Sleep.Duration <dbl> 6.1, 6.2, 6.2, 5.9, 5.9, 5.9, 6.3, 7.8, 7.8, 7…
## $ Quality.of.Sleep <dbl> 6, 6, 6, 4, 4, 4, 6, 7, 7, 7, 6, 7, 6, 6, 6, 6…
## $ Physical.Activity.Level <dbl> 42, 60, 60, 30, 30, 30, 40, 75, 75, 75, 30, 75…
## $ Stress.Level <dbl> 6, 8, 8, 8, 8, 8, 7, 6, 6, 6, 8, 6, 8, 8, 8, 8…
## $ BMI.Category <chr> "Overweight", "Normal", "Normal", "Obese", "Ob…
## $ Systolic <dbl> 126, 125, 125, 140, 140, 140, 140, 120, 120, 1…
## $ Diastolic <dbl> 83, 80, 80, 90, 90, 90, 90, 80, 80, 80, 80, 80…
## $ Heart.Rate <dbl> 77, 75, 75, 85, 85, 85, 82, 70, 70, 70, 70, 70…
## $ Daily.Steps <dbl> 4200, 10000, 10000, 3000, 3000, 3000, 3500, 80…
## $ Sleep.Disorder <chr> "None", "None", "None", "Sleep Apnea", "Sleep …
#summary(sleep1$Systolic)
#summary(sleep1$Diastolic)
After running the code, the Blood.Pressure column was successfully
split into two new numeric columns: Systolic and Diastolic. Each now
contains a single, clearly defined value for each observation, which
makes the dataset tidy. The glimpse() output confirmed that both new
variables are of type
In this step, I will make sure that each variable in the dataset has the correct data type. Some variables, such as Gender, Occupation, BMI.Category, and Sleep.Disorder, should be stored as factors since they represent categories. Others, like Age, Sleep.Duration, and Heart.Rate, should be numeric to allow for statistical calculations. Using the mutate() function, I will convert each column to its appropriate type. This will help ensure that future analyses, summaries, and visualizations work properly and that R correctly interprets the nature of each variable.
#Mutate all remaining variables to ensure they are the correct data types
sleep2 <- sleep1 %>%
mutate(
Gender = as.factor(Gender),
Occupation = as.factor(Occupation),
BMI.Category = as.factor(BMI.Category),
Sleep.Disorder = as.factor(Sleep.Disorder),
Age = as.numeric(Age),
Sleep.Duration = as.numeric(Sleep.Duration),
Quality.of.Sleep = as.numeric(Quality.of.Sleep),
Physical.Activity.Level = as.numeric(Physical.Activity.Level),
Stress.Level = as.numeric(Stress.Level),
Heart.Rate = as.numeric(Heart.Rate),
Daily.Steps = as.numeric(Daily.Steps)
)
#Confirm
str(sleep2)
## tibble [374 × 14] (S3: tbl_df/tbl/data.frame)
## $ Person.ID : num [1:374] 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ Age : num [1:374] 27 28 28 28 28 28 29 29 29 29 ...
## $ Occupation : Factor w/ 11 levels "Accountant","Doctor",..: 10 2 2 7 7 10 11 2 2 2 ...
## $ Sleep.Duration : num [1:374] 6.1 6.2 6.2 5.9 5.9 5.9 6.3 7.8 7.8 7.8 ...
## $ Quality.of.Sleep : num [1:374] 6 6 6 4 4 4 6 7 7 7 ...
## $ Physical.Activity.Level: num [1:374] 42 60 60 30 30 30 40 75 75 75 ...
## $ Stress.Level : num [1:374] 6 8 8 8 8 8 7 6 6 6 ...
## $ BMI.Category : Factor w/ 4 levels "Normal","Normal Weight",..: 4 1 1 3 3 3 3 1 1 1 ...
## $ Systolic : num [1:374] 126 125 125 140 140 140 140 120 120 120 ...
## $ Diastolic : num [1:374] 83 80 80 90 90 90 90 80 80 80 ...
## $ Heart.Rate : num [1:374] 77 75 75 85 85 85 82 70 70 70 ...
## $ Daily.Steps : num [1:374] 4200 10000 10000 3000 3000 3000 3500 8000 8000 8000 ...
## $ Sleep.Disorder : Factor w/ 3 levels "Insomnia","None",..: 2 2 2 3 3 1 1 2 2 2 ...
After running the code, the str(sleep2) output confirmed that all variables were successfully converted to their correct data types. The categorical variables now appear as factors with a defined number of levels, while all measurement variables are numeric. For example, Gender became a factor with two levels (“Female” and “Male”), and Sleep.Duration, Age, and Daily.Steps are now numeric. This correction ensures that R can properly handle categorical and continuous variables during statistical modeling and visualization, improving both accuracy and efficiency in the analysis.
In this step, I will create a custom function called checknorm to assess the normality of continuous variables in the dataset. The normality assumption is important for many statistical methods, including linear regression, so this function will help identify whether each variable follows a roughly normal distribution. The function will perform a Shapiro-Wilk test to test normality statistically and generate two diagnostic plots for each variable: a density plot to visualize the distribution and a QQ plot to compare the observed data against a theoretical normal distribution. After defining the function, I will apply it to several numeric variables, such as Age, Sleep.Duration, and Daily.Steps, and then combine all plots into one grid for easier comparison.
#1 test normalcy of all the variables
data<-sleep2
variable<-c(
"Age", "Sleep.Duration", "Quality.of.Sleep",
"Physical.Activity.Level", "Stress.Level",
"Daily.Steps", "Heart.Rate", "Systolic", "Diastolic"
)
checknorm<- function(data,variable){
#A density plot of the value distributions
p1<-ggplot(data, aes(x=.data[[variable]])) +
geom_density() +
ggtitle(paste("Density of ", variable))
#Shapiro test object can have values extracted
sp<-shapiro.test(data[[variable]])
#A qqplot as a ggplot object using stat_qq() stat_qq_line()
p2<-ggplot(data, aes(sample=.data[[variable]]))+
stat_qq() +
stat_qq_line() +
labs(x ="theoretical") +
ggtitle(paste(variable , " Shapiro pval: ", round(sp$p.value, 3)))
plot_grid(p1,p2)
}
#Create list of plots
plot_list <- lapply(variable, function(v) checknorm(data, v))
#Combine everything into one big grid
all_plots <- cowplot::plot_grid(plotlist = plot_list, ncol = 2)
#Display
all_plots
After running the code, the Shapiro-Wilk tests for all variables returned p-values of 0, indicating that none of the variables followed a normal distribution. The density plots supported this finding—most variables showed irregular, multimodal, or skewed distributions rather than the smooth, bell-shaped curve expected under normality. Among the variables, Age and Daily.Steps appeared closer to normal compared to Physical.Activity.Level and Sleep.Duration, which displayed more pronounced deviations. The QQ plots reinforced these observations, with data points for all variables straying noticeably from the diagonal reference line. Overall, both the statistical tests and visual diagnostics suggest that the continuous variables in this dataset are not normally distributed. This means that linear models based on these variables may not fully meet the normality assumption; however, they can still offer valuable insight and serve as simple predictive tools, especially when interpreted with caution.
In this analysis, I aim to investigate the relationship between age and sleep duration by building a linear regression model. The goal is to quantify how sleep duration changes with age and evaluate whether age is a significant predictor of sleep duration. The code first visualizes the data to examine the potential linearity of the relationship and then fits a linear model using the lm() function. Additionally, I will assess the assumptions of linear regression, including homoscedasticity of residuals, and evaluate the quality of the model using standard metrics such as R-squared, residual standard error, and the significance of the regression coefficients.
#2 test linearity
#ggplot of the two variables
sleep2%>%
ggplot(aes(x=Age,y=Sleep.Duration))+
geom_point()+
# the linear fit line
geom_smooth(method="lm",aes(colour="linear"))+
#the nonlinear fit line (loess)
geom_smooth(method="loess",aes(colour="loess"))+
scale_colour_manual("Line fit",breaks=c("linear","loess"),values=c("blue","red"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
#3 make the linear model
m1 <- lm(Sleep.Duration ~ Age, data = sleep2)
# present the model calculations and statistics
summary(m1)
##
## Call:
## lm(formula = Sleep.Duration ~ Age, data = sleep2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4160 -0.6867 0.1686 0.6246 1.1532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.798086 0.192278 30.155 < 2e-16 ***
## Age 0.031623 0.004465 7.083 7.12e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7479 on 372 degrees of freedom
## Multiple R-squared: 0.1188, Adjusted R-squared: 0.1165
## F-statistic: 50.16 on 1 and 372 DF, p-value: 7.117e-12
#what the model object contains
names(m1)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
#Coefficient contains the intercept and slopes of the model
m1$coefficients
## (Intercept) Age
## 5.79808624 0.03162298
#tidy(m1)
#glance(m1)
#4 test the homoscedasticity
#extract values for testing homoscedasticity
sleepH1<-data.frame(resid = m1$residuals, fitVal = m1$fitted.values)
ggplot(sleepH1, aes(x =fitVal, y= resid)) +
geom_point() +
geom_smooth(method = "lm") +
geom_smooth(method = "loess") +
geom_hline(aes(yintercept = 0), colour = "red")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
#graphing linear model, m1
ggplot(sleep2, aes(x = Age, y = Sleep.Duration)) +
geom_point() +
geom_abline(intercept = m1$coefficients[1] ,
slope = m1$coefficients[2],
color = "purple") +
annotate("text", x = 43, y = 7, label = "Sleep Duration = 0.032(Age) + 5.80")+
ggtitle("Sleep Duration by Age")
#5 evaluate the model statistics to determine if it is a good model
#checking quality of the model, sigma and R^2 statistics
summary(m1)
##
## Call:
## lm(formula = Sleep.Duration ~ Age, data = sleep2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4160 -0.6867 0.1686 0.6246 1.1532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.798086 0.192278 30.155 < 2e-16 ***
## Age 0.031623 0.004465 7.083 7.12e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7479 on 372 degrees of freedom
## Multiple R-squared: 0.1188, Adjusted R-squared: 0.1165
## F-statistic: 50.16 on 1 and 372 DF, p-value: 7.117e-12
sigma(m1)
## [1] 0.7478937
# calculate the percent error
sigma(m1)/mean(sleep2$Sleep.Duration)
## [1] 0.1048633
The linear regression results indicate that age is a statistically significant predictor of sleep duration. Visualizing the relationship with both a linear and loess fit shows considerable overlap between the two lines, suggesting that while a linear model may not fully capture the nuances of the data, it provides a useful approximation of the overall trend. The estimated model is:
\[Sleep Duration = 5.80 + 0.032 × Age\]
This implies that, on average, sleep duration increases slightly by 0.032 hours for each additional year of age. The intercept of 5.80 hours represents the predicted sleep duration when age is zero. The residuals are roughly centered around zero, though there is some deviation, indicating that the assumption of homoscedasticity is not perfectly met. The residual standard error (sigma) is 0.748 hours, corresponding to a percentage error of approximately 10.5% relative to the mean sleep duration, suggesting moderate variability around the predicted values. However, the model explains only about 12% of the variance in sleep duration (Adjusted R^2 = 0.1165), meaning that while the relationship is statistically significant (p < 0.001), age alone is a relatively weak predictor. Overall, the linear model identifies a small but meaningful trend, and incorporating additional predictors may improve model accuracy.
In this step, I aim to build a multivariate linear model to explore how sleep duration is influenced by both age and gender. By including gender as an additional predictor, I want to determine whether there are differences in sleep duration between males and females while accounting for age. The code first visualizes the relationship between age and sleep duration, then fits a linear model using lm(). Finally, I evaluate the model’s quality by examining regression coefficients, residuals, R-squared values, and the residual standard error to assess the strength and reliability of the model.
#2 test linearity
#ggplot of the two variables
sleep2%>%
ggplot(aes(x=Age,y=Sleep.Duration))+
geom_point()+
# the linear fit line
geom_smooth(method="lm",aes(colour="linear"))+
#the nonlinear fit line (loess)
geom_smooth(method="loess",aes(colour="loess"))+
scale_colour_manual("Line fit",breaks=c("linear","loess"),values=c("blue","red"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
#3 make the linear model
m2 <- lm(Sleep.Duration ~ Age+Gender, data = sleep2)
# present the model calculations and statistics
summary(m2)
##
## Call:
## lm(formula = Sleep.Duration ~ Age + Gender, data = sleep2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3690 -0.7258 0.2348 0.5345 1.3060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.392570 0.268032 20.119 < 2e-16 ***
## Age 0.038754 0.005535 7.002 1.19e-11 ***
## GenderMale 0.207161 0.095888 2.160 0.0314 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7442 on 371 degrees of freedom
## Multiple R-squared: 0.1298, Adjusted R-squared: 0.1251
## F-statistic: 27.66 on 2 and 371 DF, p-value: 6.337e-12
#what the model object contains
names(m2)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "contrasts" "xlevels" "call" "terms"
## [13] "model"
#Coefficient contains the intercept and slopes of the model
m2$coefficients
## (Intercept) Age GenderMale
## 5.39257020 0.03875422 0.20716090
#tidy(m1)
#glance(m1)
#4 test the homoscedasticity
#extract values for testing homoscedasticity
sleepH2<-data.frame(resid = m2$residuals, fitVal = m2$fitted.values)
ggplot(sleepH2, aes(x =fitVal, y= resid)) +
geom_point() +
geom_smooth(method = "lm") +
geom_smooth(method = "loess") +
geom_hline(aes(yintercept = 0), colour = "red")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
#5 evaluate the model statistics to determine if it is a good model
#checking quality of the model, sigma and R^2 statistics
summary(m2)
##
## Call:
## lm(formula = Sleep.Duration ~ Age + Gender, data = sleep2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3690 -0.7258 0.2348 0.5345 1.3060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.392570 0.268032 20.119 < 2e-16 ***
## Age 0.038754 0.005535 7.002 1.19e-11 ***
## GenderMale 0.207161 0.095888 2.160 0.0314 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7442 on 371 degrees of freedom
## Multiple R-squared: 0.1298, Adjusted R-squared: 0.1251
## F-statistic: 27.66 on 2 and 371 DF, p-value: 6.337e-12
sigma(m2)
## [1] 0.744234
# calculate the percent error
sigma(m2)/mean(sleep2$Sleep.Duration)
## [1] 0.1043501
The multivariate regression results indicate that both age and gender are significant predictors of sleep duration. A graph was constructed to assess linearity only for the continuous variables, Age and Sleep Duration, as linearity cannot be directly tested between a categorical variable like Gender and a continuous variable. This graph is the same as the one produced in question 4. The estimated model is:
\[Sleep Duration=5.39+0.039×Age+0.207×Gender(Male)\]
This suggests that, on average, sleep duration increases by 0.039 hours for each additional year of age, and males sleep approximately 0.21 hours longer than females, holding age constant. The intercept of 5.39 hours represents the predicted sleep duration when age is zero and it is the reference gender (female). The residuals are roughly centered around zero, with some deviation, indicating that the homoscedasticity assumption is not perfectly met, though the pattern appears slightly improved compared to the univariate model (m1). The residual standard error is 0.744 hours, corresponding to a percentage error of about 10.4% relative to the mean sleep duration, reflecting moderate variability around the predicted values. The model explains approximately 13% of the variance in sleep duration (Adjusted R^2 = 0.125), representing a modest improvement over the univariate model with age alone. Overall, including gender alongside age provides a clearer understanding of factors influencing sleep duration, though additional predictors may be needed to account for more of the variability in the data.
In this step, I aim to build a multivariate linear model to investigate how sleep duration is influenced by both age and exercise, represented here by daily steps and physical activity level. Before fitting the model, I examine the linearity of the relationships between each predictor and sleep duration using scatterplots with linear and loess fits. Since daily steps and physical activity level are highly correlated, I will also check for multicollinearity and adjust the model accordingly. After fitting the model, I evaluate its quality using statistical metrics such as coefficients, residual standard error, R-squared, and variance inflation factors (VIF), as well as testing the assumptions of linear regression.
#2 test linearity
#ggplot of the two variables
sleep2%>%
ggplot(aes(x=Age,y=Sleep.Duration))+
geom_point()+
# the linear fit line
geom_smooth(method="lm",aes(colour="linear"))+
#the nonlinear fit line (loess)
geom_smooth(method="loess",aes(colour="loess"))+
scale_colour_manual("Line fit",breaks=c("linear","loess"),values=c("blue","red"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
#2 test linearity
#ggplot of the two variables
sleep2%>%
ggplot(aes(x=Daily.Steps,y=Sleep.Duration))+
geom_point()+
# the linear fit line
geom_smooth(method="lm",aes(colour="linear"))+
#the nonlinear fit line (loess)
geom_smooth(method="loess",aes(colour="loess"))+
scale_colour_manual("Line fit",breaks=c("linear","loess"),values=c("blue","red"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
#2 test linearity
#ggplot of the two variables
sleep2%>%
ggplot(aes(x=Physical.Activity.Level,y=Sleep.Duration))+
geom_point()+
# the linear fit line
geom_smooth(method="lm",aes(colour="linear"))+
#the nonlinear fit line (loess)
geom_smooth(method="loess",aes(colour="loess"))+
scale_colour_manual("Line fit",breaks=c("linear","loess"),values=c("blue","red"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
#3 make the linear model
#Check correlations between variables
m3 <- lm(Sleep.Duration ~ Age + Physical.Activity.Level + Daily.Steps, data = sleep2)
cor(sleep2[, c("Sleep.Duration", "Age", "Physical.Activity.Level", "Daily.Steps")])
## Sleep.Duration Age Physical.Activity.Level
## Sleep.Duration 1.00000000 0.3447094 0.2123603
## Age 0.34470936 1.0000000 0.1789927
## Physical.Activity.Level 0.21236031 0.1789927 1.0000000
## Daily.Steps -0.03953254 0.0579734 0.7727231
## Daily.Steps
## Sleep.Duration -0.03953254
## Age 0.05797340
## Physical.Activity.Level 0.77272305
## Daily.Steps 1.00000000
#because the physcial activity level and daily steps correlate at nearly 78% between eachother, it is likely one of them should be removed from the model
names(m3)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
summary(m3)
##
## Call:
## lm(formula = Sleep.Duration ~ Age + Physical.Activity.Level +
## Daily.Steps, data = sleep2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2776 -0.5833 0.1596 0.4140 1.3081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.406e+00 2.391e-01 26.794 < 2e-16 ***
## Age 2.564e-02 4.310e-03 5.949 6.25e-09 ***
## Physical.Activity.Level 1.947e-02 2.823e-03 6.898 2.29e-11 ***
## Daily.Steps -2.211e-04 3.582e-05 -6.174 1.75e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7045 on 370 degrees of freedom
## Multiple R-squared: 0.2224, Adjusted R-squared: 0.2161
## F-statistic: 35.27 on 3 and 370 DF, p-value: < 2.2e-16
m3$coefficients
## (Intercept) Age Physical.Activity.Level
## 6.4055982732 0.0256433959 0.0194722764
## Daily.Steps
## -0.0002211379
#tidy(m3)
#glance(m3)
m3fixed <-lm(Sleep.Duration ~ Age + Physical.Activity.Level, data = sleep2)
#4 test the homoscedasticity
#extract values for testing homoscedasticity
sleepH3<-data.frame(resid = m3$residuals, fitVal = m3$fitted.values)
ggplot(sleepH3, aes(x =fitVal, y= resid)) +
geom_point() +
geom_smooth(method = "lm") +
geom_smooth(method = "loess") +
geom_hline(aes(yintercept = 0), colour = "red")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
#VIF approach to check collinearity, looks all good
vif(m3)
## Age Physical.Activity.Level Daily.Steps
## 1.050484 2.598550 2.523779
vif(m3fixed)
## Age Physical.Activity.Level
## 1.033099 1.033099
#5 evaluate the model statistics to determine if it is a good model
#checking quality of the model, sigma and R^2 statistics
summary(m3fixed)
##
## Call:
## lm(formula = Sleep.Duration ~ Age + Physical.Activity.Level,
## data = sleep2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5844 -0.5754 0.1555 0.4850 1.2851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.554123 0.204822 27.117 < 2e-16 ***
## Age 0.029067 0.004483 6.483 2.86e-10 ***
## Physical.Activity.Level 0.005945 0.001867 3.185 0.00157 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7389 on 371 degrees of freedom
## Multiple R-squared: 0.1423, Adjusted R-squared: 0.1377
## F-statistic: 30.77 on 2 and 371 DF, p-value: 4.327e-13
sigma(m3fixed)
## [1] 0.738869
# calculate the percent error
sigma(m3fixed)/mean(sleep2$Sleep.Duration)
## [1] 0.1035979
The initial model including age, physical activity level, and daily steps revealed high collinearity between physical activity level and daily steps (correlation ≈ 0.77), so daily steps was removed to avoid multicollinearity. Physical activity level was retained because the linear and loess fits for this variable aligned more closely than those for daily steps. The final model is:
\[Sleep Duration=5.55+0.029×Age+0.0059×Physical Activity Level\]
This suggests that, on average, sleep duration increases by 0.029 hours for each additional year of age and by 0.0059 hours for each unit increase in physical activity level, holding other variables constant. The intercept of 5.55 hours represents the predicted sleep duration when both age and physical activity level are zero. Residuals are roughly centered around zero, indicating reasonable adherence to linear regression assumptions, and the residual standard error is 0.739 hours (≈10.4% of the mean sleep duration). The model explains about 14% of the variance in sleep duration (Adjusted R^2 = 0.138), a modest improvement over the univariate model with age alone. VIF values were all below 5, confirming that multicollinearity is no longer a concern. Overall, including physical activity alongside age provides a better understanding of factors influencing sleep duration, though additional predictors may be needed to capture more of the variability in the data.
In this step, I aim to build a multivariate linear model to evaluate how sleep duration is influenced by age, gender, and exercise (daily steps), while also testing whether age and gender interact in predicting sleep duration. Visualizations are first used to assess linearity for the continuous predictors (age and daily steps). Since multicollinearity can affect model estimates, I will check for it and standardize variables as needed. Finally, the model’s quality is evaluated using regression coefficients, residual standard error, R-squared, and other diagnostic metrics to understand the strength and reliability of the predictors and their interaction.
#2 test linearity
#ggplot of the two variables
sleep2%>%
ggplot(aes(x=Age,y=Sleep.Duration))+
geom_point()+
# the linear fit line
geom_smooth(method="lm",aes(colour="linear"))+
#the nonlinear fit line (loess)
geom_smooth(method="loess",aes(colour="loess"))+
scale_colour_manual("Line fit",breaks=c("linear","loess"),values=c("blue","red"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
#2 test linearity
#ggplot of the two variables
sleep2%>%
ggplot(aes(x=Daily.Steps,y=Sleep.Duration))+
geom_point()+
# the linear fit line
geom_smooth(method="lm",aes(colour="linear"))+
#the nonlinear fit line (loess)
geom_smooth(method="loess",aes(colour="loess"))+
scale_colour_manual("Line fit",breaks=c("linear","loess"),values=c("blue","red"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
#3 make the linear model
m4 <- lm(Sleep.Duration ~ Age * Gender + Daily.Steps, data = sleep2)
summary(m4)
##
## Call:
## lm(formula = Sleep.Duration ~ Age * Gender + Daily.Steps, data = sleep2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2981 -0.7501 0.2297 0.6493 1.5129
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.043e+00 3.473e-01 14.522 < 2e-16 ***
## Age 5.145e-02 6.726e-03 7.649 1.77e-13 ***
## GenderMale 1.690e+00 4.810e-01 3.514 0.000496 ***
## Daily.Steps -3.689e-05 2.365e-05 -1.560 0.119582
## Age:GenderMale -3.651e-02 1.163e-02 -3.138 0.001837 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7348 on 369 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.147
## F-statistic: 17.07 on 4 and 369 DF, p-value: 7.387e-13
#tidy(m4)
#glance(m4)
#4 test the homoscedasticity
#extract values for testing homoscedasticity
sleepH4<-data.frame(resid = m4$residuals, fitVal = m4$fitted.values)
ggplot(sleepH4, aes(x =fitVal, y= resid)) +
geom_point() +
geom_smooth(method = "lm") +
geom_smooth(method = "loess") +
geom_hline(aes(yintercept = 0), colour = "red")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
#VIF approach to check collinearity, need to standardize
vif(m4)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## Age Gender Daily.Steps Age:Gender
## 2.350791 40.055828 1.011095 33.719034
#Standardize the data and correct the colinearity. Center the data by subtracting the mean of the variable.
standard_sleep <- sleep2 %>%
select(Sleep.Duration, Age, Gender, Daily.Steps) %>%
mutate(standardAge = Age - mean(Age), standDaily.Steps = Daily.Steps - mean(Daily.Steps))
#5 evaluate the model statistics to determine if it is a good model
#checking quality of the model, sigma and R^2 statistics
m4fixed<-lm(Sleep.Duration ~ Age * Gender + Daily.Steps, data = standard_sleep)
summary(m4fixed)
##
## Call:
## lm(formula = Sleep.Duration ~ Age * Gender + Daily.Steps, data = standard_sleep)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2981 -0.7501 0.2297 0.6493 1.5129
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.043e+00 3.473e-01 14.522 < 2e-16 ***
## Age 5.145e-02 6.726e-03 7.649 1.77e-13 ***
## GenderMale 1.690e+00 4.810e-01 3.514 0.000496 ***
## Daily.Steps -3.689e-05 2.365e-05 -1.560 0.119582
## Age:GenderMale -3.651e-02 1.163e-02 -3.138 0.001837 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7348 on 369 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.147
## F-statistic: 17.07 on 4 and 369 DF, p-value: 7.387e-13
sigma(m4fixed)
## [1] 0.734843
# calculate the percent error (it is the exact same if you )
sigma(m4fixed)/mean(standard_sleep$Sleep.Duration)
## [1] 0.1030334
The model including the interaction between age and gender, along with daily steps, indicates that both age and gender significantly affect sleep duration, and that the effect of age differs between males and females. The standardized model is:
\[Sleep Duration=5.04+0.051×Age+1.69×Gender(Male)−0.0365×(Age×Gender(Male))−0.0000369×Daily Steps\]
Here, age has a positive association with sleep duration for females, but the negative interaction term indicates that the effect of age is slightly reduced for males. Gender alone is associated with longer sleep duration for males, and daily steps did not reach statistical significance. The residuals are roughly centered around zero, with a residual standard error of 0.735 hours (≈10.3% of the mean sleep duration). The model explains about 15.6% of the variance in sleep duration (Adjusted R^2 = 0.147), showing a modest improvement over previous models. Standardization of predictors resolved multicollinearity issues, allowing reliable estimation of the interaction term. Overall, this model demonstrates that the effect of age on sleep duration differs by gender, while daily steps has limited predictive value in this dataset.