Weekly Assignment #6

Due: Oct 25, 2025

Concept questions

  1. In a linear model of a continuous and categorical variable what do the coefficients describe/represent?

In a linear model of a continuous and categorical variable, the coefficients represent:

\[Y = β0 + β1X1 + β2X2 +ϵ\]

Y-Intercept (β₀): The predicted value of Y for the reference group of the categorical variable when the continuous variable equals 0.

Continuous variable coefficient (β₁): How much Y changes for each 1-unit increase in the continuous variable, keeping the categorical variable fixed.

Categorical variable coefficient (β₂): How much the mean of Y differs between the comparison group and the reference group, holding the continuous variable fixed.

Example: If we model sleep_duration ~ age + gender, the intercept is the predicted sleep duration for the reference gender at age 0, the age slope shows how sleep duration changes per year of age, and genderMale shows the difference between males and females.

  1. In a linear model of a continuous and categorical variable interacting what do the coefficients describe/represent?

When the model includes an interaction term (e.g., age * gender), the coefficients represent:

Y-Intercept (β₀): Predicted value of Y for the reference group when the continuous variable is 0.

Main effect (continuous): The slope of the continuous variable for the reference group.

Main effect (categorical): The difference in intercepts between groups when the continuous variable is 0.

Interaction term: Shows how the slope of the continuous variable changes between groups.

So, in sleep_duration ~ age * gender, the interaction term tests whether the relationship between age and sleep duration differs between males and females.

  1. What are the key assumptions of a simple linear model?

Independence: Each observation must be independent. One person’s data shouldn’t affect another’s.

Normality: The residuals (errors) should be normally distributed around zero.

Linearity: The relationship between the predictor(s) (X’s) and outcome (Y) should be roughly a straight line.

Homoscedasticity: The residuals should have equal variance across all fitted values. The “spread” of errors should stay about the same.

No major outliers or influential points (optional but important): There shouldn’t be any extreme data points that strongly distort the fit.

Coding questions

Remember to always write a discussion after every graph of table produced to explain or interpret a finding and set up your next steps.

We will explore a new data set that evaluated sleep in 400 individuals. Dataset Columns:

Person ID: An identifier for each individual.

Gender: The gender of the person (Male/Female).

Age: The age of the person in years.

Occupation: The occupation or profession of the person.

Sleep Duration (hours): The number of hours the person sleeps per day.

Quality of Sleep (scale: 1-10): A subjective rating of the quality of sleep, ranging from 1 to 10.

Physical Activity Level (minutes/day): The number of minutes the person engages in physical activity daily.

Stress Level (scale: 1-10): A subjective rating of the stress level experienced by the person, ranging from 1 to 10.

BMI Category: The BMI category of the person (e.g., Underweight, Normal, Overweight).

Blood Pressure (systolic/diastolic): The blood pressure measurement of the person, indicated as systolic pressure over diastolic pressure.

Heart Rate (bpm): The resting heart rate of the person in beats per minute. Daily Steps: The number of steps the person takes per day.

Sleep Disorder: The presence or absence of a sleep disorder in the person (None, Insomnia, Sleep Apnea).

Load in the data and check for NAs or abnormal 0 values in the sleep data.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(ggpubr)
library(GGally)
library(broom)
library(car) # vif()
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(cowplot) # plot_grid similar to lesson
## 
## Attaching package: 'cowplot'
## 
## The following object is masked from 'package:ggpubr':
## 
##     get_legend
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp
sleep <- read_csv("Sleep_health_and_lifestyle_dataset.csv")
## Rows: 374 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Gender, Occupation, BMI Category, Blood Pressure, Sleep Disorder
## dbl (8): Person ID, Age, Sleep Duration, Quality of Sleep, Physical Activity...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(sleep)
## Rows: 374
## Columns: 13
## $ `Person ID`               <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ Gender                    <chr> "Male", "Male", "Male", "Male", "Male", "Mal…
## $ Age                       <dbl> 27, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, …
## $ Occupation                <chr> "Software Engineer", "Doctor", "Doctor", "Sa…
## $ `Sleep Duration`          <dbl> 6.1, 6.2, 6.2, 5.9, 5.9, 5.9, 6.3, 7.8, 7.8,…
## $ `Quality of Sleep`        <dbl> 6, 6, 6, 4, 4, 4, 6, 7, 7, 7, 6, 7, 6, 6, 6,…
## $ `Physical Activity Level` <dbl> 42, 60, 60, 30, 30, 30, 40, 75, 75, 75, 30, …
## $ `Stress Level`            <dbl> 6, 8, 8, 8, 8, 8, 7, 6, 6, 6, 8, 6, 8, 8, 8,…
## $ `BMI Category`            <chr> "Overweight", "Normal", "Normal", "Obese", "…
## $ `Blood Pressure`          <chr> "126/83", "125/80", "125/80", "140/90", "140…
## $ `Heart Rate`              <dbl> 77, 75, 75, 85, 85, 85, 82, 70, 70, 70, 70, …
## $ `Daily Steps`             <dbl> 4200, 10000, 10000, 3000, 3000, 3000, 3500, …
## $ `Sleep Disorder`          <chr> "None", "None", "None", "Sleep Apnea", "Slee…
summary(sleep)
##    Person ID         Gender               Age         Occupation       
##  Min.   :  1.00   Length:374         Min.   :27.00   Length:374        
##  1st Qu.: 94.25   Class :character   1st Qu.:35.25   Class :character  
##  Median :187.50   Mode  :character   Median :43.00   Mode  :character  
##  Mean   :187.50                      Mean   :42.18                     
##  3rd Qu.:280.75                      3rd Qu.:50.00                     
##  Max.   :374.00                      Max.   :59.00                     
##  Sleep Duration  Quality of Sleep Physical Activity Level  Stress Level  
##  Min.   :5.800   Min.   :4.000    Min.   :30.00           Min.   :3.000  
##  1st Qu.:6.400   1st Qu.:6.000    1st Qu.:45.00           1st Qu.:4.000  
##  Median :7.200   Median :7.000    Median :60.00           Median :5.000  
##  Mean   :7.132   Mean   :7.313    Mean   :59.17           Mean   :5.385  
##  3rd Qu.:7.800   3rd Qu.:8.000    3rd Qu.:75.00           3rd Qu.:7.000  
##  Max.   :8.500   Max.   :9.000    Max.   :90.00           Max.   :8.000  
##  BMI Category       Blood Pressure       Heart Rate     Daily Steps   
##  Length:374         Length:374         Min.   :65.00   Min.   : 3000  
##  Class :character   Class :character   1st Qu.:68.00   1st Qu.: 5600  
##  Mode  :character   Mode  :character   Median :70.00   Median : 7000  
##                                        Mean   :70.17   Mean   : 6817  
##                                        3rd Qu.:72.00   3rd Qu.: 8000  
##                                        Max.   :86.00   Max.   :10000  
##  Sleep Disorder    
##  Length:374        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
head(sleep)
## # A tibble: 6 × 13
##   `Person ID` Gender   Age Occupation        `Sleep Duration` `Quality of Sleep`
##         <dbl> <chr>  <dbl> <chr>                        <dbl>              <dbl>
## 1           1 Male      27 Software Engineer              6.1                  6
## 2           2 Male      28 Doctor                         6.2                  6
## 3           3 Male      28 Doctor                         6.2                  6
## 4           4 Male      28 Sales Representa…              5.9                  4
## 5           5 Male      28 Sales Representa…              5.9                  4
## 6           6 Male      28 Software Engineer              5.9                  4
## # ℹ 7 more variables: `Physical Activity Level` <dbl>, `Stress Level` <dbl>,
## #   `BMI Category` <chr>, `Blood Pressure` <chr>, `Heart Rate` <dbl>,
## #   `Daily Steps` <dbl>, `Sleep Disorder` <chr>
#Test to check if the data is clean and doesn not contain odd values, NAs, or rare categories
#apply(X = sleep, MARGIN = 2, FUN = function(x) sum(x==0))
#apply(sleep, 2, function(x) sum(is.na(x))) # same as x==0

No NAs or 0s are found in the sleep data so I commented the apply functions out. The sleep data is now ready to be viewed and used.

  1. Consider that the colnames are not R compliant, they should change into names with only letter, numbers and a . or _. Spaces and starting with a number should be avoided. you can investigate the function make.names()

In this step, I will make sure all column names in the sleep dataset follow R’s naming rules. Some names might have spaces or special characters that could cause errors later. To fix this, I’ll use the make.names() function, which automatically replaces invalid characters with periods and ensures that names don’t start with numbers. This will make the dataset easier to work with in the rest of the analysis.

colnames(sleep) <- make.names(colnames(sleep))
names(sleep)
##  [1] "Person.ID"               "Gender"                 
##  [3] "Age"                     "Occupation"             
##  [5] "Sleep.Duration"          "Quality.of.Sleep"       
##  [7] "Physical.Activity.Level" "Stress.Level"           
##  [9] "BMI.Category"            "Blood.Pressure"         
## [11] "Heart.Rate"              "Daily.Steps"            
## [13] "Sleep.Disorder"

After running the code, all column names in the sleep dataset became R-friendly. Names that had spaces or special characters were replaced with periods—for example, “Sleep Duration” became “Sleep.Duration.” The dataset now has consistent, easy-to-use column names that will prevent syntax errors in later steps.

  1. Consider if the data is tidy. Do any columns contain more than one value?

In this step, I will check whether the dataset is tidy, meaning each variable has its own column and each observation has its own row. The Blood.Pressure column currently contains two values, systolic and diastolic pressure, separated by a slash (/). To make the data tidy, I will use the separate() function from the tidyr package to split this single column into two new columns: Systolic and Diastolic. I will also use mutate() and str_trim() to clean up extra spaces and convert the new columns to numeric values. This will make the dataset easier to analyze and compatible with future statistical models.

#Make the Blood.Pressure column Tidy by splitting Systolic and Diastolic Blood Pressures
#View Blood.Pressure Column 

sleep %>% select(Blood.Pressure)
## # A tibble: 374 × 1
##    Blood.Pressure
##    <chr>         
##  1 126/83        
##  2 125/80        
##  3 125/80        
##  4 140/90        
##  5 140/90        
##  6 140/90        
##  7 140/90        
##  8 120/80        
##  9 120/80        
## 10 120/80        
## # ℹ 364 more rows
#Split Blood.Pressure into Systolic and Diastolic 
sleep1 <- sleep %>%
mutate(Blood.Pressure = str_trim(Blood.Pressure)) %>%
separate(Blood.Pressure, into = c("Systolic", "Diastolic"), sep = "/") %>%
mutate(across(c(Systolic, Diastolic), ~ as.numeric(.)))


#Observe sleep1
glimpse(sleep1)
## Rows: 374
## Columns: 14
## $ Person.ID               <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ Gender                  <chr> "Male", "Male", "Male", "Male", "Male", "Male"…
## $ Age                     <dbl> 27, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, 29…
## $ Occupation              <chr> "Software Engineer", "Doctor", "Doctor", "Sale…
## $ Sleep.Duration          <dbl> 6.1, 6.2, 6.2, 5.9, 5.9, 5.9, 6.3, 7.8, 7.8, 7…
## $ Quality.of.Sleep        <dbl> 6, 6, 6, 4, 4, 4, 6, 7, 7, 7, 6, 7, 6, 6, 6, 6…
## $ Physical.Activity.Level <dbl> 42, 60, 60, 30, 30, 30, 40, 75, 75, 75, 30, 75…
## $ Stress.Level            <dbl> 6, 8, 8, 8, 8, 8, 7, 6, 6, 6, 8, 6, 8, 8, 8, 8…
## $ BMI.Category            <chr> "Overweight", "Normal", "Normal", "Obese", "Ob…
## $ Systolic                <dbl> 126, 125, 125, 140, 140, 140, 140, 120, 120, 1…
## $ Diastolic               <dbl> 83, 80, 80, 90, 90, 90, 90, 80, 80, 80, 80, 80…
## $ Heart.Rate              <dbl> 77, 75, 75, 85, 85, 85, 82, 70, 70, 70, 70, 70…
## $ Daily.Steps             <dbl> 4200, 10000, 10000, 3000, 3000, 3000, 3500, 80…
## $ Sleep.Disorder          <chr> "None", "None", "None", "Sleep Apnea", "Sleep …
#summary(sleep1$Systolic)
#summary(sleep1$Diastolic)

After running the code, the Blood.Pressure column was successfully split into two new numeric columns: Systolic and Diastolic. Each now contains a single, clearly defined value for each observation, which makes the dataset tidy. The glimpse() output confirmed that both new variables are of type , and the dataset now has 14 columns instead of 13. This step improved the structure of the data, ensuring that each column represents one variable and each row represents one participant, which is essential for accurate data analysis.

  1. Consider if the variables are the correct data types, character, numeric and categorical.

In this step, I will make sure that each variable in the dataset has the correct data type. Some variables, such as Gender, Occupation, BMI.Category, and Sleep.Disorder, should be stored as factors since they represent categories. Others, like Age, Sleep.Duration, and Heart.Rate, should be numeric to allow for statistical calculations. Using the mutate() function, I will convert each column to its appropriate type. This will help ensure that future analyses, summaries, and visualizations work properly and that R correctly interprets the nature of each variable.

#Mutate all remaining variables to ensure they are the correct data types
sleep2 <- sleep1 %>%
mutate(
Gender = as.factor(Gender),
Occupation = as.factor(Occupation),
BMI.Category = as.factor(BMI.Category),
Sleep.Disorder = as.factor(Sleep.Disorder),
Age = as.numeric(Age),
Sleep.Duration = as.numeric(Sleep.Duration),
Quality.of.Sleep = as.numeric(Quality.of.Sleep),
Physical.Activity.Level = as.numeric(Physical.Activity.Level),
Stress.Level = as.numeric(Stress.Level),
Heart.Rate = as.numeric(Heart.Rate),
Daily.Steps = as.numeric(Daily.Steps)
)

#Confirm 

str(sleep2)
## tibble [374 × 14] (S3: tbl_df/tbl/data.frame)
##  $ Person.ID              : num [1:374] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender                 : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Age                    : num [1:374] 27 28 28 28 28 28 29 29 29 29 ...
##  $ Occupation             : Factor w/ 11 levels "Accountant","Doctor",..: 10 2 2 7 7 10 11 2 2 2 ...
##  $ Sleep.Duration         : num [1:374] 6.1 6.2 6.2 5.9 5.9 5.9 6.3 7.8 7.8 7.8 ...
##  $ Quality.of.Sleep       : num [1:374] 6 6 6 4 4 4 6 7 7 7 ...
##  $ Physical.Activity.Level: num [1:374] 42 60 60 30 30 30 40 75 75 75 ...
##  $ Stress.Level           : num [1:374] 6 8 8 8 8 8 7 6 6 6 ...
##  $ BMI.Category           : Factor w/ 4 levels "Normal","Normal Weight",..: 4 1 1 3 3 3 3 1 1 1 ...
##  $ Systolic               : num [1:374] 126 125 125 140 140 140 140 120 120 120 ...
##  $ Diastolic              : num [1:374] 83 80 80 90 90 90 90 80 80 80 ...
##  $ Heart.Rate             : num [1:374] 77 75 75 85 85 85 82 70 70 70 ...
##  $ Daily.Steps            : num [1:374] 4200 10000 10000 3000 3000 3000 3500 8000 8000 8000 ...
##  $ Sleep.Disorder         : Factor w/ 3 levels "Insomnia","None",..: 2 2 2 3 3 1 1 2 2 2 ...

After running the code, the str(sleep2) output confirmed that all variables were successfully converted to their correct data types. The categorical variables now appear as factors with a defined number of levels, while all measurement variables are numeric. For example, Gender became a factor with two levels (“Female” and “Male”), and Sleep.Duration, Age, and Daily.Steps are now numeric. This correction ensures that R can properly handle categorical and continuous variables during statistical modeling and visualization, improving both accuracy and efficiency in the analysis.

In this step, I will create a custom function called checknorm to assess the normality of continuous variables in the dataset. The normality assumption is important for many statistical methods, including linear regression, so this function will help identify whether each variable follows a roughly normal distribution. The function will perform a Shapiro-Wilk test to test normality statistically and generate two diagnostic plots for each variable: a density plot to visualize the distribution and a QQ plot to compare the observed data against a theoretical normal distribution. After defining the function, I will apply it to several numeric variables, such as Age, Sleep.Duration, and Daily.Steps, and then combine all plots into one grid for easier comparison.

#1 test normalcy of all the variables  

data<-sleep2
variable<-c(
  "Age", "Sleep.Duration", "Quality.of.Sleep",
  "Physical.Activity.Level", "Stress.Level",
  "Daily.Steps", "Heart.Rate", "Systolic", "Diastolic"
)

checknorm<- function(data,variable){

#A density plot of the value distributions
p1<-ggplot(data, aes(x=.data[[variable]])) +
    geom_density() +
  ggtitle(paste("Density of ", variable))

#Shapiro test object can have values extracted

sp<-shapiro.test(data[[variable]])

#A qqplot as a ggplot object using  stat_qq() stat_qq_line()
p2<-ggplot(data, aes(sample=.data[[variable]]))+
    stat_qq() +
    stat_qq_line() +
    labs(x ="theoretical") +
    ggtitle(paste(variable , " Shapiro pval: ", round(sp$p.value, 3)))

plot_grid(p1,p2)
}



#Create list of plots
plot_list <- lapply(variable, function(v) checknorm(data, v))

#Combine everything into one big grid
all_plots <- cowplot::plot_grid(plotlist = plot_list, ncol = 2)

#Display
all_plots

After running the code, the Shapiro-Wilk tests for all variables returned p-values of 0, indicating that none of the variables followed a normal distribution. The density plots supported this finding—most variables showed irregular, multimodal, or skewed distributions rather than the smooth, bell-shaped curve expected under normality. Among the variables, Age and Daily.Steps appeared closer to normal compared to Physical.Activity.Level and Sleep.Duration, which displayed more pronounced deviations. The QQ plots reinforced these observations, with data points for all variables straying noticeably from the diagonal reference line. Overall, both the statistical tests and visual diagnostics suggest that the continuous variables in this dataset are not normally distributed. This means that linear models based on these variables may not fully meet the normality assumption; however, they can still offer valuable insight and serve as simple predictive tools, especially when interpreted with caution.

  1. Build a linear model of sleep duration as a function of age. Evaluate the quality of the model discussing appropriate statistical metrics that the lm() function provides

In this analysis, I aim to investigate the relationship between age and sleep duration by building a linear regression model. The goal is to quantify how sleep duration changes with age and evaluate whether age is a significant predictor of sleep duration. The code first visualizes the data to examine the potential linearity of the relationship and then fits a linear model using the lm() function. Additionally, I will assess the assumptions of linear regression, including homoscedasticity of residuals, and evaluate the quality of the model using standard metrics such as R-squared, residual standard error, and the significance of the regression coefficients.

#2 test linearity 
#ggplot of the two variables
sleep2%>%
  ggplot(aes(x=Age,y=Sleep.Duration))+
  geom_point()+

    # the linear fit line
  geom_smooth(method="lm",aes(colour="linear"))+
    #the nonlinear fit line (loess)
geom_smooth(method="loess",aes(colour="loess"))+
  scale_colour_manual("Line fit",breaks=c("linear","loess"),values=c("blue","red"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

#3 make the linear model
m1 <- lm(Sleep.Duration ~ Age, data = sleep2)

# present the model calculations and statistics
summary(m1)
## 
## Call:
## lm(formula = Sleep.Duration ~ Age, data = sleep2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4160 -0.6867  0.1686  0.6246  1.1532 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.798086   0.192278  30.155  < 2e-16 ***
## Age         0.031623   0.004465   7.083 7.12e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7479 on 372 degrees of freedom
## Multiple R-squared:  0.1188, Adjusted R-squared:  0.1165 
## F-statistic: 50.16 on 1 and 372 DF,  p-value: 7.117e-12
#what the model object contains
names(m1)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
#Coefficient contains the intercept and slopes of the model
m1$coefficients
## (Intercept)         Age 
##  5.79808624  0.03162298
#tidy(m1)
#glance(m1)

#4 test the homoscedasticity 
#extract values for testing homoscedasticity
sleepH1<-data.frame(resid = m1$residuals, fitVal = m1$fitted.values)

ggplot(sleepH1, aes(x =fitVal, y= resid)) +
    geom_point() +
    geom_smooth(method = "lm") +
        geom_smooth(method = "loess") +
    geom_hline(aes(yintercept = 0), colour = "red")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

#graphing linear model, m1
ggplot(sleep2, aes(x = Age, y = Sleep.Duration)) + 
  geom_point() +
  geom_abline(intercept = m1$coefficients[1] , 
              slope = m1$coefficients[2], 
              color = "purple") +
  annotate("text", x = 43, y = 7, label = "Sleep Duration = 0.032(Age) + 5.80")+
   ggtitle("Sleep Duration by Age")

#5 evaluate the model statistics to determine if it is a good model
#checking quality of the model, sigma and R^2 statistics
summary(m1)
## 
## Call:
## lm(formula = Sleep.Duration ~ Age, data = sleep2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4160 -0.6867  0.1686  0.6246  1.1532 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.798086   0.192278  30.155  < 2e-16 ***
## Age         0.031623   0.004465   7.083 7.12e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7479 on 372 degrees of freedom
## Multiple R-squared:  0.1188, Adjusted R-squared:  0.1165 
## F-statistic: 50.16 on 1 and 372 DF,  p-value: 7.117e-12
sigma(m1)
## [1] 0.7478937
# calculate the percent error
sigma(m1)/mean(sleep2$Sleep.Duration)
## [1] 0.1048633

The linear regression results indicate that age is a statistically significant predictor of sleep duration. Visualizing the relationship with both a linear and loess fit shows considerable overlap between the two lines, suggesting that while a linear model may not fully capture the nuances of the data, it provides a useful approximation of the overall trend. The estimated model is:

\[Sleep Duration = 5.80 + 0.032 × Age\]

This implies that, on average, sleep duration increases slightly by 0.032 hours for each additional year of age. The intercept of 5.80 hours represents the predicted sleep duration when age is zero. The residuals are roughly centered around zero, though there is some deviation, indicating that the assumption of homoscedasticity is not perfectly met. The residual standard error (sigma) is 0.748 hours, corresponding to a percentage error of approximately 10.5% relative to the mean sleep duration, suggesting moderate variability around the predicted values. However, the model explains only about 12% of the variance in sleep duration (Adjusted R^2 = 0.1165), meaning that while the relationship is statistically significant (p < 0.001), age alone is a relatively weak predictor. Overall, the linear model identifies a small but meaningful trend, and incorporating additional predictors may improve model accuracy.

  1. Build a multivariate model of sleep duration as a function of gender and age. Evaluate the quality of the model discussing appropriate statistical metrics that the lm() function provides.

In this step, I aim to build a multivariate linear model to explore how sleep duration is influenced by both age and gender. By including gender as an additional predictor, I want to determine whether there are differences in sleep duration between males and females while accounting for age. The code first visualizes the relationship between age and sleep duration, then fits a linear model using lm(). Finally, I evaluate the model’s quality by examining regression coefficients, residuals, R-squared values, and the residual standard error to assess the strength and reliability of the model.

#2 test linearity 
#ggplot of the two variables
sleep2%>%
  ggplot(aes(x=Age,y=Sleep.Duration))+
  geom_point()+

    # the linear fit line
  geom_smooth(method="lm",aes(colour="linear"))+
    #the nonlinear fit line (loess)
geom_smooth(method="loess",aes(colour="loess"))+
  scale_colour_manual("Line fit",breaks=c("linear","loess"),values=c("blue","red"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

#3 make the linear model
m2 <- lm(Sleep.Duration ~ Age+Gender, data = sleep2)

# present the model calculations and statistics
summary(m2)
## 
## Call:
## lm(formula = Sleep.Duration ~ Age + Gender, data = sleep2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3690 -0.7258  0.2348  0.5345  1.3060 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.392570   0.268032  20.119  < 2e-16 ***
## Age         0.038754   0.005535   7.002 1.19e-11 ***
## GenderMale  0.207161   0.095888   2.160   0.0314 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7442 on 371 degrees of freedom
## Multiple R-squared:  0.1298, Adjusted R-squared:  0.1251 
## F-statistic: 27.66 on 2 and 371 DF,  p-value: 6.337e-12
#what the model object contains
names(m2)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"
#Coefficient contains the intercept and slopes of the model
m2$coefficients
## (Intercept)         Age  GenderMale 
##  5.39257020  0.03875422  0.20716090
#tidy(m1)
#glance(m1)

#4 test the homoscedasticity 
#extract values for testing homoscedasticity
sleepH2<-data.frame(resid = m2$residuals, fitVal = m2$fitted.values)

ggplot(sleepH2, aes(x =fitVal, y= resid)) +
    geom_point() +
    geom_smooth(method = "lm") +
        geom_smooth(method = "loess") +
    geom_hline(aes(yintercept = 0), colour = "red")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

#5 evaluate the model statistics to determine if it is a good model
#checking quality of the model, sigma and R^2 statistics
summary(m2)
## 
## Call:
## lm(formula = Sleep.Duration ~ Age + Gender, data = sleep2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3690 -0.7258  0.2348  0.5345  1.3060 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.392570   0.268032  20.119  < 2e-16 ***
## Age         0.038754   0.005535   7.002 1.19e-11 ***
## GenderMale  0.207161   0.095888   2.160   0.0314 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7442 on 371 degrees of freedom
## Multiple R-squared:  0.1298, Adjusted R-squared:  0.1251 
## F-statistic: 27.66 on 2 and 371 DF,  p-value: 6.337e-12
sigma(m2)
## [1] 0.744234
# calculate the percent error
sigma(m2)/mean(sleep2$Sleep.Duration)
## [1] 0.1043501

The multivariate regression results indicate that both age and gender are significant predictors of sleep duration. A graph was constructed to assess linearity only for the continuous variables, Age and Sleep Duration, as linearity cannot be directly tested between a categorical variable like Gender and a continuous variable. This graph is the same as the one produced in question 4. The estimated model is:

\[Sleep Duration=5.39+0.039×Age+0.207×Gender(Male)\]

This suggests that, on average, sleep duration increases by 0.039 hours for each additional year of age, and males sleep approximately 0.21 hours longer than females, holding age constant. The intercept of 5.39 hours represents the predicted sleep duration when age is zero and it is the reference gender (female). The residuals are roughly centered around zero, with some deviation, indicating that the homoscedasticity assumption is not perfectly met, though the pattern appears slightly improved compared to the univariate model (m1). The residual standard error is 0.744 hours, corresponding to a percentage error of about 10.4% relative to the mean sleep duration, reflecting moderate variability around the predicted values. The model explains approximately 13% of the variance in sleep duration (Adjusted R^2 = 0.125), representing a modest improvement over the univariate model with age alone. Overall, including gender alongside age provides a clearer understanding of factors influencing sleep duration, though additional predictors may be needed to account for more of the variability in the data.

  1. Build a multivariate linear model of sleep duration with age and daily steps (exercise). Recall that multicolinearity may be an issue and should be evaluated and corrected. Evaluate the quality of the model discussing appropriate statistical metrics that the lm() function provides.

In this step, I aim to build a multivariate linear model to investigate how sleep duration is influenced by both age and exercise, represented here by daily steps and physical activity level. Before fitting the model, I examine the linearity of the relationships between each predictor and sleep duration using scatterplots with linear and loess fits. Since daily steps and physical activity level are highly correlated, I will also check for multicollinearity and adjust the model accordingly. After fitting the model, I evaluate its quality using statistical metrics such as coefficients, residual standard error, R-squared, and variance inflation factors (VIF), as well as testing the assumptions of linear regression.

#2 test linearity 
#ggplot of the two variables
sleep2%>%
  ggplot(aes(x=Age,y=Sleep.Duration))+
  geom_point()+

    # the linear fit line
  geom_smooth(method="lm",aes(colour="linear"))+
    #the nonlinear fit line (loess)
geom_smooth(method="loess",aes(colour="loess"))+
  scale_colour_manual("Line fit",breaks=c("linear","loess"),values=c("blue","red"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

#2 test linearity 
#ggplot of the two variables
sleep2%>%
  ggplot(aes(x=Daily.Steps,y=Sleep.Duration))+
  geom_point()+

    # the linear fit line
  geom_smooth(method="lm",aes(colour="linear"))+
    #the nonlinear fit line (loess)
geom_smooth(method="loess",aes(colour="loess"))+
  scale_colour_manual("Line fit",breaks=c("linear","loess"),values=c("blue","red"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

#2 test linearity 
#ggplot of the two variables
sleep2%>%
  ggplot(aes(x=Physical.Activity.Level,y=Sleep.Duration))+
  geom_point()+

    # the linear fit line
  geom_smooth(method="lm",aes(colour="linear"))+
    #the nonlinear fit line (loess)
geom_smooth(method="loess",aes(colour="loess"))+
  scale_colour_manual("Line fit",breaks=c("linear","loess"),values=c("blue","red"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

#3 make the linear model
#Check correlations between variables
m3 <- lm(Sleep.Duration ~ Age + Physical.Activity.Level + Daily.Steps, data = sleep2)
cor(sleep2[, c("Sleep.Duration", "Age", "Physical.Activity.Level", "Daily.Steps")])
##                         Sleep.Duration       Age Physical.Activity.Level
## Sleep.Duration              1.00000000 0.3447094               0.2123603
## Age                         0.34470936 1.0000000               0.1789927
## Physical.Activity.Level     0.21236031 0.1789927               1.0000000
## Daily.Steps                -0.03953254 0.0579734               0.7727231
##                         Daily.Steps
## Sleep.Duration          -0.03953254
## Age                      0.05797340
## Physical.Activity.Level  0.77272305
## Daily.Steps              1.00000000
#because the physcial activity level and daily steps correlate at nearly 78% between eachother, it is likely one of them should be removed from the model

names(m3)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
summary(m3)
## 
## Call:
## lm(formula = Sleep.Duration ~ Age + Physical.Activity.Level + 
##     Daily.Steps, data = sleep2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2776 -0.5833  0.1596  0.4140  1.3081 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              6.406e+00  2.391e-01  26.794  < 2e-16 ***
## Age                      2.564e-02  4.310e-03   5.949 6.25e-09 ***
## Physical.Activity.Level  1.947e-02  2.823e-03   6.898 2.29e-11 ***
## Daily.Steps             -2.211e-04  3.582e-05  -6.174 1.75e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7045 on 370 degrees of freedom
## Multiple R-squared:  0.2224, Adjusted R-squared:  0.2161 
## F-statistic: 35.27 on 3 and 370 DF,  p-value: < 2.2e-16
m3$coefficients
##             (Intercept)                     Age Physical.Activity.Level 
##            6.4055982732            0.0256433959            0.0194722764 
##             Daily.Steps 
##           -0.0002211379
#tidy(m3)
#glance(m3)


m3fixed <-lm(Sleep.Duration ~ Age + Physical.Activity.Level, data = sleep2)

#4 test the homoscedasticity 
#extract values for testing homoscedasticity
sleepH3<-data.frame(resid = m3$residuals, fitVal = m3$fitted.values)

ggplot(sleepH3, aes(x =fitVal, y= resid)) +
    geom_point() +
    geom_smooth(method = "lm") +
        geom_smooth(method = "loess") +
    geom_hline(aes(yintercept = 0), colour = "red")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

#VIF approach to check collinearity, looks all good

vif(m3)
##                     Age Physical.Activity.Level             Daily.Steps 
##                1.050484                2.598550                2.523779
vif(m3fixed)
##                     Age Physical.Activity.Level 
##                1.033099                1.033099
#5 evaluate the model statistics to determine if it is a good model
#checking quality of the model, sigma and R^2 statistics
summary(m3fixed)
## 
## Call:
## lm(formula = Sleep.Duration ~ Age + Physical.Activity.Level, 
##     data = sleep2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5844 -0.5754  0.1555  0.4850  1.2851 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.554123   0.204822  27.117  < 2e-16 ***
## Age                     0.029067   0.004483   6.483 2.86e-10 ***
## Physical.Activity.Level 0.005945   0.001867   3.185  0.00157 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7389 on 371 degrees of freedom
## Multiple R-squared:  0.1423, Adjusted R-squared:  0.1377 
## F-statistic: 30.77 on 2 and 371 DF,  p-value: 4.327e-13
sigma(m3fixed)
## [1] 0.738869
# calculate the percent error
sigma(m3fixed)/mean(sleep2$Sleep.Duration)
## [1] 0.1035979

The initial model including age, physical activity level, and daily steps revealed high collinearity between physical activity level and daily steps (correlation ≈ 0.77), so daily steps was removed to avoid multicollinearity. Physical activity level was retained because the linear and loess fits for this variable aligned more closely than those for daily steps. The final model is:

\[Sleep Duration=5.55+0.029×Age+0.0059×Physical Activity Level\]

This suggests that, on average, sleep duration increases by 0.029 hours for each additional year of age and by 0.0059 hours for each unit increase in physical activity level, holding other variables constant. The intercept of 5.55 hours represents the predicted sleep duration when both age and physical activity level are zero. Residuals are roughly centered around zero, indicating reasonable adherence to linear regression assumptions, and the residual standard error is 0.739 hours (≈10.4% of the mean sleep duration). The model explains about 14% of the variance in sleep duration (Adjusted R^2 = 0.138), a modest improvement over the univariate model with age alone. VIF values were all below 5, confirming that multicollinearity is no longer a concern. Overall, including physical activity alongside age provides a better understanding of factors influencing sleep duration, though additional predictors may be needed to capture more of the variability in the data.

  1. Build another model that evaluates sleep duration with age, gender and daily steps (exercise). Evaluate if gender and age interact in the prediction of sleep duration.

In this step, I aim to build a multivariate linear model to evaluate how sleep duration is influenced by age, gender, and exercise (daily steps), while also testing whether age and gender interact in predicting sleep duration. Visualizations are first used to assess linearity for the continuous predictors (age and daily steps). Since multicollinearity can affect model estimates, I will check for it and standardize variables as needed. Finally, the model’s quality is evaluated using regression coefficients, residual standard error, R-squared, and other diagnostic metrics to understand the strength and reliability of the predictors and their interaction.

#2 test linearity 
#ggplot of the two variables
sleep2%>%
  ggplot(aes(x=Age,y=Sleep.Duration))+
  geom_point()+

    # the linear fit line
  geom_smooth(method="lm",aes(colour="linear"))+
    #the nonlinear fit line (loess)
geom_smooth(method="loess",aes(colour="loess"))+
  scale_colour_manual("Line fit",breaks=c("linear","loess"),values=c("blue","red"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

#2 test linearity 
#ggplot of the two variables
sleep2%>%
  ggplot(aes(x=Daily.Steps,y=Sleep.Duration))+
  geom_point()+

    # the linear fit line
  geom_smooth(method="lm",aes(colour="linear"))+
    #the nonlinear fit line (loess)
geom_smooth(method="loess",aes(colour="loess"))+
  scale_colour_manual("Line fit",breaks=c("linear","loess"),values=c("blue","red"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

#3 make the linear model
m4 <- lm(Sleep.Duration ~ Age * Gender + Daily.Steps, data = sleep2)
summary(m4)
## 
## Call:
## lm(formula = Sleep.Duration ~ Age * Gender + Daily.Steps, data = sleep2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2981 -0.7501  0.2297  0.6493  1.5129 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5.043e+00  3.473e-01  14.522  < 2e-16 ***
## Age             5.145e-02  6.726e-03   7.649 1.77e-13 ***
## GenderMale      1.690e+00  4.810e-01   3.514 0.000496 ***
## Daily.Steps    -3.689e-05  2.365e-05  -1.560 0.119582    
## Age:GenderMale -3.651e-02  1.163e-02  -3.138 0.001837 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7348 on 369 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.147 
## F-statistic: 17.07 on 4 and 369 DF,  p-value: 7.387e-13
#tidy(m4)
#glance(m4)

#4 test the homoscedasticity 
#extract values for testing homoscedasticity
sleepH4<-data.frame(resid = m4$residuals, fitVal = m4$fitted.values)

ggplot(sleepH4, aes(x =fitVal, y= resid)) +
    geom_point() +
    geom_smooth(method = "lm") +
        geom_smooth(method = "loess") +
    geom_hline(aes(yintercept = 0), colour = "red")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

#VIF approach to check collinearity, need to standardize

vif(m4)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##         Age      Gender Daily.Steps  Age:Gender 
##    2.350791   40.055828    1.011095   33.719034
#Standardize the data and correct the colinearity. Center the data by subtracting the mean of the variable. 
standard_sleep <- sleep2 %>% 
  select(Sleep.Duration, Age, Gender, Daily.Steps) %>% 
    mutate(standardAge = Age - mean(Age), standDaily.Steps = Daily.Steps - mean(Daily.Steps))

#5 evaluate the model statistics to determine if it is a good model
#checking quality of the model, sigma and R^2 statistics
m4fixed<-lm(Sleep.Duration ~ Age * Gender + Daily.Steps, data = standard_sleep)
summary(m4fixed)
## 
## Call:
## lm(formula = Sleep.Duration ~ Age * Gender + Daily.Steps, data = standard_sleep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2981 -0.7501  0.2297  0.6493  1.5129 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5.043e+00  3.473e-01  14.522  < 2e-16 ***
## Age             5.145e-02  6.726e-03   7.649 1.77e-13 ***
## GenderMale      1.690e+00  4.810e-01   3.514 0.000496 ***
## Daily.Steps    -3.689e-05  2.365e-05  -1.560 0.119582    
## Age:GenderMale -3.651e-02  1.163e-02  -3.138 0.001837 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7348 on 369 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.147 
## F-statistic: 17.07 on 4 and 369 DF,  p-value: 7.387e-13
sigma(m4fixed)
## [1] 0.734843
# calculate the percent error (it is the exact same if you )

sigma(m4fixed)/mean(standard_sleep$Sleep.Duration)
## [1] 0.1030334

The model including the interaction between age and gender, along with daily steps, indicates that both age and gender significantly affect sleep duration, and that the effect of age differs between males and females. The standardized model is:

\[Sleep Duration=5.04+0.051×Age+1.69×Gender(Male)−0.0365×(Age×Gender(Male))−0.0000369×Daily Steps\]

Here, age has a positive association with sleep duration for females, but the negative interaction term indicates that the effect of age is slightly reduced for males. Gender alone is associated with longer sleep duration for males, and daily steps did not reach statistical significance. The residuals are roughly centered around zero, with a residual standard error of 0.735 hours (≈10.3% of the mean sleep duration). The model explains about 15.6% of the variance in sleep duration (Adjusted R^2 = 0.147), showing a modest improvement over previous models. Standardization of predictors resolved multicollinearity issues, allowing reliable estimation of the interaction term. Overall, this model demonstrates that the effect of age on sleep duration differs by gender, while daily steps has limited predictive value in this dataset.