Let’s import relevant libraries and load the data:

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
load("sleep75.RData")

Fitting Model to our data:

model <- lm(sleep ~ totwrk, data = data)

Total number of data points: n = 706

data %>%
  summarise(n())
##   n()
## 1 706
summary(model)
## 
## Call:
## lm(formula = sleep ~ totwrk, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2429.94  -240.25     4.91   250.53  1339.72 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3586.37695   38.91243  92.165   <2e-16 ***
## totwrk        -0.15075    0.01674  -9.005   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 421.1 on 704 degrees of freedom
## Multiple R-squared:  0.1033, Adjusted R-squared:  0.102 
## F-statistic: 81.09 on 1 and 704 DF,  p-value: < 2.2e-16

i) Report your results in equation form along with the number of observations and R2. What does the intercept in this equation mean?

Results: sleep = 3586.37695 -0.15075* totalwrk

In total data consists of 706 observation for 34 features.

Intercept (3586.37695 minutes per week/ 8.54 hours per day) from our model indicates the sleep in minutes when total work is zero / total sleep in minutes for unemployed people.

Residual standard error: 421.1 on 704 degrees of freedom Multiple R-squared: 0.1033, Adjusted R-squared: 0.102

ii) If totwrk increases by 2 hours, by how much is sleep estimated to fall? Do you find this to be a large effect?

sleep = 3586.37695 -0.15075* totalwrk

If totwrk increases by 2 hours in a week then sleep will fall by 0.15075 * 2 * 60= 18.09 minutes. This is only a few minutes per night.

0.15075 * 2*60
## [1] 18.09

Plots:

plot(data$totwrk, data$sleep)
abline(model,col = 'red')

iii) Show the calculation of R2 and RMSE, using SSR, SSE, and SST

data$predicted_sleep <- predict(model)

SST calculation: SST = \(\sum (y - \bar y)^2\)

SST <- sum((data$sleep - mean(data$sleep))^2)

SST
## [1] 139239836

SSR calculation: SSR = \(\sum (y - \hat y)^2\)

SSR <- sum((data$sleep - data$predicted_sleep)^2)

SSR
## [1] 124858119

SSE calculation: SSE = \(\sum (\bar y - \hat y)^2\)

SSE <- sum((mean(data$sleep) - data$predicted_sleep)^2)

SSE
## [1] 14381717

\(R^2 \space Calculation:\)

R_square <- (1- (SSR/SST))
R_square
## [1] 0.1032874
Adjusted_R_square <- 1-((1- R_square)*705/704)
Adjusted_R_square
## [1] 0.1020136

RMSE Calculation (on 704 degrees of freedom):

RMSE <- (SSR/704)^0.5
RMSE
## [1] 421.1357

iv) Discuss the likelihood that SLR.4 holds for this equation. Show exactly what the implications of the failure of this assumption will be.

SLR 4: The error u has an expected value of zero given any value of the explanatory variable. In other words,E(u|x)= 0.

plot(data$totwrk, data$sleep)
abline(3586.37695,-0.15075, col = 'red')

E(u) = 0

round(sum(data$predicted_sleep - data$sleep),2)
## [1] 0

Also,from above plot it is clear that E(u) randomly distributed accross different values of X. So It can be said that, E(u|x) = 0

If SLR.4 fails, the OLS estimators generally will be biased and \(E(\beta) \space will \space Not \space Equal \space to \space\beta\)


Let’s include age in our model to see effect of omitted variable bias:

summary(lm(sleep ~ totwrk + age, data = data))
## 
## Call:
## lm(formula = sleep ~ totwrk + age, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2384.63  -242.13     7.81   262.45  1302.19 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3469.20059   68.11787  50.929   <2e-16 ***
## totwrk        -0.14901    0.01672  -8.912   <2e-16 ***
## age            2.92388    1.39671   2.093   0.0367 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 420.1 on 703 degrees of freedom
## Multiple R-squared:  0.1088, Adjusted R-squared:  0.1063 
## F-statistic: 42.93 on 2 and 703 DF,  p-value: < 2.2e-16
cor(data$totwrk, data$age)
## [1] -0.04956989

Including age in our model increased adjusted R_square value from 0.102 to 0.1063. The coefficient for age is positive and Correlation between total work and age is negative.

Also, the coefficient of total work is decreased. Thus we can see that, there is negative bias from excluding age in our model.

plot(data$totwrk, data$sleep)
abline(3586.37695, -0.15075, col = "blue")
abline(3469.20059, -0.14901, col = "red")

Blue line : Excluding age Red line : Including age