Let’s import relevant libraries and load the data:
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
load("sleep75.RData")
Fitting Model to our data:
model <- lm(sleep ~ totwrk, data = data)
Total number of data points: n = 706
data %>%
summarise(n())
## n()
## 1 706
summary(model)
##
## Call:
## lm(formula = sleep ~ totwrk, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2429.94 -240.25 4.91 250.53 1339.72
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3586.37695 38.91243 92.165 <2e-16 ***
## totwrk -0.15075 0.01674 -9.005 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 421.1 on 704 degrees of freedom
## Multiple R-squared: 0.1033, Adjusted R-squared: 0.102
## F-statistic: 81.09 on 1 and 704 DF, p-value: < 2.2e-16
Results: sleep = 3586.37695 -0.15075* totalwrk
In total data consists of 706 observation for 34 features.
Intercept (3586.37695 minutes per week/ 8.54 hours per day) from our model indicates the sleep in minutes when total work is zero / total sleep in minutes for unemployed people.
Residual standard error: 421.1 on 704 degrees of freedom Multiple R-squared: 0.1033, Adjusted R-squared: 0.102
sleep = 3586.37695 -0.15075* totalwrk
If totwrk increases by 2 hours in a week then sleep will fall by 0.15075 * 2 * 60= 18.09 minutes. This is only a few minutes per night.
0.15075 * 2*60
## [1] 18.09
Plots:
plot(data$totwrk, data$sleep)
abline(model,col = 'red')
data$predicted_sleep <- predict(model)
SST calculation: SST = \(\sum (y - \bar y)^2\)
SST <- sum((data$sleep - mean(data$sleep))^2)
SST
## [1] 139239836
SSR calculation: SSR = \(\sum (y - \hat y)^2\)
SSR <- sum((data$sleep - data$predicted_sleep)^2)
SSR
## [1] 124858119
SSE calculation: SSE = \(\sum (\bar y - \hat y)^2\)
SSE <- sum((mean(data$sleep) - data$predicted_sleep)^2)
SSE
## [1] 14381717
\(R^2 \space Calculation:\)
R_square <- (1- (SSR/SST))
R_square
## [1] 0.1032874
Adjusted_R_square <- 1-((1- R_square)*705/704)
Adjusted_R_square
## [1] 0.1020136
RMSE Calculation (on 704 degrees of freedom):
RMSE <- (SSR/704)^0.5
RMSE
## [1] 421.1357
SLR 4: The error u has an expected value of zero given any value of the explanatory variable. In other words,E(u|x)= 0.
plot(data$totwrk, data$sleep)
abline(3586.37695,-0.15075, col = 'red')
E(u) = 0
round(sum(data$predicted_sleep - data$sleep),2)
## [1] 0
Also,from above plot it is clear that E(u) randomly distributed accross different values of X. So It can be said that, E(u|x) = 0
If SLR.4 fails, the OLS estimators generally will be biased and \(E(\beta) \space will \space Not \space Equal \space to \space\beta\)
Let’s include age in our model to see effect of omitted variable bias:
summary(lm(sleep ~ totwrk + age, data = data))
##
## Call:
## lm(formula = sleep ~ totwrk + age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2384.63 -242.13 7.81 262.45 1302.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3469.20059 68.11787 50.929 <2e-16 ***
## totwrk -0.14901 0.01672 -8.912 <2e-16 ***
## age 2.92388 1.39671 2.093 0.0367 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 420.1 on 703 degrees of freedom
## Multiple R-squared: 0.1088, Adjusted R-squared: 0.1063
## F-statistic: 42.93 on 2 and 703 DF, p-value: < 2.2e-16
cor(data$totwrk, data$age)
## [1] -0.04956989
Including age in our model increased adjusted R_square value from 0.102 to 0.1063. The coefficient for age is positive and Correlation between total work and age is negative.
Also, the coefficient of total work is decreased. Thus we can see that, there is negative bias from excluding age in our model.
plot(data$totwrk, data$sleep)
abline(3586.37695, -0.15075, col = "blue")
abline(3469.20059, -0.14901, col = "red")
Blue line : Excluding age Red line : Including age