Based on the article “Sleep and the Allocation of Time”, I created an analysis of the length of sleep and nap time (from the sleep75 data set) in terms of other variables, such as total work time and age.

Correlation matrix of the sleep75 dataset

I created this matrix to have a general view of correlations between different variables in the sleep75 data set.

Warning in cor(data, use = method[1], method = method[2]): the standard
deviation is zero

Creating a basic regression model

From the matrix, the strongest correlation I found for slpnaps was with totwrk. The coefficient of correlation is:

The following object is masked from package:datasets:

    sleep
[1] -0.3425592

Then I created a linear model for sleep&nap time and total work time.

model1 <- lm(slpnaps~totwrk, data = sleep75)
summary(model1)

Call:
lm(formula = slpnaps ~ totwrk, data = sleep75)

Residuals:
     Min       1Q   Median       3Q      Max 
-1950.64  -270.86   -11.23   261.34  2343.88 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3766.12461   43.35218  86.873   <2e-16 ***
totwrk        -0.18043    0.01865  -9.674   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 469.2 on 704 degrees of freedom
Multiple R-squared:  0.1173,    Adjusted R-squared:  0.1161 
F-statistic:  93.6 on 1 and 704 DF,  p-value: < 2.2e-16

The intercept tells us that the approximate sleep and nap time for a person who does not work is 3766.1246 minutes (62 hours and 46 minutes) per week.

The equation of sleep and nap time looks like this: slpnaps = 3766.12461 - 0.1804*totwrk

That means increasing the total work time by one hour would decrease the sleep and nap time by 0.1804*60 = 10.824 minutes per week (which is not a lot per night)

Residual standard error: 469.2 on 704 degrees of freedom Multiple R-squared: 0.1173

Scatterplot of sleep time and total work

ggplot(data=sleep75, aes(totwrk,slpnaps)) +
  geom_point() +
  geom_smooth(method = lm)+
  labs(x="total work time", y="sleep and nap time")
`geom_smooth()` using formula = 'y ~ x'

Total work time analysis

ggplot(data=sleep75, aes(factor(male),totwrk)) +
  geom_boxplot() +
  ggtitle("Total work time grouped by sex") +
  labs(x="0 - female, 1 - male", y="total work time")

As you can see, the sleep and nap time was slightly higher for men than for women in the 70s. This brings on a question - Is the sleep time affected by gender?

biserial.cor(slpnaps, male)
[1] 0.05193646

The correlation coefficient between sleep time and gender is very low, which indicates that gender has pretty much no influence on the length of sleep and naps.

Considering wage in the model

As Biddle and Hamermesh mentioned in their article, sleep and nap time is also affected by the earnings of a working person.

model2<-lm(slpnaps ~ totwrk + earns74, data = sleep75,  na.action = na.omit)
summary(model2)

Call:
lm(formula = slpnaps ~ totwrk + earns74, data = sleep75, na.action = na.omit)

Residuals:
     Min       1Q   Median       3Q      Max 
-1966.15  -267.98    -3.06   253.17  2372.15 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.789e+03  4.551e+01  83.274   <2e-16 ***
totwrk      -1.768e-01  1.875e-02  -9.428   <2e-16 ***
earns74     -3.177e-03  1.906e-03  -1.667    0.096 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 468.6 on 703 degrees of freedom
Multiple R-squared:  0.1208,    Adjusted R-squared:  0.1183 
F-statistic: 48.31 on 2 and 703 DF,  p-value: < 2.2e-16

This model seems to be slightly better than the previous one. The R-squared is slightly higher and the error rate is lower. However, the problem with the income is that it is dependent on other variables.

The analysis of income

Warning in cor(data, use = method[1], method = method[2]): the standard
deviation is zero

From the correlarion matrix, we can observe a correlation between income and spsepay as well as othinc. Let’s analyze them.

ggplot(data=sleep75,aes(lothinc, earns74, color = factor(male))) +
  geom_point() +
  geom_smooth(method = lm)
`geom_smooth()` using formula = 'y ~ x'

There is some negative correlation between these variables, but it is not very strong:

cor(lothinc,earns74)
[1] -0.3209126
ggplot(data=sleep75,aes(spsepay, earns74)) +
  geom_point() +
  geom_smooth(method = lm) +
  ggtitle("Spouse pay vs income")
`geom_smooth()` using formula = 'y ~ x'

The correlation between earnings and spouse’s earnings is just 0.2433.

cor(spsepay,earns74)
[1] 0.2433129

To sum it up, it is quite difficult to state what impacts the income as it is influenced by many variables. It is also hard to create an accurate linear model for it.

Creating a final version of the model

In this case I want to consider a model defined by the equation: sleep = β0 + β1 x totwork + β2 x educ + β x age + u

model3<-lm(slpnaps ~ totwrk + educ + age, data = sleep75)
summary(model3)

Call:
lm(formula = slpnaps ~ totwrk + educ + age, data = sleep75)

Residuals:
     Min       1Q   Median       3Q      Max 
-1914.12  -277.00    -7.56   277.30  2377.92 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3847.54448  124.50279  30.903  < 2e-16 ***
totwrk        -0.17688    0.01851  -9.555  < 2e-16 ***
educ         -16.87035    6.52545  -2.585  0.00993 ** 
age            3.26280    1.60317   2.035  0.04220 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 465 on 702 degrees of freedom
Multiple R-squared:  0.1354,    Adjusted R-squared:  0.1317 
F-statistic: 36.64 on 3 and 702 DF,  p-value: < 2.2e-16

As you can see, more education leads to less sleep and nap time. If we assume that the difference between high school and college is four years, than the sleep time of a person who graduated from college would be 4*16,87 = 67,48 minutes per week lower than a person’s who graduated from high school.

The coefficient for age is positive, so the increase in age leads to an increase in the sleep time(although it is not very significant).

`geom_smooth()` using formula = 'y ~ x'

Including age and education in the model increased the R-squared from 0.1173 to 0.1354. However, it is still a pretty weak correlation since the three variables only explain around 13,5% variation in sleep and nap time.

The RSE of this model is 465.1, which gives us the error rate of value 13,7%. It is a little bit better than in the simple model, where the error rate was ~13,85%

[1] 0.1374576

Diagnostic plots of the regression model

When it comes to the diagnostics plots, the Residuals vs Fitted looks OK and so does the Normal Q-Q, which means the residuals have a roughly linear pattern and are normally distributed. The Scale-Location and Residuals vs Leverage are also not bad.

Overall, even though it is far from perfect, this model seems to be quite good.

Conclusion

The only factor that seems to significantly affect sleeping time is the total work time. Because of the low correlation coefficients, it it hard to state what else might influence the sleep time. However, these models show that sleep is at least partially a consumer’s choice and is influenced by the same variables as our other uses of time.

---
title: "Sleep analysis"
author: "Agata Stukow"
date: "2023-06-12"
output:
  html_document:
    theme: cerulean
    highlight: textmate
    fontsize: 8pt
    toc: yes
    code_download: yes
    toc_float:
      collapsed: no
    df_print: default
    toc_depth: 5
  pdf_document:
    toc: yes
    toc_depth: '5'
editor_options:
  markdown:
    wrap: 72
---
Based on the article "Sleep and the Allocation of Time", I created an analysis of the length of sleep and nap time (from the sleep75 data set) in terms of other variables, such as total work time and age.

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, comment = "")
library(dplyr)
library(corrplot)
library(tidyverse)
library(haven)
library(ggplot2)
library(gridExtra)
library(ppcor) # this package computes partial and semipartial correlations.
library(ltm) # this package computes point-biserial correlations.
library(devtools) 
install_github("markheckmann/ryouready") # please install package "ryouready" from github! (then # it)
library(ryouready) # this package computes nonlinear "eta" correlations.
library(GGally) # this package computes correlation matrix.
library(psych) # this package computes qualitative correlations.
library(DescTools) # this package computes qualitative correlations.
library(wooldridge)
library(plot3D)
```


## Correlation matrix of the sleep75 dataset

I created this matrix to have a general view of correlations between different variables in the sleep75 data set.

```{r matrixE}
cor_matrix <- cor(sleep75)                      
cor_matrix_rm <- cor_matrix                 
cor_matrix_rm[upper.tri(cor_matrix_rm)] <- 0
diag(cor_matrix_rm) <- 0
data_new <- sleep75[ , !apply(cor_matrix_rm,2, function(x) any(abs(x) > 0.99, na.rm = TRUE))]
#corrplot(cor(data_new,use="pairwise.complete.obs"), method = 'circle', type="upper", order = "alphabet")
ggcorr(data_new, method = c("pairwise", "pearson"), digits = 2, label_alpha = TRUE)
```

## Creating a basic regression model

From the matrix, the strongest correlation I found for slpnaps was with totwrk. The coefficient of correlation is:

```{r cor1}
attach(sleep75)
cor(totwrk,slpnaps)
```


Then I created a linear model for sleep&nap time and total work time.

```{r linear model,echo=TRUE}
model1 <- lm(slpnaps~totwrk, data = sleep75)
summary(model1)
```

The intercept tells us that the approximate sleep and nap time for a person who does not work is 3766.1246 minutes (62 hours and 46 minutes) per week.

The equation of sleep and nap time looks like this:
slpnaps = 3766.12461 - 0.1804*totwrk

That means increasing the total work time by one hour would decrease the sleep and nap time by 0.1804*60 = 10.824 minutes per week (which is not a lot per night)   

Residual standard error: 469.2 on 704 degrees of freedom
Multiple R-squared: 0.1173

## Scatterplot of sleep time and total work

```{r scatter1, echo=TRUE}
ggplot(data=sleep75, aes(totwrk,slpnaps)) +
  geom_point() +
  geom_smooth(method = lm)+
  labs(x="total work time", y="sleep and nap time")
```

## Total work time analysis 

```{r work, echo=TRUE}
ggplot(data=sleep75, aes(factor(male),totwrk)) +
  geom_boxplot() +
  ggtitle("Total work time grouped by sex") +
  labs(x="0 - female, 1 - male", y="total work time")
```

As you can see, the sleep and nap time was slightly higher for men than for women in the 70s. This brings on a question - Is the sleep time affected by gender?

```{r men cor, echo = TRUE}
biserial.cor(slpnaps, male)
```
The correlation coefficient between sleep time and gender is very low, which indicates that gender has pretty much no influence on the length of sleep and naps. 

## Considering wage in the model

As Biddle and Hamermesh mentioned in their article, sleep and nap time is also affected by the earnings of a working person.

```{r m2, echo=TRUE}
model2<-lm(slpnaps ~ totwrk + earns74, data = sleep75,  na.action = na.omit)
summary(model2)
```
This model seems to be slightly better than the previous one. The R-squared is slightly higher and the error rate is lower. However, the problem with the income is that it is dependent on other variables.

## The analysis of income 

```{r cormat}
cor_matrix <- cor(sleep75)                      
cor_matrix_rm <- cor_matrix                 
cor_matrix_rm[upper.tri(cor_matrix_rm)] <- 0
diag(cor_matrix_rm) <- 0
data_new <- sleep75[ , !apply(cor_matrix_rm,2, function(x) any(abs(x) > 0.99, na.rm = TRUE))]
#corrplot(cor(data_new,use="pairwise.complete.obs"), method = 'circle', type="upper", order = "alphabet")
ggcorr(data_new, method = c("pairwise", "pearson"), digits = 2, label_alpha = TRUE)
```

From the correlarion matrix, we can observe a correlation between income and spsepay as well as othinc. Let's analyze them.


```{r inc1, echo=TRUE}
ggplot(data=sleep75,aes(lothinc, earns74, color = factor(male))) +
  geom_point() +
  geom_smooth(method = lm)
```

There is some negative correlation between these variables, but it is not very strong:

```{r othcor, echo=TRUE}
cor(lothinc,earns74)
```


```{r inc2, echo=TRUE}
ggplot(data=sleep75,aes(spsepay, earns74)) +
  geom_point() +
  geom_smooth(method = lm) +
  ggtitle("Spouse pay vs income")
```

The correlation between earnings and spouse's earnings is just 0.2433.
```{r spscor, echo =TRUE}
cor(spsepay,earns74)
```

To sum it up, it is quite difficult to state what impacts the income as it is influenced by many variables. It is also hard to create an accurate linear model for it.

## Creating a final version of the model

In this case I want to consider a model defined by the equation:
sleep =  β0 + β1 x totwork + β2 x educ + β x age + u

```{r comp model, echo=TRUE}
model3<-lm(slpnaps ~ totwrk + educ + age, data = sleep75)
summary(model3)
```
As you can see, more education leads to less sleep and nap time. If we assume that the difference between high school and college is four years, than the sleep time of a person who graduated from college would be 4*16,87 = 67,48 minutes per week lower than a person's who graduated from high school.

The coefficient for age is positive, so the increase in age leads to an increase in the sleep time(although it is not very significant). 

```{r age}
ggplot(data=sleep75,aes(age,slpnaps,color = factor(male)) ) +
  geom_point() +
  geom_smooth(method = lm) +
  ggtitle("Age vs sleep and nap time for both sexes")
  
```

Including age and education in the model increased the R-squared from 0.1173 to 0.1354. However, it is still a pretty weak correlation since the three variables only explain around 13,5% variation in sleep and nap time.

The RSE of this model is 465.1, which gives us the error rate of value 13,7%.
It is a little bit better than in the simple model, where the error rate was ~13,85%
```{r rse2, echo true}
sigma(model3)/mean(sleep75$slpnaps)
```

## Diagnostic plots of the regression model

```{r model3}
plot(model3)
```


When it comes to the diagnostics plots, the Residuals vs Fitted looks OK and so does the Normal Q-Q, which means the residuals have a roughly linear pattern and are normally distributed. The Scale-Location and Residuals vs Leverage are also not bad.

Overall, even though it is far from perfect, this model seems to be quite good. 

## Conclusion 

The only factor that seems to significantly affect sleeping time is the total work time. Because of the low correlation coefficients, it it hard to state what else might influence the sleep time. However, these models show that sleep is at least partially a consumer's choice and is influenced by the same variables as our other uses of time.