R Markdown

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(naniar)
library(readr)
library(ggplot2)
library(broom)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(tidyr)

Let’s load the dataset:

data(sleep)
head(sleep)
##   extra group ID
## 1   0.7     1  1
## 2  -1.6     1  2
## 3  -0.2     1  3
## 4  -1.2     1  4
## 5  -0.1     1  5
## 6   3.4     1  6

The summary for each variable: Extra, Group, and ID

sum_var <- sapply(sleep, summary)
sum_var
## $extra
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.600  -0.025   0.950   1.540   3.400   5.500 
## 
## $group
##  1  2 
## 10 10 
## 
## $ID
##  1  2  3  4  5  6  7  8  9 10 
##  2  2  2  2  2  2  2  2  2  2
  1. Relationship between ID and extra sleep
plot_id_extra <- ggplot(data=sleep, mapping=aes(ID,extra)) +
geom_point(color = "blue", size = 3) +  # Adds points
  labs(title = "Scatter Plot of ID vs Extra Sleep",
       x = "ID",
       y = "Extra Sleep") +
  theme_minimal()

plot_id_extra

The initial observations show that the patients with higher ID had higher scores for extra sleep compared to those with lower ID.

Using simple linear regression to predict extra sleep by patient ID:

my_reg <- lm(extra~ID, data=sleep)
summary(my_reg)
## 
## Call:
## lm(formula = extra ~ ID, data = sleep)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -2.30  -0.65   0.00   0.65   2.30 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.3000     0.9821   1.324   0.2151  
## ID2          -1.7000     1.3889  -1.224   0.2490  
## ID3          -0.8500     1.3889  -0.612   0.5542  
## ID4          -1.8500     1.3889  -1.332   0.2124  
## ID5          -1.4000     1.3889  -1.008   0.3372  
## ID6           2.6000     1.3889   1.872   0.0907 .
## ID7           3.3000     1.3889   2.376   0.0389 *
## ID8          -0.1000     1.3889  -0.072   0.9440  
## ID9           1.0000     1.3889   0.720   0.4880  
## ID10          1.4000     1.3889   1.008   0.3372  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.389 on 10 degrees of freedom
## Multiple R-squared:  0.7507, Adjusted R-squared:  0.5263 
## F-statistic: 3.345 on 9 and 10 DF,  p-value: 0.03676
tidy(my_reg)
## # A tibble: 10 × 5
##    term        estimate std.error statistic p.value
##    <chr>          <dbl>     <dbl>     <dbl>   <dbl>
##  1 (Intercept)    1.30      0.982    1.32    0.215 
##  2 ID2           -1.7       1.39    -1.22    0.249 
##  3 ID3           -0.850     1.39    -0.612   0.554 
##  4 ID4           -1.85      1.39    -1.33    0.212 
##  5 ID5           -1.4       1.39    -1.01    0.337 
##  6 ID6            2.6       1.39     1.87    0.0907
##  7 ID7            3.30      1.39     2.38    0.0389
##  8 ID8           -0.100     1.39    -0.0720  0.944 
##  9 ID9            1.00      1.39     0.720   0.488 
## 10 ID10           1.4       1.39     1.01    0.337

Based on these results the intercept equals to 1.30, which is the value of extra sleep for the patient with ID=1. We can also see the slope values for each ID. For instance, for the patient with ID=7, the slope was 3.3, which means that inidividuals with this ID had 3.3 units more extra sleep compared to the baseline (id=1). From all of these results only the id=7 had a p-value less than 0.05, which means that patients only in this group had a statistically significant difference from the baseline group. Adjusted R-squared, which takes into account the number of predictiors added to the model equals to 0.526. This means that approximately 52% percent of the variability in the sample can be explained by the model, which is relatively low.

Plotting residuals:

plot(my_reg, which=1)

plot(my_reg, which=2)

#more advanced
residuals <- residuals(my_reg)
fitted_values <- fitted(my_reg)

ggplot(data = sleep, aes(x = fitted_values, y = residuals)) +
  geom_point(color = "blue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal()

Based on these plots residuals seem to be normally distributed.

ggplot(sleep, aes(x = as.numeric(ID), y = extra)) +  
  geom_point(color = "blue", size = 3) +           
  geom_smooth(method = "lm", color = "red", lwd = 1) +  
  labs(title = "Scatter Plot with Regression Line",
       x = "ID",
       y = "Extra Sleep") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The regression analysis showed a positive relationship between ID and extra. The relationship was not statistically significant, except the case where id=7, suggesting that there is insufficient evidence to conclude a meaningful association between ID and extra in this dataset. The adjusted R2=0.52 indicates that our model is not a good fit to the dataset.

We conclude that patient ID was not statistically significantly associated with extra sleep. We do not know what was the principle of giving IDs to patients. If it was random assignments, our results are not surprising, as clinically it does not make sense to expect an association between patient ID and extra sleep. Alternatively, if it was based on patient characteristics, our results can be explained by the influence of other confounders, which are not taken into account (sex, lifestyle factors, health status).

Predicting extra sleep by group:

my_reg2 <- lm(extra~group, data=sleep)
summary(my_reg2)
## 
## Call:
## lm(formula = extra ~ group, data = sleep)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.430 -1.305 -0.580  1.455  3.170 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.7500     0.6004   1.249   0.2276  
## group2        1.5800     0.8491   1.861   0.0792 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.899 on 18 degrees of freedom
## Multiple R-squared:  0.1613, Adjusted R-squared:  0.1147 
## F-statistic: 3.463 on 1 and 18 DF,  p-value: 0.07919
ggplot(sleep, aes(x = as.numeric(group), y = extra)) +  
  geom_point(color = "blue", size = 3) +           
  geom_smooth(method = "lm", color = "red", lwd = 1) +  
  labs(title = "Scatter Plot with Regression Line",
       x = "Group",
       y = "Extra Sleep") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

We can see that adjusted R-squared vaue was much lower, indicating that linear model is not a good fit for the relationship of group and extra sleep. P-value was greater than 0.05, so we can not reject null hypothesis, which says that there is no statistically significant relationship between the 2 variables group and extra sleep.