Daniel has modified the original dataset and has added the variables “sleep_sum”, “pain_sum” and “covariates_sum”. I’ll undertake some exploratory data analysis.
First order of business is to read in the data:
require(tidyverse)
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.4 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
url <- "https://contattafiles.s3.us-west-1.amazonaws.com/tnt45405/Gb0VJAO8wb3OccK/modified_ELSI.csv"
data <- read.csv(url)
data <- as_tibble(data)
Let’s plot the distributions for each
require (ggplot2)
data %>% ggplot(aes(x = sleep_sum)) + geom_density()
data %>% ggplot(aes(x = pain_sum)) + geom_density()
data %>% ggplot(aes(x = covariates_sum)) + geom_density()
The distributions show that none of the variables are normally distributed. Let’s be sure using a qqplot.
data %>% ggplot(aes(sample = pain_sum)) +
geom_qq() +
geom_qq_line()
data %>% ggplot(aes(sample = sleep_sum)) +
geom_qq() +
geom_qq_line()
data %>% ggplot(aes(sample = covariates_sum)) +
geom_qq() +
geom_qq_line()
I believe the qqplots above confirm those variables aren’t normally distributed.
Let’s check if there is a linear relationship between sleep_sum and pain_sum
data %>% ggplot(aes(x = pain_sum, y = sleep_sum)) + geom_point()
It doesn’t look like there is linearity here, just to be sure, let’s conduct a hypothesis test.
require(lmtest)
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 4.2.3
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
model <- lm (data = data, sleep_sum ~ pain_sum)
raintest(model)
##
## Rainbow test
##
## data: model
## Rain = 0.76396, df1 = 4975, df2 = 4972, p-value = 1
Under the rainbow-test, the null-hypothesis (that there is linearity) could not be rejected, which doesn’t fit precisely the scatter-plot, but let’s assume there is linearity.
Unfortunately, I wasn’t able to test for homoscedasticity. Let me know if there are any mistakes.