Using Raw .csv

Daniel has modified the original dataset and has added the variables “sleep_sum”, “pain_sum” and “covariates_sum”. I’ll undertake some exploratory data analysis.

First order of business is to read in the data:

require(tidyverse)

## Loading required package: tidyverse

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.4     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

url <- "https://contattafiles.s3.us-west-1.amazonaws.com/tnt45405/Gb0VJAO8wb3OccK/modified_ELSI.csv"
data <- read.csv(url)
data <- as_tibble(data)

Tests of Normality

Let’s plot the distributions for each

require (ggplot2)
data %>% ggplot(aes(x = sleep_sum)) + geom_density()

data %>% ggplot(aes(x = pain_sum)) + geom_density()

data %>% ggplot(aes(x = covariates_sum)) + geom_density()

The distributions show that none of the variables are normally distributed. Let’s be sure using a qqplot.

data %>% ggplot(aes(sample = pain_sum)) +
  geom_qq() +
  geom_qq_line()

data %>% ggplot(aes(sample = sleep_sum)) +
  geom_qq() +
  geom_qq_line()

data %>% ggplot(aes(sample = covariates_sum)) +
  geom_qq() +
  geom_qq_line()

I believe the qqplots above confirm those variables aren’t normally distributed.

Tests of Linearity

Let’s check if there is a linear relationship between sleep_sum and pain_sum

data %>% ggplot(aes(x = pain_sum, y = sleep_sum)) + geom_point()

It doesn’t look like there is linearity here, just to be sure, let’s conduct a hypothesis test.

require(lmtest)

## Loading required package: lmtest

## Warning: package 'lmtest' was built under R version 4.2.3

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

model <- lm (data = data, sleep_sum ~ pain_sum)
raintest(model)

## 
##  Rainbow test
## 
## data:  model
## Rain = 0.76396, df1 = 4975, df2 = 4972, p-value = 1

Under the rainbow-test, the null-hypothesis (that there is linearity) could not be rejected, which doesn’t fit precisely the scatter-plot, but let’s assume there is linearity.

Unfortunately, I wasn’t able to test for homoscedasticity. Let me know if there are any mistakes.

Using Raw .csv

Pedro Henrique Brant

2023-10-16

Tests of Normality

Tests of Linearity