This section is going to be repeated from what was seen in the previous assignment, it pertains to reading the data and cleaning it initially. The outputs from the chunks will be suppressed. Refer to the previous assignment corrections if needed.
## this library is used by almost everyone that uses R and it introduces the notion of pipes (%>%)
library(tidyverse)
## this is used to read the excel sheet that was provided
library(readxl)
## this is used to summarize the continuous data in a more neat way than base R does
library(skimr)
Then, let’s read in the data. This data contains multiple entries of
“.” to represent missing values. In order to correctly code this, we
pass the argument na = "." to the function. We then use
-> to assign the data that was read to the variable
df.
read_excel("PPCR_2005_2006.xlsx", na = ".") -> df
Now, as you can see below, there are 19 observations in which all
variables equal NA.
## this command takes the data that is stored in `df` and provides it as the
## first argument for the `filter()` function, which, in this situation
## is evaluating if all (`if_all()` function) observations, across all the
## variables (`everything()` function) are equal to `NA` (`is.na()` function)
df %>% filter(if_all(everything(), ~ is.na(.)))
Let’s drop those observations.
## this command filters across all columns only those observations in which at least one value isn't NA
df %>% filter(if_any(everything(), ~!is.na(.))) -> df
## if we now repeat the same command as above, we can see that there are no more fully NA observations
df %>% filter(if_all(everything(), ~ is.na(.)))
We can see that, while How much sleep do you get (hours)? is initially seen as a continuous variable, we need to remove the observations that have values equal to 77 and 99, as those are not the actual values for these observations. Also, the fact that the number 12 doesn’t actually represent 12 hours of sleep, but represents 12 or more hours of sleep, transforms this into an ordinal variable.
## this checks if `How much sleep do you get (hours)?` is 77 or 99, if it is either of those values, it replaces it with NA. If it's not, it keeps the value.
df %>% mutate(`How much sleep do you get (hours)?` = case_when(
(`How much sleep do you get (hours)?` == 77) | (`How much sleep do you get (hours)?` == 99) ~ NA,
.default = `How much sleep do you get (hours)?`)) -> df
We also need to convert variables that are coded as characters to numeric.
df %>% mutate(`Protein (gm)` = as.numeric(`Protein (gm)`),
`Room Temperature (F)`= as.numeric(`Room Temperature (F)`),
`Energy (kcal)` = as.numeric(`Energy (kcal)`),
`Vitamin C (mg)` = as.numeric(`Vitamin C (mg)`),
`How much sleep do you get (hours)?` = as.numeric(`How much sleep do you get (hours)?`),
`Annual Family Income` = as.numeric(`Annual Family Income`),
`Room Humidity (%)` = as.numeric(`Room Humidity (%)`),
`Total dust weight (mg)` = as.numeric(`Total dust weight (mg)`)) -> df
During correction for the assignments, I noticed that the students were transforming a variable that wasn’t normally distributed to its equivalent logarithmic in order to come up with a normally distributed variable. During the course of this correction we will go through the assumptions of linear regression and we will try to understand whether that is warranted or not.
Let’s consider the definition of the assumptions of linear regression, taken from Correlation and Regression with R from Boston University:
Again, the assumptions for linear regression are:
- Linearity: The relationship between X and the mean of Y is linear.
- Homoscedasticity: The variance of residual is the same for any value of X.
- Independence: Observations are independent of each other.
- Normality: For any fixed value of X, Y is normally distributed.
Let’s create a model without transforming the
Total dust weight (mg) variable to its logarithmic.
## we will store our model in the variable lm1 (for linear model 1)
lm1 <- lm(formula = `How much sleep do you get (hours)?` ~ `Total dust weight (mg)`,
data = df)
Now, let’s create some plots to examine the assumptions visually
plot(lm1, which=1)
The plotting of the residuals versus the fitted values shows that the data follows a linear pattern (demonstrated by the red line), but it shows some evidence of heterodasticity (the range of residuals is higher in higher values). We can see if this is significant or not by using the studentized Breusch-Pagan test.
The null-hypothesis for this test indicates homoscedasticity, with the the alternative hypothesis indicating heteroscedasticity.
library(lmtest)
bptest(lm1)
##
## studentized Breusch-Pagan test
##
## data: lm1
## BP = 2.6195, df = 1, p-value = 0.1056
As the p-value is higher than 0.05, we do not reject the null.
You can then ask: “But I can clearly see heteroscedasticity in the ‘Residuals vs Fitted’ plot! How can that be?”
The reason is that plot suffers from overplotting, which means that the vast majority of the data is plotted over each other, and, when we look at the plot, we overestimate the importance of few outliers, which catch our eyes.
Let’s look at that same plot, but now created in ggplot, which allows us to solve the issue of overplotting.
# we set the opacity of each point to 0.1 in this plot, so that we can clearly
# see where the majority of the data is and decrease the importance of outliers
# we also add some jitter to the positioning of each point, as the y residual
# axis is discrete in this situation. This helps us better visualize the points.
# geom_smooth is the function that adds the regression line that allows us
# to check for the linearity assumption
ggplot(lm1, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.1,
position = position_jitter(width = NULL, height = 0.25)) +
geom_smooth(method = "loess", se = FALSE, linewidth = 1,
method.args = list(degree = 1),
colour="red")
We can also bin the data to be able to show this clustering even further
ggplot(lm1, aes(x = .fitted, y = .resid)) +
stat_bin2d(bins = 40) +
scale_fill_gradient(low = "white", high = "black", limits = c(0, 350))
With the plots above, we can see that almost all of the data is concentrated in the 6.975 to 7.025 range in the “fitted” axis, where homoscedasticity is present.
By now, we have looked at the assumptions of Linearity and Homoscedasticity.
Let’s continue with further plots.
plot(lm1, which=2)
The Q-Q Residuals plot shows that the Normality assumption is respected, with the standardized residuals following the distribution of theoretical quantiles.
The assumption of Independence is inherent of the study design, and we can say it was maintained here.
Having seen all of the above, I wouldn’t recommend transforming the dust weight variable into its logarithmic in order to include it in the linear model, as it’s not necessary and can hinder the interpretation of the findings.