library(tidyverse)
library(lmtest)
Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
The data I will be using come from R’s CO2 dataset,
which records the carbon dioxide (\(CO_2\)) “uptake of six plants from Quebec
and six plants from Mississippi measured at several levels of ambient
\(CO_2\) concentration. Half the plants
of each type were chilled overnight before the experiment was
conducted.” A summary of the data is provided below:
data(CO2)
summary(CO2)
## Plant Type Treatment conc uptake
## Qn1 : 7 Quebec :42 nonchilled:42 Min. : 95 Min. : 7.70
## Qn2 : 7 Mississippi:42 chilled :42 1st Qu.: 175 1st Qu.:17.90
## Qn3 : 7 Median : 350 Median :28.30
## Qc1 : 7 Mean : 435 Mean :27.21
## Qc3 : 7 3rd Qu.: 675 3rd Qu.:37.12
## Qc2 : 7 Max. :1000 Max. :45.50
## (Other):42
To limit our regression model’s independent variables to two, the
cell below filters the data to only include rows where
type=="Quebec":
CO2 <- CO2 %>%
filter(Type == 'Quebec')
We can now build and summarize a linear regression model using
uptake as the dependent variable, and
treatment and conc as the independent
variables:
lin_reg <- lm(uptake ~ conc + Treatment, data=CO2)
summary(lin_reg)
##
## Call:
## lm(formula = uptake ~ conc + Treatment, data = CO2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6052 -3.2628 0.7975 4.6594 10.6174
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.29351 2.12484 11.904 1.47e-14 ***
## conc 0.02308 0.00353 6.538 9.29e-08 ***
## Treatmentchilled -3.58095 2.07689 -1.724 0.0926 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.73 on 39 degrees of freedom
## Multiple R-squared: 0.5396, Adjusted R-squared: 0.516
## F-statistic: 22.86 on 2 and 39 DF, p-value: 2.694e-07
Based on the summary, and given that \(\hat{y}\) = uptake, \(\beta_1\) = conc and \(\beta_2\) = Treatment:
In addition, both Treatment, and conc are
listed as being statistically significant predictors of
uptake. While this is promising, we cannot use this model
until we confirm that it passes all the assumptions of linear
regression.
The assumptions of linear regression are listed below:
The below code plots the data against the model’s fitted lines for both chilled and non-chilled plants:
ggplot(data = CO2, aes(x = conc, y = uptake, color = Treatment)) +
geom_jitter() +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE)
The plot shows that there does indeed appear to be a linear
relationship between conc and uptake for both
chilled and non-chilled plants. Namely, as \(CO2\) concentration increases, so does
\(CO_2\) uptake.
The code below plots a histogram of the residual lengths.
ggplot(lin_reg, aes(x = .resid)) +
geom_histogram(binwidth = 3)
The residuals do appear to be somewhat normally distributed. We can confirm this statistically via a Shapiro-Wilk hypothesis test on the lengths of the residuals:
shapiro.test(lin_reg$residuals)
##
## Shapiro-Wilk normality test
##
## data: lin_reg$residuals
## W = 0.95857, p-value = 0.1313
The code creates a plot of the residuals as a function of the fitted values:
ggplot(lin_reg, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
This plot shows that the spread (variance) of the residual lengths is
do not appear equal for all fitted values of uptake, in
other words the data does not appear to be homoscedastic. We can confirm
this via a Breusch-Pagan test:
bptest(lin_reg)
##
## studentized Breusch-Pagan test
##
## data: lin_reg
## BP = 8.0554, df = 2, p-value = 0.01781
Indeed, the result of the test confirms that the data is not homoscedastic.
Based off the above residual plot above, we can however confirm that there is not a strong correlation between the residuals and fitted values. This is further evidenced by the fact that the correlation coefficient between the two is very close to 0:
cor(lin_reg$residuals, lin_reg$fitted.values)
## [1] 2.671187e-16
Based off the analysis above we have confirmed that results of the
lin_reg looked promising, residual analysis revealed that
it does not satisfy the homoscedasticity condition. As such, it is not a
valid model to be used in practice. Based off the appearance of the
scatter plot data, a logarithmic function might be better suited as a
model in this case.