library(tidyverse)
library(lmtest)

Assignment Description

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Import, Clean Data

The data I will be using come from R’s CO2 dataset, which records the carbon dioxide (\(CO_2\)) “uptake of six plants from Quebec and six plants from Mississippi measured at several levels of ambient \(CO_2\) concentration. Half the plants of each type were chilled overnight before the experiment was conducted.” A summary of the data is provided below:

data(CO2)
summary(CO2)
##      Plant             Type         Treatment       conc          uptake     
##  Qn1    : 7   Quebec     :42   nonchilled:42   Min.   :  95   Min.   : 7.70  
##  Qn2    : 7   Mississippi:42   chilled   :42   1st Qu.: 175   1st Qu.:17.90  
##  Qn3    : 7                                    Median : 350   Median :28.30  
##  Qc1    : 7                                    Mean   : 435   Mean   :27.21  
##  Qc3    : 7                                    3rd Qu.: 675   3rd Qu.:37.12  
##  Qc2    : 7                                    Max.   :1000   Max.   :45.50  
##  (Other):42

To limit our regression model’s independent variables to two, the cell below filters the data to only include rows where type=="Quebec":

CO2 <- CO2 %>% 
  filter(Type == 'Quebec')

Build, Summarize Model

We can now build and summarize a linear regression model using uptake as the dependent variable, and treatment and conc as the independent variables:

lin_reg <- lm(uptake ~ conc + Treatment, data=CO2)
summary(lin_reg)
## 
## Call:
## lm(formula = uptake ~ conc + Treatment, data = CO2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6052  -3.2628   0.7975   4.6594  10.6174 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      25.29351    2.12484  11.904 1.47e-14 ***
## conc              0.02308    0.00353   6.538 9.29e-08 ***
## Treatmentchilled -3.58095    2.07689  -1.724   0.0926 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.73 on 39 degrees of freedom
## Multiple R-squared:  0.5396, Adjusted R-squared:  0.516 
## F-statistic: 22.86 on 2 and 39 DF,  p-value: 2.694e-07

Based on the summary, and given that \(\hat{y}\) = uptake, \(\beta_1\) = conc and \(\beta_2\) = Treatment:

In addition, both Treatment, and conc are listed as being statistically significant predictors of uptake. While this is promising, we cannot use this model until we confirm that it passes all the assumptions of linear regression.

Check Assumptions

The assumptions of linear regression are listed below:

  1. Linear relationship: There exists a linear relationship between the independent variable, \(x\), and the dependent variable, \(y\).
  2. Normality: The residuals of the model are normally distributed.
  3. Homoscedasticity: The residuals have constant variance at every level of \(x\).
  4. Independence: The residuals are independent. In particular, there is no correlation between consecutive residuals in time series data.

Linear Relationship

The below code plots the data against the model’s fitted lines for both chilled and non-chilled plants:

ggplot(data = CO2, aes(x = conc, y = uptake, color = Treatment)) +
  geom_jitter() + 
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE)

The plot shows that there does indeed appear to be a linear relationship between conc and uptake for both chilled and non-chilled plants. Namely, as \(CO2\) concentration increases, so does \(CO_2\) uptake.

Normality

The code below plots a histogram of the residual lengths.

ggplot(lin_reg, aes(x = .resid)) +
  geom_histogram(binwidth = 3) 

The residuals do appear to be somewhat normally distributed. We can confirm this statistically via a Shapiro-Wilk hypothesis test on the lengths of the residuals:

shapiro.test(lin_reg$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  lin_reg$residuals
## W = 0.95857, p-value = 0.1313

Homoscedasticity

The code creates a plot of the residuals as a function of the fitted values:

ggplot(lin_reg, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0)

This plot shows that the spread (variance) of the residual lengths is do not appear equal for all fitted values of uptake, in other words the data does not appear to be homoscedastic. We can confirm this via a Breusch-Pagan test:

bptest(lin_reg)
## 
##  studentized Breusch-Pagan test
## 
## data:  lin_reg
## BP = 8.0554, df = 2, p-value = 0.01781

Indeed, the result of the test confirms that the data is not homoscedastic.

Independence

Based off the above residual plot above, we can however confirm that there is not a strong correlation between the residuals and fitted values. This is further evidenced by the fact that the correlation coefficient between the two is very close to 0:

cor(lin_reg$residuals, lin_reg$fitted.values)
## [1] 2.671187e-16

Conclusion

Based off the analysis above we have confirmed that results of the lin_reg looked promising, residual analysis revealed that it does not satisfy the homoscedasticity condition. As such, it is not a valid model to be used in practice. Based off the appearance of the scatter plot data, a logarithmic function might be better suited as a model in this case.