Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
There are a number of interesting data-sets posted by Dr. John Burkardt of Florida State University. Those specifically for testing regression may be found here.
This discussion will look at the relationship between gasoline octane values and their composition with respect to three ingredients. Specifically:
In the investigation of the manufacturing process in a refinery, the octane rating of a particular petrol was measured as a function of the 3 raw materials, and a variable that characterized the manufacturing conditions. There are 82 rows of data. The data include:
- \(I\), the index
- \(A_0\), 1
- \(A_1\), amount of material 1
- \(A_2\), amount of material 2
- \(A_3\), amount of material 3
- \(A_4\), manufacturing condition rating
- \(B\), the octane rating
We seek a model of the form: \[ B = A_0X_0 + A_1X_1 + A_2X_2 + A_3X_3 + A_4X_4. \]
library(data.table)
setDTthreads(0L)
Octane <- fread("https://people.sc.fsu.edu/~jburkardt/datasets/regression/x18.txt",
skip = 43L)
setnames(Octane, c("Index", "Intercept", "Mat_1", "Mat_2", "Mat_3", "MCR",
"Octane"))
Below are the first few rows and summary statistics for the octane data:
head(Octane, n = 10L)
summary(Octane)
## Index Intercept Mat_1 Mat_2
## Min. : 1.00 Min. :1 Min. : 4.23 Min. : 0.000
## 1st Qu.:21.25 1st Qu.:1 1st Qu.:55.38 1st Qu.: 0.105
## Median :41.50 Median :1 Median :62.70 Median : 1.280
## Mean :41.50 Mean :1 Mean :60.17 Mean : 1.664
## 3rd Qu.:61.75 3rd Qu.:1 3rd Qu.:67.78 3rd Qu.: 2.277
## Max. :82.00 Max. :1 Max. :75.54 Max. :10.760
## Mat_3 MCR Octane
## Min. :40.00 Min. :1.200 Min. :89.66
## 1st Qu.:54.00 1st Qu.:1.518 1st Qu.:90.85
## Median :56.00 Median :1.604 Median :91.73
## Mean :55.46 Mean :1.627 Mean :91.85
## 3rd Qu.:59.75 3rd Qu.:1.723 3rd Qu.:92.47
## Max. :64.00 Max. :2.319 Max. :97.61
One thing we notice right away is that Intercept, aka \(A_0\) is consistently 1. Either leaving it out of the model and fitting with an intercept or leaving it in the model and fitting without an intercept will return the exact same result. Most of the time we just call that \(A_0\) and the 1 is assumed.
Below are plots of the interesting variables. Index is just that and the intercept is a constant 1.
plot(Octane[, -(1:2)])
Another thing we see is that there seems to be a distinct negative correlation between MCR and Mat_3. It may be prudent to investigate interaction terms for that pair.
We will fit three linear models. The first two will be to demonstrate the equivalence of using Intercept without an intercept or an intercept without Intercept.1
The third will be including an interaction term between MCR and Mat_3.
Fit1 <- lm(Octane ~ Intercept + Mat_1 + Mat_2 + Mat_3 + MCR - 1, data = Octane)
Fit2 <- lm(Octane ~ Mat_1 + Mat_2 + Mat_3 + MCR, data = Octane)
Fit3 <- lm(Octane ~ Mat_1 + Mat_2 + (Mat_3 * MCR), data = Octane)
summary(Fit1)
##
## Call:
## lm(formula = Octane ~ Intercept + Mat_1 + Mat_2 + Mat_3 + MCR -
## 1, data = Octane)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00612 -0.28588 -0.04679 0.32159 0.98069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Intercept 95.853150 1.224877 78.255 < 2e-16 ***
## Mat_1 -0.092821 0.005235 -17.729 < 2e-16 ***
## Mat_2 -0.126798 0.032157 -3.943 0.000176 ***
## Mat_3 -0.025381 0.013971 -1.817 0.073160 .
## MCR 1.967603 0.324573 6.062 4.65e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4415 on 77 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.101e+05 on 5 and 77 DF, p-value: < 2.2e-16
summary(Fit2)
##
## Call:
## lm(formula = Octane ~ Mat_1 + Mat_2 + Mat_3 + MCR, data = Octane)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00612 -0.28588 -0.04679 0.32159 0.98069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95.853150 1.224877 78.255 < 2e-16 ***
## Mat_1 -0.092821 0.005235 -17.729 < 2e-16 ***
## Mat_2 -0.126798 0.032157 -3.943 0.000176 ***
## Mat_3 -0.025381 0.013971 -1.817 0.073160 .
## MCR 1.967603 0.324573 6.062 4.65e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4415 on 77 degrees of freedom
## Multiple R-squared: 0.9056, Adjusted R-squared: 0.9007
## F-statistic: 184.7 on 4 and 77 DF, p-value: < 2.2e-16
summary(Fit3)
##
## Call:
## lm(formula = Octane ~ Mat_1 + Mat_2 + (Mat_3 * MCR), data = Octane)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.01197 -0.30994 -0.07207 0.32512 1.02360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 102.468289 3.435297 29.828 < 2e-16 ***
## Mat_1 -0.096512 0.005435 -17.759 < 2e-16 ***
## Mat_2 -0.122210 0.031584 -3.869 0.000229 ***
## Mat_3 -0.144671 0.059638 -2.426 0.017646 *
## MCR -1.709173 1.817135 -0.941 0.349897
## Mat_3:MCR 0.069329 0.033735 2.055 0.043305 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4325 on 76 degrees of freedom
## Multiple R-squared: 0.9106, Adjusted R-squared: 0.9047
## F-statistic: 154.8 on 5 and 76 DF, p-value: < 2.2e-16
It’s clear that the parameter values of Fits 1 and 2 are the same; however, their statistics are not. That is because Fit1 has one extra variable.
From Fit3 we see that the significance of MCR in the first fits was not due to itself per se, but as an offset to Mat_3.
For the remaining analyses, we will ignore Fit1 as it is a suboptimal version of Fit2.
par(mfrow = c(1, 2))
plot(Fit2$fitted.values, Fit2$residuals)
plot(Fit3$fitted.values, Fit3$residuals)
For Fit1, the residuals are basically evenly distributed around 0. You can also tell from their mean: -1.476630510^{-17}, which is effectively 0. It’s hard to say from the plot if there is any heteroskedacticity, as there are too few data points above 92.5 or so to matter.
The same description applies to Fit3, whose mean residuals are also effectively 0 at 1.068798610^{-17}.
par(mfrow = c(1, 2))
qqnorm(Fit2$residuals)
qqline(Fit2$residuals)
qqnorm(Fit3$residuals)
qqline(Fit3$residuals)
There is some scant visual evidence that the tails are not exactly normal, with Fit3 perhaps less so than Fit2, but that is as much due to the paucity of data.
Restricting ourselves to the linear model and the lm command in R, both fits are valid. The Fit3 model with the interaction has a slightly higher adjusted R-squared, which takes parsimony into account, so that would be my preferred model. Next steps would be to use a generalized linear model to compare AICc scores, at the least.2