Assignment

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Solution

Data Overview

There are a number of interesting data-sets posted by Dr. John Burkardt of Florida State University. Those specifically for testing regression may be found here.

This discussion will look at the relationship between gasoline octane values and their composition with respect to three ingredients. Specifically:

In the investigation of the manufacturing process in a refinery, the octane rating of a particular petrol was measured as a function of the 3 raw materials, and a variable that characterized the manufacturing conditions. There are 82 rows of data. The data include:

  • \(I\), the index
  • \(A_0\), 1
  • \(A_1\), amount of material 1
  • \(A_2\), amount of material 2
  • \(A_3\), amount of material 3
  • \(A_4\), manufacturing condition rating
  • \(B\), the octane rating

We seek a model of the form: \[ B = A_0X_0 + A_1X_1 + A_2X_2 + A_3X_3 + A_4X_4. \]

ETL

library(data.table)
setDTthreads(0L)
Octane <- fread("https://people.sc.fsu.edu/~jburkardt/datasets/regression/x18.txt",
                skip = 43L)
setnames(Octane, c("Index", "Intercept", "Mat_1", "Mat_2", "Mat_3", "MCR",
                   "Octane"))

Investigation

Below are the first few rows and summary statistics for the octane data:

head(Octane, n = 10L)
summary(Octane)
##      Index         Intercept     Mat_1           Mat_2       
##  Min.   : 1.00   Min.   :1   Min.   : 4.23   Min.   : 0.000  
##  1st Qu.:21.25   1st Qu.:1   1st Qu.:55.38   1st Qu.: 0.105  
##  Median :41.50   Median :1   Median :62.70   Median : 1.280  
##  Mean   :41.50   Mean   :1   Mean   :60.17   Mean   : 1.664  
##  3rd Qu.:61.75   3rd Qu.:1   3rd Qu.:67.78   3rd Qu.: 2.277  
##  Max.   :82.00   Max.   :1   Max.   :75.54   Max.   :10.760  
##      Mat_3            MCR            Octane     
##  Min.   :40.00   Min.   :1.200   Min.   :89.66  
##  1st Qu.:54.00   1st Qu.:1.518   1st Qu.:90.85  
##  Median :56.00   Median :1.604   Median :91.73  
##  Mean   :55.46   Mean   :1.627   Mean   :91.85  
##  3rd Qu.:59.75   3rd Qu.:1.723   3rd Qu.:92.47  
##  Max.   :64.00   Max.   :2.319   Max.   :97.61

One thing we notice right away is that Intercept, aka \(A_0\) is consistently 1. Either leaving it out of the model and fitting with an intercept or leaving it in the model and fitting without an intercept will return the exact same result. Most of the time we just call that \(A_0\) and the 1 is assumed.

Below are plots of the interesting variables. Index is just that and the intercept is a constant 1.

plot(Octane[, -(1:2)])

Another thing we see is that there seems to be a distinct negative correlation between MCR and Mat_3. It may be prudent to investigate interaction terms for that pair.

Linear Model

We will fit three linear models. The first two will be to demonstrate the equivalence of using Intercept without an intercept or an intercept without Intercept.1

The third will be including an interaction term between MCR and Mat_3.

Fit1 <- lm(Octane ~ Intercept + Mat_1 + Mat_2 + Mat_3 + MCR - 1, data = Octane)
Fit2 <- lm(Octane ~ Mat_1 + Mat_2 + Mat_3 + MCR, data = Octane)
Fit3 <- lm(Octane ~ Mat_1 + Mat_2 + (Mat_3 * MCR), data = Octane)

summary(Fit1)
## 
## Call:
## lm(formula = Octane ~ Intercept + Mat_1 + Mat_2 + Mat_3 + MCR - 
##     1, data = Octane)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00612 -0.28588 -0.04679  0.32159  0.98069 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## Intercept 95.853150   1.224877  78.255  < 2e-16 ***
## Mat_1     -0.092821   0.005235 -17.729  < 2e-16 ***
## Mat_2     -0.126798   0.032157  -3.943 0.000176 ***
## Mat_3     -0.025381   0.013971  -1.817 0.073160 .  
## MCR        1.967603   0.324573   6.062 4.65e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4415 on 77 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.101e+05 on 5 and 77 DF,  p-value: < 2.2e-16
summary(Fit2)
## 
## Call:
## lm(formula = Octane ~ Mat_1 + Mat_2 + Mat_3 + MCR, data = Octane)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00612 -0.28588 -0.04679  0.32159  0.98069 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 95.853150   1.224877  78.255  < 2e-16 ***
## Mat_1       -0.092821   0.005235 -17.729  < 2e-16 ***
## Mat_2       -0.126798   0.032157  -3.943 0.000176 ***
## Mat_3       -0.025381   0.013971  -1.817 0.073160 .  
## MCR          1.967603   0.324573   6.062 4.65e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4415 on 77 degrees of freedom
## Multiple R-squared:  0.9056, Adjusted R-squared:  0.9007 
## F-statistic: 184.7 on 4 and 77 DF,  p-value: < 2.2e-16
summary(Fit3)
## 
## Call:
## lm(formula = Octane ~ Mat_1 + Mat_2 + (Mat_3 * MCR), data = Octane)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.01197 -0.30994 -0.07207  0.32512  1.02360 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 102.468289   3.435297  29.828  < 2e-16 ***
## Mat_1        -0.096512   0.005435 -17.759  < 2e-16 ***
## Mat_2        -0.122210   0.031584  -3.869 0.000229 ***
## Mat_3        -0.144671   0.059638  -2.426 0.017646 *  
## MCR          -1.709173   1.817135  -0.941 0.349897    
## Mat_3:MCR     0.069329   0.033735   2.055 0.043305 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4325 on 76 degrees of freedom
## Multiple R-squared:  0.9106, Adjusted R-squared:  0.9047 
## F-statistic: 154.8 on 5 and 76 DF,  p-value: < 2.2e-16

It’s clear that the parameter values of Fits 1 and 2 are the same; however, their statistics are not. That is because Fit1 has one extra variable.

From Fit3 we see that the significance of MCR in the first fits was not due to itself per se, but as an offset to Mat_3.

Residual Analysis

For the remaining analyses, we will ignore Fit1 as it is a suboptimal version of Fit2.

Check for Heteroskedacticity

par(mfrow = c(1, 2))
plot(Fit2$fitted.values, Fit2$residuals)
plot(Fit3$fitted.values, Fit3$residuals)

For Fit1, the residuals are basically evenly distributed around 0. You can also tell from their mean: -1.476630510^{-17}, which is effectively 0. It’s hard to say from the plot if there is any heteroskedacticity, as there are too few data points above 92.5 or so to matter.

The same description applies to Fit3, whose mean residuals are also effectively 0 at 1.068798610^{-17}.

QQ plots

par(mfrow = c(1, 2))
qqnorm(Fit2$residuals)
qqline(Fit2$residuals)
qqnorm(Fit3$residuals)
qqline(Fit3$residuals)

There is some scant visual evidence that the tails are not exactly normal, with Fit3 perhaps less so than Fit2, but that is as much due to the paucity of data.

Conclusion

Restricting ourselves to the linear model and the lm command in R, both fits are valid. The Fit3 model with the interaction has a slightly higher adjusted R-squared, which takes parsimony into account, so that would be my preferred model. Next steps would be to use a generalized linear model to compare AICc scores, at the least.2


  1. Wordplay most definitely intended.↩︎

  2. For what it is worth, using both AIC and the preferable AICc, Fit3 is lower than Fit2 and would be preferable.↩︎