This regression model will use a dataset consisting of computer hardware features from nearly 30 years ago (look at the memory in kb). This data is a few years older than the one in the text.

Data Dictionary

  1. vendor name: 30 (adviser, amdahl,apollo, basf, bti, burroughs, c.r.d, cambex, cdc, dec, dg, formation, four-phase, gould, honeywell, hp, ibm, ipl, magnuson, microdata, nas, ncr, nixdorf, perkin-elmer, prime, siemens, sperry, sratus, wang)
  2. Model Name: many unique symbols
  3. MYCT: machine cycle time in nanoseconds (integer)
  4. MMIN: minimum main memory in kilobytes (integer)
  5. MMAX: maximum main memory in kilobytes (integer)
  6. CACH: cache memory in kilobytes (integer)
  7. CHMIN: minimum channels in units (integer)
  8. CHMAX: maximum channels in units (integer)
  9. PRP: published relative performance (integer)
  10. ERP: estimated relative performance from the original article (integer)

Prepare data

library(tidyverse)
library(corrplot)

hardware <- read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/cpu-performance/machine.data',
                      stringsAsFactors = F, header = F)
colnames(hardware) <- c('vendor', 'Model', 'MYCT', 'MMIN', 'MMAX', 'CACH', 'CHMIN', 'CHMAX', 'PRP', 'ERP')
hardware <- mutate(hardware, is_ibm = ifelse(vendor == 'ibm', 1, 0))

Data exploration

We want to see what the strong predictors are, where we might see multicolinearity, and if the relationship with the response variable is linear.

hardware_cor <- cor(select(hardware, -vendor, -Model, -ERP, -is_ibm))
hardware_matrix <- as.matrix(select(hardware,-vendor, -Model, -ERP, -is_ibm))
corrplot(hardware_cor, type = 'upper')

pairs(hardware_matrix)

We do see some nonlinear relationships, so some transformations will be necesary. We’ll start with one.

ggplot(hardware) + geom_point(aes(x = MMIN, y = PRP))

hardware = mutate(hardware, mmin_2 = MMIN**2)
ggplot(hardware) + geom_point(aes(x = mmin_2, y = PRP))

There are so few different levels that the variable behaves as an ordinal variable. In cases like this, you want to be able to connect the medians with a straight line, and we might have accomplished that with the transformation, although it is difficult to tell.

First model

hardware <- select(hardware, -vendor, -Model, - ERP)

model <- lm (PRP ~ . + is_ibm*CACH , data = hardware)
summary(model)
## 
## Call:
## lm(formula = PRP ~ . + is_ibm * CACH, data = hardware)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -199.15  -24.56    5.39   21.59  392.84 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.574e+01  8.600e+00  -4.156 4.82e-05 ***
## MYCT         2.054e-02  1.941e-02   1.058    0.291    
## MMIN         1.837e-03  3.125e-03   0.588    0.557    
## MMAX         6.319e-03  6.247e-04  10.116  < 2e-16 ***
## CACH         7.489e-01  1.341e-01   5.586 7.58e-08 ***
## CHMIN       -6.157e-01  8.123e-01  -0.758    0.449    
## CHMAX        1.415e+00  2.098e-01   6.745 1.62e-10 ***
## is_ibm       1.322e+01  1.452e+01   0.911    0.364    
## mmin_2       5.576e-07  1.091e-07   5.114 7.41e-07 ***
## CACH:is_ibm  3.507e-01  5.809e-01   0.604    0.547    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.74 on 199 degrees of freedom
## Multiple R-squared:  0.8809, Adjusted R-squared:  0.8755 
## F-statistic: 163.6 on 9 and 199 DF,  p-value: < 2.2e-16
plot(model)

The pattern in the residuals show that we will need to remove the nonlinear trends. We also see some influential points and some skew on the QQ plot. This data looks like a good candidate for a log transform of the response. We’ll also remove the terms that are not signficant. The categorical probably isn’t colinear with anything, so we won’t remove it, the interaction term, and CHMIN.

Second model

hardware2 <- hardware %>% 
  mutate(
    log_score = log(PRP),
    log_cycles = log(MYCT),
  )

hardware2 <- select(hardware2, -MYCT, -CHMIN, -PRP, - is_ibm)
pairs(hardware2)

The relationships appear more linear. This model should improve.

library(car)



model2 <- lm (log_score ~ . , data = hardware2)
summary(model2)
## 
## Call:
## lm(formula = log_score ~ ., data = hardware2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39829 -0.26287  0.02076  0.26681  1.19893 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.158e+00  2.392e-01  17.386  < 2e-16 ***
## MMIN         9.098e-05  2.553e-05   3.564 0.000456 ***
## MMAX         3.635e-05  4.955e-06   7.336 5.22e-12 ***
## CACH         7.184e-03  1.019e-03   7.048 2.81e-11 ***
## CHMAX        2.302e-03  1.560e-03   1.476 0.141580    
## mmin_2      -2.946e-09  8.860e-10  -3.326 0.001048 ** 
## log_cycles  -2.033e-01  4.207e-02  -4.833 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4472 on 202 degrees of freedom
## Multiple R-squared:  0.8233, Adjusted R-squared:  0.8181 
## F-statistic: 156.9 on 6 and 202 DF,  p-value: < 2.2e-16
plot(model2)

vif(model2)
##       MMIN       MMAX       CACH      CHMAX     mmin_2 log_cycles 
##  10.199655   3.512112   1.784168   1.710143   5.832358   1.983327
ncvTest (model2)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.7840838    Df = 1     p = 0.3758951

Most of the diagnostics improved. The only problems remaining are slight nonlinearity and influential points. After starting with an unfriendly dataset, the linear model is now acceptable. There is room for improvement, but I would not say that the model is misspecified.