This regression model will use a dataset consisting of computer hardware features from nearly 30 years ago (look at the memory in kb). This data is a few years older than the one in the text.
library(tidyverse)
library(corrplot)
hardware <- read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/cpu-performance/machine.data',
stringsAsFactors = F, header = F)
colnames(hardware) <- c('vendor', 'Model', 'MYCT', 'MMIN', 'MMAX', 'CACH', 'CHMIN', 'CHMAX', 'PRP', 'ERP')
hardware <- mutate(hardware, is_ibm = ifelse(vendor == 'ibm', 1, 0))
We want to see what the strong predictors are, where we might see multicolinearity, and if the relationship with the response variable is linear.
hardware_cor <- cor(select(hardware, -vendor, -Model, -ERP, -is_ibm))
hardware_matrix <- as.matrix(select(hardware,-vendor, -Model, -ERP, -is_ibm))
corrplot(hardware_cor, type = 'upper')
pairs(hardware_matrix)
We do see some nonlinear relationships, so some transformations will be necesary. We’ll start with one.
ggplot(hardware) + geom_point(aes(x = MMIN, y = PRP))
hardware = mutate(hardware, mmin_2 = MMIN**2)
ggplot(hardware) + geom_point(aes(x = mmin_2, y = PRP))
There are so few different levels that the variable behaves as an ordinal variable. In cases like this, you want to be able to connect the medians with a straight line, and we might have accomplished that with the transformation, although it is difficult to tell.
hardware <- select(hardware, -vendor, -Model, - ERP)
model <- lm (PRP ~ . + is_ibm*CACH , data = hardware)
summary(model)
##
## Call:
## lm(formula = PRP ~ . + is_ibm * CACH, data = hardware)
##
## Residuals:
## Min 1Q Median 3Q Max
## -199.15 -24.56 5.39 21.59 392.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.574e+01 8.600e+00 -4.156 4.82e-05 ***
## MYCT 2.054e-02 1.941e-02 1.058 0.291
## MMIN 1.837e-03 3.125e-03 0.588 0.557
## MMAX 6.319e-03 6.247e-04 10.116 < 2e-16 ***
## CACH 7.489e-01 1.341e-01 5.586 7.58e-08 ***
## CHMIN -6.157e-01 8.123e-01 -0.758 0.449
## CHMAX 1.415e+00 2.098e-01 6.745 1.62e-10 ***
## is_ibm 1.322e+01 1.452e+01 0.911 0.364
## mmin_2 5.576e-07 1.091e-07 5.114 7.41e-07 ***
## CACH:is_ibm 3.507e-01 5.809e-01 0.604 0.547
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.74 on 199 degrees of freedom
## Multiple R-squared: 0.8809, Adjusted R-squared: 0.8755
## F-statistic: 163.6 on 9 and 199 DF, p-value: < 2.2e-16
plot(model)
The pattern in the residuals show that we will need to remove the nonlinear trends. We also see some influential points and some skew on the QQ plot. This data looks like a good candidate for a log transform of the response. We’ll also remove the terms that are not signficant. The categorical probably isn’t colinear with anything, so we won’t remove it, the interaction term, and CHMIN.
hardware2 <- hardware %>%
mutate(
log_score = log(PRP),
log_cycles = log(MYCT),
)
hardware2 <- select(hardware2, -MYCT, -CHMIN, -PRP, - is_ibm)
pairs(hardware2)
The relationships appear more linear. This model should improve.
library(car)
model2 <- lm (log_score ~ . , data = hardware2)
summary(model2)
##
## Call:
## lm(formula = log_score ~ ., data = hardware2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.39829 -0.26287 0.02076 0.26681 1.19893
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.158e+00 2.392e-01 17.386 < 2e-16 ***
## MMIN 9.098e-05 2.553e-05 3.564 0.000456 ***
## MMAX 3.635e-05 4.955e-06 7.336 5.22e-12 ***
## CACH 7.184e-03 1.019e-03 7.048 2.81e-11 ***
## CHMAX 2.302e-03 1.560e-03 1.476 0.141580
## mmin_2 -2.946e-09 8.860e-10 -3.326 0.001048 **
## log_cycles -2.033e-01 4.207e-02 -4.833 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4472 on 202 degrees of freedom
## Multiple R-squared: 0.8233, Adjusted R-squared: 0.8181
## F-statistic: 156.9 on 6 and 202 DF, p-value: < 2.2e-16
plot(model2)
vif(model2)
## MMIN MMAX CACH CHMAX mmin_2 log_cycles
## 10.199655 3.512112 1.784168 1.710143 5.832358 1.983327
ncvTest (model2)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.7840838 Df = 1 p = 0.3758951
Most of the diagnostics improved. The only problems remaining are slight nonlinearity and influential points. After starting with an unfriendly dataset, the linear model is now acceptable. There is room for improvement, but I would not say that the model is misspecified.