Visualising linear models in h-dim: added variable plots for the bodyfat data

Data originally published by https://ww2.amstat.org/publications/jse/v4n1/datasets.johnson.html

AV plots show the adjusted association between the response and perdictors which can be quite different from what we observe in scatterplot matrices (marginal relationship). AV plots can be useful in detecting outliers and cases where the adjusted association between the response and perdictors is due to an influence point.

Bfat is response - we are trying to predict it from all the other measurements. Looking at the scatterplot matrix, most of the variables are positively correlated, but height is slightly negatively correlated with age and bodyfat. Two outliers with very wide ankles.

pairs(bodyfat)

Fit the full model

fit1 <- lm(bfat~., data = bodyfat) # all predictors
summary(fit1)
## 
## Call:
## lm(formula = bfat ~ ., data = bodyfat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1016 -2.9148 -0.7392  3.3351  8.4197 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.63842   23.98381  -0.235  0.81474    
## age          0.04397    0.03812   1.154  0.25212    
## weight      -0.07294    0.06756  -1.080  0.28356    
## height      -0.69440    0.26036  -2.667  0.00926 ** 
## neck        -0.56477    0.33051  -1.709  0.09137 .  
## abdomen      0.86684    0.12923   6.708 2.57e-09 ***
## knee         0.47829    0.36361   1.315  0.19214    
## ankle        0.31149    0.25267   1.233  0.22126    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.188 on 80 degrees of freedom
## Multiple R-squared:  0.8097, Adjusted R-squared:  0.7931 
## F-statistic: 48.64 on 7 and 80 DF,  p-value: < 2.2e-16

Fit the model with no weight and ankle.

fit2 <- lm(bfat~., data = bodyfat[,-c(3,8)]) # no weight and ankle
summary(fit2)
## 
## Call:
## lm(formula = bfat ~ ., data = bodyfat[, -c(3, 8)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7629 -3.2343 -0.8899  3.4161  8.2253 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.88456   13.77162   1.081 0.282948    
## age          0.06437    0.03213   2.003 0.048444 *  
## height      -0.82698    0.20481  -4.038 0.000121 ***
## neck        -0.71606    0.28407  -2.521 0.013650 *  
## abdomen      0.75235    0.07481  10.057 5.77e-16 ***
## knee         0.43745    0.34144   1.281 0.203740    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.191 on 82 degrees of freedom
## Multiple R-squared:  0.8047, Adjusted R-squared:  0.7928 
## F-statistic: 67.57 on 5 and 82 DF,  p-value: < 2.2e-16

Neck is significant and has a negative coefficient! On its own it is slightly positively correlated with bfat. Let’s have a look at AV plots

###########################################################################
avPlots(fit2) 

Obs 5 looks like an outlier from the AVplot for knee. It seems to have a high influence on the fit. This person has a small neck circumference and high bodyfat

eneck <- residuals(lm(neck~., data = bodyfat[-c(1,3,8)]))  
outlier2 <- as.numeric(eneck==min(eneck)) + 1
pairs(bodyfat, col = outlier2)

Obs 5 looks like an outlier on the neck vs knee plot, smaller neck circumference than expected and wider knee circumference.

fit2.1 <- lm(bfat~., data = bodyfat[-5,-c(3,8)])
summary(fit2.1)
## 
## Call:
## lm(formula = bfat ~ ., data = bodyfat[-5, -c(3, 8)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7965 -3.0480 -0.8301  3.5107  8.2645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.70329   13.83233   1.063 0.290956    
## age          0.06736    0.03269   2.061 0.042543 *  
## height      -0.82910    0.20569  -4.031 0.000125 ***
## neck        -0.64746    0.30960  -2.091 0.039640 *  
## abdomen      0.74523    0.07615   9.786 2.24e-15 ***
## knee         0.39103    0.35240   1.110 0.270444    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.208 on 81 degrees of freedom
## Multiple R-squared:  0.8025, Adjusted R-squared:  0.7903 
## F-statistic: 65.84 on 5 and 81 DF,  p-value: < 2.2e-16
avPlots(fit2.1)

Even with this observation removed, neck is still a significant predictor. We will leave Obs 5 in the model as we have no reason to beleive it is an error.

Knee does not appear to be significant once the other variables are in the model. We will remove it from analysis.

fit3 <- lm(bfat~., data = bodyfat[-c(3,7,8)]) # no weight and ankle OR knee
summary(fit3)
## 
## Call:
## lm(formula = bfat ~ ., data = bodyfat[-c(3, 7, 8)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3867 -3.0476 -0.4197  3.4402  8.6603 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.13554   13.58804   1.335 0.185634    
## age          0.05660    0.03168   1.787 0.077611 .  
## height      -0.73525    0.19263  -3.817 0.000259 ***
## neck        -0.65323    0.28088  -2.326 0.022480 *  
## abdomen      0.80802    0.06114  13.216  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.207 on 83 degrees of freedom
## Multiple R-squared:  0.8008, Adjusted R-squared:  0.7912 
## F-statistic: 83.41 on 4 and 83 DF,  p-value: < 2.2e-16
avPlots(fit3) 

Neck still has a significant negative coefficient. Age is no longer significant - what is going on?

Remove the effects of height, neck and abdomen from bfat, age and knee.

e2 <- residuals(lm(bfat~., data = bodyfat[-c(2,3,7,8)])) #remove the effects of height, neck, abdomen
e1 <- residuals(lm(age~., data = bodyfat[-c(1,3,7,8)])) #remove the effects of height, neck, abdomen
e0 <- residuals(lm(knee~., data = bodyfat[-c(1,2,3,8)])) #remove the effects of height, neck, abdomen

plot3d(e0, e1, e2, xlab = "knee|others", ylab = "age|others", zlab = "bfat|others") 

threedreg(e0, e1, e2,xlab = "knee|others", ylab = "age|others", zlab = "bfat|others")
summary(lm(e2~ e1 +e0))  
## 
## Call:
## lm(formula = e2 ~ e1 + e0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7629 -3.2343 -0.8899  3.4161  8.2253 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 3.355e-16  4.388e-01   0.000   1.0000  
## e1          6.437e-02  3.156e-02   2.040   0.0445 *
## e0          4.375e-01  3.354e-01   1.304   0.1956  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.116 on 85 degrees of freedom
## Multiple R-squared:  0.05594,    Adjusted R-squared:  0.03373 
## F-statistic: 2.518 on 2 and 85 DF,  p-value: 0.08659
which(e2==min(e2)) 
## 40 
## 40

One large negative residual: observation 40. This person is younger, has smaller knee circumference and lower bodyfat than expected when taking into account their height, neck and abdomen. Visualise this observation elsewhere:

outlier <- as.numeric(e2==min(e2)) +1
#pairs(bodyfat, col = outlier)
parcoord(bodyfat, col = outlier)

pairs( bodyfat[-c(3,8)], col = outlier)

Actually this person is of average age and height, but high bfat, weight, neck and very wide abdomen. Their knee circumference is slighty higher than average.

plot3d(bodyfat$age, bodyfat$knee, bodyfat$bfat, col = outlier)

Looking at the person by their age, knee and bfat, they do not seem like an outlier, except for the higher than expected bfat. Correcting for abdomen (an important predictor of bfat), the picture changes considerably.

Consider removing the effects of knee, in addition to height, neck, abdomen:

avPlots(fit2, "age", col = outlier) #remove effects of knee

avPlots(fit3, "age", col = outlier) #don't remove effects of knee

Wit knee in the model, the observation becomes more influential. This is the reason age is significant in fit2, but not in fit3.

Influence measures - we will study this later. Cook’s distance picks up this observation straight away.

# influence.measures(fit2)
plot(hatvalues(fit2))

plot(cooks.distance(fit2))

#redo analysis without this person
fit2.0 <- lm(bfat~., data = bodyfat[-40,-c(3,8)])
summary(fit2.0)
## 
## Call:
## lm(formula = bfat ~ ., data = bodyfat[-40, -c(3, 8)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7371 -3.2638 -0.8226  3.1534  8.2833 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.98905   13.50265   1.258   0.2119    
## age          0.05077    0.03204   1.585   0.1169    
## height      -0.82527    0.20030  -4.120 9.08e-05 ***
## neck        -0.66750    0.27870  -2.395   0.0189 *  
## abdomen      0.82200    0.07985  10.294 2.25e-16 ***
## knee         0.18480    0.35351   0.523   0.6026    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.099 on 81 degrees of freedom
## Multiple R-squared:  0.8086, Adjusted R-squared:  0.7968 
## F-statistic: 68.43 on 5 and 81 DF,  p-value: < 2.2e-16
avPlots(fit2.0)

Age is no longer significant (once height, neck, abdomen and knee are in the model).

Visualise models

Association between bfat and neck is positive, i.e. people with wider neck circumference have higher bfat. But we have seen that the opposite is true when we take into account other predictors, e.g. abdomen, height etc. It turns out that neck is a useful predictor of bfat in addition to the oter variables, but that wider neck implies lower bfat. Maybe once accounting for the other perdictors of bfat, wider neck also has something to tell us about the person (more likely to be male or have higher muscle percentage and therefore lower bfat?).

If this is the case, what is the conditional ssociation between bfat and neck?

Let’s first condition on abdomen only:

bodyfat$abdomenBin = cut(bodyfat$abdomen, c(70, seq(80,100,5), 130))
p<- ggplot(bodyfat, aes(neck , bfat)) + geom_point()
p + facet_grid(. ~ abdomenBin)#better

At different rangest of abdomen circumference, the association between bfat and neck is no longer positive (possibly weakly negative).

Let’s look at other conditional displays:

library(condvisV2)

condvis(data = bodyfat[-c(3,8,9)], model = list(fit2 = fit2), sectionvars = "neck")

At maximu similarity threshold, the association is positive, but when we set this, to say \(=1\), we only observe observations close to the section. Here we see a negative relationship between bfat and neck when conditioning at different combinations of the remaining predictors.

An example of Simpson’s paradox, quite hard to detect in hdim data.