Observations which fall a long way from the remainder of the data are termed outliers and while they have the capacity to influence the model, this isn’t always the case. The leverage of the outlier will dictate its influence and therefore it is important to identify outliers, assess their influence and then manage them appropriately.

The ‘Animals’ dataset within the MASS library illustates the impact of outliers. The data details the brain weight (g) and body weight (kg) of 28 animal species. You can read about it here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Animals.html

First we load the packages we will use.

library(ggplot2)
library(MASS)
library(tidyverse)

Using ggplot2 we can build a scatterplot to visualise the data. This is often easier to intepret than summary outputs…

summary(Animals)
##       body              brain        
##  Min.   :    0.02   Min.   :   0.40  
##  1st Qu.:    3.10   1st Qu.:  22.23  
##  Median :   53.83   Median : 137.00  
##  Mean   : 4278.44   Mean   : 574.52  
##  3rd Qu.:  479.00   3rd Qu.: 420.00  
##  Max.   :87000.00   Max.   :5712.00

or looking directly at the data…

head(Animals)
##                     body brain
## Mountain beaver     1.35   8.1
## Cow               465.00 423.0
## Grey wolf          36.33 119.5
## Goat               27.66 115.0
## Guinea pig          1.04   5.5
## Dipliodocus     11700.00  50.0

Ggplot is a handy plotting function as it will allow us to build upon this simple plot.

ggplot(Animals,aes(x=body,y=brain)) +
  geom_point() 

It’s hard to tell if there is a relationship here because large species are throwing out the scale. These outliers distort the scale so that we can’t even distiguish a box in the boxplot.

ggplot(Animals,aes(x=brain,y=body,group="continuous")) +
  geom_boxplot()

Boxplots are great at identifing outliers as points outside the inter-quartile range are highlighted as dots. We can see 5 dots, but which species are these?

I prefer using the piping function in tidyverse to filter my data as it is easier to intepret. I describe this code as: Take Animals then convert rownames to a column then filter by body>1000 then arrange in descending order by body.

Animals %>% 
  rownames_to_column %>%
  filter(body >1000) %>%
  arrange(desc(body))
##            rowname  body  brain
## 1    Brachiosaurus 87000  154.5
## 2      Dipliodocus 11700   50.0
## 3      Triceratops  9400   70.0
## 4 African elephant  6654 5712.0
## 5   Asian elephant  2547 4603.0

Much more obvious than…

arrange(filter(rownames_to_column(Animals),body>1000),desc(body))
##            rowname  body  brain
## 1    Brachiosaurus 87000  154.5
## 2      Dipliodocus 11700   50.0
## 3      Triceratops  9400   70.0
## 4 African elephant  6654 5712.0
## 5   Asian elephant  2547 4603.0

We can also just append the ggplot.

ggplot(Animals,aes(x=body,y=brain)) +
  geom_point() +
  geom_text(aes(label=ifelse(body>1000 & body<10000,rownames(Animals),'')),hjust=.45,vjust=-.6) + 
  geom_text(aes(label=ifelse(body>10000 & body<20000,rownames(Animals),'')),hjust=0,vjust=1.7) +
  geom_text(aes(label=ifelse(body>20000,rownames(Animals),'')),hjust=1,vjust=-1)

Three of the very large species are prehistoric and they all have a light brain relative to their bodyweight.

What does the model look like with these dinosaurs?

model <- lm(brain~body,Animals)
summary(model)
## 
## Call:
## lm(formula = brain ~ body, data = Animals)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -576.0 -554.1 -438.1 -156.3 5138.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  5.764e+02  2.659e+02   2.168   0.0395 *
## body        -4.326e-04  1.589e-02  -0.027   0.9785  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1360 on 26 degrees of freedom
## Multiple R-squared:  2.853e-05,  Adjusted R-squared:  -0.03843 
## F-statistic: 0.0007417 on 1 and 26 DF,  p-value: 0.9785

and without?

Present_Animals<- Animals %>%
  filter(body<8000)
model2 <- lm(brain~body,Present_Animals)
summary(model2)
## 
## Call:
## lm(formula = brain ~ body, data = Present_Animals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -755.05 -188.34 -128.54  -18.64 2009.53 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 191.22257  110.08777   1.737   0.0958 .  
## body          0.94317    0.07658  12.317 1.31e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 520.5 on 23 degrees of freedom
## Multiple R-squared:  0.8683, Adjusted R-squared:  0.8626 
## F-statistic: 151.7 on 1 and 23 DF,  p-value: 1.312e-11

By excluding the dinosaurs we can reject the null hypothesis that the slope coefficient is zero.

What does this look like on our plot?

coef.lm1 <- coef(model)
coef.lm2 <- coef(model2)
ggplot(Animals,aes(x=body,y=brain)) +
  geom_point() +
  geom_text(aes(label=ifelse(body>1000 & body<10000,rownames(Animals),'')),hjust=.45,vjust=-.6) + 
  geom_text(aes(label=ifelse(body>10000 & body<20000,rownames(Animals),'')),hjust=0,vjust=1.7) +
  geom_text(aes(label=ifelse(body>20000,rownames(Animals),'')),hjust=1,vjust=-1) +
  geom_abline(intercept=coef.lm1[1],
              slope=coef.lm1[2],color = "red") + #model with Dinosaurs
  geom_abline(intercept=coef.lm2[1],
              slope=coef.lm2[2],color = "blue") #model without Dinosaurs

We see that including the dinosaurs in the model distorts the model fit but including elephants is fine. The reason is that the dinosaurs have high leverage and therefore can influence the fit of the model. The elephants don’t.

This is easily visualised by plotting residuals vs leverage. This plot also contains a contour to illustrate points which fall outside the Cook’s Distance of 1 or 0.5. The Cook’s Distance is a value which summarises the impact of removing that value on the remaining fitted values. It is therefore a good proxy for influence. Generally points outside the contour should be investigated.

plot(model,which=5)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Here the Brachiosaurus has so much leverage that it breaks the plot. Without it, the other two dinosaurs show high leverage.

Nobrach<- Animals %>%
  rownames_to_column() %>%
  filter(body<80000) %>%
  column_to_rownames()
model3 <- lm(brain~body,Nobrach)
plot(model3,which=5)

It’s also interesting to note that the elephants in this model exhibit larger standardised residuals than the remaining dinosaurs. This is because the model is still being pulled horizontal by the dinosaurs.

Nodino <- Animals %>%
  rownames_to_column() %>%
  filter(body<7000) %>%
  column_to_rownames()
model4 <- lm(brain~body,Nodino)
plot(model4,which=5)

Without the dinosaurs the elephants act as the influential points. By chance they counterweigh eachother and fit the remainder of the data very well (except for the heavy brained humans).

How influential points should be handled often depends on the intentions of the analysis. In this example body size can explain brain size in present day animals but has a completely different relationship in dinosaurs. This suggests we should remove the dinosaurs from the dataset.

ggplot(Nodino,aes(x=body,y=brain)) +
  geom_point() +
  geom_text(aes(label=ifelse(brain>1000,rownames(Nodino),'')),hjust=.6,vjust=1.3) +
  geom_abline(intercept=coef.lm2[1],
              slope=coef.lm2[2],color = "blue") #model without Dinosaurs