Observations which fall a long way from the remainder of the data are termed outliers and while they have the capacity to influence the model, this isn’t always the case. The leverage of the outlier will dictate its influence and therefore it is important to identify outliers, assess their influence and then manage them appropriately.
The ‘Animals’ dataset within the MASS library illustates the impact of outliers. The data details the brain weight (g) and body weight (kg) of 28 animal species. You can read about it here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Animals.html
First we load the packages we will use.
library(ggplot2)
library(MASS)
library(tidyverse)
Using ggplot2 we can build a scatterplot to visualise the data. This is often easier to intepret than summary outputs…
summary(Animals)
## body brain
## Min. : 0.02 Min. : 0.40
## 1st Qu.: 3.10 1st Qu.: 22.23
## Median : 53.83 Median : 137.00
## Mean : 4278.44 Mean : 574.52
## 3rd Qu.: 479.00 3rd Qu.: 420.00
## Max. :87000.00 Max. :5712.00
or looking directly at the data…
head(Animals)
## body brain
## Mountain beaver 1.35 8.1
## Cow 465.00 423.0
## Grey wolf 36.33 119.5
## Goat 27.66 115.0
## Guinea pig 1.04 5.5
## Dipliodocus 11700.00 50.0
Ggplot is a handy plotting function as it will allow us to build upon this simple plot.
ggplot(Animals,aes(x=body,y=brain)) +
geom_point()
It’s hard to tell if there is a relationship here because large species are throwing out the scale. These outliers distort the scale so that we can’t even distiguish a box in the boxplot.
ggplot(Animals,aes(x=brain,y=body,group="continuous")) +
geom_boxplot()
Boxplots are great at identifing outliers as points outside the inter-quartile range are highlighted as dots. We can see 5 dots, but which species are these?
I prefer using the piping function in tidyverse to filter my data as it is easier to intepret. I describe this code as: Take Animals then convert rownames to a column then filter by body>1000 then arrange in descending order by body.
Animals %>%
rownames_to_column %>%
filter(body >1000) %>%
arrange(desc(body))
## rowname body brain
## 1 Brachiosaurus 87000 154.5
## 2 Dipliodocus 11700 50.0
## 3 Triceratops 9400 70.0
## 4 African elephant 6654 5712.0
## 5 Asian elephant 2547 4603.0
Much more obvious than…
arrange(filter(rownames_to_column(Animals),body>1000),desc(body))
## rowname body brain
## 1 Brachiosaurus 87000 154.5
## 2 Dipliodocus 11700 50.0
## 3 Triceratops 9400 70.0
## 4 African elephant 6654 5712.0
## 5 Asian elephant 2547 4603.0
We can also just append the ggplot.
ggplot(Animals,aes(x=body,y=brain)) +
geom_point() +
geom_text(aes(label=ifelse(body>1000 & body<10000,rownames(Animals),'')),hjust=.45,vjust=-.6) +
geom_text(aes(label=ifelse(body>10000 & body<20000,rownames(Animals),'')),hjust=0,vjust=1.7) +
geom_text(aes(label=ifelse(body>20000,rownames(Animals),'')),hjust=1,vjust=-1)
Three of the very large species are prehistoric and they all have a light brain relative to their bodyweight.
What does the model look like with these dinosaurs?
model <- lm(brain~body,Animals)
summary(model)
##
## Call:
## lm(formula = brain ~ body, data = Animals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -576.0 -554.1 -438.1 -156.3 5138.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.764e+02 2.659e+02 2.168 0.0395 *
## body -4.326e-04 1.589e-02 -0.027 0.9785
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1360 on 26 degrees of freedom
## Multiple R-squared: 2.853e-05, Adjusted R-squared: -0.03843
## F-statistic: 0.0007417 on 1 and 26 DF, p-value: 0.9785
and without?
Present_Animals<- Animals %>%
filter(body<8000)
model2 <- lm(brain~body,Present_Animals)
summary(model2)
##
## Call:
## lm(formula = brain ~ body, data = Present_Animals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -755.05 -188.34 -128.54 -18.64 2009.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 191.22257 110.08777 1.737 0.0958 .
## body 0.94317 0.07658 12.317 1.31e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 520.5 on 23 degrees of freedom
## Multiple R-squared: 0.8683, Adjusted R-squared: 0.8626
## F-statistic: 151.7 on 1 and 23 DF, p-value: 1.312e-11
By excluding the dinosaurs we can reject the null hypothesis that the slope coefficient is zero.
What does this look like on our plot?
coef.lm1 <- coef(model)
coef.lm2 <- coef(model2)
ggplot(Animals,aes(x=body,y=brain)) +
geom_point() +
geom_text(aes(label=ifelse(body>1000 & body<10000,rownames(Animals),'')),hjust=.45,vjust=-.6) +
geom_text(aes(label=ifelse(body>10000 & body<20000,rownames(Animals),'')),hjust=0,vjust=1.7) +
geom_text(aes(label=ifelse(body>20000,rownames(Animals),'')),hjust=1,vjust=-1) +
geom_abline(intercept=coef.lm1[1],
slope=coef.lm1[2],color = "red") + #model with Dinosaurs
geom_abline(intercept=coef.lm2[1],
slope=coef.lm2[2],color = "blue") #model without Dinosaurs
We see that including the dinosaurs in the model distorts the model fit but including elephants is fine. The reason is that the dinosaurs have high leverage and therefore can influence the fit of the model. The elephants don’t.
This is easily visualised by plotting residuals vs leverage. This plot also contains a contour to illustrate points which fall outside the Cook’s Distance of 1 or 0.5. The Cook’s Distance is a value which summarises the impact of removing that value on the remaining fitted values. It is therefore a good proxy for influence. Generally points outside the contour should be investigated.
plot(model,which=5)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Here the Brachiosaurus has so much leverage that it breaks the plot. Without it, the other two dinosaurs show high leverage.
Nobrach<- Animals %>%
rownames_to_column() %>%
filter(body<80000) %>%
column_to_rownames()
model3 <- lm(brain~body,Nobrach)
plot(model3,which=5)
It’s also interesting to note that the elephants in this model exhibit larger standardised residuals than the remaining dinosaurs. This is because the model is still being pulled horizontal by the dinosaurs.
Nodino <- Animals %>%
rownames_to_column() %>%
filter(body<7000) %>%
column_to_rownames()
model4 <- lm(brain~body,Nodino)
plot(model4,which=5)
Without the dinosaurs the elephants act as the influential points. By chance they counterweigh eachother and fit the remainder of the data very well (except for the heavy brained humans).
How influential points should be handled often depends on the intentions of the analysis. In this example body size can explain brain size in present day animals but has a completely different relationship in dinosaurs. This suggests we should remove the dinosaurs from the dataset.
ggplot(Nodino,aes(x=body,y=brain)) +
geom_point() +
geom_text(aes(label=ifelse(brain>1000,rownames(Nodino),'')),hjust=.6,vjust=1.3) +
geom_abline(intercept=coef.lm2[1],
slope=coef.lm2[2],color = "blue") #model without Dinosaurs