Outlier detection in spatial data using R

Outliers in spatial data

If you are working with spatial datasets, you should know that this data need a different treatment compared to the conventional data. However, all properties required to conventional data must be obeyed for the spatial data. In particular, this small document talks about the treatment of the outlier observations when the data is distributed in space.

Regardless of the type of data, the outliers can to generate an inference problem in statistical models. If your data has outliers, it is likely that the data distribution is assymetric and the maximun likelihood estimate is not be the best option. Generaly, many reseachers use non-parametric approaches to solve this problem. However, regardless of the estimation techinique, you must first verify if your data have outlier observations.

If the individual observations are not distributed in space, many tests can be used to identyfy the outliers. In particular, if you are using R, you can use the package OutlierDetection and choose from a set of methods to identify the said condiction. Your help file is available in .

You also can use one of more used techinique to outliers detection, the boxplot. This tool uses the quantile distributions to compute a quantilic range and show the distance of the observations in relation to mediane. To demonstrate this, I will use the mtcars data. First, I will import the dataset.

data(mtcars)

Now, I will use boxplot function to create a boxplot for the variable named hp.

boxplot(mtcars$hp, ylab = 'hp values', col = 'gray')

The gray rectangle represents the quantile range, the bold line inside the rectangle presents the median value and the traces after the dashed lines presents the maximum and minimum values of the variable disregarding outliers. If the observation is very far from the quantile change, it can be considered a upper outlier, when it is well above of the quantile range, and it can be considered a lower outlier, when it is well below of the quantile range.

The quantile range represents the distance between the third and first quantiles. The outliers can be detected using a simple concept. First, a limit to the distance must be established (usually called h). If the distance between the observation and the quantile range is at least h times smaller that the quantile range, so this observation can be considered a lower outlier. If the distance between the observation and the quantile range is at least h times greater that the quantile range, so this observation can be considered an upper outlier. In general, the reseachers considere h = 1.5 or h = 3. The higher the value of h, the further from the quantile range the observation must be considered an outlier.

The circle in the boxplot represents an upper outlier with h = 1.5. As the botton line is smaller than the top line, the boxplot suggests an asymetric distribution. I will verify this with an histogram using the package ggplot.

library(ggplot2)
ggplot(mtcars, aes(x = hp)) + 
  geom_histogram(fill = 'red', color = 'white', bins = 8) + 
  ylab('Frequence') + xlab('hp values') + 
  theme_classic()

The same process can be done with spatial data. To show this, I will use the sf package and import a spatial dataset named nc.gpkg. This dataset contains some information about the state of North Carolina.

library(sf)

## Linking to GEOS 3.6.2, GDAL 2.2.3, PROJ 4.9.3

nc = read_sf(system.file("gpkg/nc.gpkg", package="sf"))
ggplot(nc) + geom_sf() + theme()

It is possible to verify the existence of outliers to this dataset using the basic boxplot as showed with the mtcars dataset. I will use the variable called SID79 to show how you can perform this procedure.

boxplot(nc$SID79, ylab = 'SID79 values', color = 'gray')

You can to see that this variable has upper outliers. However, you can to show this procedure on a map where the informations in the boxplot will be presented in more detail. To this, you can to use the function below:

boxmap <- function(variable, data, h){
  invisible(lapply(c("sf", "ggplot2"), require, character.only = TRUE))
  x <- data.frame(eval(substitute(variable), data))
  names(x) <- 'newvar'
  for (i in c(1:length(x$newvar))) {
    x$newvar1[i] <- ifelse(x$newvar[i] >= quantile(x$newvar)[1] & x$newvar[i] < quantile(x$newvar)[2], '< 25 %', 0)
    x$newvar1[i] <-ifelse(x$newvar[i] < quantile(x$newvar)[2] - (quantile(x$newvar)[4]- quantile(x$newvar)[2])*h, 'Lower outlier', x$newvar1[i])
    x$newvar1[i] <- ifelse(x$newvar[i] >= quantile(x$newvar)[2] & x$newvar[i] < quantile(x$newvar)[3], '25 % - 50 %', x$newvar1[i])
    x$newvar1[i] <- ifelse(x$newvar[i] >= quantile(x$newvar)[3] & x$newvar[i] < quantile(x$newvar)[4], '50 % - 75 %', x$newvar1[i])
    x$newvar1[i] <- ifelse(x$newvar[i] >= quantile(x$newvar)[4] & x$newvar[i] < quantile(x$newvar)[5], '> 75 %', x$newvar1[i])
    x$newvar1[i] <-ifelse(x$newvar[i] >= quantile(x$newvar)[4] + (quantile(x$newvar)[4]- quantile(x$newvar)[2])*h, 'Upper outlier', x$newvar1[i])
    x$varplot <- as.numeric(as.factor(x$newvar1))
  }
  ggplot(data= data) +
    geom_sf(aes(fill = x[,2]))+
    guides(fill=guide_legend(title=sprintf("Outlier condiction (hinge = %.2f)", h)))+
    theme(axis.line=element_blank(),
          axis.text.x=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks=element_blank(),
          axis.title.x=element_blank(),
          axis.title.y=element_blank(),
          panel.background=element_blank(),
          panel.border=element_blank(),
          panel.grid.major=element_blank(),
          panel.grid.minor=element_blank(),
          plot.background=element_blank())
}

The function is called boxmap. In this function, the R user will inform the variable name in the parameter called variable, the name of the spatial dataset (your dataset must be in a shapefile) in the parameter data and the value of the parameter h (h = 1.5 or h = 3). To show this procedure, I will use the variable SID79 of the nc dataset.

boxmap(variable = SID79, data = nc, h = 1.5)

As you can to see, the results in the boxmap are according to the boxplot showed earlier, showing that the variable SID79 has upper outliers. However, the boxmap is much more informative than the boxplot, so that each observation is plotted according to its quantile.

The results can be changed if you choose h = 3, in this case, the observation must be more distant of the quantile range to be considered an lower outlier or an upper outlier. In this case, some observations can be showed in the second quantile (it is no longer considered a lower outlier) or in the fourth and fifth quantiles (it is no longer considered a upper outlier). This procedure is showed below.

boxmap(variable = SID79, data = nc, h = 3)

Outlier detection in spatial data using R

Helson Gomes de Souza

Outliers in spatial data

References