knitr::opts_chunk$set(echo = TRUE)

Outliers

1. Screening for Outliers

In the process of producing, collecting, processing and analyzing data, outliers can come from many sources and hide in many dimensions. An outlier is an observation that is numerically distant from the rest of the data. When reviewing a boxplot, an outlier is defined as a data point that is located outside the fences (“whiskers”) of the boxplot.

We start by creating a vector

A <- c(3, 2, 5, 6, 4, 8, 1, 2, 30, 2, 4)

Then print it out

A
##  [1]  3  2  5  6  4  8  1  2 30  2  4

We then plot a boxplot to help us visualise any existing outliers

boxplot(A)

We have an outlier as displayed by the plot above

We can also put our outliers in a vector using function boxplot.stats as shown below

boxplot.stats(A)$out
## [1] 30

Outliers should be investigated carefully. Often they contain valuable information about the process under investigation or the data gathering and recording process. Before considering the possible elimination of these points from the data, one should try to understand why they appeared and whether it is likely similar values will continue to appear. Of course, outliers are often bad data points.

2. Obvious Inconsistencies

An obvious inconsistency occurs when a record contains a value or combination of values that cannot correspond to a real-world situation. For example, a person’s age cannot be negative, a man cannot be pregnant and an under-aged person cannot possess a drivers license.

Say from our vector x above, values above 20 are obvious inconsistencies then we using logical indices to check for

non_greater_than_20 <- A > 20

We print this out

non_greater_than_20
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

We only have one element in our vector that’s greater than 20 giving us an outlier in our vector.

Let’s Challenge ourselves

Will use the given bus dataset below, determine whether there are any obvious inconsistencies Dataset url = http://bit.ly/BusNairobiWesternTransport

First we load our dataset

library(data.table) 
## Warning: package 'data.table' was built under R version 4.1.2
library(tidyverse) 
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.7     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'purrr' was built under R version 4.1.2
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'stringr' was built under R version 4.1.1
## Warning: package 'forcats' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::between()   masks data.table::between()
## x dplyr::filter()    masks stats::filter()
## x dplyr::first()     masks data.table::first()
## x dplyr::lag()       masks stats::lag()
## x dplyr::last()      masks data.table::last()
## x purrr::transpose() masks data.table::transpose()
bus_dataset <- fread('http://bit.ly/BusNairobiWesternTransport')

Let’s preview this dataset

Viewing entire dataset

View(bus_dataset)

Viewing Dataset Info

str(bus_dataset)
## Classes 'data.table' and 'data.frame':   51645 obs. of  10 variables:
##  $ ride_id        : int  1442 5437 5710 5777 5778 5777 5777 5778 5778 5781 ...
##  $ seat_number    : chr  "15A" "14A" "8B" "19A" ...
##  $ payment_method : chr  "Mpesa" "Mpesa" "Mpesa" "Mpesa" ...
##  $ payment_receipt: chr  "UZUEHCBUSO" "TIHLBUSGTE" "EQX8Q5G19O" "SGP18CL0ME" ...
##  $ travel_date    : IDate, format: "0017-10-17" "0019-11-17" ...
##  $ travel_time    : chr  "7:15" "7:12" "7:05" "7:10" ...
##  $ travel_from    : chr  "Migori" "Migori" "Keroka" "Homa Bay" ...
##  $ travel_to      : chr  "Nairobi" "Nairobi" "Nairobi" "Nairobi" ...
##  $ car_type       : chr  "Bus" "Bus" "Bus" "Bus" ...
##  $ max_capacity   : int  49 49 49 49 49 49 49 49 49 49 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Viewing number of records inthe dataset

dim(bus_dataset)
## [1] 51645    10

Viewing dataset class attribute

class(bus_dataset)
## [1] "data.table" "data.frame"
Previewing first 5 rows
head(bus_dataset)
##    ride_id seat_number payment_method payment_receipt travel_date travel_time
## 1:    1442         15A          Mpesa      UZUEHCBUSO  0017-10-17        7:15
## 2:    5437         14A          Mpesa      TIHLBUSGTE  0019-11-17        7:12
## 3:    5710          8B          Mpesa      EQX8Q5G19O  0026-11-17        7:05
## 4:    5777         19A          Mpesa      SGP18CL0ME  0027-11-17        7:10
## 5:    5778         11A          Mpesa      BM97HFRGL9  0027-11-17        7:12
## 6:    5777         18B          Mpesa      B6PBDU30IZ  0027-11-17        7:10
##    travel_from travel_to car_type max_capacity
## 1:      Migori   Nairobi      Bus           49
## 2:      Migori   Nairobi      Bus           49
## 3:      Keroka   Nairobi      Bus           49
## 4:    Homa Bay   Nairobi      Bus           49
## 5:      Migori   Nairobi      Bus           49
## 6:    Homa Bay   Nairobi      Bus           49

Identifying the numeric class (max_capacity) in the data and evaluating if there are any outliers

boxplot(bus_dataset$max_capacity, ylab = "max_capacity")

We can also try this using a ggplot

ggplot(bus_dataset) +
  aes(x = "", y = max_capacity) +
  geom_boxplot(fill = "#0c4c8a") +
  theme_minimal()

We can also confirm this as follow

boxplot.stats(bus_dataset$max_capacity)$out
## integer(0)