knitr::opts_chunk$set(echo = TRUE)
In the process of producing, collecting, processing and analyzing data, outliers can come from many sources and hide in many dimensions. An outlier is an observation that is numerically distant from the rest of the data. When reviewing a boxplot, an outlier is defined as a data point that is located outside the fences (“whiskers”) of the boxplot.
We start by creating a vector
A <- c(3, 2, 5, 6, 4, 8, 1, 2, 30, 2, 4)
Then print it out
A
## [1] 3 2 5 6 4 8 1 2 30 2 4
We then plot a boxplot to help us visualise any existing outliers
boxplot(A)
We have an outlier as displayed by the plot above
We can also put our outliers in a vector using function boxplot.stats as shown below
boxplot.stats(A)$out
## [1] 30
Outliers should be investigated carefully. Often they contain valuable information about the process under investigation or the data gathering and recording process. Before considering the possible elimination of these points from the data, one should try to understand why they appeared and whether it is likely similar values will continue to appear. Of course, outliers are often bad data points.
An obvious inconsistency occurs when a record contains a value or combination of values that cannot correspond to a real-world situation. For example, a person’s age cannot be negative, a man cannot be pregnant and an under-aged person cannot possess a drivers license.
Say from our vector x above, values above 20 are obvious inconsistencies then we using logical indices to check for
non_greater_than_20 <- A > 20
We print this out
non_greater_than_20
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
We only have one element in our vector that’s greater than 20 giving us an outlier in our vector.
Will use the given bus dataset below, determine whether there are any obvious inconsistencies Dataset url = http://bit.ly/BusNairobiWesternTransport
First we load our dataset
library(data.table)
## Warning: package 'data.table' was built under R version 4.1.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.7 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'purrr' was built under R version 4.1.2
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'stringr' was built under R version 4.1.1
## Warning: package 'forcats' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::between() masks data.table::between()
## x dplyr::filter() masks stats::filter()
## x dplyr::first() masks data.table::first()
## x dplyr::lag() masks stats::lag()
## x dplyr::last() masks data.table::last()
## x purrr::transpose() masks data.table::transpose()
bus_dataset <- fread('http://bit.ly/BusNairobiWesternTransport')
Let’s preview this dataset
View(bus_dataset)
str(bus_dataset)
## Classes 'data.table' and 'data.frame': 51645 obs. of 10 variables:
## $ ride_id : int 1442 5437 5710 5777 5778 5777 5777 5778 5778 5781 ...
## $ seat_number : chr "15A" "14A" "8B" "19A" ...
## $ payment_method : chr "Mpesa" "Mpesa" "Mpesa" "Mpesa" ...
## $ payment_receipt: chr "UZUEHCBUSO" "TIHLBUSGTE" "EQX8Q5G19O" "SGP18CL0ME" ...
## $ travel_date : IDate, format: "0017-10-17" "0019-11-17" ...
## $ travel_time : chr "7:15" "7:12" "7:05" "7:10" ...
## $ travel_from : chr "Migori" "Migori" "Keroka" "Homa Bay" ...
## $ travel_to : chr "Nairobi" "Nairobi" "Nairobi" "Nairobi" ...
## $ car_type : chr "Bus" "Bus" "Bus" "Bus" ...
## $ max_capacity : int 49 49 49 49 49 49 49 49 49 49 ...
## - attr(*, ".internal.selfref")=<externalptr>
dim(bus_dataset)
## [1] 51645 10
class(bus_dataset)
## [1] "data.table" "data.frame"
head(bus_dataset)
## ride_id seat_number payment_method payment_receipt travel_date travel_time
## 1: 1442 15A Mpesa UZUEHCBUSO 0017-10-17 7:15
## 2: 5437 14A Mpesa TIHLBUSGTE 0019-11-17 7:12
## 3: 5710 8B Mpesa EQX8Q5G19O 0026-11-17 7:05
## 4: 5777 19A Mpesa SGP18CL0ME 0027-11-17 7:10
## 5: 5778 11A Mpesa BM97HFRGL9 0027-11-17 7:12
## 6: 5777 18B Mpesa B6PBDU30IZ 0027-11-17 7:10
## travel_from travel_to car_type max_capacity
## 1: Migori Nairobi Bus 49
## 2: Migori Nairobi Bus 49
## 3: Keroka Nairobi Bus 49
## 4: Homa Bay Nairobi Bus 49
## 5: Migori Nairobi Bus 49
## 6: Homa Bay Nairobi Bus 49
Identifying the numeric class (max_capacity) in the data and evaluating if there are any outliers
boxplot(bus_dataset$max_capacity, ylab = "max_capacity")
We can also try this using a ggplot
ggplot(bus_dataset) +
aes(x = "", y = max_capacity) +
geom_boxplot(fill = "#0c4c8a") +
theme_minimal()
We can also confirm this as follow
boxplot.stats(bus_dataset$max_capacity)$out
## integer(0)