#1. Introduction
The dataset comes from the Missing Migrant Project that tracks the deaths of migrants that have perished on the way to their destination (their country of asylum). For more information, look up this website: https://www.iom.int/.
I hope that monitoring these tragic events and studying the data relating to them can help to avoid a recurrence of these. We are all human beings. ##1.1 Loading packages and datasets
# Loading libraries
library(naivebayes)
library(plyr)
library(car)
library(qdapTools)
library(rgeos)
library(dismo)
library(geosphere)
library(rgdal)
library(ggspatial)
library(viridis)
library(rnaturalearth)
library(sf)
library(dplyr)
library(tidyverse)
library(Amelia)
library(VIM)
library(stringr)
library(sp)
library(magrittr)
library(ggpubr)
library(ggthemes)
library(philentropy)
library(reshape2)
library(caret)
library(MASS)
Loading dataset
rm(list=ls())
df <- read.csv('MissingMigrants-Global-2020-04-11T05-24-09.csv')
##1.2. Quick look at the dataset
Let’s look at the dataset
head(df)
## Web.ID Region Reported.Date Reported.Year Reported.Month
## 1 53711 US-Mexico Border April 08, 2020 2020 Apr
## 2 53710 Central America April 05, 2020 2020 Apr
## 3 53705 US-Mexico Border April 05, 2020 2020 Apr
## 4 53702 Central America April 05, 2020 2020 Apr
## 5 53701 US-Mexico Border April 05, 2020 2020 Apr
## 6 53700 Mediterranean April 01, 2020 2020 Apr
## Number.Dead Minimum.Estimated.Number.of.Missing Total.Dead.and.Missing
## 1 1 NA 1
## 2 1 NA 1
## 3 1 NA 1
## 4 1 NA 1
## 5 1 NA 1
## 6 1 NA 1
## Number.of.Survivors Number.of.Females Number.of.Males Number.of.Children
## 1 NA NA 1 NA
## 2 NA NA 1 NA
## 3 NA NA 1 NA
## 4 NA NA 1 NA
## 5 NA NA 1 NA
## 6 NA NA NA NA
## Cause.of.Death Location.Description
## 1 Presumed drowning Rio Bravo, Nuevo Laredo, Tamaulipas, Mexico
## 2 Accident (fell from train) San Lorenzo Soltepec, Tlaxcala, Mexico
## 3 Presumed drowning Rio Bravo, Eagle Pass, Maverick County, Texas, USA
## 4 Run over by train Ixtepec, Oaxaca, Mexico
## 5 Drowning Las Cruces, Piedras Negras, Coahuila, Mexico
## 6 Presumed drowning Remains recovered on Al Maya Shore, Zawiyah, Libya
## Information.Source Location.Coordinates
## 1 Notiviza Tamaulipas, Hoy Tamaulipas 27.499560110910, -99.529145296143
## 2 El Gritón 19.604444400000, -98.311944400000
## 3 Zócalo 28.711778048640, -100.507718230750
## 4 Instituto Nacional de Migración 16.561126700000, -95.096314100000
## 5 Zócalo 28.709460903906, -100.510735628340
## 6 IOM Libya 32.792615604333, 12.733749617969
## Migration.Route URL
## 1 https://archive.ph/MxbRb, https://archive.vn/aXBY7
## 2 https://archive.ph/93ZaE
## 3 https://archive.ph/Nx54L
## 4 https://archive.ph/zpj1a
## 5 https://archive.ph/3JNKz
## 6 Central Mediterranean
## UNSD.Geographical.Grouping Source.Quality
## 1 Central America 3
## 2 Central America 1
## 3 Northern America 1
## 4 Central America 5
## 5 Central America 1
## 6 Uncategorized 4
The dataset contains 5333 observations that have been collected from all over the world. The description of the variables, you can find below. Let’s see variables from our dataset: Variable name Description Web ID : An automatically generated ID of WEB
Region of accident : The region in which an incident took place
Reported date : Estimated date of death or finding a body
Reported year : The year in which the incident occurred
Reported month : The month in which the incident occurred
Number dead : The total number of people confirmed dead in one incident
Minimum Estimated Number of Missing : The total number of those who are missing and are thus assumed to be dead (minimum)
Total dead and missing : The sum of the ‘number dead’ and ‘Minimum Estimated Number of Missing’ variables
Number of Survivors : The number of survivors
Number of Females : The number of females
Number of Males : The number of males
Number of children : The number of children
Cause of Death : The cause of death
Location Description : Where the incident happened
Information Source : The source of information
Location Coordinates : The Coordinates of location
Migration Route : The route of migration
URL : The URL of a site where information about a given incident is written
UNSD Geographical Grouping : The grouping of incidents based on their geo-location made by the UNSD
Source Quality : How much a source of information is reliable
###Dimensions of the dataset
dim(df)
## [1] 7242 20
##2. Missing values Before we dive into details, we should see how many missing values there are in the dataset how they are distributed.
###Missing values
Changing the names of columns so that they are more readable
colnames(df) <- c('ID', 'Region', 'Date', 'Year', 'Month', 'Dead', 'Minimum_Missing', 'Total_Dead_Missing', 'Survivors', 'Females', 'Males', 'Children',
'Death_Cause', 'Location_Description', 'Source', 'Coordinates', 'Route', 'URL', 'Geo_Grouping', 'Source_Quality')
Creating a data frame of information about missing datas
missing.values <- df %>%
gather(key = "key", value = "val") %>%
dplyr::mutate(is.missing = is.na(val)) %>%
group_by(key, is.missing) %>%
dplyr::summarise(num.missing = n()) %>%
filter(is.missing==T) %>%
dplyr::select(-is.missing) %>%
dplyr::arrange(desc(num.missing))
## Warning: attributes are not identical across measure variables;
## they will be dropped
colnames(missing.values) <- c('Columns_from_df', 'No_of_missings')
Further operations on missing values
missing.values <- data.frame(missing.values)
missing.values$Total <- dim(df)[1]
missing.values[, 'Percent_of_missing'] <- as.vector(paste0(round(missing.values$No_of_missing/missing.values$Total, 3)*100, " %"))
missing.values[, 'Fraction_of_missing'] <- round(missing.values$No_of_missing/missing.values$Total, 3)
missing.values <- missing.values[, -3]
missing.values$Columns_from_df <- factor(missing.values$Columns_from_df, level=missing.values$Columns_from_df)
x=missing.values[, -3]
x
## Columns_from_df No_of_missings Fraction_of_missing
## 1 Minimum_Missing 6623 0.915
## 2 Survivors 6279 0.867
## 3 Children 6204 0.857
## 4 Females 5672 0.783
## 5 Males 3090 0.427
## 6 Dead 247 0.034
options(repr.plot.width = 12 , repr.plot.height = 9)
ggplot(missing.values)+
geom_col(aes(x=Columns_from_df, y=Fraction_of_missing, fill=Columns_from_df), width=1)+
coord_polar(theta='x')+
geom_text(aes(x=Columns_from_df, y=0.6, label=Percent_of_missing), size=12)+
scale_fill_brewer(palette="Blues")+
labs(title="Percentage of missing data", subtitle="From 0% to 100%", x="", y="")+
theme(legend.position = "none", axis.text.x=element_text(size=10), plot.title = element_text(size = 25, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 20, face = "bold", hjust=0.48))+
scale_y_continuous(breaks=c(0, 0.99), labels=c('0 %', '100 %'))
There are 4 variables (Minimum missing, Children, Survivors, Females), whose ratio of missing data exceeds 80%.
Lack of data in those instances probabably stems from the fact that:
We know how many missing values there are.
However it will be most informative if we could see how they are distributed, and check whether they may be some pattern involved in it. To make it more transparent, we will sort the dataframe and clump missing values together for visualisation purposes. ###Distribution of missings values
d <- df %>%
arrange(desc(is.na(Dead)),
desc(is.na(Males)),
desc(is.na(Females)),
desc(is.na(Survivors)),
desc(is.na(Children)),
desc(is.na(Minimum_Missing)),
desc(is.na(Death_Cause)))
ggplot_missing <- function(x){
x %>%
is.na %>%
melt %>%
ggplot(data = .,
aes(x = Var2,
y = Var1)) +
geom_raster(aes(fill = value)) +
scale_fill_grey(name = "",
labels = c("Present","Missing")) +
theme_minimal() +
theme(axis.text.x = element_text(angle=45, vjust=0.5, size=15), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48),
axis.text.y=element_text(size=15), axis.title.x=element_text(size=20), axis.title.y=element_text(size=20)) +
labs(x = "Variables in Dataset",
y = "Rows / observations", title='Missing values in dataset', subtitle='Grouped to see overlap')
}
d %>%
is.na %>%
melt %>%
ggplot(data=., aes(x=Var2, y=Var1)) +
geom_raster(aes(fill=value)) +
scale_fill_grey(name="", labels= c('Present', 'M'))
ggplot_missing(d)
As can be seen from the above infographic, missings do not overlap and virtually every observation contains at least one missing. Consequently, we cannot deem some observations corrupted and delete them, in return, it will be best to exclude those 4 variables from our analysis.
#2. Exploratory analysis ##2.1 Analysing the source of information In this section, we will try to derive some information about the source of information on each accident (i.e., observation) from the raw feature “URL”.
At the beginning, it’d be best if we saw the unique beginnings of each URL.
We know that URL, in most cases, indicates of a source at its beginnings, and the further to the end, the more detailed pieces of information about the website, it provides with.
Let’s see then unique beginnings of URL (10 first characters)
unique(str_sub(df$URL, 1, 10))
## [1] "https://ar" "" "https://bi" "http://www" "http://bit"
## [6] "http://mvr" "https://re" "http://arc" "https://ww" "https://va"
## [11] "https://wa" "https://hc" "https://t1" "http://en." "http://hcg"
## [16] "https://tw" "https://cn" "https://yh" "http://wat" "https://bb"
## [21] "https://ny" "https://ed" "https://hr" "https://tl" "http://alw"
## [26] "https://go" "https://ab" "https://us" "http://hum" "http://reu"
## [31] "http://tms" "http://bbc" "http://wap" "http://nyt" "http://elu"
## [36] "http://fb." "https://dr" "http://dai" "http://elm" "http://ind"
## [41] "http://sum" "https://t." "http://on." "https://tg" "http://apn"
## [46] "http://lat" "http://yho" "http://f24" "http://shs" "http://hrl"
## [51] "http://abc" "http://fxn" "http://ab." "https://ca" "http://huf"
## [56] "http://1.u" "http://new" "http://cnn" "https://qu" "http://nbc"
## [61] "https://m." "http://act" "http://ow." "https://au" "http://rel"
There are four distinctive ways, in which URL can begin:
https://www.[source] or http://[source] or http://www[source] or https://[source]
Where [source] is the wanted source ?
By comparing the number of missing values generated from two different codings, we see that we those four options are the only ones
sum(is.na(str_match(df$URL, 'https://www.*|http://www.*|http://(?!ww).*|https://(?!ww).*')))
## [1] 2987
sum(df$URL=="")
## [1] 2987
We extract substrings that follows aforementioned “beginnings” to the point where there is a dot.
url1 <- str_extract(df$URL, '(?<=(http://www.|https://www.))[^\\.]*')
url2 <- str_extract(df$URL, '(?<=(https://))[^\\.(www)]*')
url3 <- str_extract(df$URL, '(?<=(http://))[^\\.(www)]*')
Converting NAs into blank spaces
url1 <- ifelse(is.na(url1), "", url1)
url2 <- ifelse(is.na(url2), "", url2)
url3 <- ifelse(is.na(url3), "", url3)
Merging 3 lists
url <- url1
url <- ifelse(url2!="", url2, url)
url <- ifelse(url3!="", url3, url)
url <- ifelse(url1=="", NA, url)
We would like to store only sources that appear more than 5 or 50 times
less_5_set <- rownames(data.frame('url'=table(url)[table(url)<6]))
less_50_set <- data.frame('url'=table(url)[table(url)<50])[, 'url.url']
url5 <- ifelse(url %in% less_5_set, NA, url)
url50 <- ifelse(url %in% less_50_set, NA, url)
Keeping it in forms of dataframes
url5_tab <- data.frame(table(url5))
url50_tab <- data.frame(table(url50))
Ordering decreasingly
url5_tab <- url5_tab %>%
arrange(desc(Freq))
url50_tab <- url50_tab %>%
arrange(desc(Freq))
url5_tab$url5 <- factor(url5_tab$url5, level=url5_tab$url5)
url50_tab$url50 <- factor(url50_tab$url50, level=url50_tab$url50)
df$url5 <- url5
df$url50 <- url50
df$url5 <- factor(df$url5, level=url5_tab$url5)
df$url50 <- factor(df$url50, level=url50_tab$url50)
options(repr.plot.width = 12 , repr.plot.height = 8)
df %>%
filter(!is.na(url5)) %>%
group_by(url5) %>%
summarize(count=n()) %>%
ggplot() +
geom_linerange(aes(x = url5, ymin = 0, ymax = count), color = "lightgray", size = 1.5)+
geom_point(aes(x = url5, y=count, color = url5), size = 1)+
ggpubr::color_palette("jco")+
theme_pubclean()+
labs(title="Sources of information on incidents", subtitle="From the most common to the least", x="Website source", y="Number of occurences")+
theme(legend.position = "none", axis.text.x=element_text(size=12, angle=90), plot.title = element_text(size = 20, face = "bold", hjust=0.5),
plot.subtitle = element_text(size = 15, face = "bold", hjust=0.48), axis.text.y=element_text(size=9), axis.title.x=element_text(size=5, vjust=-1.25), axis.title.y=element_text(size=10, vjust=2))
## Warning: This manual palette can handle a maximum of 10 values. You have
## supplied 40.
## Warning: Removed 30 rows containing missing values (geom_point).
There are a lot of isolated (single) sources of information. To make the graph more transparent, we will show only those sources which appeared at least 50 times.
df %>%
filter(!is.na(url50)) %>%
group_by(url50) %>%
summarize(count=n()) %>%
ggplot() +
geom_linerange(aes(x = url50, ymin = 0, ymax = count), color = "lightgray", size = 1.5)+
geom_point(aes(x = url50, y=count, color = url50), size = 1)+
ggpubr::color_palette("jco")+
theme_pubclean()+
labs(title="Sources of information on incidents", subtitle="Appearing more often than 50 times", x="Website source", y="Number of occurences")+
theme(legend.position = "none", axis.text.x=element_text(size=12), plot.title = element_text(size = 20, face = "bold", hjust=0.5),
plot.subtitle = element_text(size = 15, face = "bold", hjust=0.48), axis.text.y=element_text(size=9), axis.title.x=element_text(size=10, vjust=-1.25), axis.title.y=element_text(size=10, vjust=2))
With the aid of a map, we’ll try to localise the clusters of accidents involving migrants’ deaths, which, I hope, will give us some insight into those gruesome events.
Coordinates of accidents are stored in one column
To facilitate localising the points, we’ll break this column down into two columns storing information about: latitude and longitude
df$lan <- str_replace(str_extract(df$Coordinates, ".*,"), ",", "")
df$lon <- str_trim(str_replace(str_extract(df$Coordinates, ",.*"), ",", ""))
df$lon <- as.numeric(df$lon)
df$lan<- as.numeric(df$lan)
world <- ne_countries(scale='medium', returnclass='sf')
world_points <- st_centroid(world)
## Warning in st_centroid.sf(world): st_centroid assumes attributes are constant
## over geometries of x
## Warning in st_centroid.sfc(st_geometry(x), of_largest_polygon =
## of_largest_polygon): st_centroid does not give correct centroids for longitude/
## latitude data
Getting a set of points on the basis of which to draw a map with contoured countries
Creating data frame with coordinates
dsf <- df[, c('lan','lon')]
dsf[is.na(dsf$lan), 'lan'] <- 0
dsf[is.na(dsf$lon), 'lon'] <- 0
dsf <- sf::st_as_sf(dsf, coords=c("lon","lan"), crs=4326)
x=df$lan
y=df$lon
x[is.na(x)] <- 0
y[is.na(y)] <- 0
Here we get message that “although coordinates are longitude/latitude, st_intersects assumes that they are planar”
It reminds us that coordinates in a long/lat form are the mere approximation and a form of projection that is prone to distortions (they tend to be bigger,
the larger areas we consider, and the closer to the poles we get)
However, since both the world data and the points with accident data are projected with the same transformation, we don’t need to bother ourselves with tranforming those points into plansr coordinates
!!!! Bear in mind that projections of map are linear transformations (precisely affine transformation), which does not preserve angle between lines or distances between points,
but it does preserve ratios of distances between points lying on a straight line.
countries <- data.frame(countries=length(unique(world$name_en)), nums=1:255)
xy <- SpatialPointsDataFrame(matrix(c(y, x), ncol=2), data.frame(ID=seq(1:length(y))),
proj4string=CRS("+proj=longlat +datum=WGS84 +no_defs"))
We need to convert the world points into Spatial form, as points of accidents are also in this form
spd <- sf::as_Spatial(st_geometry(world$geometry), IDs = as.character(1:255))
points_in_polygons <- sp::over(y=spd, x=xy)
colnames(world)
## [1] "scalerank" "featurecla" "labelrank" "sovereignt" "sov_a3"
## [6] "adm0_dif" "level" "type" "admin" "adm0_a3"
## [11] "geou_dif" "geounit" "gu_a3" "su_dif" "subunit"
## [16] "su_a3" "brk_diff" "name" "name_long" "brk_a3"
## [21] "brk_name" "brk_group" "abbrev" "postal" "formal_en"
## [26] "formal_fr" "note_adm0" "note_brk" "name_sort" "name_alt"
## [31] "mapcolor7" "mapcolor8" "mapcolor9" "mapcolor13" "pop_est"
## [36] "gdp_md_est" "pop_year" "lastcensus" "gdp_year" "economy"
## [41] "income_grp" "wikipedia" "fips_10" "iso_a2" "iso_a3"
## [46] "iso_n3" "un_a3" "wb_a2" "wb_a3" "woe_id"
## [51] "adm0_a3_is" "adm0_a3_us" "adm0_a3_un" "adm0_a3_wb" "continent"
## [56] "region_un" "subregion" "region_wb" "name_len" "long_len"
## [61] "abbrev_len" "tiny" "homepart" "geometry"
countries_with_points <- unique(data.frame(world[points_in_polygons, 'formal_en'])['formal_en'])
countries_with_points <- as.vector(countries_with_points)[['formal_en']]
all_countries <- as.vector(world$formal_en)
countries_with_points[is.na(countries_with_points)] <- "NA"
world$countries_with_points <- ifelse(all_countries %in% countries_with_points, 1, 0)
Creating clusters
mdist <- distm(xy)
hc <- hclust(as.dist(mdist), method="complete")
d=1250000
xy$clust <- cutree(hc, h=d)
df$col <- xy$clust
df$lan[is.na(df$lan)] <- 0
df$lon[is.na(df$lon)] <- 0
find_hull <- function(df) df[chull(df$lon, df$lan), ]
hulls <- ddply(df, "col", find_hull)
define the distance threshold, in this case 1250000 (we set this distance manually by setting it to various values and checking, visually,
whether computed clusters group accidents in a proper way (e.g., on the sea or on the land),beacuse geo-distance similarity doesn’t often translate into alikeness of causes of migration
(e.g., accidents can happens close to each other but be separated by a geographical barriers, which clearly indicate that geo-distance is not a perfect one)
options(repr.plot.width = 18 , repr.plot.height = 10)
suppressMessages(ggplot(data=world)+
geom_sf(aes(fill=as.factor(countries_with_points)))+
scale_fill_manual("countries_with_points", values=c('grey', 'white'))+
geom_point(aes(x=lon, y=lan), data=df, size=0.01, color='red')+
annotation_scale(location = "bl", width_hint = 0.5, height=unit(0.75, 'cm'), text_cex=1.5)+
annotation_north_arrow(which_north=T, pad_x = unit(0.1, "in"), pad_y = unit(0.5, "in"),
style = north_arrow_fancy_orienteering, height=unit(3.0, 'cm'), width=unit(3.0, 'cm'))+
theme_classic()+
theme(panel.background = element_rect(fill = '#006994'))+
geom_rect(xmin = -120, xmax = -74.12, ymin = 7.65, ymax = 33.97,
fill = NA, colour = "black", size = 1.5)+
geom_rect(xmin = -9, xmax = 40, ymin = 30, ymax = 45,
fill = NA, colour = "black", size = 1.5)+
geom_rect(xmin = 15, xmax = 40, ymin = 5, ymax = 28,
fill = NA, colour = "black", size = 1.5)+
labs(title="Location of migrants' deaths", subtitle="Countries with no deaths on their territory coloured grey", x="", y="")+
theme(legend.position = "none", axis.text.x=element_text(size=14), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48)))
## Scale on map varies by more than 10%, scale bar may be inaccurate
ggplot(data=world)+
geom_sf(aes(fill=as.factor(countries_with_points)))+
scale_fill_manual("countries_with_points", values=c('grey', 'white'))+
geom_point(aes(x=lon, y=lan, col=as.factor(col)), data=df, size=0.01)+
scale_color_viridis("col", discrete=T)+
geom_polygon(data=hulls, alpha=0.5, aes(x=lon, y=lan, colour=as.factor(col)))+
annotation_scale(location = "bl", width_hint = 0.5, height=unit(0.75, 'cm'), text_cex=1.5)+
annotation_north_arrow(which_north=T, pad_x = unit(0.1, "in"), pad_y = unit(0.5, "in"),
style = north_arrow_fancy_orienteering, height=unit(3.0, 'cm'), width=unit(3.0, 'cm'))+
theme_classic()+
theme(panel.background = element_rect(fill = '#006994'))+
labs(title="Location of migrants' deaths", subtitle="Countries with no deaths on their territory coloured grey", x="", y="")+
theme(legend.position = "none", axis.text.x=element_text(size=14), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48))
## Scale on map varies by more than 10%, scale bar may be inaccurate
Comment to the map
As we can see on the maps above, the migrants’ deaths occur in the countries from which or to which they migrate, or in the countries that lie in between.
There can be a couple of routes seen:
(a) from Mexico to the US
(b) from North Africa to Europe
(c) from the Middle East to Europe
Besides, there are recorded deaths in: Central America, Continental Europe, in Asiaall over the Africa.
Clustering
To deal with such a large number of points and make any meaning of them, we should cluster them basing on their geolocation.
I am aware that in the dataset there is an already created variable that groups the locations of accidents, however, I would like to group it myself, showing one way how it can be done.
Geo-grouping consists of two stages:
In this kernel, I used “geo-distance” (instead of more commonly used Euclidean distance), which takes into account the curvature of the Earth. I do not suspect that using this distance metric will make much of a difference, since most points lie quite close to the Equator and generally by constructing clusters we’re more focused on distances between closer points, but as we operate on the Earth, it’s a good practice to use it.
After computing distances, we need to set a “cut-off” distance, which will be a treshold, determining the highest admissable distance between two points in a cluster.
It cannot be too large, because then clusters will have high variance failing to find distinct group of points (e.g., north part of the Mediterranean and the Southern Continental Europe).
It cannot also be too small, since it will be hard to group them later manually, which we’ll do in the second stage.
For instance, some places may be separated by mountain barrier or be of a different nature (e.g., one accident happening on the land and the other on the sea, no matter how close to each other from a geographical point of view, they are quite dissimilar).Therefore, we cannot completely rely on the algorithms.
#2.3. Visualising migrants’ deaths on each Zoom of a map at each region
centroids_lon=vector()
centroids_lan=vector()
for (i in 1:length(unique(df$col))){
lon = df[df$col==i, 'lon']
lan = df[df$col==i, 'lan']
centroids_lon[i]=mean(lon)
centroids_lan[i]=mean(lan)
}
centroids <- data.frame(centroids_lon, centroids_lan, table(df$col))
ggplot(data=world)+
geom_sf(aes(fill=as.factor(countries_with_points)))+
scale_fill_manual("countries_with_points", values=c('grey', 'white'))+
geom_point(aes(x=lon, y=lan, col=as.factor(col)), data=df, size=0.01)+
scale_color_viridis("col", discrete=T)+
geom_polygon(data=hulls, alpha=0.5, aes(x=lon, y=lan, colour=as.factor(col)))+
annotation_scale(location = "bl", width_hint = 0.5, height=unit(0.75, 'cm'), text_cex=1.5)+
annotation_north_arrow(which_north=T, pad_x = unit(0.1, "in"), pad_y = unit(0.5, "in"),
style = north_arrow_fancy_orienteering, height=unit(3.0, 'cm'), width=unit(3.0, 'cm'))+
theme_classic()+
theme(panel.background = element_rect(fill = '#006994'))+
labs(title="Location of migrants' deaths", subtitle="Countries with no deaths on their territory coloured grey", x="", y="")+
theme(legend.position = "none", axis.text.x=element_text(size=14), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48))
## Scale on map varies by more than 10%, scale bar may be inaccurate
North America
ggplot(data=world)+
geom_sf(aes(fill=as.factor(countries_with_points)))+
scale_fill_manual("countries_with_points", values=c('grey', 'white'))+
geom_point(aes(x=lon, y=lan, col=as.factor(col)), data=df, size=0.01)+
scale_color_viridis(discrete=T)+
annotation_scale(location = "bl", width_hint = 0.5, height=unit(0.75, 'cm'), text_cex=1.5)+
annotation_north_arrow(which_north=T, pad_x = unit(0.1, "in"), pad_y = unit(0.5, "in"),
style = north_arrow_fancy_orienteering, height=unit(3.0, 'cm'), width=unit(3.0, 'cm'))+
geom_polygon(data=hulls, alpha=0.5, aes(x=lon, y=lan, colour=as.factor(col)))+
coord_sf(xlim=c(-140, -60), ylim=c(5, 50))+
geom_text(data=centroids, colour='white', size=7, aes(x=centroids_lon, y=centroids_lan, label=Freq))+
theme(panel.background = element_rect(fill = '#006994'))+
labs(title="North America", subtitle="With a number of accidents in each cluster", x="", y="")+
theme(legend.position = "none", axis.text.x=element_text(size=14), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48))
## Scale on map varies by more than 10%, scale bar may be inaccurate
df$New_geo_grouping <- 'Uncategorized'
df$New_geo_grouping <- ifelse(df$col %in% c(3, 14, 23, 13), 'Mexico/Central America', df$New_geo_grouping)
df$New_geo_grouping <- ifelse(df$col %in% c(4, 22, 8, 7), 'USA-Mexico border', df$New_geo_grouping)
Comment to the map
As being an European, I am constantly exposed to and bombarded with information about the tragedies in the Mediterranean, but surprisingly enough (at least for me), the biggest cluster can be found on the US-Mexico border.
Africa
options(repr.plot.width = 18 , repr.plot.height = 20)
ggplot(data=world)+
geom_sf(aes(fill=as.factor(countries_with_points)))+
scale_fill_manual("countries_with_points", values=c('grey', 'white'))+
geom_point(aes(x=lon, y=lan, col=as.factor(col)), data=df, size=0.01)+
scale_color_viridis(discrete=T)+
annotation_scale(location = "bl", width_hint = 0.5, height=unit(0.75, 'cm'), text_cex=1.5)+
annotation_north_arrow(which_north=T, pad_x = unit(0.1, "in"), pad_y = unit(0.5, "in"),
style = north_arrow_fancy_orienteering, height=unit(3.0, 'cm'), width=unit(3.0, 'cm'))+
geom_polygon(data=hulls, alpha=0.5, aes(x=lon, y=lan, colour=as.factor(col)))+
coord_sf(ylim=c(-30, 50), xlim=c(-20, 68))+
geom_text(data=centroids, colour='white', size=7, aes(x=centroids_lon, y=centroids_lan, label=Freq))+
theme(panel.background = element_rect(fill = '#006994'))+
labs(title="Africa", subtitle="With a number of accidents in each cluster", x="", y="")+
theme(legend.position = "none", axis.text.x=element_text(size=14), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48))
## Scale on map varies by more than 10%, scale bar may be inaccurate
df$New_geo_grouping <- ifelse(df$col %in% c(9, 48, 30, 49, 36, 34, 54, 35, 40), 'Sub-saharan Africa', df$New_geo_grouping)
df$New_geo_grouping <- ifelse(df$col %in% c(43, 29, 44, 37, 26, 64, 47, 46, 63, 27, 21, 51, 53, 41, 24, 38, 33, 39, 45), 'Equatorial/South Africa', df$New_geo_grouping)
Comment to the map
Moreover, a great concentration of deaths can be found also in the eastern Sub-Saharan Africa.
Europe
options(repr.plot.width = 18 , repr.plot.height = 15)
ggplot(data=world)+
geom_sf(aes(fill=as.factor(countries_with_points)))+
scale_fill_manual("countries_with_points", values=c('grey', 'white'))+
geom_point(aes(x=lon, y=lan, col=as.factor(col)), data=df, size=0.01)+
scale_color_viridis(discrete=T)+
annotation_scale(location = "bl", width_hint = 0.5, height=unit(0.75, 'cm'), text_cex=1.5)+
annotation_north_arrow(which_north=T, pad_x = unit(0.1, "in"), pad_y = unit(0.5, "in"),
style = north_arrow_fancy_orienteering, height=unit(3.0, 'cm'), width=unit(3.0, 'cm'))+
geom_polygon(data=hulls, alpha=0.5, aes(x=lon, y=lan, colour=as.factor(col)))+
coord_sf(ylim=c(30, 71), xlim=c(-9, 68))+
geom_text(data=centroids, colour='white', size=7, aes(x=centroids_lon, y=centroids_lan, label=Freq))+
theme_classic()+
theme(panel.background = element_rect(fill = '#006994'))+
labs(title="Europe", subtitle="With a number of accidents in each cluster", x="", y="")+
theme(legend.position = "none", axis.text.x=element_text(size=14), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48))
## Scale on map varies by more than 10%, scale bar may be inaccurate
df$New_geo_grouping <- ifelse(df$col %in% c(2, 19, 6, 17, 32, 52, 1, 16), 'Mediterranean', df$New_geo_grouping)
df$New_geo_grouping <- ifelse(df$col %in% c(18, 11), 'Continental_Europe', df$New_geo_grouping)
Comment to the map
Lastly, we’ll discuss the clusters in Europe.
From the map we can gather that most people perished on the Mediterranean and there are two main routes that migrants choose:
(a) from Morocco to Spain
(b) from Libya to Italy
Middle East
ggplot(data=world)+
geom_sf(aes(fill=as.factor(countries_with_points)))+
scale_fill_manual("countries_with_points", values=c('grey', 'white'))+
geom_point(aes(x=lon, y=lan, col=as.factor(col)), data=df, size=0.01)+
scale_color_viridis(discrete=T)+
annotation_scale(location = "bl", width_hint = 0.5, height=unit(0.75, 'cm'), text_cex=1.5)+
annotation_north_arrow(which_north=T, pad_x = unit(0.1, "in"), pad_y = unit(0.5, "in"),
style = north_arrow_fancy_orienteering, height=unit(3.0, 'cm'), width=unit(3.0, 'cm'))+
geom_polygon(data=hulls, alpha=0.5, aes(x=lon, y=lan, colour=as.factor(col)))+
coord_sf(xlim=c(25, 90), ylim=c(10, 50))+
geom_text(data=centroids, colour='white', size=7, aes(x=centroids_lon, y=centroids_lan, label=Freq))+
theme(panel.background = element_rect(fill = '#006994'))+
labs(title="Middle East", subtitle="With a number of accidents in each cluster", x="", y="")+
theme(legend.position = "none", axis.text.x=element_text(size=14), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48))
## Scale on map varies by more than 10%, scale bar may be inaccurate
df$New_geo_grouping <- ifelse(df$col %in% c(31, 60, 28), 'Middle East', df$New_geo_grouping)
Asia
ggplot(data=world)+
geom_sf(aes(fill=as.factor(countries_with_points)))+
scale_fill_manual("countries_with_points", values=c('grey', 'white'))+
geom_point(aes(x=lon, y=lan, col=as.factor(col)), data=df, size=0.01)+
scale_color_viridis(discrete=T)+
annotation_scale(location = "bl", width_hint = 0.5, height=unit(0.75, 'cm'), text_cex=1.5)+
annotation_north_arrow(which_north=T, pad_x = unit(0.1, "in"), pad_y = unit(0.5, "in"),
style = north_arrow_fancy_orienteering, height=unit(3.0, 'cm'), width=unit(3.0, 'cm'))+
geom_polygon(data=hulls, alpha=0.5, aes(x=lon, y=lan, colour=as.factor(col)))+
coord_sf(xlim=c(80, 150), ylim=c(-10, 30))+
geom_text(data=centroids, colour='white', size=7, aes(x=centroids_lon, y=centroids_lan, label=Freq))+
theme(panel.background = element_rect(fill = '#006994'))+
labs(title="Asia", subtitle="With a number of accidents in each cluster", x="", y="")+
theme(legend.position = "none", axis.text.x=element_text(size=14), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48))
## Scale on map varies by more than 10%, scale bar may be inaccurate
df$New_geo_grouping <- ifelse(df$col %in% c(73, 55, 50, 5, 42, 57, 68, 59, 57, 70), 'Asia', df$New_geo_grouping)
2.4. Examining causes of deaths
Having checked the distribution of accidents across the whole world, we will now proceed into finding out what the most common causes of deaths were and how they differed in relation to other factors.
At the beginning, I need to preprocess the data, which involves extracting causes of deaths from Cause_death variable, and subsequently, grouping the similar causes in categories.
df$Death_Cause_NA <- ifelse(str_detect(df$Death_Cause, 'Unknown'),as.character(df[which(str_detect(df$Death_Cause, 'Unknown')), 'Death_Cause']), 'NA' )
df[which(str_detect(df$Death_Cause, 'Unknown')), 'Death_Cause'] <- NA
Death_cause_split <- cbind(df[1], mtabulate(str_split(df$Death_Cause, ",")))
Now we group those causes
df$Suffocation_death <- ifelse(rowSums(Death_cause_split[,
c(" Asphyxiation"," Suffocation", "Asphyxiation", "Suffocation",
"Presumed asphyxiation")]==1)>0, 1, 0)
df$Land_transport_accident <- ifelse(rowSums(Death_cause_split[,
c("Vehicle Accident", "Train Accident", "Plane stowaway",
"Hit by car","Hit by vehicle", "Fall onto train tracks", "Fall from vehicle",
"Fall from truck", "Crushed in back of truck"," Crushed by bus", " Crushed in back of truck",
" Fall from train", " Hit by car", "Bus fire",
"Crushed by stones (train)", "Electrocution on railway",
"Fall from train" )]==1)>0, 1, 0)
df$Food_water_shortages_death <- ifelse(rowSums(Death_cause_split[,
c(" Dehydration","Dehydration", "Starvation", " Starvation",
" Malnutrition" )]==1)>0, 1, 0)
df$Boat_transport_accident <- ifelse(rowSums(Death_cause_split[,
c("Violence while attempting to board boat", "Presumed drowning",
" Drowning", " Presumed drowning","Drowning","Fall from boat",
"Injuries from boat motor")]==1)>0, 1, 0)
df$Exhaustion_death <- ifelse(rowSums(Death_cause_split[,
c(" Exhaustion", "Exhaustion")]==1)>0, 1, 0)
df$Weather_condition_death <- ifelse(rowSums(Death_cause_split[,
c(" Exposure", " Harsh conditions", "Exposure",
"Harsh conditions", "Harsh weather/lack of adequate shelter",
" Harsh weather/lack of adequate shelter")]==1)>0, 1, 0)
df$Heat_Hyperthermia_death <- ifelse(rowSums(Death_cause_split[,
c(" Hyperthermia","Heat stroke", "Hyperthermia",
"Probable hyperthermia", "Presumed hyperthermia",
"Presumed hyperthermia/dehydration")]==1)>0, 1, 0)
df$Hypothermia_death <- ifelse(rowSums(Death_cause_split[,
c(" Hypothermia", "Hypothermia",
"Presumed hypothermia")]==1)>0, 1, 0)
df$Murder<- ifelse(rowSums(Death_cause_split[,
c(" Murdered", " Shot","Stabbed" , " Stabbed",
" Shot or stabbed", "Murdered", "Shot", "Shot or stabbed",
"Shot or stabbed")]==1)>0, 1, 0)
df$War_related_death <- ifelse(rowSums(Death_cause_split[,
c("Attacked by Apache helicopter", "Killed by landmine blast",
"Killed by mortar shell", "Gassed")]==1)>0, 1, 0)
df$Electrocution_fire_death <- ifelse(rowSums(Death_cause_split[,
c("Burned", "Electrocution", "Fire", "Fuel burns",
"Struck by lightning bolt")]==1)>0, 1, 0)
df$Poor_health_death <- ifelse(rowSums(Death_cause_split[,
c("Post-partum complications", "Respiratory illness" ,
" Sickness and lack of access to medicines",
"Sickness and lack of access to medicines",
"Pulmonary edema and renal insufficiency", "Pulmonary edema",
"Pulmonary complications", "Pneumonia", "Cardiac arrest",
"Cervical cancer", "Coronary artery atherosclerosis",
"Digestive bleeding", "Hypoglycemia",
"Organ failure" )]==1)>0, 1, 0)
df$Fall_death <- ifelse(rowSums(Death_cause_split[,
c("Fall into a canyon", "Fall",
"Fall from steep slope", "Crushed",
" Crushed","Rockslide",
"Fall from border fence")]==1)>0, 1, 0)
df$Other_accidents <- ifelse(rowSums(Death_cause_split[,
c("Mixed", "Poison", "Envenomation",
"Accident (non-vehicle)")]==1)>0, 1, 0)
df$Hanging_suicide <- ifelse(rowSums(Death_cause_split[,
c("Hanging", "Suicide")]==1)>0, 1, 0)
df$Hippo_croco_death <- ifelse(rowSums(Death_cause_split[,
c("Killed by hippopotamus", "Killed by crocodile",
"Killed by hippoptamus")]==1)>0, 1, 0)
Creating ad hoc df_groups with causes of death
df_groups <- df[, 28:43]
df_groups <- cbind("New_geo_grouping"=df$New_geo_grouping, df_groups)
Calculating conditional distribution of causes of death given geo-location
geo_groups <- data.frame(New_geo_grouping=unique(df$New_geo_grouping))
geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
summarise, Suffocation_death=sum(Suffocation_death)), on='New_geo_grouping')
geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
summarise, Land_transport_accident=sum(Land_transport_accident)),
on='New_geo_grouping')
geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
summarise, Food_water_shortages_death=sum(Food_water_shortages_death)),
on='New_geo_grouping')
geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
summarise, Boat_transport_accident=sum(Boat_transport_accident)),
on='New_geo_grouping')
geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
summarise, Exhaustion_death=sum(Exhaustion_death)),
on='New_geo_grouping')
geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
summarise, Weather_condition_death=sum(Weather_condition_death)),
on='New_geo_grouping')
geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
summarise, Hypothermia_death=sum(Hypothermia_death)),
on='New_geo_grouping')
geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
summarise, Murder=sum(Murder)),
on='New_geo_grouping')
geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
summarise, War_related_death=sum(War_related_death)),
on='New_geo_grouping')
geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
summarise, Electrocution_fire_death=sum(Electrocution_fire_death)),
on='New_geo_grouping')
geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
summarise, Poor_health_death=sum(Poor_health_death)),
on='New_geo_grouping')
geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
summarise, Fall_death=sum(Fall_death)), on='New_geo_grouping')
geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
summarise, Other_accidents=sum(Other_accidents)), on='New_geo_grouping')
geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
summarise, Hanging_suicide =sum(Hanging_suicide )), on='New_geo_grouping')
geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
summarise, Hippo_croco_death=sum(Hippo_croco_death)), on='New_geo_grouping')
Share of causes of death in total
Death_cause_percentage <- data.frame(New_geo_grouping=colnames(geo_groups[, -1]))
for (i in 1:length(colnames(geo_groups[, -1]))) {
Death_cause_percentage[i, 2] <- sum(geo_groups[, i+1])/sum(geo_groups[, -1])
}
Reordering it descreasingly
Death_cause_percentage <- Death_cause_percentage %>%
arrange(desc(V2))
Death_cause_percentage$New_geo_grouping <- factor(Death_cause_percentage$New_geo_grouping, level=Death_cause_percentage$New_geo_grouping)
options(repr.plot.width = 18 , repr.plot.height = 10)
ggplot(Death_cause_percentage)+
geom_col(aes(New_geo_grouping, V2, fill=V2))+
theme_pubclean()+
labs(title="Distribution of causes of death", subtitle="Summing up to 1.0", x="", y="")+
theme(legend.position = "none", axis.text.x=element_text(size=23, angle=45, hjust=1), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48),
axis.text.y=element_text(size=23))+
scale_x_discrete(labels=c("Boat transport", 'Poor health', 'Land transport', 'Food/water unsufficient', 'Murder', 'Weather conditions', 'Suffocation', 'Hypothermia', 'Other', 'Electrocution/fire',
'Fall', 'War', 'Exhaustion', 'Hanging/suicide', 'Hippo/crocodile'))
For our futher analysis, we would like to keep only those causes of death that make up more than 4% of total
Death_cause_percentage <- Death_cause_percentage[Death_cause_percentage$V2>0.04, ]
geo_groups <- geo_groups[, c(TRUE, colnames(geo_groups[, -1]) %in% Death_cause_percentage$New_geo_grouping)]
Transposing our data frame
geo_groups <- t(geo_groups)
colnames(geo_groups) <- as.character(unlist(geo_groups[1,]))
geo_groups = geo_groups[-1, ]
geo_groups <- as.data.frame(geo_groups)
geo_groups <- cbind(rownames(geo_groups), geo_groups)
rownames(geo_groups) <- NULL
Generating our dataset in a long form
geo_groups_long <- melt(geo_groups, id="rownames(geo_groups)", measure=colnames(geo_groups[, -1]))
## Warning: attributes are not identical across measure variables; they will be
## dropped
colnames(geo_groups_long)[1] <- 'Death_Cause'
geo_groups_long$value <- as.numeric(geo_groups_long$value)
Calculating total number of accidents per region
total <- geo_groups_long %>%
dplyr::group_by(variable) %>%
dplyr::summarize(TOTAL=sum(value))
geo_groups_long <- merge(geo_groups_long, total, on='variable')
geo_groups_long$ratio <- geo_groups_long$value/geo_groups_long$TOTAL
first_place <- geo_groups_long %>%
arrange(desc(ratio)) %>%
group_by(variable) %>%
top_n(1, ratio) %>%
arrange(variable) %>%
mutate(count_first = 1)
second_place <- geo_groups_long %>%
arrange(desc(ratio)) %>%
group_by(variable) %>%
top_n(2, ratio) %>%
arrange(variable) %>%
mutate(count_second = 1)
geo_groups_long <- merge(geo_groups_long, first_place, on=c('variable', 'Death_Cause'), all=T)
geo_groups_long <- merge(geo_groups_long, second_place, on=c('variable', 'Death_Cause'), all=T)
geo_groups_long$place <- 0
geo_groups_long$place <- ifelse(geo_groups_long$count_first==1, 1, geo_groups_long$place)
geo_groups_long$place <- ifelse((is.na(geo_groups_long$count_first) & geo_groups_long$count_second==1), 2, geo_groups_long$place)
geo_groups_long$place <- ifelse(is.na(geo_groups_long$place), 0, geo_groups_long$place)
geo_groups_long$text <- geo_groups_long$place
geo_groups_long$text <- ifelse(geo_groups_long$text==0, NA, geo_groups_long$text)
geo_groups_long$place <- factor(geo_groups_long$place, level=c(0, 1, 2), labels=c(0, '1st', '2nd'))
geo_groups_long$text <- factor(geo_groups_long$text, level=c(0, 1,2 ), labels=c(0, '1st', '2nd'))
options(repr.plot.width = 6 , repr.plot.height = 5)
ggplot(arrange(geo_groups_long, Death_Cause), aes(group=variable))+
theme_pubclean()+
geom_col(aes(x=Death_Cause, y=ratio, fill=place), position='identity')+
scale_fill_viridis(discrete=T, labels=c('rest', '1st', '2nd'), name='The most/The second most common')+
geom_text(aes(x=Death_Cause, y=ratio, label=text), size=4, vjust=1.0)+
facet_wrap(~variable)+
theme(strip.text = element_text(size=5), axis.text.x=element_text(size=5, angle=-45), plot.title = element_text(size = 12, face = "bold", hjust=0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust=0.50), legend.title = element_text(size=5), legend.text=element_text(size=5), axis.text.y=element_text(size=5))+
scale_x_discrete(labels=c("Boat transport", 'Food/water insuff', 'Land transport', 'Murder', 'Poor health', 'Weather conditions'))+
labs(title="Causes of deaths per region", subtitle="Only 5 most common causes of deaths", x="", y="")
## Warning: Removed 36 rows containing missing values (geom_text).
Comments to the graphs The graphs provide us with some interesting informations.
The most common causes of death are: drowning, fall from a boat, unjuries from boat motor and violence on the boat, which together constitute the variable (created by me) Boat transport. This stems from the fact that most deaths are recorded on the Mediterranean see. The boats are often bursting at the seams with passengers, who need to be crammed for the lack of the sufficient number of boats. Worth noting is also that they often cross the Mediterranaen in small and canoe-like boats, which don’t stand a chance against even mild storms.
The second most common causes of death are included in an aggregated term poor health, which encompasses: Sickness and lack of access to medicines, Organ failure, Hypoglycemia, Digestive bleeding, Coronary artery atherosclerosis, Cervical cancer, Cardiac arrest, Pneumonia, Pulmonary edema, Respiratory illness, Post-partum complications. Those deaths are connected to health conditions and ilnesses.
What comes as a surprise to me is that in Africa most people die as a result of not being provided with satisfactory health care and not from starvation or dehydration.
#3. Naive Bayes model #3.1. Preparations To extract even more information from the dataset, we will use Naive Bayes model. We would like to find out how causes of death and dates of incidents affect the number of deaths within one accident.
What is exactly Naive Bayes classifier?
Naive Bayes classifier is a classification algorithm, which rests on the assumption of independence between features (that’s why it’s called naive, because this assumption rarely holds) and Bayes theorem, which can be seen below.
Caption for the picture.
Why does this model require such a strong assumption?
If the independence holds, then it’s much easier to calculate maximum-likelihood expression on which this model bases. To see how the joint probability simplifies, see for instance: https://en.wikipedia.org/wiki/Naive_Bayes_classifier or any other book on machine learning.
Why do we think this model is appriopriate?
Firstly, it’s reasonable to suggest that the variables do not depend on each other. The causes of death, in most cases, were assigned one per observation (with little exception), therefore as such, must be independent. Moreover, the same holds for the time of event, which is mutually exclusive (an event happens only in one month and year). On top of that, there shouldn’t been any correlation between year and month, and if there is, it should be regarded as a spurious correlation. Otherwise, it’d suggest that there is some time trend indicating that from year to year more migrants choose some set of months to migrate, or that in some years migrants decided to migrate in a different month, which due to weather condition might be plausible, yet the impact of that change shouldn’t bother us when running Naive Bayes model). Having said that, there might be some dependence between the time and causes of deaths (e.g., in the summer time, we expect deaths from dehydration or hyperthermia to surge); this point may raise a concern and, therefore, we will check the correlation of features nonetheless.
Secondly, it performs quite well while provided with not so many observations, which is the case in our example.
What kind of Naive Bayes classifier?
As all features are binary, we decide for Bernoulli Naive Bayes (in contrast to Gaussian, which, simplifying to a great extent, is used for continuous variables, where we assume they are normally distributed; or in contrast to Multinomial Naive Bayes, where features have multinomial normal distribution and are often used for frequencies, e.g., how many times word appeared in a text).
df$Month2 <- as.numeric(df$Month)
df$Month2 <- car::recode(df$Month2, "5=1; 4=2; 8=3; 1=4; 9=5; 7=6; 6=7; 2=8; 12=9; 11=10; 10=11; 3=12")
df$Total_Dead_Missing <- str_remove(df$Total_Dead_Missing, ",") %>% as.numeric()
unique(df$Total_Dead_Missing)
## [1] 1 3 22 64 2 8 18 4 10 14 60 91 12 9 5
## [16] 6 39 29 11 17 13 20 0 45 40 7 84 31 56 19
## [31] 70 26 23 73 94 65 33 57 28 24 53 15 100 150 16
## [46] 83 25 32 27 59 21 50 41 117 54 34 61 130 52 38
## [61] 44 114 104 101 43 76 62 112 47 36 30 48 37 49 68
## [76] 63 97 120 35 110 126 82 156 102 42 74 118 176 58 87
## [91] 103 99 135 128 111 288 339 245 550 255 459 90 51 123 95
## [106] 202 71 222 106 46 1022 400 243 307 750 149 88 500 164 251
## [121] 172 55 151 200 75 240 217 77
df$more_death_missing_5 <- ifelse(df$Total_Dead_Missing>5, 1, 0)
# Converting a discrete variable into many binary
df$m_123 <- ifelse(df$Month2 %in% c(1, 2, 3), 1, 0)
df$m_456 <- ifelse(df$Month2 %in% c(5, 4, 6), 1, 0)
df$m_789 <- ifelse(df$Month2 %in% c(7, 8, 9), 1, 0)
df$m_101112 <- ifelse(df$Month2 %in% c(10, 11, 12), 1, 0)
df$Year_2014 <- ifelse(df$Year==2014, 1, 0)
df$Year_2015 <- ifelse(df$Year==2015, 1, 0)
df$Year_2016 <- ifelse(df$Year==2016, 1, 0)
df$Year_2017 <- ifelse(df$Year==2017, 1, 0)
df$Year_2018 <- ifelse(df$Year==2018, 1, 0)
df$Year_2019 <- ifelse(df$Year==2019, 1, 0)
df$Year_2020 <- ifelse(df$Year==2020, 1, 0)
Calculating correlation (association) between variables
jaccard_dist <- as.matrix(distance(t(df[,29:56]), method = 'jaccard',test.na = F))
## Metric: 'jaccard'; comparing: 28 vectors.
colnames(jaccard_dist) <- colnames(df[,29:56])
rownames(jaccard_dist) <- colnames(df[,29:56])
# To calculate Jaccard similarity, we have to from one subtract Jaccard distance
jaccard_sim = 1-jaccard_dist
jaccard_sim
## Land_transport_accident Food_water_shortages_death
## Land_transport_accident 1.0000000000 0.0005512679
## Food_water_shortages_death 0.0005512679 1.0000000000
## Boat_transport_accident 0.0000000000 0.0023032630
## Exhaustion_death 0.0000000000 0.0021074816
## Weather_condition_death 0.0007800312 0.2438356164
## Heat_Hyperthermia_death 0.0000000000 0.0045620438
## Hypothermia_death 0.0000000000 0.0019175455
## Murder 0.0046911650 0.0111193477
## War_related_death 0.0000000000 0.0000000000
## Electrocution_fire_death 0.0000000000 0.0000000000
## Poor_health_death 0.0000000000 0.1875584659
## Fall_death 0.0011325028 0.0000000000
## Other_accidents 0.0000000000 0.0000000000
## Hanging_suicide 0.0000000000 0.0000000000
## Hippo_croco_death 0.0000000000 0.0000000000
## Month2 0.0145076227 0.0169662071
## more_death_missing_5 0.0439429929 0.0497142857
## m_123 0.0824388149 0.0647010647
## m_456 0.0818565401 0.0809815951
## m_789 0.0860339506 0.1113671275
## m_101112 0.1013698630 0.1161265432
## Year_2014 0.0153846154 0.0025041736
## Year_2015 0.0606443462 0.0558558559
## Year_2016 0.0452995616 0.0197066911
## Year_2017 0.1048186786 0.1003344482
## Year_2018 0.0939008337 0.0968017058
## Year_2019 0.0944259567 0.1453930685
## Year_2020 0.0098716683 0.0251162791
## Boat_transport_accident Exhaustion_death
## Land_transport_accident 0.0000000000 0.000000e+00
## Food_water_shortages_death 0.0023032630 2.107482e-03
## Boat_transport_accident 1.0000000000 0.000000e+00
## Exhaustion_death 0.0000000000 1.000000e+00
## Weather_condition_death 0.0000000000 0.000000e+00
## Heat_Hyperthermia_death 0.0005503577 0.000000e+00
## Hypothermia_death 0.0017055145 0.000000e+00
## Murder 0.0000000000 0.000000e+00
## War_related_death 0.0000000000 0.000000e+00
## Electrocution_fire_death 0.0005934718 0.000000e+00
## Poor_health_death 0.0000000000 0.000000e+00
## Fall_death 0.0005955926 0.000000e+00
## Other_accidents 0.0000000000 0.000000e+00
## Hanging_suicide 0.0000000000 0.000000e+00
## Hippo_croco_death 0.0000000000 0.000000e+00
## Month2 0.0268618852 5.308573e-05
## more_death_missing_5 0.2507345739 1.119821e-03
## m_123 0.1426110920 6.038647e-04
## m_456 0.1362867771 1.177856e-03
## m_789 0.1284375000 0.000000e+00
## m_101112 0.1373660996 5.130836e-04
## Year_2014 0.0469432314 3.891051e-03
## Year_2015 0.1133603239 1.228501e-03
## Year_2016 0.0900333457 0.000000e+00
## Year_2017 0.1222593831 0.000000e+00
## Year_2018 0.1229088426 1.229256e-03
## Year_2019 0.1087026852 0.000000e+00
## Year_2020 0.0329358319 0.000000e+00
## Weather_condition_death Heat_Hyperthermia_death
## Land_transport_accident 0.0007800312 0.0000000000
## Food_water_shortages_death 0.2438356164 0.0045620438
## Boat_transport_accident 0.0000000000 0.0005503577
## Exhaustion_death 0.0000000000 0.0000000000
## Weather_condition_death 1.0000000000 0.0142602496
## Heat_Hyperthermia_death 0.0142602496 1.0000000000
## Hypothermia_death 0.0019531250 0.0000000000
## Murder 0.0036188179 0.0000000000
## War_related_death 0.0000000000 0.0000000000
## Electrocution_fire_death 0.0000000000 0.0000000000
## Poor_health_death 0.1150000000 0.0000000000
## Fall_death 0.0000000000 0.0000000000
## Other_accidents 0.0000000000 0.0000000000
## Hanging_suicide 0.0000000000 0.0000000000
## Hippo_croco_death 0.0000000000 0.0000000000
## Month2 0.0072522883 0.0026141736
## more_death_missing_5 0.0219263900 0.0028818444
## m_123 0.0428643469 0.0016629712
## m_456 0.0363279332 0.0300668151
## m_789 0.0572963295 0.0447538538
## m_101112 0.0573219884 0.0033444816
## Year_2014 0.0045045045 0.0434782609
## Year_2015 0.0337268128 0.0211640212
## Year_2016 0.0029620853 0.0309575234
## Year_2017 0.0529131986 0.0113864702
## Year_2018 0.0510046368 0.0159908624
## Year_2019 0.0760869565 0.0153601695
## Year_2020 0.0458715596 0.0000000000
## Hypothermia_death Murder War_related_death
## Land_transport_accident 0.000000000 0.004691165 0.0000000000
## Food_water_shortages_death 0.001917546 0.011119348 0.0000000000
## Boat_transport_accident 0.001705514 0.000000000 0.0000000000
## Exhaustion_death 0.000000000 0.000000000 0.0000000000
## Weather_condition_death 0.001953125 0.003618818 0.0000000000
## Heat_Hyperthermia_death 0.000000000 0.000000000 0.0000000000
## Hypothermia_death 1.000000000 0.000000000 0.0000000000
## Murder 0.000000000 1.000000000 0.0000000000
## War_related_death 0.000000000 0.000000000 1.0000000000
## Electrocution_fire_death 0.000000000 0.000000000 0.0000000000
## Poor_health_death 0.000000000 0.002995507 0.0000000000
## Fall_death 0.000000000 0.000000000 0.0000000000
## Other_accidents 0.000000000 0.002227171 0.0000000000
## Hanging_suicide 0.000000000 0.000000000 0.0000000000
## Hippo_croco_death 0.000000000 0.000000000 0.0000000000
## Month2 0.001384168 0.006636352 0.0001930628
## more_death_missing_5 0.011258956 0.022691706 0.0011111111
## m_123 0.025776216 0.047040971 0.0018061409
## m_456 0.007865169 0.054391218 0.0000000000
## m_789 0.004420432 0.051601423 0.0041025641
## m_101112 0.015399901 0.043727915 0.0000000000
## Year_2014 0.002849003 0.012066365 0.0076045627
## Year_2015 0.013377926 0.051369863 0.0000000000
## Year_2016 0.018504811 0.024169184 0.0007763975
## Year_2017 0.018207283 0.066787004 0.0044085231
## Year_2018 0.008782201 0.068550497 0.0006116208
## Year_2019 0.008125677 0.028301887 0.0005640158
## Year_2020 0.016064257 0.012389381 0.0000000000
## Electrocution_fire_death Poor_health_death
## Land_transport_accident 0.0000000000 0.000000000
## Food_water_shortages_death 0.0000000000 0.187558466
## Boat_transport_accident 0.0005934718 0.000000000
## Exhaustion_death 0.0000000000 0.000000000
## Weather_condition_death 0.0000000000 0.115000000
## Heat_Hyperthermia_death 0.0000000000 0.000000000
## Hypothermia_death 0.0000000000 0.000000000
## Murder 0.0000000000 0.002995507
## War_related_death 0.0000000000 0.000000000
## Electrocution_fire_death 1.0000000000 0.000000000
## Poor_health_death 0.0000000000 1.000000000
## Fall_death 0.0000000000 0.000000000
## Other_accidents 0.0000000000 0.000000000
## Hanging_suicide 0.0000000000 0.000000000
## Hippo_croco_death 0.0000000000 0.000000000
## Month2 0.0003403142 0.026572736
## more_death_missing_5 0.0044052863 0.030303030
## m_123 0.0029940120 0.140597540
## m_456 0.0040911748 0.116847826
## m_789 0.0025458248 0.117814277
## m_101112 0.0025471218 0.155076722
## Year_2014 0.0072992701 0.000000000
## Year_2015 0.0060386473 0.019516334
## Year_2016 0.0015408320 0.199832776
## Year_2017 0.0065741417 0.030408948
## Year_2018 0.0006075334 0.149339050
## Year_2019 0.0016835017 0.186350778
## Year_2020 0.0000000000 0.021040327
## Fall_death Other_accidents Hanging_suicide
## Land_transport_accident 0.0011325028 0.0000000000 0.0000000000
## Food_water_shortages_death 0.0000000000 0.0000000000 0.0000000000
## Boat_transport_accident 0.0005955926 0.0000000000 0.0000000000
## Exhaustion_death 0.0000000000 0.0000000000 0.0000000000
## Weather_condition_death 0.0000000000 0.0000000000 0.0000000000
## Heat_Hyperthermia_death 0.0000000000 0.0000000000 0.0000000000
## Hypothermia_death 0.0000000000 0.0000000000 0.0000000000
## Murder 0.0000000000 0.0022271715 0.0000000000
## War_related_death 0.0000000000 0.0000000000 0.0000000000
## Electrocution_fire_death 0.0000000000 0.0000000000 0.0000000000
## Poor_health_death 0.0000000000 0.0000000000 0.0000000000
## Fall_death 1.0000000000 0.0000000000 0.0000000000
## Other_accidents 0.0000000000 1.0000000000 0.0000000000
## Hanging_suicide 0.0000000000 0.0000000000 1.0000000000
## Hippo_croco_death 0.0000000000 0.0000000000 0.0000000000
## Month2 0.0002340954 0.0007388288 0.0001278964
## more_death_missing_5 0.0022123894 0.0221483942 0.0000000000
## m_123 0.0024024024 0.0017825312 0.0018083183
## m_456 0.0035169988 0.0011580776 0.0017626322
## m_789 0.0005096840 0.0060975610 0.0005115090
## m_101112 0.0025549310 0.0081507896 0.0010240655
## Year_2014 0.0074626866 0.0177304965 0.0038167939
## Year_2015 0.0024242424 0.0047619048 0.0000000000
## Year_2016 0.0007733952 0.0107941403 0.0007776050
## Year_2017 0.0043923865 0.0057929037 0.0007331378
## Year_2018 0.0024434942 0.0006035003 0.0006123699
## Year_2019 0.0005624297 0.0005571031 0.0022624434
## Year_2020 0.0000000000 0.0000000000 0.0061349693
## Hippo_croco_death Month2 more_death_missing_5
## Land_transport_accident 0.000000e+00 1.450762e-02 0.043942993
## Food_water_shortages_death 0.000000e+00 1.696621e-02 0.049714286
## Boat_transport_accident 0.000000e+00 2.686189e-02 0.250734574
## Exhaustion_death 0.000000e+00 5.308573e-05 0.001119821
## Weather_condition_death 0.000000e+00 7.252288e-03 0.021926390
## Heat_Hyperthermia_death 0.000000e+00 2.614174e-03 0.002881844
## Hypothermia_death 0.000000e+00 1.384168e-03 0.011258956
## Murder 0.000000e+00 6.636352e-03 0.022691706
## War_related_death 0.000000e+00 1.930628e-04 0.001111111
## Electrocution_fire_death 0.000000e+00 3.403142e-04 0.004405286
## Poor_health_death 0.000000e+00 2.657274e-02 0.030303030
## Fall_death 0.000000e+00 2.340954e-04 0.002212389
## Other_accidents 0.000000e+00 7.388288e-04 0.022148394
## Hanging_suicide 0.000000e+00 1.278964e-04 0.000000000
## Hippo_croco_death 1.000000e+00 6.273844e-05 0.000000000
## Month2 6.273844e-05 1.000000e+00 0.014876211
## more_death_missing_5 0.000000e+00 1.487621e-02 1.000000000
## m_123 0.000000e+00 7.747714e-03 0.087681779
## m_456 0.000000e+00 2.129529e-02 0.085642317
## m_789 1.026694e-03 3.893619e-02 0.081173780
## m_101112 5.133470e-04 5.399474e-02 0.104361371
## Year_2014 0.000000e+00 3.965545e-03 0.067164179
## Year_2015 0.000000e+00 1.638690e-02 0.092485549
## Year_2016 0.000000e+00 2.048982e-02 0.107818089
## Year_2017 7.363770e-04 2.327859e-02 0.095609756
## Year_2018 1.230012e-03 2.623128e-02 0.052301255
## Year_2019 0.000000e+00 2.911251e-02 0.048616601
## Year_2020 0.000000e+00 6.371687e-04 0.018518519
## m_123 m_456 m_789 m_101112
## Land_transport_accident 0.0824388149 0.081856540 0.086033951 0.1013698630
## Food_water_shortages_death 0.0647010647 0.080981595 0.111367127 0.1161265432
## Boat_transport_accident 0.1426110920 0.136286777 0.128437500 0.1373660996
## Exhaustion_death 0.0006038647 0.001177856 0.000000000 0.0005130836
## Weather_condition_death 0.0428643469 0.036327933 0.057296329 0.0573219884
## Heat_Hyperthermia_death 0.0016629712 0.030066815 0.044753854 0.0033444816
## Hypothermia_death 0.0257762156 0.007865169 0.004420432 0.0153999006
## Murder 0.0470409712 0.054391218 0.051601423 0.0437279152
## War_related_death 0.0018061409 0.000000000 0.004102564 0.0000000000
## Electrocution_fire_death 0.0029940120 0.004091175 0.002545825 0.0025471218
## Poor_health_death 0.1405975395 0.116847826 0.117814277 0.1550767222
## Fall_death 0.0024024024 0.003516999 0.000509684 0.0025549310
## Other_accidents 0.0017825312 0.001158078 0.006097561 0.0081507896
## Hanging_suicide 0.0018083183 0.001762632 0.000511509 0.0010240655
## Hippo_croco_death 0.0000000000 0.000000000 0.001026694 0.0005133470
## Month2 0.0077477137 0.021295290 0.038936185 0.0539947400
## more_death_missing_5 0.0876817793 0.085642317 0.081173780 0.1043613707
## m_123 1.0000000000 0.000000000 0.000000000 0.0000000000
## m_456 0.0000000000 1.000000000 0.000000000 0.0000000000
## m_789 0.0000000000 0.000000000 1.000000000 0.0000000000
## m_101112 0.0000000000 0.000000000 0.000000000 1.0000000000
## Year_2014 0.0269251481 0.038891849 0.041646947 0.0199350950
## Year_2015 0.0361648444 0.053361345 0.099242726 0.1449335548
## Year_2016 0.1338491296 0.103525046 0.106346484 0.1190558834
## Year_2017 0.1034103410 0.115904936 0.123851650 0.1350979718
## Year_2018 0.1241426612 0.149930748 0.148183864 0.1152404747
## Year_2019 0.1247942048 0.153384461 0.146831530 0.1377300613
## Year_2020 0.0891566265 0.003796095 0.000000000 0.0000000000
## Year_2014 Year_2015 Year_2016 Year_2017
## Land_transport_accident 0.015384615 0.060644346 0.0452995616 0.1048186786
## Food_water_shortages_death 0.002504174 0.055855856 0.0197066911 0.1003344482
## Boat_transport_accident 0.046943231 0.113360324 0.0900333457 0.1222593831
## Exhaustion_death 0.003891051 0.001228501 0.0000000000 0.0000000000
## Weather_condition_death 0.004504505 0.033726813 0.0029620853 0.0529131986
## Heat_Hyperthermia_death 0.043478261 0.021164021 0.0309575234 0.0113864702
## Hypothermia_death 0.002849003 0.013377926 0.0185048113 0.0182072829
## Murder 0.012066365 0.051369863 0.0241691843 0.0667870036
## War_related_death 0.007604563 0.000000000 0.0007763975 0.0044085231
## Electrocution_fire_death 0.007299270 0.006038647 0.0015408320 0.0065741417
## Poor_health_death 0.000000000 0.019516334 0.1998327759 0.0304089479
## Fall_death 0.007462687 0.002424242 0.0007733952 0.0043923865
## Other_accidents 0.017730496 0.004761905 0.0107941403 0.0057929037
## Hanging_suicide 0.003816794 0.000000000 0.0007776050 0.0007331378
## Hippo_croco_death 0.000000000 0.000000000 0.0000000000 0.0007363770
## Month2 0.003965545 0.016386902 0.0204898214 0.0232785925
## more_death_missing_5 0.067164179 0.092485549 0.1078180889 0.0956097561
## m_123 0.026925148 0.036164844 0.1338491296 0.1034103410
## m_456 0.038891849 0.053361345 0.1035250464 0.1159049360
## m_789 0.041646947 0.099242726 0.1063464837 0.1238516502
## m_101112 0.019935095 0.144933555 0.1190558834 0.1350979718
## Year_2014 1.000000000 0.000000000 0.0000000000 0.0000000000
## Year_2015 0.000000000 1.000000000 0.0000000000 0.0000000000
## Year_2016 0.000000000 0.000000000 1.0000000000 0.0000000000
## Year_2017 0.000000000 0.000000000 0.0000000000 1.0000000000
## Year_2018 0.000000000 0.000000000 0.0000000000 0.0000000000
## Year_2019 0.000000000 0.000000000 0.0000000000 0.0000000000
## Year_2020 0.000000000 0.000000000 0.0000000000 0.0000000000
## Year_2018 Year_2019 Year_2020
## Land_transport_accident 0.0939008337 0.0944259567 0.0098716683
## Food_water_shortages_death 0.0968017058 0.1453930685 0.0251162791
## Boat_transport_accident 0.1229088426 0.1087026852 0.0329358319
## Exhaustion_death 0.0012292563 0.0000000000 0.0000000000
## Weather_condition_death 0.0510046368 0.0760869565 0.0458715596
## Heat_Hyperthermia_death 0.0159908624 0.0153601695 0.0000000000
## Hypothermia_death 0.0087822014 0.0081256771 0.0160642570
## Murder 0.0685504971 0.0283018868 0.0123893805
## War_related_death 0.0006116208 0.0005640158 0.0000000000
## Electrocution_fire_death 0.0006075334 0.0016835017 0.0000000000
## Poor_health_death 0.1493390497 0.1863507779 0.0210403273
## Fall_death 0.0024434942 0.0005624297 0.0000000000
## Other_accidents 0.0006035003 0.0005571031 0.0000000000
## Hanging_suicide 0.0006123699 0.0022624434 0.0061349693
## Hippo_croco_death 0.0012300123 0.0000000000 0.0000000000
## Month2 0.0262312792 0.0291125062 0.0006371687
## more_death_missing_5 0.0523012552 0.0486166008 0.0185185185
## m_123 0.1241426612 0.1247942048 0.0891566265
## m_456 0.1499307479 0.1533844615 0.0037960954
## m_789 0.1481838637 0.1468315301 0.0000000000
## m_101112 0.1152404747 0.1377300613 0.0000000000
## Year_2014 0.0000000000 0.0000000000 0.0000000000
## Year_2015 0.0000000000 0.0000000000 0.0000000000
## Year_2016 0.0000000000 0.0000000000 0.0000000000
## Year_2017 0.0000000000 0.0000000000 0.0000000000
## Year_2018 1.0000000000 0.0000000000 0.0000000000
## Year_2019 0.0000000000 1.0000000000 0.0000000000
## Year_2020 0.0000000000 0.0000000000 1.0000000000
Correlation (association) between variables with value exceeding 0.15
ifelse(((jaccard_sim<=-0.15 | jaccard_sim>=0.15) & (jaccard_sim!=1))==FALSE, NA, jaccard_sim)
## Land_transport_accident Food_water_shortages_death
## Land_transport_accident NA NA
## Food_water_shortages_death NA NA
## Boat_transport_accident NA NA
## Exhaustion_death NA NA
## Weather_condition_death NA 0.2438356
## Heat_Hyperthermia_death NA NA
## Hypothermia_death NA NA
## Murder NA NA
## War_related_death NA NA
## Electrocution_fire_death NA NA
## Poor_health_death NA 0.1875585
## Fall_death NA NA
## Other_accidents NA NA
## Hanging_suicide NA NA
## Hippo_croco_death NA NA
## Month2 NA NA
## more_death_missing_5 NA NA
## m_123 NA NA
## m_456 NA NA
## m_789 NA NA
## m_101112 NA NA
## Year_2014 NA NA
## Year_2015 NA NA
## Year_2016 NA NA
## Year_2017 NA NA
## Year_2018 NA NA
## Year_2019 NA NA
## Year_2020 NA NA
## Boat_transport_accident Exhaustion_death
## Land_transport_accident NA NA
## Food_water_shortages_death NA NA
## Boat_transport_accident NA NA
## Exhaustion_death NA NA
## Weather_condition_death NA NA
## Heat_Hyperthermia_death NA NA
## Hypothermia_death NA NA
## Murder NA NA
## War_related_death NA NA
## Electrocution_fire_death NA NA
## Poor_health_death NA NA
## Fall_death NA NA
## Other_accidents NA NA
## Hanging_suicide NA NA
## Hippo_croco_death NA NA
## Month2 NA NA
## more_death_missing_5 0.2507346 NA
## m_123 NA NA
## m_456 NA NA
## m_789 NA NA
## m_101112 NA NA
## Year_2014 NA NA
## Year_2015 NA NA
## Year_2016 NA NA
## Year_2017 NA NA
## Year_2018 NA NA
## Year_2019 NA NA
## Year_2020 NA NA
## Weather_condition_death Heat_Hyperthermia_death
## Land_transport_accident NA NA
## Food_water_shortages_death 0.2438356 NA
## Boat_transport_accident NA NA
## Exhaustion_death NA NA
## Weather_condition_death NA NA
## Heat_Hyperthermia_death NA NA
## Hypothermia_death NA NA
## Murder NA NA
## War_related_death NA NA
## Electrocution_fire_death NA NA
## Poor_health_death NA NA
## Fall_death NA NA
## Other_accidents NA NA
## Hanging_suicide NA NA
## Hippo_croco_death NA NA
## Month2 NA NA
## more_death_missing_5 NA NA
## m_123 NA NA
## m_456 NA NA
## m_789 NA NA
## m_101112 NA NA
## Year_2014 NA NA
## Year_2015 NA NA
## Year_2016 NA NA
## Year_2017 NA NA
## Year_2018 NA NA
## Year_2019 NA NA
## Year_2020 NA NA
## Hypothermia_death Murder War_related_death
## Land_transport_accident NA NA NA
## Food_water_shortages_death NA NA NA
## Boat_transport_accident NA NA NA
## Exhaustion_death NA NA NA
## Weather_condition_death NA NA NA
## Heat_Hyperthermia_death NA NA NA
## Hypothermia_death NA NA NA
## Murder NA NA NA
## War_related_death NA NA NA
## Electrocution_fire_death NA NA NA
## Poor_health_death NA NA NA
## Fall_death NA NA NA
## Other_accidents NA NA NA
## Hanging_suicide NA NA NA
## Hippo_croco_death NA NA NA
## Month2 NA NA NA
## more_death_missing_5 NA NA NA
## m_123 NA NA NA
## m_456 NA NA NA
## m_789 NA NA NA
## m_101112 NA NA NA
## Year_2014 NA NA NA
## Year_2015 NA NA NA
## Year_2016 NA NA NA
## Year_2017 NA NA NA
## Year_2018 NA NA NA
## Year_2019 NA NA NA
## Year_2020 NA NA NA
## Electrocution_fire_death Poor_health_death
## Land_transport_accident NA NA
## Food_water_shortages_death NA 0.1875585
## Boat_transport_accident NA NA
## Exhaustion_death NA NA
## Weather_condition_death NA NA
## Heat_Hyperthermia_death NA NA
## Hypothermia_death NA NA
## Murder NA NA
## War_related_death NA NA
## Electrocution_fire_death NA NA
## Poor_health_death NA NA
## Fall_death NA NA
## Other_accidents NA NA
## Hanging_suicide NA NA
## Hippo_croco_death NA NA
## Month2 NA NA
## more_death_missing_5 NA NA
## m_123 NA NA
## m_456 NA NA
## m_789 NA NA
## m_101112 NA 0.1550767
## Year_2014 NA NA
## Year_2015 NA NA
## Year_2016 NA 0.1998328
## Year_2017 NA NA
## Year_2018 NA NA
## Year_2019 NA 0.1863508
## Year_2020 NA NA
## Fall_death Other_accidents Hanging_suicide
## Land_transport_accident NA NA NA
## Food_water_shortages_death NA NA NA
## Boat_transport_accident NA NA NA
## Exhaustion_death NA NA NA
## Weather_condition_death NA NA NA
## Heat_Hyperthermia_death NA NA NA
## Hypothermia_death NA NA NA
## Murder NA NA NA
## War_related_death NA NA NA
## Electrocution_fire_death NA NA NA
## Poor_health_death NA NA NA
## Fall_death NA NA NA
## Other_accidents NA NA NA
## Hanging_suicide NA NA NA
## Hippo_croco_death NA NA NA
## Month2 NA NA NA
## more_death_missing_5 NA NA NA
## m_123 NA NA NA
## m_456 NA NA NA
## m_789 NA NA NA
## m_101112 NA NA NA
## Year_2014 NA NA NA
## Year_2015 NA NA NA
## Year_2016 NA NA NA
## Year_2017 NA NA NA
## Year_2018 NA NA NA
## Year_2019 NA NA NA
## Year_2020 NA NA NA
## Hippo_croco_death Month2 more_death_missing_5 m_123
## Land_transport_accident NA NA NA NA
## Food_water_shortages_death NA NA NA NA
## Boat_transport_accident NA NA 0.2507346 NA
## Exhaustion_death NA NA NA NA
## Weather_condition_death NA NA NA NA
## Heat_Hyperthermia_death NA NA NA NA
## Hypothermia_death NA NA NA NA
## Murder NA NA NA NA
## War_related_death NA NA NA NA
## Electrocution_fire_death NA NA NA NA
## Poor_health_death NA NA NA NA
## Fall_death NA NA NA NA
## Other_accidents NA NA NA NA
## Hanging_suicide NA NA NA NA
## Hippo_croco_death NA NA NA NA
## Month2 NA NA NA NA
## more_death_missing_5 NA NA NA NA
## m_123 NA NA NA NA
## m_456 NA NA NA NA
## m_789 NA NA NA NA
## m_101112 NA NA NA NA
## Year_2014 NA NA NA NA
## Year_2015 NA NA NA NA
## Year_2016 NA NA NA NA
## Year_2017 NA NA NA NA
## Year_2018 NA NA NA NA
## Year_2019 NA NA NA NA
## Year_2020 NA NA NA NA
## m_456 m_789 m_101112 Year_2014 Year_2015
## Land_transport_accident NA NA NA NA NA
## Food_water_shortages_death NA NA NA NA NA
## Boat_transport_accident NA NA NA NA NA
## Exhaustion_death NA NA NA NA NA
## Weather_condition_death NA NA NA NA NA
## Heat_Hyperthermia_death NA NA NA NA NA
## Hypothermia_death NA NA NA NA NA
## Murder NA NA NA NA NA
## War_related_death NA NA NA NA NA
## Electrocution_fire_death NA NA NA NA NA
## Poor_health_death NA NA 0.1550767 NA NA
## Fall_death NA NA NA NA NA
## Other_accidents NA NA NA NA NA
## Hanging_suicide NA NA NA NA NA
## Hippo_croco_death NA NA NA NA NA
## Month2 NA NA NA NA NA
## more_death_missing_5 NA NA NA NA NA
## m_123 NA NA NA NA NA
## m_456 NA NA NA NA NA
## m_789 NA NA NA NA NA
## m_101112 NA NA NA NA NA
## Year_2014 NA NA NA NA NA
## Year_2015 NA NA NA NA NA
## Year_2016 NA NA NA NA NA
## Year_2017 NA NA NA NA NA
## Year_2018 NA NA NA NA NA
## Year_2019 0.1533845 NA NA NA NA
## Year_2020 NA NA NA NA NA
## Year_2016 Year_2017 Year_2018 Year_2019 Year_2020
## Land_transport_accident NA NA NA NA NA
## Food_water_shortages_death NA NA NA NA NA
## Boat_transport_accident NA NA NA NA NA
## Exhaustion_death NA NA NA NA NA
## Weather_condition_death NA NA NA NA NA
## Heat_Hyperthermia_death NA NA NA NA NA
## Hypothermia_death NA NA NA NA NA
## Murder NA NA NA NA NA
## War_related_death NA NA NA NA NA
## Electrocution_fire_death NA NA NA NA NA
## Poor_health_death 0.1998328 NA NA 0.1863508 NA
## Fall_death NA NA NA NA NA
## Other_accidents NA NA NA NA NA
## Hanging_suicide NA NA NA NA NA
## Hippo_croco_death NA NA NA NA NA
## Month2 NA NA NA NA NA
## more_death_missing_5 NA NA NA NA NA
## m_123 NA NA NA NA NA
## m_456 NA NA NA 0.1533845 NA
## m_789 NA NA NA NA NA
## m_101112 NA NA NA NA NA
## Year_2014 NA NA NA NA NA
## Year_2015 NA NA NA NA NA
## Year_2016 NA NA NA NA NA
## Year_2017 NA NA NA NA NA
## Year_2018 NA NA NA NA NA
## Year_2019 NA NA NA NA NA
## Year_2020 NA NA NA NA NA
Comment to the correlation matrix
The above matrices are, looking from the top,: a correlation matrix from features that we use in the Naive Bayes model, and another matrix that shows correlation value only if it exceeds absolute value of 0.15.
To measure the association between dichotomous variables we can use: Phi coefficient, Cramer’s V or Pearson’s R correlation, which yield exactly the same value in the case of binary data.
However, those aforementioned statistics treat equality 0=0 and 1=1 as equally important. In our dataset, the absence of a given cause of death (0) in two observations doesn’t translate into the same similarity as presence (1) of two the same causes of death. For instance, we have causes of death v and w.
v = (0, 0, 0, 0, 0, 1)
w = (1, 0, 0, 0, 0, 0), which according to standard measure of correlation are pretty much the same - in 4 cases (0=0) and 2 cases (1=0) or (0=1).
Now the other example
a = (0, 1, 0, 1, 0, 1)
b = (1, 1, 1, 0, 0, 0), this in turn has lower correlation (which is counter-intuitive) - 2 cases (0=0) or (1=1) and 4 cases (1=0) or (0=1).
Therefore, we are using Jaccard similarity (which ignores identity 0=0), whose formula is show below.
Where
M11 represents the total number of attributes where A and B both have a value of 1
M10 represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1
M01 represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0
M00 represents the total number of attributes where A and B both have a value of 0.
As we suspected there is no strong relationship between the time and the causes of deaths.
Yet it seems that more migrants decide to cross the Meditarrenaen from October to March. It’s hard to say what might be the reason for that. I suspect (althought I am not certain) that from March to October the weather might be more severe, as the exposure to the Sun’s rays intensifies and temperatures are higher, and the migrants cannot take much water with them given how crammed those boats are. Also, they are not well protected against the sun.
There seems to be some association between weather conditions and Food/water shortages.
A possible links is again between high temperatures (severe weather conditions) and lack of water leading to deahydration and other medical ailments (Food/water shortages).
However (Pedro Domingos, Michael Pazzani, 1997, https://link.springer.com/content/pdf/10.1023%2FA%3A1007413511361.pdf) shows that when the assumption of independecy doesn’t hold the posteriori probability estimates may be biased, but the classifier itself (which classifies a observation to a class having the highest probability) proves to be quite effective.
Thus, we shouldn’t bother too much over the dependence of features (which, in fact, is negligible).
#3.2. Model
Although the Naive Bayes model is high-bias model (and not high-variance), it still can overfit, hence we will divide our dataset into training and test set. In our model, we will want to predict whether a given accident claims more than 5 lives. For the time being, we had been going through the quantitative analysis, checking clusters of accidents based on their location on the globe. Now, I would like to take a more qualitative approach trying to find out what factors has a significant impact on the gravity of each accident (whether more than 5 people died in it).
While creating traninig and test sets, it’s important to keep, in every set, observations that correspond to the labels that have the same ratio (of 0s and 1s) as in the whole dataset. It is especially crucial in NB classifier which assumes the prior probability (by default) basing on distribiution of the ratio of a target variable. The createDataPartition does stratified partitioning by default.
Moreover, as some features (causes of deaths) occur in a few observations, we set Laplace smooting, lest there be, in training set, a feature vector made of only 0s.
Choosing only those features we want in the model
df_model = as.data.frame(df[, c("Boat_transport_accident", "Land_transport_accident",
"Food_water_shortages_death", "Poor_health_death",
"m_123", "m_456", "m_789", "m_101112", "Year_2014",
"Year_2015", "Year_2016", "Year_2017", "Year_2018",
"Year_2019","Year_2020")])
label = as.factor(df$more_death_missing_5)
ind <- createDataPartition(df_model$Boat_transport_accident, p=0.3)[[1]]
train_x <- sapply(df_model[ind, ], as.numeric)
train_y <- as.factor(label[ind])
test_x <- sapply(df_model[-ind, ], as.numeric)
test_y <- as.factor(label[-ind])
Missing <- read.csv("MissingMigrants-Global-2020-04-11T05-24-09.csv")
We call again for new dataset for this EDA.
data <- Missing %>%
replace_na(list(Number.Dead = 0,
Minimum.Estimated.Number.of.Missing = 0,
Total.Dead.and.Missing = 0,
Number.of.Survivors = 0,
Number.of.Females=0,
Number.of.Males=0,
Number.of.Children = 0))
data <- data %>%
separate(Location.Coordinates, c("longitude","latitude"),",")
data[,c("longitude","latitude")] <- lapply(data[,c("longitude","latitude")],as.numeric)
We replace NA and separate “Location.Coordinates” to longitude and latitude and change to numeric.
Now call library leaflet for build the maps with infomation
library(leaflet)
leaflet() %>%
addTiles() %>%
addMarkers(data = data,
lat = ~latitude,
lng = ~longitude,
popup = paste("Area :",
data$Location.Description, "<br>",
"Time Accident:",
data$Reported.Date,"<br>",
"Number of Death:",
data$Number.Dead, "<br>",
"<a href =",
data$URL,">", "Link for the News", "</a>"),
clusterOptions = markerClusterOptions())%>% addProviderTiles(providers$Stamen.Toner)
as you can see, you can choose for the maps and link to news.
And i will make Build Model , fitting the model, and Interpretation the model. and you can see my another project like :
https://rpubs.com/edgarna70/TS_Scotty
https://rpubs.com/edgarna70/TS_FnB https://rpubs.com/edgarna70/CL-TC https://justedgar.shinyapps.io/CapstoneDV/