#1. Introduction

The dataset comes from the Missing Migrant Project that tracks the deaths of migrants that have perished on the way to their destination (their country of asylum). For more information, look up this website: https://www.iom.int/.

I hope that monitoring these tragic events and studying the data relating to them can help to avoid a recurrence of these. We are all human beings. ##1.1 Loading packages and datasets

# Loading libraries

library(naivebayes)          
library(plyr)                
library(car)                 
library(qdapTools)          
library(rgeos)               
library(dismo)               
library(geosphere)           
library(rgdal)               
library(ggspatial)           
library(viridis)             
library(rnaturalearth)       
library(sf)                 
library(dplyr)               
library(tidyverse)          
library(Amelia)              
library(VIM)                 
library(stringr)            
library(sp)                  
library(magrittr)           
library(ggpubr)            
library(ggthemes)          
library(philentropy) 
library(reshape2)       
library(caret)              
library(MASS)                

Loading dataset

rm(list=ls())

df <- read.csv('MissingMigrants-Global-2020-04-11T05-24-09.csv')

##1.2. Quick look at the dataset
Let’s look at the dataset

head(df)
##   Web.ID           Region  Reported.Date Reported.Year Reported.Month
## 1  53711 US-Mexico Border April 08, 2020          2020            Apr
## 2  53710  Central America April 05, 2020          2020            Apr
## 3  53705 US-Mexico Border April 05, 2020          2020            Apr
## 4  53702  Central America April 05, 2020          2020            Apr
## 5  53701 US-Mexico Border April 05, 2020          2020            Apr
## 6  53700    Mediterranean April 01, 2020          2020            Apr
##   Number.Dead Minimum.Estimated.Number.of.Missing Total.Dead.and.Missing
## 1           1                                  NA                      1
## 2           1                                  NA                      1
## 3           1                                  NA                      1
## 4           1                                  NA                      1
## 5           1                                  NA                      1
## 6           1                                  NA                      1
##   Number.of.Survivors Number.of.Females Number.of.Males Number.of.Children
## 1                  NA                NA               1                 NA
## 2                  NA                NA               1                 NA
## 3                  NA                NA               1                 NA
## 4                  NA                NA               1                 NA
## 5                  NA                NA               1                 NA
## 6                  NA                NA              NA                 NA
##               Cause.of.Death                               Location.Description
## 1          Presumed drowning        Rio Bravo, Nuevo Laredo, Tamaulipas, Mexico
## 2 Accident (fell from train)             San Lorenzo Soltepec, Tlaxcala, Mexico
## 3          Presumed drowning Rio Bravo, Eagle Pass, Maverick County, Texas, USA
## 4          Run over by train                            Ixtepec, Oaxaca, Mexico
## 5                   Drowning       Las Cruces, Piedras Negras, Coahuila, Mexico
## 6          Presumed drowning Remains recovered on Al Maya Shore, Zawiyah, Libya
##                    Information.Source               Location.Coordinates
## 1 Notiviza Tamaulipas, Hoy Tamaulipas  27.499560110910, -99.529145296143
## 2                          El Gritón  19.604444400000, -98.311944400000
## 3                             Zócalo 28.711778048640, -100.507718230750
## 4    Instituto Nacional de Migración  16.561126700000, -95.096314100000
## 5                             Zócalo 28.709460903906, -100.510735628340
## 6                           IOM Libya   32.792615604333, 12.733749617969
##         Migration.Route                                                URL
## 1                       https://archive.ph/MxbRb, https://archive.vn/aXBY7
## 2                                                 https://archive.ph/93ZaE
## 3                                                 https://archive.ph/Nx54L
## 4                                                 https://archive.ph/zpj1a
## 5                                                 https://archive.ph/3JNKz
## 6 Central Mediterranean                                                   
##   UNSD.Geographical.Grouping Source.Quality
## 1            Central America              3
## 2            Central America              1
## 3           Northern America              1
## 4            Central America              5
## 5            Central America              1
## 6              Uncategorized              4

The dataset contains 5333 observations that have been collected from all over the world. The description of the variables, you can find below. Let’s see variables from our dataset: Variable name Description Web ID : An automatically generated ID of WEB
Region of accident : The region in which an incident took place
Reported date : Estimated date of death or finding a body
Reported year : The year in which the incident occurred
Reported month : The month in which the incident occurred
Number dead : The total number of people confirmed dead in one incident
Minimum Estimated Number of Missing : The total number of those who are missing and are thus assumed to be dead (minimum)
Total dead and missing : The sum of the ‘number dead’ and ‘Minimum Estimated Number of Missing’ variables
Number of Survivors : The number of survivors
Number of Females : The number of females
Number of Males : The number of males
Number of children : The number of children
Cause of Death : The cause of death
Location Description : Where the incident happened
Information Source : The source of information
Location Coordinates : The Coordinates of location
Migration Route : The route of migration
URL : The URL of a site where information about a given incident is written
UNSD Geographical Grouping : The grouping of incidents based on their geo-location made by the UNSD
Source Quality : How much a source of information is reliable

###Dimensions of the dataset

dim(df)
## [1] 7242   20

##2. Missing values Before we dive into details, we should see how many missing values there are in the dataset how they are distributed.
###Missing values

Changing the names of columns so that they are more readable

colnames(df) <- c('ID', 'Region', 'Date', 'Year', 'Month', 'Dead', 'Minimum_Missing', 'Total_Dead_Missing', 'Survivors', 'Females', 'Males', 'Children',

                'Death_Cause', 'Location_Description', 'Source', 'Coordinates', 'Route', 'URL', 'Geo_Grouping', 'Source_Quality')

Creating a data frame of information about missing datas

missing.values <- df %>%

  gather(key = "key", value = "val") %>%

  dplyr::mutate(is.missing = is.na(val)) %>%

  group_by(key, is.missing) %>%

  dplyr::summarise(num.missing = n()) %>%

  filter(is.missing==T) %>%

  dplyr::select(-is.missing) %>%

  dplyr::arrange(desc(num.missing))
## Warning: attributes are not identical across measure variables;
## they will be dropped
colnames(missing.values) <- c('Columns_from_df', 'No_of_missings')

Further operations on missing values

missing.values <- data.frame(missing.values)

missing.values$Total <- dim(df)[1]

missing.values[, 'Percent_of_missing'] <- as.vector(paste0(round(missing.values$No_of_missing/missing.values$Total, 3)*100, " %"))

missing.values[, 'Fraction_of_missing'] <- round(missing.values$No_of_missing/missing.values$Total, 3)

missing.values <- missing.values[, -3]

missing.values$Columns_from_df <- factor(missing.values$Columns_from_df, level=missing.values$Columns_from_df)

x=missing.values[, -3]

x
##   Columns_from_df No_of_missings Fraction_of_missing
## 1 Minimum_Missing           6623               0.915
## 2       Survivors           6279               0.867
## 3        Children           6204               0.857
## 4         Females           5672               0.783
## 5           Males           3090               0.427
## 6            Dead            247               0.034
options(repr.plot.width = 12 , repr.plot.height = 9)

ggplot(missing.values)+

    geom_col(aes(x=Columns_from_df, y=Fraction_of_missing, fill=Columns_from_df), width=1)+

    coord_polar(theta='x')+

    geom_text(aes(x=Columns_from_df, y=0.6, label=Percent_of_missing), size=12)+

    scale_fill_brewer(palette="Blues")+

    labs(title="Percentage of missing data", subtitle="From 0% to 100%", x="", y="")+

    theme(legend.position = "none", axis.text.x=element_text(size=10), plot.title = element_text(size = 25, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 20, face = "bold", hjust=0.48))+

    scale_y_continuous(breaks=c(0, 0.99), labels=c('0 %', '100 %'))

There are 4 variables (Minimum missing, Children, Survivors, Females), whose ratio of missing data exceeds 80%.

Lack of data in those instances probabably stems from the fact that:

  1. as for Minimum missing, many sources of information may not record the number of missing, only those who has perished - that’s why the ratio of missing data in the Dead variable amounts only to 3,6%.
  2. as for: Children, Survivors, Females, that is probably, in most cases, extremely hard to establish whether a perished migrant was female, male or a child.

We know how many missing values there are.

However it will be most informative if we could see how they are distributed, and check whether they may be some pattern involved in it. To make it more transparent, we will sort the dataframe and clump missing values together for visualisation purposes. ###Distribution of missings values

d <- df %>%

  arrange(desc(is.na(Dead)), 

          desc(is.na(Males)),

          desc(is.na(Females)),

          desc(is.na(Survivors)),

          desc(is.na(Children)),

          desc(is.na(Minimum_Missing)),

          desc(is.na(Death_Cause)))

ggplot_missing <- function(x){

  x %>% 

    is.na %>%

    melt %>%

    ggplot(data = .,

           aes(x = Var2,

               y = Var1)) +

    geom_raster(aes(fill = value)) +

    scale_fill_grey(name = "",

                    labels = c("Present","Missing")) +

    theme_minimal() + 

    theme(axis.text.x  = element_text(angle=45, vjust=0.5, size=15),  plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48),

          axis.text.y=element_text(size=15), axis.title.x=element_text(size=20), axis.title.y=element_text(size=20)) + 

    labs(x = "Variables in Dataset",

         y = "Rows / observations", title='Missing values in dataset', subtitle='Grouped to see overlap')

}


d %>%

is.na %>%

melt %>%

ggplot(data=., aes(x=Var2, y=Var1)) +

geom_raster(aes(fill=value)) +

scale_fill_grey(name="", labels= c('Present', 'M'))

ggplot_missing(d)

As can be seen from the above infographic, missings do not overlap and virtually every observation contains at least one missing. Consequently, we cannot deem some observations corrupted and delete them, in return, it will be best to exclude those 4 variables from our analysis.

#2. Exploratory analysis ##2.1 Analysing the source of information In this section, we will try to derive some information about the source of information on each accident (i.e., observation) from the raw feature “URL”.

At the beginning, it’d be best if we saw the unique beginnings of each URL.

We know that URL, in most cases, indicates of a source at its beginnings, and the further to the end, the more detailed pieces of information about the website, it provides with.

Let’s see then unique beginnings of URL (10 first characters)

unique(str_sub(df$URL, 1, 10))
##  [1] "https://ar" ""           "https://bi" "http://www" "http://bit"
##  [6] "http://mvr" "https://re" "http://arc" "https://ww" "https://va"
## [11] "https://wa" "https://hc" "https://t1" "http://en." "http://hcg"
## [16] "https://tw" "https://cn" "https://yh" "http://wat" "https://bb"
## [21] "https://ny" "https://ed" "https://hr" "https://tl" "http://alw"
## [26] "https://go" "https://ab" "https://us" "http://hum" "http://reu"
## [31] "http://tms" "http://bbc" "http://wap" "http://nyt" "http://elu"
## [36] "http://fb." "https://dr" "http://dai" "http://elm" "http://ind"
## [41] "http://sum" "https://t." "http://on." "https://tg" "http://apn"
## [46] "http://lat" "http://yho" "http://f24" "http://shs" "http://hrl"
## [51] "http://abc" "http://fxn" "http://ab." "https://ca" "http://huf"
## [56] "http://1.u" "http://new" "http://cnn" "https://qu" "http://nbc"
## [61] "https://m." "http://act" "http://ow." "https://au" "http://rel"

There are four distinctive ways, in which URL can begin:

https://www.[source] or http://[source] or http://www[source] or https://[source]

Where [source] is the wanted source ?

By comparing the number of missing values generated from two different codings, we see that we those four options are the only ones

sum(is.na(str_match(df$URL, 'https://www.*|http://www.*|http://(?!ww).*|https://(?!ww).*')))
## [1] 2987
sum(df$URL=="")
## [1] 2987

We extract substrings that follows aforementioned “beginnings” to the point where there is a dot.

url1 <- str_extract(df$URL, '(?<=(http://www.|https://www.))[^\\.]*')

url2 <- str_extract(df$URL, '(?<=(https://))[^\\.(www)]*')

url3 <- str_extract(df$URL, '(?<=(http://))[^\\.(www)]*')

Converting NAs into blank spaces

url1 <- ifelse(is.na(url1), "", url1)

url2 <- ifelse(is.na(url2), "", url2)

url3 <- ifelse(is.na(url3), "", url3)

Merging 3 lists

url <- url1

url <- ifelse(url2!="", url2, url)

url <- ifelse(url3!="", url3, url)

url <- ifelse(url1=="", NA, url)

We would like to store only sources that appear more than 5 or 50 times

less_5_set <- rownames(data.frame('url'=table(url)[table(url)<6]))

less_50_set <- data.frame('url'=table(url)[table(url)<50])[, 'url.url']

url5 <- ifelse(url %in% less_5_set, NA, url)

url50 <- ifelse(url %in% less_50_set, NA, url)

Keeping it in forms of dataframes

url5_tab <- data.frame(table(url5))

url50_tab <- data.frame(table(url50))

Ordering decreasingly

url5_tab <- url5_tab %>%

  arrange(desc(Freq))

url50_tab <- url50_tab %>%

  arrange(desc(Freq))


url5_tab$url5 <- factor(url5_tab$url5, level=url5_tab$url5)

url50_tab$url50 <- factor(url50_tab$url50, level=url50_tab$url50)


df$url5 <- url5

df$url50 <- url50

df$url5 <- factor(df$url5, level=url5_tab$url5)

df$url50 <- factor(df$url50, level=url50_tab$url50)
options(repr.plot.width = 12 , repr.plot.height = 8)
df %>%

    filter(!is.na(url5)) %>%

    group_by(url5) %>%

    summarize(count=n()) %>%

    ggplot() +

        geom_linerange(aes(x = url5, ymin = 0, ymax = count), color = "lightgray", size = 1.5)+

        geom_point(aes(x = url5, y=count, color = url5), size = 1)+

        ggpubr::color_palette("jco")+

        theme_pubclean()+

        labs(title="Sources of information on incidents", subtitle="From the most common to the least", x="Website source", y="Number of occurences")+

        theme(legend.position = "none", axis.text.x=element_text(size=12,  angle=90), plot.title = element_text(size = 20, face = "bold", hjust=0.5),

              plot.subtitle = element_text(size = 15, face = "bold", hjust=0.48), axis.text.y=element_text(size=9), axis.title.x=element_text(size=5, vjust=-1.25), axis.title.y=element_text(size=10, vjust=2))
## Warning: This manual palette can handle a maximum of 10 values. You have
## supplied 40.
## Warning: Removed 30 rows containing missing values (geom_point).

There are a lot of isolated (single) sources of information. To make the graph more transparent, we will show only those sources which appeared at least 50 times.

df %>%

    filter(!is.na(url50)) %>%

    group_by(url50) %>%

    summarize(count=n()) %>%

    ggplot() +

        geom_linerange(aes(x = url50, ymin = 0, ymax = count), color = "lightgray", size = 1.5)+

        geom_point(aes(x = url50, y=count, color = url50), size = 1)+

        ggpubr::color_palette("jco")+

        theme_pubclean()+

        labs(title="Sources of information on incidents", subtitle="Appearing more often than 50 times", x="Website source", y="Number of occurences")+

        theme(legend.position = "none", axis.text.x=element_text(size=12), plot.title = element_text(size = 20, face = "bold", hjust=0.5),

              plot.subtitle = element_text(size = 15, face = "bold", hjust=0.48), axis.text.y=element_text(size=9), axis.title.x=element_text(size=10, vjust=-1.25), axis.title.y=element_text(size=10, vjust=2))

2.2. Visualising migrants’ deaths on a map of the world

With the aid of a map, we’ll try to localise the clusters of accidents involving migrants’ deaths, which, I hope, will give us some insight into those gruesome events.

Coordinates of accidents are stored in one column
To facilitate localising the points, we’ll break this column down into two columns storing information about: latitude and longitude

df$lan <- str_replace(str_extract(df$Coordinates, ".*,"), ",", "")
df$lon <- str_trim(str_replace(str_extract(df$Coordinates, ",.*"), ",", ""))
df$lon <- as.numeric(df$lon)
df$lan<- as.numeric(df$lan)

world <- ne_countries(scale='medium', returnclass='sf')

world_points <- st_centroid(world)
## Warning in st_centroid.sf(world): st_centroid assumes attributes are constant
## over geometries of x
## Warning in st_centroid.sfc(st_geometry(x), of_largest_polygon =
## of_largest_polygon): st_centroid does not give correct centroids for longitude/
## latitude data

Getting a set of points on the basis of which to draw a map with contoured countries

Creating data frame with coordinates

dsf <- df[, c('lan','lon')]

dsf[is.na(dsf$lan), 'lan'] <- 0
dsf[is.na(dsf$lon), 'lon'] <- 0

dsf <- sf::st_as_sf(dsf, coords=c("lon","lan"), crs=4326)

x=df$lan
y=df$lon

x[is.na(x)] <- 0
y[is.na(y)] <- 0

Here we get message that “although coordinates are longitude/latitude, st_intersects assumes that they are planar”
It reminds us that coordinates in a long/lat form are the mere approximation and a form of projection that is prone to distortions (they tend to be bigger,
the larger areas we consider, and the closer to the poles we get)
However, since both the world data and the points with accident data are projected with the same transformation, we don’t need to bother ourselves with tranforming those points into plansr coordinates
!!!! Bear in mind that projections of map are linear transformations (precisely affine transformation), which does not preserve angle between lines or distances between points,
but it does preserve ratios of distances between points lying on a straight line.

countries <- data.frame(countries=length(unique(world$name_en)), nums=1:255)

xy <- SpatialPointsDataFrame(matrix(c(y, x), ncol=2), data.frame(ID=seq(1:length(y))), 
                             proj4string=CRS("+proj=longlat +datum=WGS84 +no_defs"))

We need to convert the world points into Spatial form, as points of accidents are also in this form

spd <- sf::as_Spatial(st_geometry(world$geometry), IDs = as.character(1:255))
points_in_polygons <- sp::over(y=spd, x=xy)
colnames(world)
##  [1] "scalerank"  "featurecla" "labelrank"  "sovereignt" "sov_a3"    
##  [6] "adm0_dif"   "level"      "type"       "admin"      "adm0_a3"   
## [11] "geou_dif"   "geounit"    "gu_a3"      "su_dif"     "subunit"   
## [16] "su_a3"      "brk_diff"   "name"       "name_long"  "brk_a3"    
## [21] "brk_name"   "brk_group"  "abbrev"     "postal"     "formal_en" 
## [26] "formal_fr"  "note_adm0"  "note_brk"   "name_sort"  "name_alt"  
## [31] "mapcolor7"  "mapcolor8"  "mapcolor9"  "mapcolor13" "pop_est"   
## [36] "gdp_md_est" "pop_year"   "lastcensus" "gdp_year"   "economy"   
## [41] "income_grp" "wikipedia"  "fips_10"    "iso_a2"     "iso_a3"    
## [46] "iso_n3"     "un_a3"      "wb_a2"      "wb_a3"      "woe_id"    
## [51] "adm0_a3_is" "adm0_a3_us" "adm0_a3_un" "adm0_a3_wb" "continent" 
## [56] "region_un"  "subregion"  "region_wb"  "name_len"   "long_len"  
## [61] "abbrev_len" "tiny"       "homepart"   "geometry"
countries_with_points <- unique(data.frame(world[points_in_polygons, 'formal_en'])['formal_en'])
countries_with_points <- as.vector(countries_with_points)[['formal_en']]

all_countries <- as.vector(world$formal_en)

countries_with_points[is.na(countries_with_points)] <- "NA"
world$countries_with_points <- ifelse(all_countries %in% countries_with_points, 1, 0)

Creating clusters

mdist <- distm(xy)
hc <- hclust(as.dist(mdist), method="complete")

d=1250000
xy$clust <- cutree(hc, h=d)
df$col <- xy$clust
df$lan[is.na(df$lan)] <- 0
df$lon[is.na(df$lon)] <- 0
find_hull <- function(df) df[chull(df$lon, df$lan), ]
hulls <- ddply(df, "col", find_hull)

define the distance threshold, in this case 1250000 (we set this distance manually by setting it to various values and checking, visually,
whether computed clusters group accidents in a proper way (e.g., on the sea or on the land),beacuse geo-distance similarity doesn’t often translate into alikeness of causes of migration
(e.g., accidents can happens close to each other but be separated by a geographical barriers, which clearly indicate that geo-distance is not a perfect one)

options(repr.plot.width = 18 , repr.plot.height = 10)
suppressMessages(ggplot(data=world)+
  geom_sf(aes(fill=as.factor(countries_with_points)))+
  scale_fill_manual("countries_with_points", values=c('grey', 'white'))+
  geom_point(aes(x=lon, y=lan), data=df, size=0.01, color='red')+
  annotation_scale(location = "bl", width_hint = 0.5, height=unit(0.75, 'cm'), text_cex=1.5)+
  annotation_north_arrow(which_north=T, pad_x = unit(0.1, "in"), pad_y = unit(0.5, "in"),
                         style = north_arrow_fancy_orienteering, height=unit(3.0, 'cm'), width=unit(3.0, 'cm'))+
  theme_classic()+
  theme(panel.background = element_rect(fill = '#006994'))+
  geom_rect(xmin = -120, xmax = -74.12, ymin = 7.65, ymax = 33.97, 
     fill = NA, colour = "black", size = 1.5)+
  geom_rect(xmin = -9, xmax = 40, ymin = 30, ymax = 45, 
     fill = NA, colour = "black", size = 1.5)+
  geom_rect(xmin = 15, xmax = 40, ymin = 5, ymax = 28, 
     fill = NA, colour = "black", size = 1.5)+
  labs(title="Location of migrants' deaths", subtitle="Countries with no deaths on their territory coloured grey", x="", y="")+
  theme(legend.position = "none", axis.text.x=element_text(size=14), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48)))
## Scale on map varies by more than 10%, scale bar may be inaccurate

ggplot(data=world)+
  geom_sf(aes(fill=as.factor(countries_with_points)))+
  scale_fill_manual("countries_with_points", values=c('grey', 'white'))+
  geom_point(aes(x=lon, y=lan, col=as.factor(col)), data=df, size=0.01)+
  scale_color_viridis("col", discrete=T)+
  geom_polygon(data=hulls, alpha=0.5, aes(x=lon, y=lan, colour=as.factor(col)))+
  annotation_scale(location = "bl", width_hint = 0.5, height=unit(0.75, 'cm'), text_cex=1.5)+
  annotation_north_arrow(which_north=T, pad_x = unit(0.1, "in"), pad_y = unit(0.5, "in"),
                         style = north_arrow_fancy_orienteering, height=unit(3.0, 'cm'), width=unit(3.0, 'cm'))+
  theme_classic()+
  theme(panel.background = element_rect(fill = '#006994'))+
  labs(title="Location of migrants' deaths", subtitle="Countries with no deaths on their territory coloured grey", x="", y="")+
  theme(legend.position = "none", axis.text.x=element_text(size=14), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48))
## Scale on map varies by more than 10%, scale bar may be inaccurate

Comment to the map

As we can see on the maps above, the migrants’ deaths occur in the countries from which or to which they migrate, or in the countries that lie in between.

There can be a couple of routes seen:
(a) from Mexico to the US
(b) from North Africa to Europe
(c) from the Middle East to Europe

Besides, there are recorded deaths in: Central America, Continental Europe, in Asiaall over the Africa.
Clustering

To deal with such a large number of points and make any meaning of them, we should cluster them basing on their geolocation.
I am aware that in the dataset there is an already created variable that groups the locations of accidents, however, I would like to group it myself, showing one way how it can be done.

Geo-grouping consists of two stages:

  1. Firstly, we need to group data into small clusters. To do so, we will used hierarchical clustering, which to find similarity between points, uses (as most clustering algorithms) distance.

In this kernel, I used “geo-distance” (instead of more commonly used Euclidean distance), which takes into account the curvature of the Earth. I do not suspect that using this distance metric will make much of a difference, since most points lie quite close to the Equator and generally by constructing clusters we’re more focused on distances between closer points, but as we operate on the Earth, it’s a good practice to use it.
After computing distances, we need to set a “cut-off” distance, which will be a treshold, determining the highest admissable distance between two points in a cluster.
It cannot be too large, because then clusters will have high variance failing to find distinct group of points (e.g., north part of the Mediterranean and the Southern Continental Europe).
It cannot also be too small, since it will be hard to group them later manually, which we’ll do in the second stage.

  1. Secondly, we group the clusters into larger geo-groups, by hand, because we cannot entirely rely on the distance as a clustering criterion. The geographical proximity doesn’t always translate into similarity of areas.

For instance, some places may be separated by mountain barrier or be of a different nature (e.g., one accident happening on the land and the other on the sea, no matter how close to each other from a geographical point of view, they are quite dissimilar).Therefore, we cannot completely rely on the algorithms.

#2.3. Visualising migrants’ deaths on each Zoom of a map at each region

centroids_lon=vector()
centroids_lan=vector()

for (i in 1:length(unique(df$col))){
  lon = df[df$col==i, 'lon']
  lan = df[df$col==i, 'lan']
  centroids_lon[i]=mean(lon)
  centroids_lan[i]=mean(lan)
}
centroids <- data.frame(centroids_lon, centroids_lan, table(df$col))


ggplot(data=world)+
  geom_sf(aes(fill=as.factor(countries_with_points)))+
  scale_fill_manual("countries_with_points", values=c('grey', 'white'))+
  geom_point(aes(x=lon, y=lan, col=as.factor(col)), data=df, size=0.01)+
  scale_color_viridis("col", discrete=T)+
  geom_polygon(data=hulls, alpha=0.5, aes(x=lon, y=lan, colour=as.factor(col)))+
  annotation_scale(location = "bl", width_hint = 0.5, height=unit(0.75, 'cm'), text_cex=1.5)+
  annotation_north_arrow(which_north=T, pad_x = unit(0.1, "in"), pad_y = unit(0.5, "in"),
                         style = north_arrow_fancy_orienteering, height=unit(3.0, 'cm'), width=unit(3.0, 'cm'))+
  theme_classic()+
  theme(panel.background = element_rect(fill = '#006994'))+
  labs(title="Location of migrants' deaths", subtitle="Countries with no deaths on their territory coloured grey", x="", y="")+
  theme(legend.position = "none", axis.text.x=element_text(size=14), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48))
## Scale on map varies by more than 10%, scale bar may be inaccurate

North America

ggplot(data=world)+
  geom_sf(aes(fill=as.factor(countries_with_points)))+
  scale_fill_manual("countries_with_points", values=c('grey', 'white'))+
  geom_point(aes(x=lon, y=lan, col=as.factor(col)), data=df, size=0.01)+
  scale_color_viridis(discrete=T)+
  annotation_scale(location = "bl", width_hint = 0.5, height=unit(0.75, 'cm'), text_cex=1.5)+
  annotation_north_arrow(which_north=T, pad_x = unit(0.1, "in"), pad_y = unit(0.5, "in"),
                         style = north_arrow_fancy_orienteering, height=unit(3.0, 'cm'), width=unit(3.0, 'cm'))+
  geom_polygon(data=hulls, alpha=0.5, aes(x=lon, y=lan, colour=as.factor(col)))+
  coord_sf(xlim=c(-140, -60), ylim=c(5, 50))+
  geom_text(data=centroids, colour='white', size=7, aes(x=centroids_lon, y=centroids_lan, label=Freq))+
  theme(panel.background = element_rect(fill = '#006994'))+
  labs(title="North America", subtitle="With a number of accidents in each cluster", x="", y="")+
  theme(legend.position = "none", axis.text.x=element_text(size=14), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48))
## Scale on map varies by more than 10%, scale bar may be inaccurate

df$New_geo_grouping <- 'Uncategorized'
df$New_geo_grouping <- ifelse(df$col %in% c(3, 14, 23, 13), 'Mexico/Central America', df$New_geo_grouping)
df$New_geo_grouping <- ifelse(df$col %in% c(4, 22, 8, 7), 'USA-Mexico border', df$New_geo_grouping)

Comment to the map

As being an European, I am constantly exposed to and bombarded with information about the tragedies in the Mediterranean, but surprisingly enough (at least for me), the biggest cluster can be found on the US-Mexico border.

Africa

options(repr.plot.width = 18 , repr.plot.height = 20)
ggplot(data=world)+
  geom_sf(aes(fill=as.factor(countries_with_points)))+
  scale_fill_manual("countries_with_points", values=c('grey', 'white'))+
  geom_point(aes(x=lon, y=lan, col=as.factor(col)), data=df, size=0.01)+
  scale_color_viridis(discrete=T)+
  annotation_scale(location = "bl", width_hint = 0.5, height=unit(0.75, 'cm'), text_cex=1.5)+
  annotation_north_arrow(which_north=T, pad_x = unit(0.1, "in"), pad_y = unit(0.5, "in"),
                         style = north_arrow_fancy_orienteering, height=unit(3.0, 'cm'), width=unit(3.0, 'cm'))+
  geom_polygon(data=hulls, alpha=0.5, aes(x=lon, y=lan, colour=as.factor(col)))+
  coord_sf(ylim=c(-30, 50), xlim=c(-20, 68))+
  geom_text(data=centroids, colour='white', size=7, aes(x=centroids_lon, y=centroids_lan, label=Freq))+
  theme(panel.background = element_rect(fill = '#006994'))+
  labs(title="Africa", subtitle="With a number of accidents in each cluster", x="", y="")+
  theme(legend.position = "none", axis.text.x=element_text(size=14), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48))
## Scale on map varies by more than 10%, scale bar may be inaccurate

df$New_geo_grouping <- ifelse(df$col %in% c(9, 48, 30, 49, 36, 34, 54, 35, 40), 'Sub-saharan Africa', df$New_geo_grouping)
df$New_geo_grouping <- ifelse(df$col %in% c(43, 29, 44, 37, 26, 64, 47, 46, 63, 27, 21, 51, 53, 41, 24, 38, 33, 39, 45), 'Equatorial/South Africa', df$New_geo_grouping)

Comment to the map

Moreover, a great concentration of deaths can be found also in the eastern Sub-Saharan Africa.

Europe

options(repr.plot.width = 18 , repr.plot.height = 15)
ggplot(data=world)+
  geom_sf(aes(fill=as.factor(countries_with_points)))+
  scale_fill_manual("countries_with_points", values=c('grey', 'white'))+
  geom_point(aes(x=lon, y=lan, col=as.factor(col)), data=df, size=0.01)+
  scale_color_viridis(discrete=T)+
  annotation_scale(location = "bl", width_hint = 0.5, height=unit(0.75, 'cm'), text_cex=1.5)+
  annotation_north_arrow(which_north=T, pad_x = unit(0.1, "in"), pad_y = unit(0.5, "in"),
                         style = north_arrow_fancy_orienteering, height=unit(3.0, 'cm'), width=unit(3.0, 'cm'))+
  geom_polygon(data=hulls, alpha=0.5, aes(x=lon, y=lan, colour=as.factor(col)))+
  coord_sf(ylim=c(30, 71), xlim=c(-9, 68))+
  geom_text(data=centroids, colour='white', size=7, aes(x=centroids_lon, y=centroids_lan, label=Freq))+
  theme_classic()+
  theme(panel.background = element_rect(fill = '#006994'))+
  labs(title="Europe", subtitle="With a number of accidents in each cluster", x="", y="")+
  theme(legend.position = "none", axis.text.x=element_text(size=14), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48))
## Scale on map varies by more than 10%, scale bar may be inaccurate

df$New_geo_grouping <- ifelse(df$col %in% c(2, 19, 6, 17, 32, 52, 1, 16), 'Mediterranean', df$New_geo_grouping)
df$New_geo_grouping <- ifelse(df$col %in% c(18, 11), 'Continental_Europe', df$New_geo_grouping)

Comment to the map

Lastly, we’ll discuss the clusters in Europe.

From the map we can gather that most people perished on the Mediterranean and there are two main routes that migrants choose:
(a) from Morocco to Spain
(b) from Libya to Italy

Middle East

ggplot(data=world)+
  geom_sf(aes(fill=as.factor(countries_with_points)))+
  scale_fill_manual("countries_with_points", values=c('grey', 'white'))+
  geom_point(aes(x=lon, y=lan, col=as.factor(col)), data=df, size=0.01)+
  scale_color_viridis(discrete=T)+
  annotation_scale(location = "bl", width_hint = 0.5, height=unit(0.75, 'cm'), text_cex=1.5)+
  annotation_north_arrow(which_north=T, pad_x = unit(0.1, "in"), pad_y = unit(0.5, "in"),
                         style = north_arrow_fancy_orienteering, height=unit(3.0, 'cm'), width=unit(3.0, 'cm'))+
  geom_polygon(data=hulls, alpha=0.5, aes(x=lon, y=lan, colour=as.factor(col)))+
  coord_sf(xlim=c(25, 90), ylim=c(10, 50))+
  geom_text(data=centroids, colour='white', size=7, aes(x=centroids_lon, y=centroids_lan, label=Freq))+
  theme(panel.background = element_rect(fill = '#006994'))+
  labs(title="Middle East", subtitle="With a number of accidents in each cluster", x="", y="")+
  theme(legend.position = "none", axis.text.x=element_text(size=14), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48))
## Scale on map varies by more than 10%, scale bar may be inaccurate

df$New_geo_grouping <- ifelse(df$col %in% c(31, 60, 28), 'Middle East', df$New_geo_grouping)

Asia

ggplot(data=world)+
  geom_sf(aes(fill=as.factor(countries_with_points)))+
  scale_fill_manual("countries_with_points", values=c('grey', 'white'))+
  geom_point(aes(x=lon, y=lan, col=as.factor(col)), data=df, size=0.01)+
  scale_color_viridis(discrete=T)+
  annotation_scale(location = "bl", width_hint = 0.5, height=unit(0.75, 'cm'), text_cex=1.5)+
  annotation_north_arrow(which_north=T, pad_x = unit(0.1, "in"), pad_y = unit(0.5, "in"),
                         style = north_arrow_fancy_orienteering, height=unit(3.0, 'cm'), width=unit(3.0, 'cm'))+
  geom_polygon(data=hulls, alpha=0.5, aes(x=lon, y=lan, colour=as.factor(col)))+
  coord_sf(xlim=c(80, 150), ylim=c(-10, 30))+
  geom_text(data=centroids, colour='white', size=7, aes(x=centroids_lon, y=centroids_lan, label=Freq))+
  theme(panel.background = element_rect(fill = '#006994'))+
  labs(title="Asia", subtitle="With a number of accidents in each cluster", x="", y="")+
  theme(legend.position = "none", axis.text.x=element_text(size=14), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48))
## Scale on map varies by more than 10%, scale bar may be inaccurate

df$New_geo_grouping <- ifelse(df$col %in% c(73, 55, 50, 5, 42, 57, 68, 59, 57, 70), 'Asia', df$New_geo_grouping)

2.4. Examining causes of deaths
Having checked the distribution of accidents across the whole world, we will now proceed into finding out what the most common causes of deaths were and how they differed in relation to other factors.

At the beginning, I need to preprocess the data, which involves extracting causes of deaths from Cause_death variable, and subsequently, grouping the similar causes in categories.

df$Death_Cause_NA <- ifelse(str_detect(df$Death_Cause, 'Unknown'),as.character(df[which(str_detect(df$Death_Cause, 'Unknown')), 'Death_Cause']), 'NA' )
df[which(str_detect(df$Death_Cause, 'Unknown')), 'Death_Cause'] <- NA

Death_cause_split <- cbind(df[1], mtabulate(str_split(df$Death_Cause, ",")))

Now we group those causes

df$Suffocation_death <- ifelse(rowSums(Death_cause_split[, 
        c(" Asphyxiation"," Suffocation", "Asphyxiation", "Suffocation",
          "Presumed asphyxiation")]==1)>0, 1, 0)

df$Land_transport_accident <- ifelse(rowSums(Death_cause_split[,
        c("Vehicle Accident", "Train Accident", "Plane stowaway", 
          "Hit by car","Hit by vehicle", "Fall onto train tracks", "Fall from vehicle",
          "Fall from truck", "Crushed in back of truck"," Crushed by bus", " Crushed in back of truck",
          " Fall from train", " Hit by car", "Bus fire",
          "Crushed by stones (train)", "Electrocution on railway",
          "Fall from train" )]==1)>0, 1, 0)

df$Food_water_shortages_death <- ifelse(rowSums(Death_cause_split[,
        c(" Dehydration","Dehydration", "Starvation", " Starvation",
          " Malnutrition" )]==1)>0, 1, 0)

df$Boat_transport_accident <- ifelse(rowSums(Death_cause_split[,
        c("Violence while attempting to board boat", "Presumed drowning",
          " Drowning", " Presumed drowning","Drowning","Fall from boat",
          "Injuries from boat motor")]==1)>0, 1, 0)

df$Exhaustion_death <- ifelse(rowSums(Death_cause_split[,
        c(" Exhaustion", "Exhaustion")]==1)>0, 1, 0)

df$Weather_condition_death <- ifelse(rowSums(Death_cause_split[,
        c(" Exposure", " Harsh conditions", "Exposure", 
          "Harsh conditions", "Harsh weather/lack of adequate shelter",
          " Harsh weather/lack of adequate shelter")]==1)>0, 1, 0)

df$Heat_Hyperthermia_death <- ifelse(rowSums(Death_cause_split[,
        c(" Hyperthermia","Heat stroke", "Hyperthermia", 
          "Probable hyperthermia", "Presumed hyperthermia", 
          "Presumed hyperthermia/dehydration")]==1)>0, 1, 0)

df$Hypothermia_death <- ifelse(rowSums(Death_cause_split[,
        c(" Hypothermia", "Hypothermia", 
          "Presumed hypothermia")]==1)>0, 1, 0)

df$Murder<- ifelse(rowSums(Death_cause_split[,
        c(" Murdered", " Shot","Stabbed" , " Stabbed", 
          " Shot or stabbed", "Murdered", "Shot", "Shot or stabbed",
          "Shot or stabbed")]==1)>0, 1, 0)

df$War_related_death <- ifelse(rowSums(Death_cause_split[,
        c("Attacked by Apache helicopter", "Killed by landmine blast",
          "Killed by mortar shell", "Gassed")]==1)>0, 1, 0)

df$Electrocution_fire_death <- ifelse(rowSums(Death_cause_split[, 
        c("Burned", "Electrocution", "Fire", "Fuel burns", 
          "Struck by lightning bolt")]==1)>0, 1, 0)

df$Poor_health_death <- ifelse(rowSums(Death_cause_split[, 
        c("Post-partum complications", "Respiratory illness" ,
          " Sickness and lack of access to medicines",
          "Sickness and lack of access to medicines", 
          "Pulmonary edema and renal insufficiency", "Pulmonary edema",
          "Pulmonary complications", "Pneumonia", "Cardiac arrest",
          "Cervical cancer", "Coronary artery atherosclerosis",
          "Digestive bleeding", "Hypoglycemia",
          "Organ failure" )]==1)>0, 1, 0)


df$Fall_death <- ifelse(rowSums(Death_cause_split[,
            c("Fall into a canyon", "Fall", 
              "Fall from steep slope", "Crushed",
              " Crushed","Rockslide", 
              "Fall from border fence")]==1)>0, 1, 0)

df$Other_accidents <- ifelse(rowSums(Death_cause_split[,
            c("Mixed", "Poison", "Envenomation",
              "Accident (non-vehicle)")]==1)>0, 1, 0)

df$Hanging_suicide <- ifelse(rowSums(Death_cause_split[,
            c("Hanging", "Suicide")]==1)>0, 1, 0)
df$Hippo_croco_death <- ifelse(rowSums(Death_cause_split[,
            c("Killed by hippopotamus", "Killed by crocodile",
              "Killed by hippoptamus")]==1)>0, 1, 0)

Creating ad hoc df_groups with causes of death

df_groups <- df[, 28:43]
df_groups <- cbind("New_geo_grouping"=df$New_geo_grouping, df_groups)

Calculating conditional distribution of causes of death given geo-location

geo_groups <- data.frame(New_geo_grouping=unique(df$New_geo_grouping))

geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
    summarise, Suffocation_death=sum(Suffocation_death)), on='New_geo_grouping')

geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping), 
        summarise, Land_transport_accident=sum(Land_transport_accident)),
                    on='New_geo_grouping')

geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping), 
        summarise, Food_water_shortages_death=sum(Food_water_shortages_death)),
                    on='New_geo_grouping')

geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
        summarise, Boat_transport_accident=sum(Boat_transport_accident)), 
                    on='New_geo_grouping')

geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
        summarise, Exhaustion_death=sum(Exhaustion_death)),
                    on='New_geo_grouping')

geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
            summarise, Weather_condition_death=sum(Weather_condition_death)),
                    on='New_geo_grouping')

geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping), 
            summarise, Hypothermia_death=sum(Hypothermia_death)),
                    on='New_geo_grouping')

geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
        summarise, Murder=sum(Murder)),
                    on='New_geo_grouping')

geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
            summarise, War_related_death=sum(War_related_death)),
                    on='New_geo_grouping')

geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping), 
        summarise, Electrocution_fire_death=sum(Electrocution_fire_death)),
                    on='New_geo_grouping')

geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
        summarise, Poor_health_death=sum(Poor_health_death)),
                    on='New_geo_grouping')

geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
    summarise, Fall_death=sum(Fall_death)), on='New_geo_grouping')

geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping),
    summarise, Other_accidents=sum(Other_accidents)), on='New_geo_grouping')

geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping), 
    summarise, Hanging_suicide =sum(Hanging_suicide )), on='New_geo_grouping')

geo_groups <- merge(geo_groups, ddply(df_groups, .(New_geo_grouping), 
    summarise, Hippo_croco_death=sum(Hippo_croco_death)), on='New_geo_grouping')

Share of causes of death in total

Death_cause_percentage <- data.frame(New_geo_grouping=colnames(geo_groups[, -1]))
for (i in 1:length(colnames(geo_groups[, -1]))) {
  Death_cause_percentage[i, 2] <- sum(geo_groups[, i+1])/sum(geo_groups[, -1])
}

Reordering it descreasingly

Death_cause_percentage <- Death_cause_percentage %>%
  arrange(desc(V2))

Death_cause_percentage$New_geo_grouping <- factor(Death_cause_percentage$New_geo_grouping, level=Death_cause_percentage$New_geo_grouping)
options(repr.plot.width = 18 , repr.plot.height = 10)
ggplot(Death_cause_percentage)+
  geom_col(aes(New_geo_grouping, V2, fill=V2))+
  theme_pubclean()+
  labs(title="Distribution of causes of death", subtitle="Summing up to 1.0", x="", y="")+
  theme(legend.position = "none", axis.text.x=element_text(size=23, angle=45, hjust=1), plot.title = element_text(size = 40, face = "bold", hjust=0.5), plot.subtitle = element_text(size = 30, face = "bold", hjust=0.48),
        axis.text.y=element_text(size=23))+
  scale_x_discrete(labels=c("Boat transport", 'Poor health', 'Land transport', 'Food/water unsufficient', 'Murder', 'Weather conditions', 'Suffocation', 'Hypothermia', 'Other', 'Electrocution/fire',
                           'Fall', 'War', 'Exhaustion', 'Hanging/suicide', 'Hippo/crocodile'))

For our futher analysis, we would like to keep only those causes of death that make up more than 4% of total

Death_cause_percentage <- Death_cause_percentage[Death_cause_percentage$V2>0.04, ]
geo_groups <- geo_groups[, c(TRUE, colnames(geo_groups[, -1]) %in% Death_cause_percentage$New_geo_grouping)]

Transposing our data frame

geo_groups <- t(geo_groups)
colnames(geo_groups) <- as.character(unlist(geo_groups[1,]))
geo_groups = geo_groups[-1, ]

geo_groups <- as.data.frame(geo_groups)

geo_groups <- cbind(rownames(geo_groups), geo_groups)
rownames(geo_groups) <- NULL

Generating our dataset in a long form

geo_groups_long <- melt(geo_groups, id="rownames(geo_groups)", measure=colnames(geo_groups[, -1]))
## Warning: attributes are not identical across measure variables; they will be
## dropped
colnames(geo_groups_long)[1] <- 'Death_Cause'
geo_groups_long$value <- as.numeric(geo_groups_long$value)

Calculating total number of accidents per region

total <- geo_groups_long %>%
  dplyr::group_by(variable) %>%
  dplyr::summarize(TOTAL=sum(value))
geo_groups_long <- merge(geo_groups_long, total, on='variable')
geo_groups_long$ratio <- geo_groups_long$value/geo_groups_long$TOTAL
first_place <- geo_groups_long %>%
  arrange(desc(ratio)) %>%
  group_by(variable) %>%
  top_n(1, ratio) %>%
  arrange(variable) %>%
  mutate(count_first = 1)


second_place <- geo_groups_long %>%
  arrange(desc(ratio)) %>%
  group_by(variable) %>%
  top_n(2, ratio) %>%
  arrange(variable) %>%
  mutate(count_second = 1)



geo_groups_long <- merge(geo_groups_long, first_place, on=c('variable', 'Death_Cause'), all=T)
geo_groups_long <- merge(geo_groups_long, second_place, on=c('variable', 'Death_Cause'), all=T)

geo_groups_long$place <- 0
geo_groups_long$place <- ifelse(geo_groups_long$count_first==1, 1, geo_groups_long$place)
geo_groups_long$place <- ifelse((is.na(geo_groups_long$count_first) & geo_groups_long$count_second==1), 2, geo_groups_long$place)

geo_groups_long$place <- ifelse(is.na(geo_groups_long$place), 0, geo_groups_long$place)

geo_groups_long$text <- geo_groups_long$place

geo_groups_long$text <- ifelse(geo_groups_long$text==0, NA, geo_groups_long$text)

geo_groups_long$place <- factor(geo_groups_long$place, level=c(0, 1, 2), labels=c(0, '1st', '2nd'))
geo_groups_long$text <- factor(geo_groups_long$text, level=c(0, 1,2 ), labels=c(0, '1st', '2nd'))
options(repr.plot.width = 6 , repr.plot.height = 5)
ggplot(arrange(geo_groups_long, Death_Cause), aes(group=variable))+
  theme_pubclean()+
  geom_col(aes(x=Death_Cause, y=ratio, fill=place), position='identity')+
  scale_fill_viridis(discrete=T, labels=c('rest', '1st', '2nd'), name='The most/The second most common')+
  geom_text(aes(x=Death_Cause, y=ratio, label=text), size=4, vjust=1.0)+
  facet_wrap(~variable)+
  theme(strip.text = element_text(size=5), axis.text.x=element_text(size=5, angle=-45),  plot.title = element_text(size = 12, face = "bold", hjust=0.5), 
        plot.subtitle = element_text(size = 10, face = "bold", hjust=0.50), legend.title = element_text(size=5), legend.text=element_text(size=5), axis.text.y=element_text(size=5))+
  scale_x_discrete(labels=c("Boat transport", 'Food/water insuff', 'Land transport', 'Murder', 'Poor health', 'Weather conditions'))+
  labs(title="Causes of deaths per region", subtitle="Only 5 most common causes of deaths", x="", y="")
## Warning: Removed 36 rows containing missing values (geom_text).

Comments to the graphs The graphs provide us with some interesting informations.

The most common causes of death are: drowning, fall from a boat, unjuries from boat motor and violence on the boat, which together constitute the variable (created by me) Boat transport. This stems from the fact that most deaths are recorded on the Mediterranean see. The boats are often bursting at the seams with passengers, who need to be crammed for the lack of the sufficient number of boats. Worth noting is also that they often cross the Mediterranaen in small and canoe-like boats, which don’t stand a chance against even mild storms.

The second most common causes of death are included in an aggregated term poor health, which encompasses: Sickness and lack of access to medicines, Organ failure, Hypoglycemia, Digestive bleeding, Coronary artery atherosclerosis, Cervical cancer, Cardiac arrest, Pneumonia, Pulmonary edema, Respiratory illness, Post-partum complications. Those deaths are connected to health conditions and ilnesses.

What comes as a surprise to me is that in Africa most people die as a result of not being provided with satisfactory health care and not from starvation or dehydration.

#3. Naive Bayes model #3.1. Preparations To extract even more information from the dataset, we will use Naive Bayes model. We would like to find out how causes of death and dates of incidents affect the number of deaths within one accident.

What is exactly Naive Bayes classifier?

Naive Bayes classifier is a classification algorithm, which rests on the assumption of independence between features (that’s why it’s called naive, because this assumption rarely holds) and Bayes theorem, which can be seen below.

Caption for the picture.

Caption for the picture.

Why does this model require such a strong assumption?

If the independence holds, then it’s much easier to calculate maximum-likelihood expression on which this model bases. To see how the joint probability simplifies, see for instance: https://en.wikipedia.org/wiki/Naive_Bayes_classifier or any other book on machine learning.

Why do we think this model is appriopriate?

Firstly, it’s reasonable to suggest that the variables do not depend on each other. The causes of death, in most cases, were assigned one per observation (with little exception), therefore as such, must be independent. Moreover, the same holds for the time of event, which is mutually exclusive (an event happens only in one month and year). On top of that, there shouldn’t been any correlation between year and month, and if there is, it should be regarded as a spurious correlation. Otherwise, it’d suggest that there is some time trend indicating that from year to year more migrants choose some set of months to migrate, or that in some years migrants decided to migrate in a different month, which due to weather condition might be plausible, yet the impact of that change shouldn’t bother us when running Naive Bayes model). Having said that, there might be some dependence between the time and causes of deaths (e.g., in the summer time, we expect deaths from dehydration or hyperthermia to surge); this point may raise a concern and, therefore, we will check the correlation of features nonetheless.

Secondly, it performs quite well while provided with not so many observations, which is the case in our example.

What kind of Naive Bayes classifier?

As all features are binary, we decide for Bernoulli Naive Bayes (in contrast to Gaussian, which, simplifying to a great extent, is used for continuous variables, where we assume they are normally distributed; or in contrast to Multinomial Naive Bayes, where features have multinomial normal distribution and are often used for frequencies, e.g., how many times word appeared in a text).

df$Month2 <- as.numeric(df$Month)
df$Month2 <- car::recode(df$Month2, "5=1; 4=2; 8=3; 1=4; 9=5; 7=6; 6=7; 2=8; 12=9; 11=10; 10=11; 3=12")


df$Total_Dead_Missing <- str_remove(df$Total_Dead_Missing, ",") %>% as.numeric()
unique(df$Total_Dead_Missing)
##   [1]    1    3   22   64    2    8   18    4   10   14   60   91   12    9    5
##  [16]    6   39   29   11   17   13   20    0   45   40    7   84   31   56   19
##  [31]   70   26   23   73   94   65   33   57   28   24   53   15  100  150   16
##  [46]   83   25   32   27   59   21   50   41  117   54   34   61  130   52   38
##  [61]   44  114  104  101   43   76   62  112   47   36   30   48   37   49   68
##  [76]   63   97  120   35  110  126   82  156  102   42   74  118  176   58   87
##  [91]  103   99  135  128  111  288  339  245  550  255  459   90   51  123   95
## [106]  202   71  222  106   46 1022  400  243  307  750  149   88  500  164  251
## [121]  172   55  151  200   75  240  217   77
df$more_death_missing_5 <- ifelse(df$Total_Dead_Missing>5, 1, 0)


# Converting a discrete variable into many binary
df$m_123 <- ifelse(df$Month2 %in% c(1, 2, 3), 1, 0)
df$m_456 <- ifelse(df$Month2 %in% c(5, 4, 6), 1, 0)
df$m_789 <- ifelse(df$Month2 %in% c(7, 8, 9), 1, 0)
df$m_101112 <- ifelse(df$Month2 %in% c(10, 11, 12), 1, 0)
df$Year_2014 <- ifelse(df$Year==2014, 1, 0)
df$Year_2015 <- ifelse(df$Year==2015, 1, 0)
df$Year_2016 <- ifelse(df$Year==2016, 1, 0)
df$Year_2017 <- ifelse(df$Year==2017, 1, 0)
df$Year_2018 <- ifelse(df$Year==2018, 1, 0)
df$Year_2019 <- ifelse(df$Year==2019, 1, 0)
df$Year_2020 <- ifelse(df$Year==2020, 1, 0)

Calculating correlation (association) between variables

jaccard_dist <- as.matrix(distance(t(df[,29:56]), method = 'jaccard',test.na = F))
## Metric: 'jaccard'; comparing: 28 vectors.
colnames(jaccard_dist) <- colnames(df[,29:56])
rownames(jaccard_dist) <- colnames(df[,29:56])
# To calculate Jaccard similarity, we have to from one subtract Jaccard distance
jaccard_sim = 1-jaccard_dist
jaccard_sim
##                            Land_transport_accident Food_water_shortages_death
## Land_transport_accident               1.0000000000               0.0005512679
## Food_water_shortages_death            0.0005512679               1.0000000000
## Boat_transport_accident               0.0000000000               0.0023032630
## Exhaustion_death                      0.0000000000               0.0021074816
## Weather_condition_death               0.0007800312               0.2438356164
## Heat_Hyperthermia_death               0.0000000000               0.0045620438
## Hypothermia_death                     0.0000000000               0.0019175455
## Murder                                0.0046911650               0.0111193477
## War_related_death                     0.0000000000               0.0000000000
## Electrocution_fire_death              0.0000000000               0.0000000000
## Poor_health_death                     0.0000000000               0.1875584659
## Fall_death                            0.0011325028               0.0000000000
## Other_accidents                       0.0000000000               0.0000000000
## Hanging_suicide                       0.0000000000               0.0000000000
## Hippo_croco_death                     0.0000000000               0.0000000000
## Month2                                0.0145076227               0.0169662071
## more_death_missing_5                  0.0439429929               0.0497142857
## m_123                                 0.0824388149               0.0647010647
## m_456                                 0.0818565401               0.0809815951
## m_789                                 0.0860339506               0.1113671275
## m_101112                              0.1013698630               0.1161265432
## Year_2014                             0.0153846154               0.0025041736
## Year_2015                             0.0606443462               0.0558558559
## Year_2016                             0.0452995616               0.0197066911
## Year_2017                             0.1048186786               0.1003344482
## Year_2018                             0.0939008337               0.0968017058
## Year_2019                             0.0944259567               0.1453930685
## Year_2020                             0.0098716683               0.0251162791
##                            Boat_transport_accident Exhaustion_death
## Land_transport_accident               0.0000000000     0.000000e+00
## Food_water_shortages_death            0.0023032630     2.107482e-03
## Boat_transport_accident               1.0000000000     0.000000e+00
## Exhaustion_death                      0.0000000000     1.000000e+00
## Weather_condition_death               0.0000000000     0.000000e+00
## Heat_Hyperthermia_death               0.0005503577     0.000000e+00
## Hypothermia_death                     0.0017055145     0.000000e+00
## Murder                                0.0000000000     0.000000e+00
## War_related_death                     0.0000000000     0.000000e+00
## Electrocution_fire_death              0.0005934718     0.000000e+00
## Poor_health_death                     0.0000000000     0.000000e+00
## Fall_death                            0.0005955926     0.000000e+00
## Other_accidents                       0.0000000000     0.000000e+00
## Hanging_suicide                       0.0000000000     0.000000e+00
## Hippo_croco_death                     0.0000000000     0.000000e+00
## Month2                                0.0268618852     5.308573e-05
## more_death_missing_5                  0.2507345739     1.119821e-03
## m_123                                 0.1426110920     6.038647e-04
## m_456                                 0.1362867771     1.177856e-03
## m_789                                 0.1284375000     0.000000e+00
## m_101112                              0.1373660996     5.130836e-04
## Year_2014                             0.0469432314     3.891051e-03
## Year_2015                             0.1133603239     1.228501e-03
## Year_2016                             0.0900333457     0.000000e+00
## Year_2017                             0.1222593831     0.000000e+00
## Year_2018                             0.1229088426     1.229256e-03
## Year_2019                             0.1087026852     0.000000e+00
## Year_2020                             0.0329358319     0.000000e+00
##                            Weather_condition_death Heat_Hyperthermia_death
## Land_transport_accident               0.0007800312            0.0000000000
## Food_water_shortages_death            0.2438356164            0.0045620438
## Boat_transport_accident               0.0000000000            0.0005503577
## Exhaustion_death                      0.0000000000            0.0000000000
## Weather_condition_death               1.0000000000            0.0142602496
## Heat_Hyperthermia_death               0.0142602496            1.0000000000
## Hypothermia_death                     0.0019531250            0.0000000000
## Murder                                0.0036188179            0.0000000000
## War_related_death                     0.0000000000            0.0000000000
## Electrocution_fire_death              0.0000000000            0.0000000000
## Poor_health_death                     0.1150000000            0.0000000000
## Fall_death                            0.0000000000            0.0000000000
## Other_accidents                       0.0000000000            0.0000000000
## Hanging_suicide                       0.0000000000            0.0000000000
## Hippo_croco_death                     0.0000000000            0.0000000000
## Month2                                0.0072522883            0.0026141736
## more_death_missing_5                  0.0219263900            0.0028818444
## m_123                                 0.0428643469            0.0016629712
## m_456                                 0.0363279332            0.0300668151
## m_789                                 0.0572963295            0.0447538538
## m_101112                              0.0573219884            0.0033444816
## Year_2014                             0.0045045045            0.0434782609
## Year_2015                             0.0337268128            0.0211640212
## Year_2016                             0.0029620853            0.0309575234
## Year_2017                             0.0529131986            0.0113864702
## Year_2018                             0.0510046368            0.0159908624
## Year_2019                             0.0760869565            0.0153601695
## Year_2020                             0.0458715596            0.0000000000
##                            Hypothermia_death      Murder War_related_death
## Land_transport_accident          0.000000000 0.004691165      0.0000000000
## Food_water_shortages_death       0.001917546 0.011119348      0.0000000000
## Boat_transport_accident          0.001705514 0.000000000      0.0000000000
## Exhaustion_death                 0.000000000 0.000000000      0.0000000000
## Weather_condition_death          0.001953125 0.003618818      0.0000000000
## Heat_Hyperthermia_death          0.000000000 0.000000000      0.0000000000
## Hypothermia_death                1.000000000 0.000000000      0.0000000000
## Murder                           0.000000000 1.000000000      0.0000000000
## War_related_death                0.000000000 0.000000000      1.0000000000
## Electrocution_fire_death         0.000000000 0.000000000      0.0000000000
## Poor_health_death                0.000000000 0.002995507      0.0000000000
## Fall_death                       0.000000000 0.000000000      0.0000000000
## Other_accidents                  0.000000000 0.002227171      0.0000000000
## Hanging_suicide                  0.000000000 0.000000000      0.0000000000
## Hippo_croco_death                0.000000000 0.000000000      0.0000000000
## Month2                           0.001384168 0.006636352      0.0001930628
## more_death_missing_5             0.011258956 0.022691706      0.0011111111
## m_123                            0.025776216 0.047040971      0.0018061409
## m_456                            0.007865169 0.054391218      0.0000000000
## m_789                            0.004420432 0.051601423      0.0041025641
## m_101112                         0.015399901 0.043727915      0.0000000000
## Year_2014                        0.002849003 0.012066365      0.0076045627
## Year_2015                        0.013377926 0.051369863      0.0000000000
## Year_2016                        0.018504811 0.024169184      0.0007763975
## Year_2017                        0.018207283 0.066787004      0.0044085231
## Year_2018                        0.008782201 0.068550497      0.0006116208
## Year_2019                        0.008125677 0.028301887      0.0005640158
## Year_2020                        0.016064257 0.012389381      0.0000000000
##                            Electrocution_fire_death Poor_health_death
## Land_transport_accident                0.0000000000       0.000000000
## Food_water_shortages_death             0.0000000000       0.187558466
## Boat_transport_accident                0.0005934718       0.000000000
## Exhaustion_death                       0.0000000000       0.000000000
## Weather_condition_death                0.0000000000       0.115000000
## Heat_Hyperthermia_death                0.0000000000       0.000000000
## Hypothermia_death                      0.0000000000       0.000000000
## Murder                                 0.0000000000       0.002995507
## War_related_death                      0.0000000000       0.000000000
## Electrocution_fire_death               1.0000000000       0.000000000
## Poor_health_death                      0.0000000000       1.000000000
## Fall_death                             0.0000000000       0.000000000
## Other_accidents                        0.0000000000       0.000000000
## Hanging_suicide                        0.0000000000       0.000000000
## Hippo_croco_death                      0.0000000000       0.000000000
## Month2                                 0.0003403142       0.026572736
## more_death_missing_5                   0.0044052863       0.030303030
## m_123                                  0.0029940120       0.140597540
## m_456                                  0.0040911748       0.116847826
## m_789                                  0.0025458248       0.117814277
## m_101112                               0.0025471218       0.155076722
## Year_2014                              0.0072992701       0.000000000
## Year_2015                              0.0060386473       0.019516334
## Year_2016                              0.0015408320       0.199832776
## Year_2017                              0.0065741417       0.030408948
## Year_2018                              0.0006075334       0.149339050
## Year_2019                              0.0016835017       0.186350778
## Year_2020                              0.0000000000       0.021040327
##                              Fall_death Other_accidents Hanging_suicide
## Land_transport_accident    0.0011325028    0.0000000000    0.0000000000
## Food_water_shortages_death 0.0000000000    0.0000000000    0.0000000000
## Boat_transport_accident    0.0005955926    0.0000000000    0.0000000000
## Exhaustion_death           0.0000000000    0.0000000000    0.0000000000
## Weather_condition_death    0.0000000000    0.0000000000    0.0000000000
## Heat_Hyperthermia_death    0.0000000000    0.0000000000    0.0000000000
## Hypothermia_death          0.0000000000    0.0000000000    0.0000000000
## Murder                     0.0000000000    0.0022271715    0.0000000000
## War_related_death          0.0000000000    0.0000000000    0.0000000000
## Electrocution_fire_death   0.0000000000    0.0000000000    0.0000000000
## Poor_health_death          0.0000000000    0.0000000000    0.0000000000
## Fall_death                 1.0000000000    0.0000000000    0.0000000000
## Other_accidents            0.0000000000    1.0000000000    0.0000000000
## Hanging_suicide            0.0000000000    0.0000000000    1.0000000000
## Hippo_croco_death          0.0000000000    0.0000000000    0.0000000000
## Month2                     0.0002340954    0.0007388288    0.0001278964
## more_death_missing_5       0.0022123894    0.0221483942    0.0000000000
## m_123                      0.0024024024    0.0017825312    0.0018083183
## m_456                      0.0035169988    0.0011580776    0.0017626322
## m_789                      0.0005096840    0.0060975610    0.0005115090
## m_101112                   0.0025549310    0.0081507896    0.0010240655
## Year_2014                  0.0074626866    0.0177304965    0.0038167939
## Year_2015                  0.0024242424    0.0047619048    0.0000000000
## Year_2016                  0.0007733952    0.0107941403    0.0007776050
## Year_2017                  0.0043923865    0.0057929037    0.0007331378
## Year_2018                  0.0024434942    0.0006035003    0.0006123699
## Year_2019                  0.0005624297    0.0005571031    0.0022624434
## Year_2020                  0.0000000000    0.0000000000    0.0061349693
##                            Hippo_croco_death       Month2 more_death_missing_5
## Land_transport_accident         0.000000e+00 1.450762e-02          0.043942993
## Food_water_shortages_death      0.000000e+00 1.696621e-02          0.049714286
## Boat_transport_accident         0.000000e+00 2.686189e-02          0.250734574
## Exhaustion_death                0.000000e+00 5.308573e-05          0.001119821
## Weather_condition_death         0.000000e+00 7.252288e-03          0.021926390
## Heat_Hyperthermia_death         0.000000e+00 2.614174e-03          0.002881844
## Hypothermia_death               0.000000e+00 1.384168e-03          0.011258956
## Murder                          0.000000e+00 6.636352e-03          0.022691706
## War_related_death               0.000000e+00 1.930628e-04          0.001111111
## Electrocution_fire_death        0.000000e+00 3.403142e-04          0.004405286
## Poor_health_death               0.000000e+00 2.657274e-02          0.030303030
## Fall_death                      0.000000e+00 2.340954e-04          0.002212389
## Other_accidents                 0.000000e+00 7.388288e-04          0.022148394
## Hanging_suicide                 0.000000e+00 1.278964e-04          0.000000000
## Hippo_croco_death               1.000000e+00 6.273844e-05          0.000000000
## Month2                          6.273844e-05 1.000000e+00          0.014876211
## more_death_missing_5            0.000000e+00 1.487621e-02          1.000000000
## m_123                           0.000000e+00 7.747714e-03          0.087681779
## m_456                           0.000000e+00 2.129529e-02          0.085642317
## m_789                           1.026694e-03 3.893619e-02          0.081173780
## m_101112                        5.133470e-04 5.399474e-02          0.104361371
## Year_2014                       0.000000e+00 3.965545e-03          0.067164179
## Year_2015                       0.000000e+00 1.638690e-02          0.092485549
## Year_2016                       0.000000e+00 2.048982e-02          0.107818089
## Year_2017                       7.363770e-04 2.327859e-02          0.095609756
## Year_2018                       1.230012e-03 2.623128e-02          0.052301255
## Year_2019                       0.000000e+00 2.911251e-02          0.048616601
## Year_2020                       0.000000e+00 6.371687e-04          0.018518519
##                                   m_123       m_456       m_789     m_101112
## Land_transport_accident    0.0824388149 0.081856540 0.086033951 0.1013698630
## Food_water_shortages_death 0.0647010647 0.080981595 0.111367127 0.1161265432
## Boat_transport_accident    0.1426110920 0.136286777 0.128437500 0.1373660996
## Exhaustion_death           0.0006038647 0.001177856 0.000000000 0.0005130836
## Weather_condition_death    0.0428643469 0.036327933 0.057296329 0.0573219884
## Heat_Hyperthermia_death    0.0016629712 0.030066815 0.044753854 0.0033444816
## Hypothermia_death          0.0257762156 0.007865169 0.004420432 0.0153999006
## Murder                     0.0470409712 0.054391218 0.051601423 0.0437279152
## War_related_death          0.0018061409 0.000000000 0.004102564 0.0000000000
## Electrocution_fire_death   0.0029940120 0.004091175 0.002545825 0.0025471218
## Poor_health_death          0.1405975395 0.116847826 0.117814277 0.1550767222
## Fall_death                 0.0024024024 0.003516999 0.000509684 0.0025549310
## Other_accidents            0.0017825312 0.001158078 0.006097561 0.0081507896
## Hanging_suicide            0.0018083183 0.001762632 0.000511509 0.0010240655
## Hippo_croco_death          0.0000000000 0.000000000 0.001026694 0.0005133470
## Month2                     0.0077477137 0.021295290 0.038936185 0.0539947400
## more_death_missing_5       0.0876817793 0.085642317 0.081173780 0.1043613707
## m_123                      1.0000000000 0.000000000 0.000000000 0.0000000000
## m_456                      0.0000000000 1.000000000 0.000000000 0.0000000000
## m_789                      0.0000000000 0.000000000 1.000000000 0.0000000000
## m_101112                   0.0000000000 0.000000000 0.000000000 1.0000000000
## Year_2014                  0.0269251481 0.038891849 0.041646947 0.0199350950
## Year_2015                  0.0361648444 0.053361345 0.099242726 0.1449335548
## Year_2016                  0.1338491296 0.103525046 0.106346484 0.1190558834
## Year_2017                  0.1034103410 0.115904936 0.123851650 0.1350979718
## Year_2018                  0.1241426612 0.149930748 0.148183864 0.1152404747
## Year_2019                  0.1247942048 0.153384461 0.146831530 0.1377300613
## Year_2020                  0.0891566265 0.003796095 0.000000000 0.0000000000
##                              Year_2014   Year_2015    Year_2016    Year_2017
## Land_transport_accident    0.015384615 0.060644346 0.0452995616 0.1048186786
## Food_water_shortages_death 0.002504174 0.055855856 0.0197066911 0.1003344482
## Boat_transport_accident    0.046943231 0.113360324 0.0900333457 0.1222593831
## Exhaustion_death           0.003891051 0.001228501 0.0000000000 0.0000000000
## Weather_condition_death    0.004504505 0.033726813 0.0029620853 0.0529131986
## Heat_Hyperthermia_death    0.043478261 0.021164021 0.0309575234 0.0113864702
## Hypothermia_death          0.002849003 0.013377926 0.0185048113 0.0182072829
## Murder                     0.012066365 0.051369863 0.0241691843 0.0667870036
## War_related_death          0.007604563 0.000000000 0.0007763975 0.0044085231
## Electrocution_fire_death   0.007299270 0.006038647 0.0015408320 0.0065741417
## Poor_health_death          0.000000000 0.019516334 0.1998327759 0.0304089479
## Fall_death                 0.007462687 0.002424242 0.0007733952 0.0043923865
## Other_accidents            0.017730496 0.004761905 0.0107941403 0.0057929037
## Hanging_suicide            0.003816794 0.000000000 0.0007776050 0.0007331378
## Hippo_croco_death          0.000000000 0.000000000 0.0000000000 0.0007363770
## Month2                     0.003965545 0.016386902 0.0204898214 0.0232785925
## more_death_missing_5       0.067164179 0.092485549 0.1078180889 0.0956097561
## m_123                      0.026925148 0.036164844 0.1338491296 0.1034103410
## m_456                      0.038891849 0.053361345 0.1035250464 0.1159049360
## m_789                      0.041646947 0.099242726 0.1063464837 0.1238516502
## m_101112                   0.019935095 0.144933555 0.1190558834 0.1350979718
## Year_2014                  1.000000000 0.000000000 0.0000000000 0.0000000000
## Year_2015                  0.000000000 1.000000000 0.0000000000 0.0000000000
## Year_2016                  0.000000000 0.000000000 1.0000000000 0.0000000000
## Year_2017                  0.000000000 0.000000000 0.0000000000 1.0000000000
## Year_2018                  0.000000000 0.000000000 0.0000000000 0.0000000000
## Year_2019                  0.000000000 0.000000000 0.0000000000 0.0000000000
## Year_2020                  0.000000000 0.000000000 0.0000000000 0.0000000000
##                               Year_2018    Year_2019    Year_2020
## Land_transport_accident    0.0939008337 0.0944259567 0.0098716683
## Food_water_shortages_death 0.0968017058 0.1453930685 0.0251162791
## Boat_transport_accident    0.1229088426 0.1087026852 0.0329358319
## Exhaustion_death           0.0012292563 0.0000000000 0.0000000000
## Weather_condition_death    0.0510046368 0.0760869565 0.0458715596
## Heat_Hyperthermia_death    0.0159908624 0.0153601695 0.0000000000
## Hypothermia_death          0.0087822014 0.0081256771 0.0160642570
## Murder                     0.0685504971 0.0283018868 0.0123893805
## War_related_death          0.0006116208 0.0005640158 0.0000000000
## Electrocution_fire_death   0.0006075334 0.0016835017 0.0000000000
## Poor_health_death          0.1493390497 0.1863507779 0.0210403273
## Fall_death                 0.0024434942 0.0005624297 0.0000000000
## Other_accidents            0.0006035003 0.0005571031 0.0000000000
## Hanging_suicide            0.0006123699 0.0022624434 0.0061349693
## Hippo_croco_death          0.0012300123 0.0000000000 0.0000000000
## Month2                     0.0262312792 0.0291125062 0.0006371687
## more_death_missing_5       0.0523012552 0.0486166008 0.0185185185
## m_123                      0.1241426612 0.1247942048 0.0891566265
## m_456                      0.1499307479 0.1533844615 0.0037960954
## m_789                      0.1481838637 0.1468315301 0.0000000000
## m_101112                   0.1152404747 0.1377300613 0.0000000000
## Year_2014                  0.0000000000 0.0000000000 0.0000000000
## Year_2015                  0.0000000000 0.0000000000 0.0000000000
## Year_2016                  0.0000000000 0.0000000000 0.0000000000
## Year_2017                  0.0000000000 0.0000000000 0.0000000000
## Year_2018                  1.0000000000 0.0000000000 0.0000000000
## Year_2019                  0.0000000000 1.0000000000 0.0000000000
## Year_2020                  0.0000000000 0.0000000000 1.0000000000

Correlation (association) between variables with value exceeding 0.15

ifelse(((jaccard_sim<=-0.15 | jaccard_sim>=0.15) & (jaccard_sim!=1))==FALSE, NA, jaccard_sim)
##                            Land_transport_accident Food_water_shortages_death
## Land_transport_accident                         NA                         NA
## Food_water_shortages_death                      NA                         NA
## Boat_transport_accident                         NA                         NA
## Exhaustion_death                                NA                         NA
## Weather_condition_death                         NA                  0.2438356
## Heat_Hyperthermia_death                         NA                         NA
## Hypothermia_death                               NA                         NA
## Murder                                          NA                         NA
## War_related_death                               NA                         NA
## Electrocution_fire_death                        NA                         NA
## Poor_health_death                               NA                  0.1875585
## Fall_death                                      NA                         NA
## Other_accidents                                 NA                         NA
## Hanging_suicide                                 NA                         NA
## Hippo_croco_death                               NA                         NA
## Month2                                          NA                         NA
## more_death_missing_5                            NA                         NA
## m_123                                           NA                         NA
## m_456                                           NA                         NA
## m_789                                           NA                         NA
## m_101112                                        NA                         NA
## Year_2014                                       NA                         NA
## Year_2015                                       NA                         NA
## Year_2016                                       NA                         NA
## Year_2017                                       NA                         NA
## Year_2018                                       NA                         NA
## Year_2019                                       NA                         NA
## Year_2020                                       NA                         NA
##                            Boat_transport_accident Exhaustion_death
## Land_transport_accident                         NA               NA
## Food_water_shortages_death                      NA               NA
## Boat_transport_accident                         NA               NA
## Exhaustion_death                                NA               NA
## Weather_condition_death                         NA               NA
## Heat_Hyperthermia_death                         NA               NA
## Hypothermia_death                               NA               NA
## Murder                                          NA               NA
## War_related_death                               NA               NA
## Electrocution_fire_death                        NA               NA
## Poor_health_death                               NA               NA
## Fall_death                                      NA               NA
## Other_accidents                                 NA               NA
## Hanging_suicide                                 NA               NA
## Hippo_croco_death                               NA               NA
## Month2                                          NA               NA
## more_death_missing_5                     0.2507346               NA
## m_123                                           NA               NA
## m_456                                           NA               NA
## m_789                                           NA               NA
## m_101112                                        NA               NA
## Year_2014                                       NA               NA
## Year_2015                                       NA               NA
## Year_2016                                       NA               NA
## Year_2017                                       NA               NA
## Year_2018                                       NA               NA
## Year_2019                                       NA               NA
## Year_2020                                       NA               NA
##                            Weather_condition_death Heat_Hyperthermia_death
## Land_transport_accident                         NA                      NA
## Food_water_shortages_death               0.2438356                      NA
## Boat_transport_accident                         NA                      NA
## Exhaustion_death                                NA                      NA
## Weather_condition_death                         NA                      NA
## Heat_Hyperthermia_death                         NA                      NA
## Hypothermia_death                               NA                      NA
## Murder                                          NA                      NA
## War_related_death                               NA                      NA
## Electrocution_fire_death                        NA                      NA
## Poor_health_death                               NA                      NA
## Fall_death                                      NA                      NA
## Other_accidents                                 NA                      NA
## Hanging_suicide                                 NA                      NA
## Hippo_croco_death                               NA                      NA
## Month2                                          NA                      NA
## more_death_missing_5                            NA                      NA
## m_123                                           NA                      NA
## m_456                                           NA                      NA
## m_789                                           NA                      NA
## m_101112                                        NA                      NA
## Year_2014                                       NA                      NA
## Year_2015                                       NA                      NA
## Year_2016                                       NA                      NA
## Year_2017                                       NA                      NA
## Year_2018                                       NA                      NA
## Year_2019                                       NA                      NA
## Year_2020                                       NA                      NA
##                            Hypothermia_death Murder War_related_death
## Land_transport_accident                   NA     NA                NA
## Food_water_shortages_death                NA     NA                NA
## Boat_transport_accident                   NA     NA                NA
## Exhaustion_death                          NA     NA                NA
## Weather_condition_death                   NA     NA                NA
## Heat_Hyperthermia_death                   NA     NA                NA
## Hypothermia_death                         NA     NA                NA
## Murder                                    NA     NA                NA
## War_related_death                         NA     NA                NA
## Electrocution_fire_death                  NA     NA                NA
## Poor_health_death                         NA     NA                NA
## Fall_death                                NA     NA                NA
## Other_accidents                           NA     NA                NA
## Hanging_suicide                           NA     NA                NA
## Hippo_croco_death                         NA     NA                NA
## Month2                                    NA     NA                NA
## more_death_missing_5                      NA     NA                NA
## m_123                                     NA     NA                NA
## m_456                                     NA     NA                NA
## m_789                                     NA     NA                NA
## m_101112                                  NA     NA                NA
## Year_2014                                 NA     NA                NA
## Year_2015                                 NA     NA                NA
## Year_2016                                 NA     NA                NA
## Year_2017                                 NA     NA                NA
## Year_2018                                 NA     NA                NA
## Year_2019                                 NA     NA                NA
## Year_2020                                 NA     NA                NA
##                            Electrocution_fire_death Poor_health_death
## Land_transport_accident                          NA                NA
## Food_water_shortages_death                       NA         0.1875585
## Boat_transport_accident                          NA                NA
## Exhaustion_death                                 NA                NA
## Weather_condition_death                          NA                NA
## Heat_Hyperthermia_death                          NA                NA
## Hypothermia_death                                NA                NA
## Murder                                           NA                NA
## War_related_death                                NA                NA
## Electrocution_fire_death                         NA                NA
## Poor_health_death                                NA                NA
## Fall_death                                       NA                NA
## Other_accidents                                  NA                NA
## Hanging_suicide                                  NA                NA
## Hippo_croco_death                                NA                NA
## Month2                                           NA                NA
## more_death_missing_5                             NA                NA
## m_123                                            NA                NA
## m_456                                            NA                NA
## m_789                                            NA                NA
## m_101112                                         NA         0.1550767
## Year_2014                                        NA                NA
## Year_2015                                        NA                NA
## Year_2016                                        NA         0.1998328
## Year_2017                                        NA                NA
## Year_2018                                        NA                NA
## Year_2019                                        NA         0.1863508
## Year_2020                                        NA                NA
##                            Fall_death Other_accidents Hanging_suicide
## Land_transport_accident            NA              NA              NA
## Food_water_shortages_death         NA              NA              NA
## Boat_transport_accident            NA              NA              NA
## Exhaustion_death                   NA              NA              NA
## Weather_condition_death            NA              NA              NA
## Heat_Hyperthermia_death            NA              NA              NA
## Hypothermia_death                  NA              NA              NA
## Murder                             NA              NA              NA
## War_related_death                  NA              NA              NA
## Electrocution_fire_death           NA              NA              NA
## Poor_health_death                  NA              NA              NA
## Fall_death                         NA              NA              NA
## Other_accidents                    NA              NA              NA
## Hanging_suicide                    NA              NA              NA
## Hippo_croco_death                  NA              NA              NA
## Month2                             NA              NA              NA
## more_death_missing_5               NA              NA              NA
## m_123                              NA              NA              NA
## m_456                              NA              NA              NA
## m_789                              NA              NA              NA
## m_101112                           NA              NA              NA
## Year_2014                          NA              NA              NA
## Year_2015                          NA              NA              NA
## Year_2016                          NA              NA              NA
## Year_2017                          NA              NA              NA
## Year_2018                          NA              NA              NA
## Year_2019                          NA              NA              NA
## Year_2020                          NA              NA              NA
##                            Hippo_croco_death Month2 more_death_missing_5 m_123
## Land_transport_accident                   NA     NA                   NA    NA
## Food_water_shortages_death                NA     NA                   NA    NA
## Boat_transport_accident                   NA     NA            0.2507346    NA
## Exhaustion_death                          NA     NA                   NA    NA
## Weather_condition_death                   NA     NA                   NA    NA
## Heat_Hyperthermia_death                   NA     NA                   NA    NA
## Hypothermia_death                         NA     NA                   NA    NA
## Murder                                    NA     NA                   NA    NA
## War_related_death                         NA     NA                   NA    NA
## Electrocution_fire_death                  NA     NA                   NA    NA
## Poor_health_death                         NA     NA                   NA    NA
## Fall_death                                NA     NA                   NA    NA
## Other_accidents                           NA     NA                   NA    NA
## Hanging_suicide                           NA     NA                   NA    NA
## Hippo_croco_death                         NA     NA                   NA    NA
## Month2                                    NA     NA                   NA    NA
## more_death_missing_5                      NA     NA                   NA    NA
## m_123                                     NA     NA                   NA    NA
## m_456                                     NA     NA                   NA    NA
## m_789                                     NA     NA                   NA    NA
## m_101112                                  NA     NA                   NA    NA
## Year_2014                                 NA     NA                   NA    NA
## Year_2015                                 NA     NA                   NA    NA
## Year_2016                                 NA     NA                   NA    NA
## Year_2017                                 NA     NA                   NA    NA
## Year_2018                                 NA     NA                   NA    NA
## Year_2019                                 NA     NA                   NA    NA
## Year_2020                                 NA     NA                   NA    NA
##                                m_456 m_789  m_101112 Year_2014 Year_2015
## Land_transport_accident           NA    NA        NA        NA        NA
## Food_water_shortages_death        NA    NA        NA        NA        NA
## Boat_transport_accident           NA    NA        NA        NA        NA
## Exhaustion_death                  NA    NA        NA        NA        NA
## Weather_condition_death           NA    NA        NA        NA        NA
## Heat_Hyperthermia_death           NA    NA        NA        NA        NA
## Hypothermia_death                 NA    NA        NA        NA        NA
## Murder                            NA    NA        NA        NA        NA
## War_related_death                 NA    NA        NA        NA        NA
## Electrocution_fire_death          NA    NA        NA        NA        NA
## Poor_health_death                 NA    NA 0.1550767        NA        NA
## Fall_death                        NA    NA        NA        NA        NA
## Other_accidents                   NA    NA        NA        NA        NA
## Hanging_suicide                   NA    NA        NA        NA        NA
## Hippo_croco_death                 NA    NA        NA        NA        NA
## Month2                            NA    NA        NA        NA        NA
## more_death_missing_5              NA    NA        NA        NA        NA
## m_123                             NA    NA        NA        NA        NA
## m_456                             NA    NA        NA        NA        NA
## m_789                             NA    NA        NA        NA        NA
## m_101112                          NA    NA        NA        NA        NA
## Year_2014                         NA    NA        NA        NA        NA
## Year_2015                         NA    NA        NA        NA        NA
## Year_2016                         NA    NA        NA        NA        NA
## Year_2017                         NA    NA        NA        NA        NA
## Year_2018                         NA    NA        NA        NA        NA
## Year_2019                  0.1533845    NA        NA        NA        NA
## Year_2020                         NA    NA        NA        NA        NA
##                            Year_2016 Year_2017 Year_2018 Year_2019 Year_2020
## Land_transport_accident           NA        NA        NA        NA        NA
## Food_water_shortages_death        NA        NA        NA        NA        NA
## Boat_transport_accident           NA        NA        NA        NA        NA
## Exhaustion_death                  NA        NA        NA        NA        NA
## Weather_condition_death           NA        NA        NA        NA        NA
## Heat_Hyperthermia_death           NA        NA        NA        NA        NA
## Hypothermia_death                 NA        NA        NA        NA        NA
## Murder                            NA        NA        NA        NA        NA
## War_related_death                 NA        NA        NA        NA        NA
## Electrocution_fire_death          NA        NA        NA        NA        NA
## Poor_health_death          0.1998328        NA        NA 0.1863508        NA
## Fall_death                        NA        NA        NA        NA        NA
## Other_accidents                   NA        NA        NA        NA        NA
## Hanging_suicide                   NA        NA        NA        NA        NA
## Hippo_croco_death                 NA        NA        NA        NA        NA
## Month2                            NA        NA        NA        NA        NA
## more_death_missing_5              NA        NA        NA        NA        NA
## m_123                             NA        NA        NA        NA        NA
## m_456                             NA        NA        NA 0.1533845        NA
## m_789                             NA        NA        NA        NA        NA
## m_101112                          NA        NA        NA        NA        NA
## Year_2014                         NA        NA        NA        NA        NA
## Year_2015                         NA        NA        NA        NA        NA
## Year_2016                         NA        NA        NA        NA        NA
## Year_2017                         NA        NA        NA        NA        NA
## Year_2018                         NA        NA        NA        NA        NA
## Year_2019                         NA        NA        NA        NA        NA
## Year_2020                         NA        NA        NA        NA        NA

Comment to the correlation matrix

The above matrices are, looking from the top,: a correlation matrix from features that we use in the Naive Bayes model, and another matrix that shows correlation value only if it exceeds absolute value of 0.15.

To measure the association between dichotomous variables we can use: Phi coefficient, Cramer’s V or Pearson’s R correlation, which yield exactly the same value in the case of binary data.

However, those aforementioned statistics treat equality 0=0 and 1=1 as equally important. In our dataset, the absence of a given cause of death (0) in two observations doesn’t translate into the same similarity as presence (1) of two the same causes of death. For instance, we have causes of death v and w.

v = (0, 0, 0, 0, 0, 1)
w = (1, 0, 0, 0, 0, 0), which according to standard measure of correlation are pretty much the same - in 4 cases (0=0) and 2 cases (1=0) or (0=1).

Now the other example

a = (0, 1, 0, 1, 0, 1)
b = (1, 1, 1, 0, 0, 0), this in turn has lower correlation (which is counter-intuitive) - 2 cases (0=0) or (1=1) and 4 cases (1=0) or (0=1).

Therefore, we are using Jaccard similarity (which ignores identity 0=0), whose formula is show below.

Caption for the picture.
Where

M11 represents the total number of attributes where A and B both have a value of 1
M10 represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1
M01 represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0
M00 represents the total number of attributes where A and B both have a value of 0.

As we suspected there is no strong relationship between the time and the causes of deaths.

Yet it seems that more migrants decide to cross the Meditarrenaen from October to March. It’s hard to say what might be the reason for that. I suspect (althought I am not certain) that from March to October the weather might be more severe, as the exposure to the Sun’s rays intensifies and temperatures are higher, and the migrants cannot take much water with them given how crammed those boats are. Also, they are not well protected against the sun.

There seems to be some association between weather conditions and Food/water shortages.

A possible links is again between high temperatures (severe weather conditions) and lack of water leading to deahydration and other medical ailments (Food/water shortages).

However (Pedro Domingos, Michael Pazzani, 1997, https://link.springer.com/content/pdf/10.1023%2FA%3A1007413511361.pdf) shows that when the assumption of independecy doesn’t hold the posteriori probability estimates may be biased, but the classifier itself (which classifies a observation to a class having the highest probability) proves to be quite effective.

Thus, we shouldn’t bother too much over the dependence of features (which, in fact, is negligible).

#3.2. Model

Although the Naive Bayes model is high-bias model (and not high-variance), it still can overfit, hence we will divide our dataset into training and test set. In our model, we will want to predict whether a given accident claims more than 5 lives. For the time being, we had been going through the quantitative analysis, checking clusters of accidents based on their location on the globe. Now, I would like to take a more qualitative approach trying to find out what factors has a significant impact on the gravity of each accident (whether more than 5 people died in it).

While creating traninig and test sets, it’s important to keep, in every set, observations that correspond to the labels that have the same ratio (of 0s and 1s) as in the whole dataset. It is especially crucial in NB classifier which assumes the prior probability (by default) basing on distribiution of the ratio of a target variable. The createDataPartition does stratified partitioning by default.

Moreover, as some features (causes of deaths) occur in a few observations, we set Laplace smooting, lest there be, in training set, a feature vector made of only 0s.

Choosing only those features we want in the model

df_model = as.data.frame(df[, c("Boat_transport_accident", "Land_transport_accident", 
                                "Food_water_shortages_death", "Poor_health_death", 
                                "m_123", "m_456", "m_789", "m_101112", "Year_2014", 
                                "Year_2015", "Year_2016", "Year_2017", "Year_2018", 
                                "Year_2019","Year_2020")])

label = as.factor(df$more_death_missing_5)
ind <- createDataPartition(df_model$Boat_transport_accident, p=0.3)[[1]]           

train_x <- sapply(df_model[ind, ], as.numeric)
train_y <- as.factor(label[ind])
test_x <- sapply(df_model[-ind, ], as.numeric)
test_y <- as.factor(label[-ind])

EDA for Maps Information

Missing <- read.csv("MissingMigrants-Global-2020-04-11T05-24-09.csv")

We call again for new dataset for this EDA.

data <- Missing %>% 
  replace_na(list(Number.Dead = 0, 
                  Minimum.Estimated.Number.of.Missing = 0,
                  Total.Dead.and.Missing = 0,
                  Number.of.Survivors = 0,
                  Number.of.Females=0,
                  Number.of.Males=0,
                  Number.of.Children = 0))


data <- data %>% 
  separate(Location.Coordinates, c("longitude","latitude"),",")

data[,c("longitude","latitude")] <- lapply(data[,c("longitude","latitude")],as.numeric)

We replace NA and separate “Location.Coordinates” to longitude and latitude and change to numeric.

Now call library leaflet for build the maps with infomation

library(leaflet)
leaflet() %>% 
  addTiles() %>% 
  addMarkers(data = data, 
             lat = ~latitude,
             lng = ~longitude,
             popup = paste("Area :",
                           data$Location.Description, "<br>",
                           "Time Accident:",
                           data$Reported.Date,"<br>",
                           "Number of Death:",
                           data$Number.Dead, "<br>",
                           "<a href =",
                           data$URL,">", "Link for the News", "</a>"),
             clusterOptions = markerClusterOptions())%>% addProviderTiles(providers$Stamen.Toner)

as you can see, you can choose for the maps and link to news.

And i will make Build Model , fitting the model, and Interpretation the model. and you can see my another project like :
https://rpubs.com/edgarna70/TS_Scotty
https://rpubs.com/edgarna70/TS_FnB https://rpubs.com/edgarna70/CL-TC https://justedgar.shinyapps.io/CapstoneDV/