Predicting Homicide Rates - A Machine Learning Approach to Analyzing Socio-Economic and Demographic Influences

Authors
Affiliation

Dardare, Zoé

UNIL

Julian, Figueroa

Published

May 19, 2024

This project was written by us and in our own words, except for quotations from published and unpublished sources, which are clearly indicated and acknowledged as such. We are conscious that the incorporation of material from other works or a paraphrase of such material without acknowledgement will be treated as plagiarism, subject to the custom and usage of the subject, according to the University Regulations. The source of any picture, map or other illustration is also indicated, as is the source, published or unpublished, of any material not resulting from our own research.

1 Abstract

This study leverages advanced machine learning techniques to analyze the impact of various socio-economic and demographic variables on homicide rates per 100,000 inhabitants across multiple countries.

Employing a comprehensive dataset sourced from reputable public databases, the project integrates variables such as political rights, corruption control, and economic indicators, focusing particularly on the homicide rate. After data cleaning and exploratory data analysis (EDA), we employed supervised learning models, including linear regression, gradient boosting machines (GBM), and random forests, as well as unsupervised learning techniques like K-means and hierarchical clustering. The results revealed that unemployment rates have no significant influence on homicide rates. Linear regression was inadequate for accurate predictions, while GBM improved accuracy but was outperformed by the random forest model, which provided the best predictive performance. The variable importance scores from the random forest model highlighted HDI and GDP per capita as the most influential factors on homicide rates, with unemployment rates deemed irrelevant.

Clustering analysis for the year 2022 using K-means identified three meaningful groups of countries based on socio-economic similarities, while hierarchical clustering captured more nuanced patterns. Consistent results across different years reinforced the robustness of these findings.

The study highlights the importance of improving human development and economic conditions to reduce homicide rates. It also suggests the potential for policy interventions targeted at enhancing public safety and promoting social stability. The main limitation is the potential overfitting of the random forest model, suggesting the need for more robust models and additional relevant variables. The recommendations emphasize the use of machine learning as a tool for policy-making and suggest areas for future research, including the integration of additional variables and the application of more robust models to prevent overfitting.

2 Introduction

Understanding the factors that influence homicide rates is crucial for developing effective policies to improve public safety and social stability. This machine learning project focuses on predicting and analyzing the impact of various economic and socio-demographic variables on homicide rates per 100,000 inhabitants across multiple countries. By leveraging both supervised and unsupervised learning techniques, we aim to uncover the key drivers of violent crime and provide insights into which kind of factors influence a country’s homicide rate.

Homicide rates are a significant indicator of societal well-being, reflecting the effectiveness of governance, economic health, and social cohesion within a country. High homicide rates often signal underlying issues such as poverty, inequality, and political instability, while lower rates generally indicate stronger governance and better social conditions. Analyzing these rates can help identify critical areas for improvement and guide efforts to enhance public safety.

For this study, we integrated multiple data sets to create a comprehensive data set encompassing a range of variables. These include political rights, control of corruption, unemployment rates and the primary variable of interest: the homicide rates. The data spans multiple countries and years, forming a time series that captures temporal dynamics and trends.

The data sets were sourced from reputable public databases and international organizations, ensuring high-quality and reliable data. After initial data cleaning and wrangling to merge these data sets, we conducted an Exploratory Data Analysis (EDA) to understand the underlying patterns and relationships within the data. This preliminary analysis provided valuable insights and guided the subsequent application of machine learning models.

We employed a variety of machine learning models to achieve our objectives. For supervised learning, models such as linear regression, gradient boosting machines and random forests were used to predict homicide rates based on the socio-economic and demographic variables. These models helped quantify the influence of each factor on homicide rates. For unsupervised learning, clustering techniques like K-means and hierarchical clustering were applied to identify natural groupings of countries with similar profiles, providing a deeper understanding of the data’s structure.

The findings of this study are organized as follows: In Section 2, we describe the data, detailing its sources, structure, and the cleaning and merging process. Section 3 presents the Exploratory Data Analysis, highlighting key patterns and correlations. Section 4 and t are dedicated to the methodology, explaining the machine learning models and techniques used; in Section 4, we implement and discuss the results of the supervised learning models, identifying the most significant predictors of homicide rates and Section 5 covers the unsupervised learning analysis, exploring the clustering results and their implications. Finally, Section 6 concludes the study, summarizing the findings and suggesting potential policy recommendations and areas for future research.

By combining advanced machine learning techniques with comprehensive socio-economic data, this project aims to test and understand the functioning of several types of machine learning models and to provide actionable insights into the factors driving homicide rates. The results could help policymakers and international organizations design targeted interventions to enhance public safety and promote social stability globally.

3 Data

3.1 Data Source

The data set we’ll be working on for this project is a combination of several data sets containing variables that could influence the rate of homicide per country.

Therefore, we collected socio-economic and demographic data, data on security and law enforcement, data on governance and legislation, and health and social well-being indicators. Below are the sources of all our data sets.

  1. The homicide rate per country
  2. The GDP per country
  3. The unemployment rate per country
  4. The governance indicators per country
  5. The number of police officers by 1,000 people per country
  6. The freedom rankings per country

3.2 Data Description

The HDI (Human Development Indicator):

The HDI is a summary composite measure of a country’s average achievements in three basic aspects of human development: health, knowledge and standard of living. It is a measure of a country’s average achievements in three dimensions of human development:

  • a long and healthy life, as measured by life expectancy at birth;
  • knowledge, as measured by mean years of schooling and expected years of schooling; and
  • a decent standard of living, as measured by GNI per capita in PPP terms in USD.

The HDI sets a minimum and a maximum for each dimension, called “goalposts”, then shows where each country stands in relation to these goalposts. This is expressed as a value between 0 and 1. The higher a country’s human development, the higher its HDI value.

The Unemployment Rate per Country

It contains information about the unemployment rate from the age of 15 per country.

The GDP per capita per Country

This variable contains the Gross Domestic Product (GDP) per capita for each country of the dataset, which is an economic metric that represents the average economic output (or income) per person. The values are in USD.

The Governance Indicators per Country

We chose this data set in order to be able to quantify the corruption per country thanks to governance indicators provided by the World Bank’s Worldwide.

Below are the indicators we will use in our analysis:

  • Control of Corruption: Estimate - This measure can provide insight into the level of corruption within a country. Lower corruption might be associated with more effective law enforcement and potentially lower crime rates.

  • Government Effectiveness: Estimate - This could reflect how well the government is able to implement its policies, including those related to crime prevention and justice.

  • Political Stability and Absence of Violence/Terrorism: Estimate - A lack of political stability and the presence of violence or terrorism could be related to higher homicide rates, as they often indicate broader societal issues that could lead to crime.

Freedom Rankings

This is a data set that tracks the quality of political rights and civil liberties for each country from 2013 to 2022. All data are official figures from the Freedom House that have been compiled and structured by a user of Kaggle.

Rating system explanation (source): A country or territory is awarded 0 to 4 points for each of 10 political rights indicators and 15 civil liberties indicators, which take the form of questions; a score of 0 represents the smallest degree of freedom and 4 the greatest degree of freedom. The political rights questions are grouped into three subcategories: Electoral Process (3 questions), Political Pluralism and Participation (4), and Functioning of Government (3). The civil liberties questions are grouped into four subcategories: Freedom of Expression and Belief (4 questions), Organizational Rights (3), Rule of Law (4), and Personal Autonomy and Individual Rights (4). The highest overall score that can be awarded for political rights is 40 (or a score of 4 for each of the 10 questions). The highest overall score that can be awarded for civil liberties is 60 (or a score of 4 for each of the 15 questions).

Table of variables :

Code
variables_tibble <- tibble(
  Variables = c("Country", "Region", "Year","Homicides per 100,000 people", "GDP per Capita (USD)", 
                "Civil Liberties", "Political Rights",
                "Control of Corruption: Estimate", "Government Effectiveness: Estimate", 
                "Political Stability and Absence of Violence/Terrorism: Estimate", "Unemployment Rate", "HDI Value"),
  Meaning = c("Name of the country where the data applies", 
              "Geographical or political region to which the country belongs", 
              "Specific year during which the data pertains to", 
              "Number of victims of homicides recorded per 100,000 inhabitants in a given year", 
              "Gross Domestic Product divided by the midyear population, expressed in U.S. dollars", 
              "Measure of the freedoms and rights enjoyed by individuals within the country", 
              "Measure reflecting the degree to which a country's political processes are free from undue influence or control, allowing for the full participation of its citizens", 
              "Qualitative evaluation of the extent to which public power is exercised for private gain, including both petty and grand forms of corruption, as well as capture of the state by elites and private interests", 
              "Assessment of the quality of public services, the civil service and its independence from political pressures, the quality of policy formulation and implementation, and the credibility of the government's commitment to such policies", 
              "Measure that evaluates the likelihood of political instability and/or politically motivated violence, including terrorism", 
              "Percentage of the labor force that is unemployed and actively seeking employment during the year", 
              "Human Development Index value")
)

# Create a table using kable from the knitr package
kable(variables_tibble) %>%
  kable_styling()
Variables Meaning
Country Name of the country where the data applies
Region Geographical or political region to which the country belongs
Year Specific year during which the data pertains to
Homicides per 100,000 people Number of victims of homicides recorded per 100,000 inhabitants in a given year
GDP per Capita (USD) Gross Domestic Product divided by the midyear population, expressed in U.S. dollars
Civil Liberties Measure of the freedoms and rights enjoyed by individuals within the country
Political Rights Measure reflecting the degree to which a country's political processes are free from undue influence or control, allowing for the full participation of its citizens
Control of Corruption: Estimate Qualitative evaluation of the extent to which public power is exercised for private gain, including both petty and grand forms of corruption, as well as capture of the state by elites and private interests
Government Effectiveness: Estimate Assessment of the quality of public services, the civil service and its independence from political pressures, the quality of policy formulation and implementation, and the credibility of the government's commitment to such policies
Political Stability and Absence of Violence/Terrorism: Estimate Measure that evaluates the likelihood of political instability and/or politically motivated violence, including terrorism
Unemployment Rate Percentage of the labor force that is unemployed and actively seeking employment during the year
HDI Value Human Development Index value

3.3 Data Wrangling/Cleanup

3.3.1 Homicide Rate Data Set

First, before cleaning, we’re just ensuring that the values of the data set won’t be display is scientific notation. We can see that the years are not sorted in order. Moreover, the countries are not in alphabetical order. Also, we will take away few variables that are not relevant for our project to simplify the data set. The variables X (which counts the observations), Iso3_code (which is the code of the countries) and Source are thus not useful.

Code
#load dataset
homicide_per_country <- read.csv("data/intenional_homicide_data.csv")

options(scipen = 999)

sorted_dataset <- homicide_per_country[order(homicide_per_country$Year, homicide_per_country$Country, decreasing = c(TRUE, FALSE), method = "radix"), ]

#remove irrelevant columns
homicide_per_country <- subset(sorted_dataset, select = -c(X, Iso3_code, Source))

Selecting relevant observations

While analyzing our data set, and especially the meta-data available on the website, we saw that there are two different unit of measurement used : Counts or Rate per 100,000 population. For the purpose of our project, we will only keep observations per 100,000 people because it avoids having bias related to size differences between countries. The “count” value provides the total number of homicides in countries, which is less relevant for comparing. Then, we can take this column away and rename the variable VALUE by Homicides per 100,000 people. Also, we decided that the variables “Dimension” and “Category” and their values are not relevant for our project ; the different values they take complicate our data sets’ joining, therefore, we decided to only conserve the values “Total” of those two variables for our upcoming analysis.

Code
filtered_dataset <- subset(homicide_per_country, Unit.of.measurement == "Rate per 100,000 population")

main_dataset <- subset(filtered_dataset, select = -c(Unit.of.measurement))
colnames(main_dataset)[colnames(main_dataset) == "VALUE"] <- "Homicides per 100,000 people"

main_dataset <- main_dataset %>%
  filter(main_dataset$Dimension == "Total" & main_dataset$Category == "Total")

The Indicator variable

Now, lets analyse the different values that the Indicator variable can take :the number of deaths from homicides, but also the number of person arrested for those homicides etc… We only keep observations concerning the victims of intentional homicide as the rest will not be useful for the later analysis.

We see that there is quite a lot of different values ; the number of deaths from homicides, but also the number of person arrested for those homicides etc…

For this project, we only want to keep the number of victims of intentional homicide to simplify our analysis. Indeed, we are going to join other data to this data set, thus we only want to conserve the number of victims of homicides per 100,000 people to ensure a simpler machine learning. We also remove the observations detailing the sex and the age group because we will not concentrate our analysis on these factors. Then, we remove columns that do not give us important information anymore. Finally, we apply a code to round the decimals to three.

Code
main_dataset <- main_dataset %>%
  filter(main_dataset$Indicator == "Victims of intentional homicide")

main_dataset <- main_dataset[main_dataset$Sex == "Total" & main_dataset$Age == "Total", ]

main_dataset <- main_dataset %>%
  select(-Subregion, -Dimension, -Category, -Sex, -Age, -Indicator)  #we also take away the Dimension and Category variables which are now irrelevant to display

main_dataset$`Homicides per 100,000 people` <- round(main_dataset$`Homicides per 100,000 people`, 3)

3.3.2 HDI per Country Data Set

Code
#load dataset
hdi_per_country <- read.csv("data/HDI by country.csv")

This data set contains information about the Human Development Indicator (HDI), but also other variables that we will not take into account. Therefore, we will first remove any columns that does not contain information about the HDI. The names of the years’ columns contain the prefix hdi_. We remove it to simplify our later combination of data sets. Then we create a column Year to match with the other data sets. Finally, we remove columns that we will not use in our analysis (embracing categorical variables as they do not give further information) and rename the columns HDI Value for better understanding.

Code
#keep columns containing HDI information
index_hdi_2022 <- which(names(hdi_per_country) == "hdi_2022")
hdi_per_country <- select(hdi_per_country, 1:index_hdi_2022)

# Remove the "hdi_" prefix from column names containing a year
names(hdi_per_country) <- gsub("hdi_", "", names(hdi_per_country))

#pivoting the dataset to create a column Year
hdi_per_country <- pivot_longer(
  hdi_per_country,
  cols = matches("^(199[0-9]|200[0-9]|201[0-9]|202[0-2])$"),
  names_to = "Year",
  values_to = "Value"
)

#remove irrelevant columns
hdi_per_country$iso3 <- NULL
hdi_per_country$region <- NULL
hdi_per_country$rank_2022 <- NULL
hdi_per_country$hdicode <- NULL

#rename columns
hdi_per_country <- hdi_per_country %>% 
  rename("HDI Value" = "Value")
hdi_per_country<- hdi_per_country %>%
  rename("Country" = "country")

3.3.3 Unemployment Rate Data Set

Code
#load dataset
unemployment_rate <- read.csv("data/Unemployment (%).csv")

This data set contains 2 types of unemployment rate estimation: the national estimate and the modeled ILO estimate.

We decided to keep the modeled ILO estimate as this methodology involves statistical modeling techniques such as econometric modeling, interpolation, and extrapolation. The aim is to create estimates that are comparable across different countries by adhering to standardized definitions and methodologies. This often involves adjusting or re-calculating data provided by national authorities to fit international standards.

The data set contains detailed information about the gender of the unemployed population of each country. We remove these lines as our analysis focuses on a more global scope. Therefore, we only keep the lines concerning the total population of 15 years old an more. As for all data sets, we clean it by removing irrelevant column, and by renaming columns for better understanding.

Code
# Assuming your data frame is named 'df'
unemployment_rate <- subset(unemployment_rate, Disaggregation == "total, 15+, Modeled")

#remove irrelevant columns
unemployment_rate$Country.Code <- NULL
unemployment_rate$Indicator.Code <- NULL
unemployment_rate$`Age Group` <- NULL
unemployment_rate$Indicator.Name <- NULL
unemployment_rate$Disaggregation <- NULL

#rename columns#rename columnsNULL
unemployment_rate <- unemployment_rate %>% 
  rename("Unemployment Rate" = "Value")

unemployment_rate <- unemployment_rate %>%
  rename("Country" ="Country.Name")

3.3.4 Corruption Indicators Data Set

In order to have the column year in the same format as in the other data sets, we remove the prefix in the columns’ names and pivot the data set so we can have a Year column. Then, we remove the irrelevant columns. Finally, we pivot the column Series.Name that contains the names of the indicators. We have three of them: the estimate of control of corruption, the estimate of government effectiveness and the estimate of political stability and absence of violence/terrorism.

Code
#load dataset
corruption_data <- read.csv("data/corruption.csv")

#remove the characters on the years columns
names(corruption_data) <- gsub(".*YR(\\d{4}).*", "\\1", names(corruption_data))

#pivoting the dataset to create a column Year
corruption_data <- pivot_longer(
  corruption_data,
  cols = matches("^(199[6-9]|200[0-9]|201[0-9]|202[0-2])$"),
  names_to = "Year",
  values_to = "Value"
)

#remove columns
corruption_data$Country.Code <- NULL
corruption_data$Series.Code <- NULL

corruption_data <- corruption_data %>%
  filter(Series.Name != "" & !is.na(Series.Name))

corruption_data <- corruption_data %>%
  rename("Country" = "Country.Name")    #renaming the "Country" variable

#pivoting column Series.Name
tidy_corruption <- corruption_data %>%
  pivot_wider(names_from = Series.Name, values_from = Value, id_cols = c(Country, Year))

3.3.5 Freedom Rankings Data Set

Each letter followed by a number designs a question that represents one of the subcategories explained in the Data Description. The aggregate of Political Rights category does not exist so we create it by summing all aggregate scores from its subcategories.

In order to simplify our data set we will only keep the aggregate scores of each subcategory (A, B, C, D, E, F and G) which represents the aggregate scores for the Civil Liberties and the Political Rights categories. Then, we rename columns in order to match with the other data sets, and remove lines that contain information about territories as our main data set only contains information about countries. We also remove the columns P.R Ratings and C.L Ratings because these scores were assigned a country prior to 2020. Thus we only keep the columns that are referring to the aggregate scores.

Code
#load dataset
freedom_ranking <- read.csv("data/Freedom in the World 2013-2022 Dataset (Ver 2.18.23).csv")

freedom_ranking$PR <- freedom_ranking$A + freedom_ranking$B + freedom_ranking$C

cols_to_keep <- c("Country.Territory","Region","C.T","Edition","Status","PR.rating","CL.rating","PR","CL","Total")
freedom_ranking_subset <- freedom_ranking[, cols_to_keep]

#removing lines concerning territories
freedom_ranking_subset <- freedom_ranking_subset[freedom_ranking_subset$C.T != 't', ]

#removing columns
freedom_ranking_subset$C.T <- NULL
freedom_ranking_subset$PR.rating <- NULL
freedom_ranking_subset$CL.rating <- NULL
freedom_ranking_subset$Status <- NULL
freedom_ranking_subset$Total <- NULL #we do not keep it as it the the sum of two scores already in the dataset

#renaming columns
freedom_ranking_subset <- freedom_ranking_subset %>%
  rename("Country" = "Country.Territory")
freedom_ranking_subset <- freedom_ranking_subset %>%
  rename("Year" = "Edition")
freedom_ranking_subset <- freedom_ranking_subset %>%
  rename("Civil Liberties" = "CL")
freedom_ranking_subset <- freedom_ranking_subset %>%
  rename("Political Rights" = "PR")

3.3.6 GDP per Capita Data Set

We need to pivot all of the columns of this data set under one variable, that we will call Year. Then, we convert the GDP values column into numeric values.

Code
library(readxl)
dataGPD <- read_excel("data/clean_GDPperCapita.xlsx")

library(tidyr)
datasetGPD <- pivot_longer(
  dataGPD,
  cols = c("1980", "1981", "1982", "1983", "1984", "1985", "1986", "1987", "1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1997", "1998", "1999", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020", "2021", "2022", "2023", "2024", "2025", "2026", "2027", "2028"),
  names_to = "Year",
  values_to = "GDP per capita (USD)"
)


datasetGPD <- datasetGPD %>%
  rename("Country" = `GDP per capita, current prices
 (U.S. dollars per capita)`)

3.3.7 Combined Data Set

We will now join all of our data sets in order to conduct our analysis. Note that our final data set will not contain observations across all years for all countries, as some countries do not have homicides’ rate measurements for every year. We clean the details of this final data set to simplify its structure. Finally, we verify that our data set does not contain any missing values.

Code
main_dataset$Year <- as.character(main_dataset$Year)
datasetGPD$Year <- as.character(datasetGPD$Year)
freedom_ranking_subset$Year <- as.character(freedom_ranking_subset$Year)
tidy_corruption$Year <- as.character(tidy_corruption$Year)
unemployment_rate$Year <- as.character(unemployment_rate$Year)
hdi_per_country$Year <- as.character(hdi_per_country$Year)

# Perform inner joins : join all data sets by 'Country' and 'Year'
all_data <- main_dataset %>%
  inner_join(datasetGPD, by = c("Country", "Year")) %>%
  inner_join(freedom_ranking_subset, by = c("Country", "Year")) %>%
  inner_join(tidy_corruption, by = c("Country", "Year")) %>%
  inner_join(unemployment_rate, by = c("Country", "Year")) %>%
  inner_join(hdi_per_country, by = c("Country", "Year"))

# Remove 'Region.y' and rename 'Region.x' to 'Region'
all_data <- all_data %>%
  select(-Region.y) %>%
  rename(Region = Region.x)

# Rounding to three decimals the variables that need it
all_data <- all_data %>%
  mutate(
    `GDP per capita (USD)` = round(as.numeric(`GDP per capita (USD)`), 3),
    `Control of Corruption: Estimate` = round(as.numeric(`Control of Corruption: Estimate`), 3),
    `Government Effectiveness: Estimate` = round(as.numeric(`Government Effectiveness: Estimate`), 3),
    `Political Stability and Absence of Violence/Terrorism: Estimate` = round(as.numeric(`Political Stability and Absence of Violence/Terrorism: Estimate`), 3)
  )

#remove region column
all_data$Region <- NULL

# Calculer le nombre de NA par colonne et par année
missing_data_summary <- all_data %>%
  group_by(Year) %>%
  summarise(across(everything(), ~sum(is.na(.)), .names = "NA_in_{.col}"))

4 Exploratory data analysis

4.1 Summary Table

Firstly, here is a summary of our data set.

Code
all_data_numeric <- all_data[, sapply(all_data, is.numeric) & names(all_data) != "Year"]

# Generate a summary table
summary(all_data_numeric) %>%
  kable() %>%
  kable_styling()
Homicides per 100,000 people GDP per capita (USD) Political Rights Civil Liberties Control of Corruption: Estimate Government Effectiveness: Estimate Political Stability and Absence of Violence/Terrorism: Estimate Unemployment Rate HDI Value
Min. : 0.0000 Min. : 255.3 Min. : 1.00 Min. : 6.00 Min. :-1.5970 Min. :-2.2270 Min. :-1.9640 Min. : 0.100 Min. :0.416
1st Qu.: 0.8377 1st Qu.: 5259.6 1st Qu.:25.00 1st Qu.:33.00 1st Qu.:-0.4298 1st Qu.:-0.2203 1st Qu.:-0.2627 1st Qu.: 4.003 1st Qu.:0.721
Median : 1.6855 Median : 13215.2 Median :34.00 Median :47.00 Median : 0.1790 Median : 0.2995 Median : 0.2980 Median : 6.145 Median :0.810
Mean : 7.8170 Mean : 24234.7 Mean :29.95 Mean :43.46 Mean : 0.4038 Mean : 0.4685 Mean : 0.2579 Mean : 7.354 Mean :0.796
3rd Qu.: 8.8825 3rd Qu.: 40430.3 3rd Qu.:38.00 3rd Qu.:56.00 3rd Qu.: 1.2873 3rd Qu.: 1.2648 3rd Qu.: 0.8985 3rd Qu.: 9.273 3rd Qu.:0.904
Max. :106.8200 Max. :134925.2 Max. :40.00 Max. :60.00 Max. : 2.4030 Max. : 2.2850 Max. : 1.6200 Max. :27.690 Max. :0.967

Homicides per 100,000 people

Extreme variability: The minimum value is 0, indicating that some countries recorded no homicides, while the maximum value is extremely high (106.82), showing that other countries are much more violent. Asymmetric distribution: The median (1.685) being much lower than the mean (7.817) suggests a strongly asymmetric distribution to the right, with some countries with very high homicide rates increasing the mean.

GDP per capita (USD)

Broad wealth scale: GDP per capita ranges from USD 255.3 to USD 134925.2, showing a significant gap between the poorest and richest economies. Distribution concentrated in the lower ranges: The median (USD 13215.2) is well below the mean (USD 24234.7), indicating a concentration of data in the lower ranges and some very high values pulling the mean upwards.

Political Rights and Civil Liberties

These two measures vary on a scale where the minimum and maximum values are fairly close. This suggests a degree of uniformity in the assessment of political rights and civil liberties, but there is still considerable variability between the quartiles. Total Freedom Ranking Balanced distribution: Values range from 7 to 100, with well-distributed quartiles, indicating a relatively balanced distribution of total freedom rankings.

Governance indices (Control of Corruption, Government Effectiveness, Political Stability)

Possible improvements: The values for Control of Corruption and Government Effectiveness show a range from negative to positive, indicating that some countries are well managed while others struggle with corruption and inefficiency. Political stability: Similarly, political stability shows a wide range, from strongly negative to slightly positive, reflecting significant variability in political stability between different countries.

Unemployment Rate

Major economic problems : With an average unemployment rate of 12.438% and a maximum value of 58.501%, some locations are clearly facing major economic challenges.

HDI Value (Human Development Index)

Human Development Scale: Values range from 0.416 to 0.967, showing a wide range of human development, from very underdeveloped to highly developed countries.

4.2 Correlation Matrix

A correlation matrix is useful because it provides a concise summary of the linear relationships between multiple variables, helping to identify patterns and dependencies that can guide further analysis. Additionally, it helps detect multi-collinearity, visualize relationships between variables and identify groups of correlated variables.

Code
library(ggcorrplot)
#ordering the correlation matrix using hierarchical clustering
corr <- round(cor(all_data_numeric), 1)
ggcorrplot(corr, lab = TRUE, lab_size = 2.5, hc.order = TRUE, outline.color = "white",
           ggtheme = ggplot2::theme_minimal(), tl.cex = 7)

Very Strong Correlation: The blocks of dark red squares, indicate strong positive correlations. For example, the variables Civil Liberties and Political Rights are strongly positively correlated with each other. This is expected as they both measure aspects of freedom within a country.

Strong Correlation with GDP per Capita (USD): GDP per capita (USD) is highly correlated with Control of Corruption: Estimate, Government Effectiveness: Estimate, HDI Value, and Political Stability and Absence of Violence/Terrorism: Estimate. A higher GDP per capita is associated with better governance, higher human development, and more political stability.

Moderate to High Positive Correlations: Government Effectiveness: Estimate, Control of Corruption: Estimate, and HDI Value show moderate to high positive correlations with one another. This suggests that effective governance and control of corruption are associated with higher human development indices.

Weak Correlations with Unemployment Rate: The Unemployment Rate has weak correlations with most of the other variables, as indicated by lighter-colored squares (both red and blue). This could mean that the unemployment rate may not be as strongly affected by the above factors or does not influence them as much, at least not in a direct linear way.

Negative Correlations: There are some variables with weak negative correlations, as indicated by the light blue squares. For instance, Unemployment Rate has a slight negative correlation with GDP per capita (USD) and HDI Value. While the correlations are weak, they do suggest that higher unemployment may be associated with lower GDP per capita and human development index, which fits with economic theory.

No Correlation: A correlation coefficient close to 0, indicated by white squares, means there is no linear relationship between those variables. For instance, Unemployment Rate and Homicides per 100,000 people show no substantial correlation, implying that there is no linear relationship between them in this data set. This might mean that is is not relevant to continue working with this variable for the purpose of our project.

4.3 Principal Component Analysis (PCA)

Graph of variables

Using PCA, we can capture the most influential and significant aspects of our data set, which can facilitate more targeted and informative analyses.

Code
 # Load necessary library
library(FactoMineR)
library(factoextra)  # This package enhances PCA plots from FactoMineR

# Assuming 'all_data_numeric' contains only numeric data
# Perform PCA on the numeric data
pca_data <- PCA(all_data_numeric, graph = FALSE)  # 'graph = FALSE' stops automatic plotting

# Plot the variable correlation circle (circle plot)
# This plot shows the relationships among the variables.
fviz_pca_var(pca_data, col.var = "contrib",  # Color by contribution to the PCA
             gradient.cols = c("blue4", "blue", "cadetblue3"),
             repel = TRUE)  # Avoid text overlapping

Code
summary(pca_data) 

Variance Explained by Principal Components:

Dim.1 explains 57% of the total variance, indicating that it captures much of the information present in the data. Dim.2 explains 16.5% of the variance, which is significant but much less than the first dimension. The other dimensions explain progressively less variance, with very low values after Dim.3. The first two dimensions combined explain around 73.5% of the total variance, which means that these two principal components provide a reasonably complete representation of the data.

Variables Strongly Correlated with Dim.1:

Variables such as Government Effectiveness: Estimate and Control of Corruption: Estimate show strong positive correlations with the first dimension, indicating that they contribute significantly to what this dimension represents. Homicides per 100,000 people is negatively correlated with Dim.1, indicating an inverse relationship with the other variables strongly related to this dimension.

Variables in Dim.2 and Dim.3:

Some variables such as Unemployment Rate show significant contributions to the secondary dimensions, which may indicate that they capture different or more specific aspects of the data compared to the first dimension.

For Dim.1, since it seems to be heavily influenced by variables related to governance and development indices (like Political Rights, Civil Liberties, Control of Corruption: Estimate, and HDI Value), we will name it Governance and Development Factor.

Given that the second principal component (Dim.2) is heavily influenced by variables related to societal measures such as Homicides per 100,000 people, Political Rights, Civil Liberties, and Total Freedom Ranking, it appears to capture elements related to societal stability and personal freedoms. We will name it Societal Stability and Freedom Factor.

The results suggest that to obtain a complete understanding of the data, it is beneficial to consider at least the first two main dimensions and to examine the observations and variables that contribute most to these dimensions.

4.4 Check for missing values

The data set should not contain any missing values since we cleaned all of them in the Data Wrangling/Cleanup part of this report. Let’s check if this is the case :

Code
# Summarize missing values by each column
unlist(lapply(all_data_numeric[, !names(all_data_numeric) %in% "Year"], function(x) sum(is.na(x))))
                                   Homicides per 100,000 people 
                                                              0 
                                           GDP per capita (USD) 
                                                              0 
                                               Political Rights 
                                                              0 
                                                Civil Liberties 
                                                              0 
                                Control of Corruption: Estimate 
                                                              0 
                             Government Effectiveness: Estimate 
                                                              0 
Political Stability and Absence of Violence/Terrorism: Estimate 
                                                              0 
                                              Unemployment Rate 
                                                              0 
                                                      HDI Value 
                                                              0 

4.5 Visualizing data distributions

4.5.1 Distributions of each variables

Code
# Histograms for all numeric variables
all_data_numeric %>%
  select_if(is.numeric) %>%
  gather(key = "variables", value = "values") %>%
  ggplot(aes(x = values)) +
    geom_histogram(bins = 30) +
    facet_wrap(~variables, scales = 'free_x')

Civil Liberties and Political Rights: Both of these distributions are right-skewed, indicating that most countries have high scores in these categories, with fewer countries having low scores. This could suggest a general global trend towards better civil liberties and political rights.

Control of Corruption and Government Effectiveness: These distributions are more centered, resembling a normal distribution. This suggests a balance, with many countries having moderate scores in these categories, but also a significant number with either high or low scores.

GDP per capita (USD): This distribution is heavily right-skewed, with most countries having low to moderate GDP per capita and a few outliers with very high GDP per capita. This indicates significant economic disparity between countries.

HDI Value: The distribution is left-skewed, with a majority of countries having high Human Development Index (HDI) values, suggesting better overall development in terms of health, education, and standard of living.

Homicides per 100,000 People: This distribution is also heavily right-skewed, with most countries having low homicide rates and a few outliers with very high rates. This indicates that high homicide rates are less common but can be extreme in certain regions.

Stability and Absence of Violence/Terrorism: The distribution is centered but slightly skewed, suggesting that most countries have moderate scores in this category, with some countries experiencing higher stability and others facing more violence or terrorism.

Unemployment Rate: This distribution is right-skewed, with most countries having low to moderate unemployment rates and fewer countries with very high unemployment rates.

4.5.2 Examining potential outliers

4.5.2.1 Homicides per 100,000 people variable :

This variable contains some specific outliers. We can see that those are countries of south-america and africa. Indeed, from 2014 to 2016, El Salvador and Honduras had the highest number of homicides per 100,000 people. This data isn’t very surprising, given that these countries are known to have broken murder records in those years. Other outliers include Latin American countries such as Colombia, Brazil, etc., as well as African countries like South Africa and Caribbean countries like Jamaica.

Code
library(plotly)

# Convert necessary columns to correct data types, especially if they are not correctly formatted
all_data$Country <- as.character(all_data$Country)
all_data$Year <- as.numeric(all_data$Year)
all_data$`Homicides per 100,000 people` <- as.numeric(all_data$`Homicides per 100,000 people`)

# Remove any NA values to avoid issues during plotting
all_data <- na.omit(all_data)

# Create an interactive boxplot with Plotly
homicides_plotly <- plot_ly(data = all_data, y = ~`Homicides per 100,000 people`, type = 'box',
                            text = ~paste("Country:", Country, "<br>Year:", Year),
                            hoverinfo = "text", # Hover text information
                            boxpoints = "outliers", # Show only outliers as points
                            marker = list(color = 'rgb(8,81,156)', outliercolor = 'rgba(219, 64, 82, 0.6)',
                                          line = list(outliercolor = 'rgba(219, 64, 82, 1.0)', outlierwidth = 2)),
                            line = list(color = 'rgb(8,81,156)'))

# Adjust the layout of the plot
homicides_plotly <- homicides_plotly %>% layout(
  title = "Interactive Boxplot for Homicides per 100,000 People",
  yaxis = list(title = "Homicides per 100,000 People"),
  xaxis = list(title = "Boxplot", showticklabels = FALSE)  # No tick labels necessary on x-axis
)

# Print the interactive plot
homicides_plotly

4.5.2.2 GDP per capita variable :

Here, we see that outliers, which are countries with especially high GDP per capita, are in fact related to particularly rich countries such as Luxembourg, Switzerland, Norway or Qatar, which is quite logical.

Code
# Load necessary libraries
library(tidyverse)
library(plotly)

all_data$Country <- as.character(all_data$Country)
all_data$Year <- as.numeric(all_data$Year)
all_data$`GDP per capita (USD)` <- as.numeric(all_data$`GDP per capita (USD)`)

# Remove any NA values that may interfere with plotting
all_data <- na.omit(all_data)

# Create an interactive boxplot with Plotly
gdp_plotly <- plot_ly(data = all_data, y = ~`GDP per capita (USD)`, type = 'box',
                      text = ~paste("Country:", Country, "<br>Year:", Year),
                      hoverinfo = "text",
                      boxpoints = "outliers", # Show only outliers as points
                      marker = list(color = 'rgb(8,81,156)', outliercolor = 'rgba(219, 64, 82, 0.6)', 
                                    line = list(outliercolor = 'rgba(219, 64, 82, 1.0)', outlierwidth = 2)),
                      line = list(color = 'rgb(8,81,156)'))

# Layout adjustments
gdp_plotly <- gdp_plotly %>% layout(
  title = "Interactive Boxplot for GDP per Capita (USD)",
  yaxis = list(title = "GDP per capita (USD)", zeroline = FALSE),
  xaxis = list(showticklabels = FALSE) # Hide x-axis tick labels as there's only one boxplot
)

# Print the plot
gdp_plotly

4.5.2.3 Political Rights variable :

In this boxplot, we can see that Saudi Arabia had in 2018 the worse score in terms of political rights. This is hardly surprising, given the country’s reputation for controversy when it comes to human rights and egalitarianism.

The two other outliers displayed, being in the same region, are UAE and Bahrain.

Code
library(plotly)

# Convert necessary columns to correct data types, especially if they are not correctly formatted
all_data$Country <- as.character(all_data$Country)
all_data$Year <- as.numeric(all_data$Year)
all_data$`Political Rights` <- as.numeric(all_data$`Political Rights`)

# Remove any NA values to avoid issues during plotting
all_data <- na.omit(all_data)

# Create an interactive boxplot with Plotly
homicides_plotly <- plot_ly(data = all_data, y = ~`Political Rights`, type = 'box',
                            text = ~paste("Country:", Country, "<br>Year:", Year),
                            hoverinfo = "text", # Hover text information
                            boxpoints = "outliers", # Show only outliers as points
                            marker = list(color = 'rgb(8,81,156)', outliercolor = 'rgba(219, 64, 82, 0.6)',
                                          line = list(outliercolor = 'rgba(219, 64, 82, 1.0)', outlierwidth = 2)),
                            line = list(color = 'rgb(8,81,156)'))

# Adjust the layout of the plot
homicides_plotly <- homicides_plotly %>% layout(
  title = "Interactive Boxplot for Political Rights",
  yaxis = list(title = "Political Rights"),
  xaxis = list(title = "Boxplot", showticklabels = FALSE)  # No tick labels necessary on x-axis
)

# Print the interactive plot
homicides_plotly

4.5.2.4 Civil Liberties variable, Control of Corruption, Government Effectiveness and Political Stability variables :

We see no particular outliers for these four variables.

Code
library(plotly)

# Convert necessary columns to correct data types, especially if they are not correctly formatted
all_data$Country <- as.character(all_data$Country)
all_data$Year <- as.numeric(all_data$Year)
all_data$`Civil Liberties` <- as.numeric(all_data$`Civil Liberties`)

# Remove any NA values to avoid issues during plotting
all_data <- na.omit(all_data)

# Create an interactive boxplot with Plotly
homicides_plotly <- plot_ly(data = all_data, y = ~`Civil Liberties`, type = 'box',
                            text = ~paste("Country:", Country, "<br>Year:", Year),
                            hoverinfo = "text", # Hover text information
                            boxpoints = "outliers", # Show only outliers as points
                            marker = list(color = 'rgb(8,81,156)', outliercolor = 'rgba(219, 64, 82, 0.6)',
                                          line = list(outliercolor = 'rgba(219, 64, 82, 1.0)', outlierwidth = 2)),
                            line = list(color = 'rgb(8,81,156)'))

# Adjust the layout of the plot
homicides_plotly <- homicides_plotly %>% layout(
  title = "Interactive Boxplot for Civil Liberties",
  yaxis = list(title = "Civil Liberties"),
  xaxis = list(title = "Boxplot", showticklabels = FALSE)  # No tick labels necessary on x-axis
)

# Print the interactive plot
homicides_plotly
Code
library(plotly)

# Convert necessary columns to correct data types, especially if they are not correctly formatted
all_data$Country <- as.character(all_data$Country)
all_data$Year <- as.numeric(all_data$Year)
all_data$`Control of Corruption: Estimate` <- as.numeric(all_data$`Control of Corruption: Estimate`)

# Remove any NA values to avoid issues during plotting
all_data <- na.omit(all_data)

# Create an interactive boxplot with Plotly
homicides_plotly <- plot_ly(data = all_data, y = ~`Control of Corruption: Estimate`, type = 'box',
                            text = ~paste("Country:", Country, "<br>Year:", Year),
                            hoverinfo = "text", # Hover text information
                            boxpoints = "outliers", # Show only outliers as points
                            marker = list(color = 'rgb(8,81,156)', outliercolor = 'rgba(219, 64, 82, 0.6)',
                                          line = list(outliercolor = 'rgba(219, 64, 82, 1.0)', outlierwidth = 2)),
                            line = list(color = 'rgb(8,81,156)'))

# Adjust the layout of the plot
homicides_plotly1 <- homicides_plotly %>% layout(
  title = "Interactive Boxplot for Control of Corruption",
  yaxis = list(title = "Control of Corruption: Estimate"),
  xaxis = list(title = "Boxplot", showticklabels = FALSE)  # No tick labels necessary on x-axis
)

# Print the interactive plot
p1 <- homicides_plotly1
p1
Code
#Gov. Effectiveness 

library(plotly)

# Convert necessary columns to correct data types, especially if they are not correctly formatted
all_data$Country <- as.character(all_data$Country)
all_data$Year <- as.numeric(all_data$Year)
all_data$`Government Effectiveness: Estimate` <- as.numeric(all_data$`Government Effectiveness: Estimate`)

# Remove any NA values to avoid issues during plotting
all_data <- na.omit(all_data)

# Create an interactive boxplot with Plotly
homicides_plotly <- plot_ly(data = all_data, y = ~`Government Effectiveness: Estimate`, type = 'box',
                            text = ~paste("Country:", Country, "<br>Year:", Year),
                            hoverinfo = "text", # Hover text information
                            boxpoints = "outliers", # Show only outliers as points
                            marker = list(color = 'rgb(8,81,156)', outliercolor = 'rgba(219, 64, 82, 0.6)',
                                          line = list(outliercolor = 'rgba(219, 64, 82, 1.0)', outlierwidth = 2)),
                            line = list(color = 'rgb(8,81,156)'))

# Adjust the layout of the plot
homicides_plotly2 <- homicides_plotly %>% layout(
  title = "Interactive Boxplot for Government Effectiveness",
  yaxis = list(title = "Government Effectiveness: Estimate"),
  xaxis = list(title = "Boxplot", showticklabels = FALSE)  # No tick labels necessary on x-axis
)

# Print the interactive plot
p2 <- homicides_plotly2
p2
Code
#Pol. Stability

library(plotly)

# Convert necessary columns to correct data types, especially if they are not correctly formatted
all_data$Country <- as.character(all_data$Country)
all_data$Year <- as.numeric(all_data$Year)
all_data$`Political Stability and Absence of Violence/Terrorism: Estimate` <- as.numeric(all_data$`Political Stability and Absence of Violence/Terrorism: Estimate`)

# Remove any NA values to avoid issues during plotting
all_data <- na.omit(all_data)

# Create an interactive boxplot with Plotly
homicides_plotly <- plot_ly(data = all_data, y = ~`Political Stability and Absence of Violence/Terrorism: Estimate`, type = 'box',
                            text = ~paste("Country:", Country, "<br>Year:", Year),
                            hoverinfo = "text", # Hover text information
                            boxpoints = "outliers", # Show only outliers as points
                            marker = list(color = 'rgb(8,81,156)', outliercolor = 'rgba(219, 64, 82, 0.6)',
                                          line = list(outliercolor = 'rgba(219, 64, 82, 1.0)', outlierwidth = 2)),
                            line = list(color = 'rgb(8,81,156)'))

# Adjust the layout of the plot
homicides_plotly3 <- homicides_plotly %>% layout(
  title = "Interactive Boxplot for Political Stability",
  yaxis = list(title = "Political Stability: Estimate"),
  xaxis = list(title = "Boxplot", showticklabels = FALSE)  # No tick labels necessary on x-axis
)

# Print the interactive plot
p3 <- homicides_plotly3
p3

4.5.2.5 Unemployment Rate variable :

There are quite a few outliers visible in this boxplot. We see Greece over several consecutive years, which is not surprising given the difficulties this country has had in terms of employment in the decade following the 2008 crisis. The same applies to Spain.

In addition, African countries such as South Africa and Botswana have had years with particularly high unemployment rates, as has Jordan in the Middle East.

Code
library(plotly)

# Convert necessary columns to correct data types, especially if they are not correctly formatted
all_data$Country <- as.character(all_data$Country)
all_data$Year <- as.numeric(all_data$Year)
all_data$`Unemployment Rate` <- as.numeric(all_data$`Unemployment Rate`)

# Remove any NA values to avoid issues during plotting
all_data <- na.omit(all_data)

# Create an interactive boxplot with Plotly
homicides_plotly <- plot_ly(data = all_data, y = ~`Unemployment Rate`, type = 'box',
                            text = ~paste("Country:", Country, "<br>Year:", Year),
                            hoverinfo = "text", # Hover text information
                            boxpoints = "outliers", # Show only outliers as points
                            marker = list(color = 'rgb(8,81,156)', outliercolor = 'rgba(219, 64, 82, 0.6)',
                                          line = list(outliercolor = 'rgba(219, 64, 82, 1.0)', outlierwidth = 2)),
                            line = list(color = 'rgb(8,81,156)'))

# Adjust the layout of the plot
homicides_plotly <- homicides_plotly %>% layout(
  title = "Interactive Boxplot for Unemployment Rate",
  yaxis = list(title = "Unemployment Rate"),
  xaxis = list(title = "Boxplot", showticklabels = FALSE)  # No tick labels necessary on x-axis
)

# Print the interactive plot
homicides_plotly

4.5.2.6 Human Development Index variable :

Finally, we see here that Burundi stands out from the rest with a very low HDI for several consecutive years and indeed, according to United Nations Development Program, this country is still today categorized as low human development.

Code
library(plotly)

# Convert necessary columns to correct data types, especially if they are not correctly formatted
all_data$Country <- as.character(all_data$Country)
all_data$Year <- as.numeric(all_data$Year)
all_data$`HDI Value` <- as.numeric(all_data$`HDI Value`)

# Remove any NA values to avoid issues during plotting
all_data <- na.omit(all_data)

# Create an interactive boxplot with Plotly
homicides_plotly <- plot_ly(data = all_data, y = ~`HDI Value`, type = 'box',
                            text = ~paste("Country:", Country, "<br>Year:", Year),
                            hoverinfo = "text", # Hover text information
                            boxpoints = "outliers", # Show only outliers as points
                            marker = list(color = 'rgb(8,81,156)', outliercolor = 'rgba(219, 64, 82, 0.6)',
                                          line = list(outliercolor = 'rgba(219, 64, 82, 1.0)', outlierwidth = 2)),
                            line = list(color = 'rgb(8,81,156)'))

# Adjust the layout of the plot
homicides_plotly <- homicides_plotly %>% layout(
  title = "Interactive Boxplot for Human Development Index",
  yaxis = list(title = "HDI Value"),
  xaxis = list(title = "Boxplot", showticklabels = FALSE)  # No tick labels necessary on x-axis
)

# Print the interactive plot
homicides_plotly

4.5.3 Categorical data analysis :

In fact, the data set we use for this project does not contain any categorical variable and value. Thus, there is nothing to explore related to this type of data.

4.6 Uni-variate visualizations

4.6.0.1 Average homicides per 100,000 per years

This graph shows the average number of homicides per 100,000 people across the world since 1990. The scale on the y-axis starts at 5/100,000. In 2020, we notice a surprising decrease. The explanation certainly lies behind the COVID pandemic. People were locked down, preventing crimes.

Code
# Calculate the average value of homicide per year from the main_dataset
average_homicide_per_year <- main_dataset %>%
  group_by(Year) %>%
  summarise(`Average Homicide` = mean(`Homicides per 100,000 people`, na.rm = TRUE)) 


library(plotly)
library(ggplot2)

 p<- ggplot(average_homicide_per_year, aes(x = Year, y = `Average Homicide`)) +
  geom_col(fill = "aliceblue", color = "black") + 
  scale_x_discrete(name = "Year") +
  labs(title = "Average Homicide per 100,000 People per Year",
       x = "Year",
       y = "Average Homicide Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_cartesian(ylim = c(5, NA))  # Adjust y-axis limits without removing data
 ggplotly(p)

The following plot shows the average value of homicide per 100,000 people among the countries observed in our data set from 2013 to 2022.

However, we have to be careful and remove the countries that do not contain observations for each year so that it does not bias the annual average.

Code
# Find all unique years in the dataset
unique_years <- unique(all_data$Year)

# Find countries that have observations for every year
countries_with_observations <- all_data %>%
  group_by(Country) %>%
  summarize(all_years_present = all(unique_years %in% Year)) %>%
  filter(all_years_present)

# List of countries with complete observations
countries_with_full_obs <- countries_with_observations$Country
a <- nrow(countries_with_observations)

#create a subset with countries_with_observations
subset_full_data <- all_data %>%
  filter(Country %in% countries_with_full_obs)

This is the data frame we are going to use for our further explorations.

Lets now fix our previous graph :

Code
 # Calculate the average value of homicide per year
average_homicide_per_year <- subset_full_data %>%
  group_by(Year) %>%
  summarise(`Average Homicide` = mean(`Homicides per 100,000 people`, na.rm = TRUE))

# Plot the data
p <- ggplot(average_homicide_per_year, aes(x = Year, y = `Average Homicide`)) +
  geom_col(fill = "aliceblue", color = "black") + 
  scale_x_continuous(name = "Year", breaks = seq(min(average_homicide_per_year$Year), max(average_homicide_per_year$Year), by = 2)) +
  labs(title = "Average Homicide per 100,000 People per Year",
       x = "Year",
       y = "Average Homicide Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_cartesian(ylim = c(7, NA))  # Adjust y-axis limits without removing data

# Convert to an interactive plotly plot
ggplotly(p)

Let’s now plot this as a line chart for visualizations concerns :

Code
# Convert data types as necessary
subset_full_data$Year <- as.integer(subset_full_data$Year)
subset_full_data$Country <- as.character(subset_full_data$Country)
subset_full_data$`Homicides per 100,000 people` <- as.numeric(subset_full_data$`Homicides per 100,000 people`)

# Remove any NA values to ensure clean plotting
subset_full_data <- na.omit(subset_full_data)

# Calculate the average Homicides per 100,000 people per year
average_homicides_per_year <- subset_full_data %>%
  group_by(Year) %>%
  summarise(Average_Homicides = mean(`Homicides per 100,000 people`, na.rm = TRUE))

# Plot the average Homicides per 100,000 people over the years
homicides_plot <- ggplot(average_homicides_per_year, aes(x = Year, y = Average_Homicides)) +
  geom_line(color = "black") +  # Connect points by year with a line
  geom_point(color = "black") +  # Add points to the line
  scale_x_continuous(breaks = average_homicides_per_year$Year) +  # Ensure every year is marked on the x-axis
  labs(title = "Average Homicides Per 100,000 People Over Years",
       x = "Year",
       y = "Average Homicides Per 100,000 People") +
  theme_minimal()  # Using a minimal theme for a cleaner look

# Print the plot
print(homicides_plot)

We can divide the tendencies of the average homicides rate per 100,000 people into four different phases. Lets name them and provide some potential insights.

Stable Homicide Rates (2013-2014): Stability in economic conditions, law enforcement practices, and social policies may have contributed to maintaining a steady homicide rate during these years. Absence of significant social, political, or economic disruptions could also explain the stable rates.

Increase in 2015: Economic challenges or downturns could have led to increased crime rates, including homicides. Moreover, increased political instability or changes in governance could have contributed to higher violence levels during this year. Notably, a surge in drug trafficking and related violence might explain the spike in homicide rates.

Gradual Decline (2016-2019): This decline could be due to improved law enforcement; enhanced policing strategies, better crime prevention programs, and community policing efforts could have contributed to the decline in homicide rates. Improvements in economic conditions and job creation efforts might also have played a role in reducing crime rates.

Moreover, increased focus on social programs, education, and employment opportunities could have led to a reduction in crime rates.

Sharp Decline and Subsequent Increase (2020-2022): This phase of the plot is higly related to the impact of COVID-19 pandemic: the pandemic led to significant social and economic disruptions, including increased unemployment, social isolation, and stress, which may have initially reduced homicides due to lockdowns but later contributed to a rise as restrictions eased. The pandemic also strained law enforcement resources and capabilities, potentially leading to changes in crime patterns. In addition to economic hardship, increased social unrest and protests during this period could have also contributed to higher homicide rates.

4.6.0.2 Average GDP per capita over years

In this graph, we can see a downward trend in average GDP per capita during the year 2020, which corresponds to the COVID-19 period and the shutdown of industries worldwide following the established confinements.

However, we are seeing very low average GDP per capita in 2015 and 2016, which is not very intuitive to understand. One potential reason for these two values is the structure of our data set, which, depending on the countries concerned, may have relatively low values in these two years. All in all, there’s no particular reason.

It is, however, interesting to see how average economic growth picks up in the post-covid period in 2021 and 2022.

Code
# Convert data types as necessary
subset_full_data$Year <- as.integer(subset_full_data$Year)
subset_full_data$Country <- as.character(subset_full_data$Country)
subset_full_data$`GDP per capita (USD)` <- as.numeric(subset_full_data$`GDP per capita (USD)`)

# Remove any NA values to ensure clean plotting
subset_full_data <- na.omit(subset_full_data)

# Calculate the average GDP per capita per year
average_gdp_per_year <- subset_full_data %>%
  group_by(Year) %>%
  summarise(Average_GDP = mean(`GDP per capita (USD)`, na.rm = TRUE))

# Plot the average GDP per capita over the years
gdp_plot <- ggplot(average_gdp_per_year, aes(x = Year, y = Average_GDP)) +
  geom_line(color = "blue") +  # Connect points by year with a line
  geom_point(color = "red") +  # Add points to the line
  scale_x_continuous(breaks = average_gdp_per_year$Year) +  # Ensure every year is marked on the x-axis
  labs(title = "Average GDP Per Capita (USD) Over Years",
       x = "Year",
       y = "Average GDP Per Capita (USD)") +
  theme_minimal()  # Using a minimal theme for a cleaner look

# Print the plot
print(gdp_plot)

4.6.0.3 Average unemployment rate over years

Similar to the previous graph, we see a trend strongly marked by COVID-19 and the notorious rise in the average unemployment rate that it engendered, causing an economic crisis and numerous bankruptcies.

However, we can see that average unemployment in 2013 and 2014 is comparable to that of COVID-19, which is also probably due to the structure of our data set and the high unemployment rates in Europe and elsewhere (Greece, Spain, South Africa, etc.) in those years, as seen above.

Code
# Convert data types as necessary
subset_full_data$Year <- as.integer(subset_full_data$Year)
subset_full_data$Country <- as.character(subset_full_data$Country)
subset_full_data$`Unemployment Rate` <- as.numeric(subset_full_data$`Unemployment Rate`)

# Remove any NA values to ensure clean plotting
subset_full_data <- na.omit(subset_full_data)

# Calculate the average Unemployment rate per year
average_unemployment_per_year <- subset_full_data %>%
  group_by(Year) %>%
  summarise(Average_Unemployment = mean(`Unemployment Rate`, na.rm = TRUE))

# Plot the average Unemployment rate over the years
unemployment_plot <- ggplot(average_unemployment_per_year, aes(x = Year, y = Average_Unemployment)) +
  geom_line(color = "orange") +  # Connect points by year with a line
  geom_point(color = "brown4") +  # Add points to the line
  scale_x_continuous(breaks = average_unemployment_per_year$Year) +  # Ensure every year is marked on the x-axis
  labs(title = "Average Unemployment Rate Over Years",
       x = "Year",
       y = "Average Unemployment Rate") +
  theme_minimal()  # Using a minimal theme for a cleaner look

# Print the plot
print(unemployment_plot)

4.6.0.4 Average Human Development Index rate over years

Finally, we see that for this variable, the index was increasing from year to year, only to fall again in 2020 as a result of the pandemic.

In fact, this decrease can be linked to several factors associated with the pandemic’s widespread impacts on health, education, and income, which are the three core dimensions measured by the HDI.

Indeed, the pandemic had impacts in health by increasing mortality and restricting healthcare access and resources. Moreover, it provoked educational disruptions with school closures and impacts on education quality. COVID-19 also has economic consequences, increasing poverty and widening inequalities. Restrictions and lock-downs clearly had impacts on human rights and freedoms.

Code
# Convert data types as necessary
subset_full_data$Year <- as.integer(subset_full_data$Year)
subset_full_data$Country <- as.character(subset_full_data$Country)
subset_full_data$`HDI Value` <- as.numeric(subset_full_data$`HDI Value`)

# Remove any NA values to ensure clean plotting
subset_full_data <- na.omit(subset_full_data)

# Calculate the average HDI Value per year
average_hdi_per_year <- subset_full_data %>%
  group_by(Year) %>%
  summarise(Average_HDI = mean(`HDI Value`, na.rm = TRUE))

# Plot the average HDI Value over the years
hdi_plot <- ggplot(average_hdi_per_year, aes(x = Year, y = Average_HDI)) +
  geom_line(color = "pink") +  # Connect points by year with a line
  geom_point(color = "darkorchid1") +  # Add points to the line
  scale_x_continuous(breaks = average_hdi_per_year$Year) +  # Ensure every year is marked on the x-axis
  labs(title = "Average HDI Value Over Years",
       x = "Year",
       y = "Average HDI Value") +
  theme_minimal()  # Using a minimal theme for a cleaner look

# Print the plot
print(hdi_plot)

4.7 Multivariate visualizations

4.7.0.1 Relationship between GDP per capita and unemployment

We can see from this scatter-plot that, in general, the lower a country’s unemployment rate, the more likely it is to have a high GDP per capita. This relationship is quite logical, since unemployment is a variable negatively correlated with a country’s economic health. Indeed, a country with a low unemployment rate is more likely to have a good economic health and thus high GDP per capita.

Code
library(ggplot2)
# Ensure data types are correct (optional, depends on your data reading settings)
subset_full_data$`GDP per capita (USD)` <- as.numeric(subset_full_data$`GDP per capita (USD)`)
subset_full_data$`Unemployment Rate` <- as.numeric(subset_full_data$`Unemployment Rate`)

# Remove any NA values to ensure clean plotting
subset_full_data <- na.omit(subset_full_data)

# Create the scatter plot
ggplot(subset_full_data, aes(x = `GDP per capita (USD)`, y = `Unemployment Rate`)) +
  geom_point(aes(color = `Unemployment Rate`), alpha = 0.6, size = 2, shape = 19) +  # Color points by Unemployment Rate
  scale_color_gradient(low = "blue", high = "red") +  # Use a color gradient from blue to red
  labs(
    title = "GDP per capita vs unemployment",
    x = "GDP per capita (USD)",
    y = "Unemployment Rate",
    caption = " "
  ) +
  theme_minimal(base_size = 14) +  # Use a minimal theme with readable base font size
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),  # Bold and center the title
    plot.caption = element_text(face = "italic"),  # Italicize the caption
    legend.position = "right"  # Adjust the legend position
  ) +
  guides(color = guide_colorbar(title = "Unemployment Rate"))

4.7.0.2 Relationship between homicides per 100,000 people and unemployment

This graph may not seem very intuitive; logic would dictate that the higher a country’s unemployment rate (and therefore its poverty), the higher its number of homicides per 100,000 inhabitants.

However, here, the countries with the most unemployment are concentrated towards the lowest homicide rates per 100,000 inhabitants. Therefore, as we saw previously with the correlation matrix, we cannot say that murders and unemployment are correlated.

Code
# Load necessary library
library(ggplot2)

subset_full_data$`Homicides per 100,000 people` <- as.numeric(subset_full_data$`Homicides per 100,000 people`)
subset_full_data$`Unemployment Rate` <- as.numeric(subset_full_data$`Unemployment Rate`)

# Remove any NA values to ensure clean plotting
subset_full_data <- na.omit(subset_full_data)

# Create the scatter plot
ggplot(subset_full_data, aes(x = `Homicides per 100,000 people`, y = `Unemployment Rate`)) +
  geom_point(aes(color = `Unemployment Rate`), alpha = 0.6, size = 2, shape = 19) +  # Color points by Unemployment Rate
  scale_color_gradient(low = "lightblue", high = "red") +  # Use a color gradient from blue to red
  labs(
    title = "Homicides per 100,000 People vs unemployment",
    x = "Homicides per 100,000 People",
    y = "Unemployment Rate",
    caption = " "
  ) +
  theme_minimal(base_size = 14) +  # Use a minimal theme with readable base font size
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),  # Bold and center the title
    plot.caption = element_text(face = "italic"),  # Italicize the caption
    legend.position = "right"  # Adjust the legend position
  ) +
  guides(color = guide_colorbar(title = "Unemployment Rate"))

This graph is for us an additional clue that the unemployment rate variable is likely to be not relevant for our further analysis.

4.7.0.3 Relationship between homicides per 100,000 people and GDP per capita

On this scatter-plot, we see that countries with high GDP per capita are less likely to have a high homicide rate per 100,000 inhabitants.

Indeed, economic prosperity impacts various aspects of a society that are linked to crime rates, such as employment opportunities, access to education, community programs, resource allocation for public safety, judicial system efficient, infrastructures and others more.

Code
# Ensure data types are correct (optional, depends on your data reading settings)
subset_full_data$`Homicides per 100,000 people` <- as.numeric(subset_full_data$`Homicides per 100,000 people`)
subset_full_data$`GDP per capita (USD)` <- as.numeric(subset_full_data$`GDP per capita (USD)`)

# Remove any NA values to ensure clean plotting
subset_full_data <- na.omit(subset_full_data)

# Create the scatter plot
ggplot(subset_full_data, aes(x = `Homicides per 100,000 people`, y = `GDP per capita (USD)`)) +
  geom_point(aes(color = `GDP per capita (USD)`), alpha = 0.6, size = 2, shape = 19) +  # Color points by GDP per capita (USD)
  scale_color_gradient(low = "lightblue", high = "orange") +  # Use a color gradient from blue to red
  labs(
    title = "Homicides per 100,000 People vs GDP per Capita (USD)",
    x = "Homicides per 100,000 People",
    y = "GDP per Capita (USD)",
    caption = " "
  ) +
  theme_minimal(base_size = 14) +  # Use a minimal theme with readable base font size
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),  # Bold and center the title
    plot.caption = element_text(face = "italic"),  # Italicize the caption
    legend.position = "right"  # Adjust the legend position
  ) +
  guides(color = guide_colorbar(title = "GDP per Capita (USD)"))

4.7.0.4 Relationship between homicides per 100,000 people and HDI value

Finally, on this last graph, we see that countries with high HDI value are less likely to have a high homicide rate per 100,000 inhabitants.

Indeed, it is quite logical that countries with high Human Development Index values typically experience lower homicide rates because their higher standards of living, education, and economic stability contribute to social cohesion and reduce motivations for crime. Additionally, effective governance and robust legal systems in these countries enhance public safety and ensure the rule of law is maintained, further deterring violent crimes.

Code
# Ensure data types are correct (optional, depends on your data reading settings)
subset_full_data$`Homicides per 100,000 people` <- as.numeric(subset_full_data$`Homicides per 100,000 people`)
subset_full_data$`HDI Value` <- as.numeric(subset_full_data$`HDI Value`)

# Remove any NA values to ensure clean plotting
subset_full_data <- na.omit(subset_full_data)

# Create the scatter plot
ggplot(subset_full_data, aes(x = `Homicides per 100,000 people`, y = `HDI Value`)) +
  geom_point(aes(color = `HDI Value`), alpha = 0.6, size = 2, shape = 19) +  # Color points by HDI Value
  scale_color_gradient(low = "purple", high = "orange") +  # Use a color gradient from blue to red
  labs(
    title = "Homicides per 100,000 People vs HDI Value",
    x = "Homicides per 100,000 People",
    y = "HDI Value",
    caption = " "
  ) +
  theme_minimal(base_size = 14) +  # Use a minimal theme with readable base font size
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),  # Bold and center the title
    plot.caption = element_text(face = "italic"),  # Italicize the caption
    legend.position = "right"  # Adjust the legend position
  ) +
  guides(color = guide_colorbar(title = "HDI Value"))

5 Supervised Learning

Code
#load packages
library(rpart)
library(rpart.plot)
library(randomForest)
library(gbm)
library(caret)

5.1 Data Splitting

We start by creating a subset of “all_data” by removing the column Country. We transform years into incremental numerical variables for the analysis: 0 if 2013, 1 if 2014, 2 if 2015, etc…

Code
subset_for_modelling <-all_data %>% select(-Country)


subset_for_modelling$YearFactor <- as.integer(factor(subset_for_modelling$Year, levels = sort(unique(subset_for_modelling$Year))))

subset_for_modelling <- subset_for_modelling%>%select(-Year)

Then, we split the data into a training and a test set (75/25).

Code
set.seed(234)
index <- sample(x=c(1,2), size=nrow(subset_for_modelling), replace=TRUE, prob=c(0.75,0.25)) # 1==training set, 2==test set
train_set <- subset_for_modelling[index==1,]
test_set <- subset_for_modelling[index==2,]

5.2 Linear Regression

We start by fitting one very simple model, a linear regression. The objective is to show the progress we can obtain with other more advanced models.

Code
model_lm <- lm(`Homicides per 100,000 people`~., data = train_set)
Code
summary(model_lm)

In fact, this very simple model explains about 25.84% of the variability in homicide rates, with a modest explanatory power, indicating that additional variables or more complex models might be needed to better understand and predict homicide rates.

5.2.1 Variable Selection

We use a step-wise variable selection method with the AIC by backward elimination. Therefore, we start with all potential variables, remove the variable that results in the smallest increase (or largest decrease) in AIC, and continue removing variables as long as the AIC is decreasing.

The variables that have been retained in our final model to predict Homicides per 100,000 people are Political Rights, Civil Liberties, Control of Corruption: Estimate and HDI Value. Now, we can fit the final model and evaluate its performance with the test set.

Code
library(broom)
final_step_lm <- lm(`Homicides per 100,000 people`~`Political Rights` + 
    `Civil Liberties` + `Control of Corruption: Estimate` + `HDI Value`, 
    data = subset_for_modelling)
final_summary <- summary(final_step_lm)

#clean table for results
tidy_summ <- tidy(final_summary)
tidy_summ %>%
  kable() %>%
  kable_styling()
term estimate std.error statistic p.value
(Intercept) 18.3676182 4.0645811 4.518945 0.0000073
`Political Rights` 0.9269874 0.1523652 6.083982 0.0000000
`Civil Liberties` -0.4803842 0.1258255 -3.817860 0.0001471
`Control of Corruption: Estimate` -4.1054731 0.6571822 -6.247085 0.0000000
`HDI Value` -19.8154469 5.6419960 -3.512134 0.0004744

We predict the homicide rate in the test set. We already know by looking at the adjusted R-squared (0.2529) in the summary that the prediction quality is not good.

Code
model_lm_pred <- predict(model_lm_sel, newdata=test_set)
plot(test_set$`Homicides per 100,000 people` ~ model_lm_pred, xlab="Prediction", ylab="Observed Homicides")
abline(0,1, col = "red")

For predictions close to zero and slightly above, the model appears to be quite accurate as most observed points are clustered near the line.

However the model’s predictions are less reliable or consistent in higher range. Above 5, the observed data points show a significant dispersion around the trend line. Indeed, for higher actual values, the predicted values are generally lower than the actual values, indicating a tendency to underestimate higher values. There are also some instances where the predicted values are negative, which is not feasible for homicide counts, indicating model limitations or issues with data pre-processing.

Linear regression assumes a linear relationship between the predictors and the response variable. The observed scatter and deviation from the red line suggest that the actual relationship may not be strictly linear, and the model fails to capture the complexity of the data.

5.3 Random Forest

To illustrate, we first build a tree with our training set.

Code
set.seed(123)
data_reg <- rpart(`Homicides per 100,000 people` ~ ., data=train_set)
rpart.plot(data_reg)

We optimize the predictive power of the random forest algorithm on the data set, by pre-processing the data, tuning the model parameters (here we are using a tuning grid for the “mtry” parameter), and evaluating the model’s performance with cross-validation.

The cross-validation method specifies 10-fold cross-validation, which means the training set is split into 10 smaller sets, the model is trained on 9 of these, and validated on the 10th, rotating until each subset has been used for validation.

Code
library(janitor)
train_rf <-clean_names(train_set)
test_rf <- clean_names(test_set)

# Define a tuning grid for mtry 
tuneGrid <- expand.grid(mtry = c(1, 3, 5, 7, 9))

# Create a trainControl object for cross-validation
trainControl <- trainControl(method = "cv", number = 10)  # 10-fold cross-validation

set.seed(234)

# Train the random forest model with hyperparameter tuning
rf_tuned <- train(homicides_per_100_000_people ~ ., data = train_rf, method = "rf", trControl = trainControl, tuneGrid = tuneGrid, ntree = 1000, importance = TRUE)

# Check the best tuning parameters
rf_tuned$bestTune %>%
  kable()
mtry
3 5

Plotting the model

We compare the predicted against the actual values to see how well the model performed after tuning :

Code
predictions <- predict(rf_tuned, newdata = test_rf)

# Create a plot comparing predicted to actual values
ggplot(data.frame(Actual = test_rf$homicides_per_100_000_people, Predicted = predictions), aes(x = Actual, y = Predicted)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Predicted vs. Actual Values", x = "Actual Values", y = "Predicted Values") +
  theme_minimal()

The majority of points follow a roughly linear trend along the diagonal, which indicates that the model generally captures the relationship between predictors and target values. Some points deviate significantly from the line, particularly at higher values, where predictions appear to underestimate the actual values. This could again indicate a tendency of the model to be less accurate at predicting higher values.

A noticeable cluster of points near the lower values on both axes might indicate that the data set contains many lower-value samples, and the model generally predicts these accurately.

5.4 Gradient Boosting Machines

We fit a GBM model on the training set of our data to predict Homicides per 100,000 people, while also employing cross-validation to optimize the model’s hyper-parameters. This process helps ensure that the model is both robust and optimized for the predictive task by systematically testing different configurations and selecting the best one based on empirical performance.

Code
# Define the tuning grid for the gbm model
tune_grid <- expand.grid(
  n.trees = c(500, 1000, 1500),
  interaction.depth = c(3, 4, 5),
  shrinkage = c(0.01, 0.05, 0.1),
  n.minobsinnode = c(5, 10)
)

# Set up cross-validation controls
trControl <- trainControl(method = "cv", number = 10)

# Train the gbm model using caret
set.seed(123)
gbm_tuned <- train(`Homicides per 100,000 people` ~ ., data = train_set, method = "gbm", trControl = trControl, tuneGrid = tune_grid, verbose = FALSE)

# View the best parameters based on cross-validation
gbm_tuned$bestTune %>%
  kable()
n.trees interaction.depth shrinkage n.minobsinnode
33 1500 5 0.05 5

Plotting the model

Here, we use the optimized model to make predictions on the test data set.

Code
model_pred_gbm <- predict(gbm_tuned, newdata = test_set)
plot(test_set$`Homicides per 100,000 people`, model_pred_gbm, xlab = "Actual Values", ylab = "Predicted Values", main = "Predicted vs. Actual")
abline(0, 1, col = "red")  # Line for perfect prediction

While the model predictions generally follow the trend of actual values, there are significant deviations, particularly at higher values. This may indicate that the model is less accurate when predicting higher values or that those predictions have higher variance.

5.5 Metrics

5.5.1 R-squared

Essentially, R-squared quantifies how well the variations in the independent variables explain the variations in the dependent variable, where 0 indicates that the model explains none of the variability of the response data around its mean, and 1 indicates that the model explains all the variability of the response data around its mean.

Code
r2_lm <- R2(predict(final_step_lm, newdata = test_set), test_set$`Homicides per 100,000 people`) #lin. reg
r2_gbm <- R2(predict(gbm_tuned, newdata = test_set), test_set$`Homicides per 100,000 people`) #gbm
r2_rf <- R2(predict(rf_tuned, newdata = test_rf ), test_rf$homicides_per_100_000_people) #random forest

print(paste("R2 for Linear Regression model:", r2_lm))
[1] "R2 for Linear Regression model: 0.291143184826871"
Code
print(paste("R2 for Random Forest model:", r2_rf))
[1] "R2 for Random Forest model: 0.914374167648047"
Code
print(paste("R2 for GBM model:", r2_gbm))
[1] "R2 for GBM model: 0.844370018322847"

Random Forest is here the best model.

5.5.2 RMSE

RMSE, or Root Mean Squared Error, is a widely used measure of the difference between values predicted by a model or an estimator and the values actually observed from the environment that is being modeled or estimated. It serves as a way to quantify the magnitude of prediction errors from a regression model. A lower RMSE is usually better because it indicates smaller differences between predicted and observed values, suggesting better model performance.

Code
rmse_lm <- RMSE(predict(final_step_lm, newdata = test_set), test_set$`Homicides per 100,000 people`) #lin. model
rmse_gbm <- RMSE(predict(gbm_tuned, newdata = test_set), test_set$`Homicides per 100,000 people`) #gbm
rmse_rf <- RMSE(predict(rf_tuned, newdata = test_rf ), test_rf$homicides_per_100_000_people) #random forest

print(paste("RMSE for Linear Regression model:", rmse_lm))
[1] "RMSE for Linear Regression model: 10.9010377821501"
Code
print(paste("RMSE for Random Forest model:", rmse_rf))
[1] "RMSE for Random Forest model: 4.30147859218741"
Code
print(paste("RMSE for GBM model:", rmse_gbm))
[1] "RMSE for GBM model: 5.12144832352382"

Random Forest is here the best model.

5.5.3 MAE

MAE, or Mean Absolute Error, is a statistical measure used to quantify the accuracy of a forecast model. It calculates the average magnitude of the errors in a set of predictions, without considering their direction (i.e., whether they are underestimates or overestimates). RMSE tends to punish larger errors more than MAE does, due to the squaring of each term, which effectively weights larger errors higher than smaller ones.

Code
mae_lm <- MAE(predict(final_step_lm, newdata = test_set), test_set$`Homicides per 100,000 people`) #lin. reg
mae_gbm <- MAE(predict(gbm_tuned, newdata = test_set), test_set$`Homicides per 100,000 people`) #gbm
mae_rf <- MAE(predict(rf_tuned, newdata = test_rf ), test_rf$homicides_per_100_000_people) #random forest

print(paste("MAE for Linear Regression model:", mae_lm))
[1] "MAE for Linear Regression model: 7.04395955349281"
Code
print(paste("MAE for Random Forest model:", mae_rf))
[1] "MAE for Random Forest model: 2.32405447937063"
Code
print(paste("MAE for GBM model:", mae_gbm))
[1] "MAE for GBM model: 2.84315329008671"

5.6 Best Model

The model with the best metrics is the Random Forest model. We start by checking if this model is over-fitting or not.

Code
#comparing metrics on training and test sets


 # Load necessary libraries
library(rpart)
library(rpart.plot)
library(randomForest)
library(gbm)
library(caret)
library(dplyr)
library(janitor)
library(ggplot2)
library(knitr)

# Prepare data
subset_for_modelling <- all_data %>% select(-Country)
subset_for_modelling$YearFactor <- as.integer(factor(subset_for_modelling$Year, levels = sort(unique(subset_for_modelling$Year))))
subset_for_modelling <- subset_for_modelling %>% select(-Year)

set.seed(234)
index <- sample(x=c(1, 2), size=nrow(subset_for_modelling), replace=TRUE, prob=c(0.75, 0.25)) # 1==training set, 2==test set
train_set <- subset_for_modelling[index==1, ]
test_set <- subset_for_modelling[index==2, ]

# Clean names
train_rf <- clean_names(train_set)
test_rf <- clean_names(test_set)

# Define a tuning grid for mtry 
tuneGrid <- expand.grid(mtry = c(1, 3, 5, 7, 9))

# Create a trainControl object for cross-validation
trainControl <- trainControl(method = "cv", number = 10)  # 10-fold cross-validation

set.seed(234)
# Train the random forest model with hyperparameter tuning
rf_tuned <- train(homicides_per_100_000_people ~ ., data = train_rf, method = "rf", trControl = trainControl, tuneGrid = tuneGrid, ntree = 1000, importance = TRUE)

# Predictions on the training set
train_predictions <- predict(rf_tuned, newdata = train_rf)
# Predictions on the test set
test_predictions <- predict(rf_tuned, newdata = test_rf)

# Calculate metrics on the training set
r2_train <- R2(train_predictions, train_rf$homicides_per_100_000_people)
rmse_train <- RMSE(train_predictions, train_rf$homicides_per_100_000_people)
mae_train <- MAE(train_predictions, train_rf$homicides_per_100_000_people)

# Calculate metrics on the test set
r2_test <- R2(test_predictions, test_rf$homicides_per_100_000_people)
rmse_test <- RMSE(test_predictions, test_rf$homicides_per_100_000_people)
mae_test <- MAE(test_predictions, test_rf$homicides_per_100_000_people)

# Compare metrics
metrics_comparison <- data.frame(
  Metric = c("R2", "RMSE", "MAE"),
  Training = c(r2_train, rmse_train, mae_train),
  Testing = c(r2_test, rmse_test, mae_test)
)

# Display the comparison
kable(metrics_comparison) %>%
  kable_styling()
Metric Training Testing
R2 0.9688419 0.9143742
RMSE 2.7216448 4.3014786
MAE 1.1851703 2.3240545

The model appears to perform well on the training set, with high R2 and relatively low RMSE and MAE. However, there is a noticeable decline in performance on the test set; the higher error metrics on the test set (compared to the training set) suggest that the model might be overfitting to some extent. This overfitting means that the model has learned patterns specific to the training data that do not generalize well to unseen data.

However, despite the decline, the model still shows reasonably good performance on the test set, indicating that while there is overfitting, it is not extreme.

Then, we plot the feature importance of that model.

Code
# Assuming rf_tuned is the trained model object
importance <- varImp(rf_tuned)

# Plot feature importance
plot(importance, main = "Feature Importance in Random Forest")

The results show that the most important feature is the HDI Value, followed by the GDP per Capita (USD). The Control of Corruption Estimate, the Political rights and the Political Stability and Absence of Violence Terrorism Estimate seem moderately significant to the model. Then, the remaining features are not important to the model according to that graph.

The variable importance of the Gradient Boosting model is also available in the appendix of the project.

6 Unsupervised Learning

As we are working on a time series, we decided to do the clustering by year. Obviously, we will not go through the 10 year of our data, because it would not be coherent with our model selection in the supervised learning.

Therefore, in this section, we only analyze the year 2022.

6.1 Data Preparation

First we prepare our subset for the year 2022.

2022:

Code
library(dplyr)

#subset for the year 2022
data_2022 <- all_data %>% filter(Year == 2022)

# Convertir les pays en noms de ligne
row.names(data_2022) <- data_2022$Country

# remove year and coountry column to have a numerical dataset
data_2022$Country <- NULL
data_2022$Year <- NULL

print(paste("We have observations for", nrow(data_2022), "countries for the year 2022."))
[1] "We have observations for 45 countries for the year 2022."

Standardize the data:

The data must be standardized (i.e., scaled) to make variables comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one.

Code
# Standardize only numeric variables
data_2022_standardized <- data_2022 %>%
  mutate(across(where(is.numeric), scale))

6.2 Clustering

6.2.1 Clustering Distance Measures

The classification of observations into groups requires some methods for computing the distance or the (dis)similarity between each pair of observations. The result of this computation is known as a distance matrix.

Code
distance <- get_dist(data_2022_standardized) #here we use the euclidian distance
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

6.2.2 K-Mean Clustering

K-Mean clustering classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroïd) which corresponds to the mean of points assigned to the cluster.

6.2.2.1 Choosing the number of clusters

Elbow method

Code
set.seed(123)

fviz_nbclust(data_2022_standardized, kmeans, method = "wss")

In this graph, the “elbow” appears to be at the point where the rate of decrease sharply changes, which seems to be between 3 and 4 clusters. After this point, the sum of squared errors (SSE) starts to decrease at a slower rate.

Average Silhouette Method

In short, the average silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering. However, the average silhouette should not be interpreted for itself.

Code
fviz_nbclust(data_2022_standardized, kmeans, method = "silhouette")

Silouhette Plot

A detailed analysis of the clustering can be obtained using the silhouette plot shows: the silhouettes of all the instances.

Code
pam_object <- pam(data_2022_standardized, k = 4)

# Compute silhouette widths
sil_widths <- silhouette(pam_object)
# Plot the silhouette information
plot(sil_widths, col = 1:4, border = NA)

Cluster 3 is the most well-defined cluster with the highest average silhouette score, suggesting that its members are quite similar to each other and distinct from members of other clusters.

Gap Statistic Method

The approach can be applied to any clustering method (i.e. K-means clustering, hierarchical clustering). The gap statistic compares the total intra-cluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering).

Code
# compute gap statistic
set.seed(123)
gap_stat <- clusGap(data_2022_standardized, FUN = kmeans, nstart = 25,
                    K.max = 10, B = 50)
fviz_gap_stat(gap_stat)

From the visualization of the clusGap() result above, we can see that the optimal number of clusters is 4.

6.2.2.2 Computing K-means Clustering

Code
k2 <-  kmeans(data_2022_standardized, centers = 2, nstart = 25)
k4 <- kmeans(data_2022_standardized, centers = 4, nstart = 25)
k3 <-kmeans(data_2022_standardized, centers = 3, nstart = 25)

Visualization of the results

If there are more than two dimensions, fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance (refer to our PCA in the EDA).

Let’s compare the three sets:

Code
p0 <- fviz_cluster(k2, geom = "point",data = data_2022_standardized)  +
  ggtitle("Cluster Plot 2022 (k=2)")
p1 <-fviz_cluster(k3, geom = "point",data = data_2022_standardized)  +
  ggtitle("Cluster Plot 2022 (k=3)")
p2 <-fviz_cluster(k4, geom = "point", data = data_2022_standardized)  +
  ggtitle("Cluster Plot 2022 (k=4)")
grid.arrange(p0,p1,p2, nrow = 2)

We can observe that the teal cluster from k=2 splits into more detailed sub-clusters as k increases, suggesting internal complexity within that group.

Also, in terms of variance explained, Dim1 explains 62.1% of the variance and Dim2 explains 13.4%, totaling 75.5%. This means the plots capture a significant portion of the data’s variance, providing a meaningful visual representation of the clustering.

6.2.2.3 Extracting results

Because of our previous conclusions extracted by the silhouette plot above, we choose to focus on the tri-clustered data.

We can now extract the clusters and add them to our initial data in order to perform some descriptive statistics at the cluster level:

Code
# Compute k-means clustering with k = 3
set.seed(123)
final <- kmeans(data_2022_standardized, 3, nstart = 25)

Visualize the results:

Code
fviz_cluster(final, data = data_2022_standardized)

By plotting the names of the country within each cluster, we can now discuss and inspect for insights:

It is interesting to see that the first cluster in red (which contains countries such as Algeria, Colombia, Haiti, Kenya, Mexico, etc…)appears to include countries that might share similar socio-economic and governance challenges. Indeed, these countries might have higher rates of instability, corruption, and lower economic development.

Then, it looks like the second cluster in green groups countries that might have moderate to high levels of economic development and governance; the countries in this cluster (Spain, Dominican Republic, Greece, Costa Rica, Hungary, Israel, Costa Rica etc…) likely have a mix of developed and developing economies with varying degrees of political and economic stability.

The third and notable cluster in blue contains highly developed countries with strong economic performance and governance. These countries likely have high levels of human development, political stability, and low corruption. Indeed, the cluster contains countries such as Singapore, Denmark, Finland, Netherlands, Luxembourg, etc… which represent some of the most developed and well-governed nations.

Variable Distribution Across Clusters:

These three clusters can also be analyzed by the feature values within each cluster.

Code
# Assign cluster membership to the original data
data_2022_standardized$Cluster <- as.factor(final$cluster)

library(ggplot2)
library(tidyr)

# Reshape data from wide to long format
data_long <- pivot_longer(data_2022_standardized, 
                          cols = -Cluster, 
                          names_to = "Variable", 
                          values_to = "Value")

# Create boxplots for each variable by cluster
ggplot(data_long, aes(x = Cluster, y = Value, fill = Cluster)) +
  geom_boxplot() +
  facet_wrap(~ Variable, scales = "free", nrow = 5) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0, vjust = 1, hjust = 1),
        strip.text = element_text(size = 10))  # Adjust text size for facet labels if necessary

Civil Liberties, Political Rights and Political Stability: Clusters 2 (green) and especially Cluster 3 (blue) generally show higher values indicating better conditions or higher rankings in these metrics compared to Cluster 1 (red).

Control of Corruption and Government Effectiveness: Cluster 3 (blue) typically demonstrates the highest values, suggesting it might consist of countries with better governance metrics.

HDI Value: This shows a significant range in Cluster 1 (red) with relatively lower median values, indicating perhaps a wider disparity in development levels within this cluster. Cluster 3 (blue) contains again favorable rates by having the highest values of HDI.

Homicides per 100,000 people: Here, there is a noticeable difference, with Cluster 1 showing higher values (more homicides), which might suggest higher violence or security issues. Cluster 2 (green) and especially Cluster 3 (blue) generally show lower values, indicating better conditions, except some high homicide outliers in Cluster 2.

In the same way, we see in the Unemployment rate variable a larger disparity of values in Cluster 1 (red) and higher rates than in the two other clusters, indicating worse employment situations in the countries contained in this cluster.

GDP per capita: Cluster 3 (blue) shows notably higher values, implying that, as already mentioned, this cluster includes more economically prosperous and highly developed countries.

6.2.3 Hierarchical Clustering

One potential disadvantage of K-means clustering is that it requires us to pre-specify the number of clusters.

Hierarchical clustering is an alternative approach which does not require that we commit to a particular choice of clusters. Hierarchical clustering has an added advantage over K-means clustering in that it results in an attractive tree-based representation of the observations, called a dendrogram.

We use the average linkage method, trying to find a balance between the most and least similar pairs within each cluster. This method can sometimes prevent extreme values from overly influencing the groupings.

Code
hclust_avg <- hclust(distance, method = 'average') #specify the linkage method via the method argument
plot(hclust_avg, 
     main = "Hierarchical Clustering Dendrogram for 2022",
     xlab = "Clusters",
     ylab = "Height",
     sub = "Method: Average Linkage",
     cex = 0.6,  # Label size for the dendrogram
     cex.main = 1.2,  # Title size
     cex.lab = 1.0,  # Axis label size
     cex.sub = 0.8)  # Subtitle size
rect.hclust(hclust_avg , k = 3, border = 2:6)

Countries are now divided into distinct groups, based on various similarities.

The height at which the branches meet in the dendrogram reflects the distance or dissimilarity between the groups. For example, Haiti and Jordan from the first cluster (red), which form a group at a lower height, are more similar to each other than, say, the countries in the group containing Singapore and Israel, which join at a much higher height.

We see in this type of clustering that the most developed countries (Finland, Austria, Denmark, etc.), previously grouped together by k-mean clustering, are now grouped with other rather varied countries.

A mix between countries previously clustered together differently has taken place, making comparison with k-mean clustering difficult. This is quite normal, because hierarchical clustering might capture more nuanced and complex patterns of similarity between countries, reflecting not just the major sources of variance but also subtler similarities. On the other hand, K-means on PCA tends to group countries based on the most significant patterns of variation and might therefore align more closely with dominant economic, social, or political factors that stand out in the data reduction process.

We applied the same steps over several years of our data set but found no significant differences between years. Therefore, we decided to not include it in this section of the project. You can however check our analysis in the appendix.

7 Conclusion

7.1 Results

Our main results are listed here. The aim of this subsection is to present in a structured way the results obtained by the machine learning models applied in this project.

Unemployment Rates:

The analysis revealed that unemployment rates do not significantly influence homicide rates. In fact, unemployment rate has a weak correlation with every socio-demographic variables. Thus, including unemployment factors in machine learning models is irrelevant for the purpose of our project aiming at predicting homicide rates.

Linear Model Performance:

A simple linear model was inadequate for accurately predicting the impacts of socio-economic and demographic variables on homicide rates. Even with the assistance of the Akaike Information Criterion (AIC) for model selection, the linear model’s predictions were insufficiently reliable, particularly for higher homicide rates, indicating the model’s inability to capture the complexity of the data.

Gradient Boosting Machines:

Implementing a gradient boosting machines (GBM) model, with optimized hyper-parameters through cross-validation, improved the prediction quality compared to the linear model. However, while GBM enhanced the accuracy, it was still outperformed by another model in our study.

Random Forest Model:

The random forest model, also optimized through cross-validation, exhibited the best predictive performance among all the models tested. It provided the most reliable metrics for the purpose of our project, making it the most effective model for predicting homicide rates.

Variable Importance:

The random forest model’s variable importance scores identified the Human Development Index (HDI) as the most impactful variable on homicide rates, followed by GDP per capita. The scores also confirmed the irrelevance of unemployment rates in predicting homicides. Civil liberties and government effectiveness rates also received low importance scores, indicating a lesser impact on homicide rates.

Model Overfitting:

Comparing the training and test sets revealed a noticeable decline in performance on the test set. The higher error metrics on the test set suggest that the random forest model might be overfitting the training data to some extent.

K-means Clustering:

While focusing on the year 2022, K-means clustering with three clusters produced meaningful groupings of countries based on their socio-economic and demographic similarities. Indeed, the three clusters accurately grouped the different countries contained in our data set by their similarity; they displayed a group of countries deemed highly developed socially and economically, another one of a mix of developed and developing economies with varying degrees of political and economic stability, and a last one of countries deemed poor and with social difficulties, where homicide rates are therefore relatively high.

Hierarchical Clustering:

Hierarchical clustering grouped most developed countries (e.g., Finland, Austria, Denmark) together, but with a more varied mix of other countries compared to K-means clustering. This method captured more nuanced and complex patterns of similarity between countries, reflecting subtler relationships beyond the primary variance sources.

Clustering Over Multiple Years:

Applying the same clustering methods to other years yielded consistent insights, reinforcing the robustness of the findings across different time periods.

7.2 Recommendations and Discussion

The study successfully identified key socio-economic and demographic variables influencing homicide rates, with HDI and GDP per capita being the most significant predictors, while unemployment rates were found to be irrelevant. The random forest model emerged as the most effective predictive tool, and clustering analyses provided clear groupings of countries based on development levels and homicide rates.

Policymakers should prioritize improving human development and economic conditions, as these factors have the most substantial impact on reducing homicide rates. Resources should be allocated to enhance HDI and GDP per capita, while less emphasis can be placed on unemployment rates concerning homicide reduction strategies.

The main limitation of our study is the potential overfitting of the random forest model. Future work should explore more robust regularization techniques or alternative models to mitigate this issue. Additionally, while our data set is comprehensive, there may be other relevant variables not included in this analysis; the models may not fully capture all potential factors influencing homicide rates, such as cultural differences, law enforcement effectiveness, and historical contexts aspects that were not included. Future research should explore integrating additional variables, such as cultural factors or historical data, to provide a more comprehensive analysis. Another limitation is the reliance on publicly available data, which may have varying degrees of accuracy and completeness.

Applying advanced machine learning techniques like deep learning could enhance predictive performance. Continuous updating and validation of the models with new data will also ensure their relevance and accuracy over time. Also, experimenting with hybrid models that combine the strengths of different algorithms and expanding the scope of the study to include regional analyses within countries could further improve prediction accuracy and might also uncover more granular insights. Moreover, incorporation of qualitative data, such as expert opinions and case studies, could also enhance the predictive models.

In conclusion, by leveraging advanced machine learning techniques and comprehensive socio-economic data, projects of this genre could provide actionable insights that can guide policy-making and international efforts to reduce homicide rates and promote social stability. The findings emphasize the importance of human development and economic growth as critical factors in ensuring public safety.

8 Appendix

8.1 Supervised ML

8.1.1 Variable Importance for GBM

Here, we visualize the most important predictors contributing to the GBM model.

Code
ggplot(importance, aes(x = reorder(var, rel.inf), y = rel.inf)) +
  geom_bar(stat = "identity", fill = "aliceblue", color ="black") +
  coord_flip() +
  labs(x = "Variables", y = "Relative Importance", title = "Feature Importance Plot") +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 14),  # Change the size of variable names
    axis.text.x = element_text(size = 16),  # Change the size of x-axis text (if needed)
    plot.title = element_text(size = 20)    # Adjust the plot title size
  )

According to this plot, the most important feature is the HDI value, followed by the GDP per capita (USD) and the Control of Corruption Estimate. This model seems to capture approximately the same most important features than in the random forest one.

8.2 Unsupervised ML

8.2.1 Comparing Clusters Across Years

We compare our data over years 2013, 2015, 2017 and 2019 to 2022.

8.2.1.1 Comparing PCAs

Code
# Create PCA plots
plot_pca_2021 <- fviz_pca_var(pca_data_2021)
plot_pca_2020 <- fviz_pca_var(pca_data_2020)
plot_pca_2019 <- fviz_pca_var(pca_data_2019)
plot_pca_2017 <- fviz_pca_var(pca_data_2017)
plot_pca_2015 <- fviz_pca_var(pca_data_2015)
plot_pca_2013 <- fviz_pca_var(pca_data_2013)

# Use grid.arrange to combine the plots
grid.arrange(plot_pca_2021, plot_pca_2020,plot_pca_2019, plot_pca_2017, plot_pca_2015, plot_pca_2013, nrow =3)

8.2.1.2 Comparing K-Mean Clustering

Code
k3_2021 <-kmeans(data_2021_standardized, centers = 3, nstart = 25)
k3_2020 <-kmeans(data_2020_standardized, centers = 3, nstart = 25)
k3_2019 <-kmeans(data_2019_standardized, centers = 3, nstart = 25)
k3_2017 <-kmeans(data_2017_standardized, centers = 3, nstart = 25)
k3_2015 <-kmeans(data_2015_standardized, centers = 3, nstart = 25)
k3_2013 <-kmeans(data_2013_standardized, centers = 3, nstart = 25)
Code
p1 <-fviz_cluster(k3_2021, geom = "point",data = data_2021_standardized)  +
  ggtitle("Cluster Plot 2021 (k=3)")
p2 <-fviz_cluster(k3_2020, geom = "point",data = data_2020_standardized)  +
  ggtitle("Cluster Plot 2020 (k=3)")
p3 <-fviz_cluster(k3_2019, geom = "point",data = data_2019_standardized)  +
  ggtitle("Cluster Plot 2019 (k=3)")
p4 <-fviz_cluster(k3_2017, geom = "point",data = data_2017_standardized)  +
  ggtitle("Cluster Plot 2017 (k=3)")
p5 <-fviz_cluster(k3_2015, geom = "point",data = data_2015_standardized)  +
  ggtitle("Cluster Plot 2015 (k=3)")
p6 <-fviz_cluster(k3_2013, geom = "point",data = data_2013_standardized)  +
  ggtitle("Cluster Plot 2013 (k=3)")

grid.arrange(p1,p2,p3,p4, p5, p6, nrow = 3)

8.3 Source

EDA: 3.2.2.1.1 Homicides per 100,000 People Variable

EDA: 3.2.2.1.3 Political Rights Variable

Unsupervised Learning: K-Mean Clustering

Unsupervised Learning: Hierarchical Clustering