The following machine learning project focuses on…
1 Introduction
A sense of personal security is crucial to individual well-being. It includes the possibility of physical assault or other forms of crime. Crime can cause not only loss of life and property, but also physical pain, post-traumatic stress and anxiety. Apparently, the feeling of vulnerability caused by a criminal act is one of the most significant influences on an individual’s well-being. Homicides represent only the most extreme form of violence against the person and is therefore not representative of general security conditions. It is, however, a more reliable indicator of a country’s level of security because, unlike other crimes, murders are in principle systematically reported to the police. According to the latest OECD data available, the average homicide rate in the OECD area is 2.6 per 100,000 inhabitants.
Our findings describe a world of contrasts, where some regions enjoy stability, economic prosperity and high levels of political rights and civil liberties, while others face major challenges such as corruption, government inefficiency, violence and severe economic problems.
Understanding the factors that influence homicide rates is crucial for developing effective policies to improve public safety and social stability. This machine learning project focuses on predicting and analyzing the impact of various economic and socio-demographic variables on homicide rates per 100,000 inhabitants across multiple countries. By leveraging both supervised and unsupervised learning techniques, we aim to uncover the key drivers of violent crime and provide insights into which kind of factors influence a country’s homicide rate.
Homicide rates are a significant indicator of societal well-being, reflecting the effectiveness of governance, economic health, and social cohesion within a country. High homicide rates often signal underlying issues such as poverty, inequality, and political instability, while lower rates generally indicate stronger governance and better social conditions. Analyzing these rates can help identify critical areas for improvement and guide efforts to enhance public safety.
For this study, we integrated multiple data sets to create a comprehensive data set encompassing a range of variables. These include political rights, control of corruption, unemployment rates and the primary variable of interest: the homicide rate. The data spans multiple countries and years, forming a time series that captures temporal dynamics and trends.
The data sets were sourced from reputable public databases and international organizations, ensuring high-quality and reliable data. After initial data cleaning and wrangling to merge these data sets, we conducted an Exploratory Data Analysis (EDA) to understand the underlying patterns and relationships within the data. This preliminary analysis provided valuable insights and guided the subsequent application of machine learning models.
We employed a variety of machine learning models to achieve our objectives. For supervised learning, models such as linear regression, gradient boosting machines and random forests were used to predict homicide rates based on the socio-economic and demographic variables. These models helped quantify the influence of each factor on homicide rates. For unsupervised learning, clustering techniques like K-means and hierarchical clustering were applied to identify natural groupings of countries with similar profiles, providing a deeper understanding of the data’s structure.
The findings of this study are organized as follows: In Section 2, we describe the data, detailing its sources, structure, and the cleaning and merging process. Section 3 presents the Exploratory Data Analysis, highlighting key patterns and correlations. Section 4 and t are dedicated to the methodology, explaining the machine learning models and techniques used; in Section 4, we implement and discuss the results of the supervised learning models, identifying the most significant predictors of homicide rates and Section 5 covers the unsupervised learning analysis, exploring the clustering results and their implications. Finally, Section 6 concludes the study, summarizing the findings and suggesting potential policy recommendations and areas for future research.
By combining advanced machine learning techniques with comprehensive socio-economic data, this project aims to test and understand the functioning of several types of machine learning models and to provide actionable insights into the factors driving homicide rates. The results could help policymakers and international organizations design targeted interventions to enhance public safety and promote social stability globally.
2 Data
2.1 Data Source
The data set we’ll be working on for this project is a combination of several data sets containing variables that could influence the rate of homicide per country.
Therefore, we collected socio-economic and demographic data, data on security and law enforcement, data on governance and legislation, and health and social well-being indicators. Below are the sources of all our data sets.
The HDI is a summary composite measure of a country’s average achievements in three basic aspects of human development: health, knowledge and standard of living. It is a measure of a country’s average achievements in three dimensions of human development:
a long and healthy life, as measured by life expectancy at birth;
knowledge, as measured by mean years of schooling and expected years of schooling; and
a decent standard of living, as measured by GNI per capita in PPP terms in USD.
The HDI sets a minimum and a maximum for each dimension, called “goalposts”, then shows where each country stands in relation to these goalposts. This is expressed as a value between 0 and 1. The higher a country’s human development, the higher its HDI value.
The Unemployment Rate per Country
It contains information about the unemployment rate from the age of 15 per country.
The GDP per capita per Country
This variable contains the Gross Domestic Product (GDP) per capita for each country of the dataset, which is an economic metric that represents the average economic output (or income) per person. The values are in USD.
The Governance Indicators per Country
We chose this data set in order to be able to quantify the corruption per country thanks to governance indicators provided by the World Bank’s Worldwide.
Below are the indicators we will use in our analysis:
Control of Corruption: Estimate - This measure can provide insight into the level of corruption within a country. Lower corruption might be associated with more effective law enforcement and potentially lower crime rates.
Government Effectiveness: Estimate - This could reflect how well the government is able to implement its policies, including those related to crime prevention and justice.
Political Stability and Absence of Violence/Terrorism: Estimate - A lack of political stability and the presence of violence or terrorism could be related to higher homicide rates, as they often indicate broader societal issues that could lead to crime.
Freedom Rankings
This is a data set that tracks the quality of political rights and civil liberties for each country from 2013 to 2022. All data are official figures from the Freedom House that have been compiled and structured by a user of Kaggle.
Rating system explanation (source): A country or territory is awarded 0 to 4 points for each of 10 political rights indicators and 15 civil liberties indicators, which take the form of questions; a score of 0 represents the smallest degree of freedom and 4 the greatest degree of freedom. The political rights questions are grouped into three subcategories: Electoral Process (3 questions), Political Pluralism and Participation (4), and Functioning of Government (3). The civil liberties questions are grouped into four subcategories: Freedom of Expression and Belief (4 questions), Organizational Rights (3), Rule of Law (4), and Personal Autonomy and Individual Rights (4). The highest overall score that can be awarded for political rights is 40 (or a score of 4 for each of the 10 questions). The highest overall score that can be awarded for civil liberties is 60 (or a score of 4 for each of the 15 questions).
Table of variables :
Code
variables_tibble <-tibble(Variables =c("Country", "Region", "Year","Homicides per 100,000 people", "GDP per Capita (USD)", "Civil Liberties", "Political Rights","Control of Corruption: Estimate", "Government Effectiveness: Estimate", "Political Stability and Absence of Violence/Terrorism: Estimate", "Unemployment Rate", "HDI Value"),Meaning =c("Name of the country where the data applies", "Geographical or political region to which the country belongs", "Specific year during which the data pertains to", "Number of victims of homicides recorded per 100,000 inhabitants in a given year", "Gross Domestic Product divided by the midyear population, expressed in U.S. dollars", "Measure of the freedoms and rights enjoyed by individuals within the country", "Measure reflecting the degree to which a country's political processes are free from undue influence or control, allowing for the full participation of its citizens", "Qualitative evaluation of the extent to which public power is exercised for private gain, including both petty and grand forms of corruption, as well as capture of the state by elites and private interests", "Assessment of the quality of public services, the civil service and its independence from political pressures, the quality of policy formulation and implementation, and the credibility of the government's commitment to such policies", "Measure that evaluates the likelihood of political instability and/or politically motivated violence, including terrorism", "Percentage of the labor force that is unemployed and actively seeking employment during the year", "Human Development Index value"))# Create a table using kable from the knitr packagekable(variables_tibble) %>%kable_styling()
Variables
Meaning
Country
Name of the country where the data applies
Region
Geographical or political region to which the country belongs
Year
Specific year during which the data pertains to
Homicides per 100,000 people
Number of victims of homicides recorded per 100,000 inhabitants in a given year
GDP per Capita (USD)
Gross Domestic Product divided by the midyear population, expressed in U.S. dollars
Civil Liberties
Measure of the freedoms and rights enjoyed by individuals within the country
Political Rights
Measure reflecting the degree to which a country's political processes are free from undue influence or control, allowing for the full participation of its citizens
Control of Corruption: Estimate
Qualitative evaluation of the extent to which public power is exercised for private gain, including both petty and grand forms of corruption, as well as capture of the state by elites and private interests
Government Effectiveness: Estimate
Assessment of the quality of public services, the civil service and its independence from political pressures, the quality of policy formulation and implementation, and the credibility of the government's commitment to such policies
Political Stability and Absence of Violence/Terrorism: Estimate
Measure that evaluates the likelihood of political instability and/or politically motivated violence, including terrorism
Unemployment Rate
Percentage of the labor force that is unemployed and actively seeking employment during the year
HDI Value
Human Development Index value
2.3 Data Wrangling/Cleanup
2.3.1 Homicide Rate Data Set
Here are the first 50 rows of this data set, which reports homicides per countries, years and a lot of different factors :
Also, we will take away few variables that are not relevant for our project to simplify the data set. The variables “X” (which counts the observations), “Iso3_code” (which is the code of the countries) and “Source” are thus not useful.
While analyzing our data set, and especially the meta-data available on the website, we saw that there are two different unit of measurement used : Counts or Rate per 100,000 population. For the purpose of our project, we will only keep observations per 100,000 people because it avoids having bias related to size differences between countries. The “count” value provides the total number of homicides in countries, which is less relevant for comparing.
Code
filtered_dataset <-subset(homicide_per_country, Unit.of.measurement =="Rate per 100,000 population")
Now, we can take this column away and rename the variable VALUE by Homicides per 100,000 people.
Code
main_dataset <-subset(filtered_dataset, select =-c(Unit.of.measurement))colnames(main_dataset)[colnames(main_dataset) =="VALUE"] <-"Homicides per 100,000 people"
Also, we decided that the variables “Dimension” and “Category” and their values are not relevant for our project ; the different values they take complicate our data sets’ joining, therefore, we decided to only conserve the values “Total” of those two variables for our upcoming analysis.
Now, lets analyse the different values that the Indicator variable can take :the number of deaths from homicides, but also the number of person arrested for those homicides etc… We only keep observations concerning the victims of intentional homicide as the rest will not be useful for the later analysis.
Code
unique(main_dataset$Indicator)
We see that there is quite a lot of different values ; the number of deaths from homicides, but also the number of person arrested for those homicides etc…
For this project, we only want to keep the number of victims of intentional homicide to simplify our analysis. Indeed, we are going to join other data to this data set, thus we only want to conserve the number of victims of homicides per 100,000 people to ensure a simpler machine learning.
Code
main_dataset <- main_dataset %>%filter(main_dataset$Indicator =="Victims of intentional homicide")
We also remove the observations detailing the sex and the age group because we will not concentrate our analysis on these factors.
Now, we can remove columns that do not give us important information anymore.
Code
main_dataset <- main_dataset %>%select(-Subregion, -Dimension, -Category, -Sex, -Age, -Indicator) #we also take away the Dimension and Category variables which are now irrelevant to display
Finally, let’s apply a code to round the decimals to three :
Code
main_dataset$`Homicides per 100,000 people`<-round(main_dataset$`Homicides per 100,000 people`, 3)
2.3.2 HDI per Country Data Set
Code
#load datasethdi_per_country <-read.csv("data/HDI by country.csv")
This data set contains information about the Human Development Indicator (HDI), but also other variables that we will not take into account. Therefore, we will first remove any columns that does not contain information about the HDI.
The names of the years’ columns contain the prefix hdi_. We remove it to simplify our later combination of data sets. Then we create a column Year to match with the other data sets.
Code
# Remove the "hdi_" prefix from column names containing a yearnames(hdi_per_country) <-gsub("hdi_", "", names(hdi_per_country))#pivoting the dataset to create a column Yearhdi_per_country <-pivot_longer( hdi_per_country,cols =matches("^(199[0-9]|200[0-9]|201[0-9]|202[0-2])$"),names_to ="Year",values_to ="Value")
Finally, we remove columns that we will not use in our analysis (embracing categorical variables as they do not give further information) and rename the columns HDI Value for better understanding.
This data set contains 2 types of unemployment rate estimation: the national estimate and the modeled ILO estimate.
We decided to keep the modeled ILO estimate as this methodology involves statistical modeling techniques such as econometric modeling, interpolation, and extrapolation. The aim is to create estimates that are comparable across different countries by adhering to standardized definitions and methodologies. This often involves adjusting or re-calculating data provided by national authorities to fit international standards.
The data set contains detailed information about the gender of the unemployed population of each country. We remove these lines as our analysis focuses on a more global scope. Therefore, we only keep the lines concerning the total population of 15 years old an more.
Code
# Assuming your data frame is named 'df'unemployment_rate <-subset(unemployment_rate, Disaggregation =="total, 15+, Modeled")
As for all data sets, we clean it by removing irrelevant column, and by renaming columns for better understanding.
In order to have the column year in the same format as in the other data sets, we remove the prefix in the columns’ names and pivot the data set so we can have a Year column.
Code
#remove the characters on the years columnsnames(corruption_data) <-gsub(".*YR(\\d{4}).*", "\\1", names(corruption_data))#pivoting the dataset to create a column Yearcorruption_data <-pivot_longer( corruption_data,cols =matches("^(199[6-9]|200[0-9]|201[0-9]|202[0-2])$"),names_to ="Year",values_to ="Value")
Finally, we pivot the column Series.Name that contains the names of the indicators. We have three of them: the estimate of control of corruption, the estimate of government effectiveness and the estimate of political stability and absence of violence/terrorism.
In order to simplify our data set we will only keep the aggregate scores of each subcategory (A, B, C, D, E, F and G) which represents the aggregate scores for the Civil Liberties and the Political Rights categories.
Now, we rename columns in order to match with the other data sets, and remove lines that contain information about territories as our main data set only contains information about countries. We also remove the columns “P.R Ratings” and “C.L Ratings” because these scores were assigned a country prior to 2020. Thus we only keep the columns that are referring to the aggregate scores.
Code
#removing lines concerning territoriesfreedom_ranking_subset <- freedom_ranking_subset[freedom_ranking_subset$C.T !='t', ]#removing columnsfreedom_ranking_subset$C.T <-NULLfreedom_ranking_subset$PR.rating <-NULLfreedom_ranking_subset$CL.rating <-NULLfreedom_ranking_subset$Status <-NULLfreedom_ranking_subset$Total <-NULL#we do not keep it as it the the sum of two scores already in the dataset#renaming columnsfreedom_ranking_subset <- freedom_ranking_subset %>%rename("Country"="Country.Territory")freedom_ranking_subset <- freedom_ranking_subset %>%rename("Year"="Edition")freedom_ranking_subset <- freedom_ranking_subset %>%rename("Civil Liberties"="CL")freedom_ranking_subset <- freedom_ranking_subset %>%rename("Political Rights"="PR")
We need to pivot all of the columns of this data set under one variable, that we will call Year. Then, we convert the GDP values column into numeric values.
We will now join all of our data sets in order to conduct our analysis. Note that our final data set will not contain observations across all years for all countries, as some countries do not have homicides’ rate measurements for every year.
Code
main_dataset$Year <-as.character(main_dataset$Year)datasetGPD$Year <-as.character(datasetGPD$Year)freedom_ranking_subset$Year <-as.character(freedom_ranking_subset$Year)tidy_corruption$Year <-as.character(tidy_corruption$Year)unemployment_rate$Year <-as.character(unemployment_rate$Year)hdi_per_country$Year <-as.character(hdi_per_country$Year)# Perform inner joins : join all data sets by 'Country' and 'Year'all_data <- main_dataset %>%inner_join(datasetGPD, by =c("Country", "Year")) %>%inner_join(freedom_ranking_subset, by =c("Country", "Year")) %>%inner_join(tidy_corruption, by =c("Country", "Year")) %>%inner_join(unemployment_rate, by =c("Country", "Year")) %>%inner_join(hdi_per_country, by =c("Country", "Year"))
We clean the details of this final data set to simplify its structure.
Code
# Remove 'Region.y' and rename 'Region.x' to 'Region'all_data <- all_data %>%select(-Region.y) %>%rename(Region = Region.x)# Rounding to three decimals the variables that need itall_data <- all_data %>%mutate(`GDP per capita (USD)`=round(as.numeric(`GDP per capita (USD)`), 3),`Control of Corruption: Estimate`=round(as.numeric(`Control of Corruption: Estimate`), 3),`Government Effectiveness: Estimate`=round(as.numeric(`Government Effectiveness: Estimate`), 3),`Political Stability and Absence of Violence/Terrorism: Estimate`=round(as.numeric(`Political Stability and Absence of Violence/Terrorism: Estimate`), 3) )#remove region columnall_data$Region <-NULL
Finally, we verify that our data set does not contain any missing values.
Code
# Calculer le nombre de NA par colonne et par annéemissing_data_summary <- all_data %>%group_by(Year) %>%summarise(across(everything(), ~sum(is.na(.)), .names ="NA_in_{.col}"))
Political Stability and Absence of Violence/Terrorism: Estimate
Unemployment Rate
HDI Value
Min. : 0.0000
Min. : 255.3
Min. : 1.00
Min. : 6.00
Min. :-1.5970
Min. :-2.2270
Min. :-1.9640
Min. : 0.100
Min. :0.416
1st Qu.: 0.8377
1st Qu.: 5259.6
1st Qu.:25.00
1st Qu.:33.00
1st Qu.:-0.4298
1st Qu.:-0.2203
1st Qu.:-0.2627
1st Qu.: 4.003
1st Qu.:0.721
Median : 1.6855
Median : 13215.2
Median :34.00
Median :47.00
Median : 0.1790
Median : 0.2995
Median : 0.2980
Median : 6.145
Median :0.810
Mean : 7.8170
Mean : 24234.7
Mean :29.95
Mean :43.46
Mean : 0.4038
Mean : 0.4685
Mean : 0.2579
Mean : 7.354
Mean :0.796
3rd Qu.: 8.8825
3rd Qu.: 40430.3
3rd Qu.:38.00
3rd Qu.:56.00
3rd Qu.: 1.2873
3rd Qu.: 1.2648
3rd Qu.: 0.8985
3rd Qu.: 9.273
3rd Qu.:0.904
Max. :106.8200
Max. :134925.2
Max. :40.00
Max. :60.00
Max. : 2.4030
Max. : 2.2850
Max. : 1.6200
Max. :27.690
Max. :0.967
Homicides per 100,000 people
Extreme variability: The minimum value is 0, indicating that some countries recorded no homicides, while the maximum value is extremely high (106.82), showing that other countries are much more violent. Asymmetric distribution: The median (1.685) being much lower than the mean (7.817) suggests a strongly asymmetric distribution to the right, with some countries with very high homicide rates increasing the mean.
GDP per capita (USD)
Broad wealth scale: GDP per capita ranges from USD 255.3 to USD 134925.2, showing a significant gap between the poorest and richest economies. Distribution concentrated in the lower ranges: The median (USD 13215.2) is well below the mean (USD 24234.7), indicating a concentration of data in the lower ranges and some very high values pulling the mean upwards.
Political Rights and Civil Liberties
These two measures vary on a scale where the minimum and maximum values are fairly close. This suggests a degree of uniformity in the assessment of political rights and civil liberties, but there is still considerable variability between the quartiles. Total Freedom Ranking Balanced distribution: Values range from 7 to 100, with well-distributed quartiles, indicating a relatively balanced distribution of total freedom rankings.
Governance indices (Control of Corruption, Government Effectiveness, Political Stability)
Possible improvements: The values for Control of Corruption and Government Effectiveness show a range from negative to positive, indicating that some countries are well managed while others struggle with corruption and inefficiency. Political stability: Similarly, political stability shows a wide range, from strongly negative to slightly positive, reflecting significant variability in political stability between different countries.
Unemployment Rate
Major economic problems : With an average unemployment rate of 12.438% and a maximum value of 58.501%, some locations are clearly facing major economic challenges.
HDI Value (Human Development Index)
Human Development Scale: Values range from 0.416 to 0.967, showing a wide range of human development, from very underdeveloped to highly developed countries.
3.2 Correlation Matrix
A correlation matrix is useful because it provides a concise summary of the linear relationships between multiple variables, helping to identify patterns and dependencies that can guide further analysis. Additionally, it helps detect multi-collinearity, visualize relationships between variables and identify groups of correlated variables.
Very Strong Correlation: The blocks of dark red squares, indicate strong positive correlations. For example, the variables Civil Liberties and Political Rights are strongly positively correlated with each other. This is expected as they both measure aspects of freedom within a country.
Strong Correlation with GDP per Capita (USD): GDP per capita (USD) is highly correlated with Control of Corruption: Estimate, Government Effectiveness: Estimate, HDI Value, and Political Stability and Absence of Violence/Terrorism: Estimate. A higher GDP per capita is associated with better governance, higher human development, and more political stability.
Moderate to High Positive Correlations: Government Effectiveness: Estimate, Control of Corruption: Estimate, and HDI Value show moderate to high positive correlations with one another. This suggests that effective governance and control of corruption are associated with higher human development indices.
Weak Correlations with Unemployment Rate: The Unemployment Rate has weak correlations with most of the other variables, as indicated by lighter-colored squares (both red and blue). This could mean that the unemployment rate may not be as strongly affected by the above factors or does not influence them as much, at least not in a direct linear way.
Negative Correlations: There are some variables with weak negative correlations, as indicated by the light blue squares. For instance, Unemployment Rate has a slight negative correlation with GDP per capita (USD) and HDI Value. While the correlations are weak, they do suggest that higher unemployment may be associated with lower GDP per capita and human development index, which fits with economic theory.
No Correlation: A correlation coefficient close to 0, indicated by white squares, means there is no linear relationship between those variables. For instance, Unemployment Rate and Homicides per 100,000 people show no substantial correlation, implying that there is no linear relationship between them in this data set. This might mean that is is not relevant to continue working with this variable for the purpose of our project.
3.3 Principal Component Analysis (PCA)
Graph of variables
Using PCA, we can capture the most influential and significant aspects of our data set, which can facilitate more targeted and informative analyses.
Code
# Load necessary librarylibrary(FactoMineR)library(factoextra) # This package enhances PCA plots from FactoMineR# Assuming 'all_data_numeric' contains only numeric data# Perform PCA on the numeric datapca_data <-PCA(all_data_numeric, graph =FALSE) # 'graph = FALSE' stops automatic plotting# Plot the variable correlation circle (circle plot)# This plot shows the relationships among the variables.fviz_pca_var(pca_data, col.var ="contrib", # Color by contribution to the PCAgradient.cols =c("blue4", "blue", "cadetblue3"),repel =TRUE) # Avoid text overlapping
Code
summary(pca_data)
Variance Explained by Principal Components:
Dim.1 explains 57% of the total variance, indicating that it captures much of the information present in the data. Dim.2 explains 16.5% of the variance, which is significant but much less than the first dimension. The other dimensions explain progressively less variance, with very low values after Dim.3. The first two dimensions combined explain around 73.5% of the total variance, which means that these two principal components provide a reasonably complete representation of the data.
Variables Strongly Correlated with Dim.1:
Variables such as Government Effectiveness: Estimate and Control of Corruption: Estimate show strong positive correlations with the first dimension, indicating that they contribute significantly to what this dimension represents. Homicides per 100,000 people is negatively correlated with Dim.1, indicating an inverse relationship with the other variables strongly related to this dimension.
Variables in Dim.2 and Dim.3:
Some variables such as Unemployment Rate show significant contributions to the secondary dimensions, which may indicate that they capture different or more specific aspects of the data compared to the first dimension.
For Dim.1, since it seems to be heavily influenced by variables related to governance and development indices (like Political Rights, Civil Liberties, Control of Corruption: Estimate, and HDI Value), we will name it Governance and Development Factor.
Given that the second principal component (Dim.2) is heavily influenced by variables related to societal measures such as Homicides per 100,000 people, Political Rights, Civil Liberties, and Total Freedom Ranking, it appears to capture elements related to societal stability and personal freedoms. We will name it Societal Stability and Freedom Factor.
The results suggest that to obtain a complete understanding of the data, it is beneficial to consider at least the first two main dimensions and to examine the observations and variables that contribute most to these dimensions.
3.4 Check for missing values
The data set should not contain any missing values since we cleaned all of them in the Data Wrangling/Cleanup part of this report. Let’s check if this is the case :
Code
# Summarize missing values by each columnunlist(lapply(all_data_numeric[, !names(all_data_numeric) %in%"Year"], function(x) sum(is.na(x))))
Homicides per 100,000 people
0
GDP per capita (USD)
0
Political Rights
0
Civil Liberties
0
Control of Corruption: Estimate
0
Government Effectiveness: Estimate
0
Political Stability and Absence of Violence/Terrorism: Estimate
0
Unemployment Rate
0
HDI Value
0
3.5 Visualizing data distributions
3.5.1 Distributions of each variables
Code
# Histograms for all numeric variablesall_data_numeric %>%select_if(is.numeric) %>%gather(key ="variables", value ="values") %>%ggplot(aes(x = values)) +geom_histogram(bins =30) +facet_wrap(~variables, scales ='free_x')
Civil Liberties and Political Rights: Both of these distributions are right-skewed, indicating that most countries have high scores in these categories, with fewer countries having low scores. This could suggest a general global trend towards better civil liberties and political rights.
Control of Corruption and Government Effectiveness: These distributions are more centered, resembling a normal distribution. This suggests a balance, with many countries having moderate scores in these categories, but also a significant number with either high or low scores.
GDP per capita (USD): This distribution is heavily right-skewed, with most countries having low to moderate GDP per capita and a few outliers with very high GDP per capita. This indicates significant economic disparity between countries.
HDI Value: The distribution is left-skewed, with a majority of countries having high Human Development Index (HDI) values, suggesting better overall development in terms of health, education, and standard of living.
Homicides per 100,000 People: This distribution is also heavily right-skewed, with most countries having low homicide rates and a few outliers with very high rates. This indicates that high homicide rates are less common but can be extreme in certain regions.
Stability and Absence of Violence/Terrorism: The distribution is centered but slightly skewed, suggesting that most countries have moderate scores in this category, with some countries experiencing higher stability and others facing more violence or terrorism.
Unemployment Rate: This distribution is right-skewed, with most countries having low to moderate unemployment rates and fewer countries with very high unemployment rates.
3.5.2 Examining potential outliers
3.5.2.1 Homicides per 100,000 people variable :
This variable contains some specific outliers. We can see that those are countries of south-america and africa. Indeed, from 2014 to 2016, El Salvador and Honduras had the highest number of homicides per 100,000 people. This data isn’t very surprising, given that these countries are known to have broken murder records in those years. Other outliers include Latin American countries such as Colombia, Brazil, etc., as well as African countries like South Africa and Caribbean countries like Jamaica.
Code
library(plotly)# Convert necessary columns to correct data types, especially if they are not correctly formattedall_data$Country <-as.character(all_data$Country)all_data$Year <-as.numeric(all_data$Year)all_data$`Homicides per 100,000 people`<-as.numeric(all_data$`Homicides per 100,000 people`)# Remove any NA values to avoid issues during plottingall_data <-na.omit(all_data)# Create an interactive boxplot with Plotlyhomicides_plotly <-plot_ly(data = all_data, y =~`Homicides per 100,000 people`, type ='box',text =~paste("Country:", Country, "<br>Year:", Year),hoverinfo ="text", # Hover text informationboxpoints ="outliers", # Show only outliers as pointsmarker =list(color ='rgb(8,81,156)', outliercolor ='rgba(219, 64, 82, 0.6)',line =list(outliercolor ='rgba(219, 64, 82, 1.0)', outlierwidth =2)),line =list(color ='rgb(8,81,156)'))# Adjust the layout of the plothomicides_plotly <- homicides_plotly %>%layout(title ="Interactive Boxplot for Homicides per 100,000 People",yaxis =list(title ="Homicides per 100,000 People"),xaxis =list(title ="Boxplot", showticklabels =FALSE) # No tick labels necessary on x-axis)# Print the interactive plothomicides_plotly
3.5.2.2 GDP per capita variable :
Here, we see that outliers, which are countries with especially high GDP per capita, are in fact related to particularly rich countries such as Luxembourg, Switzerland, Norway or Qatar, which is quite logical.
Code
# Load necessary librarieslibrary(tidyverse)library(plotly)all_data$Country <-as.character(all_data$Country)all_data$Year <-as.numeric(all_data$Year)all_data$`GDP per capita (USD)`<-as.numeric(all_data$`GDP per capita (USD)`)# Remove any NA values that may interfere with plottingall_data <-na.omit(all_data)# Create an interactive boxplot with Plotlygdp_plotly <-plot_ly(data = all_data, y =~`GDP per capita (USD)`, type ='box',text =~paste("Country:", Country, "<br>Year:", Year),hoverinfo ="text",boxpoints ="outliers", # Show only outliers as pointsmarker =list(color ='rgb(8,81,156)', outliercolor ='rgba(219, 64, 82, 0.6)', line =list(outliercolor ='rgba(219, 64, 82, 1.0)', outlierwidth =2)),line =list(color ='rgb(8,81,156)'))# Layout adjustmentsgdp_plotly <- gdp_plotly %>%layout(title ="Interactive Boxplot for GDP per Capita (USD)",yaxis =list(title ="GDP per capita (USD)", zeroline =FALSE),xaxis =list(showticklabels =FALSE) # Hide x-axis tick labels as there's only one boxplot)# Print the plotgdp_plotly
3.5.2.3 Political Rights variable :
In this boxplot, we can see that Saudi Arabia had in 2018 the worse score in terms of political rights. This is hardly surprising, given the country’s reputation for controversy when it comes to human rights and egalitarianism.
The two other outliers displayed, being in the same region, are UAE and Bahrain.
Code
library(plotly)# Convert necessary columns to correct data types, especially if they are not correctly formattedall_data$Country <-as.character(all_data$Country)all_data$Year <-as.numeric(all_data$Year)all_data$`Political Rights`<-as.numeric(all_data$`Political Rights`)# Remove any NA values to avoid issues during plottingall_data <-na.omit(all_data)# Create an interactive boxplot with Plotlyhomicides_plotly <-plot_ly(data = all_data, y =~`Political Rights`, type ='box',text =~paste("Country:", Country, "<br>Year:", Year),hoverinfo ="text", # Hover text informationboxpoints ="outliers", # Show only outliers as pointsmarker =list(color ='rgb(8,81,156)', outliercolor ='rgba(219, 64, 82, 0.6)',line =list(outliercolor ='rgba(219, 64, 82, 1.0)', outlierwidth =2)),line =list(color ='rgb(8,81,156)'))# Adjust the layout of the plothomicides_plotly <- homicides_plotly %>%layout(title ="Interactive Boxplot for Political Rights",yaxis =list(title ="Political Rights"),xaxis =list(title ="Boxplot", showticklabels =FALSE) # No tick labels necessary on x-axis)# Print the interactive plothomicides_plotly
3.5.2.4 Civil Liberties variable, Control of Corruption, Government Effectiveness and Political Stability variables :
We see no particular outliers for these four variables.
Code
library(plotly)# Convert necessary columns to correct data types, especially if they are not correctly formattedall_data$Country <-as.character(all_data$Country)all_data$Year <-as.numeric(all_data$Year)all_data$`Civil Liberties`<-as.numeric(all_data$`Civil Liberties`)# Remove any NA values to avoid issues during plottingall_data <-na.omit(all_data)# Create an interactive boxplot with Plotlyhomicides_plotly <-plot_ly(data = all_data, y =~`Civil Liberties`, type ='box',text =~paste("Country:", Country, "<br>Year:", Year),hoverinfo ="text", # Hover text informationboxpoints ="outliers", # Show only outliers as pointsmarker =list(color ='rgb(8,81,156)', outliercolor ='rgba(219, 64, 82, 0.6)',line =list(outliercolor ='rgba(219, 64, 82, 1.0)', outlierwidth =2)),line =list(color ='rgb(8,81,156)'))# Adjust the layout of the plothomicides_plotly <- homicides_plotly %>%layout(title ="Interactive Boxplot for Civil Liberties",yaxis =list(title ="Civil Liberties"),xaxis =list(title ="Boxplot", showticklabels =FALSE) # No tick labels necessary on x-axis)# Print the interactive plothomicides_plotly
Code
library(plotly)# Convert necessary columns to correct data types, especially if they are not correctly formattedall_data$Country <-as.character(all_data$Country)all_data$Year <-as.numeric(all_data$Year)all_data$`Control of Corruption: Estimate`<-as.numeric(all_data$`Control of Corruption: Estimate`)# Remove any NA values to avoid issues during plottingall_data <-na.omit(all_data)# Create an interactive boxplot with Plotlyhomicides_plotly <-plot_ly(data = all_data, y =~`Control of Corruption: Estimate`, type ='box',text =~paste("Country:", Country, "<br>Year:", Year),hoverinfo ="text", # Hover text informationboxpoints ="outliers", # Show only outliers as pointsmarker =list(color ='rgb(8,81,156)', outliercolor ='rgba(219, 64, 82, 0.6)',line =list(outliercolor ='rgba(219, 64, 82, 1.0)', outlierwidth =2)),line =list(color ='rgb(8,81,156)'))# Adjust the layout of the plothomicides_plotly1 <- homicides_plotly %>%layout(title ="Interactive Boxplot for Control of Corruption",yaxis =list(title ="Control of Corruption: Estimate"),xaxis =list(title ="Boxplot", showticklabels =FALSE) # No tick labels necessary on x-axis)# Print the interactive plotp1 <- homicides_plotly1p1
Code
#Gov. Effectiveness library(plotly)# Convert necessary columns to correct data types, especially if they are not correctly formattedall_data$Country <-as.character(all_data$Country)all_data$Year <-as.numeric(all_data$Year)all_data$`Government Effectiveness: Estimate`<-as.numeric(all_data$`Government Effectiveness: Estimate`)# Remove any NA values to avoid issues during plottingall_data <-na.omit(all_data)# Create an interactive boxplot with Plotlyhomicides_plotly <-plot_ly(data = all_data, y =~`Government Effectiveness: Estimate`, type ='box',text =~paste("Country:", Country, "<br>Year:", Year),hoverinfo ="text", # Hover text informationboxpoints ="outliers", # Show only outliers as pointsmarker =list(color ='rgb(8,81,156)', outliercolor ='rgba(219, 64, 82, 0.6)',line =list(outliercolor ='rgba(219, 64, 82, 1.0)', outlierwidth =2)),line =list(color ='rgb(8,81,156)'))# Adjust the layout of the plothomicides_plotly2 <- homicides_plotly %>%layout(title ="Interactive Boxplot for Government Effectiveness",yaxis =list(title ="Government Effectiveness: Estimate"),xaxis =list(title ="Boxplot", showticklabels =FALSE) # No tick labels necessary on x-axis)# Print the interactive plotp2 <- homicides_plotly2p2
Code
#Pol. Stabilitylibrary(plotly)# Convert necessary columns to correct data types, especially if they are not correctly formattedall_data$Country <-as.character(all_data$Country)all_data$Year <-as.numeric(all_data$Year)all_data$`Political Stability and Absence of Violence/Terrorism: Estimate`<-as.numeric(all_data$`Political Stability and Absence of Violence/Terrorism: Estimate`)# Remove any NA values to avoid issues during plottingall_data <-na.omit(all_data)# Create an interactive boxplot with Plotlyhomicides_plotly <-plot_ly(data = all_data, y =~`Political Stability and Absence of Violence/Terrorism: Estimate`, type ='box',text =~paste("Country:", Country, "<br>Year:", Year),hoverinfo ="text", # Hover text informationboxpoints ="outliers", # Show only outliers as pointsmarker =list(color ='rgb(8,81,156)', outliercolor ='rgba(219, 64, 82, 0.6)',line =list(outliercolor ='rgba(219, 64, 82, 1.0)', outlierwidth =2)),line =list(color ='rgb(8,81,156)'))# Adjust the layout of the plothomicides_plotly3 <- homicides_plotly %>%layout(title ="Interactive Boxplot for Political Stability",yaxis =list(title ="Political Stability: Estimate"),xaxis =list(title ="Boxplot", showticklabels =FALSE) # No tick labels necessary on x-axis)# Print the interactive plotp3 <- homicides_plotly3p3
3.5.2.5 Unemployment Rate variable :
There are quite a few outliers visible in this boxplot. We see Greece over several consecutive years, which is not surprising given the difficulties this country has had in terms of employment in the decade following the 2008 crisis. The same applies to Spain.
In addition, African countries such as South Africa and Botswana have had years with particularly high unemployment rates, as has Jordan in the Middle East.
Code
library(plotly)# Convert necessary columns to correct data types, especially if they are not correctly formattedall_data$Country <-as.character(all_data$Country)all_data$Year <-as.numeric(all_data$Year)all_data$`Unemployment Rate`<-as.numeric(all_data$`Unemployment Rate`)# Remove any NA values to avoid issues during plottingall_data <-na.omit(all_data)# Create an interactive boxplot with Plotlyhomicides_plotly <-plot_ly(data = all_data, y =~`Unemployment Rate`, type ='box',text =~paste("Country:", Country, "<br>Year:", Year),hoverinfo ="text", # Hover text informationboxpoints ="outliers", # Show only outliers as pointsmarker =list(color ='rgb(8,81,156)', outliercolor ='rgba(219, 64, 82, 0.6)',line =list(outliercolor ='rgba(219, 64, 82, 1.0)', outlierwidth =2)),line =list(color ='rgb(8,81,156)'))# Adjust the layout of the plothomicides_plotly <- homicides_plotly %>%layout(title ="Interactive Boxplot for Unemployment Rate",yaxis =list(title ="Unemployment Rate"),xaxis =list(title ="Boxplot", showticklabels =FALSE) # No tick labels necessary on x-axis)# Print the interactive plothomicides_plotly
3.5.2.6 Human Development Index variable :
Finally, we see here that Burundi stands out from the rest with a very low HDI for several consecutive years and indeed, according to United Nations Development Program, this country is still today categorized as low human development.
Code
library(plotly)# Convert necessary columns to correct data types, especially if they are not correctly formattedall_data$Country <-as.character(all_data$Country)all_data$Year <-as.numeric(all_data$Year)all_data$`HDI Value`<-as.numeric(all_data$`HDI Value`)# Remove any NA values to avoid issues during plottingall_data <-na.omit(all_data)# Create an interactive boxplot with Plotlyhomicides_plotly <-plot_ly(data = all_data, y =~`HDI Value`, type ='box',text =~paste("Country:", Country, "<br>Year:", Year),hoverinfo ="text", # Hover text informationboxpoints ="outliers", # Show only outliers as pointsmarker =list(color ='rgb(8,81,156)', outliercolor ='rgba(219, 64, 82, 0.6)',line =list(outliercolor ='rgba(219, 64, 82, 1.0)', outlierwidth =2)),line =list(color ='rgb(8,81,156)'))# Adjust the layout of the plothomicides_plotly <- homicides_plotly %>%layout(title ="Interactive Boxplot for Human Development Index",yaxis =list(title ="HDI Value"),xaxis =list(title ="Boxplot", showticklabels =FALSE) # No tick labels necessary on x-axis)# Print the interactive plothomicides_plotly
3.5.3 Categorical data analysis :
In fact, the data set we use for this project does not contain any categorical variable and value. Thus, there is nothing to explore related to this type of data.
3.6 Uni-variate visualizations
3.6.0.1 Average homicides per 100,000 per years
This graph shows the average number of homicides per 100,000 people across the world since 1990. The scale on the y-axis starts at 5/100,000. In 2020, we notice a surprising decrease. The explanation certainly lies behind the COVID pandemic. People were locked down, preventing crimes.
Code
# Calculate the average value of homicide per year from the main_datasetaverage_homicide_per_year <- main_dataset %>%group_by(Year) %>%summarise(`Average Homicide`=mean(`Homicides per 100,000 people`, na.rm =TRUE)) library(plotly)library(ggplot2) p<-ggplot(average_homicide_per_year, aes(x = Year, y =`Average Homicide`)) +geom_col(fill ="aliceblue", color ="black") +scale_x_discrete(name ="Year") +labs(title ="Average Homicide per 100,000 People per Year",x ="Year",y ="Average Homicide Count") +theme_minimal() +theme(plot.title =element_text(hjust =0.5),axis.text.x =element_text(angle =45, hjust =1)) +coord_cartesian(ylim =c(5, NA)) # Adjust y-axis limits without removing dataggplotly(p)
The following plot shows the average value of homicide per 100,000 people among the countries observed in our data set from 2013 to 2022.
However, we have to be careful and remove the countries that do not contain observations for each year so that it does not bias the annual average.
Code
# Find all unique years in the datasetunique_years <-unique(all_data$Year)# Find countries that have observations for every yearcountries_with_observations <- all_data %>%group_by(Country) %>%summarize(all_years_present =all(unique_years %in% Year)) %>%filter(all_years_present)# List of countries with complete observationscountries_with_full_obs <- countries_with_observations$Countrya <-nrow(countries_with_observations)#create a subset with countries_with_observationssubset_full_data <- all_data %>%filter(Country %in% countries_with_full_obs)
This is the data frame we are going to use for our further explorations.
Lets now fix our previous graph :
Code
# Calculate the average value of homicide per yearaverage_homicide_per_year <- subset_full_data %>%group_by(Year) %>%summarise(`Average Homicide`=mean(`Homicides per 100,000 people`, na.rm =TRUE))# Plot the datap <-ggplot(average_homicide_per_year, aes(x = Year, y =`Average Homicide`)) +geom_col(fill ="aliceblue", color ="black") +scale_x_continuous(name ="Year", breaks =seq(min(average_homicide_per_year$Year), max(average_homicide_per_year$Year), by =2)) +labs(title ="Average Homicide per 100,000 People per Year",x ="Year",y ="Average Homicide Count") +theme_minimal() +theme(plot.title =element_text(hjust =0.5),axis.text.x =element_text(angle =45, hjust =1)) +coord_cartesian(ylim =c(7, NA)) # Adjust y-axis limits without removing data# Convert to an interactive plotly plotggplotly(p)
Let’s now plot this as a line chart for visualizations concerns :
Code
# Convert data types as necessarysubset_full_data$Year <-as.integer(subset_full_data$Year)subset_full_data$Country <-as.character(subset_full_data$Country)subset_full_data$`Homicides per 100,000 people`<-as.numeric(subset_full_data$`Homicides per 100,000 people`)# Remove any NA values to ensure clean plottingsubset_full_data <-na.omit(subset_full_data)# Calculate the average Homicides per 100,000 people per yearaverage_homicides_per_year <- subset_full_data %>%group_by(Year) %>%summarise(Average_Homicides =mean(`Homicides per 100,000 people`, na.rm =TRUE))# Plot the average Homicides per 100,000 people over the yearshomicides_plot <-ggplot(average_homicides_per_year, aes(x = Year, y = Average_Homicides)) +geom_line(color ="black") +# Connect points by year with a linegeom_point(color ="black") +# Add points to the linescale_x_continuous(breaks = average_homicides_per_year$Year) +# Ensure every year is marked on the x-axislabs(title ="Average Homicides Per 100,000 People Over Years",x ="Year",y ="Average Homicides Per 100,000 People") +theme_minimal() # Using a minimal theme for a cleaner look# Print the plotprint(homicides_plot)
We can divide the tendencies of the average homicides rate per 100,000 people into four different phases. Lets name them and provide some potential insights.
Stable Homicide Rates (2013-2014):
Stability in economic conditions, law enforcement practices, and social policies may have contributed to maintaining a steady homicide rate during these years. Absence of significant social, political, or economic disruptions could also explain the stable rates.
Increase in 2015:
Economic challenges or downturns could have led to increased crime rates, including homicides. Moreover, increased political instability or changes in governance could have contributed to higher violence levels during this year.
Notably, a surge in drug trafficking and related violence might explain the spike in homicide rates.
Gradual Decline (2016-2019):
This decline could be due to improved law enforcement; enhanced policing strategies, better crime prevention programs, and community policing efforts could have contributed to the decline in homicide rates. Improvements in economic conditions and job creation efforts might also have played a role in reducing crime rates.
Moreover, increased focus on social programs, education, and employment opportunities could have led to a reduction in crime rates.
Sharp Decline and Subsequent Increase (2020-2022):
This phase of the plot is higly related to the impact of COVID-19 pandemic: the pandemic led to significant social and economic disruptions, including increased unemployment, social isolation, and stress, which may have initially reduced homicides due to lockdowns but later contributed to a rise as restrictions eased. The pandemic also strained law enforcement resources and capabilities, potentially leading to changes in crime patterns.
In addition to economic hardship, increased social unrest and protests during this period could have also contributed to higher homicide rates.
3.6.0.2 Average GDP per capita over years
In this graph, we can see a downward trend in average GDP per capita during the year 2020, which corresponds to the COVID-19 period and the shutdown of industries worldwide following the established confinements.
However, we are seeing very low average GDP per capita in 2015 and 2016, which is not very intuitive to understand. One potential reason for these two values is the structure of our data set, which, depending on the countries concerned, may have relatively low values in these two years. All in all, there’s no particular reason.
It is, however, interesting to see how average economic growth picks up in the post-covid period in 2021 and 2022.
Code
# Convert data types as necessarysubset_full_data$Year <-as.integer(subset_full_data$Year)subset_full_data$Country <-as.character(subset_full_data$Country)subset_full_data$`GDP per capita (USD)`<-as.numeric(subset_full_data$`GDP per capita (USD)`)# Remove any NA values to ensure clean plottingsubset_full_data <-na.omit(subset_full_data)# Calculate the average GDP per capita per yearaverage_gdp_per_year <- subset_full_data %>%group_by(Year) %>%summarise(Average_GDP =mean(`GDP per capita (USD)`, na.rm =TRUE))# Plot the average GDP per capita over the yearsgdp_plot <-ggplot(average_gdp_per_year, aes(x = Year, y = Average_GDP)) +geom_line(color ="blue") +# Connect points by year with a linegeom_point(color ="red") +# Add points to the linescale_x_continuous(breaks = average_gdp_per_year$Year) +# Ensure every year is marked on the x-axislabs(title ="Average GDP Per Capita (USD) Over Years",x ="Year",y ="Average GDP Per Capita (USD)") +theme_minimal() # Using a minimal theme for a cleaner look# Print the plotprint(gdp_plot)
3.6.0.3 Average unemployment rate over years
Similar to the previous graph, we see a trend strongly marked by COVID-19 and the notorious rise in the average unemployment rate that it engendered, causing an economic crisis and numerous bankruptcies.
However, we can see that average unemployment in 2013 and 2014 is comparable to that of COVID-19, which is also probably due to the structure of our data set and the high unemployment rates in Europe and elsewhere (Greece, Spain, South Africa, etc.) in those years, as seen above.
Code
# Convert data types as necessarysubset_full_data$Year <-as.integer(subset_full_data$Year)subset_full_data$Country <-as.character(subset_full_data$Country)subset_full_data$`Unemployment Rate`<-as.numeric(subset_full_data$`Unemployment Rate`)# Remove any NA values to ensure clean plottingsubset_full_data <-na.omit(subset_full_data)# Calculate the average Unemployment rate per yearaverage_unemployment_per_year <- subset_full_data %>%group_by(Year) %>%summarise(Average_Unemployment =mean(`Unemployment Rate`, na.rm =TRUE))# Plot the average Unemployment rate over the yearsunemployment_plot <-ggplot(average_unemployment_per_year, aes(x = Year, y = Average_Unemployment)) +geom_line(color ="orange") +# Connect points by year with a linegeom_point(color ="brown4") +# Add points to the linescale_x_continuous(breaks = average_unemployment_per_year$Year) +# Ensure every year is marked on the x-axislabs(title ="Average Unemployment Rate Over Years",x ="Year",y ="Average Unemployment Rate") +theme_minimal() # Using a minimal theme for a cleaner look# Print the plotprint(unemployment_plot)
3.6.0.4 Average Human Development Index rate over years
Finally, we see that for this variable, the index was increasing from year to year, only to fall again in 2020 as a result of the pandemic.
In fact, this decrease can be linked to several factors associated with the pandemic’s widespread impacts on health, education, and income, which are the three core dimensions measured by the HDI.
Indeed, the pandemic had impacts in health by increasing mortality and restricting healthcare access and resources. Moreover, it provoked educational disruptions with school closures and impacts on education quality. COVID-19 also has economic consequences, increasing poverty and widening inequalities. Restrictions and lock-downs clearly had impacts on human rights and freedoms.
Code
# Convert data types as necessarysubset_full_data$Year <-as.integer(subset_full_data$Year)subset_full_data$Country <-as.character(subset_full_data$Country)subset_full_data$`HDI Value`<-as.numeric(subset_full_data$`HDI Value`)# Remove any NA values to ensure clean plottingsubset_full_data <-na.omit(subset_full_data)# Calculate the average HDI Value per yearaverage_hdi_per_year <- subset_full_data %>%group_by(Year) %>%summarise(Average_HDI =mean(`HDI Value`, na.rm =TRUE))# Plot the average HDI Value over the yearshdi_plot <-ggplot(average_hdi_per_year, aes(x = Year, y = Average_HDI)) +geom_line(color ="pink") +# Connect points by year with a linegeom_point(color ="darkorchid1") +# Add points to the linescale_x_continuous(breaks = average_hdi_per_year$Year) +# Ensure every year is marked on the x-axislabs(title ="Average HDI Value Over Years",x ="Year",y ="Average HDI Value") +theme_minimal() # Using a minimal theme for a cleaner look# Print the plotprint(hdi_plot)
3.7 Multivariate visualizations
3.7.0.1 Relationship between GDP per capita and unemployment
We can see from this scatter-plot that, in general, the lower a country’s unemployment rate, the more likely it is to have a high GDP per capita. This relationship is quite logical, since unemployment is a variable negatively correlated with a country’s economic health. Indeed, a country with a low unemployment rate is more likely to have a good economic health and thus high GDP per capita.
Code
library(ggplot2)# Ensure data types are correct (optional, depends on your data reading settings)subset_full_data$`GDP per capita (USD)`<-as.numeric(subset_full_data$`GDP per capita (USD)`)subset_full_data$`Unemployment Rate`<-as.numeric(subset_full_data$`Unemployment Rate`)# Remove any NA values to ensure clean plottingsubset_full_data <-na.omit(subset_full_data)# Create the scatter plotggplot(subset_full_data, aes(x =`GDP per capita (USD)`, y =`Unemployment Rate`)) +geom_point(aes(color =`Unemployment Rate`), alpha =0.6, size =2, shape =19) +# Color points by Unemployment Ratescale_color_gradient(low ="blue", high ="red") +# Use a color gradient from blue to redlabs(title ="GDP per capita vs unemployment",x ="GDP per capita (USD)",y ="Unemployment Rate",caption =" " ) +theme_minimal(base_size =14) +# Use a minimal theme with readable base font sizetheme(plot.title =element_text(hjust =0.5, face ="bold"), # Bold and center the titleplot.caption =element_text(face ="italic"), # Italicize the captionlegend.position ="right"# Adjust the legend position ) +guides(color =guide_colorbar(title ="Unemployment Rate"))
3.7.0.2 Relationship between homicides per 100,000 people and unemployment
This graph may not seem very intuitive; logic would dictate that the higher a country’s unemployment rate (and therefore its poverty), the higher its number of homicides per 100,000 inhabitants.
However, here, the countries with the most unemployment are concentrated towards the lowest homicide rates per 100,000 inhabitants. Therefore, as we saw previously with the correlation matrix, we cannot say that murders and unemployment are correlated.
Code
# Load necessary librarylibrary(ggplot2)subset_full_data$`Homicides per 100,000 people`<-as.numeric(subset_full_data$`Homicides per 100,000 people`)subset_full_data$`Unemployment Rate`<-as.numeric(subset_full_data$`Unemployment Rate`)# Remove any NA values to ensure clean plottingsubset_full_data <-na.omit(subset_full_data)# Create the scatter plotggplot(subset_full_data, aes(x =`Homicides per 100,000 people`, y =`Unemployment Rate`)) +geom_point(aes(color =`Unemployment Rate`), alpha =0.6, size =2, shape =19) +# Color points by Unemployment Ratescale_color_gradient(low ="lightblue", high ="red") +# Use a color gradient from blue to redlabs(title ="Homicides per 100,000 People vs unemployment",x ="Homicides per 100,000 People",y ="Unemployment Rate",caption =" " ) +theme_minimal(base_size =14) +# Use a minimal theme with readable base font sizetheme(plot.title =element_text(hjust =0.5, face ="bold"), # Bold and center the titleplot.caption =element_text(face ="italic"), # Italicize the captionlegend.position ="right"# Adjust the legend position ) +guides(color =guide_colorbar(title ="Unemployment Rate"))
This graph is for us an additional clue that the unemployment rate variable is likely to be not relevant for our further analysis.
3.7.0.3 Relationship between homicides per 100,000 people and GDP per capita
On this scatter-plot, we see that countries with high GDP per capita are less likely to have a high homicide rate per 100,000 inhabitants.
Indeed, economic prosperity impacts various aspects of a society that are linked to crime rates, such as employment opportunities, access to education, community programs, resource allocation for public safety, judicial system efficient, infrastructures and others more.
Code
# Ensure data types are correct (optional, depends on your data reading settings)subset_full_data$`Homicides per 100,000 people`<-as.numeric(subset_full_data$`Homicides per 100,000 people`)subset_full_data$`GDP per capita (USD)`<-as.numeric(subset_full_data$`GDP per capita (USD)`)# Remove any NA values to ensure clean plottingsubset_full_data <-na.omit(subset_full_data)# Create the scatter plotggplot(subset_full_data, aes(x =`Homicides per 100,000 people`, y =`GDP per capita (USD)`)) +geom_point(aes(color =`GDP per capita (USD)`), alpha =0.6, size =2, shape =19) +# Color points by GDP per capita (USD)scale_color_gradient(low ="lightblue", high ="orange") +# Use a color gradient from blue to redlabs(title ="Homicides per 100,000 People vs GDP per Capita (USD)",x ="Homicides per 100,000 People",y ="GDP per Capita (USD)",caption =" " ) +theme_minimal(base_size =14) +# Use a minimal theme with readable base font sizetheme(plot.title =element_text(hjust =0.5, face ="bold"), # Bold and center the titleplot.caption =element_text(face ="italic"), # Italicize the captionlegend.position ="right"# Adjust the legend position ) +guides(color =guide_colorbar(title ="GDP per Capita (USD)"))
3.7.0.4 Relationship between homicides per 100,000 people and HDI value
Finally, on this last graph, we see that countries with high HDI value are less likely to have a high homicide rate per 100,000 inhabitants.
Indeed, it is quite logical that countries with high Human Development Index values typically experience lower homicide rates because their higher standards of living, education, and economic stability contribute to social cohesion and reduce motivations for crime. Additionally, effective governance and robust legal systems in these countries enhance public safety and ensure the rule of law is maintained, further deterring violent crimes.
Code
# Ensure data types are correct (optional, depends on your data reading settings)subset_full_data$`Homicides per 100,000 people`<-as.numeric(subset_full_data$`Homicides per 100,000 people`)subset_full_data$`HDI Value`<-as.numeric(subset_full_data$`HDI Value`)# Remove any NA values to ensure clean plottingsubset_full_data <-na.omit(subset_full_data)# Create the scatter plotggplot(subset_full_data, aes(x =`Homicides per 100,000 people`, y =`HDI Value`)) +geom_point(aes(color =`HDI Value`), alpha =0.6, size =2, shape =19) +# Color points by HDI Valuescale_color_gradient(low ="purple", high ="orange") +# Use a color gradient from blue to redlabs(title ="Homicides per 100,000 People vs HDI Value",x ="Homicides per 100,000 People",y ="HDI Value",caption =" " ) +theme_minimal(base_size =14) +# Use a minimal theme with readable base font sizetheme(plot.title =element_text(hjust =0.5, face ="bold"), # Bold and center the titleplot.caption =element_text(face ="italic"), # Italicize the captionlegend.position ="right"# Adjust the legend position ) +guides(color =guide_colorbar(title ="HDI Value"))
We start by creating a subset of “all_data” by removing the column Country. We transform years into incremental numerical variables for the analysis: 0 if 2013, 1 if 2014, 2 if 2015, etc…
We start by fitting one very simple model, a linear regression. The objective is to show the progress we can obtain with other more advanced models.
Code
model_lm <-lm(`Homicides per 100,000 people`~., data = train_set)
Code
summary(model_lm)
In fact, this very simple model explains about 25.84% of the variability in homicide rates, with a modest explanatory power, indicating that additional variables or more complex models might be needed to better understand and predict homicide rates.
4.2.1 Variable Selection
We use a step-wise variable selection method with the AIC by backward elimination. Therefore, we start with all potential variables, remove the variable that results in the smallest increase (or largest decrease) in AIC, and continue removing variables as long as the AIC is decreasing.
The variables that have been retained in our final model to predict Homicides per 100,000 people are Political Rights, Civil Liberties, Control of Corruption: Estimate and HDI Value. Now, we can fit the final model and evaluate its performance with the test set.
Code
final_step_lm <-lm(`Homicides per 100,000 people`~`Political Rights`+`Civil Liberties`+`Control of Corruption: Estimate`+`HDI Value`, data = subset_for_modelling)summary(final_step_lm)
Call:
lm(formula = `Homicides per 100,000 people` ~ `Political Rights` +
`Civil Liberties` + `Control of Corruption: Estimate` + `HDI Value`,
data = subset_for_modelling)
Residuals:
Min 1Q Median 3Q Max
-19.851 -6.422 -1.323 2.176 86.526
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.3676 4.0646 4.519 0.000007348110
`Political Rights` 0.9270 0.1524 6.084 0.000000001973
`Civil Liberties` -0.4804 0.1258 -3.818 0.000147
`Control of Corruption: Estimate` -4.1055 0.6572 -6.247 0.000000000744
`HDI Value` -19.8154 5.6420 -3.512 0.000474
(Intercept) ***
`Political Rights` ***
`Civil Liberties` ***
`Control of Corruption: Estimate` ***
`HDI Value` ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.14 on 669 degrees of freedom
Multiple R-squared: 0.2574, Adjusted R-squared: 0.2529
F-statistic: 57.97 on 4 and 669 DF, p-value: < 0.00000000000000022
We predict the homicide rate in the test set. We already know by looking at the adjusted R-squared in the summary that the prediction quality is not good.
Code
model_lm_pred <-predict(model_lm_sel, newdata=test_set)plot(test_set$`Homicides per 100,000 people`~ model_lm_pred, xlab="Prediction", ylab="Observed Homicides")abline(0,1, col ="red")
For predictions close to zero and slightly above, the model appears to be quite accurate as most observed points are clustered near the line.
However the model’s predictions are less reliable or consistent in higher range. Above 5, the observed data points show a significant dispersion around the trend line. Indeed, for higher actual values, the predicted values are generally lower than the actual values, indicating a tendency to underestimate higher values. There are also some instances where the predicted values are negative, which is not feasible for homicide counts, indicating model limitations or issues with data pre-processing.
Linear regression assumes a linear relationship between the predictors and the response variable. The observed scatter and deviation from the red line suggest that the actual relationship may not be strictly linear, and the model fails to capture the complexity of the data.
4.3 Random Forest
To illustrate, we first build a tree with our training set.
Code
set.seed(123)data_reg <-rpart(`Homicides per 100,000 people`~ ., data=train_set)rpart.plot(data_reg)
We optimize the predictive power of the random forest algorithm on the data set, by pre-processing the data, tuning the model parameters (here we are using a tuning grid for the “mtry” parameter), and evaluating the model’s performance with cross-validation.
The cross-validation method specifies 10-fold cross-validation, which means the training set is split into 10 smaller sets, the model is trained on 9 of these, and validated on the 10th, rotating until each subset has been used for validation.
Code
library(janitor)train_rf <-clean_names(train_set)test_rf <-clean_names(test_set)# Define a tuning grid for mtry tuneGrid <-expand.grid(mtry =c(1, 3, 5, 7, 9))# Create a trainControl object for cross-validationtrainControl <-trainControl(method ="cv", number =10) # 10-fold cross-validationset.seed(234)# Train the random forest model with hyperparameter tuningrf_tuned <-train(homicides_per_100_000_people ~ ., data = train_rf, method ="rf", trControl = trainControl, tuneGrid = tuneGrid, ntree =1000, importance =TRUE)# Check the best tuning parametersrf_tuned$bestTune %>%kable()
mtry
3
5
Plotting the model
We compare the predicted against the actual values to see how well the model performed after tuning :
Code
predictions <-predict(rf_tuned, newdata = test_rf)# Create a plot comparing predicted to actual valuesggplot(data.frame(Actual = test_rf$homicides_per_100_000_people, Predicted = predictions), aes(x = Actual, y = Predicted)) +geom_point() +geom_abline(slope =1, intercept =0, color ="red", linetype ="dashed") +labs(title ="Predicted vs. Actual Values", x ="Actual Values", y ="Predicted Values") +theme_minimal()
The majority of points follow a roughly linear trend along the diagonal, which indicates that the model generally captures the relationship between predictors and target values. Some points deviate significantly from the line, particularly at higher values, where predictions appear to underestimate the actual values. This could again indicate a tendency of the model to be less accurate at predicting higher values.
A noticeable cluster of points near the lower values on both axes might indicate that the data set contains many lower-value samples, and the model generally predicts these accurately.
4.4 Gradient Boosting Machines
We fit a GBM model on the training set of our data to predict Homicides per 100,000 people, while also employing cross-validation to optimize the model’s hyper-parameters. This process helps ensure that the model is both robust and optimized for the predictive task by systematically testing different configurations and selecting the best one based on empirical performance.
Code
# Define the tuning grid for the gbm modeltune_grid <-expand.grid(n.trees =c(500, 1000, 1500),interaction.depth =c(3, 4, 5),shrinkage =c(0.01, 0.05, 0.1),n.minobsinnode =c(5, 10))# Set up cross-validation controlstrControl <-trainControl(method ="cv", number =10)# Train the gbm model using caretset.seed(123)gbm_tuned <-train(`Homicides per 100,000 people`~ ., data = train_set, method ="gbm", trControl = trControl, tuneGrid = tune_grid, verbose =FALSE)# View the best parameters based on cross-validationgbm_tuned$bestTune %>%kable()
n.trees
interaction.depth
shrinkage
n.minobsinnode
33
1500
5
0.05
5
Plotting the model
Here, we use the optimized model to make predictions on the test data set.
Code
model_pred_gbm <-predict(gbm_tuned, newdata = test_set)plot(test_set$`Homicides per 100,000 people`, model_pred_gbm, xlab ="Actual Values", ylab ="Predicted Values", main ="Predicted vs. Actual")abline(0, 1, col ="red") # Line for perfect prediction
While the model predictions generally follow the trend of actual values, there are significant deviations, particularly at higher values. This may indicate that the model is less accurate when predicting higher values or that those predictions have higher variance.
4.5 Metrics
4.5.1 R-squared
Essentially, R-squared quantifies how well the variations in the independent variables explain the variations in the dependent variable, where 0 indicates that the model explains none of the variability of the response data around its mean, and 1 indicates that the model explains all the variability of the response data around its mean.
Code
r2_lm <-R2(predict(final_step_lm, newdata = test_set), test_set$`Homicides per 100,000 people`) #lin. regr2_gbm <-R2(predict(gbm_tuned, newdata = test_set), test_set$`Homicides per 100,000 people`) #gbmr2_rf <-R2(predict(rf_tuned, newdata = test_rf ), test_rf$homicides_per_100_000_people) #random forestprint(paste("R2 for Linear Regression model:", r2_lm))
[1] "R2 for Linear Regression model: 0.291143184826871"
Code
print(paste("R2 for Random Forest model:", r2_rf))
[1] "R2 for Random Forest model: 0.914374167648047"
Code
print(paste("R2 for GBM model:", r2_gbm))
[1] "R2 for GBM model: 0.844370018322847"
Random Forest is here the best model.
4.5.2 RMSE
RMSE, or Root Mean Squared Error, is a widely used measure of the difference between values predicted by a model or an estimator and the values actually observed from the environment that is being modeled or estimated. It serves as a way to quantify the magnitude of prediction errors from a regression model. A lower RMSE is usually better because it indicates smaller differences between predicted and observed values, suggesting better model performance.
Code
rmse_lm <-RMSE(predict(final_step_lm, newdata = test_set), test_set$`Homicides per 100,000 people`) #lin. modelrmse_gbm <-RMSE(predict(gbm_tuned, newdata = test_set), test_set$`Homicides per 100,000 people`) #gbmrmse_rf <-RMSE(predict(rf_tuned, newdata = test_rf ), test_rf$homicides_per_100_000_people) #random forestprint(paste("RMSE for Linear Regression model:", rmse_lm))
[1] "RMSE for Linear Regression model: 10.9010377821501"
Code
print(paste("RMSE for Random Forest model:", rmse_rf))
[1] "RMSE for Random Forest model: 4.30147859218741"
Code
print(paste("RMSE for GBM model:", rmse_gbm))
[1] "RMSE for GBM model: 5.12144832352382"
Random Forest is here the best model.
4.5.3 MAE
MAE, or Mean Absolute Error, is a statistical measure used to quantify the accuracy of a forecast model. It calculates the average magnitude of the errors in a set of predictions, without considering their direction (i.e., whether they are underestimates or overestimates). RMSE tends to punish larger errors more than MAE does, due to the squaring of each term, which effectively weights larger errors higher than smaller ones.
Code
mae_lm <-MAE(predict(final_step_lm, newdata = test_set), test_set$`Homicides per 100,000 people`) #lin. regmae_gbm <-MAE(predict(gbm_tuned, newdata = test_set), test_set$`Homicides per 100,000 people`) #gbmmae_rf <-MAE(predict(rf_tuned, newdata = test_rf ), test_rf$homicides_per_100_000_people) #random forestprint(paste("MAE for Linear Regression model:", mae_lm))
[1] "MAE for Linear Regression model: 7.04395955349281"
Code
print(paste("MAE for Random Forest model:", mae_rf))
[1] "MAE for Random Forest model: 2.32405447937063"
Code
print(paste("MAE for GBM model:", mae_gbm))
[1] "MAE for GBM model: 2.84315329008671"
4.6 Best Model
The model with the best metrics is the Random Forest model. We start by checking if this model is over-fitting or not.
Code
#comparing metrics on training and test sets# Load necessary librarieslibrary(rpart)library(rpart.plot)library(randomForest)library(gbm)library(caret)library(dplyr)library(janitor)library(ggplot2)library(knitr)# Prepare datasubset_for_modelling <- all_data %>%select(-Country)subset_for_modelling$YearFactor <-as.integer(factor(subset_for_modelling$Year, levels =sort(unique(subset_for_modelling$Year))))subset_for_modelling <- subset_for_modelling %>%select(-Year)set.seed(234)index <-sample(x=c(1, 2), size=nrow(subset_for_modelling), replace=TRUE, prob=c(0.75, 0.25)) # 1==training set, 2==test settrain_set <- subset_for_modelling[index==1, ]test_set <- subset_for_modelling[index==2, ]# Clean namestrain_rf <-clean_names(train_set)test_rf <-clean_names(test_set)# Define a tuning grid for mtry tuneGrid <-expand.grid(mtry =c(1, 3, 5, 7, 9))# Create a trainControl object for cross-validationtrainControl <-trainControl(method ="cv", number =10) # 10-fold cross-validationset.seed(234)# Train the random forest model with hyperparameter tuningrf_tuned <-train(homicides_per_100_000_people ~ ., data = train_rf, method ="rf", trControl = trainControl, tuneGrid = tuneGrid, ntree =1000, importance =TRUE)# Predictions on the training settrain_predictions <-predict(rf_tuned, newdata = train_rf)# Predictions on the test settest_predictions <-predict(rf_tuned, newdata = test_rf)# Calculate metrics on the training setr2_train <-R2(train_predictions, train_rf$homicides_per_100_000_people)rmse_train <-RMSE(train_predictions, train_rf$homicides_per_100_000_people)mae_train <-MAE(train_predictions, train_rf$homicides_per_100_000_people)# Calculate metrics on the test setr2_test <-R2(test_predictions, test_rf$homicides_per_100_000_people)rmse_test <-RMSE(test_predictions, test_rf$homicides_per_100_000_people)mae_test <-MAE(test_predictions, test_rf$homicides_per_100_000_people)# Compare metricsmetrics_comparison <-data.frame(Metric =c("R2", "RMSE", "MAE"),Training =c(r2_train, rmse_train, mae_train),Testing =c(r2_test, rmse_test, mae_test))# Display the comparisonkable(metrics_comparison) %>%kable_styling()
Metric
Training
Testing
R2
0.9688419
0.9143742
RMSE
2.7216448
4.3014786
MAE
1.1851703
2.3240545
The model appears to perform well on the training set, with high R2 and relatively low RMSE and MAE. However, there is a noticeable decline in performance on the test set; the higher error metrics on the test set (compared to the training set) suggest that the model might be overfitting to some extent. This overfitting means that the model has learned patterns specific to the training data that do not generalize well to unseen data.
However, despite the decline, the model still shows reasonably good performance on the test set, indicating that while there is overfitting, it is not extreme.
Then, we plot the feature importance of that model.
Code
# Assuming rf_tuned is the trained model objectimportance <-varImp(rf_tuned)# Plot feature importanceplot(importance, main ="Feature Importance in Random Forest")
The results show that the most important feature is the HDI Value, followed by the GDP per Capita (USD). The Control of Corruption Estimate, the Political rights and the Political Stability and Absence of Violence Terrorism Estimate seem moderately significant to the model. Then, the remaining features are not important to the model according to that graph.
The variable importance of the Gradient Boosting model is also available in the appendix of the project.
5 Unsupervised Learning
As we are working on a time series, we decided to do the clustering by year. Obviously, we will not go through the 10 year of our data, because it would not be coherent with our model selection in the supervised learning.
Therefore, in this section, we only analyze the year 2022.
5.1 Data Preparation
First we prepare our subset for the year 2022.
2022:
Code
library(dplyr)#subset for the year 2022data_2022 <- all_data %>%filter(Year ==2022)# Convertir les pays en noms de lignerow.names(data_2022) <- data_2022$Country# remove year and coountry column to have a numerical datasetdata_2022$Country <-NULLdata_2022$Year <-NULLprint(paste("We have observations for", nrow(data_2022), "countries for the year 2022."))
[1] "We have observations for 45 countries for the year 2022."
Standardize the data:
The data must be standardized (i.e., scaled) to make variables comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one.
Code
# Standardize only numeric variablesdata_2022_standardized <- data_2022 %>%mutate(across(where(is.numeric), scale))
5.2 Clustering
5.2.1 Clustering Distance Measures
The classification of observations into groups requires some methods for computing the distance or the (dis)similarity between each pair of observations. The result of this computation is known as a distance matrix.
Code
distance <-get_dist(data_2022_standardized) #here we use the euclidian distancefviz_dist(distance, gradient =list(low ="#00AFBB", mid ="white", high ="#FC4E07"))
5.2.2 K-Mean Clustering
K-Mean clustering classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroïd) which corresponds to the mean of points assigned to the cluster.
In this graph, the “elbow” appears to be at the point where the rate of decrease sharply changes, which seems to be between 3 and 4 clusters. After this point, the sum of squared errors (SSE) starts to decrease at a slower rate.
Average Silhouette Method
In short, the average silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering. However, the average silhouette should not be interpreted for itself.
A detailed analysis of the clustering can be obtained using the silhouette plot shows: the silhouettes of all the instances.
Code
pam_object <-pam(data_2022_standardized, k =4)# Compute silhouette widthssil_widths <-silhouette(pam_object)# Plot the silhouette informationplot(sil_widths, col =1:4, border =NA)
Cluster 3 is the most well-defined cluster with the highest average silhouette score, suggesting that its members are quite similar to each other and distinct from members of other clusters.
Gap Statistic Method
The approach can be applied to any clustering method (i.e. K-means clustering, hierarchical clustering). The gap statistic compares the total intra-cluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering).
Code
# compute gap statisticset.seed(123)gap_stat <-clusGap(data_2022_standardized, FUN = kmeans, nstart =25,K.max =10, B =50)fviz_gap_stat(gap_stat)
From the visualization of the clusGap() result above, we can see that the optimal number of clusters is 4.
If there are more than two dimensions, fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance (refer to our PCA in the EDA).
We can observe that the teal cluster from k=2 splits into more detailed sub-clusters as k increases, suggesting internal complexity within that group.
Also, in terms of variance explained, Dim1 explains 62.1% of the variance and Dim2 explains 13.4%, totaling 75.5%. This means the plots capture a significant portion of the data’s variance, providing a meaningful visual representation of the clustering.
5.2.2.3 Extracting results
Because of our previous conclusions extracted by the silhouette plot above, we choose to focus on the tri-clustered data.
We can now extract the clusters and add them to our initial data in order to perform some descriptive statistics at the cluster level:
Code
# Compute k-means clustering with k = 3set.seed(123)final <-kmeans(data_2022_standardized, 3, nstart =25)
Visualize the results:
Code
fviz_cluster(final, data = data_2022_standardized)
By plotting the names of the country within each cluster, we can now discuss and inspect for insights:
It is interesting to see that the first cluster in red (which contains countries such as Algeria, Colombia, Haiti, Kenya, Mexico, etc…)appears to include countries that might share similar socio-economic and governance challenges. Indeed, these countries might have higher rates of instability, corruption, and lower economic development.
Then, it looks like the second cluster in green groups countries that might have moderate to high levels of economic development and governance; the countries in this cluster (Spain, Dominican Republic, Greece, Costa Rica, Hungary, Israel, Costa Rica etc…) likely have a mix of developed and developing economies with varying degrees of political and economic stability.
The third and notable cluster in blue contains highly developed countries with strong economic performance and governance. These countries likely have high levels of human development, political stability, and low corruption. Indeed, the cluster contains countries such as Singapore, Denmark, Finland, Netherlands, Luxembourg, etc… which represent some of the most developed and well-governed nations.
Variable Distribution Across Clusters:
These three clusters can also be analyzed by the feature values within each cluster.
Code
# Assign cluster membership to the original datadata_2022_standardized$Cluster <-as.factor(final$cluster)library(ggplot2)library(tidyr)# Reshape data from wide to long formatdata_long <-pivot_longer(data_2022_standardized, cols =-Cluster, names_to ="Variable", values_to ="Value")# Create boxplots for each variable by clusterggplot(data_long, aes(x = Cluster, y = Value, fill = Cluster)) +geom_boxplot() +facet_wrap(~ Variable, scales ="free", nrow =5) +theme_minimal() +theme(axis.text.x =element_text(angle =0, vjust =1, hjust =1),strip.text =element_text(size =10)) # Adjust text size for facet labels if necessary
Civil Liberties, Political Rights and Political Stability: Clusters 2 (green) and especially Cluster 3 (blue) generally show higher values indicating better conditions or higher rankings in these metrics compared to Cluster 1 (red).
Control of Corruption and Government Effectiveness: Cluster 3 (blue) typically demonstrates the highest values, suggesting it might consist of countries with better governance metrics.
HDI Value: This shows a significant range in Cluster 1 (red) with relatively lower median values, indicating perhaps a wider disparity in development levels within this cluster. Cluster 3 (blue) contains again favorable rates by having the highest values of HDI.
Homicides per 100,000 people: Here, there is a noticeable difference, with Cluster 1 showing higher values (more homicides), which might suggest higher violence or security issues. Cluster 2 (green) and especially Cluster 3 (blue) generally show lower values, indicating better conditions, except some high homicide outliers in Cluster 2.
In the same way, we see in the Unemployment rate variable a larger disparity of values in Cluster 1 (red) and higher rates than in the two other clusters, indicating worse employment situations in the countries contained in this cluster.
GDP per capita: Cluster 3 (blue) shows notably higher values, implying that, as already mentioned, this cluster includes more economically prosperous and highly developed countries.
5.2.3 Hierarchical Clustering
One potential disadvantage of K-means clustering is that it requires us to pre-specify the number of clusters.
Hierarchical clustering is an alternative approach which does not require that we commit to a particular choice of clusters. Hierarchical clustering has an added advantage over K-means clustering in that it results in an attractive tree-based representation of the observations, called a dendrogram.
We use the average linkage method, trying to find a balance between the most and least similar pairs within each cluster. This method can sometimes prevent extreme values from overly influencing the groupings.
Code
hclust_avg <-hclust(distance, method ='average') #specify the linkage method via the method argumentplot(hclust_avg, main ="Hierarchical Clustering Dendrogram for 2022",xlab ="Clusters",ylab ="Height",sub ="Method: Average Linkage",cex =0.6, # Label size for the dendrogramcex.main =1.2, # Title sizecex.lab =1.0, # Axis label sizecex.sub =0.8) # Subtitle sizerect.hclust(hclust_avg , k =3, border =2:6)
Countries are now divided into distinct groups, based on various similarities.
The height at which the branches meet in the dendrogram reflects the distance or dissimilarity between the groups. For example, Haiti and Jordan from the first cluster (red), which form a group at a lower height, are more similar to each other than, say, the countries in the group containing Singapore and Israel, which join at a much higher height.
We see in this type of clustering that the most developed countries (Finland, Austria, Denmark, etc.), previously grouped together by k-mean clustering, are now grouped with other rather varied countries.
A mix between countries previously clustered together differently has taken place, making comparison with k-mean clustering difficult. This is quite normal, because hierarchical clustering might capture more nuanced and complex patterns of similarity between countries, reflecting not just the major sources of variance but also subtler similarities. On the other hand, K-means on PCA tends to group countries based on the most significant patterns of variation and might therefore align more closely with dominant economic, social, or political factors that stand out in the data reduction process.
We applied the same steps over several years of our data set but found no significant differences between years. Therefore, we decided to not include it in this section of the project. You can however check our analysis in the appendix.
6 Conclusion
6.1 Results
Our main results are listed here. The aim of this subsection is to present in a structured way the results obtained by the machine learning models applied in this project.
Unemployment Rate:
The analysis revealed that unemployment rate do not significantly influence homicide rate. In fact, unemployment rate has a weak correlation with every socio-demographic variables. Thus, including unemployment factors in machine learning models is irrelevant for the purpose of our project aiming at predicting homicide rates.
Linear Model Performance:
A simple linear model was inadequate for accurately predicting the impacts of socio-economic and demographic variables on homicide rate. Even with the assistance of the Akaike Information Criterion (AIC) for model selection, the linear model’s predictions were insufficiently reliable, particularly for higher homicide rates, indicating the model’s inability to capture the complexity of the data.
Gradient Boosting Machines:
Implementing a gradient boosting machines (GBM) model, with optimized hyper-parameters through cross-validation, improved the prediction quality compared to the linear model. However, while GBM enhanced the accuracy, it was still outperformed by another model in our study.
Random Forest Model:
The random forest model, also optimized through cross-validation, exhibited the best predictive performance among all the models tested. It provided the most reliable metrics for the purpose of our project, making it the most effective model for predicting homicide rates.
Variable Importance:
The random forest model’s variable importance scores identified the Human Development Index (HDI) as the most impactful variable on homicide rates, followed by GDP per capita. The scores also confirmed the irrelevance of unemployment rates in predicting homicides. Civil liberties and government effectiveness rates alson received low importance scores, indicating a lesser impact on homicide rates.
K-means Clustering:
While focusing on the year 2022, K-means clustering with three clusters produced meaningful groupings of countries based on their socio-economic and demographic similarities. Indeed, the three clusters accurately grouped the different countries contained in our data set by their similarity; they displayed a group of countries deemed highly developed socially and economically, another one of a mix of developed and developing economies with varying degrees of political and economic stability, and a last one of countries deemed poor and with social difficulties, where homicide rates are therefore relatively high.
Hierarchical Clustering:
Hierarchical clustering grouped most developed countries (e.g., Finland, Austria, Denmark) together, but with a more varied mix of other countries compared to K-means clustering. This method captured more nuanced and complex patterns of similarity between countries, reflecting subtler relationships beyond the primary variance sources.
Clustering Over Multiple Years:
Applying the same clustering methods to other years yielded consistent insights, reinforcing the robustness of the findings across different time periods.
6.2 Recommendations and Discussion
The study successfully identified key socio-economic and demographic variables influencing homicide rates, with HDI and GDP per capita being the most significant predictors, while unemployment rates were found to be irrelevant. The random forest model emerged as the most effective predictive tool, and clustering analyses provided clear groupings of countries based on development levels and homicide rates.
Policymakers should prioritize improving human development and economic conditions, as these factors have the most substantial impact on reducing homicide rates. Resources should be allocated to enhance HDI and GDP per capita, while less emphasis can be placed on unemployment rates concerning homicide reduction strategies.
The study’s reliance on available data sets means the results are constrained by the quality and completeness of the data. Additionally, the models may not fully capture all potential factors influencing homicide rates, such as cultural differences, law enforcement effectiveness, and historical contexts aspects that were not included. Future research should explore integrating additional variables, such as cultural factors or historical data, to provide a more comprehensive analysis. Another limitation is the reliance on publicly available data, which may have varying degrees of accuracy and completeness.
Applying advanced machine learning techniques like deep learning could enhance predictive performance. Continuous updating and validation of the models with new data will also ensure their relevance and accuracy over time. Also, experimenting with hybrid models that combine the strengths of different algorithms and expanding the scope of the study to include regional analyses within countries could further improve prediction accuracy and might also uncover more granular insights. Moreover, incorporation of qualitative data, such as expert opinions and case studies, could also enhance the predictive models.
In conclusion, by leveraging advanced machine learning techniques and comprehensive socio-economic data, projects of this genre could provide actionable insights that can guide policy-making and international efforts to reduce homicide rates and promote social stability. The findings emphasize the importance of human development and economic growth as critical factors in ensuring public safety.
7 Appendix
7.1 Supervised ML
7.1.1 Variable Importance for GBM
Here, we visualize the most important predictors contributing to the GBM model.
Code
ggplot(importance, aes(x =reorder(var, rel.inf), y = rel.inf)) +geom_bar(stat ="identity", fill ="aliceblue", color ="black") +coord_flip() +labs(x ="Variables", y ="Relative Importance", title ="Feature Importance Plot") +theme_minimal() +theme(axis.text.y =element_text(size =14), # Change the size of variable namesaxis.text.x =element_text(size =16), # Change the size of x-axis text (if needed)plot.title =element_text(size =20) # Adjust the plot title size )
According to this plot, the most important feature is the HDI value, followed by the GDP per capita (USD) and the Control of Corruption Estimate. This model seems to capture approximately the same most important features than in the random forest one.
7.2 Unsupervised ML
7.2.1 Comparing Clusters Across Years
We compare our data over years 2013, 2015, 2017 and 2019 to 2022.
7.2.1.1 Comparing PCAs
Code
# Create PCA plotsplot_pca_2021 <-fviz_pca_var(pca_data_2021)plot_pca_2020 <-fviz_pca_var(pca_data_2020)plot_pca_2019 <-fviz_pca_var(pca_data_2019)plot_pca_2017 <-fviz_pca_var(pca_data_2017)plot_pca_2015 <-fviz_pca_var(pca_data_2015)plot_pca_2013 <-fviz_pca_var(pca_data_2013)# Use grid.arrange to combine the plotsgrid.arrange(plot_pca_2021, plot_pca_2020,plot_pca_2019, plot_pca_2017, plot_pca_2015, plot_pca_2013, nrow =3)