Arka Dasgupta, Rahul RajanBabu & Steven Santhosh Simon
2023-05-28
The objective of this analysis is to investigate the intentional homicide rates in different regions and identify any significant differences between them. Specifically, we focus on comparing homicide rates between Region A and Region B.
We will use a variety of statistical methods, including data exploration, outlier detection, normality assessment, and hypothesis testing, to answer this question. We intend to offer insightful information about the variety of homicide rates between regions and contribute to a better understanding of the variables influencing these rates by conducting a thorough study of the dataset.
Intentional Homicide Rate dataset provides information on the intentional homicide rate in countries around the world.
The dataset contains information on more than 150 countries and territories, including both developed and developing nations. It provides a comprehensive overview of the variation in homicide rates across different regions and countries around the world.
The data is open and was collected from www.kaggle.com and the Source: https://www.kaggle.com/datasets/bilalwaseer/countries-by-intentional-homicide-rate
library(readr) # Reading Rectangular text data
library(magrittr) # Forward Pipe Operator
library(dplyr) # Used for data manipulation##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## [1] "Afghanistan" "Albania"
## [3] "Algeria" "Andorra"
## [5] "Angola" "Anguilla"
## [7] "Antigua and Barbuda" "Argentina"
## [9] "Armenia" "Aruba"
## [11] "Australia" "Austria"
## [13] "Azerbaijan" "Bahamas"
## [15] "Bahrain" "Bangladesh"
## [17] "Barbados" "Belarus"
## [19] "Belgium" "Belize"
## [21] "Benin" "Bermuda"
## [23] "Bhutan" "Bolivia"
## [25] "Bosnia and Herzegovina" "Botswana"
## [27] "Brazil" "British Virgin Islands"
## [29] "Brunei" "Bulgaria"
## [31] "Burkina Faso" "Burundi"
## [33] "Cape Verde" "Cambodia"
## [35] "Cameroon" "Canada"
## [37] "Cayman Islands" "Central African Republic"
## [39] "Channel Islands" "Chile"
## [41] "China" "Colombia"
## [43] "Costa Rica" "Croatia"
## [45] "Cuba" "Curaçao"
## [47] "Cyprus" "Czech Republic"
## [49] "Denmark" "Dominica"
## [51] "Dominican Republic" "Ecuador"
## [53] "Egypt" "El Salvador"
## [55] "England and Wales" "Estonia"
## [57] "Eswatini" "Ethiopia"
## [59] "Finland" "France"
## [61] "French Guiana" "Georgia"
## [63] "Germany" "Ghana"
## [65] "Gibraltar" "Greece"
## [67] "Greenland" "Grenada"
## [69] "Guadeloupe" "Guatemala"
## [71] "Guinea-Bissau" "Guyana"
## [73] "Haiti" "Holy See"
## [75] "Honduras" "Hong Kong"
## [77] "Hungary" "Iceland"
## [79] "India" "Indonesia"
## [81] "Iran" "Iraq"
## [83] "Iraq (excluding Kurdistan)" "Ireland"
## [85] "Isle of Man" "Israel"
## [87] "Italy" "Jamaica"
## [89] "Japan" "Jordan"
## [91] "Kazakhstan" "Kenya"
## [93] "Kosovo" "Kurdistan Region (Iraq)"
## [95] "Kuwait" "Kyrgyzstan"
## [97] "Latvia" "Lebanon"
## [99] "Lesotho" "Liberia"
## [101] "Liechtenstein" "Lithuania"
## [103] "Luxembourg" "Macau"
## [105] "Malawi" "Malaysia"
## [107] "Maldives" "Malta"
## [109] "Martinique" "Mauritius"
## [111] "Mayotte" "Mexico"
## [113] "Monaco" "Mongolia"
## [115] "Montenegro" "Montserrat"
## [117] "Morocco" "Mozambique"
## [119] "Myanmar" "Namibia"
## [121] "Nepal" "Netherlands"
## [123] "New Zealand" "Nicaragua"
## [125] "Niger" "Nigeria"
## [127] "Northern Ireland" "Norway"
## [129] "Oman" "Pakistan"
## [131] "Panama" "Paraguay"
## [133] "Peru" "Philippines"
## [135] "Poland" "Portugal"
## [137] "Puerto Rico" "Qatar"
## [139] "South Korea" "Moldova"
## [141] "North Macedonia" "Réunion"
## [143] "Romania" "Russia"
## [145] "Rwanda" "Saint Helena"
## [147] "Saint Kitts and Nevis" "Saint Lucia"
## [149] "Saint Martin (French part)" "Saint Pierre and Miquelon"
## [151] "Saint Vincent and the Grenadines" "San Marino"
## [153] "São Tomé and PrÃncipe" "Saudi Arabia"
## [155] "Scotland" "Senegal"
## [157] "Serbia" "Seychelles"
## [159] "Sierra Leone" "Singapore"
## [161] "Slovakia" "Slovenia"
## [163] "South Africa" "South Sudan"
## [165] "Spain" "Sri Lanka"
## [167] "Palestine" "Sudan"
## [169] "Suriname" "Sweden"
## [171] "Switzerland" "Syria"
## [173] "Taiwan" "Tajikistan"
## [175] "Thailand" "East Timor"
## [177] "Trinidad and Tobago" "Tunisia"
## [179] "Turkey" "Turkmenistan"
## [181] "Turks and Caicos Islands" "Uganda"
## [183] "Ukraine" "United Arab Emirates"
## [185] "United Kingdom" "Tanzania"
## [187] "United States" "U.S. Virgin Islands"
## [189] "Uruguay" "Uzbekistan"
## [191] "Venezuela" "Vietnam"
## [193] "Yemen" "Zambia"
## [195] "Zimbabwe"
## [1] 6.7 2.1 1.3 2.6 4.8 28.3 9.2 5.3 1.8 1.9 0.9 0.7 2.3 18.6 0.1
## [16] 2.4 14.3 1.7 25.7 1.1 0.0 2.5 7.0 15.2 22.5 8.3 0.5 1.0 6.1 6.5
## [31] 1.4 2.0 8.2 20.1 22.6 11.2 5.0 19.0 1.2 20.8 8.9 7.8 37.2 3.2 11.6
## [46] 8.8 1.6 13.2 0.8 3.0 12.4 5.8 17.5 20.0 36.3 0.3 1.5 0.4 2.2 10.1
## [61] 44.7 4.0 43.6 3.3 3.7 0.2 0.6 2.8 5.9 28.4 6.0 2.9 20.3 3.5 11.9
## [76] 7.9 4.4 22.0 3.8 11.1 7.7 18.5 7.3 18.8 27.7 15.8 17.2 10.2 33.5 14.9
## [91] 5.1 9.4 4.1 38.6 4.2 5.7 9.7 6.2 49.3 36.7 6.8 5.4 7.5
## Location Region Subregion Rate
## Length:195 Length:195 Length:195 Min. : 0.000
## Class :character Class :character Class :character 1st Qu.: 1.100
## Mode :character Mode :character Mode :character Median : 2.600
## Mean : 6.845
## 3rd Qu.: 7.850
## Max. :49.300
## Count Year
## Min. : 0 Min. :2006
## 1st Qu.: 28 1st Qu.:2016
## Median : 128 Median :2019
## Mean : 1943 Mean :2017
## 3rd Qu.: 785 3rd Qu.:2020
## Max. :47722 Max. :2021
Looking for missing values
# Check for missing values
has_missing_values <- apply(data, 2, function(x) any(is.na(x)))
# Print the result
print(has_missing_values)## Location Region Subregion Rate Count Year
## FALSE FALSE FALSE FALSE FALSE FALSE
Checking the summary of rate and count
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.100 2.600 6.845 7.850 49.300
# Descriptive statistics for the 'Count' variable
count_stats <- summary(data$Count)
print(count_stats)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 28 128 1943 785 47722
# Sort the data frame by CrimeRate in descending order
sorted_data1 <- data[order(data$Rate, decreasing = TRUE), ]
# Select the top 10 countries
top_10_countries_rank <- sorted_data1[1:10, ]
# Add a rank column to the data frame
top_10_countries_rank$Rank <- 1:10
# Print the top 10 countries with their ranks
print(top_10_countries_rank)## Location Region Subregion Rate Count Year Rank
## 188 U.S. Virgin Islands Americas Caribbean 49.3 52 2012 1
## 88 Jamaica Americas Caribbean 44.7 1323 2020 2
## 99 Lesotho Africa Southern Africa 43.6 897 2015 3
## 177 Trinidad and Tobago Americas Caribbean 38.6 538 2019 4
## 54 El Salvador Americas Central America 37.2 2398 2019 5
## 191 Venezuela Americas South America 36.7 10598 2018 6
## 75 Honduras Americas Central America 36.3 3598 2020 7
## 163 South Africa Africa Southern Africa 33.5 19846 2020 8
## 112 Mexico Americas Central America 28.4 36579 2020 9
## 119 Myanmar Asia South-Eastern Asia 28.4 15299 2021 10
# Sort the data frame by count in descending order
sorted_data2 <- data[order(data$Count, decreasing = TRUE), ]
# Select the top 10 countries
top_10_countries_count <- sorted_data2[1:10, ]
# Add a rank column to the data frame
top_10_countries_count$Rank <- 1:10
# Print the top 10 countries with their ranks
print(top_10_countries_count)## Location Region Subregion Rate Count Year Rank
## 27 Brazil Americas South America 22.5 47722 2020 1
## 126 Nigeria Africa Western Africa 22.0 44200 2019 2
## 79 India Asia Southern Asia 3.0 40651 2020 3
## 112 Mexico Americas Central America 28.4 36579 2020 4
## 187 United States Americas Northern America 6.5 21570 2020 5
## 163 South Africa Africa Southern Africa 33.5 19846 2020 6
## 119 Myanmar Asia South-Eastern Asia 28.4 15299 2021 7
## 42 Colombia Americas South America 22.6 11520 2020 8
## 144 Russia Europe Eastern Europe 7.3 10697 2020 9
## 191 Venezuela Americas South America 36.7 10598 2018 10
# Sort the data frame by count in ascending order
sorted_data3 <- data[order(data$Rate), ]
# Select the top 10 countries
bottom_10_countries_count <- sorted_data3[1:10, ]
# Add a rank column to the data frame
bottom_10_countries_count$Rank <- 1:10
# Print the top 10 countries with their ranks
print(bottom_10_countries_count)## Location Region Subregion Rate Count Year Rank
## 22 Bermuda Americas Northern America 0.0 0 2019 1
## 39 Channel Islands Europe Northern Europe 0.0 0 2010 2
## 74 Holy See Europe Southern Europe 0.0 0 2015 3
## 85 Isle of Man Europe Northern Europe 0.0 0 2016 4
## 113 Monaco Europe Western Europe 0.0 0 2015 5
## 146 Saint Helena Africa Western Africa 0.0 0 2009 6
## 152 San Marino Europe Southern Europe 0.0 0 2011 7
## 15 Bahrain Asia Western Asia 0.1 2 2019 8
## 103 Luxembourg Europe Western Europe 0.2 1 2020 9
## 160 Singapore Asia South-Eastern Asia 0.2 10 2020 10
# Sort the data frame by count in ascending order
sorted_data4 <- data[order(data$Rate), ]
# Select the top 10 countries
bottom_10_countries_count <- sorted_data4[1:10, ]
# Add a rank column to the data frame
bottom_10_countries_count$Rank <- 1:10
# Print the top 10 countries with their ranks
print(bottom_10_countries_count)## Location Region Subregion Rate Count Year Rank
## 22 Bermuda Americas Northern America 0.0 0 2019 1
## 39 Channel Islands Europe Northern Europe 0.0 0 2010 2
## 74 Holy See Europe Southern Europe 0.0 0 2015 3
## 85 Isle of Man Europe Northern Europe 0.0 0 2016 4
## 113 Monaco Europe Western Europe 0.0 0 2015 5
## 146 Saint Helena Africa Western Africa 0.0 0 2009 6
## 152 San Marino Europe Southern Europe 0.0 0 2011 7
## 15 Bahrain Asia Western Asia 0.1 2 2019 8
## 103 Luxembourg Europe Western Europe 0.2 1 2020 9
## 160 Singapore Asia South-Eastern Asia 0.2 10 2020 10
Checking the distribution of the data and identifying any patterns that can be seen.
library(ggplot2)
ggplot(data, aes(x = Rate)) +
geom_histogram(binwidth = 5, fill = "blue", color = "black") +
labs(x = "Homicide Rate", y = "Frequency", title = "Distribution of Homicide Rates")We now calculate the z-scores, we do it to identify outliers or extreme values based on a predefined threshold () z-score > 3), and understand the overall variability of the data.
rate_zscores <- scale(data$Rate)
count_zscores <- scale(data$Count)
# Find outliers based on z-score threshold (3, in our case)
rate_outliers <- data$Rate[abs(rate_zscores) > 3]
count_outliers <- data$Count[abs(count_zscores) > 3]
# Print the identified outliers
print(rate_outliers)## [1] 37.2 36.3 44.7 43.6 38.6 49.3 36.7
## [1] 47722 40651 36579 44200
The histogram helps in understanding the distribution of the data and identifying any patterns or outliers present.
The purpose of checking skewness and making the Q-Q plot is to help evaluate whether the transformed variable follows a normal distribution.
The purpose of a quantile transformation is to transform the distribution of a variable into a standard normal distribution.
# Performing quantile transformation on the variable
ranked_data <- rank(data2$Rate)
quantiles <- qnorm((ranked_data - 0.5) / length(ranked_data))
transformed_data <- quantiles
# Generate a Q-Q plot for the transformed data
qqnorm(transformed_data)
qqline(transformed_data)Null Hypothesis (H0): There is no significant difference in homicide rates between the Americas and Africa
Alternative Hypothesis (H1): There is a significant difference in homicide rates between the Americas and Africa
# Subsetting the data for Americas and Africa
region_a <- data2$Rate[data2$Region == "Americas" & data2$Year == 2020]
region_b <- data2$Rate[data2$Region == "Africa" & data2$Year == 2020]
# Checking the number of observations in each region
length(region_a)## [1] 25
## [1] 9
# Perform a two-sample t-test
result <- t.test(region_a, region_b)
# Print the test result
print(result)##
## Welch Two Sample t-test
##
## data: region_a and region_b
## t = 1.6302, df = 11.512, p-value = 0.1301
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.063202 14.098757
## sample estimates:
## mean of x mean of y
## 14.440000 8.422222
We set out to understand the different crime rates and counts in the world throught out the years. There are few limitations, for example our dataset has data recorded from different years, but it should not affect the general analysis and the hypothesis testing.
We found the top and bottom 10 countries interms of the crime count and crime rate. This gave us a general idea of the data and the regions.
The research question -> Is there a significant difference in homicide rates between two regions the Americas and Africa?
-The t-value was 1.6302, indicating a moderate difference between the means of the two groups. -The degrees of freedom associated with the t-distribution were calculated as 11.512. -The p-value obtained from the test was 0.1301. Since this value is greater than the conventional significance level of 0.05, there is insufficient evidence to reject the null hypothesis. -The null hypothesis assumes that there is no significant difference between the means of the two groups. -The alternative hypothesis states that the true difference in means is not equal to zero. -The 95% confidence interval for the difference in means ranged from -2.06 to 14.09. This interval provides an estimate of plausible values for the true difference in means. -The sample mean of region_a was 14.44, while the sample mean of region_b was 8.422.
In short, the statistical analysis does not provide strong evidence to support the presence of a significant difference between the means of Americas and Africa. However, the confidence interval suggests that there is a range of possible values. Therefore, further investigation might be needed.
-https://astral-theory-157510.appspot.com/secured/RBootcamp_Course_04.html#R_Markdown_Syntax_Basics -https://bookdown.org/yihui/rmarkdown/tufte-figures.html#margin-figures -https://r-coder.com/density-plot-r/ -https://iqss.github.io/dss-workshops/R/Rintro/base-r-cheat-sheet.pdf -https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf -https://github.com/topics/learning-r