Applied Analytics Group Project 2

Arka Dasgupta, Rahul RajanBabu & Steven Santhosh Simon

2023-05-28

Introduction

Problem Statement

The objective of this analysis is to investigate the intentional homicide rates in different regions and identify any significant differences between them. Specifically, we focus on comparing homicide rates between Region A and Region B.

Is there a significant difference in homicide rates between any two regions? Which region has the most homicides compared to the rate etc.

We will use a variety of statistical methods, including data exploration, outlier detection, normality assessment, and hypothesis testing, to answer this question. We intend to offer insightful information about the variety of homicide rates between regions and contribute to a better understanding of the variables influencing these rates by conducting a thorough study of the dataset.

Information about the data

Intentional Homicide Rate dataset provides information on the intentional homicide rate in countries around the world.
The dataset contains information on more than 150 countries and territories, including both developed and developing nations. It provides a comprehensive overview of the variation in homicide rates across different regions and countries around the world.
The data is open and was collected from www.kaggle.com and the Source: https://www.kaggle.com/datasets/bilalwaseer/countries-by-intentional-homicide-rate

Variables

Location: A categorical variable representing the country.(character)
Region: A categorical variable representing the continent or larger region where the location belongs.(character)
Subregion: A categorical variable representing the subregion within the region where the location belongs.(character)
Rate: A numerical variable representing the suicide rate per 100,000 population.(integer)
Count: A numerical variable representing the total number of suicides.(integer)
Year: A categorical variable representing the year when the data was recorded.(integer)

Setting up the directory and installing packages needed for the analysis

library(readr) # Reading Rectangular text data
library(magrittr) # Forward Pipe Operator
library(dplyr) # Used for data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(e1071)

Importing and checking the dataset to make sure the data is unique and for missing values.

data <- read.csv("/Users/rahulrb/Desktop/RMIT/Homicide_rates.csv")
unique(data$Location)

##   [1] "Afghanistan"                      "Albania"                         
##   [3] "Algeria"                          "Andorra"                         
##   [5] "Angola"                           "Anguilla"                        
##   [7] "Antigua and Barbuda"              "Argentina"                       
##   [9] "Armenia"                          "Aruba"                           
##  [11] "Australia"                        "Austria"                         
##  [13] "Azerbaijan"                       "Bahamas"                         
##  [15] "Bahrain"                          "Bangladesh"                      
##  [17] "Barbados"                         "Belarus"                         
##  [19] "Belgium"                          "Belize"                          
##  [21] "Benin"                            "Bermuda"                         
##  [23] "Bhutan"                           "Bolivia"                         
##  [25] "Bosnia and Herzegovina"           "Botswana"                        
##  [27] "Brazil"                           "British Virgin Islands"          
##  [29] "Brunei"                           "Bulgaria"                        
##  [31] "Burkina Faso"                     "Burundi"                         
##  [33] "Cape Verde"                       "Cambodia"                        
##  [35] "Cameroon"                         "Canada"                          
##  [37] "Cayman Islands"                   "Central African Republic"        
##  [39] "Channel Islands"                  "Chile"                           
##  [41] "China"                            "Colombia"                        
##  [43] "Costa Rica"                       "Croatia"                         
##  [45] "Cuba"                             "Curaçao"                         
##  [47] "Cyprus"                           "Czech Republic"                  
##  [49] "Denmark"                          "Dominica"                        
##  [51] "Dominican Republic"               "Ecuador"                         
##  [53] "Egypt"                            "El Salvador"                     
##  [55] "England and Wales"                "Estonia"                         
##  [57] "Eswatini"                         "Ethiopia"                        
##  [59] "Finland"                          "France"                          
##  [61] "French Guiana"                    "Georgia"                         
##  [63] "Germany"                          "Ghana"                           
##  [65] "Gibraltar"                        "Greece"                          
##  [67] "Greenland"                        "Grenada"                         
##  [69] "Guadeloupe"                       "Guatemala"                       
##  [71] "Guinea-Bissau"                    "Guyana"                          
##  [73] "Haiti"                            "Holy See"                        
##  [75] "Honduras"                         "Hong Kong"                       
##  [77] "Hungary"                          "Iceland"                         
##  [79] "India"                            "Indonesia"                       
##  [81] "Iran"                             "Iraq"                            
##  [83] "Iraq (excluding Kurdistan)"       "Ireland"                         
##  [85] "Isle of Man"                      "Israel"                          
##  [87] "Italy"                            "Jamaica"                         
##  [89] "Japan"                            "Jordan"                          
##  [91] "Kazakhstan"                       "Kenya"                           
##  [93] "Kosovo"                           "Kurdistan Region (Iraq)"         
##  [95] "Kuwait"                           "Kyrgyzstan"                      
##  [97] "Latvia"                           "Lebanon"                         
##  [99] "Lesotho"                          "Liberia"                         
## [101] "Liechtenstein"                    "Lithuania"                       
## [103] "Luxembourg"                       "Macau"                           
## [105] "Malawi"                           "Malaysia"                        
## [107] "Maldives"                         "Malta"                           
## [109] "Martinique"                       "Mauritius"                       
## [111] "Mayotte"                          "Mexico"                          
## [113] "Monaco"                           "Mongolia"                        
## [115] "Montenegro"                       "Montserrat"                      
## [117] "Morocco"                          "Mozambique"                      
## [119] "Myanmar"                          "Namibia"                         
## [121] "Nepal"                            "Netherlands"                     
## [123] "New Zealand"                      "Nicaragua"                       
## [125] "Niger"                            "Nigeria"                         
## [127] "Northern Ireland"                 "Norway"                          
## [129] "Oman"                             "Pakistan"                        
## [131] "Panama"                           "Paraguay"                        
## [133] "Peru"                             "Philippines"                     
## [135] "Poland"                           "Portugal"                        
## [137] "Puerto Rico"                      "Qatar"                           
## [139] "South Korea"                      "Moldova"                         
## [141] "North Macedonia"                  "Réunion"                         
## [143] "Romania"                          "Russia"                          
## [145] "Rwanda"                           "Saint Helena"                    
## [147] "Saint Kitts and Nevis"            "Saint Lucia"                     
## [149] "Saint Martin (French part)"       "Saint Pierre and Miquelon"       
## [151] "Saint Vincent and the Grenadines" "San Marino"                      
## [153] "São Tomé and Príncipe"            "Saudi Arabia"                    
## [155] "Scotland"                         "Senegal"                         
## [157] "Serbia"                           "Seychelles"                      
## [159] "Sierra Leone"                     "Singapore"                       
## [161] "Slovakia"                         "Slovenia"                        
## [163] "South Africa"                     "South Sudan"                     
## [165] "Spain"                            "Sri Lanka"                       
## [167] "Palestine"                        "Sudan"                           
## [169] "Suriname"                         "Sweden"                          
## [171] "Switzerland"                      "Syria"                           
## [173] "Taiwan"                           "Tajikistan"                      
## [175] "Thailand"                         "East Timor"                      
## [177] "Trinidad and Tobago"              "Tunisia"                         
## [179] "Turkey"                           "Turkmenistan"                    
## [181] "Turks and Caicos Islands"         "Uganda"                          
## [183] "Ukraine"                          "United Arab Emirates"            
## [185] "United Kingdom"                   "Tanzania"                        
## [187] "United States"                    "U.S. Virgin Islands"             
## [189] "Uruguay"                          "Uzbekistan"                      
## [191] "Venezuela"                        "Vietnam"                         
## [193] "Yemen"                            "Zambia"                          
## [195] "Zimbabwe"

unique(data$Rate)

##   [1]  6.7  2.1  1.3  2.6  4.8 28.3  9.2  5.3  1.8  1.9  0.9  0.7  2.3 18.6  0.1
##  [16]  2.4 14.3  1.7 25.7  1.1  0.0  2.5  7.0 15.2 22.5  8.3  0.5  1.0  6.1  6.5
##  [31]  1.4  2.0  8.2 20.1 22.6 11.2  5.0 19.0  1.2 20.8  8.9  7.8 37.2  3.2 11.6
##  [46]  8.8  1.6 13.2  0.8  3.0 12.4  5.8 17.5 20.0 36.3  0.3  1.5  0.4  2.2 10.1
##  [61] 44.7  4.0 43.6  3.3  3.7  0.2  0.6  2.8  5.9 28.4  6.0  2.9 20.3  3.5 11.9
##  [76]  7.9  4.4 22.0  3.8 11.1  7.7 18.5  7.3 18.8 27.7 15.8 17.2 10.2 33.5 14.9
##  [91]  5.1  9.4  4.1 38.6  4.2  5.7  9.7  6.2 49.3 36.7  6.8  5.4  7.5

summary(data)

##    Location            Region           Subregion              Rate       
##  Length:195         Length:195         Length:195         Min.   : 0.000  
##  Class :character   Class :character   Class :character   1st Qu.: 1.100  
##  Mode  :character   Mode  :character   Mode  :character   Median : 2.600  
##                                                           Mean   : 6.845  
##                                                           3rd Qu.: 7.850  
##                                                           Max.   :49.300  
##      Count            Year     
##  Min.   :    0   Min.   :2006  
##  1st Qu.:   28   1st Qu.:2016  
##  Median :  128   Median :2019  
##  Mean   : 1943   Mean   :2017  
##  3rd Qu.:  785   3rd Qu.:2020  
##  Max.   :47722   Max.   :2021

Checking for missing values.

Looking for missing values

# Check for missing values
has_missing_values <- apply(data, 2, function(x) any(is.na(x)))

# Print the result
print(has_missing_values)

##  Location    Region Subregion      Rate     Count      Year 
##     FALSE     FALSE     FALSE     FALSE     FALSE     FALSE

Descriptive statistics for the variables

Checking the summary of rate and count

# Descriptive statistics for the 'Rate' variable
rate_stats <- summary(data$Rate)
print(rate_stats)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.100   2.600   6.845   7.850  49.300

# Descriptive statistics for the 'Count' variable
count_stats <- summary(data$Count)
print(count_stats)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      28     128    1943     785   47722

The 10 countries with the highest crime rate

# Sort the data frame by CrimeRate in descending order
sorted_data1 <- data[order(data$Rate, decreasing = TRUE), ]
# Select the top 10 countries
top_10_countries_rank <- sorted_data1[1:10, ]

# Add a rank column to the data frame
top_10_countries_rank$Rank <- 1:10

# Print the top 10 countries with their ranks
print(top_10_countries_rank)

##                Location   Region          Subregion Rate Count Year Rank
## 188 U.S. Virgin Islands Americas          Caribbean 49.3    52 2012    1
## 88              Jamaica Americas          Caribbean 44.7  1323 2020    2
## 99              Lesotho   Africa    Southern Africa 43.6   897 2015    3
## 177 Trinidad and Tobago Americas          Caribbean 38.6   538 2019    4
## 54          El Salvador Americas    Central America 37.2  2398 2019    5
## 191           Venezuela Americas      South America 36.7 10598 2018    6
## 75             Honduras Americas    Central America 36.3  3598 2020    7
## 163        South Africa   Africa    Southern Africa 33.5 19846 2020    8
## 112              Mexico Americas    Central America 28.4 36579 2020    9
## 119             Myanmar     Asia South-Eastern Asia 28.4 15299 2021   10

The top 10 countries with the highest crime count

# Sort the data frame by count in descending order
sorted_data2 <- data[order(data$Count, decreasing = TRUE), ]

# Select the top 10 countries
top_10_countries_count <- sorted_data2[1:10, ]

# Add a rank column to the data frame
top_10_countries_count$Rank <- 1:10

# Print the top 10 countries with their ranks
print(top_10_countries_count)

##          Location   Region          Subregion Rate Count Year Rank
## 27         Brazil Americas      South America 22.5 47722 2020    1
## 126       Nigeria   Africa     Western Africa 22.0 44200 2019    2
## 79          India     Asia      Southern Asia  3.0 40651 2020    3
## 112        Mexico Americas    Central America 28.4 36579 2020    4
## 187 United States Americas   Northern America  6.5 21570 2020    5
## 163  South Africa   Africa    Southern Africa 33.5 19846 2020    6
## 119       Myanmar     Asia South-Eastern Asia 28.4 15299 2021    7
## 42       Colombia Americas      South America 22.6 11520 2020    8
## 144        Russia   Europe     Eastern Europe  7.3 10697 2020    9
## 191     Venezuela Americas      South America 36.7 10598 2018   10

The bottom 10 countries with the lowest crime rate

# Sort the data frame by count in ascending order
sorted_data3 <- data[order(data$Rate), ]

# Select the top 10 countries
bottom_10_countries_count <- sorted_data3[1:10, ]

# Add a rank column to the data frame
bottom_10_countries_count$Rank <- 1:10

# Print the top 10 countries with their ranks
print(bottom_10_countries_count)

##            Location   Region          Subregion Rate Count Year Rank
## 22          Bermuda Americas   Northern America  0.0     0 2019    1
## 39  Channel Islands   Europe    Northern Europe  0.0     0 2010    2
## 74         Holy See   Europe    Southern Europe  0.0     0 2015    3
## 85      Isle of Man   Europe    Northern Europe  0.0     0 2016    4
## 113          Monaco   Europe     Western Europe  0.0     0 2015    5
## 146    Saint Helena   Africa     Western Africa  0.0     0 2009    6
## 152      San Marino   Europe    Southern Europe  0.0     0 2011    7
## 15          Bahrain     Asia       Western Asia  0.1     2 2019    8
## 103      Luxembourg   Europe     Western Europe  0.2     1 2020    9
## 160       Singapore     Asia South-Eastern Asia  0.2    10 2020   10

The bottom 10 countries with the lowest crime count

# Sort the data frame by count in ascending order
sorted_data4 <- data[order(data$Rate), ]

# Select the top 10 countries
bottom_10_countries_count <- sorted_data4[1:10, ]

# Add a rank column to the data frame
bottom_10_countries_count$Rank <- 1:10

# Print the top 10 countries with their ranks
print(bottom_10_countries_count)

##            Location   Region          Subregion Rate Count Year Rank
## 22          Bermuda Americas   Northern America  0.0     0 2019    1
## 39  Channel Islands   Europe    Northern Europe  0.0     0 2010    2
## 74         Holy See   Europe    Southern Europe  0.0     0 2015    3
## 85      Isle of Man   Europe    Northern Europe  0.0     0 2016    4
## 113          Monaco   Europe     Western Europe  0.0     0 2015    5
## 146    Saint Helena   Africa     Western Africa  0.0     0 2009    6
## 152      San Marino   Europe    Southern Europe  0.0     0 2011    7
## 15          Bahrain     Asia       Western Asia  0.1     2 2019    8
## 103      Luxembourg   Europe     Western Europe  0.2     1 2020    9
## 160       Singapore     Asia South-Eastern Asia  0.2    10 2020   10

Histogram of homicide rates

Checking the distribution of the data and identifying any patterns that can be seen.

library(ggplot2)
ggplot(data, aes(x = Rate)) +
  geom_histogram(binwidth = 5, fill = "blue", color = "black") +
  labs(x = "Homicide Rate", y = "Frequency", title = "Distribution of Homicide Rates")

Looking for outlier and calculating z-scores for Rate and Count variables

We now calculate the z-scores, we do it to identify outliers or extreme values based on a predefined threshold () z-score > 3), and understand the overall variability of the data.

rate_zscores <- scale(data$Rate)
count_zscores <- scale(data$Count)

# Find outliers based on z-score threshold (3, in our case)
rate_outliers <- data$Rate[abs(rate_zscores) > 3]
count_outliers <- data$Count[abs(count_zscores) > 3]

# Print the identified outliers
print(rate_outliers)

## [1] 37.2 36.3 44.7 43.6 38.6 49.3 36.7

print(count_outliers)

## [1] 47722 40651 36579 44200

# Removing outliers from the dataset
data2 <- data[!(data$Rate %in% rate_outliers), ]

Checking for normality for the attribute ‘Region’

The histogram helps in understanding the distribution of the data and identifying any patterns or outliers present.

# Histogram of the data
hist(data2$Rate, breaks = 20, col = "red", border = "black")

Creating a density plot of the data

# Density plot of the data
plot(density(data2$Rate), col = "blue", lwd = 2)

Checking the skewness and making the Q-Q plot

The purpose of checking skewness and making the Q-Q plot is to help evaluate whether the transformed variable follows a normal distribution.

# Q-Q plot of the data
qqnorm(data2$Rate)
qqline(data2$Rate)

skewness <- skewness(data2$Rate)

Tranforming the dataset

The purpose of a quantile transformation is to transform the distribution of a variable into a standard normal distribution.

# Performing quantile transformation on the variable
ranked_data <- rank(data2$Rate)
quantiles <- qnorm((ranked_data - 0.5) / length(ranked_data))
transformed_data <- quantiles

# Generate a Q-Q plot for the transformed data
qqnorm(transformed_data)
qqline(transformed_data)

Conducting the hypothesis testing

Research Question: Is there a significant difference in homicide rates between two regions the Americas and Africa?

Null Hypothesis (H0): There is no significant difference in homicide rates between the Americas and Africa

Alternative Hypothesis (H1): There is a significant difference in homicide rates between the Americas and Africa

# Subsetting the data for Americas and Africa
region_a <- data2$Rate[data2$Region == "Americas" & data2$Year == 2020]
region_b <- data2$Rate[data2$Region == "Africa" & data2$Year == 2020]

# Checking the number of observations in each region
length(region_a)

## [1] 25

length(region_b)

## [1] 9

# Perform a two-sample t-test
result <- t.test(region_a, region_b)

# Print the test result
print(result)

## 
##  Welch Two Sample t-test
## 
## data:  region_a and region_b
## t = 1.6302, df = 11.512, p-value = 0.1301
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.063202 14.098757
## sample estimates:
## mean of x mean of y 
## 14.440000  8.422222

Discussion

We set out to understand the different crime rates and counts in the world throught out the years. There are few limitations, for example our dataset has data recorded from different years, but it should not affect the general analysis and the hypothesis testing.

We found the top and bottom 10 countries interms of the crime count and crime rate. This gave us a general idea of the data and the regions.

The research question -> Is there a significant difference in homicide rates between two regions the Americas and Africa?

-The t-value was 1.6302, indicating a moderate difference between the means of the two groups. -The degrees of freedom associated with the t-distribution were calculated as 11.512. -The p-value obtained from the test was 0.1301. Since this value is greater than the conventional significance level of 0.05, there is insufficient evidence to reject the null hypothesis. -The null hypothesis assumes that there is no significant difference between the means of the two groups. -The alternative hypothesis states that the true difference in means is not equal to zero. -The 95% confidence interval for the difference in means ranged from -2.06 to 14.09. This interval provides an estimate of plausible values for the true difference in means. -The sample mean of region_a was 14.44, while the sample mean of region_b was 8.422.

In short, the statistical analysis does not provide strong evidence to support the presence of a significant difference between the means of Americas and Africa. However, the confidence interval suggests that there is a range of possible values. Therefore, further investigation might be needed.

REFERENCES

-https://astral-theory-157510.appspot.com/secured/RBootcamp_Course_04.html#R_Markdown_Syntax_Basics -https://bookdown.org/yihui/rmarkdown/tufte-figures.html#margin-figures -https://r-coder.com/density-plot-r/ -https://iqss.github.io/dss-workshops/R/Rintro/base-r-cheat-sheet.pdf -https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf -https://github.com/topics/learning-r