1 Introduction

Inspired by the professor’s apt reference, when we think about Italy, the mind effortlessly wanders to its breathtaking cities, each a testament to history, art, and culture. From the romantic canals of Venice to the ancient streets of Rome, these cities tell stories that have shaped civilizations. However, beneath their beauty lies a more complex reality, one that involves the safety and security of these beloved places.

I decided to explore their safety from a statistical perspective. By analyzing a dataset brimming with crime statistics, I aim to uncover whether these cities can be grouped based on their security parameters. Can we identify clusters of cities that are remarkably safe, somewhat balanced, or perhaps in need of heightened safety measures? What insights can we derive from the numbers behind Italy’s most beautiful locations?

Padova, my city of origin

library(readr)
crime_data <- read.csv("~/Desktop/Italian cities_crime.csv", sep = ";")

# For aesthetically purposes I will show only the first four columns
head(crime_data[, 1:4])

##         City.Provence Year Massacre Completed.voluntary.homicides
## 1           Agrigento 2024        0                             3
## 2         Alessandria 2024        1                             5
## 3              Ancona 2024        0                             0
## 4              Arezzo 2024        0                             1
## 5       Ascoli Piceno 2024        0                             2
## 6                Asti 2024        0                             0

1.1 Dataset overview

Unit of observation: each row represents a single Italian province.
Sample size: the dataset includes data from 105 provinces in Italy.

The dataset was retrieved from https://github.com/1655653/VICrime-Visual-Analytics-Project?tab=readme-ov-file, which obtained data from Ministry of the Interior.

1.2 Variables description

Massacre: Multiple murders committed simultaneously.
Completed voluntary homicides: Intentional killings completed.
Homicides for theft or robbery: Killings committed during theft or robbery.
Mafia-type homicides: Killings linked to mafia activities.
Terrorist homicides: Killings for terrorist purposes.
Attempted homicides: Failed attempts to kill.
Infanticides: Killing of infants.
Involuntary manslaughter: Killing without intent, beyond control.
Negligent homicides: Killings caused by negligence.
Road accident homicides: Killings due to road accidents.
Beatings: Physical assault causing harm.
Intentional injuries: Injuries caused deliberately.
Threats: Expressions of intent to harm.
Kidnappings: Unlawful detaining of a person.
Insults: Offensive or derogatory remarks.
Sexual violence: Sexual acts involving force or threat.
Sexual acts with minors: Engaging in sexual acts with underage individuals.
Corruption of minors: Exposing minors to inappropriate activities.
Exploitation and facilitation of prostitution: Forcing or aiding others into prostitution.
Child pornography and possession of pedopornographic material: Creating or holding illegal content involving minors.
Thefts: Taking property without consent.
Snatch thefts: Theft using quick grabs, like purse snatching.
Pickpocketing: Stealing from someone’s pockets or bag.
Burglaries in homes: Breaking into homes to steal.
Thefts in commercial establishments: Stealing from shops or businesses.
Thefts from parked cars: Taking items from vehicles.
Thefts of artworks and archaeological material: Stealing valuable cultural or historical items.
Thefts of heavy vehicles transporting goods: Stealing trucks and their cargo.
Thefts of mopeds: Stealing small motorized bikes.
Thefts of motorcycles: Stealing motorcycles.
Thefts of cars: Stealing automobiles.
Robberies: Stealing with violence or threat.
Robberies in homes: Violent thefts in residences.
Bank robberies: Robbing banks with force or threat.
Robberies in post offices: Stealing from post office facilities.
Robberies in commercial establishments: Violent thefts from businesses.
Street robberies: Violent thefts in public areas.
Extortions: Forcing others to give money or items under threat.
Scams and computer fraud: Deceptive acts to steal, often online.
Cybercrimes: Crimes committed using technology or the internet.
Counterfeiting of trademarks and industrial products: Producing fake goods illegally.
Violation of intellectual property: Illegally using someone else’s creations or ideas.
Receiving stolen goods: Possessing items known to be stolen.
Money laundering: Cleaning money obtained through illegal activities.
Goods or benefits of illicit origin: Profiting from illegal sources.
Usury: Lending money at excessively high interest rates.
Damages: Causing harm or destruction to property.
Fires: Setting or causing uncontrolled fires.
Forest fires: Fires that destroy wooded areas.
Damage followed by fire: Vandalism resulting in fires.
Drug legislation violations: Breaking drug-related laws.
Attacks: Violent actions intended to harm or intimidate.
Criminal association: Joining groups for illegal activities.
Mafia-type association: Involvement in organized crime groups.
Smuggling: Illegal transport of goods or people.
Other crimes: Miscellaneous illegal activities.
Value added per inhabitant: The economic contribution per person in a region, calculated as the sum of value added from sectors like agriculture, industry, construction, commerce, financial services, and other services, divided by the population.

1.3 Data manipulation

First, I replace commas with dots in the Value added per inhabitant column and convert it to numeric format to ensure accurate calculations and analyses involving this variable.

# Replace commas with periods in the column
crime_data$Value.added.per.inhabitant <- gsub(",", ".", crime_data$Value.added.per.inhabitant)

# Convert the column to numeric
crime_data$Value.added.per.inhabitant <- as.numeric(crime_data$Value.added.per.inhabitant)

In the following step, I will categorize the original variables in the dataset into six broader crime categories:

Violent Crimes
Property Crimes
Organized Crimes
Cyber Crimes
Sexual and Moral Crimes
Drug and Other Crimes

This is done to simplify the analysis by reducing the number of variables from 58 to 6, making the dataset easier to interpret and analyze. Each category represents a logical grouping of related crimes based on their nature and severity.

For example, Violent Crimes include offenses like homicides, assaults, and kidnappings, while Property Crimes encompass thefts, robberies, and damages. The variables within each category were summed to create aggregate measures, enabling a more focused analysis of crime patterns across Italian provinces. This approach helps to identify trends and patterns in broader crime categories while maintaining the integrity of the original data.

# Define categories and their subcategories
crime_data$Violent_Crimes <- rowSums(crime_data[, c('Massacre', 
                                                    'Completed.voluntary.homicides', 
                                                    'Completed.voluntary.homicides.for.theft.or.robbery', 
                                                    'Completed.voluntary.homicides.of.a.mafia.type', 
                                                    'Completed.voluntary.homicides.for.terrorist.purposes', 
                                                    'Attempted.homicides', 
                                                    'Infanticides', 
                                                    'Beatings', 
                                                    'Intentional.injuries', 
                                                    'Sexual.violence', 
                                                    'Kidnappings')])

crime_data$Property_Crimes <- rowSums(crime_data[, c('Thefts', 
                                                     'Snatch.thefts', 
                                                     'Pickpocketing', 
                                                     'Burglaries.in.homes', 
                                                     'Thefts.in.commercial.establishments', 
                                                     'Thefts.from.parked.cars', 
                                                     'Thefts.of.artworks.and.archaeological.material', 
                                                     'Thefts.of.heavy.vehicles.transporting.goods', 
                                                     'Thefts.of.mopeds', 
                                                     'Thefts.of.motorcycles', 
                                                     'Thefts.of.cars', 
                                                     'Robberies', 
                                                     'Robberies.in.homes', 
                                                     'Bank.robberies', 
                                                     'Robberies.in.post.offices', 
                                                     'Robberies.in.commercial.establishments', 
                                                     'Street.robberies', 
                                                     'Damages', 
                                                     'Fires', 
                                                     'Forest.fires', 
                                                     'Damage.followed.by.fire')])

crime_data$WhiteCollar_Crimes <- rowSums(crime_data[, c('Mafia.type.association', 
                                                      'Criminal.association', 
                                                      'Money.laundering.and.use.of.illicit.funds', 
                                                      'Usury', 
                                                      'Counterfeiting.of.trademarks.and.industrial.products', 
                                                      'Violation.of.intellectual.property', 
                                                      'Receiving.stolen.goods')])

crime_data$Cyber_Crimes <- rowSums(crime_data[, c('Scams.and.computer.fraud', 
                                                  'Cybercrimes')])

crime_data$Sexual_Moral_Crimes <- rowSums(crime_data[, c('Sexual.acts.with.minors', 
                                                         'Corruption.of.minors', 
                                                         'Exploitation.and.facilitation.of.prostitution', 
                                                         'Child.pornography.and.possession.of.pedopornographic.material')])

crime_data$Drug_Other_Crimes <- rowSums(crime_data[, c('Drug.legislation', 
                                                       'Attacks', 
                                                       'Other.crimes')])

Let’s view the newly created dataset.

# Create a new dataset with City/Province, Value added per inhabitant and summed crime categories
Italian_cities_Crime <- crime_data[, c("City.Provence",
                                       "Violent_Crimes", 
                                       "Property_Crimes", 
                                       "WhiteCollar_Crimes", 
                                       "Cyber_Crimes", 
                                       "Sexual_Moral_Crimes", 
                                       "Drug_Other_Crimes",
                                       "Value.added.per.inhabitant")]

# View the new dataset
head(Italian_cities_Crime)

##         City.Provence Violent_Crimes Property_Crimes WhiteCollar_Crimes
## 1           Agrigento            661            5218               1464
## 2         Alessandria            597           10980               2129
## 3              Ancona            594            7965               1716
## 4              Arezzo            487            7023               1013
## 5       Ascoli Piceno            241            3773                758
## 6                Asti            259            5230               1297
##   Cyber_Crimes Sexual_Moral_Crimes Drug_Other_Crimes Value.added.per.inhabitant
## 1         1262                  16             11327                       13.9
## 2         1636                  19             14882                       26.2
## 3         1712                  26             13263                       27.6
## 4         1043                  17              9982                       26.4
## 5          494                   3              5571                       23.8
## 6          943                   9              7254                       23.6

This streamlined dataset simplifies analysis by grouping related crime types under broader categories, making it easier to identify patterns and trends. For example, in the province of Arezzo, the dataset reveals the following crime scores for a given year (Violent Crimes: 487; Property Crimes: 7023; Organized Crimes: 1013). These figures indicate that property crimes are significantly more prevalent in Arezzo compared to violent or organized crimes. This insight could be instrumental in allocating resources and devising crime prevention strategies tailored to the region’s specific needs.

1.4 Descriptive statistics

# Set a wider console output width
options(width = 70)

library(psych)

# Generate descriptive statistics for the entire dataset
describe(Italian_cities_Crime)

##                            vars   n     mean       sd  median
## City.Provence*                1 105    53.00    30.45    53.0
## Violent_Crimes                2 105   821.21   940.86   560.0
## Property_Crimes               3 105 17394.86 31652.00  8574.0
## WhiteCollar_Crimes            4 105  2720.72  4468.42  1544.0
## Cyber_Crimes                  5 105  2166.34  2925.56  1406.0
## Sexual_Moral_Crimes           6 105    19.04    23.97    11.0
## Drug_Other_Crimes             7 105 21878.66 33601.41 12948.0
## Value.added.per.inhabitant    8 105    24.34     6.90    24.1
##                             trimmed     mad    min      max    range
## City.Provence*                53.00   38.55    1.0    105.0    104.0
## Violent_Crimes               630.51  358.79  112.0   5889.0   5777.0
## Property_Crimes            10765.76 7117.96  943.0 210400.0 209457.0
## WhiteCollar_Crimes          1801.36  999.27  307.0  31310.0  31003.0
## Cyber_Crimes                1572.12  876.22  283.0  19375.0  19092.0
## Sexual_Moral_Crimes           13.59    8.90    0.0    145.0    145.0
## Drug_Other_Crimes          14952.99 9104.65 2237.0 220066.0 217829.0
## Value.added.per.inhabitant    23.94    7.71   13.9     49.7     35.8
##                            skew kurtosis      se
## City.Provence*             0.00    -1.23    2.97
## Violent_Crimes             3.68    15.08   91.82
## Property_Crimes            4.63    23.58 3088.92
## WhiteCollar_Crimes         4.67    23.39  436.07
## Cyber_Crimes               4.16    18.79  285.51
## Sexual_Moral_Crimes        3.09    10.73    2.34
## Drug_Other_Crimes          4.43    21.64 3279.16
## Value.added.per.inhabitant 0.61     0.60    0.67

1.4.1 Explanation of a few parameters estimates

Violent Crimes

The mean number of violent crimes across all provinces is 821.21, with a median of 560, suggesting a right-skewed distribution (confirmed by the skewness of 3.68). This indicates that while most provinces experience relatively lower levels of violent crimes, a few provinces have significantly higher rates, pulling the average up;
The standard deviation (940.86) is high compared to the mean, reflecting significant variability in violent crime rates across provinces.

Property Crimes

Property crimes exhibit the highest range of 209,457, with a mean of 17,394.86 and a median of 8,574. This extreme range and a high skewness (4.63) suggest the presence of provinces with exceptionally high property crime rates, likely urban centers;
The kurtosis (23.58) indicates a heavy-tailed distribution, meaning some provinces have extraordinarily high property crime rates compared to the majority.

Sexual and Moral Crimes

Sexual and moral crimes have a low mean of 19.04 and a median of 11, with a range extending up to 145. This relatively small scale of values, combined with a skewness of 3.09, suggests these crimes are not uniformly distributed but concentrated in a few provinces;
The low standard error (2.34) suggests the sample mean is a good representation of the population mean for this category.

2 RQ: Clustering

Question: can Italian provinces be clustered into distinct groups based on their crime profiles?

2.1 Step 1: variables standardizing, dissimilarity & outliers removing

For the clustering analysis, I selected six variables: Violent Crimes, Property Crimes, White-Collar Crimes, Cyber Crimes, Sexual and Moral Crimes, and Drug/Other Crimes. These categories capture distinct and critical dimensions of criminal activity, providing a comprehensive picture of crime patterns across Italian provinces.

I chose to exclude Value added per inhabitant from the clustering process and use it after as a criterion validity test. This variable reflects the economic state of each provence, making it an excellent benchmark for assessing the effectiveness and meaningfulness of the clusters formed.

# Standardize the clustering variables
mydata_clu_std <- as.data.frame(scale(Italian_cities_Crime[, c("Violent_Crimes", 
                                                               "Property_Crimes", 
                                                               "WhiteCollar_Crimes", 
                                                               "Cyber_Crimes", 
                                                               "Sexual_Moral_Crimes",
                                                               "Drug_Other_Crimes")]))

# Calculate dissimilarity measure to find outliers
Italian_cities_Crime$Dissimilarity <- sqrt(mydata_clu_std$Violent_Crimes^2 + 
                                           mydata_clu_std$Property_Crimes^2 + 
                                           mydata_clu_std$WhiteCollar_Crimes^2 + 
                                           mydata_clu_std$Cyber_Crimes^2 + 
                                           mydata_clu_std$Sexual_Moral_Crimes^2 +
                                           mydata_clu_std$Drug_Other_Crimes^2)

head(Italian_cities_Crime[order(-Italian_cities_Crime$Dissimilarity), c("City.Provence", "Dissimilarity")])

##    City.Provence Dissimilarity
## 55        Milano     13.805188
## 81          Roma     13.353195
## 58        Napoli      8.233065
## 92        Torino      7.841460
## 33       Firenze      3.097972
## 14       Bologna      3.021125

Explanation of results

The dissimilarity values represent how different each city is compared to the rest of the dataset, with higher values indicating greater divergence. Milano, Roma, Napoli, and Torino have significantly higher dissimilarity scores (ranging from 7.8 to 13.8) compared to other cities like Firenze and Bologna, which have much lower scores (around 3). Being large metropolitan cities, they naturally have higher values due to their size and complexity, making them potential outliers. Removing them ensures they do not distort the clustering process.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Filter out the specified outlier cities using row indices
Italian_cities_Crime <- Italian_cities_Crime %>%
  filter(!(row_number() %in% c(55, 81, 58, 92)))

# Standardize the clustering variables without the outliers
mydata_clu_std <- as.data.frame(scale(Italian_cities_Crime[, c("Violent_Crimes", 
                                                               "Property_Crimes", 
                                                               "WhiteCollar_Crimes", 
                                                               "Cyber_Crimes", 
                                                               "Sexual_Moral_Crimes",
                                                               "Drug_Other_Crimes")]))

head(Italian_cities_Crime[order(-Italian_cities_Crime$Dissimilarity), c("City.Provence", "Dissimilarity")])

##    City.Provence Dissimilarity
## 33       Firenze      3.097972
## 14       Bologna      3.021125
## 37        Genova      2.872282
## 95       Venezia      2.182329
## 16       Brescia      2.088707
## 61       Palermo      1.894155

2.2 Step 2: Euclidian distances

library(factoextra)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

# Calculate Euclidean distances
Distances <- get_dist(mydata_clu_std, 
                      method = "euclidean")

# Visualize the distance matrix
fviz_dist(Distances, gradient = list(low = "darkred", 
                                     mid = "grey95", 
                                     high = "white"))

library(factoextra)

# Hopkins statistics
get_clust_tendency(mydata_clu_std, 
                   n = nrow(mydata_clu_std) - 1, 
                   graph = FALSE)

## $hopkins_stat
## [1] 0.8313608
## 
## $plot
## NULL

Explanation of results

The Hopkins statistic, which measures clustering tendency, is 0.831, indicating a strong clustering structure in the dataset. This value, being close to 1, confirms the suitability of the data for cluster analysis. Additionally, the visual inspection of the distance matrix reveals the formation of distinct squares, further supporting the existence of well-defined clusters within the data.

2.3 Step 3: how many clusters?

library(factoextra)
library(NbClust)

# Perform the Elbow Method
fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

Explanation of results

The elbow method is used to determine the optimal number of clusters by plotting the total within-cluster sum of squares against the number of clusters. In the resulting plot, breaks are observed at 2 and 4 clusters, indicating potential points where adding more clusters does not significantly reduce WCSS. This suggests that dividing the data into either 2 or 4 clusters captures the structure of the dataset effectively, balancing simplicity and meaningful differentiation.

# Perform the Silhouette analysis
fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette") +
  labs(subtitle = "Silhouette analysis")

Explanation of results

The silhouette method evaluates how well each data point fits within its assigned cluster compared to other clusters. In this case, the silhouette analysis indicates that 2 clusters provide the best separation and cohesion among the data, as this configuration achieves the highest overall silhouette score. This suggests that dividing the data into 2 clusters most effectively captures the underlying structure. All is confirmed also with the K-means method.

library(NbClust)

# Determine the optimal number of clusters
nc <- NbClust(mydata_clu_std, distance = "euclidean", 
              min.nc = 2, max.nc = 10, 
              method = "kmeans", index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 11 proposed 2 as the best number of clusters 
## * 2 proposed 3 as the best number of clusters 
## * 8 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

# Perform k-means clustering with 2 clusters
Clustering <- kmeans(mydata_clu_std, 
                        centers = 2, 
                        nstart = 25)

Clustering

## K-means clustering with 2 clusters of sizes 86, 15
## 
## Cluster means:
##   Violent_Crimes Property_Crimes WhiteCollar_Crimes Cyber_Crimes
## 1     -0.3435403      -0.3418753         -0.3162826   -0.3286798
## 2      1.9696312       1.9600853          1.8133536    1.8844311
##   Sexual_Moral_Crimes Drug_Other_Crimes
## 1          -0.3050237        -0.3510029
## 2           1.7488024         2.0124168
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 2 1 1 1 2 1 2 1 2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1
##  [33] 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 2 1 1 1
##  [65] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1
##  [97] 1 2 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 130.96392  87.31167
##  (between_SS / total_SS =  63.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

library(factoextra)

# Visualize the clusters
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE, 
             ggtheme = theme_bw(),
             data = mydata_clu_std)

Explanation of results

Based on the visual observation, I come logically to the idea that I could either remove the outliers that visually disturb the efficiency of the clustering, or step back and restart the process with 4 groups. Considering this and the insights from the elbow and silhouette methods, I have decided to step back and proceed with 4 clusters. This approach will allow for a more refined analysis, enabling a deeper understanding of the distinct characteristics and relationships between cities within each cluster.

2.4 Step 4: clusters creation and analysis

# Perform k-means clustering with 4 clusters
Clustering <- kmeans(mydata_clu_std, 
                        centers = 4, 
                        nstart = 25)

Clustering

## K-means clustering with 4 clusters of sizes 13, 9, 37, 42
## 
## Cluster means:
##   Violent_Crimes Property_Crimes WhiteCollar_Crimes Cyber_Crimes
## 1     1.07991184       0.8833743          0.9481110   0.95323298
## 2     2.22939005       2.4496302          2.2892278   2.32035110
## 3    -0.03810664      -0.1316975         -0.1159741  -0.05178042
## 4    -0.77841473      -0.6823269         -0.6818441  -0.74665031
##   Sexual_Moral_Crimes Drug_Other_Crimes
## 1          0.23462134        0.96908733
## 2          2.63237012        2.45209958
## 3         -0.04470523       -0.09576122
## 4         -0.59731702       -0.74104443
## 
## Clustering vector:
##   [1] 3 3 3 4 4 4 3 2 4 4 4 2 4 2 3 2 4 1 4 4 1 2 3 4 3 3 4 4 3 4 4 3
##  [33] 2 1 3 3 2 4 4 3 4 4 4 3 1 4 3 4 3 4 3 4 4 3 1 1 3 4 4 1 2 3 1 1
##  [65] 4 3 4 3 3 4 4 3 4 3 3 3 4 3 4 1 3 3 4 3 4 3 4 4 3 3 3 3 3 1 2 4
##  [97] 4 1 4 1 4
## 
## Within cluster sum of squares by cluster:
## [1] 19.55126 41.70576 25.11829 13.35958
##  (between_SS / total_SS =  83.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Explanation of results

The results of the k-means clustering reveal some interesting insights:

Cluster sizes: The largest group is Cluster 4, with 42 cities, while the smallest is Cluster 2, with only 9 cities;
Cluster means: for example, Cluster 2 exhibits the highest average values across all crime categories, indicating it includes cities with the most severe crime rate, while Cluster 4 has negative averages, indicating below-average crime rates;
Cluster cohesion: Cluster 4 is the most compact (lowest within-cluster sum of squares: 13.36), indicating strong internal similarity, while Cluster 2 has the highest within-cluster sum of squares (41.71), reflecting greater variability among its cities;
Overall clustering efficiency: The model achieves a high proportion of variance explained, with 83.4% between-cluster variation, indicating the clusters are well-separated and meaningful.

library(factoextra)

# Visualize the clusters
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE, 
             ggtheme = theme_bw(),
             data = mydata_clu_std)

Explanation of results

In the visual graph of the clusters, it can be clearly observed that some cities, such as Bari (ID 8) and Pescara (ID 66), are very distant from the cluster centers. These cities act as outliers, indicating they have unique crime characteristics that do not align closely with the general trends of their respective clusters. This distance highlights the distinctiveness of these cities in terms of crime rates compared to the rest of the data.

library(dplyr)

Italian_cities_Crime <- Italian_cities_Crime %>%
  filter(!(row_number() %in% c(66, 92, 83, 79, 80, 8, 37, 16)))

# Standardize the clustering variables without the outliers
mydata_clu_std <- as.data.frame(scale(Italian_cities_Crime[, c("Violent_Crimes", 
                                                               "Property_Crimes", 
                                                               "WhiteCollar_Crimes", 
                                                               "Cyber_Crimes", 
                                                               "Sexual_Moral_Crimes",
                                                               "Drug_Other_Crimes")]))

# Perform k-means clustering with 4 clusters
Clustering <- kmeans(mydata_clu_std, 
                        centers = 4, 
                        nstart = 25)

Clustering

## K-means clustering with 4 clusters of sizes 40, 6, 33, 14
## 
## Cluster means:
##   Violent_Crimes Property_Crimes WhiteCollar_Crimes Cyber_Crimes
## 1    -0.78515932     -0.65995537        -0.70011337 -0.748978559
## 2     2.58502542      2.87124885         2.35015943  2.512993381
## 3     0.03029793     -0.08576926        -0.04162889  0.005629694
## 4     1.06402776      0.85722194         1.09123797  1.049671586
##   Sexual_Moral_Crimes Drug_Other_Crimes
## 1         -0.57661797        -0.7243371
## 2          3.26065679         2.8233377
## 3         -0.05369907        -0.0398669
## 4          0.37663195         0.9535047
## 
## Clustering vector:
##  [1] 3 3 3 1 1 1 3 1 1 1 2 1 2 3 1 4 1 1 4 2 3 1 3 3 1 1 3 1 1 3 2 4 3
## [34] 3 1 1 3 1 1 1 4 4 1 3 1 3 1 3 1 1 3 4 4 3 1 1 4 2 3 4 4 1 1 3 3 1
## [67] 1 3 1 3 3 3 1 3 3 3 3 1 3 1 1 3 3 4 3 4 2 1 1 4 1 4 1
## 
## Within cluster sum of squares by cluster:
## [1] 15.13867 29.57155 21.40680 21.83599
##  (between_SS / total_SS =  84.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

library(factoextra)

# Visualization
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE, 
             ggtheme = theme_bw(),
             data = mydata_clu_std)

Explanation of results

After performing the deletion of outliers, the ratio improved significantly, which is a positive sign. Additionally, visually, the clusters now better satisfy both within-cluster and between-cluster requirements, showing a clearer separation of groups with reduced outlier influence. This indicates more meaningful and consistent groupings in the data.

2.5 Step 5: cluster averages & variables differentiation

# Averages for each cluster
Averages <- Clustering$centers

Averages

##   Violent_Crimes Property_Crimes WhiteCollar_Crimes Cyber_Crimes
## 1    -0.78515932     -0.65995537        -0.70011337 -0.748978559
## 2     2.58502542      2.87124885         2.35015943  2.512993381
## 3     0.03029793     -0.08576926        -0.04162889  0.005629694
## 4     1.06402776      0.85722194         1.09123797  1.049671586
##   Sexual_Moral_Crimes Drug_Other_Crimes
## 1         -0.57661797        -0.7243371
## 2          3.26065679         2.8233377
## 3         -0.05369907        -0.0398669
## 4          0.37663195         0.9535047

Explanation of results

The cluster averages reveal intriguing differences in crime patterns among the four clusters:

Cluster 1 displays consistently negative averages for all crime types, representing cities with safer profiles. These cities are likely characterized by lower levels of violent crimes, theft, and cyber offenses, making them outliers on the safer end of the spectrum;
Cluster 2 stands out with dramatically higher averages across all categories, particularly in Sexual Moral Crimes and Property Crimes, highlighting cities struggling with severe social issues like drug trafficking and crimes that exploit vulnerable populations;
Cluster 3 hovers around zero for most crime types, indicating cities with crime rates close to the average, suggesting they reflect the overall regional norm for issues like WhiteCollar Crimes and Cyber Crimes, with no specific category standing out significantly;
Cluster 4 shows moderately high averages across all crime categories, particularly in Violent Crimes and Property Crimes, suggesting cities with slightly elevated levels of physical harm and theft-related crimes but not extreme criminal activity overall.

Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)

# Transforming the data for visualization
library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Violent_Crimes", "Property_Crimes", "WhiteCollar_Crimes", "Cyber_Crimes", "Sexual_Moral_Crimes", "Drug_Other_Crimes"))

Figure$Group <- factor(Figure$ID,
                       levels = c(1, 2, 3, 4),
                       labels = c("1", "2", "3", "4"))

Figure$NameF <- factor(Figure$name,
                       levels = c("Violent_Crimes", "Property_Crimes", "WhiteCollar_Crimes", "Cyber_Crimes", "Sexual_Moral_Crimes", "Drug_Other_Crimes"),
                       labels = c("Violent Crimes", "Property Crimes", "White Collar Crimes", "Cyber Crimes", "Sexual Moral Crimes", "Drug/Other Crimes"))

# Visualizing with ggplot
library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, color = Group), size = 3) +
  geom_line(aes(group = ID), size = 1) +
  ylab("Averages") +
  xlab("Cluster Variables") +
  ylim(-2, 3.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning
## was generated.

Explanation of results

The graphical visualization likely emphasizes these distinctions, showing distinct groupings that reflect the varying crime patterns across cities. For example, cities in Cluster 2 might visually cluster further from others, driven by extreme rates in crimes like Sexual Moral Crimes and Drug/Other Crimes, while Cluster 3 would group closer together due to uniformly lower crime rates. These findings provide valuable insights for policymakers and law enforcement, offering a clear roadmap for tailored interventions based on city-specific challenges.

Let’s see how my city of origin, Padova, is performing.

# Show cluster averages for Padova's cluster (Cluster 4)
Cluster_averages <- Averages[4, ]  # Cluster 4 averages

print(Cluster_averages)

##      Violent_Crimes     Property_Crimes  WhiteCollar_Crimes 
##           1.0640278           0.8572219           1.0912380 
##        Cyber_Crimes Sexual_Moral_Crimes   Drug_Other_Crimes 
##           1.0496716           0.3766320           0.9535047

# Filter the dataset for the 57th row (Padova's row)
Padova_values <- mydata_clu_std[57, ]

# Display Padova's standardized values for all variables
print(Padova_values)

##    Violent_Crimes Property_Crimes WhiteCollar_Crimes Cyber_Crimes
## 57       1.565188        1.495802           2.033737      2.23468
##    Sexual_Moral_Crimes Drug_Other_Crimes
## 57           0.3059806          1.717484

Explanation of results

Padova, a city in Cluster 4, unfortunately, displays moderately high averages across all crime categories, particularly in Cyber Crimes and White Collar Crimes, suggesting cities with slightly elevated levels of informatic harm and organized crimes.

Italian_cities_Crime$Group <- Clustering$cluster

# Checking if clustering variables successfully differentiate between groups
fit <- aov(cbind(Violent_Crimes, Property_Crimes, WhiteCollar_Crimes, Cyber_Crimes, Sexual_Moral_Crimes, Drug_Other_Crimes) ~ as.factor(Group),
           data = Italian_cities_Crime)

summary(fit)

##  Response Violent_Crimes :
##                  Df   Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)  3 11511370 3837123  210.46 < 2.2e-16 ***
## Residuals        89  1622682   18232                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Property_Crimes :
##                  Df     Sum Sq    Mean Sq F value    Pr(>F)    
## as.factor(Group)  3 8713150574 2904383525  157.48 < 2.2e-16 ***
## Residuals        89 1641367067   18442327                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response WhiteCollar_Crimes :
##                  Df    Sum Sq  Mean Sq F value    Pr(>F)    
## as.factor(Group)  3 126812537 42270846  91.498 < 2.2e-16 ***
## Residuals        89  41116667   461985                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Cyber_Crimes :
##                  Df   Sum Sq  Mean Sq F value    Pr(>F)    
## as.factor(Group)  3 70682862 23560954  138.35 < 2.2e-16 ***
## Residuals        89 15156235   170295                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Sexual_Moral_Crimes :
##                  Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)  3 13676.0  4558.7   183.1 < 2.2e-16 ***
## Residuals        89  2215.9    24.9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Drug_Other_Crimes :
##                  Df     Sum Sq    Mean Sq F value    Pr(>F)    
## as.factor(Group)  3 1.0714e+10 3571477254  232.64 < 2.2e-16 ***
## Residuals        89 1.3663e+09   15352022                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explanation of results

The extremely low p-values (< 0.001) across all crime categories confirm that the clustering successfully differentiates cities based on their crime patterns. Each cluster represents distinct profiles, with meaningful differences across all analyzed variables. This supports the validity of the clustering approach in identifying city groups with varying levels and types of criminal activity.

2.6 Step 6: criterion validity

To evaluate the criterion validity of my clustering solution, I selected Value added per inhabitant as a numerical variable to serve as the benchmark for the test. This variable, derived from the economic contributions of various sectors such as agriculture, industry, construction, commerce, financial services, and other services, offers a meaningful way to assess whether the identified clusters differ significantly in terms of their economic output. By comparing the clusters based on this measure, I aim to validate whether the grouping captures underlying patterns linked to the socio-economic characteristics of Italian cities.

# Aggregate the means of Value added per inhabitant by cluster group
aggregate(Italian_cities_Crime$Value.added.per.inhabitant, 
          by = list(Cluster = Italian_cities_Crime$Group), 
          FUN = mean)

##   Cluster        x
## 1       1 21.96000
## 2       2 28.38333
## 3       3 24.66061
## 4       4 25.25714

# Load the car library for Levene's test
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:psych':
## 
##     logit

# Perform Levene's test
leveneTest(Italian_cities_Crime$Value.added.per.inhabitant, as.factor(Italian_cities_Crime$Group))

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  3  1.9954 0.1204
##       89

Explanation of results

The Levene’s Test checks whether the variances across groups are homogeneous.

H0: The variances of the groups are equal (homogeneity of variance)
H1: The variances of the groups are not equal (heterogeneity of variance)

Since the p-value (0.1243) is greater than the significance level (0.05), we fail to reject the null hypothesis. This means there is no significant evidence to suggest that the variances across the groups are different.

library(dplyr)
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

# Perform Shapiro-Wilk test for each cluster
Italian_cities_Crime %>%
  group_by(as.factor(Italian_cities_Crime$Group)) %>%
  shapiro_test(Value.added.per.inhabitant)

## # A tibble: 4 × 4
##   `as.factor(Italian_cities_Crime$Group)` variable     statistic     p
##   <fct>                                   <chr>            <dbl> <dbl>
## 1 1                                       Value.added…     0.961 0.186
## 2 2                                       Value.added…     0.851 0.160
## 3 3                                       Value.added…     0.951 0.138
## 4 4                                       Value.added…     0.917 0.198

Explanation of results

H0: The data follows a normal distribution
H1: The data does not follow a normal distribution

For all groups, the p-values are greater than 0.05 (0.3632 for Group 1, 0.1602 for Group 2, 0.1856 for Group 3, and 0.0729 for Group 4). Thus, we fail to reject the null hypothesis, concluding that the “Value added per inhabitant” variable is normally distributed across all groups.

Since both assumptions for ANOVA (normality and homogeneity of variances) are satisfied, I can proceed with the ANOVA test.

# Perform ANOVA
fit <- aov(cbind(Value.added.per.inhabitant) ~ as.factor(Group), 
           data = Italian_cities_Crime)

# View the summary of the ANOVA
summary(fit)

##                  Df Sum Sq Mean Sq F value Pr(>F)  
## as.factor(Group)  3    316  105.18   2.651 0.0536 .
## Residuals        89   3531   39.67                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explanation of results

The ANOVA results examine whether there is a statistically significant difference in Value added per inhabitant between the four clusters.

H0: The means of “Value added per inhabitant” are equal across all groups
H1: At least one group mean is different

Since the p-value is slightly greater than the conventional significance level of 0.05, we fail to reject the null hypothesis. This means there is no statistically significant difference in Value added per inhabitant between the clusters at the 5% level of significance.

3 Conclusion

This analysis aimed to investigate whether Italian cities could be effectively grouped based on their crime patterns and socioeconomic characteristics. The results of clustering, ANOVA tests, and subsequent interpretations clearly demonstrate that cities are meaningfully divided into four distinct clusters, each with unique crime profiles. These clusters not only highlight the diversity in crime rates and types across Italy but also provide valuable insights into safer and riskier regions.

Based on their characteristics, here are the clusters (creatively renamed):

The Quiet Retreats (Cluster 1): These cities, characterized by consistently below-average crime rates, are Italy’s safest havens. Violent crimes, property crimes, and drug-related offenses are all significantly lower here. Examples include tranquil locations like Venezia and Cremona. These are perfect destinations for a peaceful holiday with minimal concerns about safety;
The Hot Zones (Cluster 2): These cities face significantly higher crime rates in all categories, especially sexual and drug-related offenses. Urban metropolises like Palermo, Bergamo, and Bologna dominate this group. While rich in history and attractions, these cities might require extra caution for visitors due to their heightened crime levels;
The Balanced Hubs (Cluster 3): With crime rates hovering close to the national average, these cities represent well-balanced urban centers. Their moderate levels of crime suggest neither excessive danger nor unusually low activity. Cities like Rimini and Como fall into this cluster, offering a mix of urban life and safety. These could also make for a great holiday choice, depending on one’s interests;
The Urban Beacons (Cluster 4): Cities in this group exhibit moderately elevated crime levels across most categories but aren’t extreme outliers. They represent thriving urban centers with some challenges but still hold appeal. For instance, Modena and Padova belong here, offering culture, history, and a bit of vibrancy for travelers who enjoy bustling atmospheres;

3.1 Vacation Recommendations:

Thus, dear Professor, for seeking a serene and worry-free holiday, I’d suggest exploring Venezia or Cremona from the Quiet Retreats cluster. These cities promise a relaxing experience steeped in beauty and history, far removed from the hustle and bustle of urban hotspots. Alternatively, if a mix of urban excitement and reasonable safety appeals, Rimini or Como from the Balanced Hubs cluster would be excellent choices.

This clustering analysis not only answered the research question but also provided actionable insights. By identifying distinct crime patterns, it became possible to discern safer cities from those requiring closer attention for crime prevention strategies. Whether for policy or personal travel planning, the results offer both practical utility and intellectual intrigue.

Acknowledgments

I would like to extend my heartfelt thanks to our mutual friend, ChatGPT, for tirelessly providing advice and helping me elevate the aesthetics of this homework. From fonts to formatting, his wisdom and patience have been invaluable. Without him, this document might still look like it was crafted in the early 2000s. Cheers to modern technology!

Christian Lasalvia

MVA - R Homework 2

Christian Lasalvia

18th January 2025

1 Introduction

1.1 Dataset overview

1.2 Variables description

1.3 Data manipulation

1.4 Descriptive statistics

1.4.1 Explanation of a few parameters estimates

2 RQ: Clustering

2.1 Step 1: variables standardizing, dissimilarity & outliers removing

Explanation of results

2.2 Step 2: Euclidian distances

Explanation of results

2.3 Step 3: how many clusters?

Explanation of results

Explanation of results

Explanation of results

2.4 Step 4: clusters creation and analysis

Explanation of results

Explanation of results

Explanation of results

2.5 Step 5: cluster averages & variables differentiation

Explanation of results

Explanation of results

Explanation of results

Explanation of results

2.6 Step 6: criterion validity

Explanation of results

Explanation of results

Explanation of results

3 Conclusion

3.1 Vacation Recommendations:

Acknowledgments