Inspired by the professor’s apt reference, when we think about Italy, the mind effortlessly wanders to its breathtaking cities, each a testament to history, art, and culture. From the romantic canals of Venice to the ancient streets of Rome, these cities tell stories that have shaped civilizations. However, beneath their beauty lies a more complex reality, one that involves the safety and security of these beloved places.
I decided to explore their safety from a statistical perspective. By analyzing a dataset brimming with crime statistics, I aim to uncover whether these cities can be grouped based on their security parameters. Can we identify clusters of cities that are remarkably safe, somewhat balanced, or perhaps in need of heightened safety measures? What insights can we derive from the numbers behind Italy’s most beautiful locations?
Padova, my city of origin
library(readr)
crime_data <- read.csv("~/Desktop/Italian cities_crime.csv", sep = ";")
# For aesthetically purposes I will show only the first four columns
head(crime_data[, 1:4])## City.Provence Year Massacre Completed.voluntary.homicides
## 1 Agrigento 2024 0 3
## 2 Alessandria 2024 1 5
## 3 Ancona 2024 0 0
## 4 Arezzo 2024 0 1
## 5 Ascoli Piceno 2024 0 2
## 6 Asti 2024 0 0
The dataset was retrieved from https://github.com/1655653/VICrime-Visual-Analytics-Project?tab=readme-ov-file, which obtained data from Ministry of the Interior.
First, I replace commas with dots in the Value added per inhabitant column and convert it to numeric format to ensure accurate calculations and analyses involving this variable.
# Replace commas with periods in the column
crime_data$Value.added.per.inhabitant <- gsub(",", ".", crime_data$Value.added.per.inhabitant)
# Convert the column to numeric
crime_data$Value.added.per.inhabitant <- as.numeric(crime_data$Value.added.per.inhabitant)In the following step, I will categorize the original variables in the dataset into six broader crime categories:
This is done to simplify the analysis by reducing the number of variables from 58 to 6, making the dataset easier to interpret and analyze. Each category represents a logical grouping of related crimes based on their nature and severity.
For example, Violent Crimes include offenses like homicides, assaults, and kidnappings, while Property Crimes encompass thefts, robberies, and damages. The variables within each category were summed to create aggregate measures, enabling a more focused analysis of crime patterns across Italian provinces. This approach helps to identify trends and patterns in broader crime categories while maintaining the integrity of the original data.
# Define categories and their subcategories
crime_data$Violent_Crimes <- rowSums(crime_data[, c('Massacre',
'Completed.voluntary.homicides',
'Completed.voluntary.homicides.for.theft.or.robbery',
'Completed.voluntary.homicides.of.a.mafia.type',
'Completed.voluntary.homicides.for.terrorist.purposes',
'Attempted.homicides',
'Infanticides',
'Beatings',
'Intentional.injuries',
'Sexual.violence',
'Kidnappings')])
crime_data$Property_Crimes <- rowSums(crime_data[, c('Thefts',
'Snatch.thefts',
'Pickpocketing',
'Burglaries.in.homes',
'Thefts.in.commercial.establishments',
'Thefts.from.parked.cars',
'Thefts.of.artworks.and.archaeological.material',
'Thefts.of.heavy.vehicles.transporting.goods',
'Thefts.of.mopeds',
'Thefts.of.motorcycles',
'Thefts.of.cars',
'Robberies',
'Robberies.in.homes',
'Bank.robberies',
'Robberies.in.post.offices',
'Robberies.in.commercial.establishments',
'Street.robberies',
'Damages',
'Fires',
'Forest.fires',
'Damage.followed.by.fire')])
crime_data$WhiteCollar_Crimes <- rowSums(crime_data[, c('Mafia.type.association',
'Criminal.association',
'Money.laundering.and.use.of.illicit.funds',
'Usury',
'Counterfeiting.of.trademarks.and.industrial.products',
'Violation.of.intellectual.property',
'Receiving.stolen.goods')])
crime_data$Cyber_Crimes <- rowSums(crime_data[, c('Scams.and.computer.fraud',
'Cybercrimes')])
crime_data$Sexual_Moral_Crimes <- rowSums(crime_data[, c('Sexual.acts.with.minors',
'Corruption.of.minors',
'Exploitation.and.facilitation.of.prostitution',
'Child.pornography.and.possession.of.pedopornographic.material')])
crime_data$Drug_Other_Crimes <- rowSums(crime_data[, c('Drug.legislation',
'Attacks',
'Other.crimes')])Let’s view the newly created dataset.
# Create a new dataset with City/Province, Value added per inhabitant and summed crime categories
Italian_cities_Crime <- crime_data[, c("City.Provence",
"Violent_Crimes",
"Property_Crimes",
"WhiteCollar_Crimes",
"Cyber_Crimes",
"Sexual_Moral_Crimes",
"Drug_Other_Crimes",
"Value.added.per.inhabitant")]
# View the new dataset
head(Italian_cities_Crime)## City.Provence Violent_Crimes Property_Crimes WhiteCollar_Crimes
## 1 Agrigento 661 5218 1464
## 2 Alessandria 597 10980 2129
## 3 Ancona 594 7965 1716
## 4 Arezzo 487 7023 1013
## 5 Ascoli Piceno 241 3773 758
## 6 Asti 259 5230 1297
## Cyber_Crimes Sexual_Moral_Crimes Drug_Other_Crimes Value.added.per.inhabitant
## 1 1262 16 11327 13.9
## 2 1636 19 14882 26.2
## 3 1712 26 13263 27.6
## 4 1043 17 9982 26.4
## 5 494 3 5571 23.8
## 6 943 9 7254 23.6
This streamlined dataset simplifies analysis by grouping related crime types under broader categories, making it easier to identify patterns and trends. For example, in the province of Arezzo, the dataset reveals the following crime scores for a given year (Violent Crimes: 487; Property Crimes: 7023; Organized Crimes: 1013). These figures indicate that property crimes are significantly more prevalent in Arezzo compared to violent or organized crimes. This insight could be instrumental in allocating resources and devising crime prevention strategies tailored to the region’s specific needs.
# Set a wider console output width
options(width = 70)
library(psych)
# Generate descriptive statistics for the entire dataset
describe(Italian_cities_Crime)## vars n mean sd median
## City.Provence* 1 105 53.00 30.45 53.0
## Violent_Crimes 2 105 821.21 940.86 560.0
## Property_Crimes 3 105 17394.86 31652.00 8574.0
## WhiteCollar_Crimes 4 105 2720.72 4468.42 1544.0
## Cyber_Crimes 5 105 2166.34 2925.56 1406.0
## Sexual_Moral_Crimes 6 105 19.04 23.97 11.0
## Drug_Other_Crimes 7 105 21878.66 33601.41 12948.0
## Value.added.per.inhabitant 8 105 24.34 6.90 24.1
## trimmed mad min max range
## City.Provence* 53.00 38.55 1.0 105.0 104.0
## Violent_Crimes 630.51 358.79 112.0 5889.0 5777.0
## Property_Crimes 10765.76 7117.96 943.0 210400.0 209457.0
## WhiteCollar_Crimes 1801.36 999.27 307.0 31310.0 31003.0
## Cyber_Crimes 1572.12 876.22 283.0 19375.0 19092.0
## Sexual_Moral_Crimes 13.59 8.90 0.0 145.0 145.0
## Drug_Other_Crimes 14952.99 9104.65 2237.0 220066.0 217829.0
## Value.added.per.inhabitant 23.94 7.71 13.9 49.7 35.8
## skew kurtosis se
## City.Provence* 0.00 -1.23 2.97
## Violent_Crimes 3.68 15.08 91.82
## Property_Crimes 4.63 23.58 3088.92
## WhiteCollar_Crimes 4.67 23.39 436.07
## Cyber_Crimes 4.16 18.79 285.51
## Sexual_Moral_Crimes 3.09 10.73 2.34
## Drug_Other_Crimes 4.43 21.64 3279.16
## Value.added.per.inhabitant 0.61 0.60 0.67
Violent Crimes
Property Crimes
Sexual and Moral Crimes
Question: can Italian provinces be clustered into distinct groups based on their crime profiles?
For the clustering analysis, I selected six variables: Violent Crimes, Property Crimes, White-Collar Crimes, Cyber Crimes, Sexual and Moral Crimes, and Drug/Other Crimes. These categories capture distinct and critical dimensions of criminal activity, providing a comprehensive picture of crime patterns across Italian provinces.
I chose to exclude Value added per inhabitant from the clustering process and use it after as a criterion validity test. This variable reflects the economic state of each provence, making it an excellent benchmark for assessing the effectiveness and meaningfulness of the clusters formed.
# Standardize the clustering variables
mydata_clu_std <- as.data.frame(scale(Italian_cities_Crime[, c("Violent_Crimes",
"Property_Crimes",
"WhiteCollar_Crimes",
"Cyber_Crimes",
"Sexual_Moral_Crimes",
"Drug_Other_Crimes")]))# Calculate dissimilarity measure to find outliers
Italian_cities_Crime$Dissimilarity <- sqrt(mydata_clu_std$Violent_Crimes^2 +
mydata_clu_std$Property_Crimes^2 +
mydata_clu_std$WhiteCollar_Crimes^2 +
mydata_clu_std$Cyber_Crimes^2 +
mydata_clu_std$Sexual_Moral_Crimes^2 +
mydata_clu_std$Drug_Other_Crimes^2)
head(Italian_cities_Crime[order(-Italian_cities_Crime$Dissimilarity), c("City.Provence", "Dissimilarity")])## City.Provence Dissimilarity
## 55 Milano 13.805188
## 81 Roma 13.353195
## 58 Napoli 8.233065
## 92 Torino 7.841460
## 33 Firenze 3.097972
## 14 Bologna 3.021125
The dissimilarity values represent how different each city is compared to the rest of the dataset, with higher values indicating greater divergence. Milano, Roma, Napoli, and Torino have significantly higher dissimilarity scores (ranging from 7.8 to 13.8) compared to other cities like Firenze and Bologna, which have much lower scores (around 3). Being large metropolitan cities, they naturally have higher values due to their size and complexity, making them potential outliers. Removing them ensures they do not distort the clustering process.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Filter out the specified outlier cities using row indices
Italian_cities_Crime <- Italian_cities_Crime %>%
filter(!(row_number() %in% c(55, 81, 58, 92)))
# Standardize the clustering variables without the outliers
mydata_clu_std <- as.data.frame(scale(Italian_cities_Crime[, c("Violent_Crimes",
"Property_Crimes",
"WhiteCollar_Crimes",
"Cyber_Crimes",
"Sexual_Moral_Crimes",
"Drug_Other_Crimes")]))
head(Italian_cities_Crime[order(-Italian_cities_Crime$Dissimilarity), c("City.Provence", "Dissimilarity")])## City.Provence Dissimilarity
## 33 Firenze 3.097972
## 14 Bologna 3.021125
## 37 Genova 2.872282
## 95 Venezia 2.182329
## 16 Brescia 2.088707
## 61 Palermo 1.894155
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Calculate Euclidean distances
Distances <- get_dist(mydata_clu_std,
method = "euclidean")
# Visualize the distance matrix
fviz_dist(Distances, gradient = list(low = "darkred",
mid = "grey95",
high = "white"))library(factoextra)
# Hopkins statistics
get_clust_tendency(mydata_clu_std,
n = nrow(mydata_clu_std) - 1,
graph = FALSE)## $hopkins_stat
## [1] 0.8313608
##
## $plot
## NULL
The Hopkins statistic, which measures clustering tendency, is 0.831, indicating a strong clustering structure in the dataset. This value, being close to 1, confirms the suitability of the data for cluster analysis. Additionally, the visual inspection of the distance matrix reveals the formation of distinct squares, further supporting the existence of well-defined clusters within the data.
library(factoextra)
library(NbClust)
# Perform the Elbow Method
fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
labs(subtitle = "Elbow method")The elbow method is used to determine the optimal number of clusters by plotting the total within-cluster sum of squares against the number of clusters. In the resulting plot, breaks are observed at 2 and 4 clusters, indicating potential points where adding more clusters does not significantly reduce WCSS. This suggests that dividing the data into either 2 or 4 clusters captures the structure of the dataset effectively, balancing simplicity and meaningful differentiation.
# Perform the Silhouette analysis
fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette analysis")The silhouette method evaluates how well each data point fits within its assigned cluster compared to other clusters. In this case, the silhouette analysis indicates that 2 clusters provide the best separation and cohesion among the data, as this configuration achieves the highest overall silhouette score. This suggests that dividing the data into 2 clusters most effectively captures the underlying structure. All is confirmed also with the K-means method.
library(NbClust)
# Determine the optimal number of clusters
nc <- NbClust(mydata_clu_std, distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans", index = "all")## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 11 proposed 2 as the best number of clusters
## * 2 proposed 3 as the best number of clusters
## * 8 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
# Perform k-means clustering with 2 clusters
Clustering <- kmeans(mydata_clu_std,
centers = 2,
nstart = 25)
Clustering## K-means clustering with 2 clusters of sizes 86, 15
##
## Cluster means:
## Violent_Crimes Property_Crimes WhiteCollar_Crimes Cyber_Crimes
## 1 -0.3435403 -0.3418753 -0.3162826 -0.3286798
## 2 1.9696312 1.9600853 1.8133536 1.8844311
## Sexual_Moral_Crimes Drug_Other_Crimes
## 1 -0.3050237 -0.3510029
## 2 1.7488024 2.0124168
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 2 1 1 1 2 1 2 1 2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1
## [33] 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 2 1 1 1
## [65] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1
## [97] 1 2 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 130.96392 87.31167
## (between_SS / total_SS = 63.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
library(factoextra)
# Visualize the clusters
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)Based on the visual observation, I come logically to the idea that I could either remove the outliers that visually disturb the efficiency of the clustering, or step back and restart the process with 4 groups. Considering this and the insights from the elbow and silhouette methods, I have decided to step back and proceed with 4 clusters. This approach will allow for a more refined analysis, enabling a deeper understanding of the distinct characteristics and relationships between cities within each cluster.
# Perform k-means clustering with 4 clusters
Clustering <- kmeans(mydata_clu_std,
centers = 4,
nstart = 25)
Clustering## K-means clustering with 4 clusters of sizes 13, 9, 37, 42
##
## Cluster means:
## Violent_Crimes Property_Crimes WhiteCollar_Crimes Cyber_Crimes
## 1 1.07991184 0.8833743 0.9481110 0.95323298
## 2 2.22939005 2.4496302 2.2892278 2.32035110
## 3 -0.03810664 -0.1316975 -0.1159741 -0.05178042
## 4 -0.77841473 -0.6823269 -0.6818441 -0.74665031
## Sexual_Moral_Crimes Drug_Other_Crimes
## 1 0.23462134 0.96908733
## 2 2.63237012 2.45209958
## 3 -0.04470523 -0.09576122
## 4 -0.59731702 -0.74104443
##
## Clustering vector:
## [1] 3 3 3 4 4 4 3 2 4 4 4 2 4 2 3 2 4 1 4 4 1 2 3 4 3 3 4 4 3 4 4 3
## [33] 2 1 3 3 2 4 4 3 4 4 4 3 1 4 3 4 3 4 3 4 4 3 1 1 3 4 4 1 2 3 1 1
## [65] 4 3 4 3 3 4 4 3 4 3 3 3 4 3 4 1 3 3 4 3 4 3 4 4 3 3 3 3 3 1 2 4
## [97] 4 1 4 1 4
##
## Within cluster sum of squares by cluster:
## [1] 19.55126 41.70576 25.11829 13.35958
## (between_SS / total_SS = 83.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
The results of the k-means clustering reveal some interesting insights:
library(factoextra)
# Visualize the clusters
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)In the visual graph of the clusters, it can be clearly observed that some cities, such as Bari (ID 8) and Pescara (ID 66), are very distant from the cluster centers. These cities act as outliers, indicating they have unique crime characteristics that do not align closely with the general trends of their respective clusters. This distance highlights the distinctiveness of these cities in terms of crime rates compared to the rest of the data.
library(dplyr)
Italian_cities_Crime <- Italian_cities_Crime %>%
filter(!(row_number() %in% c(66, 92, 83, 79, 80, 8, 37, 16)))
# Standardize the clustering variables without the outliers
mydata_clu_std <- as.data.frame(scale(Italian_cities_Crime[, c("Violent_Crimes",
"Property_Crimes",
"WhiteCollar_Crimes",
"Cyber_Crimes",
"Sexual_Moral_Crimes",
"Drug_Other_Crimes")]))# Perform k-means clustering with 4 clusters
Clustering <- kmeans(mydata_clu_std,
centers = 4,
nstart = 25)
Clustering## K-means clustering with 4 clusters of sizes 40, 6, 33, 14
##
## Cluster means:
## Violent_Crimes Property_Crimes WhiteCollar_Crimes Cyber_Crimes
## 1 -0.78515932 -0.65995537 -0.70011337 -0.748978559
## 2 2.58502542 2.87124885 2.35015943 2.512993381
## 3 0.03029793 -0.08576926 -0.04162889 0.005629694
## 4 1.06402776 0.85722194 1.09123797 1.049671586
## Sexual_Moral_Crimes Drug_Other_Crimes
## 1 -0.57661797 -0.7243371
## 2 3.26065679 2.8233377
## 3 -0.05369907 -0.0398669
## 4 0.37663195 0.9535047
##
## Clustering vector:
## [1] 3 3 3 1 1 1 3 1 1 1 2 1 2 3 1 4 1 1 4 2 3 1 3 3 1 1 3 1 1 3 2 4 3
## [34] 3 1 1 3 1 1 1 4 4 1 3 1 3 1 3 1 1 3 4 4 3 1 1 4 2 3 4 4 1 1 3 3 1
## [67] 1 3 1 3 3 3 1 3 3 3 3 1 3 1 1 3 3 4 3 4 2 1 1 4 1 4 1
##
## Within cluster sum of squares by cluster:
## [1] 15.13867 29.57155 21.40680 21.83599
## (between_SS / total_SS = 84.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
library(factoextra)
# Visualization
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)After performing the deletion of outliers, the ratio improved significantly, which is a positive sign. Additionally, visually, the clusters now better satisfy both within-cluster and between-cluster requirements, showing a clearer separation of groups with reduced outlier influence. This indicates more meaningful and consistent groupings in the data.
## Violent_Crimes Property_Crimes WhiteCollar_Crimes Cyber_Crimes
## 1 -0.78515932 -0.65995537 -0.70011337 -0.748978559
## 2 2.58502542 2.87124885 2.35015943 2.512993381
## 3 0.03029793 -0.08576926 -0.04162889 0.005629694
## 4 1.06402776 0.85722194 1.09123797 1.049671586
## Sexual_Moral_Crimes Drug_Other_Crimes
## 1 -0.57661797 -0.7243371
## 2 3.26065679 2.8233377
## 3 -0.05369907 -0.0398669
## 4 0.37663195 0.9535047
The cluster averages reveal intriguing differences in crime patterns among the four clusters:
Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)
# Transforming the data for visualization
library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Violent_Crimes", "Property_Crimes", "WhiteCollar_Crimes", "Cyber_Crimes", "Sexual_Moral_Crimes", "Drug_Other_Crimes"))
Figure$Group <- factor(Figure$ID,
levels = c(1, 2, 3, 4),
labels = c("1", "2", "3", "4"))
Figure$NameF <- factor(Figure$name,
levels = c("Violent_Crimes", "Property_Crimes", "WhiteCollar_Crimes", "Cyber_Crimes", "Sexual_Moral_Crimes", "Drug_Other_Crimes"),
labels = c("Violent Crimes", "Property Crimes", "White Collar Crimes", "Cyber Crimes", "Sexual Moral Crimes", "Drug/Other Crimes"))
# Visualizing with ggplot
library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, color = Group), size = 3) +
geom_line(aes(group = ID), size = 1) +
ylab("Averages") +
xlab("Cluster Variables") +
ylim(-2, 3.5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning
## was generated.
The graphical visualization likely emphasizes these distinctions, showing distinct groupings that reflect the varying crime patterns across cities. For example, cities in Cluster 2 might visually cluster further from others, driven by extreme rates in crimes like Sexual Moral Crimes and Drug/Other Crimes, while Cluster 3 would group closer together due to uniformly lower crime rates. These findings provide valuable insights for policymakers and law enforcement, offering a clear roadmap for tailored interventions based on city-specific challenges.
Let’s see how my city of origin, Padova, is performing.
# Show cluster averages for Padova's cluster (Cluster 4)
Cluster_averages <- Averages[4, ] # Cluster 4 averages
print(Cluster_averages)## Violent_Crimes Property_Crimes WhiteCollar_Crimes
## 1.0640278 0.8572219 1.0912380
## Cyber_Crimes Sexual_Moral_Crimes Drug_Other_Crimes
## 1.0496716 0.3766320 0.9535047
# Filter the dataset for the 57th row (Padova's row)
Padova_values <- mydata_clu_std[57, ]
# Display Padova's standardized values for all variables
print(Padova_values)## Violent_Crimes Property_Crimes WhiteCollar_Crimes Cyber_Crimes
## 57 1.565188 1.495802 2.033737 2.23468
## Sexual_Moral_Crimes Drug_Other_Crimes
## 57 0.3059806 1.717484
Padova, a city in Cluster 4, unfortunately, displays moderately high averages across all crime categories, particularly in Cyber Crimes and White Collar Crimes, suggesting cities with slightly elevated levels of informatic harm and organized crimes.
# Checking if clustering variables successfully differentiate between groups
fit <- aov(cbind(Violent_Crimes, Property_Crimes, WhiteCollar_Crimes, Cyber_Crimes, Sexual_Moral_Crimes, Drug_Other_Crimes) ~ as.factor(Group),
data = Italian_cities_Crime)
summary(fit)## Response Violent_Crimes :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 3 11511370 3837123 210.46 < 2.2e-16 ***
## Residuals 89 1622682 18232
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Property_Crimes :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 3 8713150574 2904383525 157.48 < 2.2e-16 ***
## Residuals 89 1641367067 18442327
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response WhiteCollar_Crimes :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 3 126812537 42270846 91.498 < 2.2e-16 ***
## Residuals 89 41116667 461985
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Cyber_Crimes :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 3 70682862 23560954 138.35 < 2.2e-16 ***
## Residuals 89 15156235 170295
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Sexual_Moral_Crimes :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 3 13676.0 4558.7 183.1 < 2.2e-16 ***
## Residuals 89 2215.9 24.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Drug_Other_Crimes :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 3 1.0714e+10 3571477254 232.64 < 2.2e-16 ***
## Residuals 89 1.3663e+09 15352022
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The extremely low p-values (< 0.001) across all crime categories confirm that the clustering successfully differentiates cities based on their crime patterns. Each cluster represents distinct profiles, with meaningful differences across all analyzed variables. This supports the validity of the clustering approach in identifying city groups with varying levels and types of criminal activity.
To evaluate the criterion validity of my clustering solution, I selected Value added per inhabitant as a numerical variable to serve as the benchmark for the test. This variable, derived from the economic contributions of various sectors such as agriculture, industry, construction, commerce, financial services, and other services, offers a meaningful way to assess whether the identified clusters differ significantly in terms of their economic output. By comparing the clusters based on this measure, I aim to validate whether the grouping captures underlying patterns linked to the socio-economic characteristics of Italian cities.
# Aggregate the means of Value added per inhabitant by cluster group
aggregate(Italian_cities_Crime$Value.added.per.inhabitant,
by = list(Cluster = Italian_cities_Crime$Group),
FUN = mean)## Cluster x
## 1 1 21.96000
## 2 2 28.38333
## 3 3 24.66061
## 4 4 25.25714
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:psych':
##
## logit
# Perform Levene's test
leveneTest(Italian_cities_Crime$Value.added.per.inhabitant, as.factor(Italian_cities_Crime$Group))## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 1.9954 0.1204
## 89
The Levene’s Test checks whether the variances across groups are homogeneous.
Since the p-value (0.1243) is greater than the significance level (0.05), we fail to reject the null hypothesis. This means there is no significant evidence to suggest that the variances across the groups are different.
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
# Perform Shapiro-Wilk test for each cluster
Italian_cities_Crime %>%
group_by(as.factor(Italian_cities_Crime$Group)) %>%
shapiro_test(Value.added.per.inhabitant)## # A tibble: 4 × 4
## `as.factor(Italian_cities_Crime$Group)` variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 Value.added… 0.961 0.186
## 2 2 Value.added… 0.851 0.160
## 3 3 Value.added… 0.951 0.138
## 4 4 Value.added… 0.917 0.198
For all groups, the p-values are greater than 0.05 (0.3632 for Group 1, 0.1602 for Group 2, 0.1856 for Group 3, and 0.0729 for Group 4). Thus, we fail to reject the null hypothesis, concluding that the “Value added per inhabitant” variable is normally distributed across all groups.
Since both assumptions for ANOVA (normality and homogeneity of variances) are satisfied, I can proceed with the ANOVA test.
# Perform ANOVA
fit <- aov(cbind(Value.added.per.inhabitant) ~ as.factor(Group),
data = Italian_cities_Crime)
# View the summary of the ANOVA
summary(fit)## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 3 316 105.18 2.651 0.0536 .
## Residuals 89 3531 39.67
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA results examine whether there is a statistically significant difference in Value added per inhabitant between the four clusters.
Since the p-value is slightly greater than the conventional significance level of 0.05, we fail to reject the null hypothesis. This means there is no statistically significant difference in Value added per inhabitant between the clusters at the 5% level of significance.
This analysis aimed to investigate whether Italian cities could be effectively grouped based on their crime patterns and socioeconomic characteristics. The results of clustering, ANOVA tests, and subsequent interpretations clearly demonstrate that cities are meaningfully divided into four distinct clusters, each with unique crime profiles. These clusters not only highlight the diversity in crime rates and types across Italy but also provide valuable insights into safer and riskier regions.
Based on their characteristics, here are the clusters (creatively renamed):
Thus, dear Professor, for seeking a serene and worry-free holiday, I’d suggest exploring Venezia or Cremona from the Quiet Retreats cluster. These cities promise a relaxing experience steeped in beauty and history, far removed from the hustle and bustle of urban hotspots. Alternatively, if a mix of urban excitement and reasonable safety appeals, Rimini or Como from the Balanced Hubs cluster would be excellent choices.
This clustering analysis not only answered the research question but also provided actionable insights. By identifying distinct crime patterns, it became possible to discern safer cities from those requiring closer attention for crime prevention strategies. Whether for policy or personal travel planning, the results offer both practical utility and intellectual intrigue.
I would like to extend my heartfelt thanks to our mutual friend, ChatGPT, for tirelessly providing advice and helping me elevate the aesthetics of this homework. From fonts to formatting, his wisdom and patience have been invaluable. Without him, this document might still look like it was crafted in the early 2000s. Cheers to modern technology!
Christian Lasalvia