This report presents an end-to-end analysis of criminal conviction data spanning the years 2014 to 2016. The analysis aims to uncover underlying patterns, trends, and anomalies in criminal activity over time, using a combination of data wrangling, statistical summarization, and machine learning techniques. By integrating data from multiple sources and years, the project provides a comprehensive view of how different types of crimes have evolved, how frequently they occur, and how they relate to one another. The analytical process begins with importing and merging yearly datasets into a unified structure, followed by meticulous data cleaning and preprocessing to ensure quality and consistency. Exploratory data analysis (EDA) techniques are then applied to summarize and visualize key features of the data, offering insights into the most common offense types, temporal trends, and variable correlations. Building on this foundation, the report leverages predictive models—such as linear regression, clustering algorithms, and classification models to forecast crime related variables and segment the dataset into meaningful groups. These models provide deeper insight into the structure of the data and help assess the feasibility of predicting crime patterns based on historical trends. The ultimate goal of this analysis is not only to describe the past but also to support proactive decision-making through data-driven forecasting and pattern recognition. Whether for academic research, public policy development, or resource allocation, this study offers a solid framework for understanding criminal behavior through data.
The project adopts a structured approach to data science, involving:
The following R libraries are used in this analysis:
These Libraries were loaded and used in analysis.
# Load necessary libraries library(tidyverse) library(janitor)
library(lubridate) library(readr) library(ggplot2) library(caret)
library(cluster) library(factoextra) library(e1071)
library(randomForest) library(markdown)
install.packages(“markdown”)
To organize and access the dataset files efficiently, a base directory path is defined. This allows the script to dynamically locate CSV files stored in year-wise subfolders without hardcoding file paths repeatedly.
The base directory is set as follows:
{r set-base-dir}
# Set base directory
base_dir <- "/Users/kaustubhawaghade/dataset/"
To efficiently load datasets stored in different subfolders for each year, a custom function is defined to extract CSV file paths based on the year.
The extract_csv_files_by_year() function automates the process of navigating through subdirectories and collecting the paths to all .csv files. This makes the loading process dynamic and scalable, especially when working with structured folders (e.g., one folder per year).
The function works as follows: - It takes a base directory and a vector of target years. • For each year, it builds the path to the corresponding folder. • If the folder exists, it lists all CSV files within that folder using list.files() . • The resulting file paths are stored in a named list, with each year as a key.
Here is the implementation:
{r extract-csv} # Function to extract CSVs by year extract_csv_files_by_year <- function(base_dir, years) { files_by_year <- list() for (year in years) { folder_path <- file.path(base_dir, year) if (dir.exists(folder_path)) { csv_files <- list.files(path = folder_path, pattern = “\.csv$”, full.names = TRUE) files_by_year[[year]] <- csv_files } } return(files_by_year) }
years <- c(“2014”, “2015”, “2016”) files_by_year <- extract_csv_files_by_year(base_dir, years)
After extracting the CSV file paths for each year, the next step is to merge all the individual CSV files into a single unified dataset. This is done using the merge_csv_data() function, which reads each CSV, adds a year column, and combines them into one large data frame.
The function works as follows: - It accepts a list of CSV file paths grouped by year. • For each year and corresponding CSV file, it reads the data into a data frame. • If the date column is present, it ensures the date is converted to Date format. • The year for each record is added as a new column to track the source of the data. • All the individual data frames are stored in a list, and then bind_rows() combines them into a single large data frame.
The resulting merged dataset has rows from multiple years and a year column for easy filtering and comparison.
Here is the implementation:
{r merge-csv} # Function to merge CSVs merge_csv_data <- function(files_by_year) { all_data <- list() for (year in names(files_by_year)) { for (csv_file in files_by_year[[year]]) { df <- read_csv(csv_file) if (“date” %in% colnames(df)) df\(date <- as.Date(df\)date) df$year <- year all_data[[length(all_data) + 1]] <- df } } bind_rows(all_data) %>% clean_names() }
final_data <- merge_csv_data(files_by_year)
Data cleaning is a critical part of the data analysis process. This section explains the steps taken to ensure the data is consistent, complete, and ready for analysis.
First, we summarize the number of missing values in each column using the summarise() function. This gives an overview of how much data is missing for each variable.
# Cleaning steps missing_summary <- final_data %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = “column”, values_to =
“missing_count”) %>% arrange(desc(missing_count))
print(missing_summary)
final_data_cleaned <- final_data %>% select(where(~ mean(is.na(.)) < 0.9)) %>% mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .))) %>% mutate(across(where(is.character), ~ replace_na(., “Unknown”))) %>% mutate(across(where(is.factor), ~ fct_explicit_na(., na_level = “Unknown”))) %>% distinct() %>% mutate(across(where(is.character), as.factor)) %>% clean_names()
In this section, we perform descriptive analytics to summarize key characteristics of the criminal conviction data. The analysis involves exploring the number of records per year, calculating summary statistics, visualizing correlations between different variables, and analyzing the distribution of specific offenses over time.
The first analysis step involves counting the number of records available for each year. This provides an overview of how the dataset is distributed across the years.
We visualize this with a bar plot:
# 1. Records per Year final_data_cleaned %>% count(year) %>%
ggplot(aes(x = year, y = n)) + geom_col(fill = “steelblue”) + labs(title
= “Number of Records by Year”, x = “Year”, y = “Count”) +
theme_minimal()
Next, we calculate basic summary statistics (mean, median, and standard deviation) for the numeric variables in the dataset. This helps to understand the central tendencies and spread of the data.
# 2. Summary Statistics for Numeric Columns final_data_cleaned %>%
select(where(is.numeric)) %>% summarise(across(everything(), list(
mean = ~ mean(., na.rm = TRUE), median = ~ median(., na.rm = TRUE), sd =
~ sd(., na.rm = TRUE) ))) %>% print()
We then compute the correlations between all numeric variables in the dataset. A heatmap is generated to visualize these correlations, with stronger correlations highlighted in color.
# 3. Correlation Heatmap final_data_cleaned %>%
select(where(is.numeric)) %>% cor(use = “complete.obs”) %>%
as.data.frame() %>% rownames_to_column(“var1”) %>%
pivot_longer(-var1, names_to = “var2”, values_to = “correlation”) %>%
ggplot(aes(x = var1, y = var2, fill = correlation)) + geom_tile(color =
“white”) + scale_fill_viridis_c() + theme(axis.text.x =
element_text(angle = 90, hjust = 1)) + labs(title = “Correlation
Heatmap”, x = ““, y =”“, fill =”Correlation”)
A boxplot is used to visualize the distribution of burglary convictions across different years. This allows us to see the spread, median, and outliers for each year.
# 4. Boxplot: Distribution of convictions across years
final_data_cleaned %>% ggplot(aes(x = year, y =
number_of_burglary_convictions)) + geom_boxplot(fill = “orange”) +
labs(title = “Burglary Convictions by Year”, x = “Year”, y = “Count”) +
theme_minimal()
Lastly, a density plot is created to visualize the distribution of drug offense convictions for each year. This plot helps to understand how the density of convictions varies over time.
# 5. Density Plot: Drugs Offence Convictions final_data_cleaned %>%
ggplot(aes(x = number_of_drugs_offences_convictions, fill = year)) +
geom_density(alpha = 0.4) + labs(title = “Density of Drugs Offence
Convictions by Year”, x = “Convictions”, y = “Density”) +
theme_minimal()
In this section, predictive analytics techniques are applied to uncover deeper insights from the data and to develop models that can forecast or classify outcomes based on historical patterns. Three major predictive methods are used: linear regression, clustering, and classification models.
The first regression model aims to predict the number of theft and handling convictions using other related offenses (e.g., drug, burglary, and robbery convictions). This helps identify which crimes may be statistically associated with theft.
# —- LINEAR REGRESSION 1 —- lm_data1 <- final_data_cleaned %>%
select(number_of_theft_and_handling_convictions,
number_of_drugs_offences_convictions, number_of_burglary_convictions,
number_of_robbery_convictions) %>% na.omit() lm_model1 <-
lm(number_of_theft_and_handling_convictions ~ ., data = lm_data1)
summary(lm_model1)
The second model predicts motoring-related offenses using public order, fraud, and criminal damage convictions. These regression models help evaluate how certain categories of crime relate to one another.
# —- LINEAR REGRESSION 2 —- lm_data2 <- final_data_cleaned %>%
select(number_of_motoring_offences_convictions,
number_of_public_order_offences_convictions,
number_of_fraud_and_forgery_convictions,
number_of_criminal_damage_convictions) %>% na.omit() lm_model2 <-
lm(number_of_motoring_offences_convictions ~ ., data = lm_data2)
summary(lm_model2)
Clustering techniques are used to uncover natural groupings or structures within the data based on numeric features.
K-means Clustering
# —- CLUSTERING 1: K-means —- cluster_data1 <- final_data_cleaned
%>% select(where(is.numeric)) %>% na.omit() scaled1 <-
scale(cluster_data1) kmeans1 <- kmeans(scaled1, centers = 3)
fviz_cluster(list(data = scaled1, cluster = kmeans1$cluster)) +
labs(title = “K-Means Clustering 1”)
Hierarchical Clustering creates a dendrogram to visualize nested groupings of similar observations. We also use the silhouette method to estimate the optimal number of clusters.
# —- CLUSTERING 2: Hierarchical Clustering with Visualization —-
cluster_data_hc <- final_data_cleaned %>%
select(where(is.numeric)) %>% na.omit() scaled_hc <-
scale(cluster_data_hc) dist_matrix <- dist(scaled_hc) hc <-
hclust(dist_matrix, method = “complete”) plot(hc, main = “Hierarchical
Clustering Dendrogram”, xlab = ““, sub =”“, cex = 0.5)
fviz_nbclust(scaled_hc, FUN = hcut, method = “silhouette”) + labs(title = “Optimal Number of Clusters - Silhouette”)
k_hc <- 3 hc_clusters <- cutree(hc, k = k_hc) final_data_cleaned$hc_cluster <- factor(hc_clusters)
fviz_cluster(list(data = scaled_hc, cluster = hc_clusters), geom = “point”, ellipse.type = “convex”, repel = TRUE, main = “Hierarchical Clustering (PCA View)”)
3. Classification Models Classification techniques are used to predict the year of a criminal record based on its numeric features. Two models are used: Naive Bayes and Random Forest.
Naive Bayes Classifier
# —- CLASSIFICATION 1: Naive Bayes —- class_data <-
final_data_cleaned %>% select(where(is.numeric), year) %>%
mutate(year = as.factor(year)) %>% na.omit() set.seed(123) split
<- createDataPartition(class_data\(year, p
= 0.8, list = FALSE)
train1 <- class_data[split, ]; test1 <- class_data[-split, ]
nb_model <- naiveBayes(year ~ ., data = train1)
nb_preds <- predict(nb_model, test1)
confusionMatrix(nb_preds, test1\)year)
Random Forest Classifier
# —- CLASSIFICATION 2: Random Forest —- rf_model <- randomForest(year
~ ., data = train1, ntree = 100) rf_preds <- predict(rf_model, test1)
confusionMatrix(rf_preds, test1$year)