Overview

This report presents an end-to-end analysis of criminal conviction data spanning the years 2014 to 2016. The analysis aims to uncover underlying patterns, trends, and anomalies in criminal activity over time, using a combination of data wrangling, statistical summarization, and machine learning techniques. By integrating data from multiple sources and years, the project provides a comprehensive view of how different types of crimes have evolved, how frequently they occur, and how they relate to one another. The analytical process begins with importing and merging yearly datasets into a unified structure, followed by meticulous data cleaning and preprocessing to ensure quality and consistency. Exploratory data analysis (EDA) techniques are then applied to summarize and visualize key features of the data, offering insights into the most common offense types, temporal trends, and variable correlations. Building on this foundation, the report leverages predictive models—such as linear regression, clustering algorithms, and classification models to forecast crime related variables and segment the dataset into meaningful groups. These models provide deeper insight into the structure of the data and help assess the feasibility of predicting crime patterns based on historical trends. The ultimate goal of this analysis is not only to describe the past but also to support proactive decision-making through data-driven forecasting and pattern recognition. Whether for academic research, public policy development, or resource allocation, this study offers a solid framework for understanding criminal behavior through data.

The project adopts a structured approach to data science, involving:

Loading and merging raw datasets from multiple years,
Cleaning and preparing data to ensure consistency and completeness,
Exploring the data through visualizations and summaries,
Performing descriptive analysis to identify trends and distributions,
Applying predictive modeling techniques such as regression, clustering, and classification to generate deeper insights.

Libraries Used

The following R libraries are used in this analysis:

⁠ tidyverse ⁠ – for data manipulation and visualization
⁠ janitor ⁠ – for cleaning data and column names
⁠ lubridate ⁠ – for working with date and time data
⁠ readr ⁠ – for reading CSV files efficiently
⁠ ggplot2 ⁠ – for creating elegant data visualizations
⁠ caret ⁠ – for machine learning and model training
⁠ cluster ⁠ – for clustering algorithms
⁠ factoextra ⁠ – for extracting and visualizing multivariate data (especially clustering results)
⁠ e1071 ⁠ – for Naive Bayes classification and other ML functions
⁠ randomForest ⁠ – for Random Forest classification models
⁠ markdown ⁠ – for rendering and formatting Markdown content

These Libraries were loaded and used in analysis.

⁠ 
# Load necessary libraries library(tidyverse) library(janitor) library(lubridate) library(readr) library(ggplot2) library(caret) library(cluster) library(factoextra) library(e1071) library(randomForest) library(markdown)

install.packages(“markdown”)  ⁠

Setting the Base Directory

To organize and access the dataset files efficiently, a base directory path is defined. This allows the script to dynamically locate CSV files stored in year-wise subfolders without hardcoding file paths repeatedly.

The base directory is set as follows:

⁠ {r set-base-dir}
# Set base directory
base_dir <- "/Users/kaustubhawaghade/dataset/" 
 ⁠

Extracting CSV Files by Year

To efficiently load datasets stored in different subfolders for each year, a custom function is defined to extract CSV file paths based on the year.

The ⁠ extract_csv_files_by_year() ⁠ function automates the process of navigating through subdirectories and collecting the paths to all ⁠ .csv ⁠ files. This makes the loading process dynamic and scalable, especially when working with structured folders (e.g., one folder per year).

The function works as follows: - It takes a base directory and a vector of target years. •⁠ ⁠For each year, it builds the path to the corresponding folder. •⁠ ⁠If the folder exists, it lists all CSV files within that folder using ⁠ list.files() ⁠. •⁠ ⁠The resulting file paths are stored in a named list, with each year as a key.

Here is the implementation:

⁠ {r extract-csv} # Function to extract CSVs by year extract_csv_files_by_year <- function(base_dir, years) { files_by_year <- list() for (year in years) { folder_path <- file.path(base_dir, year) if (dir.exists(folder_path)) { csv_files <- list.files(path = folder_path, pattern = “\.csv$”, full.names = TRUE) files_by_year[[year]] <- csv_files } } return(files_by_year) }

Define target years and extract CSV file paths

years <- c(“2014”, “2015”, “2016”) files_by_year <- extract_csv_files_by_year(base_dir, years)  ⁠

Merging CSV Files

After extracting the CSV file paths for each year, the next step is to merge all the individual CSV files into a single unified dataset. This is done using the ⁠ merge_csv_data() ⁠ function, which reads each CSV, adds a ⁠ year ⁠ column, and combines them into one large data frame.

The function works as follows: - It accepts a list of CSV file paths grouped by year. •⁠ ⁠For each year and corresponding CSV file, it reads the data into a data frame. •⁠ ⁠If the ⁠ date ⁠ column is present, it ensures the date is converted to ⁠ Date ⁠ format. •⁠ ⁠The ⁠ year ⁠ for each record is added as a new column to track the source of the data. •⁠ ⁠All the individual data frames are stored in a list, and then ⁠ bind_rows() ⁠ combines them into a single large data frame.

The resulting merged dataset has rows from multiple years and a ⁠ year ⁠ column for easy filtering and comparison.

Here is the implementation:

⁠ {r merge-csv} # Function to merge CSVs merge_csv_data <- function(files_by_year) { all_data <- list() for (year in names(files_by_year)) { for (csv_file in files_by_year[[year]]) { df <- read_csv(csv_file) if (“date” %in% colnames(df)) df$date <- as.Date(df$date) df$year <- year all_data[[length(all_data) + 1]] <- df } } bind_rows(all_data) %>% clean_names() }

final_data <- merge_csv_data(files_by_year)  ⁠

Data Cleaning Steps

Data cleaning is a critical part of the data analysis process. This section explains the steps taken to ensure the data is consistent, complete, and ready for analysis.

1. Handling Missing Values

First, we summarize the number of missing values in each column using the ⁠ summarise() ⁠ function. This gives an overview of how much data is missing for each variable.

⁠ 
# Cleaning steps missing_summary <- final_data %>% summarise(across(everything(), ~ sum(is.na(.)))) %>% pivot_longer(cols = everything(), names_to = “column”, values_to = “missing_count”) %>% arrange(desc(missing_count)) print(missing_summary)

final_data_cleaned <- final_data %>% select(where(~ mean(is.na(.)) < 0.9)) %>% mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .))) %>% mutate(across(where(is.character), ~ replace_na(., “Unknown”))) %>% mutate(across(where(is.factor), ~ fct_explicit_na(., na_level = “Unknown”))) %>% distinct() %>% mutate(across(where(is.character), as.factor)) %>% clean_names()  ⁠

Descriptive Analytics (LO3)

In this section, we perform descriptive analytics to summarize key characteristics of the criminal conviction data. The analysis involves exploring the number of records per year, calculating summary statistics, visualizing correlations between different variables, and analyzing the distribution of specific offenses over time.

1. Records per Year

The first analysis step involves counting the number of records available for each year. This provides an overview of how the dataset is distributed across the years.

We visualize this with a bar plot:

⁠ 
# 1. Records per Year final_data_cleaned %>% count(year) %>% ggplot(aes(x = year, y = n)) + geom_col(fill = “steelblue”) + labs(title = “Number of Records by Year”, x = “Year”, y = “Count”) + theme_minimal()  ⁠

2. Summary Statistics for Numeric Columns

Next, we calculate basic summary statistics (mean, median, and standard deviation) for the numeric variables in the dataset. This helps to understand the central tendencies and spread of the data.

⁠ 
# 2. Summary Statistics for Numeric Columns final_data_cleaned %>% select(where(is.numeric)) %>% summarise(across(everything(), list( mean = ~ mean(., na.rm = TRUE), median = ~ median(., na.rm = TRUE), sd = ~ sd(., na.rm = TRUE) ))) %>% print()  ⁠

3. Correlation Heatmap

We then compute the correlations between all numeric variables in the dataset. A heatmap is generated to visualize these correlations, with stronger correlations highlighted in color.

⁠ 
# 3. Correlation Heatmap final_data_cleaned %>% select(where(is.numeric)) %>% cor(use = “complete.obs”) %>% as.data.frame() %>% rownames_to_column(“var1”) %>% pivot_longer(-var1, names_to = “var2”, values_to = “correlation”) %>% ggplot(aes(x = var1, y = var2, fill = correlation)) + geom_tile(color = “white”) + scale_fill_viridis_c() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title = “Correlation Heatmap”, x = ““, y =”“, fill =”Correlation”)  ⁠

4. Boxplot: Distribution of Burglary Convictions Across Years

A boxplot is used to visualize the distribution of burglary convictions across different years. This allows us to see the spread, median, and outliers for each year.

⁠ 
# 4. Boxplot: Distribution of convictions across years final_data_cleaned %>% ggplot(aes(x = year, y = number_of_burglary_convictions)) + geom_boxplot(fill = “orange”) + labs(title = “Burglary Convictions by Year”, x = “Year”, y = “Count”) + theme_minimal()  ⁠

5. Density Plot: Distribution of Drug Offense Convictions

Lastly, a density plot is created to visualize the distribution of drug offense convictions for each year. This plot helps to understand how the density of convictions varies over time.

⁠ 
# 5. Density Plot: Drugs Offence Convictions final_data_cleaned %>% ggplot(aes(x = number_of_drugs_offences_convictions, fill = year)) + geom_density(alpha = 0.4) + labs(title = “Density of Drugs Offence Convictions by Year”, x = “Convictions”, y = “Density”) + theme_minimal()  ⁠

Predictive Analytics (LO2/LO3)

In this section, predictive analytics techniques are applied to uncover deeper insights from the data and to develop models that can forecast or classify outcomes based on historical patterns. Three major predictive methods are used: linear regression, clustering, and classification models.

1. Linear Regression

Linear Regression Model 1

The first regression model aims to predict the number of theft and handling convictions using other related offenses (e.g., drug, burglary, and robbery convictions). This helps identify which crimes may be statistically associated with theft.

⁠ 
# —- LINEAR REGRESSION 1 —- lm_data1 <- final_data_cleaned %>% select(number_of_theft_and_handling_convictions, number_of_drugs_offences_convictions, number_of_burglary_convictions, number_of_robbery_convictions) %>% na.omit() lm_model1 <- lm(number_of_theft_and_handling_convictions ~ ., data = lm_data1) summary(lm_model1)  ⁠

Linear Regression Model 2

The second model predicts motoring-related offenses using public order, fraud, and criminal damage convictions. These regression models help evaluate how certain categories of crime relate to one another.

⁠ 
# —- LINEAR REGRESSION 2 —- lm_data2 <- final_data_cleaned %>% select(number_of_motoring_offences_convictions, number_of_public_order_offences_convictions, number_of_fraud_and_forgery_convictions, number_of_criminal_damage_convictions) %>% na.omit() lm_model2 <- lm(number_of_motoring_offences_convictions ~ ., data = lm_data2) summary(lm_model2)  ⁠

2. Clustering

Clustering techniques are used to uncover natural groupings or structures within the data based on numeric features.

K-means Clustering

⁠ 
# —- CLUSTERING 1: K-means —- cluster_data1 <- final_data_cleaned %>% select(where(is.numeric)) %>% na.omit() scaled1 <- scale(cluster_data1) kmeans1 <- kmeans(scaled1, centers = 3) fviz_cluster(list(data = scaled1, cluster = kmeans1$cluster)) + labs(title = “K-Means Clustering 1”)  ⁠

Hierarchical Clustering creates a dendrogram to visualize nested groupings of similar observations. We also use the silhouette method to estimate the optimal number of clusters.

⁠ 
# —- CLUSTERING 2: Hierarchical Clustering with Visualization —- cluster_data_hc <- final_data_cleaned %>% select(where(is.numeric)) %>% na.omit() scaled_hc <- scale(cluster_data_hc) dist_matrix <- dist(scaled_hc) hc <- hclust(dist_matrix, method = “complete”) plot(hc, main = “Hierarchical Clustering Dendrogram”, xlab = ““, sub =”“, cex = 0.5)

Optimal k (optional)

fviz_nbclust(scaled_hc, FUN = hcut, method = “silhouette”) + labs(title = “Optimal Number of Clusters - Silhouette”)

Cut dendrogram

k_hc <- 3 hc_clusters <- cutree(hc, k = k_hc) final_data_cleaned$hc_cluster <- factor(hc_clusters)

Visualise

fviz_cluster(list(data = scaled_hc, cluster = hc_clusters), geom = “point”, ellipse.type = “convex”, repel = TRUE, main = “Hierarchical Clustering (PCA View)”)  ⁠

3. Classification Models Classification techniques are used to predict the year of a criminal record based on its numeric features. Two models are used: Naive Bayes and Random Forest.

Naive Bayes Classifier

⁠ 
# —- CLASSIFICATION 1: Naive Bayes —- class_data <- final_data_cleaned %>% select(where(is.numeric), year) %>% mutate(year = as.factor(year)) %>% na.omit() set.seed(123) split <- createDataPartition(class_data$year, p = 0.8, list = FALSE) train1 <- class_data[split, ]; test1 <- class_data[-split, ] nb_model <- naiveBayes(year ~ ., data = train1) nb_preds <- predict(nb_model, test1) confusionMatrix(nb_preds, test1$year)  ⁠

Random Forest Classifier

⁠ 
# —- CLASSIFICATION 2: Random Forest —- rf_model <- randomForest(year ~ ., data = train1, ntree = 100) rf_preds <- predict(rf_model, test1) confusionMatrix(rf_preds, test1$year)  ⁠

kaustubh_4408579

2025-05-19