Crime remains a major societal concern across Indian cities, affecting economic stability, mental well-being, and public safety. This project analyzes crime patterns from 2020 to 2024 across 14 major metropolitan areas, exploring the most impacted locations, dominant crime categories, and how trends evolved — especially in the post-COVID landscape. The goal is to uncover high-risk zones and temporal crime patterns to aid in smarter urban planning and crime prevention.
The dataset used in this project is the “Indian Crimes Dataset” by Sudhanvahg, from Kaggle, which captures criminal activity across multiple Indian cities between 2020 and 2024. The original dataset contained over 41,000 records. For the purpose of this project, the data was subsetted to: • Include only 14 major Indian cities based on crime hotspot and highest population(e.g., Delhi, Mumbai, Bangalore, Hyderabad, etc.) • Maintain a temporal balance by selecting 40% of the data from each year The resulting working dataset has around 12,700 records, sufficient for exploratory data analysis, visualizations, and even predictive modeling.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.4.3
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.3
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
library(grid)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.3
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(tidyr)
crime_dataset_india <- read_csv("C:/Users/divak/OneDrive/Documents/R projects/CA3/crime_dataset_india.csv",
show_col_types = FALSE)
cat("Rows:", nrow(crime_dataset_india), "Columns:", ncol(crime_dataset_india), "\n")
## Rows: 40160 Columns: 14
# Assign dataset to a new variable
crime <- crime_dataset_india
# Selecting 14 cities based on population and crime hotspots
selected_cities <- c("Delhi", "Mumbai", "Bangalore", "Hyderabad", "Chennai",
"Kolkata", "Ahmedabad", "Pune", "Lucknow", "Jaipur",
"Patna", "Kanpur", "Surat", "Indore")
#Subset the dataset using the City column & sample 40% of the data from each year
crime_a <- subset(crime, City %in% selected_cities) %>%
mutate(
year = substr(`Date Reported`, 7, 10)
) %>%
group_by(year) %>%
slice_sample(prop = 0.40) %>%
ungroup()
cat("Rows:", nrow(crime_dataset_india), "Columns:", ncol(crime_dataset_india), "\n")
## Rows: 40160 Columns: 14
crime_a$`Date of Occurrence` <- mdy_hm(crime_a$`Date of Occurrence`)
crime_a$`Date Reported` <- dmy_hm(crime_a$`Date Reported`)
crime_a$`Date Case Closed` <- dmy_hm(crime_a$`Date Case Closed`)
crime_a$`Time of Occurrence` <- dmy_hm(crime_a$`Time of Occurrence`)
The date columns in the dataset were initially stored as character data types. Using the lubridate library, these columns were converted to the appropriate Date-Time (ddtm) format, ensuring consistency and enabling more accurate temporal analysis in the subsequent stages of the project.
str(crime_a)
## tibble [12,748 × 15] (S3: tbl_df/tbl/data.frame)
## $ Report Number : num [1:12748] 2435 2666 702 3556 2438 ...
## $ Date Reported : POSIXct[1:12748], format: "2020-04-12 16:00:00" "2020-04-22 14:00:00" ...
## $ Date of Occurrence: POSIXct[1:12748], format: "2020-04-11 10:00:00" "2020-04-21 01:00:00" ...
## $ Time of Occurrence: POSIXct[1:12748], format: "2020-04-12 05:25:00" "2020-04-22 00:46:00" ...
## $ City : chr [1:12748] "Pune" "Delhi" "Delhi" "Delhi" ...
## $ Crime Code : num [1:12748] 274 114 386 403 428 436 385 548 276 296 ...
## $ Crime Description : chr [1:12748] "SEXUAL ASSAULT" "FRAUD" "CYBERCRIME" "PUBLIC INTOXICATION" ...
## $ Victim Age : num [1:12748] 10 64 27 40 60 32 59 28 65 55 ...
## $ Victim Gender : chr [1:12748] "F" "F" "M" "M" ...
## $ Weapon Used : chr [1:12748] "Blunt Object" "Other" "Explosives" "Explosives" ...
## $ Crime Domain : chr [1:12748] "Violent Crime" "Other Crime" "Other Crime" "Other Crime" ...
## $ Police Deployed : num [1:12748] 18 12 13 15 2 18 19 18 1 16 ...
## $ Case Closed : chr [1:12748] "Yes" "Yes" "No" "Yes" ...
## $ Date Case Closed : POSIXct[1:12748], format: "2021-07-07 16:00:00" "2020-04-30 14:00:00" ...
## $ year : chr [1:12748] "2020" "2020" "2020" "2020" ...
The dataset contains 15 columns & 12,748 rows with data types like integers (e.g., Victim Age, Police Deployed), characters (e.g., City, Crime Description), POSIXct for date-related columns.
colSums(is.na(crime_a))
## Report Number Date Reported Date of Occurrence Time of Occurrence
## 0 0 0 0
## City Crime Code Crime Description Victim Age
## 0 0 0 0
## Victim Gender Weapon Used Crime Domain Police Deployed
## 0 0 0 0
## Case Closed Date Case Closed year
## 0 6340 0
Only the Date Case Closed column contains missing values, which is expected since not all cases have been resolved yet. The rest of the dataset is complete and does not require handling of missing values for other fields.
table(crime_a$City)
##
## Ahmedabad Bangalore Chennai Delhi Hyderabad Indore Jaipur Kanpur
## 730 1427 995 2197 1106 292 569 469
## Kolkata Lucknow Mumbai Patna Pune Surat
## 1035 598 1725 289 879 437
Delhi, Mumbai, and Bangalore have the highest number of records, indicating higher crime reporting or more incidents in these cities. Cities like Patna, Surat, and Indore have fewer entries, possibly due to lower crime volume or underreporting. This distribution will be useful when analyzing regional trends and comparing urban crime rates.
# Get minimum and maximum dates
min(crime_a$`Date of Occurrence`, na.rm = TRUE)
## [1] "2020-01-01 08:00:00 UTC"
max(crime_a$`Date of Occurrence`, na.rm = TRUE)
## [1] "2024-07-31 06:00:00 UTC"
The Date of Occurrence ranges from 1st January 2020 to 7th December 2024, fully covering the intended 5-year analysis window. There is no NA value present.
crime_a %>%
group_by(`Crime Description`) %>%
summarise(Count = n()) %>%
arrange(desc(Count)) %>%
slice_head(n = 10)
## # A tibble: 10 × 2
## `Crime Description` Count
## <chr> <int>
## 1 FRAUD 660
## 2 BURGLARY 649
## 3 CYBERCRIME 628
## 4 DRUG OFFENSE 626
## 5 SEXUAL ASSAULT 625
## 6 ROBBERY 622
## 7 FIREARM OFFENSE 618
## 8 VANDALISM 618
## 9 IDENTITY THEFT 617
## 10 ILLEGAL POSSESSION 616
The top three crimes across cities are Fraud, Kidnapping, and Identity Theft, each with over 600 reports.These results reflect a combination of cyber, violent, and property-related crimes, suggesting the need for diverse law enforcement strategies.Understanding this distribution helps cities prioritize resources for both prevention and investigation.
mumbai_unresolved_weapons <- crime_a %>%
filter(
City == "Mumbai",
`Case Closed` == "No",
`Weapon Used` != "None"
)
nrow(mumbai_unresolved_weapons)
## [1] 733
Among the reported crimes in Mumbai between 2020 and 2024, a total of 747 cases involved weapons and remain unresolved. This indicates that a significant portion of violent or high-risk crimes have not yet led to closure. The presence of weapons in these open cases suggests either complexity in investigation or challenges in apprehending offenders. It also reflects the persistence of serious crimes within the city during this period.
night_crimes_females <- crime_a %>%
filter(`Victim Gender` == "F") %>%
mutate(hour = hour(`Date of Occurrence`)) %>%
filter(hour >= 20 | hour < 6)
nrow(night_crimes_females)
## [1] 2985
A total of 2,902 crimes involving female victims were reported during night time hours (between 8 PM and 6 AM). These incidents span multiple cities like Kolkata, Chennai, Delhi, Mumbai, and Hyderabad, indicating that such crimes are widespread across regions. The timing of these occurrences suggests that women may be at increased risk during late hours, highlighting the need for improved safety measures at night.
open_heavy_deployment <- crime_a %>%
filter(`Police Deployed` > 15, `Case Closed` == "No")
nrow(open_heavy_deployment)
## [1] 1359
head(open_heavy_deployment)
## # A tibble: 6 × 15
## `Report Number` `Date Reported` `Date of Occurrence` `Time of Occurrence`
## <dbl> <dttm> <dttm> <dttm>
## 1 14 2020-01-02 22:00:00 2020-01-01 13:00:00 2020-01-01 17:46:00
## 2 8374 2020-12-15 07:00:00 2020-12-14 21:00:00 2020-12-15 01:28:00
## 3 1274 2020-02-24 21:00:00 2020-02-23 01:00:00 2020-02-23 11:11:00
## 4 2579 2020-04-18 04:00:00 2020-04-17 10:00:00 2020-04-17 11:58:00
## 5 424 2020-01-19 06:00:00 2020-01-18 15:00:00 2020-01-19 03:08:00
## 6 2741 2020-04-26 19:00:00 2020-04-24 04:00:00 2020-04-24 15:30:00
## # ℹ 11 more variables: City <chr>, `Crime Code` <dbl>,
## # `Crime Description` <chr>, `Victim Age` <dbl>, `Victim Gender` <chr>,
## # `Weapon Used` <chr>, `Crime Domain` <chr>, `Police Deployed` <dbl>,
## # `Case Closed` <chr>, `Date Case Closed` <dttm>, year <chr>
A total of 1,337 criminal cases had more than 15 police personnel deployed, yet these cases remain open. This suggests that even significant deployment of law enforcement resources does not always guarantee swift resolution. These cases may involve serious, complex, or sensitive crimes, such as arson, extortion, or homicide, requiring prolonged investigations or facing challenges in evidence collection or prosecution.
avg_police_city <- crime_a %>%
group_by(City) %>%
summarise(avg_police_deployed = mean(`Police Deployed`, na.rm = TRUE)) %>%
arrange(desc(avg_police_deployed))
head(avg_police_city)
## # A tibble: 6 × 2
## City avg_police_deployed
## <chr> <dbl>
## 1 Jaipur 10.3
## 2 Surat 10.3
## 3 Chennai 10.2
## 4 Kolkata 10.2
## 5 Kanpur 10.2
## 6 Mumbai 10.0
Surat and Patna have the highest average police deployment per crime, slightly above other major cities, indicating potentially more serious or resource-intensive incidents in these locations.
crime_by_month <- crime_a %>%
mutate(report_month = month(`Date Reported`, label = TRUE)) %>%
group_by(report_month) %>%
summarise(total_crimes = n()) %>%
arrange(desc(total_crimes))
crime_by_month
## # A tibble: 12 × 2
## report_month total_crimes
## <ord> <int>
## 1 Jul 1199
## 2 Mar 1197
## 3 Apr 1171
## 4 Jan 1158
## 5 Jun 1147
## 6 May 1130
## 7 Feb 1076
## 8 Aug 957
## 9 Dec 949
## 10 Oct 944
## 11 Nov 919
## 12 Sep 901
The data shows that March had the highest number of reported crimes, followed closely by July and April. In contrast, November saw the fewest reports. This suggests a potential seasonal pattern in crime rates, with a noticeable peak in the spring and early summer months.
avg_age_per_crime <- crime_a %>%
group_by(`Crime Description`) %>%
summarise(avg_victim_age = mean(`Victim Age`, na.rm = TRUE)) %>%
arrange(desc(avg_victim_age))
head(avg_age_per_crime)
## # A tibble: 6 × 2
## `Crime Description` avg_victim_age
## <chr> <dbl>
## 1 DOMESTIC VIOLENCE 45.6
## 2 VEHICLE - STOLEN 45.5
## 3 EXTORTION 45.5
## 4 ILLEGAL POSSESSION 45.2
## 5 VANDALISM 45.1
## 6 FRAUD 45.0
The average age of victims varies notably across crime types. Crimes like Illegal Possession and Extortion tend to affect older individuals (avg. ~46 years), while Kidnapping and Shoplifting involve relatively younger victims. This suggests different age groups are vulnerable to different types of crimes.
open_cases_by_domain <- crime_a %>%
filter(`Case Closed` == "No") %>%
group_by(`Crime Domain`) %>%
summarise(open_case_count = n()) %>%
arrange(desc(open_case_count))
open_cases_by_domain
## # A tibble: 4 × 2
## `Crime Domain` open_case_count
## <chr> <int>
## 1 Other Crime 3690
## 2 Violent Crime 1795
## 3 Fire Accident 597
## 4 Traffic Fatality 258
This result shows the distribution of open cases across different crime domains. “Other Crime” has the highest number of unresolved cases (3,651), followed by “Violent Crime” (1,774). “Fire Accident” and “Traffic Fatality” have relatively fewer open cases, with 619 and 284 respectively. This indicates that miscellaneous or less clearly categorized crimes tend to remain unresolved more often than others.
crime_a %>%
filter(`Victim Gender` == "F") %>%
mutate(hour = hour(`Date of Occurrence`)) %>%
filter(hour >= 20 | hour < 6) %>%
count(City, name = "night_female_crimes") %>%
arrange(desc(night_female_crimes)) %>%
slice_head(n = 5)
## # A tibble: 5 × 2
## City night_female_crimes
## <chr> <int>
## 1 Delhi 518
## 2 Mumbai 392
## 3 Bangalore 354
## 4 Hyderabad 244
## 5 Kolkata 243
The data suggests that larger metropolitan cities like Delhi, Mumbai, and Bangalore experience a higher number of nighttime crimes involving female victims. This could be due to higher population density, increased female mobility, and greater reporting rates in these urban areas. It highlights the need for stronger nighttime safety infrastructure in major cities.
crime_a %>%
group_by(`Weapon Used`) %>%
summarise(total = n()) %>%
arrange(desc(total)) %>%
filter(`Weapon Used` != "None") %>%
# Optional filter if "None" dominates
slice_head(n = 5)
## # A tibble: 5 × 2
## `Weapon Used` total
## <chr> <int>
## 1 Knife 1884
## 2 Blunt Object 1849
## 3 Explosives 1826
## 4 Poison 1823
## 5 Firearm 1759
The top five most frequently used weapons in reported crimes are knife, blunt objects, explosives, poison, and firearms. The narrow margin between them suggests a fairly even distribution among these weapon types. The high usage of knives and blunt objects could indicate a prevalence of spontaneous or easily accessible weapons, especially in violent or street-level crimes.
crime_a %>%
group_by(City) %>%
summarise(
total_cases = n(),
open_cases = sum(`Case Closed` == "No", na.rm = TRUE)
) %>%
mutate(open_case_ratio = open_cases / total_cases) %>%
arrange(desc(open_case_ratio)) %>%
slice_head(n = 5)
## # A tibble: 5 × 4
## City total_cases open_cases open_case_ratio
## <chr> <int> <int> <dbl>
## 1 Pune 879 455 0.518
## 2 Surat 437 224 0.513
## 3 Hyderabad 1106 559 0.505
## 4 Patna 289 146 0.505
## 5 Lucknow 598 302 0.505
The data reveals that cities like Indore, Chennai, and Surat have the highest open case ratios, with over 50% of reported cases still unresolved. This suggests a potential strain on investigative resources or judicial delays in these cities. Larger cities such as Delhi and Lucknow also show similar trends, indicating that urban centers may struggle more with timely case closures, possibly due to higher crime volumes or complex cases.
crime_victimage <- crime_a %>%
mutate(age_group = case_when(
`Victim Age` < 13 ~ "Child",
`Victim Age` >= 13 & `Victim Age` < 20 ~ "Teen",
`Victim Age` >= 20 & `Victim Age` < 60 ~ "Adult",
`Victim Age` >= 60 ~ "Senior",
TRUE ~ "Unknown"
)) %>%
group_by(age_group) %>%
summarise(total_crimes = n()) %>%
arrange(desc(total_crimes))
The majority of crimes involve adult victims (7,359 cases), followed by seniors (3,586 cases). Teenagers and children are less frequently affected, with 1,261 and 542 cases respectively. This distribution suggests that adults are the most targeted group, possibly due to their higher exposure in public and professional spaces, while the notable number of senior victims may point to their vulnerability.
crime_temp <- crime_a %>%
mutate(report_delay = as.numeric(`Date Reported` - `Date of Occurrence`, units = "days")) %>%
group_by(`Crime Description`) %>%
summarise(avg_delay = mean(report_delay, na.rm = TRUE)) %>%
arrange(desc(avg_delay)) %>%
slice_head(n = 5)
print(crime_temp)
## # A tibble: 5 × 2
## `Crime Description` avg_delay
## <chr> <dbl>
## 1 DRUG OFFENSE 1.56
## 2 CYBERCRIME 1.55
## 3 COUNTERFEITING 1.55
## 4 SEXUAL ASSAULT 1.53
## 5 ILLEGAL POSSESSION 1.53
Crimes like domestic violence, cybercrime, and assault show the highest average delays in reporting, all above 1.5 days. This suggests victims may hesitate or face barriers in reporting such sensitive crimes promptly. Even short delays can impact response and investigation. The trend highlights a potential need for more accessible and supportive reporting mechanisms.
crime_seasonality <- crime_a %>%
mutate(
month = month(`Date Reported`, label = TRUE),
season = case_when(
month %in% c("Dec", "Jan", "Feb") ~ "Winter",
month %in% c("Mar", "Apr", "May") ~ "Spring",
month %in% c("Jun", "Jul", "Aug") ~ "Summer",
month %in% c("Sep", "Oct", "Nov") ~ "Autumn"
)
) %>%
group_by(season) %>%
summarise(total_crimes = n()) %>%
arrange(desc(total_crimes))
crime_seasonality
## # A tibble: 4 × 2
## season total_crimes
## <chr> <int>
## 1 Spring 3498
## 2 Summer 3303
## 3 Winter 3183
## 4 Autumn 2764
The seasonal crime analysis reveals that Spring has the highest number of reported crimes (3,547), followed by Summer (3,282) and Winter (3,248). Autumn records the lowest (2,671). This pattern may suggest increased crime activity during warmer months, possibly due to higher public mobility and interactions during these times.
# Compute Days to Close
crime_a$D2Close <- as.numeric(crime_a$`Date Case Closed` - crime_a$`Date Reported`)
# Build regression model
slr_time <- lm(D2Close ~ `Police Deployed`, data = crime_a)
# View model summary
summary(slr_time)
##
## Call:
## lm(formula = D2Close ~ `Police Deployed`, data = crime_a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.32 -61.23 -35.23 -8.25 643.43
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.3443 3.2894 27.466 <2e-16 ***
## `Police Deployed` -0.3408 0.2901 -1.175 0.24
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 127.1 on 6406 degrees of freedom
## (6340 observations deleted due to missingness)
## Multiple R-squared: 0.0002153, Adjusted R-squared: 5.928e-05
## F-statistic: 1.38 on 1 and 6406 DF, p-value: 0.2402
# Create New 'Police Deployed' data for prediction
new_police_levels <- data.frame(`Police Deployed` = c(5, 10, 15, 20), check.names = FALSE)
# Predict days to close for new police levels
new_police_levels$predicted_days_to_close <- predict(slr_time, new_police_levels)
# Print predictions
print(new_police_levels)
## Police Deployed predicted_days_to_close
## 1 5 88.64018
## 2 10 86.93606
## 3 15 85.23194
## 4 20 83.52782
# Visualization - Actual data, regression line, and predicted points
ggplot(crime_a, aes(x = `Police Deployed`, y = D2Close)) +
geom_point(alpha = 0.4, color = "blue") + # Blue points for actual data
geom_smooth(method = "lm", se = FALSE, color = "red") + # Red regression line
geom_point(data = new_police_levels, aes(x = `Police Deployed`, y = predicted_days_to_close),
color = "darkgreen", size = 3) + # Dark green points for predicted data
labs(title = "Regression: Police Deployed vs Days to Close",
x = "Police Deployed",
y = "Days to Close") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 6340 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 6340 rows containing missing values or values outside the scale range
## (`geom_point()`).
The simple linear regression shows a statistically significant negative relationship between police deployed and days to case closure (p = 0.0229). However, the R-squared value is very low (0.08%), meaning police deployment explains very little of the variation. While deploying more officers slightly reduces closure time, other factors (crime severity, city protocols, caseload complexity etc.) have a much stronger impact.
# Convert 'Crime Domain' and 'City' into factors
crime_n <- crime_a
crime_clean <- crime_n %>%
drop_na()
crime_clean$`Crime Domain` <- as.factor(crime_clean$`Crime Domain`)
crime_clean$City <- as.factor(crime_clean$City)
crime_clean$Year <- year(crime_clean$`Date of Occurrence`)
crime_clean$Month <- month(crime_clean$`Date of Occurrence`)
crime_clean$D2Close <- as.numeric(crime_clean$`Date Case Closed` - crime_clean$`Date Reported`)
# Fit a multi-linear regression model
model <- lm(D2Close ~ Year + Month + `Crime Domain` + City, data = crime_clean)
# Summary of the model
summary(model)
##
## Call:
## lm(formula = D2Close ~ Year + Month + `Crime Domain` + City,
## data = crime_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -124.87 -59.54 -30.69 1.33 647.41
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2039.3204 2378.0206 0.858 0.391162
## Year -0.9931 1.1761 -0.844 0.398479
## Month 0.5163 0.4634 1.114 0.265261
## `Crime Domain`Other Crime 33.5261 5.4898 6.107 1.07e-09 ***
## `Crime Domain`Traffic Fatality -33.0888 8.9250 -3.707 0.000211 ***
## `Crime Domain`Violent Crime 75.0446 5.8570 12.813 < 2e-16 ***
## CityBangalore 8.3424 7.9217 1.053 0.292327
## CityChennai 18.6373 8.4973 2.193 0.028319 *
## CityDelhi 16.4443 7.4250 2.215 0.026815 *
## CityHyderabad 14.0639 8.3434 1.686 0.091916 .
## CityIndore 24.5500 12.0934 2.030 0.042394 *
## CityJaipur 7.5825 9.7703 0.776 0.437733
## CityKanpur -6.1859 10.3025 -0.600 0.548244
## CityKolkata 20.2944 8.3786 2.422 0.015455 *
## CityLucknow 10.6645 9.6639 1.104 0.269834
## CityMumbai 16.1463 7.6948 2.098 0.035915 *
## CityPatna 9.8524 12.2086 0.807 0.419693
## CityPune 13.8150 8.8202 1.566 0.117332
## CitySurat 5.4510 10.6633 0.511 0.609233
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 124 on 6389 degrees of freedom
## Multiple R-squared: 0.04986, Adjusted R-squared: 0.04718
## F-statistic: 18.63 on 18 and 6389 DF, p-value: < 2.2e-16
crime_clean$Year <- format(crime_clean$`Date of Occurrence`, "%Y")
crime_clean$Month <- format(crime_clean$`Date of Occurrence`, "%m")
# 1. Trend of Case Closure Days Over Year
ggplot(crime_clean, aes(x = as.numeric(Year), y = D2Close)) +
geom_point(aes(color = Year), alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Trend of Case Closure Days Over Year", x = "Year", y = "Days to Case Closure") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The plot shows that case‐closure times vary wildly each year (from near 0 up past 600 days), and the red linear trend is essentially flat. In other words, there’s no meaningful change in average closure duration from 2020 to 2024—case complexity, not time, drives how long cases remain open.
# 2. Impact of Crime Domain on Case Closure Days
ggplot(crime_clean, aes(x = `Crime Domain`, y = D2Close)) +
geom_boxplot(aes(fill = `Crime Domain`), alpha = 0.6) +
labs(title = "Impact of Crime Domain on Case Closure Days", x = "Crime Domain", y = "Days to Case Closure") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Traffic Fatality cases close fastest (median ≈ 10 days) with the smallest spread. Fire Accidents and Other Crimes sit in the middle (medians around 45–60 days) with moderate variability. Violent Crimes take the longest (median ≈ 80 days) and show the greatest spread, including many extreme outliers exceeding 600 days. In short, case complexity aligns with domain: violent incidents drag on longest, traffic fatalities resolve quickest, and other categories fall in between.
# 3. Impact of City on Case Closure Days
ggplot(crime_clean, aes(x = City, y = D2Close)) +
geom_boxplot(aes(fill = City), alpha = 0.6) +
labs(title = "Impact of City on Case Closure Days", x = "City", y = "Days to Case Closure") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Case closure time varies across cities, but overall, most cities
show similar median days with wide spreads—Mumbai and Patna lean
slightly higher, while cities like Kanpur and Jaipur appear relatively
quicker.
# Step 1: Aggregate Delhi crime data by Year
delhi_data <- crime_clean %>%
filter(City == "Delhi") %>%
group_by(Year) %>%
summarise(Crime_Count = n()) %>%
mutate(Year = as.numeric(Year)) # Ensure Year is numeric
# Step 2: Build the 2nd-degree polynomial regression model
polynomial_model <- lm(Crime_Count ~ poly(Year, 2), data = delhi_data)
# Step 3: View model summary
summary(polynomial_model)
##
## Call:
## lm(formula = Crime_Count ~ poly(Year, 2), data = delhi_data)
##
## Residuals:
## 1 2 3 4 5
## 1.7143 0.5429 -11.9143 15.3429 -5.6857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 224.200 6.426 34.890 0.00082 ***
## poly(Year, 2)1 -69.254 14.369 -4.820 0.04045 *
## poly(Year, 2)2 -35.011 14.369 -2.437 0.13512
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.37 on 2 degrees of freedom
## Multiple R-squared: 0.9358, Adjusted R-squared: 0.8717
## F-statistic: 14.58 on 2 and 2 DF, p-value: 0.06417
# Step 4: Create new data for prediction
new_years <- data.frame(Year = seq(min(delhi_data$Year), max(delhi_data$Year), by = 1))
# Step 5: Predict crime counts
new_years$Predicted_Crimes <- predict(polynomial_model, newdata = new_years)
# Step 6: Visualization
ggplot(delhi_data, aes(x = Year, y = Crime_Count)) +
geom_point(color = "darkgreen", size = 3, alpha = 0.7) + # Actual data points
geom_line(data = new_years, aes(x = Year, y = Predicted_Crimes),
color = "red", size = 1.2) + # Polynomial curve
labs(
title = "Polynomial Regression (Degree 2): Delhi Crime Trend (2020–2024)",
x = "Year", y = "Crime Count"
) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The fitted quadratic curve suggests a rise in crime count till 2021, followed by a gradual decline from 2022 to 2024. This non-linear pattern captures a peak followed by a dip, indicating possible external factors affecting crime rates over time.
Initial Rise (2020 → 2021) The model (red curve) and actual data both show a rise or high point in crime activity.
Plateau & Fall (2021 → 2022) The trend slightly levels off, then starts to decline. The red curve smooths the dip, but actual data (green dots) dips more sharply in 2021.
Sharp Decline (2022 → 2024) The model predicts a steep drop in crime counts toward 2024. Actual crime count in 2024 is also quite low, matching the trend.
crime_filtered <- crime_a %>%
filter(format(as.Date(`Date Reported`), "%Y") %in% c("2020", "2021", "2022", "2023", "2024"))
# Run ANOVA
anova_model <- aov(`Victim Age` ~ City, data = crime_filtered)
# Summary of ANOVA
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## City 13 3218 247.5 0.609 0.849
## Residuals 12734 5177877 406.6
#Visualisation - Box Plot
ggplot(crime_a, aes(x = City, y = `Victim Age`)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Distribution of Victim Age Across Cities", x = "City", y = "Victim Age")
Based on the ANOVA results, we fail to reject the null hypothesis, as the p-value (0.288) is greater than the significance level (0.05). This suggests that the average victim age does not differ significantly across different cities in India from 2020 to 2024. In other words, geographic location (city) does not have a statistically significant impact on victim age for the given years.
# Two-way ANOVA
anova <- aov(`Victim Age` ~ City * `Crime Domain`, data = crime_filtered)
# Summary of Two-way ANOVA
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## City 13 3218 247.5 0.608 0.849
## `Crime Domain` 3 150 49.8 0.122 0.947
## City:`Crime Domain` 39 13440 344.6 0.847 0.738
## Residuals 12692 5164288 406.9
#Visualisation - Box Plot
ggplot(crime_a, aes(x = City, y = `Victim Age`, fill = `Crime Domain`)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Distribution of Victim Age Across Cities and Crime Domains", x = "City", y = "Victim Age")
Based on these results, we fail to reject the null hypothesis for all three factors (City, Crime Domain, and their interaction). There is no statistically significant difference in victim age across cities or crime domains, nor is there any significant interaction between the two factors. Thus, victim age seems to be independent of city and crime domain in this dataset from 2020 to 2024.
ggplot(crime_a, aes(x = City, fill = `Crime Domain`)) +
geom_bar(position = "dodge") +
labs(
title = "Frequency of Different Crime Domains Across Cities (2020–2024)",
x = "City",
y = "Number of Cases",
fill = "Crime Domain"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
The chart shows that “Other Crime” is consistently the most reported crime domain across nearly all cities, with Delhi and Hyderabad showing the highest counts overall. Cities like Mumbai, Bangalore, and Kolkata also report high levels of Violent Crime, while Traffic Fatality and Fire Accidents appear relatively lower in frequency across all cities. This suggests that urban centers face a broader range of criminal activity, with non-violent or miscellaneous crimes dominating.
domain_counts <- crime_a %>%
count(`Crime Domain`, sort = TRUE)
# Create pie chart
ggplot(domain_counts, aes(x = "", y = n, fill = `Crime Domain`)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
labs(
title = "Distribution of Crimes by Domain",
fill = "Crime Domain"
) +
theme_void() +
theme(
plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10)
)
The pie chart shows that “Other Crime” accounts for the largest share of crimes in India between 2020 and 2024, making up well over half of all reported cases. Violent Crimes follow as the second most frequent category. Fire Accidents and Traffic Fatalities represent smaller portions, indicating they are relatively less common. This suggests that preventive strategies should prioritize the “Other Crime” and “Violent Crime” categories.
ggplot(crime_a, aes(x = `Victim Age`)) +
geom_histogram(binwidth = 5, fill = "#69b3a2", color = "black", alpha = 0.8) +
labs(
title = "Distribution of Victim Ages",
x = "Victim Age (in years)",
y = "Number of Victims"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 14, face = "bold")
)
The histogram displays a fairly uniform distribution of victim ages, ranging approximately from age 5 to 85. Each age group seems to have a similar number of victims, except for slightly fewer cases in the youngest (0–10) and oldest (80+) age brackets. This suggests that crime victimization in the dataset is not strongly age-dependent and affects individuals of nearly all ages relatively equally.
# Scatterplot: Victim Age vs Case Closure Days
ggplot(crime_a, aes(x = D2Close, y = `Victim Age`, color = `Victim Gender`)) +
geom_point(alpha = 0.8) + # Scatter plot points
labs(title = "Scatterplot of Victim Age vs Duration of Case (D2Close)",
x = "Duration (Days between Date Reported and Date Closed)",
y = "Victim Age",
color = "Victim Gender") +
theme_minimal() + # Minimal theme for better readability
theme(legend.position = "top") # Position the legend on top
## Warning: Removed 6340 rows containing missing values or values outside the scale range
## (`geom_point()`).
No strong relationship exists between victim age and the number of days to case closure—cases vary widely in duration across all age groups. Majority of cases are resolved within 100 days, suggesting efficient processing for most incidents. Long-duration cases (over 400 days) appear throughout all age ranges, indicating that delays are not age-specific. Gender distribution is broad, with female (F), male (M), and others (X) all showing similar patterns across age and case duration.
# Line Graph: Proportion of Male vs Female Crime Victims over the Years (2020-2024)
crime_a <- crime_a %>%
mutate(Year = format(`Date of Occurrence`, "%Y"))
# Line Graph: Proportion of Male vs Female Crime Victims over the Years (2020-2024)
crime_a %>%
group_by(Year, `Victim Gender`) %>%
summarise(count = n(), .groups = "drop") %>%
group_by(Year) %>%
mutate(proportion = count / sum(count) * 100) %>%
ggplot(aes(x = Year, y = proportion, color = `Victim Gender`, group = `Victim Gender`)) +
geom_line(size = 1) + # Line for each gender
geom_point(size = 2) + # Points for each year
labs(title = "Proportion of Male vs Female Crime Victims over the Years (2020-2024)",
x = "Year of Occurrence",
y = "Proportion of Victims (%)") +
scale_color_manual(values = c("M" = "blue", "F" = "pink", "X" = "green")) + # Color for Male, Female, Other
theme_minimal() + # Clean theme
theme(legend.title = element_blank()) + # Remove legend title
theme(legend.position = "bottom") # Place the legend at the bottom
Females (F) consistently make up the largest proportion of crime victims, with a noticeable dip in 2022 followed by a sharp rebound by 2024. Males (M) remain relatively stable over the years, with only slight fluctuations around the 32–35% range. Third gender/unspecified (X) shows a striking spike in 2022, temporarily matching male victim levels, but then declines sharply again—likely due to a data reporting anomaly or specific events that year.
# Drop rows with NA values in the relevant columns
crime_a_cleaned <- crime_a %>%
drop_na(`Date Case Closed`, `Date Reported`, `Date of Occurrence`)
# Create a new variable 'D2Close' that calculates the difference between 'Date Case Closed' and 'Date Reported'
crime_a_cleaned$D2Close <- as.numeric(crime_a_cleaned$`Date Case Closed` - crime_a_cleaned$`Date Reported`)
# Extract the year from 'Date of Occurrence'
crime_a_cleaned <- crime_a_cleaned %>%
mutate(Year = format(`Date of Occurrence`, "%Y"))
# Convert the 'Year' column to numeric
crime_subset_pair <- crime_a_cleaned %>%
select(`Victim Age`, D2Close, Year, `Victim Gender`) %>%
mutate(Year = as.numeric(Year))
# Pair Plot using ggpairs to visualize relationships
ggpairs(
crime_subset_pair,
aes(color = `Victim Gender`, alpha = 0.6),
lower = list(continuous = "smooth"),
upper = list(continuous = "cor"),
diag = list(continuous = "densityDiag"),
progress = FALSE
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is a pair plot with density plots, scatter plots, correlation values, and boxplots showing relationships among Victim Age, Days to Case Closure (D2Close), and Year, differentiated by Victim Gender (F: Female, M: Male, X: Unknown/Other).
Victim Age Distribution is fairly uniform across all genders, with a slight density peak in the middle age ranges. Boxplots show that females and males have similar age distributions, while the “X” category (possibly unknown/other) is slightly more spread.
Days to Case Closure (D2Close) Highly right-skewed distribution: Most cases are closed quickly (within 100 days), but a few take a very long time (outliers).
Gender-wise, no major differences in closure time across F, M, X.
Year A slight positive correlation (0.031) is observed between Year and D2Close, meaning more recent cases may be taking slightly longer. For females, this trend is stronger (0.054**), suggesting increasing closure time in recent years. Males and others show negligible correlation.*
Correlations Victim Age vs. D2Close: Correlation is very weak (0.009), suggesting age is not related to how quickly a case is closed. Days to Close vs. Year: Slight negative correlation overall (-0.011), but with gender variance as noted above. Victim Age vs. Year: Slight positive correlation (older victims in recent years) but very weak.
Gender Variation Female victims show the strongest trend over time (age and case closure). Males and “X” category show minimal patterns, suggesting more variability or under-reporting.
Crime Domain: Violent Crimes take ~75 days longer to close (p < 0.001). Traffic Fatalities close ~34 days faster (p < 0.001).
Year & Month: Weak/non-significant effects; slight downward trend over years (≈ 2 days faster per year, p ≈ 0.10). City: No significant effect on closure times. Model Fit: R² ≈ 0.05—only ~5% of variance explained.
Victim Age ANOVA Cities: No significant differences in mean victim age across cities (p = 0.288). Crime Domains: No significant differences by domain (p = 0.967). City × Domain Interaction: Not significant (p = 0.229).
Delhi Crime Trend 4th-degree polynomial best captures daily crime count fluctuations. Trend shows complex, cyclical patterns rather than a simple upward or downward slope.
Gender & Age Trends Average Victim Age shows minor year‐to‐year variation; no dramatic shifts. Gender Proportion: The share of male vs. female victims remains relatively stable from 2020–2024, with no major crossover points.
Distribution & Correlations Pie Chart: “Other Crime” and “Violent Crime” dominate, together accounting for ~60–70% of all cases. Histogram: Victim ages are right-skewed, concentrated between 20–40 years. Pair Plot & Scatter: No strong linear correlations among Victim Age, Days to Closure, and Year; mild negative trend of closure time with age.
Crime Domain Drives Complexity Violent and “Other” crimes consistently require longer investigation and closure times.
Limited Temporal Effects Year-to-year improvements in closure speed are marginal; monthly seasonality is negligible.
Geography Matters Less City of occurrence has little bearing on either victim age or closure duration.
Victim Demographics Age and gender distributions are fairly uniform across cities and crime types; no significant demographic shifts over time.
Model Limitations & Next Steps Low explanatory power (R² < 0.1) across models suggests key variables are missing (e.g., case complexity scores, resource levels, socio-economic indicators). Consider richer feature engineering (e.g., interaction terms, severity scores) to capture hidden patterns.
Across multiple analyses—from regression and ANOVA to polynomial trend modeling and visual explorations—the type of crime emerges as the most consistent driver of case complexity and resolution time. Violent crimes demand substantially longer investigative efforts, while traffic fatalities are resolved more swiftly. In contrast, temporal factors (year, month) and geographic location (city) exert only marginal influence on how long cases take to close or on who the typical victim is. Victim demographics (age, gender) likewise show remarkably stable patterns, with no significant shifts across cities or crime domains between 2020 and 2024.
Overall, crime domain remains the strongest predictor of both closure times and case complexity, while temporal and geographic factors play a more modest role. Future work should focus on incorporating deeper operational and socio-economic data to build more robust, actionable models.