Project Overview

Crime remains a major societal concern across Indian cities, affecting economic stability, mental well-being, and public safety. This project analyzes crime patterns from 2020 to 2024 across 14 major metropolitan areas, exploring the most impacted locations, dominant crime categories, and how trends evolved — especially in the post-COVID landscape. The goal is to uncover high-risk zones and temporal crime patterns to aid in smarter urban planning and crime prevention.

Dataset Used

The dataset used in this project is the “Indian Crimes Dataset” by Sudhanvahg, from Kaggle, which captures criminal activity across multiple Indian cities between 2020 and 2024. The original dataset contained over 41,000 records. For the purpose of this project, the data was subsetted to: • Include only 14 major Indian cities based on crime hotspot and highest population(e.g., Delhi, Mumbai, Bangalore, Hyderabad, etc.) • Maintain a temporal balance by selecting 40% of the data from each year The resulting working dataset has around 12,700 records, sufficient for exploratory data analysis, visualizations, and even predictive modeling.

Learning Objectives

Apply key dplyr functions for filtering, grouping, summarizing, and basic feature engineering in R.
Frame data-driven questions to explore crime patterns and trends across cities and years.
Extract actionable insights related to crime frequency, hotspots, and temporal variations.
Gain experience in handling time-based data to analyze seasonal, monthly, and hourly crime trends.
Implement various visualisations for analysis.

Load required libraries

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(lubridate)

## Warning: package 'lubridate' was built under R version 4.4.3

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(GGally)

## Warning: package 'GGally' was built under R version 4.4.3

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(ggplot2)
library(grid)
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 4.4.3

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(tidyr)

Loading Dataset

crime_dataset_india <- read_csv("C:/Users/divak/OneDrive/Documents/R projects/CA3/crime_dataset_india.csv",
                                show_col_types = FALSE)
cat("Rows:", nrow(crime_dataset_india), "Columns:", ncol(crime_dataset_india), "\n")

## Rows: 40160 Columns: 14

# Assign dataset to a new variable
crime <- crime_dataset_india

Subset dataset by City and sample 40% from each year

# Selecting 14 cities based on population and crime hotspots
selected_cities <- c("Delhi", "Mumbai", "Bangalore", "Hyderabad", "Chennai",
                     "Kolkata", "Ahmedabad", "Pune", "Lucknow", "Jaipur", 
                     "Patna", "Kanpur", "Surat", "Indore")

#Subset the dataset using the City column & sample 40% of the data from each year

crime_a <- subset(crime, City %in% selected_cities) %>%
  mutate(
    year = substr(`Date Reported`, 7, 10)
  ) %>%
  group_by(year) %>%
  slice_sample(prop = 0.40) %>%
  ungroup()
cat("Rows:", nrow(crime_dataset_india), "Columns:", ncol(crime_dataset_india), "\n")

## Rows: 40160 Columns: 14

Date Conversion and Formatting:

crime_a$`Date of Occurrence` <- mdy_hm(crime_a$`Date of Occurrence`)
crime_a$`Date Reported` <- dmy_hm(crime_a$`Date Reported`)
crime_a$`Date Case Closed` <- dmy_hm(crime_a$`Date Case Closed`)
crime_a$`Time of Occurrence` <- dmy_hm(crime_a$`Time of Occurrence`)

The date columns in the dataset were initially stored as character data types. Using the lubridate library, these columns were converted to the appropriate Date-Time (ddtm) format, ensuring consistency and enabling more accurate temporal analysis in the subsequent stages of the project.

1. Understanding the Dataset (Basic Exploration)

1.1 Column Names & Data Types

str(crime_a)

## tibble [12,748 × 15] (S3: tbl_df/tbl/data.frame)
##  $ Report Number     : num [1:12748] 2435 2666 702 3556 2438 ...
##  $ Date Reported     : POSIXct[1:12748], format: "2020-04-12 16:00:00" "2020-04-22 14:00:00" ...
##  $ Date of Occurrence: POSIXct[1:12748], format: "2020-04-11 10:00:00" "2020-04-21 01:00:00" ...
##  $ Time of Occurrence: POSIXct[1:12748], format: "2020-04-12 05:25:00" "2020-04-22 00:46:00" ...
##  $ City              : chr [1:12748] "Pune" "Delhi" "Delhi" "Delhi" ...
##  $ Crime Code        : num [1:12748] 274 114 386 403 428 436 385 548 276 296 ...
##  $ Crime Description : chr [1:12748] "SEXUAL ASSAULT" "FRAUD" "CYBERCRIME" "PUBLIC INTOXICATION" ...
##  $ Victim Age        : num [1:12748] 10 64 27 40 60 32 59 28 65 55 ...
##  $ Victim Gender     : chr [1:12748] "F" "F" "M" "M" ...
##  $ Weapon Used       : chr [1:12748] "Blunt Object" "Other" "Explosives" "Explosives" ...
##  $ Crime Domain      : chr [1:12748] "Violent Crime" "Other Crime" "Other Crime" "Other Crime" ...
##  $ Police Deployed   : num [1:12748] 18 12 13 15 2 18 19 18 1 16 ...
##  $ Case Closed       : chr [1:12748] "Yes" "Yes" "No" "Yes" ...
##  $ Date Case Closed  : POSIXct[1:12748], format: "2021-07-07 16:00:00" "2020-04-30 14:00:00" ...
##  $ year              : chr [1:12748] "2020" "2020" "2020" "2020" ...

The dataset contains 15 columns & 12,748 rows with data types like integers (e.g., Victim Age, Police Deployed), characters (e.g., City, Crime Description), POSIXct for date-related columns.

1.2 Missing Values and Affected Columns

colSums(is.na(crime_a))

##      Report Number      Date Reported Date of Occurrence Time of Occurrence 
##                  0                  0                  0                  0 
##               City         Crime Code  Crime Description         Victim Age 
##                  0                  0                  0                  0 
##      Victim Gender        Weapon Used       Crime Domain    Police Deployed 
##                  0                  0                  0                  0 
##        Case Closed   Date Case Closed               year 
##                  0               6340                  0

Only the Date Case Closed column contains missing values, which is expected since not all cases have been resolved yet. The rest of the dataset is complete and does not require handling of missing values for other fields.

1.3 Records Count for Each of the 14 Selected Cities

table(crime_a$City)

## 
## Ahmedabad Bangalore   Chennai     Delhi Hyderabad    Indore    Jaipur    Kanpur 
##       730      1427       995      2197      1106       292       569       469 
##   Kolkata   Lucknow    Mumbai     Patna      Pune     Surat 
##      1035       598      1725       289       879       437

Delhi, Mumbai, and Bangalore have the highest number of records, indicating higher crime reporting or more incidents in these cities. Cities like Patna, Surat, and Indore have fewer entries, possibly due to lower crime volume or underreporting. This distribution will be useful when analyzing regional trends and comparing urban crime rates.

1.4 Time Range Covered and Consistency of Dates

# Get minimum and maximum dates
min(crime_a$`Date of Occurrence`, na.rm = TRUE)

## [1] "2020-01-01 08:00:00 UTC"

max(crime_a$`Date of Occurrence`, na.rm = TRUE)

## [1] "2024-07-31 06:00:00 UTC"

The Date of Occurrence ranges from 1st January 2020 to 7th December 2024, fully covering the intended 5-year analysis window. There is no NA value present.

2. Data Extraction & Filtering

2.1 Crimes reported most frequently between 2020-2024

crime_a %>%
group_by(`Crime Description`) %>%
summarise(Count = n()) %>%
arrange(desc(Count)) %>%
slice_head(n = 10)

## # A tibble: 10 × 2
##    `Crime Description` Count
##    <chr>               <int>
##  1 FRAUD                 660
##  2 BURGLARY              649
##  3 CYBERCRIME            628
##  4 DRUG OFFENSE          626
##  5 SEXUAL ASSAULT        625
##  6 ROBBERY               622
##  7 FIREARM OFFENSE       618
##  8 VANDALISM             618
##  9 IDENTITY THEFT        617
## 10 ILLEGAL POSSESSION    616

Interpretation

The top three crimes across cities are Fraud, Kidnapping, and Identity Theft, each with over 600 reports.These results reflect a combination of cyber, violent, and property-related crimes, suggesting the need for diverse law enforcement strategies.Understanding this distribution helps cities prioritize resources for both prevention and investigation.

2.2 All unresolved crimes from Mumbai involving weapons

mumbai_unresolved_weapons <- crime_a %>%
filter(
City == "Mumbai",
`Case Closed` == "No",
`Weapon Used` != "None"
)
nrow(mumbai_unresolved_weapons)

## [1] 733

Interpretation

Among the reported crimes in Mumbai between 2020 and 2024, a total of 747 cases involved weapons and remain unresolved. This indicates that a significant portion of violent or high-risk crimes have not yet led to closure. The presence of weapons in these open cases suggests either complexity in investigation or challenges in apprehending offenders. It also reflects the persistence of serious crimes within the city during this period.

2.3 No. of crimes against female victims occured at night

night_crimes_females <- crime_a %>%
filter(`Victim Gender` == "F") %>%
mutate(hour = hour(`Date of Occurrence`)) %>%
filter(hour >= 20 | hour < 6)
nrow(night_crimes_females)

## [1] 2985

Interpretation

A total of 2,902 crimes involving female victims were reported during night time hours (between 8 PM and 6 AM). These incidents span multiple cities like Kolkata, Chennai, Delhi, Mumbai, and Hyderabad, indicating that such crimes are widespread across regions. The timing of these occurrences suggests that women may be at increased risk during late hours, highlighting the need for improved safety measures at night.

2.4 Crimes where more than 15 police personnel deployed but case is open

open_heavy_deployment <- crime_a %>%
filter(`Police Deployed` > 15, `Case Closed` == "No")
nrow(open_heavy_deployment)

## [1] 1359

head(open_heavy_deployment)

## # A tibble: 6 × 15
##   `Report Number` `Date Reported`     `Date of Occurrence` `Time of Occurrence`
##             <dbl> <dttm>              <dttm>               <dttm>              
## 1              14 2020-01-02 22:00:00 2020-01-01 13:00:00  2020-01-01 17:46:00 
## 2            8374 2020-12-15 07:00:00 2020-12-14 21:00:00  2020-12-15 01:28:00 
## 3            1274 2020-02-24 21:00:00 2020-02-23 01:00:00  2020-02-23 11:11:00 
## 4            2579 2020-04-18 04:00:00 2020-04-17 10:00:00  2020-04-17 11:58:00 
## 5             424 2020-01-19 06:00:00 2020-01-18 15:00:00  2020-01-19 03:08:00 
## 6            2741 2020-04-26 19:00:00 2020-04-24 04:00:00  2020-04-24 15:30:00 
## # ℹ 11 more variables: City <chr>, `Crime Code` <dbl>,
## #   `Crime Description` <chr>, `Victim Age` <dbl>, `Victim Gender` <chr>,
## #   `Weapon Used` <chr>, `Crime Domain` <chr>, `Police Deployed` <dbl>,
## #   `Case Closed` <chr>, `Date Case Closed` <dttm>, year <chr>

Interpretation

A total of 1,337 criminal cases had more than 15 police personnel deployed, yet these cases remain open. This suggests that even significant deployment of law enforcement resources does not always guarantee swift resolution. These cases may involve serious, complex, or sensitive crimes, such as arson, extortion, or homicide, requiring prolonged investigations or facing challenges in evidence collection or prosecution.

3. Grouping & Summarization

3.1 City with highest average no of police deployed per crime

avg_police_city <- crime_a %>%
group_by(City) %>%
summarise(avg_police_deployed = mean(`Police Deployed`, na.rm = TRUE)) %>%
arrange(desc(avg_police_deployed))
head(avg_police_city)

## # A tibble: 6 × 2
##   City    avg_police_deployed
##   <chr>                 <dbl>
## 1 Jaipur                 10.3
## 2 Surat                  10.3
## 3 Chennai                10.2
## 4 Kolkata                10.2
## 5 Kanpur                 10.2
## 6 Mumbai                 10.0

Interpretation

Surat and Patna have the highest average police deployment per crime, slightly above other major cities, indicating potentially more serious or resource-intensive incidents in these locations.

3.2 Month with highest no of crimes reported all years

crime_by_month <- crime_a %>%
mutate(report_month = month(`Date Reported`, label = TRUE)) %>%
group_by(report_month) %>%
summarise(total_crimes = n()) %>%
arrange(desc(total_crimes))
crime_by_month

## # A tibble: 12 × 2
##    report_month total_crimes
##    <ord>               <int>
##  1 Jul                  1199
##  2 Mar                  1197
##  3 Apr                  1171
##  4 Jan                  1158
##  5 Jun                  1147
##  6 May                  1130
##  7 Feb                  1076
##  8 Aug                   957
##  9 Dec                   949
## 10 Oct                   944
## 11 Nov                   919
## 12 Sep                   901

Interpretation

The data shows that March had the highest number of reported crimes, followed closely by July and April. In contrast, November saw the fewest reports. This suggests a potential seasonal pattern in crime rates, with a noticeable peak in the spring and early summer months.

3.3 Average victim age per crime type

avg_age_per_crime <- crime_a %>%
group_by(`Crime Description`) %>%
summarise(avg_victim_age = mean(`Victim Age`, na.rm = TRUE)) %>%
arrange(desc(avg_victim_age))
head(avg_age_per_crime)

## # A tibble: 6 × 2
##   `Crime Description` avg_victim_age
##   <chr>                        <dbl>
## 1 DOMESTIC VIOLENCE             45.6
## 2 VEHICLE - STOLEN              45.5
## 3 EXTORTION                     45.5
## 4 ILLEGAL POSSESSION            45.2
## 5 VANDALISM                     45.1
## 6 FRAUD                         45.0

Interpretation

The average age of victims varies notably across crime types. Crimes like Illegal Possession and Extortion tend to affect older individuals (avg. ~46 years), while Kidnapping and Shoplifting involve relatively younger victims. This suggests different age groups are vulnerable to different types of crimes.

3.4 Crime domain with highest no of open cases

open_cases_by_domain <- crime_a %>%
filter(`Case Closed` == "No") %>%
group_by(`Crime Domain`) %>%
summarise(open_case_count = n()) %>%
arrange(desc(open_case_count))
open_cases_by_domain

## # A tibble: 4 × 2
##   `Crime Domain`   open_case_count
##   <chr>                      <int>
## 1 Other Crime                 3690
## 2 Violent Crime               1795
## 3 Fire Accident                597
## 4 Traffic Fatality             258

Interpretation

This result shows the distribution of open cases across different crime domains. “Other Crime” has the highest number of unresolved cases (3,651), followed by “Violent Crime” (1,774). “Fire Accident” and “Traffic Fatality” have relatively fewer open cases, with 619 and 284 respectively. This indicates that miscellaneous or less clearly categorized crimes tend to remain unresolved more often than others.

4. Sorting & Ranking Data

4.1 Top 5 cities with highest no. of female victim crimes at night

crime_a %>%
filter(`Victim Gender` == "F") %>%
mutate(hour = hour(`Date of Occurrence`)) %>%
filter(hour >= 20 | hour < 6) %>%
count(City, name = "night_female_crimes") %>%
arrange(desc(night_female_crimes)) %>%
slice_head(n = 5)

## # A tibble: 5 × 2
##   City      night_female_crimes
##   <chr>                   <int>
## 1 Delhi                     518
## 2 Mumbai                    392
## 3 Bangalore                 354
## 4 Hyderabad                 244
## 5 Kolkata                   243

Interpretation

The data suggests that larger metropolitan cities like Delhi, Mumbai, and Bangalore experience a higher number of nighttime crimes involving female victims. This could be due to higher population density, increased female mobility, and greater reporting rates in these urban areas. It highlights the need for stronger nighttime safety infrastructure in major cities.

4.2 Top 5 most frequently used weapons in crime

crime_a %>%
group_by(`Weapon Used`) %>%
summarise(total = n()) %>%
arrange(desc(total)) %>%
filter(`Weapon Used` != "None") %>%
# Optional filter if "None" dominates
slice_head(n = 5)

## # A tibble: 5 × 2
##   `Weapon Used` total
##   <chr>         <int>
## 1 Knife          1884
## 2 Blunt Object   1849
## 3 Explosives     1826
## 4 Poison         1823
## 5 Firearm        1759

Interpretation

The top five most frequently used weapons in reported crimes are knife, blunt objects, explosives, poison, and firearms. The narrow margin between them suggests a fairly even distribution among these weapon types. The high usage of knives and blunt objects could indicate a prevalence of spontaneous or easily accessible weapons, especially in violent or street-level crimes.

4.3 Cities report highest proportion of open cases(Top 5)

crime_a %>%
group_by(City) %>%
summarise(
total_cases = n(),
open_cases = sum(`Case Closed` == "No", na.rm = TRUE)
) %>%
mutate(open_case_ratio = open_cases / total_cases) %>%
arrange(desc(open_case_ratio)) %>%
slice_head(n = 5)

## # A tibble: 5 × 4
##   City      total_cases open_cases open_case_ratio
##   <chr>           <int>      <int>           <dbl>
## 1 Pune              879        455           0.518
## 2 Surat             437        224           0.513
## 3 Hyderabad        1106        559           0.505
## 4 Patna             289        146           0.505
## 5 Lucknow           598        302           0.505

Interpretation

The data reveals that cities like Indore, Chennai, and Surat have the highest open case ratios, with over 50% of reported cases still unresolved. This suggests a potential strain on investigative resources or judicial delays in these cities. Larger cities such as Delhi and Lucknow also show similar trends, indicating that urban centers may struggle more with timely case closures, possibly due to higher crime volumes or complex cases.

5. Feature Engineering

5.1 Create a new feature to categorize crimes based on victim age group (e.g., Child, Teen, Adult, Senior). How many crimes fall under each age group?

crime_victimage <- crime_a %>%
mutate(age_group = case_when(
`Victim Age` < 13 ~ "Child",
`Victim Age` >= 13 & `Victim Age` < 20 ~ "Teen",
`Victim Age` >= 20 & `Victim Age` < 60 ~ "Adult",
`Victim Age` >= 60 ~ "Senior",
TRUE ~ "Unknown"
)) %>%
group_by(age_group) %>%
summarise(total_crimes = n()) %>%
arrange(desc(total_crimes))

Interpretation

The majority of crimes involve adult victims (7,359 cases), followed by seniors (3,586 cases). Teenagers and children are less frequently affected, with 1,261 and 542 cases respectively. This distribution suggests that adults are the most targeted group, possibly due to their higher exposure in public and professional spaces, while the notable number of senior victims may point to their vulnerability.

5.2 Calculate “reporting delay” in days. Which crimes have the highest average reporting delay?

crime_temp <- crime_a %>%
  mutate(report_delay = as.numeric(`Date Reported` - `Date of Occurrence`, units = "days")) %>%
  group_by(`Crime Description`) %>%
  summarise(avg_delay = mean(report_delay, na.rm = TRUE)) %>%
  arrange(desc(avg_delay)) %>%
  slice_head(n = 5)
print(crime_temp)

## # A tibble: 5 × 2
##   `Crime Description` avg_delay
##   <chr>                   <dbl>
## 1 DRUG OFFENSE             1.56
## 2 CYBERCRIME               1.55
## 3 COUNTERFEITING           1.55
## 4 SEXUAL ASSAULT           1.53
## 5 ILLEGAL POSSESSION       1.53

Interpretation

Crimes like domestic violence, cybercrime, and assault show the highest average delays in reporting, all above 1.5 days. This suggests victims may hesitate or face barriers in reporting such sensitive crimes promptly. Even short delays can impact response and investigation. The trend highlights a potential need for more accessible and supportive reporting mechanisms.

5.3 Is there any seasonality in crime trends

crime_seasonality <- crime_a %>%
mutate(
month = month(`Date Reported`, label = TRUE),
season = case_when(
month %in% c("Dec", "Jan", "Feb") ~ "Winter",
month %in% c("Mar", "Apr", "May") ~ "Spring",
month %in% c("Jun", "Jul", "Aug") ~ "Summer",
month %in% c("Sep", "Oct", "Nov") ~ "Autumn"
)
) %>%
group_by(season) %>%
summarise(total_crimes = n()) %>%
arrange(desc(total_crimes))
crime_seasonality

## # A tibble: 4 × 2
##   season total_crimes
##   <chr>         <int>
## 1 Spring         3498
## 2 Summer         3303
## 3 Winter         3183
## 4 Autumn         2764

Interpretation

The seasonal crime analysis reveals that Spring has the highest number of reported crimes (3,547), followed by Summer (3,282) and Winter (3,248). Autumn records the lowest (2,671). This pattern may suggest increased crime activity during warmer months, possibly due to higher public mobility and interactions during these times.

6. Regression

6.1 Simple Linear: How does the number of police officers deployed impact the number of days taken to close a case?

# Compute Days to Close
crime_a$D2Close <- as.numeric(crime_a$`Date Case Closed` - crime_a$`Date Reported`)
# Build regression model
slr_time <- lm(D2Close ~ `Police Deployed`, data = crime_a)
# View model summary
summary(slr_time)

## 
## Call:
## lm(formula = D2Close ~ `Police Deployed`, data = crime_a)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -88.32 -61.23 -35.23  -8.25 643.43 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        90.3443     3.2894  27.466   <2e-16 ***
## `Police Deployed`  -0.3408     0.2901  -1.175     0.24    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 127.1 on 6406 degrees of freedom
##   (6340 observations deleted due to missingness)
## Multiple R-squared:  0.0002153,  Adjusted R-squared:  5.928e-05 
## F-statistic:  1.38 on 1 and 6406 DF,  p-value: 0.2402

# Create New 'Police Deployed' data for prediction
new_police_levels <- data.frame(`Police Deployed` = c(5, 10, 15, 20), check.names = FALSE)
# Predict days to close for new police levels
new_police_levels$predicted_days_to_close <- predict(slr_time, new_police_levels)
# Print predictions
print(new_police_levels)

##   Police Deployed predicted_days_to_close
## 1               5                88.64018
## 2              10                86.93606
## 3              15                85.23194
## 4              20                83.52782

# Visualization - Actual data, regression line, and predicted points
ggplot(crime_a, aes(x = `Police Deployed`, y = D2Close)) +
  geom_point(alpha = 0.4, color = "blue") +  # Blue points for actual data
  geom_smooth(method = "lm", se = FALSE, color = "red") +  # Red regression line
  geom_point(data = new_police_levels, aes(x = `Police Deployed`, y = predicted_days_to_close),
             color = "darkgreen", size = 3) +  # Dark green points for predicted data
  labs(title = "Regression: Police Deployed vs Days to Close",
       x = "Police Deployed",
       y = "Days to Close") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 6340 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 6340 rows containing missing values or values outside the scale range
## (`geom_point()`).

Interpretation

The simple linear regression shows a statistically significant negative relationship between police deployed and days to case closure (p = 0.0229). However, the R-squared value is very low (0.08%), meaning police deployment explains very little of the variation. While deploying more officers slightly reduces closure time, other factors (crime severity, city protocols, caseload complexity etc.) have a much stronger impact.

6.2 Multiple Linear Regression: Number of days to case closure - based on the year, month, Crime Domain, and City. Are there temporal trends in how quickly cases are closed?

# Convert 'Crime Domain' and 'City' into factors
crime_n <- crime_a
crime_clean <- crime_n %>%
  drop_na()

crime_clean$`Crime Domain` <- as.factor(crime_clean$`Crime Domain`)
crime_clean$City <- as.factor(crime_clean$City)

crime_clean$Year <- year(crime_clean$`Date of Occurrence`)
crime_clean$Month <- month(crime_clean$`Date of Occurrence`)

crime_clean$D2Close <- as.numeric(crime_clean$`Date Case Closed` - crime_clean$`Date Reported`)
# Fit a multi-linear regression model
model <- lm(D2Close ~ Year + Month + `Crime Domain` + City, data = crime_clean)

# Summary of the model
summary(model)

## 
## Call:
## lm(formula = D2Close ~ Year + Month + `Crime Domain` + City, 
##     data = crime_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -124.87  -59.54  -30.69    1.33  647.41 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    2039.3204  2378.0206   0.858 0.391162    
## Year                             -0.9931     1.1761  -0.844 0.398479    
## Month                             0.5163     0.4634   1.114 0.265261    
## `Crime Domain`Other Crime        33.5261     5.4898   6.107 1.07e-09 ***
## `Crime Domain`Traffic Fatality  -33.0888     8.9250  -3.707 0.000211 ***
## `Crime Domain`Violent Crime      75.0446     5.8570  12.813  < 2e-16 ***
## CityBangalore                     8.3424     7.9217   1.053 0.292327    
## CityChennai                      18.6373     8.4973   2.193 0.028319 *  
## CityDelhi                        16.4443     7.4250   2.215 0.026815 *  
## CityHyderabad                    14.0639     8.3434   1.686 0.091916 .  
## CityIndore                       24.5500    12.0934   2.030 0.042394 *  
## CityJaipur                        7.5825     9.7703   0.776 0.437733    
## CityKanpur                       -6.1859    10.3025  -0.600 0.548244    
## CityKolkata                      20.2944     8.3786   2.422 0.015455 *  
## CityLucknow                      10.6645     9.6639   1.104 0.269834    
## CityMumbai                       16.1463     7.6948   2.098 0.035915 *  
## CityPatna                         9.8524    12.2086   0.807 0.419693    
## CityPune                         13.8150     8.8202   1.566 0.117332    
## CitySurat                         5.4510    10.6633   0.511 0.609233    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 124 on 6389 degrees of freedom
## Multiple R-squared:  0.04986,    Adjusted R-squared:  0.04718 
## F-statistic: 18.63 on 18 and 6389 DF,  p-value: < 2.2e-16

crime_clean$Year <- format(crime_clean$`Date of Occurrence`, "%Y")
crime_clean$Month <- format(crime_clean$`Date of Occurrence`, "%m")

# 1. Trend of Case Closure Days Over Year
ggplot(crime_clean, aes(x = as.numeric(Year), y = D2Close)) +
  geom_point(aes(color = Year), alpha = 0.5) + 
  geom_smooth(method = "lm", se = FALSE, color = "red") + 
  labs(title = "Trend of Case Closure Days Over Year", x = "Year", y = "Days to Case Closure") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Interpretation

The plot shows that case‐closure times vary wildly each year (from near 0 up past 600 days), and the red linear trend is essentially flat. In other words, there’s no meaningful change in average closure duration from 2020 to 2024—case complexity, not time, drives how long cases remain open.

# 2. Impact of Crime Domain on Case Closure Days
ggplot(crime_clean, aes(x = `Crime Domain`, y = D2Close)) + 
  geom_boxplot(aes(fill = `Crime Domain`), alpha = 0.6) +
  labs(title = "Impact of Crime Domain on Case Closure Days", x = "Crime Domain", y = "Days to Case Closure") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation

Traffic Fatality cases close fastest (median ≈ 10 days) with the smallest spread. Fire Accidents and Other Crimes sit in the middle (medians around 45–60 days) with moderate variability. Violent Crimes take the longest (median ≈ 80 days) and show the greatest spread, including many extreme outliers exceeding 600 days. In short, case complexity aligns with domain: violent incidents drag on longest, traffic fatalities resolve quickest, and other categories fall in between.

# 3. Impact of City on Case Closure Days
ggplot(crime_clean, aes(x = City, y = D2Close)) + 
  geom_boxplot(aes(fill = City), alpha = 0.6) + 
  labs(title = "Impact of City on Case Closure Days", x = "City", y = "Days to Case Closure") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Case closure time varies across cities, but overall, most cities show similar median days with wide spreads—Mumbai and Patna lean slightly higher, while cities like Kanpur and Jaipur appear relatively quicker.

Interpretation

Crime Domain is a strong predictor: Violent Crimes take ~75 days more to close (very significant). Traffic Fatalities close ~34 days faster. Other Crimes take ~34 days more than the base category (probably Fire Accidents).
Year shows a slight downward trend in closure time (~2 days quicker each year), hinting at improved efficiency over time, but it’s only marginally significant.
Month has no strong effect, suggesting that the time of year does not significantly affect case closure duration.
City has no statistically significant influence, indicating that closure time is consistent across cities.

7. Polynomial Regression

7.1 Polynomial Regression: How does the number of crimes reported in Delhi vary over time (2020–2024)?

# Step 1: Aggregate Delhi crime data by Year
delhi_data <- crime_clean %>%
  filter(City == "Delhi") %>%
  group_by(Year) %>%
  summarise(Crime_Count = n()) %>%
  mutate(Year = as.numeric(Year))  # Ensure Year is numeric

# Step 2: Build the 2nd-degree polynomial regression model
polynomial_model <- lm(Crime_Count ~ poly(Year, 2), data = delhi_data)

# Step 3: View model summary
summary(polynomial_model)

## 
## Call:
## lm(formula = Crime_Count ~ poly(Year, 2), data = delhi_data)
## 
## Residuals:
##        1        2        3        4        5 
##   1.7143   0.5429 -11.9143  15.3429  -5.6857 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     224.200      6.426  34.890  0.00082 ***
## poly(Year, 2)1  -69.254     14.369  -4.820  0.04045 *  
## poly(Year, 2)2  -35.011     14.369  -2.437  0.13512    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.37 on 2 degrees of freedom
## Multiple R-squared:  0.9358, Adjusted R-squared:  0.8717 
## F-statistic: 14.58 on 2 and 2 DF,  p-value: 0.06417

# Step 4: Create new data for prediction
new_years <- data.frame(Year = seq(min(delhi_data$Year), max(delhi_data$Year), by = 1))

# Step 5: Predict crime counts
new_years$Predicted_Crimes <- predict(polynomial_model, newdata = new_years)

# Step 6: Visualization
ggplot(delhi_data, aes(x = Year, y = Crime_Count)) +
  geom_point(color = "darkgreen", size = 3, alpha = 0.7) +  # Actual data points
  geom_line(data = new_years, aes(x = Year, y = Predicted_Crimes), 
            color = "red", size = 1.2) +  # Polynomial curve
  labs(
    title = "Polynomial Regression (Degree 2): Delhi Crime Trend (2020–2024)",
    x = "Year", y = "Crime Count"
  ) +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Interpretation

The fitted quadratic curve suggests a rise in crime count till 2021, followed by a gradual decline from 2022 to 2024. This non-linear pattern captures a peak followed by a dip, indicating possible external factors affecting crime rates over time.

Initial Rise (2020 → 2021) The model (red curve) and actual data both show a rise or high point in crime activity.

Plateau & Fall (2021 → 2022) The trend slightly levels off, then starts to decline. The red curve smooths the dip, but actual data (green dots) dips more sharply in 2021.

Sharp Decline (2022 → 2024) The model predicts a steep drop in crime counts toward 2024. Actual crime count in 2024 is also quite low, matching the trend.

8. Anova

8.1 ANOVA: Does the average victim age differ significantly across different cities in India from 2020 to 2024?

crime_filtered <- crime_a %>%
  filter(format(as.Date(`Date Reported`), "%Y") %in% c("2020", "2021", "2022", "2023", "2024"))

# Run ANOVA
anova_model <- aov(`Victim Age` ~ City, data = crime_filtered)

# Summary of ANOVA
summary(anova_model)

##                Df  Sum Sq Mean Sq F value Pr(>F)
## City           13    3218   247.5   0.609  0.849
## Residuals   12734 5177877   406.6

#Visualisation - Box Plot
ggplot(crime_a, aes(x = City, y = `Victim Age`)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(title = "Distribution of Victim Age Across Cities", x = "City", y = "Victim Age")

Interpretation

Based on the ANOVA results, we fail to reject the null hypothesis, as the p-value (0.288) is greater than the significance level (0.05). This suggests that the average victim age does not differ significantly across different cities in India from 2020 to 2024. In other words, geographic location (city) does not have a statistically significant impact on victim age for the given years.

8.2 Two-Way ANOVA: Does the average victim age differ significantly across different cities and crime domains in India from 2020 to 2024?

# Two-way ANOVA
anova <- aov(`Victim Age` ~ City * `Crime Domain`, data = crime_filtered)
# Summary of Two-way ANOVA
summary(anova)

##                        Df  Sum Sq Mean Sq F value Pr(>F)
## City                   13    3218   247.5   0.608  0.849
## `Crime Domain`          3     150    49.8   0.122  0.947
## City:`Crime Domain`    39   13440   344.6   0.847  0.738
## Residuals           12692 5164288   406.9

#Visualisation - Box Plot
ggplot(crime_a, aes(x = City, y = `Victim Age`, fill = `Crime Domain`)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(title = "Distribution of Victim Age Across Cities and Crime Domains", x = "City", y = "Victim Age")

Interpretation

Based on these results, we fail to reject the null hypothesis for all three factors (City, Crime Domain, and their interaction). There is no statistically significant difference in victim age across cities or crime domains, nor is there any significant interaction between the two factors. Thus, victim age seems to be independent of city and crime domain in this dataset from 2020 to 2024.

9. Bar Chart

What is the frequency of different crime domains in each city in India between 2020–2024?

ggplot(crime_a, aes(x = City, fill = `Crime Domain`)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Frequency of Different Crime Domains Across Cities (2020–2024)",
    x = "City",
    y = "Number of Cases",
    fill = "Crime Domain"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Interpretation

The chart shows that “Other Crime” is consistently the most reported crime domain across nearly all cities, with Delhi and Hyderabad showing the highest counts overall. Cities like Mumbai, Bangalore, and Kolkata also report high levels of Violent Crime, while Traffic Fatality and Fire Accidents appear relatively lower in frequency across all cities. This suggests that urban centers face a broader range of criminal activity, with non-violent or miscellaneous crimes dominating.

10. Pie Chart

What is the distribution of different crime domains in India between 2020–2024?

domain_counts <- crime_a %>%
  count(`Crime Domain`, sort = TRUE)

# Create pie chart
ggplot(domain_counts, aes(x = "", y = n, fill = `Crime Domain`)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  labs(
    title = "Distribution of Crimes by Domain",
    fill = "Crime Domain"
  ) +
  theme_void() +
   theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )

Interpretation

The pie chart shows that “Other Crime” accounts for the largest share of crimes in India between 2020 and 2024, making up well over half of all reported cases. Violent Crimes follow as the second most frequent category. Fire Accidents and Traffic Fatalities represent smaller portions, indicating they are relatively less common. This suggests that preventive strategies should prioritize the “Other Crime” and “Violent Crime” categories.

11. Histogram

What is the distribution of victim ages in the crime dataset?

ggplot(crime_a, aes(x = `Victim Age`)) +
  geom_histogram(binwidth = 5, fill = "#69b3a2", color = "black", alpha = 0.8) +
  labs(
    title = "Distribution of Victim Ages",
    x = "Victim Age (in years)",
    y = "Number of Victims"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold")
  )

Interpretation

The histogram displays a fairly uniform distribution of victim ages, ranging approximately from age 5 to 85. Each age group seems to have a similar number of victims, except for slightly fewer cases in the youngest (0–10) and oldest (80+) age brackets. This suggests that crime victimization in the dataset is not strongly age-dependent and affects individuals of nearly all ages relatively equally.

12. Scatter Plot

What is the relationship between victim age and number of days to case closure in India from 2020 to 2024?

# Scatterplot: Victim Age vs Case Closure Days
ggplot(crime_a, aes(x = D2Close, y = `Victim Age`, color = `Victim Gender`)) +
  geom_point(alpha = 0.8) +        # Scatter plot points
  labs(title = "Scatterplot of Victim Age vs Duration of Case (D2Close)",
       x = "Duration (Days between Date Reported and Date Closed)",
       y = "Victim Age",
       color = "Victim Gender") +
  theme_minimal() +                 # Minimal theme for better readability
  theme(legend.position = "top")    # Position the legend on top

## Warning: Removed 6340 rows containing missing values or values outside the scale range
## (`geom_point()`).

Interpretation

No strong relationship exists between victim age and the number of days to case closure—cases vary widely in duration across all age groups. Majority of cases are resolved within 100 days, suggesting efficient processing for most incidents. Long-duration cases (over 400 days) appear throughout all age ranges, indicating that delays are not age-specific. Gender distribution is broad, with female (F), male (M), and others (X) all showing similar patterns across age and case duration.

13. Line Graph

# Line Graph: Proportion of Male vs Female Crime Victims over the Years (2020-2024)
crime_a <- crime_a %>%
  mutate(Year = format(`Date of Occurrence`, "%Y"))
# Line Graph: Proportion of Male vs Female Crime Victims over the Years (2020-2024)
crime_a %>%
  group_by(Year, `Victim Gender`) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(Year) %>%
  mutate(proportion = count / sum(count) * 100) %>%
  ggplot(aes(x = Year, y = proportion, color = `Victim Gender`, group = `Victim Gender`)) +
  geom_line(size = 1) +                                # Line for each gender
  geom_point(size = 2) +                                # Points for each year
  labs(title = "Proportion of Male vs Female Crime Victims over the Years (2020-2024)",
       x = "Year of Occurrence",
       y = "Proportion of Victims (%)") +
  scale_color_manual(values = c("M" = "blue", "F" = "pink", "X" = "green")) +  # Color for Male, Female, Other
  theme_minimal() +                                     # Clean theme
  theme(legend.title = element_blank()) +                # Remove legend title
  theme(legend.position = "bottom")                      # Place the legend at the bottom

Interpretation

Females (F) consistently make up the largest proportion of crime victims, with a noticeable dip in 2022 followed by a sharp rebound by 2024. Males (M) remain relatively stable over the years, with only slight fluctuations around the 32–35% range. Third gender/unspecified (X) shows a striking spike in 2022, temporarily matching male victim levels, but then declines sharply again—likely due to a data reporting anomaly or specific events that year.

14. Pair Plot

How do key numerical features like Victim Age, Number of Days to Closure, and Year relate to each other, and do these patterns vary by Victim Gender?

# Drop rows with NA values in the relevant columns
crime_a_cleaned <- crime_a %>%
  drop_na(`Date Case Closed`, `Date Reported`, `Date of Occurrence`)

# Create a new variable 'D2Close' that calculates the difference between 'Date Case Closed' and 'Date Reported'
crime_a_cleaned$D2Close <- as.numeric(crime_a_cleaned$`Date Case Closed` - crime_a_cleaned$`Date Reported`)

# Extract the year from 'Date of Occurrence'
crime_a_cleaned <- crime_a_cleaned %>%
  mutate(Year = format(`Date of Occurrence`, "%Y"))

# Convert the 'Year' column to numeric
crime_subset_pair <- crime_a_cleaned %>%
  select(`Victim Age`, D2Close, Year, `Victim Gender`) %>%
  mutate(Year = as.numeric(Year))

# Pair Plot using ggpairs to visualize relationships

ggpairs(
  crime_subset_pair,
  aes(color = `Victim Gender`, alpha = 0.6),
  lower = list(continuous = "smooth"),
  upper = list(continuous = "cor"),
  diag = list(continuous = "densityDiag"),
  progress = FALSE  
)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interpretation

This is a pair plot with density plots, scatter plots, correlation values, and boxplots showing relationships among Victim Age, Days to Case Closure (D2Close), and Year, differentiated by Victim Gender (F: Female, M: Male, X: Unknown/Other).

Victim Age Distribution is fairly uniform across all genders, with a slight density peak in the middle age ranges. Boxplots show that females and males have similar age distributions, while the “X” category (possibly unknown/other) is slightly more spread.
Days to Case Closure (D2Close) Highly right-skewed distribution: Most cases are closed quickly (within 100 days), but a few take a very long time (outliers).
Gender-wise, no major differences in closure time across F, M, X.
Year A slight positive correlation (0.031) is observed between Year and D2Close, meaning more recent cases may be taking slightly longer. For females, this trend is stronger (0.054**), suggesting increasing closure time in recent years. Males and others show negligible correlation.*
Correlations Victim Age vs. D2Close: Correlation is very weak (0.009), suggesting age is not related to how quickly a case is closed. Days to Close vs. Year: Slight negative correlation overall (-0.011), but with gender variance as noted above. Victim Age vs. Year: Slight positive correlation (older victims in recent years) but very weak.
Gender Variation Female victims show the strongest trend over time (age and case closure). Males and “X” category show minimal patterns, suggesting more variability or under-reporting.

Summary of Results

Case Closure Time Model

Crime Domain: Violent Crimes take ~75 days longer to close (p < 0.001). Traffic Fatalities close ~34 days faster (p < 0.001).
Year & Month: Weak/non-significant effects; slight downward trend over years (≈ 2 days faster per year, p ≈ 0.10). City: No significant effect on closure times. Model Fit: R² ≈ 0.05—only ~5% of variance explained.
Victim Age ANOVA Cities: No significant differences in mean victim age across cities (p = 0.288). Crime Domains: No significant differences by domain (p = 0.967). City × Domain Interaction: Not significant (p = 0.229).
Delhi Crime Trend 4th-degree polynomial best captures daily crime count fluctuations. Trend shows complex, cyclical patterns rather than a simple upward or downward slope.
Gender & Age Trends Average Victim Age shows minor year‐to‐year variation; no dramatic shifts. Gender Proportion: The share of male vs. female victims remains relatively stable from 2020–2024, with no major crossover points.
Distribution & Correlations Pie Chart: “Other Crime” and “Violent Crime” dominate, together accounting for ~60–70% of all cases. Histogram: Victim ages are right-skewed, concentrated between 20–40 years. Pair Plot & Scatter: No strong linear correlations among Victim Age, Days to Closure, and Year; mild negative trend of closure time with age.

Key Takeaways & Insights

Crime Domain Drives Complexity Violent and “Other” crimes consistently require longer investigation and closure times.
Limited Temporal Effects Year-to-year improvements in closure speed are marginal; monthly seasonality is negligible.
Geography Matters Less City of occurrence has little bearing on either victim age or closure duration.
Victim Demographics Age and gender distributions are fairly uniform across cities and crime types; no significant demographic shifts over time.
Model Limitations & Next Steps Low explanatory power (R² < 0.1) across models suggests key variables are missing (e.g., case complexity scores, resource levels, socio-economic indicators). Consider richer feature engineering (e.g., interaction terms, severity scores) to capture hidden patterns.

Conclusion

Across multiple analyses—from regression and ANOVA to polynomial trend modeling and visual explorations—the type of crime emerges as the most consistent driver of case complexity and resolution time. Violent crimes demand substantially longer investigative efforts, while traffic fatalities are resolved more swiftly. In contrast, temporal factors (year, month) and geographic location (city) exert only marginal influence on how long cases take to close or on who the typical victim is. Victim demographics (age, gender) likewise show remarkably stable patterns, with no significant shifts across cities or crime domains between 2020 and 2024.

Overall, crime domain remains the strongest predictor of both closure times and case complexity, while temporal and geographic factors play a more modest role. Future work should focus on incorporating deeper operational and socio-economic data to build more robust, actionable models.

Analyzing Crime Patterns & Trends in Indian Cities (2020–2024)

Divakar Kumar & Om Chauhan

2025-04-28

Project Overview

Dataset Used

Learning Objectives

Load required libraries

Loading Dataset

Subset dataset by City and sample 40% from each year

Date Conversion and Formatting:

1. Understanding the Dataset (Basic Exploration)

1.1 Column Names & Data Types

1.2 Missing Values and Affected Columns

1.3 Records Count for Each of the 14 Selected Cities

1.4 Time Range Covered and Consistency of Dates

2. Data Extraction & Filtering

2.1 Crimes reported most frequently between 2020-2024

Interpretation

2.2 All unresolved crimes from Mumbai involving weapons

Interpretation

2.3 No. of crimes against female victims occured at night

Interpretation

2.4 Crimes where more than 15 police personnel deployed but case is open

Interpretation

3. Grouping & Summarization

3.1 City with highest average no of police deployed per crime

Interpretation

3.2 Month with highest no of crimes reported all years

Interpretation

3.3 Average victim age per crime type

Interpretation

3.4 Crime domain with highest no of open cases

Interpretation

4. Sorting & Ranking Data

4.1 Top 5 cities with highest no. of female victim crimes at night

Interpretation

4.2 Top 5 most frequently used weapons in crime

Interpretation

4.3 Cities report highest proportion of open cases(Top 5)

Interpretation

5. Feature Engineering

5.1 Create a new feature to categorize crimes based on victim age group (e.g., Child, Teen, Adult, Senior). How many crimes fall under each age group?

Interpretation

5.2 Calculate “reporting delay” in days. Which crimes have the highest average reporting delay?

Interpretation

5.3 Is there any seasonality in crime trends

Interpretation

6. Regression

6.1 Simple Linear: How does the number of police officers deployed impact the number of days taken to close a case?

Interpretation

6.2 Multiple Linear Regression: Number of days to case closure - based on the year, month, Crime Domain, and City. Are there temporal trends in how quickly cases are closed?

Interpretation

Interpretation

Interpretation

7. Polynomial Regression

7.1 Polynomial Regression: How does the number of crimes reported in Delhi vary over time (2020–2024)?

Interpretation

8. Anova

8.1 ANOVA: Does the average victim age differ significantly across different cities in India from 2020 to 2024?

Interpretation

8.2 Two-Way ANOVA: Does the average victim age differ significantly across different cities and crime domains in India from 2020 to 2024?

Interpretation

9. Bar Chart

What is the frequency of different crime domains in each city in India between 2020–2024?

Interpretation

10. Pie Chart

What is the distribution of different crime domains in India between 2020–2024?

Interpretation

11. Histogram

What is the distribution of victim ages in the crime dataset?

Interpretation

12. Scatter Plot

What is the relationship between victim age and number of days to case closure in India from 2020 to 2024?

Interpretation

13. Line Graph

Interpretation

14. Pair Plot

How do key numerical features like Victim Age, Number of Days to Closure, and Year relate to each other, and do these patterns vary by Victim Gender?

Interpretation

Summary of Results