The analysis of crime cases data

Exploratory Data Analysis (EDA) and Statistical Inference

Author

Madonsela, Nkosinathi (231717962)

Published

2025

Other Formats

1 Introduction

The purpose of this project is to analyse crime data, understand the factors influencing the crime cases and provide insights on how we can minimize crime impact.

2 Data

The dataset includes the following tables.

  • id : Numerical Identification of the crime case.

  • region :Categorical data stating the region in which the crime has occured

  • complaint_type : Categorical data describing the type of complaint based on priority i.e alpha for lower priority and charlie for medium priority

  • number_of_arrests : Numerical data for the number of suspects arrested for each crime case.

  • incident_impact_score : Numerical data which is the weight of the impact of the crime case.

  • Response_time : Numerical data representing the time it took the police task team to respond to the crime incident.

Code
library(tidyverse)
library(knitr)

kable(head(Crime_data))
id region complaint_type number_of_arrests incident_impact_score response_time
1 KwaZulu-Natal alpha 4 5.8 26.20
2 KwaZulu-Natal charlie 5 7.7 24.93
3 Eastern Cape alpha 7 6.6 27.59
4 KwaZulu-Natal charlie 5 6.7 24.47
5 Western Cape alpha 9 6.9 11.18
6 KwaZulu-Natal charlie 7 6.0 26.12

3 Descriptive Statistics/ EDA

Perform descriptive analysis/ exploratory data analysis. Include at least 3 graphs (of different types)

Code
#| label:Descriptive stats
my_skim <- skim_with(
  numeric = sfl(Average = mean, 
                Standard_Deviation = sd, 
                Minimum = min, 
                Percentile_25th = ~ quantile(., .25, type = 6),
                Median = median,
                Percentile_75th = ~ quantile(., .75, type = 6),
                Maximum = max),
  append = FALSE
)

# Summary statistics
my_skim(Crime_data)
Data summary
Name Crime_data
Number of rows 232
Number of columns 6
_______________________
Column type frequency:
character 2
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
region 0 1 7 13 0 6 0
complaint_type 0 1 5 7 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate Average Standard_Deviation Minimum Percentile_25th Median Percentile_75th Maximum
id 0 1 116.50 67.12 1.00 58.25 116.50 174.75 232.00
number_of_arrests 0 1 6.16 2.11 0.00 5.00 6.00 7.00 12.00
incident_impact_score 0 1 6.69 1.15 2.90 6.00 6.70 7.40 9.40
response_time 0 1 23.35 13.07 6.82 13.77 21.09 27.77 80.59
Code

The data set contains 232 crime records, which provides a reasonable sample size for this exploratory investigation. There are no missing, so the data is complete and ready for analysis without the need for imputation or cleaning.

The data set includes both numeric variables (e.g., number of arrests, incident impact score, response time) and categorical variables (e.g., region, complaint type). Each of these will be discussed in turn.

3.0.1 Numeric Variables

  • Number_of_arrests

This is a discrete variable representing how many arrests were made in connection with each complaint. Arrests range from low single digits to higher counts, with an average of around 6 arrests per complain. The moderate spread indicates that while most complaints lead to a handful of arrests, some result in significantly more, possibly reflecting larger or more serious incidents.

  • Incident_impact_score

A continuous variable reflecting the severity or seriousness of the incident, measured on a numeric scale. The average score is around 6.6, with relatively low variability across complaints. This suggests that most incidents fall within a similar impact range, although certain outliers may indicate highly severe cases.

  • Response_time


A continuous variable representing the time taken to respond to an incident (measured in minutes). Response times vary greatly, from as low as 11 minutes (Western Cape) to over 60 minutes (Northern Cape ), with an overall average in the mid-20s. The wide spread in response times highlights disparities in policing efficiency between regions, with faster responses generally associated with higher arrest rates.

Skewness provides a numerical summary of the shape of the numerical variables in the sample data. The skewness of the numeric variables in the crime data is summarised below.

3.1 Skewness Analysis

  • Number of Arrests

The skewness value is 0.203, which is very close to zero. This indicates that the distribution of the number of arrests is almost symmetric, with no significant tail bias.

  • Incident Impact Score

The skewness is -0.226, slightly negative. This suggests a mild left-skew, meaning that extreme lower values are slightly more frequent than extreme higher values, but overall the distribution is fairly balanced.

  • Response Time

The skewness is 1.933, which is strongly positive. This indicates a right-skewed distribution, meaning that there are some incidents with very long response times, pulling the tail to the right. This could highlight outliers or exceptional delays in the dataset.

Code
library(dplyr)
library(ggplot2)

# Calculate average response time by region
avg_response <- Crime_data %>%
  group_by(region) %>%
  summarise(avg_response_time = mean(response_time, na.rm = TRUE))

# Plot
ggplot(avg_response, aes(x = reorder(region, avg_response_time), y = avg_response_time)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +  # flip for better readability
  labs(
    title = "Average Response Time by Region",
    x = "Region",
    y = "Average Response Time (minutes or units)"
  ) +
  theme_minimal()

Regional differences

This plot shows the average response time for each province :

  • Northern Cape has the longest response time (~64 min) but not particularly high arrests or impact.)

  • Gauteng has the shortest response time (~12 min) and relatively high arrests (~6.2 on avg)

  • Western Cape shows the highest arrests (~6.7) with moderate response times (~21 min).

Code
library(ggplot2)
library(dplyr)

# Assuming your complaint type column is called 'complaint_type'
ggplot(Crime_data, aes(x = complaint_type, y = number_of_arrests)) +
  geom_boxplot(fill = "tomato", outlier.color = "red", outlier.shape = 16) +
  coord_flip() +  # makes labels easier to read
  labs(
    title = "Distribution of Arrests by Complaint Type",
    x = "Complaint Type",
    y = "Number of Arrests"
  ) +
  theme_minimal()

The box plot shows that certain complaint types consistently result in higher numbers of arrests, while others rarely lead to arrests. Outliers indicate cases where unusually high numbers of arrests occurred. For instance, complaints like bravo have higher medians, suggesting that they are more likely to result in enforcement actions.

Code
library(ggplot2)

ggplot(Crime_data, aes(x = response_time, y = number_of_arrests, color = region)) +
  geom_point(alpha = 0.7, size = 3) +   # scatter points
  geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "black") + # optional trend line
  labs(
    title = "Response Time vs. Number of Arrests by Region",
    x = "Response Time",
    y = "Number of Arrests",
    color = "Region"
  ) +
  theme_minimal()

The scatter plot shows that KZN and Gauteng have the highest number of arrests, and these arrests are heavily concentrated at shorter response times. This suggests that these regions are not only responding faster to complaints but also achieving more arrests compared to other provinces.

4 Statistical Inference

4.1 Estimation

I am going to use method of maximum likelhood to fit Poisson distribution on the number_of_arrests.

Code
lambda_hat <-round(mean(Crime_data$number_of_arrests),3)

The lambda value is 6.159 and the fitted distribution is Poisson(6.159 )

4.1.1 Goodness-of-fit Test

Provide a plot to compare the fit of the observed and fitted data. Perform goodness-of-fit test.

4.2 Interval Estimation

Code
library(confintr)
#| label: confidence interval
 xbar <- mean(Crime_data$number_of_arrests)
 su <-sqrt(var(Crime_data$number_of_arrests))
 n <- nrow(Crime_data)
  alpha <- 0.05
  t_crit <- qt(p=alpha/2,df=n-1,lower.tail=FALSE)
 moe <- t_crit*su/sqrt(n)
 ll<-round(xbar-moe,3)
 ul <-round(xbar+moe,3)
 Crime_data %>% glimpse()
Rows: 232
Columns: 6
$ id                    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
$ region                <chr> "KwaZulu-Natal", "KwaZulu-Natal", "Eastern Cape"…
$ complaint_type        <chr> "alpha", "charlie", "alpha", "charlie", "alpha",…
$ number_of_arrests     <dbl> 4, 5, 7, 5, 9, 7, 6, 4, 4, 3, 6, 9, 6, 1, 10, 7,…
$ incident_impact_score <dbl> 5.8, 7.7, 6.6, 6.7, 6.9, 6.0, 6.5, 9.1, 7.4, 4.7…
$ response_time         <dbl> 26.20, 24.93, 27.59, 24.47, 11.18, 26.12, 33.00,…

The 95 % confidence interval for Mean number of arrests is [ 5.886 ;6.433 ]which means that we are 95 confident that the population mean number of arrests lies within this interval.

4.3 Hypothesis testing

Perform a hypothesis test of your choosing.

5 Summary of Insights

Summarise key findings from your EDA and statistical analysis.

Note: no new information need be presented here, it is a summary of insights already noted in previous sections.

6 Conclusion

Conclude report.

6.1 Recommendations

What would you recommend based on your findings?