The analysis of crime cases data

1 Introduction

The purpose of this project is to analyse crime data, understand the factors influencing the crime cases and provide insights on how we can minimize crime impact.

2 Data

The dataset includes the following tables.

id : Numerical Identification of the crime case.
region :Categorical data stating the region in which the crime has occured
complaint_type : Categorical data describing the type of complaint based on priority i.e alpha for lower priority and charlie for medium priority
number_of_arrests : Numerical data for the number of suspects arrested for each crime case.
incident_impact_score : Numerical data which is the weight of the impact of the crime case.
Response_time : Numerical data representing the time it took the police task team to respond to the crime incident.

Code

library(tidyverse)
library(knitr)

kable(head(Crime_data))

id	region	complaint_type	number_of_arrests	incident_impact_score	response_time
1	KwaZulu-Natal	alpha	4	5.8	26.20
2	KwaZulu-Natal	charlie	5	7.7	24.93
3	Eastern Cape	alpha	7	6.6	27.59
4	KwaZulu-Natal	charlie	5	6.7	24.47
5	Western Cape	alpha	9	6.9	11.18
6	KwaZulu-Natal	charlie	7	6.0	26.12

3 Descriptive Statistics/ EDA

Perform descriptive analysis/ exploratory data analysis. Include at least 3 graphs (of different types)

Code

#| label:Descriptive stats
my_skim <- skim_with(
  numeric = sfl(Average = mean, 
                Standard_Deviation = sd, 
                Minimum = min, 
                Percentile_25th = ~ quantile(., .25, type = 6),
                Median = median,
                Percentile_75th = ~ quantile(., .75, type = 6),
                Maximum = max),
  append = FALSE
)

# Summary statistics
my_skim(Crime_data)

Data summary
Name	Crime_data
Number of rows	232
Number of columns	6
_______________________
Column type frequency:
character	2
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
region	0	1	7	13	0	6	0
complaint_type	0	1	5	7	0	3	0

Variable type: numeric

skim_variable	complete_rate	Average	Standard_Deviation	Minimum	Percentile_25th	Median	Percentile_75th	Maximum
id	1	116.50	67.12	1.00	58.25	116.50	174.75	232.00
number_of_arrests	1	6.16	2.11	0.00	5.00	6.00	7.00	12.00
incident_impact_score	1	6.69	1.15	2.90	6.00	6.70	7.40	9.40
response_time	1	23.35	13.07	6.82	13.77	21.09	27.77	80.59

Code

The data set contains 232 crime records, which provides a reasonable sample size for this exploratory investigation. There are no missing, so the data is complete and ready for analysis without the need for imputation or cleaning.

The data set includes both numeric variables (e.g., number of arrests, incident impact score, response time) and categorical variables (e.g., region, complaint type). Each of these will be discussed in turn.

3.0.1 Numeric Variables

Number_of_arrests

This is a discrete variable representing how many arrests were made in connection with each complaint. Arrests range from low single digits to higher counts, with an average of around 6 arrests per complain. The moderate spread indicates that while most complaints lead to a handful of arrests, some result in significantly more, possibly reflecting larger or more serious incidents.

Incident_impact_score

A continuous variable reflecting the severity or seriousness of the incident, measured on a numeric scale. The average score is around 6.6, with relatively low variability across complaints. This suggests that most incidents fall within a similar impact range, although certain outliers may indicate highly severe cases.

Response_time

A continuous variable representing the time taken to respond to an incident (measured in minutes). Response times vary greatly, from as low as 11 minutes (Western Cape) to over 60 minutes (Northern Cape ), with an overall average in the mid-20s. The wide spread in response times highlights disparities in policing efficiency between regions, with faster responses generally associated with higher arrest rates.

Skewness provides a numerical summary of the shape of the numerical variables in the sample data. The skewness of the numeric variables in the crime data is summarised below.

3.1 Skewness Analysis

Number of Arrests

The skewness value is 0.203, which is very close to zero. This indicates that the distribution of the number of arrests is almost symmetric, with no significant tail bias.

Incident Impact Score

The skewness is -0.226, slightly negative. This suggests a mild left-skew, meaning that extreme lower values are slightly more frequent than extreme higher values, but overall the distribution is fairly balanced.

Response Time

The skewness is 1.933, which is strongly positive. This indicates a right-skewed distribution, meaning that there are some incidents with very long response times, pulling the tail to the right. This could highlight outliers or exceptional delays in the dataset.

Code

library(dplyr)
library(ggplot2)

# Calculate average response time by region
avg_response <- Crime_data %>%
  group_by(region) %>%
  summarise(avg_response_time = mean(response_time, na.rm = TRUE))

# Plot
ggplot(avg_response, aes(x = reorder(region, avg_response_time), y = avg_response_time)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +  # flip for better readability
  labs(
    title = "Average Response Time by Region",
    x = "Region",
    y = "Average Response Time (minutes or units)"
  ) +
  theme_minimal()

Regional differences

This plot shows the average response time for each province :

Northern Cape has the longest response time (~64 min) but not particularly high arrests or impact.)
Gauteng has the shortest response time (~12 min) and relatively high arrests (~6.2 on avg)
Western Cape shows the highest arrests (~6.7) with moderate response times (~21 min).

Code

library(ggplot2)
library(dplyr)

# Assuming your complaint type column is called 'complaint_type'
ggplot(Crime_data, aes(x = complaint_type, y = number_of_arrests)) +
  geom_boxplot(fill = "tomato", outlier.color = "red", outlier.shape = 16) +
  coord_flip() +  # makes labels easier to read
  labs(
    title = "Distribution of Arrests by Complaint Type",
    x = "Complaint Type",
    y = "Number of Arrests"
  ) +
  theme_minimal()

The box plot shows that certain complaint types consistently result in higher numbers of arrests, while others rarely lead to arrests. Outliers indicate cases where unusually high numbers of arrests occurred. For instance, complaints like bravo have higher medians, suggesting that they are more likely to result in enforcement actions.

Code

library(ggplot2)

ggplot(Crime_data, aes(x = response_time, y = number_of_arrests, color = region)) +
  geom_point(alpha = 0.7, size = 3) +   # scatter points
  geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "black") + # optional trend line
  labs(
    title = "Response Time vs. Number of Arrests by Region",
    x = "Response Time",
    y = "Number of Arrests",
    color = "Region"
  ) +
  theme_minimal()

The scatter plot shows that KZN and Gauteng have the highest number of arrests, and these arrests are heavily concentrated at shorter response times. This suggests that these regions are not only responding faster to complaints but also achieving more arrests compared to other provinces.

4 Statistical Inference

4.1 Estimation

I am going to use method of maximum likelhood to fit Poisson distribution on the number_of_arrests.

Code

lambda_hat <-round(mean(Crime_data$number_of_arrests),3)

The lambda value is 6.159 and the fitted distribution is Poisson(6.159 )

4.1.1 Goodness-of-fit Test

Provide a plot to compare the fit of the observed and fitted data. Perform goodness-of-fit test.

4.2 Interval Estimation

Code

library(confintr)
#| label: confidence interval
 xbar <- mean(Crime_data$number_of_arrests)
 su <-sqrt(var(Crime_data$number_of_arrests))
 n <- nrow(Crime_data)
  alpha <- 0.05
  t_crit <- qt(p=alpha/2,df=n-1,lower.tail=FALSE)
 moe <- t_crit*su/sqrt(n)
 ll<-round(xbar-moe,3)
 ul <-round(xbar+moe,3)
 Crime_data %>% glimpse()

Rows: 232
Columns: 6
$ id                    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
$ region                <chr> "KwaZulu-Natal", "KwaZulu-Natal", "Eastern Cape"…
$ complaint_type        <chr> "alpha", "charlie", "alpha", "charlie", "alpha",…
$ number_of_arrests     <dbl> 4, 5, 7, 5, 9, 7, 6, 4, 4, 3, 6, 9, 6, 1, 10, 7,…
$ incident_impact_score <dbl> 5.8, 7.7, 6.6, 6.7, 6.9, 6.0, 6.5, 9.1, 7.4, 4.7…
$ response_time         <dbl> 26.20, 24.93, 27.59, 24.47, 11.18, 26.12, 33.00,…

The 95 % confidence interval for Mean number of arrests is [ 5.886 ;6.433 ]which means that we are 95 confident that the population mean number of arrests lies within this interval.

4.3 Hypothesis testing

Perform a hypothesis test of your choosing.

5 Summary of Insights

Summarise key findings from your EDA and statistical analysis.

Note: no new information need be presented here, it is a summary of insights already noted in previous sections.

6 Conclusion

Conclude report.

6.1 Recommendations

What would you recommend based on your findings?