The analysis of crime cases data

Exploratory Data Analysis (EDA) and Statistical Inference

Author

Madonsela, Nkosinathi (231717962)

Published

2025

Other Formats

1 Introduction

The purpose of this project is to analyse crime data, understand the factors influencing the crime cases and provide insights on how we can minimize crime impact.

2 Data

The dataset includes the following tables.

  • id : Numerical Identification of the crime case.

  • region :Categorical data stating the region in which the crime has occured

  • complaint_type : Categorical data describing the type of complaint based on priority i.e alpha for lower priority and charlie for medium priority

  • number_of_arrests : Numerical data for the number of suspects arrested for each crime case.

  • incident_impact_score : Numerical data which is the weight of the impact of the crime case.

  • Response_time : Numerical data representing the time it took the police task team to respond to the crime incident.

Code
library(tidyverse)
library(knitr)

kable(head(Crime_data))
id region complaint_type number_of_arrests incident_impact_score response_time
1 KwaZulu-Natal alpha 4 5.8 26.20
2 KwaZulu-Natal charlie 5 7.7 24.93
3 Eastern Cape alpha 7 6.6 27.59
4 KwaZulu-Natal charlie 5 6.7 24.47
5 Western Cape alpha 9 6.9 11.18
6 KwaZulu-Natal charlie 7 6.0 26.12

3 Descriptive Statistics/ EDA

Perform descriptive analysis/ exploratory data analysis. Include at least 3 graphs (of different types)

Code
my_skim <- skim_with(
  numeric = sfl(Average = mean, 
                Standard_Deviation = sd, 
                Minimum = min, 
                Percentile_25th = ~ quantile(., .25, type = 6),
                Median = median,
                Percentile_75th = ~ quantile(., .75, type = 6),
                Maximum = max),
  append = FALSE
)

# Summary statistics
my_skim(Crime_data)
Data summary
Name Crime_data
Number of rows 232
Number of columns 6
_______________________
Column type frequency:
character 2
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
region 0 1 7 13 0 6 0
complaint_type 0 1 5 7 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate Average Standard_Deviation Minimum Percentile_25th Median Percentile_75th Maximum
id 0 1 116.50 67.12 1.00 58.25 116.50 174.75 232.00
number_of_arrests 0 1 6.16 2.11 0.00 5.00 6.00 7.00 12.00
incident_impact_score 0 1 6.69 1.15 2.90 6.00 6.70 7.40 9.40
response_time 0 1 23.35 13.07 6.82 13.77 21.09 27.77 80.59
Code

The data set contains 232 crime records, which provides a reasonable sample size for this exploratory investigation. There are no missing, so the data is complete and ready for analysis without the need for imputation or cleaning.

The data set includes both numeric variables (e.g., number of arrests, incident impact score, response time) and categorical variables (e.g., region, complaint type). Each of these will be discussed in turn.

3.0.1 Numeric Variables

  • Number_of_arrests

This is a discrete variable representing how many arrests were made in connection with each complaint. Arrests range from low single digits to higher counts, with an average of around 6 arrests per complain. The moderate spread indicates that while most complaints lead to a handful of arrests, some result in significantly more, possibly reflecting larger or more serious incidents.

  • Incident_impact_score

A continuous variable reflecting the severity or seriousness of the incident, measured on a numeric scale. The average score is around 6.6, with relatively low variability across complaints. This suggests that most incidents fall within a similar impact range, although certain outliers may indicate highly severe cases.

  • Response_time


A continuous variable representing the time taken to respond to an incident (measured in minutes). Response times vary greatly, from as low as 11 minutes (Western Cape) to over 60 minutes (Northern Cape ), with an overall average in the mid-20s. The wide spread in response times highlights disparities in policing efficiency between regions, with faster responses generally associated with higher arrest rates.

Skewness provides a numerical summary of the shape of the numerical variables in the sample data. The skewness of the numeric variables in the crime data is summarised below.

Code
library(dplyr)
knitr::kable(
  Crime_data %>%
    summarise( 
      number_of_arrests_skewness= e1071::skewness(number_of_arrests),
    incident_impact_score_skewness= e1071::skewness(incident_impact_score),
      response_time_skewness= e1071::skewness(response_time)
    )
  
)
number_of_arrests_skewness incident_impact_score_skewness response_time_skewness
0.2033865 -0.2263861 1.933137

4 Statistical Inference

4.1 Estimation

Use an estimation method to fit on a distribution on numerical discrete variable.

4.1.1 Goodness-of-fit Test

Provide a plot to compare the fit of the observed and fitted data. Perform goodness-of-fit test.

4.2 Interval Estimation

Provide a confidence interval.

4.3 Hypothesis testing

Perform a hypothesis test of your choosing.

5 Summary of Insights

Summarise key findings from your EDA and statistical analysis.

Note: no new information need be presented here, it is a summary of insights already noted in previous sections.

6 Conclusion

Conclude report.

6.1 Recommendations

What would you recommend based on your findings?