Introduction

This report provides a descriptive quantitative analysis of the alleged offenders based on a set of data containing information on the crime type and the possible outcomes of the crimes including homicide, blackmail, and extortion. Subtypes of the incident are “Arrest,” “Not authorize,” “Other,” and “Summons” for the year ending June 2024. It is expected that the number of incidents for these outcomes will be compared, hypotheses will be constructed concerning the distribution of incidents; correlation analysis and regression analysis will be used to establish the association between outcomes on one hand and the number of incidents on the other.

The analysis consists of the following steps: exploratory data analysis, hypothesis testing, regression analysis, and visualizations to summarize key findings.

Problem Statement

Data Loading

# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)

# Load the Excel dataset
data <- read_excel("C:/Users/naidu/Downloads/Data_Tables_Alleged_Offender_Incidents_Visualisation_Year_Ending_June_2024.xlsx")

# Display the first few rows of the dataset
head(data)
## # A tibble: 6 × 6
##    Year `Year ending` `Offence Division`          `Offence Subdivision`  Outcome
##   <dbl> <chr>         <chr>                       <chr>                  <chr>  
## 1  2024 June          A Crimes against the person A10 Homicide and rela… Arrest 
## 2  2024 June          A Crimes against the person A10 Homicide and rela… Not au…
## 3  2024 June          A Crimes against the person A10 Homicide and rela… Other  
## 4  2024 June          A Crimes against the person A10 Homicide and rela… Summons
## 5  2024 June          A Crimes against the person A20 Assault and relat… Arrest 
## 6  2024 June          A Crimes against the person A20 Assault and relat… Not au…
## # ℹ 1 more variable: `Alleged Offender Incidents` <dbl>

Data Exploration and Cleaning

# Check the structure of the dataset
str(data)
## tibble [1,014 × 6] (S3: tbl_df/tbl/data.frame)
##  $ Year                      : num [1:1014] 2024 2024 2024 2024 2024 ...
##  $ Year ending               : chr [1:1014] "June" "June" "June" "June" ...
##  $ Offence Division          : chr [1:1014] "A Crimes against the person" "A Crimes against the person" "A Crimes against the person" "A Crimes against the person" ...
##  $ Offence Subdivision       : chr [1:1014] "A10 Homicide and related offences & A60 Blackmail and extortion" "A10 Homicide and related offences & A60 Blackmail and extortion" "A10 Homicide and related offences & A60 Blackmail and extortion" "A10 Homicide and related offences & A60 Blackmail and extortion" ...
##  $ Outcome                   : chr [1:1014] "Arrest" "Not authorised" "Other" "Summons" ...
##  $ Alleged Offender Incidents: num [1:1014] 172 56 55 33 7990 ...
# Check for missing values
sum(is.na(data))
## [1] 0
# View summary statistics
summary(data)
##       Year      Year ending        Offence Division   Offence Subdivision
##  Min.   :2015   Length:1014        Length:1014        Length:1014        
##  1st Qu.:2017   Class :character   Class :character   Class :character   
##  Median :2020   Mode  :character   Mode  :character   Mode  :character   
##  Mean   :2020                                                            
##  3rd Qu.:2022                                                            
##  Max.   :2024                                                            
##    Outcome          Alleged Offender Incidents
##  Length:1014        Min.   :    2.0           
##  Class :character   1st Qu.:   52.0           
##  Mode  :character   Median :  339.5           
##                     Mean   : 1593.5           
##                     3rd Qu.: 1712.0           
##                     Max.   :29483.0

Descriptive Statistics and Visualisation

# Grouping the data by Outcome and summarizing incidents using the correct column name
outcome_summary <- data %>%
  group_by(Outcome) %>%
  summarise(
    Total_Incidents = sum(`Alleged Offender Incidents`),  # Backticks around the column name
    Mean_Incidents = mean(`Alleged Offender Incidents`),
    Median_Incidents = median(`Alleged Offender Incidents`),
    SD_Incidents = sd(`Alleged Offender Incidents`)
  )

# Display summary
print(outcome_summary)
## # A tibble: 4 × 5
##   Outcome        Total_Incidents Mean_Incidents Median_Incidents SD_Incidents
##   <chr>                    <dbl>          <dbl>            <dbl>        <dbl>
## 1 Arrest                  631380          2515.              852        3771.
## 2 Not authorised          273923          1058.              315        2029.
## 3 Other                   275137          1142.              129        2762.
## 4 Summons                 435329          1655.              457        2745.

Descriptive Statistics

# Bar plot showing total incidents by outcome
ggplot(outcome_summary, aes(x = Outcome, y = Total_Incidents, fill = Outcome)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Total Alleged Offender Incidents by Outcome",
    x = "Outcome",
    y = "Total Incidents"
  ) +
  theme_minimal()

# Box plot showing the spread of incidents across outcomes
ggplot(data, aes(x = Outcome, y = `Alleged Offender Incidents`, fill = Outcome)) +
  geom_boxplot() +
  labs(
    title = "Box Plot of Alleged Offender Incidents by Outcome",
    x = "Outcome",
    y = "Alleged Offender Incidents"
  ) +
  theme_minimal()

# Histogram of alleged offender incidents
ggplot(data, aes(x = `Alleged Offender Incidents`, fill = Outcome)) +
  geom_histogram(binwidth = 500, position = "dodge") +
  labs(
    title = "Histogram of Alleged Offender Incidents",
    x = "Alleged Offender Incidents",
    y = "Count"
  ) +
  theme_minimal()

# Scatter plot of incidents by year
ggplot(data, aes(x = Year, y = `Alleged Offender Incidents`, color = Outcome)) +
  geom_point() +
  labs(
    title = "Scatter Plot of Alleged Offender Incidents by Year",
    x = "Year",
    y = "Alleged Offender Incidents"
  ) +
  theme_minimal()

Visualizations

Hypothesis Testing

# Split the data into two groups: "Arrest" and "Other" outcomes
arrest_incidents <- data$`Alleged Offender Incidents`[data$Outcome == "Arrest"]
other_incidents <- data$`Alleged Offender Incidents`[data$Outcome != "Arrest"]

# Perform two-sample t-test
t_test_result <- t.test(arrest_incidents, other_incidents, alternative = "two.sided")

# Display t-test result
print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  arrest_incidents and other_incidents
## t = 4.8012, df = 327.86, p-value = 2.401e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   723.2551 1727.3493
## sample estimates:
## mean of x mean of y 
##  2515.458  1290.156
# Display confidence interval
print(t_test_result$conf.int)
## [1]  723.2551 1727.3493
## attr(,"conf.level")
## [1] 0.95

Regression Analysis

# Convert Outcome to factor for regression
data$Outcome <- factor(data$Outcome)

# Fit a linear regression model
regression_model <- lm(`Alleged Offender Incidents` ~ Outcome, data = data)

# Display the summary of the regression model
summary(regression_model)
## 
## Call:
## lm(formula = `Alleged Offender Incidents` ~ Outcome, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2513.5 -1291.5  -960.1   255.0 28341.4 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2515.5      182.3  13.801  < 2e-16 ***
## OutcomeNot authorised  -1457.8      255.8  -5.700 1.57e-08 ***
## OutcomeOther           -1373.8      260.4  -5.275 1.62e-07 ***
## OutcomeSummons          -860.2      254.8  -3.376 0.000764 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2888 on 1010 degrees of freedom
## Multiple R-squared:  0.03858,    Adjusted R-squared:  0.03572 
## F-statistic: 13.51 on 3 and 1010 DF,  p-value: 1.202e-08
# Visualize the regression result with points and a regression line
ggplot(data, aes(x = Outcome, y = `Alleged Offender Incidents`)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Regression Analysis: Incidents vs. Outcome",
    x = "Outcome",
    y = "Alleged Offender Incidents"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Discussion

References