Introduction

This report provides a descriptive quantitative analysis of the alleged offenders based on a set of data containing information on the crime type and the possible outcomes of the crimes including homicide, blackmail and extortion. Subtypes of the incident are “Arrest,” “Not authorize,” “Other,” and “Summons” for the year ending June 2024. It is expected that the number of incidents for these outcomes will be compared, hypothesis will be constructed concerning the distribution of incidents; correlation analysis and regression analysis will be used to establish the association between outcomes on one hand and the number of incidents on the other.

The analysis consists of the following steps: exploratory data analysis, hypothesis testing, regression analysis, and visualizations to summarize key findings.

Problem Statement

Data Loading

# Load the Excel dataset
data <- read_excel("C:/Users/naidu/Downloads/Data_Tables_Alleged_Offender_Incidents_Visualisation_Year_Ending_June_2024.xlsx")

# View the first few rows of the dataset
head(data)

Data Cont.

## Step 3: Clean and explore the dataset

# Check the structure of the dataset
str(data)
## tibble [1,014 × 6] (S3: tbl_df/tbl/data.frame)
##  $ Year                      : num [1:1014] 2024 2024 2024 2024 2024 ...
##  $ Year ending               : chr [1:1014] "June" "June" "June" "June" ...
##  $ Offence Division          : chr [1:1014] "A Crimes against the person" "A Crimes against the person" "A Crimes against the person" "A Crimes against the person" ...
##  $ Offence Subdivision       : chr [1:1014] "A10 Homicide and related offences & A60 Blackmail and extortion" "A10 Homicide and related offences & A60 Blackmail and extortion" "A10 Homicide and related offences & A60 Blackmail and extortion" "A10 Homicide and related offences & A60 Blackmail and extortion" ...
##  $ Outcome                   : chr [1:1014] "Arrest" "Not authorised" "Other" "Summons" ...
##  $ Alleged Offender Incidents: num [1:1014] 172 56 55 33 7990 ...
# Check for missing values
sum(is.na(data))
## [1] 0
# View summary statistics
summary(data)
##       Year      Year ending        Offence Division   Offence Subdivision
##  Min.   :2015   Length:1014        Length:1014        Length:1014        
##  1st Qu.:2017   Class :character   Class :character   Class :character   
##  Median :2020   Mode  :character   Mode  :character   Mode  :character   
##  Mean   :2020                                                            
##  3rd Qu.:2022                                                            
##  Max.   :2024                                                            
##    Outcome          Alleged Offender Incidents
##  Length:1014        Min.   :    2.0           
##  Class :character   1st Qu.:   52.0           
##  Mode  :character   Median :  339.5           
##                     Mean   : 1593.5           
##                     3rd Qu.: 1712.0           
##                     Max.   :29483.0

Descriptive Statistics and Visualisation

# Grouping the data by Outcome and summarizing incidents using the correct column name
outcome_summary <- data %>%
  group_by(Outcome) %>%
  summarise(
    Total_Incidents = sum(`Alleged Offender Incidents`),  # Backticks around the column name
    Mean_Incidents = mean(`Alleged Offender Incidents`),
    Median_Incidents = median(`Alleged Offender Incidents`),
    SD_Incidents = sd(`Alleged Offender Incidents`)
  )

# Display summary
print(outcome_summary)
## # A tibble: 4 × 5
##   Outcome        Total_Incidents Mean_Incidents Median_Incidents SD_Incidents
##   <chr>                    <dbl>          <dbl>            <dbl>        <dbl>
## 1 Arrest                  631380          2515.              852        3771.
## 2 Not authorised          273923          1058.              315        2029.
## 3 Other                   275137          1142.              129        2762.
## 4 Summons                 435329          1655.              457        2745.

Descriptive Statistics

  • The dataset was grouped by Outcome, and summary statistics were calculated to understand the distribution of incidents across different categories. The summary statistics are shown in the table.

  • As it can be seen in the table below, “Arrest” outcomes are significantly related to a higher number of incidents compared to “Summons” outcomes which have the least incidence. It is clear from table 2 that in “Arrest,” the mean number of occurrences is 2515, while in other categories, it is comparatively low.

# Bar plot showing total incidents by outcome
ggplot(outcome_summary, aes(x = Outcome, y = Total_Incidents, fill = Outcome)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Total Alleged Offender Incidents by Outcome",
    x = "Outcome",
    y = "Total Incidents"
  ) +
  theme_minimal()

# Box plot showing the spread of incidents across outcomes
ggplot(data, aes(x = Outcome, y = `Alleged Offender Incidents`, fill = Outcome)) +
  geom_boxplot() +
  labs(
    title = "Box Plot of Alleged Offender Incidents by Outcome",
    x = "Outcome",
    y = "Alleged Offender Incidents"
  ) +
  theme_minimal()

# Histogram of alleged offender incidents
ggplot(data, aes(x = `Alleged Offender Incidents`, fill = Outcome)) +
  geom_histogram(binwidth = 500, position = "dodge") +
  labs(
    title = "Histogram of Alleged Offender Incidents",
    x = "Alleged Offender Incidents",
    y = "Count"
  ) +
  theme_minimal()

# Scatter plot of incidents by year
ggplot(data, aes(x = Year, y = `Alleged Offender Incidents`, color = Outcome)) +
  geom_point() +
  labs(
    title = "Scatter Plot of Alleged Offender Incidents by Year",
    x = "Year",
    y = "Alleged Offender Incidents"
  ) +
  theme_minimal()

Visualizations

  • To better understand the distribution of incidents by outcome, several visualizations were created:

  • Bar Plot of Total Incidents by Outcome: In the bar plot a student is shown the total number of cases in each of the outcome types. It compare the frequencies of “Arrest” with other consequences, whereby, “Arrest” records significantly higher frequencies.

  • Box Plot of Incidents by Outcome: For an understanding of such factors within each of those four general outcomes, the box plot may be of great use (Fogliato et al., 2023). One has only to compare the standard deviations of “Arrest” distribution with those of “Citation” and “Verbal warning” options to see that the variability of the incidents is notably higher for the “Arrest” option.

  • Histogram of Alleged Offender Incidents: The histogram enables viewing the frequency of alleged offender incidents. They prove that most of the events are in the lower bins, which speaks volumes that the majority of the results have several occurrences while a few results have many.

  • Scatter Plot of Alleged Offender Incidents by Year: This scatter plot useful in analysing the information over a period of time. The findings provided here imply that the most incidents occur in some particular years with no consistent trend in increasing or decreasing over time.

  • These visualizations present an overview of how the frequency of alleged offender incidents differs in terms of outcomes and time.

Hypothesis Testing

# Split the data into two groups: "Arrest" and "Other" outcomes
arrest_incidents <- data$`Alleged Offender Incidents`[data$Outcome == "Arrest"]
other_incidents <- data$`Alleged Offender Incidents`[data$Outcome != "Arrest"]

# Perform two-sample t-test
t_test_result <- t.test(arrest_incidents, other_incidents, alternative = "two.sided")

# Display t-test result
print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  arrest_incidents and other_incidents
## t = 4.8012, df = 327.86, p-value = 2.401e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   723.2551 1727.3493
## sample estimates:
## mean of x mean of y 
##  2515.458  1290.156
# Display confidence interval
print(t_test_result$conf.int)
## [1]  723.2551 1727.3493
## attr(,"conf.level")
## [1] 0.95

Regression Analysis

# Convert Outcome to factor for regression
data$Outcome <- factor(data$Outcome)

# Fit a linear regression model
regression_model <- lm(`Alleged Offender Incidents` ~ Outcome, data = data)

# Display the summary of the regression model
summary(regression_model)
## 
## Call:
## lm(formula = `Alleged Offender Incidents` ~ Outcome, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2513.5 -1291.5  -960.1   255.0 28341.4 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2515.5      182.3  13.801  < 2e-16 ***
## OutcomeNot authorised  -1457.8      255.8  -5.700 1.57e-08 ***
## OutcomeOther           -1373.8      260.4  -5.275 1.62e-07 ***
## OutcomeSummons          -860.2      254.8  -3.376 0.000764 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2888 on 1010 degrees of freedom
## Multiple R-squared:  0.03858,    Adjusted R-squared:  0.03572 
## F-statistic: 13.51 on 3 and 1010 DF,  p-value: 1.202e-08
# Visualize the regression result with points and a regression line
ggplot(data, aes(x = Outcome, y = `Alleged Offender Incidents`)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Regression Analysis: Incidents vs. Outcome",
    x = "Outcome",
    y = "Alleged Offender Incidents"
  ) +
  theme_minimal()

Discussion

References