Introduction

This report provides a descriptive quantitative analysis of the alleged offenders based on a set of data containing information on the crime type and the possible outcomes of the crimes including homicide, blackmail and extortion. Subtypes of the incident are “Arrest,” “Not authorize,” “Other,” and “Summons” for the year ending June 2024. It is expected that the number of incidents for these outcomes will be compared, hypothesis will be constructed concerning the distribution of incidents; correlation analysis and regression analysis will be used to establish the association between outcomes on one hand and the number of incidents on the other.

The analysis consists of the following steps: exploratory data analysis, hypothesis testing, regression analysis, and visualizations to summarize key findings.

Problem Statement

Is there evidence of a difference in estimated numbers of alleged offender incidents for crimes committed in the year ending June 2024 and for different outcome types (Arrest, Not Authorized, Other, Summons)?
What differences are there in the number of alleged offender incidents depending on the outcome and can the number of incidents be predicted using the type of outcome ?
Is the overall rate of incidents connected with ‘Arrest’ outcomes greater than the rate of incidents connected with other outcomes in criminal trials?

Data Loading

# Load the Excel dataset
data <- read_excel("C:/Users/naidu/Downloads/Data_Tables_Alleged_Offender_Incidents_Visualisation_Year_Ending_June_2024.xlsx")

# View the first few rows of the dataset
head(data)

Data Cont.

## Step 3: Clean and explore the dataset

# Check the structure of the dataset
str(data)

## tibble [1,014 × 6] (S3: tbl_df/tbl/data.frame)
##  $ Year                      : num [1:1014] 2024 2024 2024 2024 2024 ...
##  $ Year ending               : chr [1:1014] "June" "June" "June" "June" ...
##  $ Offence Division          : chr [1:1014] "A Crimes against the person" "A Crimes against the person" "A Crimes against the person" "A Crimes against the person" ...
##  $ Offence Subdivision       : chr [1:1014] "A10 Homicide and related offences & A60 Blackmail and extortion" "A10 Homicide and related offences & A60 Blackmail and extortion" "A10 Homicide and related offences & A60 Blackmail and extortion" "A10 Homicide and related offences & A60 Blackmail and extortion" ...
##  $ Outcome                   : chr [1:1014] "Arrest" "Not authorised" "Other" "Summons" ...
##  $ Alleged Offender Incidents: num [1:1014] 172 56 55 33 7990 ...

# Check for missing values
sum(is.na(data))

## [1] 0

# View summary statistics
summary(data)

##       Year      Year ending        Offence Division   Offence Subdivision
##  Min.   :2015   Length:1014        Length:1014        Length:1014        
##  1st Qu.:2017   Class :character   Class :character   Class :character   
##  Median :2020   Mode  :character   Mode  :character   Mode  :character   
##  Mean   :2020                                                            
##  3rd Qu.:2022                                                            
##  Max.   :2024                                                            
##    Outcome          Alleged Offender Incidents
##  Length:1014        Min.   :    2.0           
##  Class :character   1st Qu.:   52.0           
##  Mode  :character   Median :  339.5           
##                     Mean   : 1593.5           
##                     3rd Qu.: 1712.0           
##                     Max.   :29483.0

The dataset contains 1,014 rows and six columns: Year, Year ending, Offence Division, Offence Subdivision, Outcome and Alleged Offender Incidents. The measure of interest for each outcome is the count of incidents.
The first operation was data cleansing and data discovery. Examining the structure of the dataset, we identified that there were no values that were missing in the data set as well. The data set is for year 2015-2024, and the incidents have been categorized based on different outcomes. The descriptive statistics showed that the number of incidents ranged from 2 to 29,483, with a median of 339.5 incidents per outcome.

Descriptive Statistics and Visualisation

# Grouping the data by Outcome and summarizing incidents using the correct column name
outcome_summary <- data %>%
  group_by(Outcome) %>%
  summarise(
    Total_Incidents = sum(`Alleged Offender Incidents`),  # Backticks around the column name
    Mean_Incidents = mean(`Alleged Offender Incidents`),
    Median_Incidents = median(`Alleged Offender Incidents`),
    SD_Incidents = sd(`Alleged Offender Incidents`)
  )

# Display summary
print(outcome_summary)

## # A tibble: 4 × 5
##   Outcome        Total_Incidents Mean_Incidents Median_Incidents SD_Incidents
##   <chr>                    <dbl>          <dbl>            <dbl>        <dbl>
## 1 Arrest                  631380          2515.              852        3771.
## 2 Not authorised          273923          1058.              315        2029.
## 3 Other                   275137          1142.              129        2762.
## 4 Summons                 435329          1655.              457        2745.

Descriptive Statistics

The dataset was grouped by Outcome, and summary statistics were calculated to understand the distribution of incidents across different categories. The summary statistics are shown in the table.
As it can be seen in the table below, “Arrest” outcomes are significantly related to a higher number of incidents compared to “Summons” outcomes which have the least incidence. It is clear from table 2 that in “Arrest,” the mean number of occurrences is 2515, while in other categories, it is comparatively low.

# Bar plot showing total incidents by outcome
ggplot(outcome_summary, aes(x = Outcome, y = Total_Incidents, fill = Outcome)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Total Alleged Offender Incidents by Outcome",
    x = "Outcome",
    y = "Total Incidents"
  ) +
  theme_minimal()

# Box plot showing the spread of incidents across outcomes
ggplot(data, aes(x = Outcome, y = `Alleged Offender Incidents`, fill = Outcome)) +
  geom_boxplot() +
  labs(
    title = "Box Plot of Alleged Offender Incidents by Outcome",
    x = "Outcome",
    y = "Alleged Offender Incidents"
  ) +
  theme_minimal()

# Histogram of alleged offender incidents
ggplot(data, aes(x = `Alleged Offender Incidents`, fill = Outcome)) +
  geom_histogram(binwidth = 500, position = "dodge") +
  labs(
    title = "Histogram of Alleged Offender Incidents",
    x = "Alleged Offender Incidents",
    y = "Count"
  ) +
  theme_minimal()

# Scatter plot of incidents by year
ggplot(data, aes(x = Year, y = `Alleged Offender Incidents`, color = Outcome)) +
  geom_point() +
  labs(
    title = "Scatter Plot of Alleged Offender Incidents by Year",
    x = "Year",
    y = "Alleged Offender Incidents"
  ) +
  theme_minimal()

Visualizations

To better understand the distribution of incidents by outcome, several visualizations were created:
Bar Plot of Total Incidents by Outcome: In the bar plot a student is shown the total number of cases in each of the outcome types. It compare the frequencies of “Arrest” with other consequences, whereby, “Arrest” records significantly higher frequencies.
Box Plot of Incidents by Outcome: For an understanding of such factors within each of those four general outcomes, the box plot may be of great use (Fogliato et al., 2023). One has only to compare the standard deviations of “Arrest” distribution with those of “Citation” and “Verbal warning” options to see that the variability of the incidents is notably higher for the “Arrest” option.
Histogram of Alleged Offender Incidents: The histogram enables viewing the frequency of alleged offender incidents. They prove that most of the events are in the lower bins, which speaks volumes that the majority of the results have several occurrences while a few results have many.
Scatter Plot of Alleged Offender Incidents by Year: This scatter plot useful in analysing the information over a period of time. The findings provided here imply that the most incidents occur in some particular years with no consistent trend in increasing or decreasing over time.
These visualizations present an overview of how the frequency of alleged offender incidents differs in terms of outcomes and time.

Hypothesis Testing

# Split the data into two groups: "Arrest" and "Other" outcomes
arrest_incidents <- data$`Alleged Offender Incidents`[data$Outcome == "Arrest"]
other_incidents <- data$`Alleged Offender Incidents`[data$Outcome != "Arrest"]

# Perform two-sample t-test
t_test_result <- t.test(arrest_incidents, other_incidents, alternative = "two.sided")

# Display t-test result
print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  arrest_incidents and other_incidents
## t = 4.8012, df = 327.86, p-value = 2.401e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   723.2551 1727.3493
## sample estimates:
## mean of x mean of y 
##  2515.458  1290.156

# Display confidence interval
print(t_test_result$conf.int)

## [1]  723.2551 1727.3493
## attr(,"conf.level")
## [1] 0.95

The hypothesis test was conducted to determine whether there is a significant difference in the number of incidents between “Arrest” and other outcomes. The hypotheses were as follows:
Null Hypothesis (H₀): The first one is equal to the mean of incidents for “Arrest” to the mean of all outcomes other than “Arrest”.
Alternative Hypothesis (H₁): The number of incidents for “Arrest” is significantly different from the average number of incidents across other outcomes.
A two-sample t-test was performed, and the results showed a t-statistic of 4.8012 with a p-value of 2.401e-06. Since the p-value is less than 0.05, we reject the null hypothesis and conclude that the number of incidents for “Arrest” is significantly different from other outcomes.
The 95% confidence interval for the difference in means was (723.25, 1727.34), indicating that the difference in incidents between “Arrest” and other outcomes is significant.

Regression Analysis

# Convert Outcome to factor for regression
data$Outcome <- factor(data$Outcome)

# Fit a linear regression model
regression_model <- lm(`Alleged Offender Incidents` ~ Outcome, data = data)

# Display the summary of the regression model
summary(regression_model)

## 
## Call:
## lm(formula = `Alleged Offender Incidents` ~ Outcome, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2513.5 -1291.5  -960.1   255.0 28341.4 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2515.5      182.3  13.801  < 2e-16 ***
## OutcomeNot authorised  -1457.8      255.8  -5.700 1.57e-08 ***
## OutcomeOther           -1373.8      260.4  -5.275 1.62e-07 ***
## OutcomeSummons          -860.2      254.8  -3.376 0.000764 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2888 on 1010 degrees of freedom
## Multiple R-squared:  0.03858,    Adjusted R-squared:  0.03572 
## F-statistic: 13.51 on 3 and 1010 DF,  p-value: 1.202e-08

# Visualize the regression result with points and a regression line
ggplot(data, aes(x = Outcome, y = `Alleged Offender Incidents`)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Regression Analysis: Incidents vs. Outcome",
    x = "Outcome",
    y = "Alleged Offender Incidents"
  ) +
  theme_minimal()

Therefore, utilizing a linear regression analysis, the correlations between the subdivisions of the identified outcome and the number of incidents were examined. Outcome variable was analyzed as categorical measure, while number of alleged offender incidents was considered as measure of dependent variable.
These findings imply that “Not authorized”, “Other” and “Summons” have a regression coefficient of less than zero, and hence the probability is less than that of the reference category “Arrest” (Diaz et al., 2023). All coefficients have p values of less than 0.05 which suggest that these types of outcomes are significantly more likely to be reported with lower number of incidents than the option “Arrest”.
According to the models adjusted R-squared value of 0.0357 it means that the said model can account for only 3.57% of the variability in the number of incidents. This may mean that there exist other factors that are not part of the model that account for most of the variations of the number of incidents.

Discussion

The results with respect to hypothesis testing confirm that “Arrest” outcomes differ from other outcomes in terms of the number of incidences. Comparing the means of the incidents categorized as involving arrests show that they are more frequent as the mean values for this type of incidents are higher to the ones that do not have arrests as a characteristic.
The regression analysis futher helps to confirm the hypothesis stating that Arrest outcomes involve more incidents than other outcomes. The meaning of negative coefficients are that “Not authorized,” “Other,” and “Summons” outcomes are observed to be fewer in number compared with “Arrest” outcome.
The value derived from the R-squared coefficient from the regression model indicates that the outcome variable is insufficient to explain variability in the rate of Incidents. Subsequent studies might include other variables, by offence type, location, or demography to create a better model.
In conclusion, it is possible to notice that the choice made by the authors to present the data grouped by outcomes produced a considerably different picture of the number of incidents and shows how high the rates even in the presence of a favorable outcome (Boateng et al., 2023). The comparison with the ”Arrest” outcomes clearly show that these results are indicative of a larger number of incidents, and therefore further research may be done in order to identify other parameters that would have an effect on these outcomes. Thus, such results may have implications for the adoption of policies regarding proper police conduct and Crime fighting measures.

References

Boateng, F.D., Pryce, D.K., Dzordzormenyoh, M.K., Hsieh, M.L. and Cuff, A., 2024. Empirical Examination of Factors that Influence Official Decisions in Criminal Cases Against Police Officers. American Journal of Criminal Justice, 49(3), pp.462-484.
Fogliato, R., Kuchibhotla, A.K., Lipton, Z., Nagin, D., Xiang, A. and Chouldechova, A., 2024. Estimating the likelihood of arrest from police records in presence of unreported crimes. The Annals of Applied Statistics, 18(2), pp.1253-1274.
Taaka, S.S., Tamatea, A. and Polaschek, D.L., 2024. Predicting Physical Violence Against Corrections Officers Across Three Levels of Severity Using Individual and Environmental Characteristics. Journal of Interpersonal Violence, p.08862605241287802.
McCarthy, A., Fox, B. and Verona, E., 2024. The relationship between psychopathy facets and types of criminal offences. Journal of Investigative Psychology and Offender Profiling, p.e1628.
Diaz, C.L., Lowder, E.M., Bohmert, M.N., Ying, M. and Hatfield, T., 2024. A retrospective study of the role of probation revocation in future criminal justice involvement. Journal of Criminal Justice, 93, p.102225.
Gonaygunta, H., Meduri, S.S., Podicheti, S. and Nadella, G.S., 2023. The Impact of Virtual Reality on Social Interaction and Relationship via Statistical Analysis. International Journal of Machine Learning for Sustainable Development, 5(2), pp.1-20.

Assignment 2

Statistical Analysis of Alleged Offender Incidents

Anamala venkata Nagendra and S4024233

2024-10-14