Introduction

A hate crime is defined as any criminal offence perceived by the victim or anyone motivated by hostility or prejudice towards a personal characteristic
The police force primarily focuses on five strands of hate crime: race or ethnicity, religion or beliefs, sexual orientation, disability, and transgender identity
Data on these offences can improve police response

Introduction (Cont.)

Although the data focuses on five strands of hate crime, only sexual orientation and transgender identity will be analysed in this report
Therefore, rather than focusing on hate crime as a whole, I will be focusing on hate crime within the LGBTQ+ community

Problem Statement

In this report, I will analyse if there is a statistically significant difference between average number of offences motivated by sexual orientation and transgender identity using a two-sample t-test
Summary statistics will be performed on the Number_of_offences variable, comparing the motivating factors sexual orientation and transgender identity
Outliers will be found using boxplot visualisations and missing values will be removed
QQ plots will be used to visually check for normality

Data

The data set can be found here: https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1023568/prc-hate-crime-open-data-121021.ods
Variables
- Financial Year: [2011/12, 2012/13, 2013/14, 2014/15, 2015/16, 2016/17, 2017/18, 2018/19 ,2019/20, 2020/21] FACTOR
- Force Name: Police forces across England and Wales [Avon and Somerset, Bedfordshire, British Transport Police, Cambridgeshire, …] FACTOR
- Motivating Factor: Type of hate crime [Disability, Race, Religion, Sexual orientation, Transgender identity] FACTOR
- Number of Offences: Number of hate crimes committed NUMERIC

Raw Data Set

hatecrime = read.csv('C:\\Users\\barcus\\OneDrive - RMIT University\\Desktop\\Applied Analytics\\HateCrime.csv')
knitr::kable(head(hatecrime,5))

ï..Financial.Year	Force.Name	Motivating.factor	Number.of.offences
2011/12	Avon and Somerset	Disability	113
2011/12	Bedfordshire	Disability	6
2011/12	British Transport Police	Disability	25
2011/12	Cambridgeshire	Disability	6
2011/12	Cheshire	Disability	7

Preprocessing

Importing the data
Replacing spaces with dashes in column names
Converting the variable Number_of_offences to a numeric data type by removing commas
Filtering out Disability, Race, and Religion from the Motivating_factor variable
Converting character data types to factor data types

Filtered & Tidied Data

knitr::kable(head(hatecrimefiltered,5))

Financial_Year	Force_Name	Motivating_factor	Number_of_offences
2011/12	Avon and Somerset	Transgender identity	16
2011/12	Bedfordshire	Transgender identity	1
2011/12	British Transport Police	Transgender identity	5
2011/12	Cambridgeshire	Transgender identity	1
2011/12	Cheshire	Transgender identity	5

Descriptive Statistics and Visualisation

The average number of offences motivated by sexual orientation is higher than the average number of offences motivated by transgender identity
As both means are higher than the medians, the data is therefore positively skewed

hatecrimefiltered %>% group_by(Motivating_factor) %>% summarise(Min = min(Number_of_offences,na.rm = TRUE),
                                                        Q1 = quantile(Number_of_offences,probs = .25,na.rm = TRUE),
                                                        Median = median(Number_of_offences, na.rm = TRUE),
                                                        Q3 = round(quantile(Number_of_offences,probs = .75,na.rm = TRUE),1),
                                                        Max = max(Number_of_offences,na.rm = TRUE),
                                                        Mean = round(mean(Number_of_offences, na.rm = TRUE),1),
                                                        SD = round(sd(Number_of_offences, na.rm = TRUE),1),
                                                        n = n(),
                                                        Missing = sum(is.na(Number_of_offences))) -> CrimeByMotivatingFactor
knitr::kable(CrimeByMotivatingFactor)

Motivating_factor	Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
Sexual orientation	5	56	121	240	3035	218.1	341.8	440	1
Transgender identity	0	6	17	37	292	30.3	39.8	440	1

Histograms with Normal Overlay

plotNormalHistogram(hatecrime_sexuality$Number_of_offences, col = 'red', main = 'Number of Offences Motivated By Sexuality - Distribution', length = 500)
plotNormalHistogram(hatecrime_trans$Number_of_offences, col = 'red', main = 'Number of Offences Motivated By Transgender Identity - Distribution', length = 500)

Note: Missing values were removed as they made up less than 5% of the data

Visual Representation of Outliers

92 outliers were detected in this data set using Tukey’s method of outlier detection

is.outlier = function(x){(x < summary(x)[2] - 1.5*IQR(x))|(x > summary(x)[5] + 1.5*IQR(x))}
sum(is.outlier(hatecrimecomplete$Number_of_offences))

## [1] 92

boxplot(Number_of_offences~Motivating_factor, data = hatecrimecomplete, xlab = "Motivating Factor",
        ylab = "Number of Offences", main = "Number Of Offences Motivated By Sexuality Compared To Transgender Identity", col=c("blue", "pink"))

Hypthesis Testing

H0: There is no statistically significant difference between the average number of offences motivated by sexual orientation and transgender identity

\[H_0: \mu_1 - \mu_2 = 0 \] HA: There is a statistically significant difference between the average number of offences motivated by sexual orientation and transgender identity

\[H_A: \mu_1 - \mu_2 \ne 0 \]

Hypothesis Testing - QQ Plots

QQ plots are used here to check for normality, although they’re not required as the sample sizes are greater than 30

hatecrime_sexuality$Number_of_offences %>% qqPlot(dist="norm", main = 'Number Of Offences Motivated By Sexuality')

## [1] 377 421

hatecrime_trans$Number_of_offences %>% qqPlot(dist="norm", main = 'Number Of Offences Motivated By Transgender Identity')

## [1] 377 421

Hypothesis Testing - Assumptions

Normality

Both QQ plots do not follow a normal distribution as data points lie outside the 95% CI. Data points outside the 95% CI are heavier on the right tail, causing a positive skew
However, due to large sample sizes (sexual orientation n=440, transgender identity n=440), I can continue with my test

Central Limit Theorem

Because the sample size is greater than 30, I can assume that the sampling distributions are normal, regardless of whether or not the underlying population distribution is
Therefore, I can still perform a two-sample t-test

Homogeneity of Variance - Levene’s Test

\[H_0: \sigma^2_1 = \sigma^2_2 \] \[H_A: \sigma^2_1 \ne \sigma^2_2 \]

leveneTest(Number_of_offences~Motivating_factor, data = hatecrimecomplete) %>%
 as.data.frame()

The p value for the Levene’s test of equal variance for the number of offences motivated by sexual orientation compared to transgender identity was p=1.44e-18
Since the p value is less 0.05,the results are statistically significant and we cannot assume equal variance

Hypthesis Testing - Two-sample t-test

Now we can perform a two-sample t-test

t.test(
  Number_of_offences~Motivating_factor, 
  data = hatecrimecomplete,
  var.equal = FALSE,
  alternative = "two.sided"
)

## 
##  Welch Two Sample t-test
## 
## data:  Number_of_offences by Motivating_factor
## t = 11.432, df = 449.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Sexual orientation and group Transgender identity is not equal to 0
## 95 percent confidence interval:
##  155.4816 220.0355
## sample estimates:
##   mean in group Sexual orientation mean in group Transgender identity 
##                          218.10478                           30.34624

# Difference between the two means

218.10478 - 30.34624

## [1] 187.7585

Hypthesis Testing - Interpretation

A two-sample t-test was used to test for a significant difference between the average number of offences motivated by sexual orientation, and the average number of offences motivated by transgender identity. While the distribution of the number of offences for both groups does not appear normal, the central limit theorem states that we can proceed with a two-sample t-test due to large sample sizes (n=440 n=440). The Levene’s test of homogeneity of variance suggested that equal variance could not be assumed. The results of the two-sample t-test, not assuming equal variance, found a statistically significant difference between the number of offences motivated by sexual orientation and transgender identity, t(df = 450) = 11.43, p<2.2e-16, 95% CI for the difference in means [155.48 220.04]. The results of the investigation suggest that the number of offences motivated by sexual orientation is statistically significantly higher than offences motivated by transgender identity

Discussion

Between 2011 and 2020, the mean number of offences motivated by sexual orientation is 218.1 whereas the mean number of offences motivated by transgender identity is 30.35. There is a noticeably large difference and it shows that there are, on average, 187.76 more offences motivated by sexuality than transgender identity.
This was proved by a two-sample t-test which produced a p value less than 0.05, thus allowing us to reject the null hypothesis that there is no difference between the mean number of offences
One strength was the sample size itself (440 samples for each group) because the larger the sample size is, the more accurate the generalisations about the population will be
A limitation is that sample might not reflect the population due to human bias/error. Minority groups are less likely to report crimes to the police, so the data will only reflect reported cases of hate crime
Expanding the data to only include reported cases of hate crime to the police, but also reported cases online such as reddit. This will provide a more accurate representation of crime committed due to prejudice.
A further analysis into race and gender within these two groups might provide more interesting results
In conclusion, while there are significantly more hate crimes motivated by sexuality than by trans identity, the LGBTQ+ community continually faces prejudice in society

MATH1324 Applied Analytics Assignment 2

Two-Sample T-Test

RPubs link information