MATH1324 Applied Analytics Assignment 2

Two-Sample T-Test

Belle Arcus s3951382

Last updated: 26 May, 2022

Introduction

Introduction (Cont.)

Problem Statement

Data

Raw Data Set

hatecrime = read.csv('C:\\Users\\barcus\\OneDrive - RMIT University\\Desktop\\Applied Analytics\\HateCrime.csv')
knitr::kable(head(hatecrime,5))
ï..Financial.Year Force.Name Motivating.factor Number.of.offences
2011/12 Avon and Somerset Disability 113
2011/12 Bedfordshire Disability 6
2011/12 British Transport Police Disability 25
2011/12 Cambridgeshire Disability 6
2011/12 Cheshire Disability 7

Preprocessing

Filtered & Tidied Data

knitr::kable(head(hatecrimefiltered,5))
Financial_Year Force_Name Motivating_factor Number_of_offences
2011/12 Avon and Somerset Transgender identity 16
2011/12 Bedfordshire Transgender identity 1
2011/12 British Transport Police Transgender identity 5
2011/12 Cambridgeshire Transgender identity 1
2011/12 Cheshire Transgender identity 5

Descriptive Statistics and Visualisation

hatecrimefiltered %>% group_by(Motivating_factor) %>% summarise(Min = min(Number_of_offences,na.rm = TRUE),
                                                        Q1 = quantile(Number_of_offences,probs = .25,na.rm = TRUE),
                                                        Median = median(Number_of_offences, na.rm = TRUE),
                                                        Q3 = round(quantile(Number_of_offences,probs = .75,na.rm = TRUE),1),
                                                        Max = max(Number_of_offences,na.rm = TRUE),
                                                        Mean = round(mean(Number_of_offences, na.rm = TRUE),1),
                                                        SD = round(sd(Number_of_offences, na.rm = TRUE),1),
                                                        n = n(),
                                                        Missing = sum(is.na(Number_of_offences))) -> CrimeByMotivatingFactor
knitr::kable(CrimeByMotivatingFactor)
Motivating_factor Min Q1 Median Q3 Max Mean SD n Missing
Sexual orientation 5 56 121 240 3035 218.1 341.8 440 1
Transgender identity 0 6 17 37 292 30.3 39.8 440 1

Histograms with Normal Overlay

plotNormalHistogram(hatecrime_sexuality$Number_of_offences, col = 'red', main = 'Number of Offences Motivated By Sexuality - Distribution', length = 500)
plotNormalHistogram(hatecrime_trans$Number_of_offences, col = 'red', main = 'Number of Offences Motivated By Transgender Identity - Distribution', length = 500)

Note: Missing values were removed as they made up less than 5% of the data

Visual Representation of Outliers

is.outlier = function(x){(x < summary(x)[2] - 1.5*IQR(x))|(x > summary(x)[5] + 1.5*IQR(x))}
sum(is.outlier(hatecrimecomplete$Number_of_offences))
## [1] 92
boxplot(Number_of_offences~Motivating_factor, data = hatecrimecomplete, xlab = "Motivating Factor",
        ylab = "Number of Offences", main = "Number Of Offences Motivated By Sexuality Compared To Transgender Identity", col=c("blue", "pink"))

Hypthesis Testing

H0: There is no statistically significant difference between the average number of offences motivated by sexual orientation and transgender identity

\[H_0: \mu_1 - \mu_2 = 0 \] HA: There is a statistically significant difference between the average number of offences motivated by sexual orientation and transgender identity

\[H_A: \mu_1 - \mu_2 \ne 0 \]

Hypothesis Testing - QQ Plots

hatecrime_sexuality$Number_of_offences %>% qqPlot(dist="norm", main = 'Number Of Offences Motivated By Sexuality')
## [1] 377 421
hatecrime_trans$Number_of_offences %>% qqPlot(dist="norm", main = 'Number Of Offences Motivated By Transgender Identity')
## [1] 377 421

Hypothesis Testing - Assumptions

Normality

Central Limit Theorem

Homogeneity of Variance - Levene’s Test

\[H_0: \sigma^2_1 = \sigma^2_2 \] \[H_A: \sigma^2_1 \ne \sigma^2_2 \]

leveneTest(Number_of_offences~Motivating_factor, data = hatecrimecomplete) %>%
 as.data.frame()

Hypthesis Testing - Two-sample t-test

t.test(
  Number_of_offences~Motivating_factor, 
  data = hatecrimecomplete,
  var.equal = FALSE,
  alternative = "two.sided"
)
## 
##  Welch Two Sample t-test
## 
## data:  Number_of_offences by Motivating_factor
## t = 11.432, df = 449.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Sexual orientation and group Transgender identity is not equal to 0
## 95 percent confidence interval:
##  155.4816 220.0355
## sample estimates:
##   mean in group Sexual orientation mean in group Transgender identity 
##                          218.10478                           30.34624
# Difference between the two means

218.10478 - 30.34624
## [1] 187.7585

Hypthesis Testing - Interpretation

A two-sample t-test was used to test for a significant difference between the average number of offences motivated by sexual orientation, and the average number of offences motivated by transgender identity. While the distribution of the number of offences for both groups does not appear normal, the central limit theorem states that we can proceed with a two-sample t-test due to large sample sizes (n=440 n=440). The Levene’s test of homogeneity of variance suggested that equal variance could not be assumed. The results of the two-sample t-test, not assuming equal variance, found a statistically significant difference between the number of offences motivated by sexual orientation and transgender identity, t(df = 450) = 11.43, p<2.2e-16, 95% CI for the difference in means [155.48 220.04]. The results of the investigation suggest that the number of offences motivated by sexual orientation is statistically significantly higher than offences motivated by transgender identity

Discussion

References