This document examines and analyses a University of Leeds dataset surrounding the United States’ crime rate over the span of a decade long observational study. Comparisons in the data are made from the initial year to a decade after, providing interesting trends. We use multiple of the 27 variables in the extensive dataset, to formulate regression lines and trends to determine whether the number of males to female offences can predict a future crime rate and whether the number of families under half the national average wage can predict future crime rate. Our final evaluation finds that there are multiple variables that work concurrently to determine crime rate and that there are weak correlations associated with male to female offences and families below half wage in predicting crime rate.
#Load Data
data = read.csv("https://www.sheffield.ac.uk/polopoly_fs/1.886044!/file/Crime_R.csv")
Our data is cited from the Mathematics and Statistics Resource Centre (MASH), based at the University of Sheffield, originally contributed by Katy Dobson of the University of Leeds upon investigating crime rates in 47 US states spanning across a 10 year observation. The data was sourced from https://www.sheffield.ac.uk/mash/statistics/datasets.
The dataset is sourced by various universities, as well as being a primary dataset for the MASH, adding reliability and crediting validity. The data provides clear values, does not contain missing data and spans across a decade long series of observations of a range of variables that contribute to potential relationships involving crime rate in the US. A wide variety of states are assessed, although not named, as well as benefactors including wages, ratio of males-to-females, whether individuals committing crimes were of a certain socio-economic background (under half-wage) and the employment and labour force of each state. These metrics clearly measure and allow for further estimation of crime rates across the US, validating the dataset, as it measures it’s intended aim.
The data is both benefited and limited by the large number of variables, which can strain the dataset, as well as increase the likelihood of observer bias. The large number of variables can ‘drown-out’ outliers, overshadowing the important correlations which define crime rate statistics. The data lacks information regarding the decade which it occurred in, as well as the median wage currency ($USD assumedly). It calls for assumptions to be made regarding whether the crime rates represent only currently tried and/or convicted criminals, state populations, which states each data correlates to, number of false arrests and the degrees of the crimes. Furthermore, the data is based in the US, meaning application outside of the country may be limited due to social/political differences.
#Number of Observations and Variables
str(data)
## 'data.frame': 47 obs. of 27 variables:
## $ ï..CrimeRate : num 45.5 52.3 56.6 60.3 64.2 67.6 70.5 73.2 75 78.1 ...
## $ Youth : int 135 140 157 139 126 128 130 143 141 133 ...
## $ Southern : int 0 0 1 1 0 0 0 0 0 0 ...
## $ Education : num 12.4 10.9 11.2 11.9 12.2 13.5 14.1 12.9 12.9 11.4 ...
## $ ExpenditureYear0 : int 69 55 47 46 106 67 63 66 56 51 ...
## $ LabourForce : int 540 535 512 480 599 624 641 537 523 599 ...
## $ Males : int 965 1045 962 968 989 972 984 977 968 1024 ...
## $ MoreMales : int 0 1 0 0 0 0 0 0 0 1 ...
## $ StateSize : int 6 6 22 19 40 28 14 10 4 7 ...
## $ YouthUnemployment : int 80 135 97 135 78 77 70 114 107 99 ...
## $ MatureUnemployment : int 22 40 34 53 25 25 21 35 37 27 ...
## $ HighYouthUnemploy : int 1 1 0 0 1 1 1 1 0 1 ...
## $ Wage : int 564 453 288 457 593 507 486 487 489 425 ...
## $ BelowWage : int 139 200 276 249 171 206 196 166 170 225 ...
## $ CrimeRate10 : num 26.5 35.9 37.1 42.7 46.7 47.9 50.6 55.9 61.8 65.4 ...
## $ Youth10 : int 135 135 153 139 125 128 153 143 153 134 ...
## $ Education10 : num 12.5 10.9 11 11.8 12.2 13.8 14.1 13 12.9 11.2 ...
## $ ExpenditureYear10 : int 71 54 44 41 97 60 57 63 54 47 ...
## $ LabourForce10 : int 564 540 529 497 602 621 641 549 538 600 ...
## $ Males10 : int 974 1039 959 983 989 983 993 973 968 1024 ...
## $ MoreMales10 : int 0 1 0 0 0 0 0 0 0 1 ...
## $ StateSize10 : int 6 7 24 20 42 28 14 11 5 7 ...
## $ YouthUnemploy10 : int 82 138 98 131 79 81 71 119 110 97 ...
## $ MatureUnemploy10 : int 20 39 33 50 24 24 23 36 36 28 ...
## $ HighYouthUnemploy10: int 1 1 0 0 1 1 1 1 1 1 ...
## $ Wage10 : int 632 521 359 510 660 571 556 561 550 499 ...
## $ BelowWage10 : int 142 210 256 235 162 199 176 168 126 215 ...
#Size of data
dim(data)
## [1] 47 27
There are a total of 27 variables in the dataset, of which, the three variables that were investigated were, crime rate (number of offences per million) , families earning below half wage (per 1000), and males (per 1000 females). The 47 observations are from each state of the US.
Crime rate is statistically associated with a trend of males committing more offences than females. According to law.jrank.org, sourced from https://law.jrank.org/pages/1250/Gender-Crime-Differences-between-male-female-offending-patterns.html the trend appears to be that women have lower arrest rates and “constitute less than 20 percent of arrests for most crime categories” in the United States. The dataset used in our research displays males to females offence ratios in crime rate under the category “Males” being described as males per 1000 females. If trends described by law.jrank.org and various government sites are correct, we should see similar trends in this dataset.
When comparing crime rate (overall) to crime rate (male specific), we notice between 950-1000 males to females (when there are less males than females in the state population), that crime rate frequencies increase. This correlates male crime occuring more often than female crime, otherwise we would see equivalent frequencies in the nation, non-gender specific crime rate.
The boxplot of crime rate by state compared to male crime illustrates a mean difference of 880 offences per 1 million population. The median crime rate of male offences specifically are over 9x the non-gendered amount, indicating the majority of crime is male-oriented. This data notably has outliers, which have been removed to ensure an increased accuracy, however due to our limited knowledge of R the we were unable to remove the lower quartile outlier, which may of potentially swaying the results negatively.
#Plotting the Scatter plot
Male = data$Males
Crime_Rate = data$CrimeRate
#Outlier For Crime Rate
hist(data$CrimeRate, xlab= "Crime Rate", main = "Crime Rate by state", col="brown")
boxplot(data$CrimeRate, ylab="Crime Rate", main = "Crime Rate by state", horizontal = TRUE)
IQR(data$CrimeRate)
## [1] 53.9
summary(data$CrimeRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 26.50 76.35 103.50 102.07 130.25 178.20
#Outlier For Males
summary(data$Males)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 934.0 964.5 977.0 983.0 992.0 1071.0
boxplot(data$Males, ylab="Males(per 1000 females)", main = "Males", horizontal = TRUE)
IQR(data$Males)
## [1] 27.5
hist(data$Males, xlab= "Males(per 1000 females)", main = "Males", col="brown")
#Data Cleaning
#Upper Q = 992.0+1.5*27.5 = 1033.25
#Lower Q = 964.5 -1.5*27.5 = 923.25
Male_clean = data$Males[data$Males < 1033.25]
Male_clean2 = as.numeric(as.numeric(Male_clean))
Crime_clean = data$CrimeRate[data$Males < 1033.25]
plot(Male_clean2, Crime_clean, xlab = "Males per 1000 females", ylab = "Crime Rate(number of offences per million population)", main = "Males vs Crime rate", pch = 20)
How are Males realted to Crime Rate?
#Regression and Scatter plot
plot(Male_clean2, Crime_clean, xlab = "Males per 1000 females", ylab = "Crime Rate(number of offences per million population)", main = "Males vs Crime Rate", pch = 20)
line = lm(Crime_clean ~ Male_clean2)
line$coefficients
## (Intercept) Male_clean2
## -80.2190518 0.1857596
abline(line, col = "red")
#Residuals
plot(Male_clean2, line$residuals, xlab = "Males per 1000 Females", ylab = "Residuals", pch = 16, col = "red")
abline(h = 0)
A linear relationship is difficult to assume, the residual plot demonstrates that due to the homoscedastic nature of the scatter, that a linear trendline is appropriate. However a correlation coefficient of +0.2 demonstrates a weak trend. This is likely due to the amount of data which has given various outliers, making the data less accurate. We do however notice a growth in offences per 1 million people as the ratio of males to females increases. This corroborates with the stated trends given by the Statistic Research Department in association with the US government in 2021.
With a weak correlation coefficient of the 2 variables of +0.2, it is unlikely this trendline suffices in accurately predicting crime rate by simply looking at the males per 1000 females of a state. The regression line does show a trend however, which agrees with various sources surrounding the improved offences caused by males compared to females but does not provide strong conclusive results as to whether an increase in males per 1000 females is likely to introduce a higher rate of offences and crime.
Socio-economic status is a general talking point regarding the likelihood of crime. Correlations have been drawn in many studies that indicate a range of people born into families with low wages are more likely to commit crimes. We have used the number of families below half wage (assumed to be national average) and observed the correlation with crime rates, defined as, number of offences per one million residents.
#Number of Observations and Variables
str(data)
## 'data.frame': 47 obs. of 27 variables:
## $ ï..CrimeRate : num 45.5 52.3 56.6 60.3 64.2 67.6 70.5 73.2 75 78.1 ...
## $ Youth : int 135 140 157 139 126 128 130 143 141 133 ...
## $ Southern : int 0 0 1 1 0 0 0 0 0 0 ...
## $ Education : num 12.4 10.9 11.2 11.9 12.2 13.5 14.1 12.9 12.9 11.4 ...
## $ ExpenditureYear0 : int 69 55 47 46 106 67 63 66 56 51 ...
## $ LabourForce : int 540 535 512 480 599 624 641 537 523 599 ...
## $ Males : int 965 1045 962 968 989 972 984 977 968 1024 ...
## $ MoreMales : int 0 1 0 0 0 0 0 0 0 1 ...
## $ StateSize : int 6 6 22 19 40 28 14 10 4 7 ...
## $ YouthUnemployment : int 80 135 97 135 78 77 70 114 107 99 ...
## $ MatureUnemployment : int 22 40 34 53 25 25 21 35 37 27 ...
## $ HighYouthUnemploy : int 1 1 0 0 1 1 1 1 0 1 ...
## $ Wage : int 564 453 288 457 593 507 486 487 489 425 ...
## $ BelowWage : int 139 200 276 249 171 206 196 166 170 225 ...
## $ CrimeRate10 : num 26.5 35.9 37.1 42.7 46.7 47.9 50.6 55.9 61.8 65.4 ...
## $ Youth10 : int 135 135 153 139 125 128 153 143 153 134 ...
## $ Education10 : num 12.5 10.9 11 11.8 12.2 13.8 14.1 13 12.9 11.2 ...
## $ ExpenditureYear10 : int 71 54 44 41 97 60 57 63 54 47 ...
## $ LabourForce10 : int 564 540 529 497 602 621 641 549 538 600 ...
## $ Males10 : int 974 1039 959 983 989 983 993 973 968 1024 ...
## $ MoreMales10 : int 0 1 0 0 0 0 0 0 0 1 ...
## $ StateSize10 : int 6 7 24 20 42 28 14 11 5 7 ...
## $ YouthUnemploy10 : int 82 138 98 131 79 81 71 119 110 97 ...
## $ MatureUnemploy10 : int 20 39 33 50 24 24 23 36 36 28 ...
## $ HighYouthUnemploy10: int 1 1 0 0 1 1 1 1 1 1 ...
## $ Wage10 : int 632 521 359 510 660 571 556 561 550 499 ...
## $ BelowWage10 : int 142 210 256 235 162 199 176 168 126 215 ...
#Plotting the histogram with the Median
NoFamWage10 = data$BelowWage10
CrimeRate = data$CrimeRate
hist(NoFamWage10, xlab = "Families below Half Wage per 1000", col = "pink", main="Frequency of Families below Half Wage per 1K over 10yrs", ylim = c(0,15))
abline(v = median(NoFamWage10), col= "green")
#Boxplot
boxplot(data$BelowWage10, ylab = "Families below Half Wage per 1K", main = "Amount of Families under Half Wage per 1K over 10yrs", horizontal = T)
The histogram shows the frequency of families below half average wage in the US states. The highest frequency of states with families below half wage is 12.5% with 160 to 180/1000 families. Observing the boxplot, this is the reason why there is more spread of data from the median to the third quartile, in comparison to the first quartile to the median.
How are Families that are below half US average wage related to Crime Rate?
# Scatter Plot and Regression
plot(NoFamWage10, CrimeRate,xlab = "Families Below Half Wage(per 1000)", ylab = "Offences per 1 Million People", main = "Crime Rate vs Families Below Half Wage ", ylim = c(0, 200))
cor(NoFamWage10, CrimeRate)
## [1] -0.06596506
line = lm(CrimeRate ~ NoFamWage10)
line$coefficients
## (Intercept) NoFamWage10
## 114.97006421 -0.06685335
abline(line, col="orange")
The linear regression line shows a slight negative Pearsen’s correlation
coefficient of -0.05 between offences per million people and the number
of families under half wage. This is surprising, as the preconception
was that the lower the socio-economic status of families was, the higher
the chance of contribution to crime rate. A multitude of factors could
explain this including; poorer states have a lower population than
richer states.
New York for example, has a population of over 20 million, yet has a lower individual crime rate than Louisiana, which bolsters a population of around 5 million. This results in poorer families in richer states being less likely to commit crime, while poorer families in more rural states potentially contributing more to the crime rate, but being overshadowed by population differences. The data shows that of the 16 southern states observed, the mean families below half wage per 1000 families is 233, compared to the northern states being 172. This along with the increased crime rate in southern states illustrates that northern states’ crime rates and population do not represent southern states. A correlation between southern states families below half wage and likelihood of crime rate is much more likely, and with further investigation would show much more efficient trends and data. The southern state average of 94 offences per 1 million population compared to the northern average of 107.1 is not surprising considering the massive northern population, however, the increased poverty in southern states leads to a higher density of crime rate compared to their northern counterpart.
#Histogram
CrimeRateSouthern = data$CrimeRate10[data$Southern == "1"]
CrimeRateNorthern = data$CrimeRate10[data$Southern == "0"]
#Histogram For Crime Rate in Southern States with the Median
hist(CrimeRateSouthern, freq = F,xlim=c(0,200),ylim=c(0,0.02))
abline(v = median(CrimeRateSouthern))
#Histogram For Crime Rate in Northern States with the Median
hist(CrimeRateNorthern, freq = F, xlim=c(0,200),ylim=c(0,0.02))
abline(v=median(CrimeRateNorthern))
#Residual
plot(NoFamWage10, line$residuals, xlab = "Famililes below Half Wage Per 1K", ylab = "Residuals", pch = 16, col ="red")
abline(h = 0)
Data from World Population Review states that southern states have a higher crime rate compared to their northern counterparts, but are overshadowed by population differences, this agrees with our analysis that families below half wage in poorer states are more likely to commit crime, but families below half the national wage in richer states (a larger population) are less likely. Therefore, with the national average of crime rate, it is not possible to determine the likelihood of crime rate from families under half wage, it instead requires a division of northern and southern states.
The data fails to mention when the data was observed but the FBI of the US confirms that the overall violent crime rate fell 74% and property crimes decreased to 71% from 1993 to 2019. Comparing the initial crime rates and the crime rates after 10 years, overall there was no significant change in crime rates, when contrasting the mean and median. The initial median and mean were 103 offences per million population and mean was 102.80 offences per million population whereas the final median and mean was 103.5 and 102.07 offences per million population. This suggests that the data is not from the same time period as the data from the FBI.
The US Department of Justice concluded that people living under the Federal Poverty Level (FPL) were more likely to report violent crimes compared to people above the FPL (Harrel et al., 2022). The FBI also supports this statement as it was disclosed that in 2019 alone, 40.9% of violent crimes and 32.5% of property crimes were reported taking this into consideration, the crime rate might not be the total crime rate, and different states will vary in reported crimes rate
Gramlich, J. (2022). What the data says (and doesn’t say) about crime in the United States. Pew Research Center. Retrieved 11 April 2022, from https://www.pewresearch.org/fact-tank/2020/11/20/facts-about-crime-in-the-u-s/.
InfoPlease. 2022. State Population by Rank. [online] Available at: https://www.infoplease.com/us/states/state-population-by-rank [Accessed 5 April 2022].
Sheffield, U. (2021). Datasets for teaching - Statistics - MASH - The University of Sheffield. Retrieved 5 April 2022, from https://www.sheffield.ac.uk/mash/statistics/datasets
Worldpopulationreview.com. 2022. Crime Rate by State 2022. [online] Available at: https://worldpopulationreview.com/state-rankings/crime-rate-by-state [Accessed 9 April 2022].
Statista. 2022. Crime rate by state U.S. 2020 | Statista. [online] Available at: https://www.statista.com/statistics/301549/us-crimes-committed-state/ [Accessed 11 April 2022].
Harrel, E., Langton, L., Berzofsky, M., Couzens, L., & Smiley-McDonald,, H. (2022). Household Poverty and Nonfatal Violent Victimization, 2008–2012. Bjs.ojp.gov. Retrieved 11 April 2022, from https://bjs.ojp.gov/content/pub/pdf/hpnvv0812.pdf.