Crime Data Analysis Presentation

Xinyi Zhu

2023-02-26

Abstract

Introduction:

Across the US,there were more than 25% murders recorded in 2020, which is a steep rise from previous years. The overall crime rate in NYC was up to around 11% in October 2021 compared with last year. In Addition, violent and non-violent crime behaviors both impose significant costs on Americans. Moreover, the costs borne by the US government and official organization for this level of criminal activities are significant as well.

Situations and Problem

According to newly updated FBI data, Anti-ansian hate crime rate has rose more than 70% in 2020. Shocked and concerned about this chose the Crime Rate dataset to study and research social-economic factors that affect violent and non-violent crime rates within the US communities. So we could understand how various factors impact the crime rates and which ones have the highest significance correlation with violent and non-violent criminal behaviors.

Primary objective

We searched some data sources and papers to come up with a whole picture of the topic of crime rate and understand exploratory data analysis techniques.

Dataset

Data Description:

** Population ** Ethnicity ** Age ** Economy ** Social and Education ** Crime Related Factors

Question & Hypothesis

Our project of analyzing this crime data set is to figure out the difference between significant factors of violent and non-violent crimes:

Technical Approach Method- 1

Linear regression

The goal of linear regression is to model the relationship between the explanatory and dependent variables by fitting a linear equation to observed data. We can use this regression model to predict the Y according to the given X. The function we will use to implement the linear regression is lm(). The usage is as follow:

The main arguments lm() function takes are formula and data. The data is typically a data frame from a CSV file, and the formula is an object of the formula class. For example, the R default dataset “cars” includes two variables speed and distance. We can set the formula as “dist~ speed.” The code will be as follow: Carsmod = lm(dist~ speed, data=cars) The output result includes two components: Intercept_coefficients and speed_coefficients. The linear regression model according to the input data can be written as follow: dist = Intercept_ coefficients + speed_ coefficients ∗speed

Violent Crime: Factors with statistical significance (***) RaceOctWhite - percentage of population that is caucasian PctUnemployed - percentage of people 16 and over, in the labor force, and unemployed NumImmig - total number of people known to be foreign born NumStreet - number of homeless people counted in the street LemasSwFTFieldOps - number of sworn full time police officers in field operations

Non-Violent Crime: Factors with statistical significance (***) TotalPctDiv - percentage of population who are divorced From above linear regression analysis, we could draw a conclusion that different factors have significant influence on violent and non-violent crime.

Technical Approach - Method-2

K-mean K-mean is an unsupervised machine learning algorithm used to cluster the data. The objective of the K-mean algorithm is to minimize the sum of the distance between the data and its cluster centroid. The function we will use to implement the K-mean is kmeans(). The usage is as follow:

The main arguments kmean() function takes are x and centers. The x is a numeric matrix of data or an object coerced to such a matrix, and the center is the number of clusters (K). Choosing a good “K” is a very interesting topic, and usually, we will use elbow plot to find the turning point and use that value as K. The output result includes the center for different clusters and the clustering vector for each data point.

Analysis - 1

Linear regression model Check if assumptions for a linear regression model hold: Linearity: The relationship between X and the mean of Y is linear. (y vs. x scatter plot) Homoscedasticity: The variance of residual is the same for any value of X. (residuals vs. x plot) The plot above does not show specific pattern for the residuals, thus the linearity assumption holds. Independence: Observations are independent of each other. (VIF test)

VIF test shows for some of the variables, they are highly correlated (>5). However, we consider these variables are good explanatory variables and decide to keep these variables in the model. In addition, the sign of these variables are aligned with our expectation.

Normality: For any fixed value of X, Y is normally distributed. (QQ plot)

QQplot demonstrates the normality of the residuals although it shows heavy tails. Overall, the residuals are considered normal.

Assess the performance of the model: p-value of all coefficients of the independent variables r-squared value The summary table shows except for PctNotHSGrad variable, all other variables are significant at confidence level 10%. The adjusted R-squared is 0.6799, which means the explanatory variables explain 67.99% of the change of the response variable. Overall, the model performance is satisfactory.

Analysis - 2

K-mean clustering algorithm Elbow method: this method gives us an idea on what a good k number of clusters would be based on the sum of squared distance (SSE) between data points and their assigned clusters’ centroids. We pick k at the spot where SSE starts to flatten out and form an elbow. We’ll use the geyser dataset and evaluate SSE for different values of k and see where the curve might form an elbow and flatten out.

Silhouette analysis: Silhouette analysis can be used to determine the degree of separation between clusters. For each sample: Compute the average distance from all data points in the same cluster (ai). Compute the average distance from all data points in the closest cluster (bi). Compute the coefficient:

The coefficient can take values in the interval [-1, 1]. If it is 0 –> the sample is very close to the neighboring clusters. If it is 1 –> the sample is far away from the neighboring clusters. If it is -1 –> the sample is assigned to the wrong clusters.

Therefore, we want the coefficients to be as big as possible and close to 1 to have a good cluster.

Conclusion

From previous linear regression and other statistical analysis, we could draw the conclusion that: Violent Crime rate significant factors include: Negative correlation: ethnicity (percentage of caucasian), total number of immigration and police operation; Positive correlation: Unemployment and total number of homeless in the street; Non-Violent Crime rate significant factor is percentage of divorced people in positive correlation; Non-Violent Crime is supposed to have higher volume than violent crime across the US communities; Lacking of police officer is also an obvious impact to crime rate in the communities with large size of population;

Solution

Improve social security policy and system to reduce number of homeless Help poison returning or people with lower education level find secure living-wage employment Monitor public cameras can play investigation of high-profile crime acts Government invest in education and after-school program to increase the percentage of people in higher education level Restrict punishment on race hatred or racism behaviors Step up law enforcement efforts on violent offenders, stem the trafficking of illegal guns, and make real investments in communities to intervene in and prevent gun violence

Reference

Reference

Rivera, J. (2019, August 19). Non-Violent vs. Violent Crimes. LegalMatch Law Library. Retrieved February 23, 2022, from https://www.legalmatch.com/law-library/article/non-violent-vs-violent-crimes.html

Hate Crime Recorded by Law Enforcement, 2010–2019. (2021, October). Bureau of Justice Statistics. Retrieved February 23, 2022, from https://bjs.ojp.gov/library/publications/hate-crime-recorded-law-enforcement-2010-2019