This report aims to visualise the crime statistics in Houston, USA from Jan to May 2019 so to provide insights to interested parties
The purpose of this document is to find useful insights on the crime-related information to the readers so they can take further action. This document will import and process the data to a usable format before proceeding to further exploration. We will use temporal and spatial aspects to explore and analyse the dataset.
In this report, we use Monthly Crime Data By Street And Police Beat in Houston, provided by Houston Police Department. We download the data from Jan to May 2019 in excel format and import them to R for analysis.
The following data import steps are performed:
* Imported Jan19 to May19 source excel file to R
* Removed blank columns which were created in the process of data import
* Combined 5 months data into one dataset in the format of data frame.
We have to perform certain data pre-processing tasks before performing exploratory analysis. The purpose of data pre-processing is to ensure data are stored properly and is stored in their desired format.
The following steps have been performed:
1. Transformed the values of different columns to most suitable form. For example, variable “Occurrence Date” was stored as character in its intial form, hence it was transformed to ‘date’ format for fruther analysis. Other variables are factorised.
2. Added new columns - weekday and month week, where these vairables map the Occurrence date to the day of the week (from Mon to Sun) and the week of the month (from 1 to 5).
3. Reduced the levels for NIBRS description from 59 to 24 and stored into new variable ‘offense’ in the data form of factor.
| Occurrence Date | Occurrence Hour | NIBRS Description | Offense Count | Beat | Premise | Block Range | Street Name | Street Type | Suffix | ZIP | Total Obs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 129 | 0 | 387 | 0 | 8545 | 87237 | 59212 | 101642 |
Missing value treatment:
* We notice that the majority of NAs came from Suffix, ZIP and Street Type. A tiny portion of Beat and Block Range also have missing values.
+ Suffix is only available for streets with only number but not the name, hence blank values in this variable did have meaning and should not be removed or replaced. No further processing will be conducted.
+ Most of the unknown values in ZIP are a result of the non-existence of this variable before Apr 2019 and it accounted for more than half of the population. Although ZIP code is useful in spatial analysis, it is preferred not to handle missing values in this variable, as the amount is too large and we are unable to ensure the accuracy of any matching or imputation. We should not remove it as it will greatly affect the integrity of the data. Instead, we will use beat to do the spatial analysis.
+ Street type should be read together with street name. We would not remove or replace the missing values in street type as this variable is not crucial in conducting spatial analysis. Street Name can be used to identify location.
+ Beat has 129 missing values in total. We are unable to determine the beats where these offenses were happened. We do not want to remove the offenses just because of missing beat details, as they may contain important information such as offenses that are rarely happened like homicides. We will have special treatment on the data only until we proceed to analysis involving the use of beat.
Among the 24 types of offenses, Theft, Assault and Destruction are the most common one (exclude others) in Jan to May 2019.
We noted an increasing trend of the total number of offenses, despite a drop in Feb this year.
** We dive into offense level. It is noted that most of the offenses show no clear common tempotoral pattern.**
** Most of the offenses occured from Tuesday to Saturday. Monday and Sunday have fewer offenses when compared to other days. **
We deep dive to offense level. No common pattern can be seen, but in general there are fewer offenses on weekend, except for Alcohol, Arson, Assault, Drive, Robbery and Weapons.
No significant differences can be seen from the number of offenses occured in the week of the month, except Week 5. The large difference between week 5 and other weeks can be attributed to the fact that week 5 is always an incomplete week.
In terms of offense distribution across a day, most of the offenses occured in the afternoon, with 12p.m., 5p.m. and 6p.m are considered peak hours. Fewer offenses occured in morning time from 1a.m. to 8a.m.
** On offense level, most offense types show a similar pattern, in which offenses concentrate in the second half of the day, except for Drive, where most of them are occured in from 12a.m. to 2a.m., and for Weapon where most are occured from 11p.m. to 1a.m. in a day. **
** We combine Weekday and Time of A Day to have a more holistic view. In general, there are fewer offenses in early morning, especially in weekday. During weekend, the number of offenses from 12a.m. to 2a.m. are higher than the same period in weekday. Most of the offenses are occured at 12p.m. and 6p.m., especially in weekday. Wednesday has the highest number of offenses in these 2 peak hours. **
On offense level, we see a similar pattern as well. Most of the offenses occured in the latter half of the day.
####b. Spatial Analysis ** There are 126 beats in Houston with varying degree of offense occurrence level. We note that Beat 17E10 and Beat 14D20 have relatively higher number of offenses. **
The large number in beats will obscure the details. We will focus on 20 beats with highest number of offenses. Beats have varying degree of crime level. For example, Beats 1A30, 2A50, 1A20,5F30 and 10H30 see highest number of thefts, while Beats HCC7, HCC4, HCC3, have no thefts.
Most of offenses occured in Residence & Home. Highway, Road, Street, Alley and Parking Lot, Garage are also destinations which are worth more attention
** On offense, we want to understand the most common premise type for common offenses, including Theft, Assault, Destruction, Burglary, Drug, Fraud, Robbery. **
We notice the most common premise type varies across different types of offense. Most thefts and robberies occured in Parking Lot & Garage, while Residence & Home are the most common premise type for Assault, Destruction, Burglary and Fraud.
We want to know which streets are considered more dangerous in terms of offense number. 4400-4499 NORTH has the highest number of offense, followed by 5000-5099 WESTHEIMER.
From this exploratory analysis, we noted that most common offenses are Theft, Assault, Other, Destruction and Burglary. There is an upward trend in the number of offense in the first 5 months of 2019 except a mild drop in Feb’19. No significant difference can be see on the week of month in terms of number of offense, except the fifth week which is usually incomplete in a month. More offenses occured on Wednesday, while Monday and Sunday have relatively lower counts. In general, offenses are more likely to occur in the second half of the day, with 12pm and 6pm as the peak hours.
There are huge gaps among beats in crime level. Some beats have no major crimes while some beats, such as 17E10, have significantly large number of offenses. Most offenses happened in residence & home. Highway, Road, Street, Alley and Parking Lot, Garage are also places where attention should be put. We identified 4400-4499 EAST FWY as a place where offenses can easily be found.
We should be aware of a fact that each offense type will have its variation in terms of frequency across a month, a week, a day or locations. The above summary only generalises the overall observation but not on specific offense.