This report covers the pre-module assignment 2 of the course Data Mining in R. The goal of this report is to perform an exploratory analysis of the dataset of crime between January and May 2019 provided by the Houston Police Department. This report was created on Sat Jul 20 13:04:14 2019.

Initial data import

First, I accessed the website containing the datasets with the crime statistics for Houston - https://www.houstontx.gov/police/cs/Monthly_Crime_Data_by_Street_and_Police_Beat.htm

Montly statistics were available for the period comprehended between January 2019 and May 2019.

I downloaded the datasets in xlsx (Excel) format and converted them to csv format for ease of import.

The data sets contain information on crime statistics for the following fields: occurence data and hour, crime description, crime count, beat, premise type, block range, street name and suffix. For April and May there is also information available on the zip code where they crime occured.

As part of the data preparation I performed a number of tasks:

Variable Analysis

##  [1] "Aggravated Assault"                       
##  [2] "All other larceny"                        
##  [3] "All other offenses"                       
##  [4] "Animal Cruelty"                           
##  [5] "Arson"                                    
##  [6] "Assisting or promoting prostitution"      
##  [7] "Bad checks"                               
##  [8] "Betting/wagering"                         
##  [9] "Bribery"                                  
## [10] "Burglary, Breaking and Entering"          
## [11] "Counterfeiting, forgery"                  
## [12] "Credit card, ATM fraud"                   
## [13] "Curfew, loitering, vagrancy violations"   
## [14] "Destruction, damage, vandalism"           
## [15] "Disorderly conduct"                       
## [16] "Driving under the influence"              
## [17] "Drug equipment violations"                
## [18] "Drug, narcotic violations"                
## [19] "Drunkenness"                              
## [20] "Embezzlement"                             
## [21] "Extortion, Blackmail"                     
## [22] "False pretenses, swindle"                 
## [23] "Family offenses, no violence"             
## [24] "Forcible fondling"                        
## [25] "Forcible rape"                            
## [26] "Forcible sodomy"                          
## [27] "From coin-operated machine or device"     
## [28] "Gambling equipment violations"            
## [29] "Hacking/Computer Invasion"                
## [30] "Human Trafficking/Commercial Sex Act"     
## [31] "Identify theft"                           
## [32] "Impersonation"                            
## [33] "Intimidation"                             
## [34] "Justifiable homicide"                     
## [35] "Kidnapping, abduction"                    
## [36] "Liquor law violations"                    
## [37] "Motor vehicle theft"                      
## [38] "Murder, non-negligent"                    
## [39] "Negligent manslaughter"                   
## [40] "Peeping tom"                              
## [41] "Pocket-picking"                           
## [42] "Pornographs, obscene material"            
## [43] "Promoting gambling"                       
## [44] "Prostitution"                             
## [45] "Purchasing prostitution"                  
## [46] "Purse-snatching"                          
## [47] "Robbery"                                  
## [48] "Runaway"                                  
## [49] "Shoplifting"                              
## [50] "Simple assault"                           
## [51] "Statutory rape"                           
## [52] "Stolen property offenses"                 
## [53] "Theft from building"                      
## [54] "Theft from motor vehicle"                 
## [55] "Theft of motor vehicle parts or accessory"
## [56] "Trespass of real property"                
## [57] "Weapon law violations"                    
## [58] "Welfare fraud"                            
## [59] "Wire fraud"

The table below shows the 10 top and bottom rows of the dataset after performing the aforementioned data preparation and processing tasks. The table doesn’t include all columns and some of them have been omitted. Overall, there are a total of 98842 in the dataset.

Date nMonth nWeekDay Part_of_day Hour Description Beat Count
2019-03-12 Mar Tue Aft-Ev 16 Purchasing prostitution 19G10 10
2019-03-15 Mar Fri Aft-Ev 20 Intimidation 7C30 10
2019-02-02 Feb Sat Night 00 Destruction, damage, vandalism 24C10 8
2019-02-13 Feb Wed Night 11 Purchasing prostitution 13D10 8
2019-01-26 Jan Sat Night 21 Simple assault 22B30 7
2019-03-09 Mar Sat Night 02 Theft from motor vehicle 18F20 7
2019-05-25 May Sat Night 23 Aggravated Assault 3B50 7
2019-01-04 Jan Fri Aft-Ev 16 Aggravated Assault 3B40 6
2019-01-31 Jan Thu Aft-Ev 16 Purchasing prostitution 19G10 6
2019-02-09 Feb Sat Aft-Ev 15 Theft of motor vehicle parts or accessory 18F50 6
Date nMonth nWeekDay Part_of_day Hour Description Beat Count
2019-05-30 May Thu Night 23 Theft from motor vehicle 22B20 1
2019-05-30 May Thu Night 23 Theft from motor vehicle 2A30 1
2019-05-30 May Thu Night 23 Theft from motor vehicle 5F30 1
2019-05-30 May Thu Night 23 Theft of motor vehicle parts or accessory 12D20 1
2019-05-30 May Thu Night 23 Theft of motor vehicle parts or accessory 5F20 1
2019-01-11 Jan Fri Aft-Ev 15 Motor vehicle theft 20G60 0
2019-01-15 Jan Tue Night 11 Motor vehicle theft 20G50 0
2019-04-19 Apr Fri Aft-Ev 16 Motor vehicle theft 8C50 0
2019-04-19 Apr Fri Night 03 Intimidation 19G10 0
2019-04-27 Apr Sat Aft-Ev 17 Aggravated Assault 5F20 0

High-level Data Analysis

The table below shows the summary statistics (i.e. minimum, maximum and interquartile range for the numeric variables and counts for factor variables). We can already start to draw some interesting conclusions.

##       Date                nMonth      nWeekDay     Part_of_day   
##  Min.   :2019-01-01   May    :21468   Sun:13191   Morning:    0  
##  1st Qu.:2019-02-09   Apr    :20321   Mon:13584   Aft-Ev :41515  
##  Median :2019-03-20   Mar    :19685   Tue:14105   Night  :57327  
##  Mean   :2019-03-18   Jan    :19392   Wed:14732                  
##  3rd Qu.:2019-04-26   Feb    :17976   Thu:14504                  
##  Max.   :2019-05-30   Jun    :    0   Fri:14502                  
##                       (Other):    0   Sat:14224                  
##       Hour                                Description         Beat      
##  18     : 5964   Theft from motor vehicle       :11773   17E10  : 2031  
##  12     : 5905   Simple assault                 :10951   14D20  : 1974  
##  17     : 5734   All other offenses             : 9068   22B20  : 1892  
##  20     : 5503   Destruction, damage, vandalism : 8497   15E40  : 1867  
##  19     : 5291   All other larceny              : 7317   19G10  : 1835  
##  21     : 5245   Burglary, Breaking and Entering: 6362   (Other):89119  
##  (Other):65200   (Other)                        :44874   NA's   :  124  
##      Count       
##  Min.   : 0.000  
##  1st Qu.: 1.000  
##  Median : 1.000  
##  Mean   : 1.061  
##  3rd Qu.: 1.000  
##  Max.   :10.000  
## 

Detailed Data Analysis

Analysis of month, hours, days of the week and part of the day

Below we can see 4 different graphics that represent the crimes per month, hour of the day, day of the week and part of the day, respectively.

From the interpretation of this graphic, we can confirm some of the conclusions drawn in the previous section.

  • Positive trend in crime from January to May.
  • There are some interesting observations in the crime count by hour of the day graphic. During typical sleeping hours, crime is considerably lower. In contrast, crime is especially high right after lunch (1pm) and in the time right after work (6-7pm)
  • Crime is higher on Wednesdays and also Thursday’s and Friday’s, with lower counts on Sundays.
  • Linked to the hours comments, crime is the lowest in the morning, highest during the afternoon/evenning and decreases again at night.

Analysis of crime types

First, the following table shows the top and bottom 10 counts by type of crime committed in Houston between January and May 2019.

Description Count
Simple assault 12910
Theft from motor vehicle 12741
All other offenses 9308
Destruction, damage, vandalism 8690
All other larceny 7459
Burglary, Breaking and Entering 6547
Intimidation 6051
Aggravated Assault 5561
Motor vehicle theft 4983
Drug, narcotic violations 4088
Description Count
Curfew, loitering, vagrancy violations 23
Peeping tom 13
Runaway 9
Promoting gambling 6
Welfare fraud 5
Betting/wagering 4
Bribery 4
Gambling equipment violations 4
Justifiable homicide 4
Negligent manslaughter 2

The following graph shows type of crimes counts by day of the week. Several interesting observations can be drawn:

  • Some of the worse possible crimes are very infrequent e.g. manslaughter or homicide.
  • Some of the most frequent crimes are theft from motor vehicle, simple assault, other offences, destruction/damage/vandalism or other larceny.
  • It is interesting how certain crimes are comparatively more prominent during the Friday’s and Saturday’s like simple assault, destruction/damage/vandalism, aggravated assault or DUIs.

When looking at the crime types by part of the day we can observe:

  • Shoplifting is considerably higher during the afternoon/evenning, which is quite logical as this covers most of the business open hours.
  • DUIs occur significantly more at night, and way less in the mornings or afternoons/evennings. Burglary, Breaking and Entering is comparatively more common in the mornings.

Another interesting graph is the distribution of crimes during Friday and Saturday nights.As it can be observed, the top crimes deviate from the norm when observing any day of the week or time of the day, and we see more crimes associated to alcoholism like assault, DUI or vandalism.

Analysis of Beats

First, the following table shows the top and bottom 10 counts of crime by Beat in Houston between January and May 2019. We can see how there is a lot of variations. The top 10, range 2141 to 1869 while the bottom 20 range from 12 to only 1 crime.

Beat Count
17E10 2141
14D20 2132
15E40 2035
22B20 2029
19G10 2016
1A10 1904
17E40 1899
12D10 1888
1A20 1885
7C20 1869
Beat Count
5F40 12
6B50 11
21I40 10
7C50 8
HCC5 6
23J40 5
HCC7 4
21I30 3
HCC4 2
HCC3 1

The graph below shows the beat where crimes occurred by part of the day. While the proportions are fairly consistent, there are some beats where crimes are much more prominent at night (e.g. 1A20 or 1A30), while in other beats the afternoon/evenning shows comparatively higher crimes. It would be interesting to analyze if some of the former ar more residential areas, while some of the latter are business/working areas but I haven’t dfurther analyzed this.

Detailed analysis of worse types of crimes

First, I reviewed the list of crimes and selected the ones that were remarkably harmful based on a purely subjective crtiteria. I considered manslaughter, homicide, murder, rape, kidnapping, sodomy, arson or human trafficking. The goal is to analyze if there is any observable pattern for these types of crimes.

The first table shows how sexually-related crimes (i.e. forcible rape, sodomy and fondling) are the most common. Negligent manslaughter is the least common with 2 occurrences.

Description Count
Forcible rape 278
Forcible fondling 214
Forcible sodomy 140
Arson 96
Murder, non-negligent 92
Kidnapping, abduction 75
Human Trafficking/Commercial Sex Act 37
Justifiable homicide 4
Negligent manslaughter 2

In order to draw conclusions on an aggregated basis, I analyzed the incidence of the worse crimes (regardless of the description type) by day of the week, month, part of the day and beat.

It is very interesting to see how the worse crimes are pronouncedly higher as the weekend approaches on Fridays and Saturdays, and at night (as opposed to the overall crime dataset, which concentrated more towards afternoon-evenning). Also, April was a pretty bad month in terms of the worse crimes. In terms of beats, it is interesting to see the high variability in the worse crimes from beat to beat.

Next, we analyzed the breakdown of the worse crimes by time of the day, day of the week, month and hour of the day. The following graphs show the results obtained. Several conclusions can be drawn from these graphs e.g. prominence of rape at night during the weekends, forcible fonding on Friday’s during the day, increase in arson and kidnapping in May, etc.