I. Data cleaning

1) Data reading

First, let us import the libraries that we need.

Then, let us import the data. Here I used the fread function from the data.table package because it is more quickly to import data than read.csv.

2) Data exploration

##          ID Case Number                   Date                  Block IUCR
## 1: 10000092    HY189866 03/18/2015 07:44:00 PM        047XX W OHIO ST 041A
## 2: 10000094    HY190059 03/18/2015 11:00:00 PM 066XX S MARSHFIELD AVE 4625
## 3: 10000095    HY190052 03/18/2015 10:45:00 PM  044XX S LAKE PARK AVE 0486
## 4: 10000096    HY190054 03/18/2015 10:30:00 PM   051XX S MICHIGAN AVE 0460
## 5: 10000097    HY189976 03/18/2015 09:00:00 PM       047XX W ADAMS ST 031A
## 6: 10000098    HY190032 03/18/2015 10:00:00 PM    049XX S DREXEL BLVD 0460
##     Primary Type             Description Location Description Arrest Domestic
## 1:       BATTERY     AGGRAVATED: HANDGUN               STREET  FALSE    FALSE
## 2: OTHER OFFENSE        PAROLE VIOLATION               STREET   TRUE    FALSE
## 3:       BATTERY DOMESTIC BATTERY SIMPLE            APARTMENT  FALSE     TRUE
## 4:       BATTERY                  SIMPLE            APARTMENT  FALSE    FALSE
## 5:       ROBBERY          ARMED: HANDGUN             SIDEWALK  FALSE    FALSE
## 6:       BATTERY                  SIMPLE            APARTMENT  FALSE    FALSE
##    Beat District Ward Community Area FBI Code X Coordinate Y Coordinate Year
## 1: 1111       11   28             25      04B      1144606      1903566 2015
## 2:  725        7   15             67       26      1166468      1860715 2015
## 3:  222        2    4             39      08B      1185075      1875622 2015
## 4:  225        2    3             40      08B      1178033      1870804 2015
## 5: 1113       11   28             25       03      1144920      1898709 2015
## 6:  223        2    4             39      08B      1183018      1872537 2015
##                Updated On Latitude Longitude                      Location
## 1: 02/10/2018 03:50:01 PM 41.89140 -87.74438 (41.891398861, -87.744384567)
## 2: 02/10/2018 03:50:01 PM 41.77337 -87.66532 (41.773371528, -87.665319468)
## 3: 02/10/2018 03:50:01 PM 41.81386 -87.59664  (41.81386068, -87.596642837)
## 4: 02/10/2018 03:50:01 PM 41.80080 -87.62262 (41.800802415, -87.622619343)
## 5: 02/10/2018 03:50:01 PM 41.87806 -87.74335 (41.878064761, -87.743354013)
## 6: 02/10/2018 03:50:01 PM 41.80544 -87.60428 (41.805443345, -87.604283976)
## Classes 'data.table' and 'data.frame':   6635842 obs. of  22 variables:
##  $ ID                  : int  10000092 10000094 10000095 10000096 10000097 10000098 10000099 10000100 10000101 10000104 ...
##  $ Case Number         : chr  "HY189866" "HY190059" "HY190052" "HY190054" ...
##  $ Date                : chr  "03/18/2015 07:44:00 PM" "03/18/2015 11:00:00 PM" "03/18/2015 10:45:00 PM" "03/18/2015 10:30:00 PM" ...
##  $ Block               : chr  "047XX W OHIO ST" "066XX S MARSHFIELD AVE" "044XX S LAKE PARK AVE" "051XX S MICHIGAN AVE" ...
##  $ IUCR                : chr  "041A" "4625" "0486" "0460" ...
##  $ Primary Type        : chr  "BATTERY" "OTHER OFFENSE" "BATTERY" "BATTERY" ...
##  $ Description         : chr  "AGGRAVATED: HANDGUN" "PAROLE VIOLATION" "DOMESTIC BATTERY SIMPLE" "SIMPLE" ...
##  $ Location Description: chr  "STREET" "STREET" "APARTMENT" "APARTMENT" ...
##  $ Arrest              : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...
##  $ Domestic            : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ Beat                : int  1111 725 222 225 1113 223 733 213 912 511 ...
##  $ District            : int  11 7 2 2 11 2 7 2 9 5 ...
##  $ Ward                : int  28 15 4 3 28 4 17 3 11 6 ...
##  $ Community Area      : int  25 67 39 40 25 39 68 38 59 49 ...
##  $ FBI Code            : chr  "04B" "26" "08B" "08B" ...
##  $ X Coordinate        : int  1144606 1166468 1185075 1178033 1144920 1183018 1170859 1178746 1164279 1179637 ...
##  $ Y Coordinate        : int  1903566 1860715 1875622 1870804 1898709 1872537 1858210 1876914 1880656 1840444 ...
##  $ Year                : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ Updated On          : chr  "02/10/2018 03:50:01 PM" "02/10/2018 03:50:01 PM" "02/10/2018 03:50:01 PM" "02/10/2018 03:50:01 PM" ...
##  $ Latitude            : num  41.9 41.8 41.8 41.8 41.9 ...
##  $ Longitude           : num  -87.7 -87.7 -87.6 -87.6 -87.7 ...
##  $ Location            : chr  "(41.891398861, -87.744384567)" "(41.773371528, -87.665319468)" "(41.81386068, -87.596642837)" "(41.800802415, -87.622619343)" ...
##  - attr(*, ".internal.selfref")=<externalptr>

By using the function str(), we can know that there are 6635842 observations of 22 variables in the dataset.
Here is a list of descriptions of the variables:

  • ID: a unique identifier for each record.
  • Case Number: the Chicago Police Department RD Number (Record Division Number), which is unique to the incident.
  • Date: the date the incident occurred.
  • Block: the partially redacted address where the incident occurred.
  • IUCR: the Illinois Unifrom Crime Reporting code.
  • Primary Type: the primary description of the IUCR code.
  • Description: the secondary description of the IUCR code, a subcategory of the primary description.
  • Location Description: the location where the incident occurred.
  • Arrest: indicates whether or not an arrest was made for the incident (TRUE if an arrest was made, and FALSE if an arrest was not made).
  • Domestic: indicates whether or not the incident was domestic-related, meaning that it was committed against a family member (TRUE if it was domestic, and FALSE if it was not domestic).
  • Beat: indicates the area, or “beat” in which the incident occurred. This is the smallest police geographical area defined by the Chicago police department.
  • District: indicates the police district in which the incident occured. Three to five beats make up a police sector and three sectors make up a police district. The Chicago police department has 22 police districts and one headquarter.
  • Ward: indicates the ward (City Council district) where the incident occurred.
  • Community Area: indicates the community area in which the incident occurred. Chicago has 77 community areas. The community areas were devised in an attempt to create socially homogeneous regions.
  • FBI code: indicates the crime classification as outlined in the FBI’s National Incident-Based Reporting System (NIBRS).
  • X Coordinate: indicates the x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
  • Y Coordinate: indicates the y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
  • Year: the year in which the incident occurred.
  • Updated on: the date and time the record was last updated.
  • Latitude: the latitude of the location at which the incident occurred.
  • Longitude: the longitude of the location at which the incident occurred.
  • Location: the combination of latitude and longtitude.

Moreover, we remark that the types of variables Date, IUCR, Primary Type, Beat, District, Ward, Community Area, FBI Code and Updated On are incorrect. The two date relevant variables Date and Updated On should be coded as Date while other categorical variables should be coded as Factor. As we are going to delete some useless columns, we will do the type transformation at the end of the data cleaning part.

The function summary() provides a detailed summary of the data.

##        ID           Case Number            Date              Block          
##  Min.   :     634   Length:6635842     Length:6635842     Length:6635842    
##  1st Qu.: 3391616   Class :character   Class :character   Class :character  
##  Median : 6119792   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 6137996                                                           
##  3rd Qu.: 8712503                                                           
##  Max.   :11364574                                                           
##                                                                             
##      IUCR           Primary Type       Description        Location Description
##  Length:6635842     Length:6635842     Length:6635842     Length:6635842      
##  Class :character   Class :character   Class :character   Class :character    
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character    
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##    Arrest         Domestic            Beat         District         Ward       
##  Mode :logical   Mode :logical   Min.   : 111   Min.   : 1.0   Min.   : 1.0    
##  FALSE:4786167   FALSE:5768582   1st Qu.: 622   1st Qu.: 6.0   1st Qu.:10.0    
##  TRUE :1849675   TRUE :867260    Median :1111   Median :10.0   Median :22.0    
##                                  Mean   :1193   Mean   :11.3   Mean   :22.7    
##                                  3rd Qu.:1731   3rd Qu.:17.0   3rd Qu.:34.0    
##                                  Max.   :2535   Max.   :31.0   Max.   :50.0    
##                                                 NA's   :51     NA's   :614854  
##  Community Area     FBI Code          X Coordinate      Y Coordinate    
##  Min.   : 0.0     Length:6635842     Min.   :      0   Min.   :      0  
##  1st Qu.:23.0     Class :character   1st Qu.:1152930   1st Qu.:1859170  
##  Median :32.0     Mode  :character   Median :1165964   Median :1890459  
##  Mean   :37.6                        Mean   :1164504   Mean   :1885693  
##  3rd Qu.:58.0                        3rd Qu.:1176352   3rd Qu.:1909321  
##  Max.   :77.0                        Max.   :1205119   Max.   :1951622  
##  NA's   :616030                      NA's   :58603     NA's   :58603    
##       Year       Updated On           Latitude       Longitude     
##  Min.   :2001   Length:6635842     Min.   :36.62   Min.   :-91.69  
##  1st Qu.:2004   Class :character   1st Qu.:41.77   1st Qu.:-87.71  
##  Median :2008   Mode  :character   Median :41.86   Median :-87.67  
##  Mean   :2008                      Mean   :41.84   Mean   :-87.67  
##  3rd Qu.:2012                      3rd Qu.:41.91   3rd Qu.:-87.63  
##  Max.   :2018                      Max.   :42.02   Max.   :-87.52  
##                                    NA's   :58603   NA's   :58603   
##    Location        
##  Length:6635842    
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

3) Data preprocessing

Now, let us clean the original dataset (for past 5 years because my computer cannot work for the whole dataset) to get a final dataset named dttest that provides useful information for our analysis.

Firstly, we remark that the data are stored at a crime incident level. Hence, each observation is recorded for one crime incident in the data table. Each incident has a unique identifier associated with it which is represented by the two first columns ID and Case Number. As we only need one indicator for each incident, we will hence delete the ID column.

Moreover, we decide to use only the variable Primary type as the description of the crime incident. Hence we will delete the columns IUCR, Description, and FBI Code.

We rename some variables to simplify our codes.

3.1) Duplicates

By using the function any(Duplicated()), we remark that some instances are duplicated, which means that there are two or more rows having the same Case Number. These duplicated rows need to be removed.

## [1] TRUE
## [1] FALSE

3.2) Missing values

By using the function any(is.na()), we remark that there exist some missing values in the dataset. Depending on the meaning and type of the variable, these missing values need to be substituted logically or removed.

## [1] TRUE
##           ID         Case         Date        Block         IUCR         Type 
##            0            1            0            0            0            0 
##  Description   Locdescrip       Arrest     Domestic         Beat     District 
##            0         2751            0            0            0            5 
##         Ward    Community     FBI Code X Coordinate Y Coordinate         Year 
##            6            1            0        11489        11489            0 
##   Updated On     Latitude    Longitude     Location 
##            0        11489        11489        11489

Firstly, we remark that there are certain records which do not have any description of the location where the crime occurred. In other words, there are some missing values in the X Coordinate, Y Coordinate, Latitude, Longitude and Location. After trying to replace NAs in the Latitude column with similar values of rows having the same X Coordinate content since they both present the adress information, we find that the number of NAs in the Latitude column does not change, so we can conclude that we cannot substitute these values using logical connections with other variables. However, since the percentage of these missing values is relatively small, we can hence safely igore these records.

##           ID         Case         Date        Block         IUCR         Type 
##            0            1            0            0            0            0 
##  Description   Locdescrip       Arrest     Domestic         Beat     District 
##            0         2751            0            0            0            5 
##         Ward    Community     FBI Code X Coordinate Y Coordinate         Year 
##            6            1            0        11489        11489            0 
##   Updated On     Latitude    Longitude     Location 
##            0        11489        11489        11489
##           ID         Case         Date        Block         IUCR         Type 
##            0            1            0            0            0            0 
##  Description   Locdescrip       Arrest     Domestic         Beat     District 
##            0         2094            0            0            0            5 
##         Ward    Community     FBI Code X Coordinate Y Coordinate         Year 
##            6            1            0            0            0            0 
##   Updated On     Latitude    Longitude     Location 
##            0            0            0            0

Secondly, we find that one of the values in the Case Number is missed, which seems to be some sort of a data record issue and we can hence safely igore the associated observation.

##           ID         Case         Date        Block         IUCR         Type 
##            0            0            0            0            0            0 
##  Description   Locdescrip       Arrest     Domestic         Beat     District 
##            0         2094            0            0            0            5 
##         Ward    Community     FBI Code X Coordinate Y Coordinate         Year 
##            6            1            0            0            0            0 
##   Updated On     Latitude    Longitude     Location 
##            0            0            0            0

Finally, we find that there are some missing values in the columns Locdescrip, District, and Community. However, if one observation has the same value in the Beat, and/or Location columns as another record without missing values, meaning that these two observations occurred at the same place, so they should have the same value in these columns, then these NAs could be substituted according to the logical connections. Otherwise, we need to remove those observations with NAs that cannot be replaced properly.

##           ID         Case         Date        Block         IUCR         Type 
##            0            0            0            0            0            0 
##  Description   Locdescrip       Arrest     Domestic         Beat     District 
##            0          277            0            0            0            0 
##         Ward    Community     FBI Code X Coordinate Y Coordinate         Year 
##            0            0            0            0            0            0 
##   Updated On     Latitude    Longitude     Location 
##            0            0            0            0
## [1] FALSE

3.3) Obvious values and inconsistencies

To be cautious, let us check obvious values and inconsistencies here.

## integer(0)

Hence, there is no obvious values in the Year column.

As for inconsistencies, we know from the description of the dataset that we have 22 police districts and 77 community areas, let us check whether they are consistent with our data.

## [1] 22
## 
##     1     2     3     4     5     6     7     8     9    10    11    12    14 
## 59138 50003 56747 67969 52296 71919 64860 76180 55077 54377 83988 57470 42303 
##    15    16    17    18    19    20    22    24    25 
## 50358 40670 34027 57553 52607 19431 38051 33061 64826
## [1] 78
## 
##     0     1     2     3     4     5     6     7     8     9    10    11    12 
##     3 16839 14263 16218  8042  5661 25157 18398 47993  1065  4811  4578  1887 
##    13    14    15    16    17    18    19    20    21    22    23    24    25 
##  3973 10600 14117 12478  6445  2475 20413  6933  9912 21688 37101 33485 75946 
##    26    27    28    29    30    31    32    33    34    35    36    37    38 
## 26467 23544 38779 39454 20222 11231 41042  8400  4647 11701  2975  3987 14842 
##    39    40    41    42    43    44    45    46    47    48    49    50    51 
##  6456 12899  7182 17216 39751 28255  5797 21715  1658  6073 30545  5016  8290 
##    52    53    54    55    56    57    58    59    60    61    62    63    64 
##  5465 17673  5786  2474  8176  3938 10949  4434  6874 21377  4381 10295  3789 
##    65    66    67    68    69    70    71    72    73    74    75    76    77 
##  8680 26741 32226 29434 29711 10325 34243  3757 13731  2403  8887  8113 10424

We remark that there exists 22 districts and 78 community areas in our dataset, which means we have one more distct and one more community area in our dataset.
Indeed, after searching on Google, I find out that the code of the headquarter of Chicago Police Department is 31 and we can find some areas controlled by the district #31. So we can keep these data.
However, for the community areas, we remark that there are some data having 0 as their community area code. As I cannot find any explanation about the code 0 on the Internet, and since the number of cases having 0 as community code is not relatively significant, so I decided to remove these data.

As we do not need all variables in the original dataset neither to do operations below nor to do further analysis, we can now delete these useless columns: Block, Ward, X Coordinate, Y Coordinate and Updated On.

3.4) Variable Date

From the str() function, we know that data is stored as a string variable. To make R understand that it is in fact a date, we can use the as.POSIXlt() function.

In order to analyze the number of crimes in different time intervals of the day, we prepare four time intervals: from 0H to 5H, from 6H to 11H, from 12H to 17H and from 18H to 24H. Then we match each observation to one of these four time intervals.

Finally, in order to analyze the crime incidents’ evolution for weekdays and months, we create two more variables Day and Month. Moreover, we can also compare the number of incidents occurred during different quarters/seasons of a year. Hence, let us prepare four season intevals: SPRING, SUMMER, FALL, and WINTER and match each observation.

3.5) Description of the crime incident

Here we use the Type column to distinguish different incident types. Let us take a look at this column.

## 
##                             ARSON                           ASSAULT 
##                              1956                             81281 
##                           BATTERY                          BURGLARY 
##                            220539                             59984 
## CONCEALED CARRY LICENSE VIOLATION               CRIM SEXUAL ASSAULT 
##                               209                              6228 
##                   CRIMINAL DAMAGE                 CRIMINAL TRESPASS 
##                            128969                             30348 
##                DECEPTIVE PRACTICE                          GAMBLING 
##                             72899                              1156 
##                          HOMICIDE                 HUMAN TRAFFICKING 
##                              2493                                39 
##  INTERFERENCE WITH PUBLIC OFFICER                      INTIMIDATION 
##                              5337                               574 
##                        KIDNAPPING              LIQUOR LAW VIOLATION 
##                               891                              1231 
##               MOTOR VEHICLE THEFT                         NARCOTICS 
##                             47105                             80985 
##                      NON-CRIMINAL  NON-CRIMINAL (SUBJECT SPECIFIED) 
##                               130                                 5 
##                    NON - CRIMINAL                         OBSCENITY 
##                                35                               252 
##        OFFENSE INVOLVING CHILDREN          OTHER NARCOTIC VIOLATION 
##                              9805                                30 
##                     OTHER OFFENSE                      PROSTITUTION 
##                             76428                              4816 
##                  PUBLIC INDECENCY            PUBLIC PEACE VIOLATION 
##                                51                              9036 
##                           ROBBERY                       SEX OFFENSE 
##                             47631                              4086 
##                          STALKING                             THEFT 
##                               735                            270671 
##                 WEAPONS VIOLATION 
##                             16973
## [1] 33

We remark that these exist 33 different incident types. In order to simplify our analysis without lossing generality, we can regroup some “small” types as one type.

Similarly, we can also regroup some location descriptions.

dttest[["Locdescrip"]] <- ifelse(dttest[["Locdescrip"]] %in% c("VEHICLE-COMMERCIAL", "VEHICLE - DELIVERY TRUCK", "VEHICLE - OTHER RIDE SERVICE", "VEHICLE - OTHER RIDE SHARE SERVICE (E.G., UBER, LYFT)", "VEHICLE NON-COMMERCIAL", "TRAILER", "TRUCK", "DELIVERY TRUCK", "TAXICAB", "OTHER COMMERCIAL TRANSPORTATION"), "VEHICLE", 
                   ifelse(dttest[["Locdescrip"]] %in% c("BAR OR TAVERN", "TAVERN", "TAVERN/LIQUOR STORE"), "TAVERN",
                   ifelse(dttest[["Locdescrip"]] %in% c("SCHOOL YARD", "SCHOOL, PRIVATE, BUILDING", "SCHOOL, PRIVATE, GROUNDS", "SCHOOL, PUBLIC, BUILDING", "SCHOOL, PUBLIC, GROUNDS", "COLLEGE/UNIVERSITY GROUNDS", "COLLEGE/UNIVERSITY RESIDENCE HALL"), "SCHOOL",
                   ifelse(dttest[["Locdescrip"]] %in% c("RESIDENCE", "RESIDENCE-GARAGE", "RESIDENCE PORCH/HALLWAY", "RESIDENTIAL YARD (FRONT/BACK)", "DRIVEWAY - RESIDENTIAL", "GARAGE", "HOUSE", "PORCH", "YARD"), "RESIDENCE", 
                   ifelse(dttest[["Locdescrip"]] %in% c("PARKING LOT", "PARKING LOT/GARAGE(NON.RESID.)", "POLICE FACILITY/VEH PARKING LOT"), "PARKING", 
                   ifelse(dttest[["Locdescrip"]] %in% c("OTHER", "OTHER RAILROAD PROP / TRAIN DEPOT", "ABANDONED BUILDING", "ANIMAL HOSPITAL", "ATHLETIC CLUB", "BASEMENT", "BOAT/WATERCRAFT", "CHURCH", "CHURCH/SYNAGOGUE/PLACE OF WORSHIP", "COIN OPERATED MACHINE", "CONSTRUCTION SITE", "SEWER", "STAIRWELL", "VACANT LOT", "VACANT LOT/LAND", "VESTIBULE", "WOODED AREA", "FARM", "FACTORY", "FACTORY/MANUFACTURING BUILDING", "FEDERAL BUILDING", "FIRE STATION", "FOREST PRESERVE", "GOVERNMENT BUILDING", "GOVERNMENT BUILDING/PROPERTY", "JAIL / LOCK-UP FACILITY", "LIBRARY", "MOVIE HOUSE/THEATER", "POOL ROOM", "SPORTS ARENA/STADIUM", "WAREHOUSE", "AUTO", "AUTO / BOAT / RV DEALERSHIP", "CEMETARY"), "OTHERS", 
                   ifelse(dttest[["Locdescrip"]] %in% c("COMMERCIAL / BUSINESS OFFICE"), "BIGBUSINESS", 
                   ifelse(dttest[["Locdescrip"]] %in% c("PARK PROPERTY"), "PARK", 
                   ifelse(dttest[["Locdescrip"]] %in% c("ATM (AUTOMATIC TELLER MACHINE)", "BANK", "CREDIT UNION", "CURRENCY EXCHANGE", "SAVINGS AND LOAN"), "BANK", 
                   ifelse(dttest[["Locdescrip"]] %in% c("HOTEL", "HOTEL/MOTEL"), "HOTEL", 
                   ifelse(dttest[["Locdescrip"]] %in% c("HOSPITAL", "HOSPITAL BUILDING/GROUNDS", "DAY CARE CENTER", "NURSING HOME", "NURSING HOME/RETIREMENT HOME", "MEDICAL/DENTAL OFFICE"), "HEALTH", 
                   ifelse(dttest[["Locdescrip"]] %in% c("ALLEY", "BOWLING ALLEY"), "ALLEY", 
                   ifelse(dttest[["Locdescrip"]] %in% c("CHA APARTMENT", "CHA HALLWAY/STAIRWELL/ELEVATOR", "CHA PARKING LOT", "CHA PARKING LOT/GROUNDS"), "CHA", 
                   ifelse(dttest[["Locdescrip"]] %in% c("CTA BUS", "CTA BUS STOP", "CTA GARAGE / OTHER PROPERTY", "CTA PLATFORM", "CTA STATION", "CTA TRACKS - RIGHT OF WAY", "CTA TRAIN", "CTA \"\"L\"\" TRAIN"), "CTA", 
                   ifelse(dttest[["Locdescrip"]] %in% c("AIRPORT BUILDING NON-TERMINAL - NON-SECURE AREA", "AIRPORT BUILDING NON-TERMINAL - SECURE AREA", "AIRPORT EXTERIOR - NON-SECURE AREA", "AIRPORT EXTERIOR - SECURE AREA", "AIRPORT PARKING LOT", "AIRPORT TERMINAL LOWER LEVEL - NON-SECURE AREA", "AIRPORT TERMINAL LOWER LEVEL - SECURE AREA", "AIRPORT TERMINAL MEZZANINE - NON-SECURE AREA", "AIRPORT TERMINAL UPPER LEVEL - NON-SECURE AREA", "AIRPORT TERMINAL UPPER LEVEL - SECURE AREA", "AIRPORT TRANSPORTATION SYSTEM (ATS)", "AIRPORT VENDING ESTABLISHMENT", "AIRPORT/AIRCRAFT", "AIRCRAFT"), "AIRPORT", 
                   ifelse(dttest[["Locdescrip"]] %in% c("APPLIANCE STORE", "BARBERSHOP", "CAR WASH", "CLEANING STORE", "CONVENIENCE STORE", "DEPARTMENT STORE", "DRUG STORE", "GARAGE/AUTO REPAIR", "GAS STATION", "GAS STATION DRIVE/PROP.", "GROCERY FOOD STORE", "NEWSSTAND", "OFFICE", "PAWN SHOP", "RETAIL STORE", "SMALL RETAIL STORE"), "STORE",
                   ifelse(dttest[["Locdescrip"]] %in% c("BRIDGE", "DRIVEWAY", "GANGWAY", "HIGHWAY/EXPRESSWAY", "LAKEFRONT/WATERFRONT/RIVERBANK", "SIDEWALK", "STREET", "HALLWAY"), "STREET",
                   dttest[["Locdescrip"]])))))))))))))))))

3.6) Final steps

At the end, let us reorder the columns normalize types of variables.

Here is a general overview about our dataset after data cleaning. We will use this dataset dttest for further analysis.

## Rows: 1,182,908
## Columns: 17
## $ Case       <chr> "HY189866", "HY190059", "HY190052", "HY190054", "HY18997...
## $ Date       <dttm> 2015-03-18 19:44:00, 2015-03-18 23:00:00, 2015-03-18 22...
## $ Year       <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 20...
## $ Month      <ord> mars, mars, mars, mars, mars, mars, mars, mars, mars, ma...
## $ Day        <ord> mer\., mer\., mer\., mer\., mer\., mer\., mer\., mer\., ...
## $ Season     <fct> SPRING, SPRING, SPRING, SPRING, SPRING, SPRING, SPRING, ...
## $ Tint       <fct> 18-24H, 18-24H, 18-24H, 18-24H, 18-24H, 18-24H, 18-24H, ...
## $ Type       <fct> BATTERY, OTHER, BATTERY, BATTERY, ROBBERY, BATTERY, BATT...
## $ Arrest     <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, T...
## $ Domestic   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FAL...
## $ Locdescrip <fct> STREET, STREET, APARTMENT, APARTMENT, STREET, APARTMENT,...
## $ Beat       <fct> 1111, 725, 222, 225, 1113, 223, 733, 213, 912, 511, 533,...
## $ District   <fct> 11, 7, 2, 2, 11, 2, 7, 2, 9, 5, 5, 6, 4, 12, 15, 4, 14, ...
## $ Community  <fct> 25, 67, 39, 40, 25, 39, 68, 38, 59, 49, 54, 69, 46, 28, ...
## $ Latitude   <dbl> 41.89140, 41.77337, 41.81386, 41.80080, 41.87806, 41.805...
## $ Longitude  <dbl> -87.74438, -87.66532, -87.59664, -87.62262, -87.74335, -...
## $ Location   <chr> "(41.891398861, -87.744384567)", "(41.773371528, -87.665...

II. Analysis and Visualization

1) Number of crimes

In this part, we are trying to answer the question: How has crime evolved over time in Chicago?

1.1) Evolution over years

At first, let us plot the number of crimes for each year from 2014 to 2018. We can see that crime in Chicago has been decreasing over years. From 2014 to 2017, the number of crimes were in average constant, then there was a significant decrease from 2017 to 2018. The reason of this significant decrease is because we do not have all data for the whole year of 2018. Hence, that decline does not necesarrily imply a sudden improvement of crime situation in Chicago.

However, the general decreasing trend could be interpreted as a improve of efficiency of Chicago Police Department because we can contribute the decline of number of incidents to a stronger reputation of Chicago Police Department.

1.2) Evolution in different time dimensions

The following plot shows how the number of crimes change in different time dimensions: evolution by time intervals, by weekdays, by months and by seasons.

We can see that, from 2014 to 2018 in Chicago, there is no big difference in the frequency of crimes happened in different weekdays.

However, we can find some patterns of the happening of crimes: - Most incidents happened in the second part of the day, i.e., in the afternoon and at night. - Incidents were more likely to occurr in Fridays and Saturdays. - Crimes were more likely to happen in May and in June while they were less frequent in November and in December. - Most crimes happened in summer and there were relatively less incidents in winter. This is consistent with our results obtained from the figure Evolution by months. This result is also logical since the temperature may affect people’s emotion and hence has an impact on the frequency of crime incidents. During summer months, incidents are more frequent due to the high temperature which may make people more emontional, and vice versa in winter months.

# By time intervals
p1 <- dttest %>%
  group_by(Tint) %>%
  summarise(Count = n()) %>%
  ggplot(aes(x = Tint, y = Count)) +
  geom_bar(aes(x = Tint, y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
  labs(x = "Time intervals", y = "Number of crimes", title = "Evolution by time intervals") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) + 
  theme(axis.title.x=element_blank()) +
  theme(axis.title.y=element_blank())

# By weekdays
p2 <- dttest %>%
  group_by(Day) %>%
  summarise(Count = n()) %>%
  ggplot(aes(x = Day, y = Count)) +
  geom_bar(aes(x = factor(Day, level = c("lun\\.", "mar\\.", "mer\\.", "jeu\\.", "ven\\.", "sam\\.", "dim\\.")), y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
  labs(x = "Weekdays", y = "Number of crimes", title = "Evolution by weekdays") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) + 
  theme(axis.title.x=element_blank()) +
  theme(axis.title.y=element_blank())

# By months
p3 <- dttest %>%
  group_by(Month) %>%
  summarise(Count = n()) %>%
  ggplot(aes(x = Month, y = Count)) +
  geom_bar(aes(x = factor(Month, level = c("janv\\.", "févr\\.", "mars", "avr\\.", "mai", "juin", "juil\\.", "août", "sept\\.", "oct\\.", "nov\\.", "déc\\.")), y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
  labs(x = "Months", y = "Number of crimes", title = "Evolution by months") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) +
  theme(axis.title.x=element_blank()) +
  theme(axis.title.y=element_blank())

# By seasons
p4 <- dttest %>%
  group_by(Season) %>%
  summarise(Count = n()) %>%
  ggplot(aes(x = Season, y = Count)) +
  geom_bar(aes(x = factor(Season, level = c("SPRING", "SUMMER", "FALL", "WINTER")), y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
  labs(x = "Seasons", y = "Number of crimes", title = "Evolution by seasons") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) +
  theme(axis.title.x=element_blank()) +
  theme(axis.title.y=element_blank())

# Combine plots into one plot
ggarrange(p1, p2, p4, p3, ncol = 2, nrow = 2)

1.3) Evolution in different places of the city

In this part, we will make use of the column Location Description and try to answer the question: In which places is crime more likely to happen?

The following plot shows how the number of crimes change in different places. We can see that most crimes happen on the Street, then in places such as Residences, Apartments, Stores and Others places.

## [1] "STREET"    "RESIDENCE" "APARTMENT" "STORE"     "OTHERS"

1.4) Evolution in different locations of the city

In this part, we will make use of the columns Latitude and Longitude and try to answer the question: In which location is crime more likely to happen?

The following plot shows how the most of crimes happened in the middle west areas of Chicago, especially in the community #25. Crimes also happened a lot in community #8, #32, #43, along the east side of Chicago city. We also remark that incidents were not frequent in really central areas of Chicago (i.e. areas #57, 59, 34) and the number of crimes increases from this centra area to the north and to the south.

## [1] "area"       "area_num_1" "area_numbe" "comarea"    "comarea_id"
## [6] "community"  "perimeter"  "shape_area" "shape_len"
# Transform the map as a data frame
dfcommu <- fortify(mapcomu, region = "area_numbe")

# Extract number of crimes for each community
temp <- dttest %>%
  group_by(Community) %>%
  summarise(Count = n())

temp$id <- 1:77

# Merge two data frames
temp2df <- merge(dfcommu, temp, by = "id", all.x = TRUE)

temp2df <- temp2df[order(temp2df$order), ]

# Extract community numbers
communum <- aggregate(cbind(long, lat) ~ Community, data = temp2df, FUN = function(x) mean(range(x)))

# Basic plot
locplot <- ggplot() +
  geom_polygon(data = temp2df, aes(x = long, y = lat, group = Community, fill = Count), color = "black", size = 0.25) + 
  coord_map() + 
  scale_fill_gradient(low = "white", high = "red") + 
  theme_nothing(legend = T) +
  labs(title = "Number of crimes per community") + 
  geom_text(data = communum, aes(x = long, y = lat, label = Community), size = 3, fontface = "bold")

# Import the police station
dfpolice <- fread(file = "C:/Users/ZHAO Hanlin/Desktop/RProject (DDL 2020-1-9)/Police_Stations_-_Map.csv", header = T, sep = ",", na.strings = "")

# Extract police stations' locations
dfpolice$LOCATION <- gsub("[(*)]", "", dfpolice$LOCATION)
policeloc <-str_split_fixed(dfpolice$LOCATION, ", ", 2)
policeloc <- as.data.frame(policeloc)
colnames(policeloc) <- c("lat", "long")
policeloc$lat <- as.numeric(as.character(policeloc$lat))
policeloc$long <- as.numeric(as.character(policeloc$long))
policeloc$id <- dfpolice$DISTRICT

# Plot police stations (by using black triangles) on the map
locplot <- locplot +
  geom_point(data = policeloc, aes(x = long, y = lat), size = 1, shape = 24, fill = "black")

# Plot histogramme
tempplot <- dttest %>%
  group_by(Community) %>%
  summarise(Count = n()) %>%
  ggplot(aes(x = Community, y = Count)) +
  geom_bar(aes(x = reorder(Community, Count), y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
  labs(x = "Community number", y = "Number of crimes", title = "Evolution by community areas") +
  theme_minimal() +
  theme(axis.title.x=element_blank()) +
  theme(axis.title.y=element_blank())

locplot

2) Types of crimes

In this part, we are trying to answer the question: How has crime types evolved over time in Chicago?

2.1) Evolution over years

At first, let us plot the number of different crime types for each year from 2014 to 2018.

We can see that the most frequent crime types are: Theft, Battery, and Damage in Chicago for the last five years. In contrast, the least frequent crimes are ARSON and HOMICIDE.

From 2014 to 2018, we can see a general decreasing trend in most of the crime types and the number of Weapon and Homicide cases were almost constant during years. This can be seen as a good signal since people committed less and less crimes. The general decreasing trend could be interpreted as a improve of efficiency of Chicago Police Department.

2.2) Evolution in different time dimensions

The following plot shows how the differnt crime types change in different time dimensions: evolution by time intervals, by weekdays, by months and by seasons.

From these heat maps, we find that, from 2014 to 2018 in Chicago: - THEFT occurred a lot in all time intervals, especially from 6h to 24h, and we remark that most of THEFT crimes occurred in the afternoom, i.e. from 12h to 18h. However, BATTERY crimes happened often in the evening, i.e. from 18h to 24h. Some special crimes such as NARCOTICS and DAMAGE occurred more in the evening while DECEIVE crimes happened more frequently in the morning. This result is logical since the frequency of crime types is consistent with their characteristics. Indeed, in general, people want to hide themselves when they committed NARCOTICS and DAMAGE crimes. Moreover, DECEIVE incidents involve mostly businessmen, so when people committed DECEIVE crimes, they usually did that during the office hour. - THEFT crimes occurred the most in eacy day of the week. The BATTERY and DAMAGE crimes happened mostly during the weekend. For the same reason explained before, the DECEIVE crime occurred more during the weekdays than during the weekend. - THEFT occurred the most during the year and then we have BATTERY, DAMAGE and ASSAULT crimes. Similar to our conclusion above, we see that almost all types occurred more often during summer months, i.e. from Mai to August. - THEFT occurred a lot during the four seasons and especially this type of crimes occurred more frequently in summer. Battery, DAMAGE and ASSAULT cases occurred the most frequently in summer, and the least frequently in winter. - The frequency of most incident types such as ARSON, BURGLARY, HOMICIDE, HUMANCHILD, MOTO, ROBBERY, SEX, SOCIETY, TRESPASS, and WEAPONS did nont change a lot no matter the time dimension (no matter we study the evolution by time intervals, by weekdays, by months or by seasons).

# Transform the type
dttest[, c("Month", "Day", "Season", "Tint")] <- lapply(dttest[, c("Month", "Day", "Season", "Tint")], as.character)

# By time intervals
p1 <- dttest %>%
  group_by(Type, Tint) %>%
  summarise(Count = n()) %>%
  ggplot(aes(x = Tint, y = reorder(Type, Count))) +
  geom_tile(aes(fill = Count)) + 
  scale_x_discrete("Time intervals", expand = c(0, 0), position = "top") +
  scale_y_discrete("Crime types", expand = c(0, -2)) +
  scale_fill_gradient("Number of crimes", low = "white", high = "red") +
  ggtitle("Evolution by time intervals") +
  theme_bw() +
  theme(panel.grid.major =element_line(colour = NA), panel.grid.minor = element_line(colour = NA))

# By weekdays
p2 <- dttest %>%
  group_by(Type, Day) %>%
  summarise(Count = n()) %>%
  ggplot(aes(x = Day, y = reorder(Type, Count))) +
  geom_tile(aes(fill = Count)) + 
  scale_x_discrete("Weekdays", expand = c(0, 0), position = "top") +
  scale_y_discrete("Crime types", expand = c(0, -2)) +
  scale_fill_gradient("Number of crimes", low = "white", high = "red") +
  ggtitle("Evolution by weekdays") +
  theme_bw() +
  theme(panel.grid.major =element_line(colour = NA), panel.grid.minor = element_line(colour = NA))

# By months
p3 <- dttest %>%
  group_by(Type, Month) %>%
  summarise(Count = n()) %>%
  ggplot(aes(x = Month, y = reorder(Type, Count))) +
  geom_tile(aes(fill = Count)) + 
  scale_x_discrete("Months", expand = c(0, 0), position = "top") +
  scale_y_discrete("Crime types", expand = c(0, -2)) +
  scale_fill_gradient("Number of crimes", low = "white", high = "red") +
  ggtitle("Evolution by months") +
  theme_bw() +
  theme(panel.grid.major =element_line(colour = NA), panel.grid.minor = element_line(colour = NA))

# By seasons
p4 <- dttest %>%
  group_by(Type, Season) %>%
  summarise(Count = n()) %>%
  ggplot(aes(x = Season, y = reorder(Type, Count))) +
  geom_tile(aes(fill = Count)) + 
  scale_x_discrete("Seasons", expand = c(0, 0), position = "top") +
  scale_y_discrete("Crime types", expand = c(0, -2)) +
  scale_fill_gradient("Number of crimes", low = "white", high = "red") +
  ggtitle("Evolution by seasons") +
  theme_bw() +
  theme(panel.grid.major =element_line(colour = NA), panel.grid.minor = element_line(colour = NA))

# Combine plots into one plot
ggarrange(p1, p2, p4, p3, ncol = 2, nrow = 2)

2.3) Evolution in different places of the city

In this part, we will make use of the column Location Description and try to answer the question: Different crime types are more likeky to happen in which places?

The following plot shows how the crime types change in different places. We can see that Theft and Battery cases happened the most on the street and residence where people stay longer, which creates opportunites for thefts. Moreover, in these places, people have enough space to fight.Therefore, this result makes sense. Similarly, we find that Damage, Narcotics, Robery, Moto were also recorded almost enterily in the street.

2.4) Evolution in different locations of the city

In this part, we will make use of the columns Latitude and Longitude and try to answer the question: In which location is different crime types more likely to happen?

We can see that areas 8, 32, 28, 24 and 25 are particularly dangerous in terms of Theft, that community areas 25, 43, 29 stand out in terms of Battery, most of Narcotics crime concentrates in areas 25, 23 and 29, and Deceive case is more frequet in the same areas as Theft case: community 8 and 32.

2.5) Domestic crimes

We can see that the number of domestic crimes increases a little from 2014 to 2016 and then decreased from 2016 to 2018.
Moreover, we can see that most domestic crimes happened in the evening (18-24H) and during the weekend when people are generally at home, which makes sense. Moreover, there were more domestic cases in the summer, which is similar as other cases since the temperature may make people be more emotionnal to commit crimes. Finally, we find that most domestic cases happened on the street, in the residence and in the appartment. The two latter places are easy to understand. But Street, the most frequent place where people committed domestic crimes, makes sense if we understand it as cases involving conflits between family members, and this could happen on the street.

# By time intervals
p1 <- dttest %>%
  filter(Domestic == T) %>%
  group_by(Tint) %>%
  summarise(Count = n()) %>%
  ggplot(aes(x = Tint, y = Count)) +
  geom_bar(aes(x = Tint, y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
  labs(x = "Time intervals", y = "Number of domestic crimes", title = "Evolution by time intervals") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) + 
  theme(axis.title.x=element_blank()) +
  theme(axis.title.y=element_blank())

# By weekdays
p2 <- dttest %>%
  filter(Domestic == T) %>%
  group_by(Day) %>%
  summarise(Count = n()) %>%
  ggplot(aes(x = Day, y = Count)) +
  geom_bar(aes(x = factor(Day, level = c("lun\\.", "mar\\.", "mer\\.", "jeu\\.", "ven\\.", "sam\\.", "dim\\.")), y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
  labs(x = "Weekdays", y = "Number of domestic crimes", title = "Evolution by weekdays") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) + 
  theme(axis.title.x=element_blank()) +
  theme(axis.title.y=element_blank())

# By months
p3 <- dttest %>%
  filter(Domestic == T) %>%
  group_by(Month) %>%
  summarise(Count = n()) %>%
  ggplot(aes(x = Month, y = Count)) +
  geom_bar(aes(x = factor(Month, level = c("janv\\.", "févr\\.", "mars", "avr\\.", "mai", "juin", "juil\\.", "août", "sept\\.", "oct\\.", "nov\\.", "déc\\.")), y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
  labs(x = "Months", y = "Number of domestic crimes", title = "Evolution by months") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) +
  theme(axis.title.x=element_blank()) +
  theme(axis.title.y=element_blank())

# By seasons
p4 <- dttest %>%
  filter(Domestic == T) %>%
  group_by(Season) %>%
  summarise(Count = n()) %>%
  ggplot(aes(x = Season, y = Count)) +
  geom_bar(aes(x = factor(Season, level = c("SPRING", "SUMMER", "FALL", "WINTER")), y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
  labs(x = "Seasons", y = "Number of domestic crimes", title = "Evolution by seasons") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) +
  theme(axis.title.x=element_blank()) +
  theme(axis.title.y=element_blank())

# Combine plots into one plot
ggarrange(p1, p2, p4, p3, ncol = 2, nrow = 2)

III. Efficiency

3.1) Number of crimes

Firstly, we can measure the efficiency by the number of crimes over years. From our previous analysis, we know that there’s a decreasing trend, so we can see that in this point of view, the Chicago Police Department is more efficient.

3.2) Arrest rate

However, if we look at the evolution of arrested crime rate of each years, we can see that the rate is decreasing, meaning that there is less and less people committing crimes who are caught by the police. This show that Chicago Department Police is less and less efficient.

3.3) Number of crimes and arrest rate in different locations

We can also see that in general, the number of crimes decreased even in top10 dangerous community areas, which also shows an improvement of police’s efficiency. However, as we also remark a decrease in arrested crime rate in the 10 most dangerous areas, this suggests a deterioration of police’e efficiency. But it is interesting to emphasize that between 2016 and 2017, there was an increase of arrested crime rate in almost each of these 10 most dangerous community areas (except for the communuty 28), which may suggest an improvement of efficiency of police in these areas. And since we do not have the complete data of 2018, we can not say the efficiency decreased between 2017 and 2018 even the graph shows a decrease of rate in this period.

3.4) Number of crimes and arrest rate for different crime types

Finally, let us use the rate of arrested crimes to analyse the police efficiency treating the each crime type. We remark that in general, there is a decreasing trend of arrested rate for almost all types of crime, which suggests worse efficiency (especially a significant decline of the efficiency for Narcotics and Battery cases). Moreover, we also see that even though there were a lot of reported crimes each year, the arrest rate of most of crime types is really low and stays at a low level, which suggests also a low efficiency.

In conclusion, we could say that if arrest rate is a good measure of police efficiency, then the Chicago’s police work were not enough effective at least duriing 2014 and 2018.