First, let us import the libraries that we need.
Then, let us import the data. Here I used the fread
function from the data.table
package because it is more quickly to import data than read.csv
.
## ID Case Number Date Block IUCR
## 1: 10000092 HY189866 03/18/2015 07:44:00 PM 047XX W OHIO ST 041A
## 2: 10000094 HY190059 03/18/2015 11:00:00 PM 066XX S MARSHFIELD AVE 4625
## 3: 10000095 HY190052 03/18/2015 10:45:00 PM 044XX S LAKE PARK AVE 0486
## 4: 10000096 HY190054 03/18/2015 10:30:00 PM 051XX S MICHIGAN AVE 0460
## 5: 10000097 HY189976 03/18/2015 09:00:00 PM 047XX W ADAMS ST 031A
## 6: 10000098 HY190032 03/18/2015 10:00:00 PM 049XX S DREXEL BLVD 0460
## Primary Type Description Location Description Arrest Domestic
## 1: BATTERY AGGRAVATED: HANDGUN STREET FALSE FALSE
## 2: OTHER OFFENSE PAROLE VIOLATION STREET TRUE FALSE
## 3: BATTERY DOMESTIC BATTERY SIMPLE APARTMENT FALSE TRUE
## 4: BATTERY SIMPLE APARTMENT FALSE FALSE
## 5: ROBBERY ARMED: HANDGUN SIDEWALK FALSE FALSE
## 6: BATTERY SIMPLE APARTMENT FALSE FALSE
## Beat District Ward Community Area FBI Code X Coordinate Y Coordinate Year
## 1: 1111 11 28 25 04B 1144606 1903566 2015
## 2: 725 7 15 67 26 1166468 1860715 2015
## 3: 222 2 4 39 08B 1185075 1875622 2015
## 4: 225 2 3 40 08B 1178033 1870804 2015
## 5: 1113 11 28 25 03 1144920 1898709 2015
## 6: 223 2 4 39 08B 1183018 1872537 2015
## Updated On Latitude Longitude Location
## 1: 02/10/2018 03:50:01 PM 41.89140 -87.74438 (41.891398861, -87.744384567)
## 2: 02/10/2018 03:50:01 PM 41.77337 -87.66532 (41.773371528, -87.665319468)
## 3: 02/10/2018 03:50:01 PM 41.81386 -87.59664 (41.81386068, -87.596642837)
## 4: 02/10/2018 03:50:01 PM 41.80080 -87.62262 (41.800802415, -87.622619343)
## 5: 02/10/2018 03:50:01 PM 41.87806 -87.74335 (41.878064761, -87.743354013)
## 6: 02/10/2018 03:50:01 PM 41.80544 -87.60428 (41.805443345, -87.604283976)
## Classes 'data.table' and 'data.frame': 6635842 obs. of 22 variables:
## $ ID : int 10000092 10000094 10000095 10000096 10000097 10000098 10000099 10000100 10000101 10000104 ...
## $ Case Number : chr "HY189866" "HY190059" "HY190052" "HY190054" ...
## $ Date : chr "03/18/2015 07:44:00 PM" "03/18/2015 11:00:00 PM" "03/18/2015 10:45:00 PM" "03/18/2015 10:30:00 PM" ...
## $ Block : chr "047XX W OHIO ST" "066XX S MARSHFIELD AVE" "044XX S LAKE PARK AVE" "051XX S MICHIGAN AVE" ...
## $ IUCR : chr "041A" "4625" "0486" "0460" ...
## $ Primary Type : chr "BATTERY" "OTHER OFFENSE" "BATTERY" "BATTERY" ...
## $ Description : chr "AGGRAVATED: HANDGUN" "PAROLE VIOLATION" "DOMESTIC BATTERY SIMPLE" "SIMPLE" ...
## $ Location Description: chr "STREET" "STREET" "APARTMENT" "APARTMENT" ...
## $ Arrest : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
## $ Domestic : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
## $ Beat : int 1111 725 222 225 1113 223 733 213 912 511 ...
## $ District : int 11 7 2 2 11 2 7 2 9 5 ...
## $ Ward : int 28 15 4 3 28 4 17 3 11 6 ...
## $ Community Area : int 25 67 39 40 25 39 68 38 59 49 ...
## $ FBI Code : chr "04B" "26" "08B" "08B" ...
## $ X Coordinate : int 1144606 1166468 1185075 1178033 1144920 1183018 1170859 1178746 1164279 1179637 ...
## $ Y Coordinate : int 1903566 1860715 1875622 1870804 1898709 1872537 1858210 1876914 1880656 1840444 ...
## $ Year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ Updated On : chr "02/10/2018 03:50:01 PM" "02/10/2018 03:50:01 PM" "02/10/2018 03:50:01 PM" "02/10/2018 03:50:01 PM" ...
## $ Latitude : num 41.9 41.8 41.8 41.8 41.9 ...
## $ Longitude : num -87.7 -87.7 -87.6 -87.6 -87.7 ...
## $ Location : chr "(41.891398861, -87.744384567)" "(41.773371528, -87.665319468)" "(41.81386068, -87.596642837)" "(41.800802415, -87.622619343)" ...
## - attr(*, ".internal.selfref")=<externalptr>
By using the function str()
, we can know that there are 6635842 observations of 22 variables in the dataset.
Here is a list of descriptions of the variables:
TRUE
if an arrest was made, and FALSE
if an arrest was not made).TRUE
if it was domestic, and FALSE
if it was not domestic).Moreover, we remark that the types of variables Date, IUCR, Primary Type, Beat, District, Ward, Community Area, FBI Code and Updated On are incorrect. The two date relevant variables Date and Updated On should be coded as Date while other categorical variables should be coded as Factor. As we are going to delete some useless columns, we will do the type transformation at the end of the data cleaning part.
The function summary()
provides a detailed summary of the data.
## ID Case Number Date Block
## Min. : 634 Length:6635842 Length:6635842 Length:6635842
## 1st Qu.: 3391616 Class :character Class :character Class :character
## Median : 6119792 Mode :character Mode :character Mode :character
## Mean : 6137996
## 3rd Qu.: 8712503
## Max. :11364574
##
## IUCR Primary Type Description Location Description
## Length:6635842 Length:6635842 Length:6635842 Length:6635842
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Arrest Domestic Beat District Ward
## Mode :logical Mode :logical Min. : 111 Min. : 1.0 Min. : 1.0
## FALSE:4786167 FALSE:5768582 1st Qu.: 622 1st Qu.: 6.0 1st Qu.:10.0
## TRUE :1849675 TRUE :867260 Median :1111 Median :10.0 Median :22.0
## Mean :1193 Mean :11.3 Mean :22.7
## 3rd Qu.:1731 3rd Qu.:17.0 3rd Qu.:34.0
## Max. :2535 Max. :31.0 Max. :50.0
## NA's :51 NA's :614854
## Community Area FBI Code X Coordinate Y Coordinate
## Min. : 0.0 Length:6635842 Min. : 0 Min. : 0
## 1st Qu.:23.0 Class :character 1st Qu.:1152930 1st Qu.:1859170
## Median :32.0 Mode :character Median :1165964 Median :1890459
## Mean :37.6 Mean :1164504 Mean :1885693
## 3rd Qu.:58.0 3rd Qu.:1176352 3rd Qu.:1909321
## Max. :77.0 Max. :1205119 Max. :1951622
## NA's :616030 NA's :58603 NA's :58603
## Year Updated On Latitude Longitude
## Min. :2001 Length:6635842 Min. :36.62 Min. :-91.69
## 1st Qu.:2004 Class :character 1st Qu.:41.77 1st Qu.:-87.71
## Median :2008 Mode :character Median :41.86 Median :-87.67
## Mean :2008 Mean :41.84 Mean :-87.67
## 3rd Qu.:2012 3rd Qu.:41.91 3rd Qu.:-87.63
## Max. :2018 Max. :42.02 Max. :-87.52
## NA's :58603 NA's :58603
## Location
## Length:6635842
## Class :character
## Mode :character
##
##
##
##
Now, let us clean the original dataset (for past 5 years because my computer cannot work for the whole dataset) to get a final dataset named dttest
that provides useful information for our analysis.
Firstly, we remark that the data are stored at a crime incident level. Hence, each observation is recorded for one crime incident in the data table. Each incident has a unique identifier associated with it which is represented by the two first columns ID and Case Number. As we only need one indicator for each incident, we will hence delete the ID column.
Moreover, we decide to use only the variable Primary type as the description of the crime incident. Hence we will delete the columns IUCR, Description, and FBI Code.
We rename some variables to simplify our codes.
# Rename some variables
setnames(dttest, c("Case Number", "Primary Type", "Location Description", "Community Area"), c("Case", "Type", "Locdescrip", "Community"))
By using the function any(Duplicated())
, we remark that some instances are duplicated, which means that there are two or more rows having the same Case Number. These duplicated rows need to be removed.
## [1] TRUE
# Remove duplicates according to Case Number
dttest <- dttest[!duplicated(dttest[["Case"]])]
# Test again to assure that there is no more duplicates
any(duplicated(dttest[["Case"]]))
## [1] FALSE
By using the function any(is.na())
, we remark that there exist some missing values in the dataset. Depending on the meaning and type of the variable, these missing values need to be substituted logically or removed.
## [1] TRUE
## ID Case Date Block IUCR Type
## 0 1 0 0 0 0
## Description Locdescrip Arrest Domestic Beat District
## 0 2751 0 0 0 5
## Ward Community FBI Code X Coordinate Y Coordinate Year
## 6 1 0 11489 11489 0
## Updated On Latitude Longitude Location
## 0 11489 11489 11489
Firstly, we remark that there are certain records which do not have any description of the location where the crime occurred. In other words, there are some missing values in the X Coordinate, Y Coordinate, Latitude, Longitude and Location. After trying to replace NAs in the Latitude column with similar values of rows having the same X Coordinate content since they both present the adress information, we find that the number of NAs in the Latitude column does not change, so we can conclude that we cannot substitute these values using logical connections with other variables. However, since the percentage of these missing values is relatively small, we can hence safely igore these records.
# Try to replace NAs with similar values
dttest$`Latitude` <- na.omit(dttest$`Latitude`)[match(dttest$`X Coordinate`, na.omit(dttest$`X Coordinate`))]
colSums(is.na(dttest))
## ID Case Date Block IUCR Type
## 0 1 0 0 0 0
## Description Locdescrip Arrest Domestic Beat District
## 0 2751 0 0 0 5
## Ward Community FBI Code X Coordinate Y Coordinate Year
## 6 1 0 11489 11489 0
## Updated On Latitude Longitude Location
## 0 11489 11489 11489
# Remove NA in latitude, longitude, location
dttest <- dttest[!is.na(dttest[["Latitude"]])]
# We remark that we do not need to repeat the function for each columns, after removing duplicates in the column "Latitude", all missing values concerning the location description disappear since these three columns are relevant.
colSums(is.na(dttest))
## ID Case Date Block IUCR Type
## 0 1 0 0 0 0
## Description Locdescrip Arrest Domestic Beat District
## 0 2094 0 0 0 5
## Ward Community FBI Code X Coordinate Y Coordinate Year
## 6 1 0 0 0 0
## Updated On Latitude Longitude Location
## 0 0 0 0
Secondly, we find that one of the values in the Case Number is missed, which seems to be some sort of a data record issue and we can hence safely igore the associated observation.
## ID Case Date Block IUCR Type
## 0 0 0 0 0 0
## Description Locdescrip Arrest Domestic Beat District
## 0 2094 0 0 0 5
## Ward Community FBI Code X Coordinate Y Coordinate Year
## 6 1 0 0 0 0
## Updated On Latitude Longitude Location
## 0 0 0 0
Finally, we find that there are some missing values in the columns Locdescrip, District, and Community. However, if one observation has the same value in the Beat, and/or Location columns as another record without missing values, meaning that these two observations occurred at the same place, so they should have the same value in these columns, then these NAs could be substituted according to the logical connections. Otherwise, we need to remove those observations with NAs that cannot be replaced properly.
# Replace NAs for Location Description using records in Location
dttest$`Locdescrip` <- na.omit(dttest$`Locdescrip`)[match(dttest$`Location`, na.omit(dttest$`Location`))]
# Replace NAs for District using records in Beat
dttest$`District` <- na.omit(dttest$`District`)[match(dttest$`Beat`, na.omit(dttest$`Beat`))]
# Replace NAs for Ward using records in Location
dttest$`Ward` <- na.omit(dttest$`Ward`)[match(dttest$`Location`, na.omit(dttest$`Location`))]
# Replace NAs for Community Area using records in Location
dttest$`Community` <- na.omit(dttest$`Community`)[match(dttest$`Location`, na.omit(dttest$`Location`))]
colSums(is.na(dttest))
## ID Case Date Block IUCR Type
## 0 0 0 0 0 0
## Description Locdescrip Arrest Domestic Beat District
## 0 277 0 0 0 0
## Ward Community FBI Code X Coordinate Y Coordinate Year
## 0 0 0 0 0 0
## Updated On Latitude Longitude Location
## 0 0 0 0
# Remove the observations containing NAs that cannot be replaced in the column Locdescrip
dttest <- dttest[!is.na(dttest[["Locdescrip"]])]
# Test again to make sure that there is no more missing values
any(is.na(dttest))
## [1] FALSE
To be cautious, let us check obvious values and inconsistencies here.
## integer(0)
Hence, there is no obvious values in the Year column.
As for inconsistencies, we know from the description of the dataset that we have 22 police districts and 77 community areas, let us check whether they are consistent with our data.
## [1] 22
##
## 1 2 3 4 5 6 7 8 9 10 11 12 14
## 59138 50003 56747 67969 52296 71919 64860 76180 55077 54377 83988 57470 42303
## 15 16 17 18 19 20 22 24 25
## 50358 40670 34027 57553 52607 19431 38051 33061 64826
## [1] 78
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 3 16839 14263 16218 8042 5661 25157 18398 47993 1065 4811 4578 1887
## 13 14 15 16 17 18 19 20 21 22 23 24 25
## 3973 10600 14117 12478 6445 2475 20413 6933 9912 21688 37101 33485 75946
## 26 27 28 29 30 31 32 33 34 35 36 37 38
## 26467 23544 38779 39454 20222 11231 41042 8400 4647 11701 2975 3987 14842
## 39 40 41 42 43 44 45 46 47 48 49 50 51
## 6456 12899 7182 17216 39751 28255 5797 21715 1658 6073 30545 5016 8290
## 52 53 54 55 56 57 58 59 60 61 62 63 64
## 5465 17673 5786 2474 8176 3938 10949 4434 6874 21377 4381 10295 3789
## 65 66 67 68 69 70 71 72 73 74 75 76 77
## 8680 26741 32226 29434 29711 10325 34243 3757 13731 2403 8887 8113 10424
We remark that there exists 22 districts and 78 community areas in our dataset, which means we have one more distct and one more community area in our dataset.
Indeed, after searching on Google, I find out that the code of the headquarter of Chicago Police Department is 31 and we can find some areas controlled by the district #31. So we can keep these data.
However, for the community areas, we remark that there are some data having 0 as their community area code. As I cannot find any explanation about the code 0 on the Internet, and since the number of cases having 0 as community code is not relatively significant, so I decided to remove these data.
As we do not need all variables in the original dataset neither to do operations below nor to do further analysis, we can now delete these useless columns: Block, Ward, X Coordinate, Y Coordinate and Updated On.
From the str()
function, we know that data is stored as a string variable. To make R understand that it is in fact a date, we can use the as.POSIXlt()
function.
# Transform the variable Date from string to date object
dttest[["Date"]] <- parse_date_time(dttest[["Date"]], orders = "mdY IMSp")
In order to analyze the number of crimes in different time intervals of the day, we prepare four time intervals: from 0H to 5H, from 6H to 11H, from 12H to 17H and from 18H to 24H. Then we match each observation to one of these four time intervals.
# Create four time intervals
tint <- c("0", "5.9", "11.9", "17.9", "23.9")
# Extract hours
hours <- hour(dttest[["Date"]])
# Matching
dttest[["Tint"]] <- cut(hours, breaks = tint, labels = c("0-5H", "6-11H", "12-17H", "18-24H"), include.lowest = T)
Finally, in order to analyze the crime incidents’ evolution for weekdays and months, we create two more variables Day and Month. Moreover, we can also compare the number of incidents occurred during different quarters/seasons of a year. Hence, let us prepare four season intevals: SPRING, SUMMER, FALL, and WINTER and match each observation.
# Create the column Day showing the weekday when the incident occurred
dttest[["Day"]] <- wday(dttest[["Date"]], label = T)
# Create the column Month showing the month when the incident occurred
dttest[["Month"]] <- month(dttest[["Date"]], label = T)
# Extract quarters
quarters <- quarter(dttest$Date)
# Create four season intervals
sint <- c("0.9", "1.9", "2.9", "3.9", "4.9")
# Matching
dttest[["Season"]] <- cut(quarters, breaks = sint, labels = c("SPRING", "SUMMER", "FALL", "WINTER"))
Here we use the Type column to distinguish different incident types. Let us take a look at this column.
##
## ARSON ASSAULT
## 1956 81281
## BATTERY BURGLARY
## 220539 59984
## CONCEALED CARRY LICENSE VIOLATION CRIM SEXUAL ASSAULT
## 209 6228
## CRIMINAL DAMAGE CRIMINAL TRESPASS
## 128969 30348
## DECEPTIVE PRACTICE GAMBLING
## 72899 1156
## HOMICIDE HUMAN TRAFFICKING
## 2493 39
## INTERFERENCE WITH PUBLIC OFFICER INTIMIDATION
## 5337 574
## KIDNAPPING LIQUOR LAW VIOLATION
## 891 1231
## MOTOR VEHICLE THEFT NARCOTICS
## 47105 80985
## NON-CRIMINAL NON-CRIMINAL (SUBJECT SPECIFIED)
## 130 5
## NON - CRIMINAL OBSCENITY
## 35 252
## OFFENSE INVOLVING CHILDREN OTHER NARCOTIC VIOLATION
## 9805 30
## OTHER OFFENSE PROSTITUTION
## 76428 4816
## PUBLIC INDECENCY PUBLIC PEACE VIOLATION
## 51 9036
## ROBBERY SEX OFFENSE
## 47631 4086
## STALKING THEFT
## 735 270671
## WEAPONS VIOLATION
## 16973
## [1] 33
We remark that these exist 33 different incident types. In order to simplify our analysis without lossing generality, we can regroup some “small” types as one type.
# Regroup some "small" types
dttest[["Type"]] <- ifelse(dttest[["Type"]] %in% c("CRIMINAL DAMAGE"), "DAMAGE",
ifelse(dttest[["Type"]] %in% c("DECEPTIVE PRACTICE"), "DECEIVE",
ifelse(dttest[["Type"]] %in% c("KIDNAPPING", "OFFENSE INVOLVING CHILDREN", "HUMAN TRAFFICKING"), "HUMANCHILD",
ifelse(dttest[["Type"]] %in% c("NARCOTICS", "OTHER NARCOTIC VIOLATION"), "NARCOTICS",
ifelse(dttest[["Type"]] %in% c("MOTOR VEHICLE THEFT"), "MOTO",
ifelse(dttest[["Type"]] %in% c("OTHER OFFENSE"), "OTHER",
ifelse(dttest[["Type"]] %in% c("CRIM SEXUAL ASSAULT", "PROSTITUTION", "SEX OFFENSE"), "SEX",
ifelse(dttest[["Type"]] %in% c("GAMBLING", "INTERFERENCE WITH PUBLIC OFFICER", "INTIMIDATION", "LIQUOR LAW VIOLATION", "OBSCENITY", "PUBLIC INDECENCY", "PUBLIC PEACE VIOLATION", "STALKING", "NON-CRIMINAL", "NON-CRIMINAL (SUBJECT SPECIFIED)", "NON - CRIMINAL"), "SOCIETY",
ifelse(dttest[["Type"]] %in% c("CRIMINAL TRESPASS"), "TRESPASS",
ifelse(dttest[["Type"]] %in% c("CONCEALED CARRY LICENSE VIOLATION", "WEAPONS VIOLATION"), "WEAPONS", dttest[["Type"]]))))))))))
Similarly, we can also regroup some location descriptions.
dttest[["Locdescrip"]] <- ifelse(dttest[["Locdescrip"]] %in% c("VEHICLE-COMMERCIAL", "VEHICLE - DELIVERY TRUCK", "VEHICLE - OTHER RIDE SERVICE", "VEHICLE - OTHER RIDE SHARE SERVICE (E.G., UBER, LYFT)", "VEHICLE NON-COMMERCIAL", "TRAILER", "TRUCK", "DELIVERY TRUCK", "TAXICAB", "OTHER COMMERCIAL TRANSPORTATION"), "VEHICLE",
ifelse(dttest[["Locdescrip"]] %in% c("BAR OR TAVERN", "TAVERN", "TAVERN/LIQUOR STORE"), "TAVERN",
ifelse(dttest[["Locdescrip"]] %in% c("SCHOOL YARD", "SCHOOL, PRIVATE, BUILDING", "SCHOOL, PRIVATE, GROUNDS", "SCHOOL, PUBLIC, BUILDING", "SCHOOL, PUBLIC, GROUNDS", "COLLEGE/UNIVERSITY GROUNDS", "COLLEGE/UNIVERSITY RESIDENCE HALL"), "SCHOOL",
ifelse(dttest[["Locdescrip"]] %in% c("RESIDENCE", "RESIDENCE-GARAGE", "RESIDENCE PORCH/HALLWAY", "RESIDENTIAL YARD (FRONT/BACK)", "DRIVEWAY - RESIDENTIAL", "GARAGE", "HOUSE", "PORCH", "YARD"), "RESIDENCE",
ifelse(dttest[["Locdescrip"]] %in% c("PARKING LOT", "PARKING LOT/GARAGE(NON.RESID.)", "POLICE FACILITY/VEH PARKING LOT"), "PARKING",
ifelse(dttest[["Locdescrip"]] %in% c("OTHER", "OTHER RAILROAD PROP / TRAIN DEPOT", "ABANDONED BUILDING", "ANIMAL HOSPITAL", "ATHLETIC CLUB", "BASEMENT", "BOAT/WATERCRAFT", "CHURCH", "CHURCH/SYNAGOGUE/PLACE OF WORSHIP", "COIN OPERATED MACHINE", "CONSTRUCTION SITE", "SEWER", "STAIRWELL", "VACANT LOT", "VACANT LOT/LAND", "VESTIBULE", "WOODED AREA", "FARM", "FACTORY", "FACTORY/MANUFACTURING BUILDING", "FEDERAL BUILDING", "FIRE STATION", "FOREST PRESERVE", "GOVERNMENT BUILDING", "GOVERNMENT BUILDING/PROPERTY", "JAIL / LOCK-UP FACILITY", "LIBRARY", "MOVIE HOUSE/THEATER", "POOL ROOM", "SPORTS ARENA/STADIUM", "WAREHOUSE", "AUTO", "AUTO / BOAT / RV DEALERSHIP", "CEMETARY"), "OTHERS",
ifelse(dttest[["Locdescrip"]] %in% c("COMMERCIAL / BUSINESS OFFICE"), "BIGBUSINESS",
ifelse(dttest[["Locdescrip"]] %in% c("PARK PROPERTY"), "PARK",
ifelse(dttest[["Locdescrip"]] %in% c("ATM (AUTOMATIC TELLER MACHINE)", "BANK", "CREDIT UNION", "CURRENCY EXCHANGE", "SAVINGS AND LOAN"), "BANK",
ifelse(dttest[["Locdescrip"]] %in% c("HOTEL", "HOTEL/MOTEL"), "HOTEL",
ifelse(dttest[["Locdescrip"]] %in% c("HOSPITAL", "HOSPITAL BUILDING/GROUNDS", "DAY CARE CENTER", "NURSING HOME", "NURSING HOME/RETIREMENT HOME", "MEDICAL/DENTAL OFFICE"), "HEALTH",
ifelse(dttest[["Locdescrip"]] %in% c("ALLEY", "BOWLING ALLEY"), "ALLEY",
ifelse(dttest[["Locdescrip"]] %in% c("CHA APARTMENT", "CHA HALLWAY/STAIRWELL/ELEVATOR", "CHA PARKING LOT", "CHA PARKING LOT/GROUNDS"), "CHA",
ifelse(dttest[["Locdescrip"]] %in% c("CTA BUS", "CTA BUS STOP", "CTA GARAGE / OTHER PROPERTY", "CTA PLATFORM", "CTA STATION", "CTA TRACKS - RIGHT OF WAY", "CTA TRAIN", "CTA \"\"L\"\" TRAIN"), "CTA",
ifelse(dttest[["Locdescrip"]] %in% c("AIRPORT BUILDING NON-TERMINAL - NON-SECURE AREA", "AIRPORT BUILDING NON-TERMINAL - SECURE AREA", "AIRPORT EXTERIOR - NON-SECURE AREA", "AIRPORT EXTERIOR - SECURE AREA", "AIRPORT PARKING LOT", "AIRPORT TERMINAL LOWER LEVEL - NON-SECURE AREA", "AIRPORT TERMINAL LOWER LEVEL - SECURE AREA", "AIRPORT TERMINAL MEZZANINE - NON-SECURE AREA", "AIRPORT TERMINAL UPPER LEVEL - NON-SECURE AREA", "AIRPORT TERMINAL UPPER LEVEL - SECURE AREA", "AIRPORT TRANSPORTATION SYSTEM (ATS)", "AIRPORT VENDING ESTABLISHMENT", "AIRPORT/AIRCRAFT", "AIRCRAFT"), "AIRPORT",
ifelse(dttest[["Locdescrip"]] %in% c("APPLIANCE STORE", "BARBERSHOP", "CAR WASH", "CLEANING STORE", "CONVENIENCE STORE", "DEPARTMENT STORE", "DRUG STORE", "GARAGE/AUTO REPAIR", "GAS STATION", "GAS STATION DRIVE/PROP.", "GROCERY FOOD STORE", "NEWSSTAND", "OFFICE", "PAWN SHOP", "RETAIL STORE", "SMALL RETAIL STORE"), "STORE",
ifelse(dttest[["Locdescrip"]] %in% c("BRIDGE", "DRIVEWAY", "GANGWAY", "HIGHWAY/EXPRESSWAY", "LAKEFRONT/WATERFRONT/RIVERBANK", "SIDEWALK", "STREET", "HALLWAY"), "STREET",
dttest[["Locdescrip"]])))))))))))))))))
At the end, let us reorder the columns normalize types of variables.
# Set dttest as data.frame
dttest <- as.data.frame(dttest)
# Reorder columns
dttest <- dttest[c("Case", "Date", "Year", "Month", "Day", "Season", "Tint", "Type", "Arrest", "Domestic", "Locdescrip", "Beat", "District", "Community", "Latitude", "Longitude", "Location")]
# Normalize variables
dttest[, c("Beat", "Type", "District", "Community", "Month", "Day", "Locdescrip")] <- lapply(dttest[, c("Beat", "Type", "District", "Community", "Month", "Day", "Locdescrip")], as.factor)
Here is a general overview about our dataset after data cleaning. We will use this dataset dttest for further analysis.
## Rows: 1,182,908
## Columns: 17
## $ Case <chr> "HY189866", "HY190059", "HY190052", "HY190054", "HY18997...
## $ Date <dttm> 2015-03-18 19:44:00, 2015-03-18 23:00:00, 2015-03-18 22...
## $ Year <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 20...
## $ Month <ord> mars, mars, mars, mars, mars, mars, mars, mars, mars, ma...
## $ Day <ord> mer\., mer\., mer\., mer\., mer\., mer\., mer\., mer\., ...
## $ Season <fct> SPRING, SPRING, SPRING, SPRING, SPRING, SPRING, SPRING, ...
## $ Tint <fct> 18-24H, 18-24H, 18-24H, 18-24H, 18-24H, 18-24H, 18-24H, ...
## $ Type <fct> BATTERY, OTHER, BATTERY, BATTERY, ROBBERY, BATTERY, BATT...
## $ Arrest <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, T...
## $ Domestic <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FAL...
## $ Locdescrip <fct> STREET, STREET, APARTMENT, APARTMENT, STREET, APARTMENT,...
## $ Beat <fct> 1111, 725, 222, 225, 1113, 223, 733, 213, 912, 511, 533,...
## $ District <fct> 11, 7, 2, 2, 11, 2, 7, 2, 9, 5, 5, 6, 4, 12, 15, 4, 14, ...
## $ Community <fct> 25, 67, 39, 40, 25, 39, 68, 38, 59, 49, 54, 69, 46, 28, ...
## $ Latitude <dbl> 41.89140, 41.77337, 41.81386, 41.80080, 41.87806, 41.805...
## $ Longitude <dbl> -87.74438, -87.66532, -87.59664, -87.62262, -87.74335, -...
## $ Location <chr> "(41.891398861, -87.744384567)", "(41.773371528, -87.665...
In this part, we are trying to answer the question: How has crime evolved over time in Chicago?
At first, let us plot the number of crimes for each year from 2014 to 2018. We can see that crime in Chicago has been decreasing over years. From 2014 to 2017, the number of crimes were in average constant, then there was a significant decrease from 2017 to 2018. The reason of this significant decrease is because we do not have all data for the whole year of 2018. Hence, that decline does not necesarrily imply a sudden improvement of crime situation in Chicago.
However, the general decreasing trend could be interpreted as a improve of efficiency of Chicago Police Department because we can contribute the decline of number of incidents to a stronger reputation of Chicago Police Department.
dttest %>%
group_by(Year) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Year, y = Count)) +
geom_line(colour = "red") +
geom_point(colour = "red") +
geom_bar(aes(x = Year, y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
labs(x = "Number of crimes", y = "Year", title = "Evolution of number of crimes") +
geom_text(aes(x = Year, y = Count, label = Count), size = 3, vjust = -1, position = position_dodge(0.9)) +
theme_minimal() +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank())
The following plot shows how the number of crimes change in different time dimensions: evolution by time intervals, by weekdays, by months and by seasons.
We can see that, from 2014 to 2018 in Chicago, there is no big difference in the frequency of crimes happened in different weekdays.
However, we can find some patterns of the happening of crimes: - Most incidents happened in the second part of the day, i.e., in the afternoon and at night. - Incidents were more likely to occurr in Fridays and Saturdays. - Crimes were more likely to happen in May and in June while they were less frequent in November and in December. - Most crimes happened in summer and there were relatively less incidents in winter. This is consistent with our results obtained from the figure Evolution by months. This result is also logical since the temperature may affect people’s emotion and hence has an impact on the frequency of crime incidents. During summer months, incidents are more frequent due to the high temperature which may make people more emontional, and vice versa in winter months.
# By time intervals
p1 <- dttest %>%
group_by(Tint) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Tint, y = Count)) +
geom_bar(aes(x = Tint, y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
labs(x = "Time intervals", y = "Number of crimes", title = "Evolution by time intervals") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank())
# By weekdays
p2 <- dttest %>%
group_by(Day) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Day, y = Count)) +
geom_bar(aes(x = factor(Day, level = c("lun\\.", "mar\\.", "mer\\.", "jeu\\.", "ven\\.", "sam\\.", "dim\\.")), y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
labs(x = "Weekdays", y = "Number of crimes", title = "Evolution by weekdays") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank())
# By months
p3 <- dttest %>%
group_by(Month) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Month, y = Count)) +
geom_bar(aes(x = factor(Month, level = c("janv\\.", "févr\\.", "mars", "avr\\.", "mai", "juin", "juil\\.", "août", "sept\\.", "oct\\.", "nov\\.", "déc\\.")), y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
labs(x = "Months", y = "Number of crimes", title = "Evolution by months") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank())
# By seasons
p4 <- dttest %>%
group_by(Season) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Season, y = Count)) +
geom_bar(aes(x = factor(Season, level = c("SPRING", "SUMMER", "FALL", "WINTER")), y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
labs(x = "Seasons", y = "Number of crimes", title = "Evolution by seasons") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank())
# Combine plots into one plot
ggarrange(p1, p2, p4, p3, ncol = 2, nrow = 2)
In this part, we will make use of the column Location Description and try to answer the question: In which places is crime more likely to happen?
The following plot shows how the number of crimes change in different places. We can see that most crimes happen on the Street, then in places such as Residences, Apartments, Stores and Others places.
# All categories
p1 <- dttest %>%
group_by(Locdescrip) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Locdescrip, y = Count)) +
geom_bar(aes(x = reorder(Locdescrip, Count), y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
labs(x = "Places", y = "Number of crimes", title = "Evolution by places") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank())
# Since the difference between places is quite significant, we can extract the data only for the top5 frequent places and plot for these data.
# Find top5 most frequent places
top5 <- head(names((sort(table(dttest$Locdescrip), decreasing = TRUE))), 5)
print(top5)
## [1] "STREET" "RESIDENCE" "APARTMENT" "STORE" "OTHERS"
p2 <- filter(dttest, Locdescrip %in% top5) %>%
group_by(Locdescrip) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Locdescrip, y = Count)) +
geom_bar(aes(x = reorder(Locdescrip, Count), y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
labs(x = "Places", y = "Number of crimes", title = "Evolution by top 5 places") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank())
# Combine plots into one plot
ggarrange(p1, p2, ncol = 2, nrow = 1)
In this part, we will make use of the columns Latitude and Longitude and try to answer the question: In which location is crime more likely to happen?
The following plot shows how the most of crimes happened in the middle west areas of Chicago, especially in the community #25. Crimes also happened a lot in community #8, #32, #43, along the east side of Chicago city. We also remark that incidents were not frequent in really central areas of Chicago (i.e. areas #57, 59, 34) and the number of crimes increases from this centra area to the north and to the south.
# Import and plot the shape file
mapcomu <- readShapePoly("C:/Users/ZHAO Hanlin/Desktop/RProject (DDL 2020-1-9)/Boundaries - Community Areas (current)/geo_export_b5591d25-0f4c-476f-8429-1f14d7129d9b.shp")
names(mapcomu)
## [1] "area" "area_num_1" "area_numbe" "comarea" "comarea_id"
## [6] "community" "perimeter" "shape_area" "shape_len"
# Transform the map as a data frame
dfcommu <- fortify(mapcomu, region = "area_numbe")
# Extract number of crimes for each community
temp <- dttest %>%
group_by(Community) %>%
summarise(Count = n())
temp$id <- 1:77
# Merge two data frames
temp2df <- merge(dfcommu, temp, by = "id", all.x = TRUE)
temp2df <- temp2df[order(temp2df$order), ]
# Extract community numbers
communum <- aggregate(cbind(long, lat) ~ Community, data = temp2df, FUN = function(x) mean(range(x)))
# Basic plot
locplot <- ggplot() +
geom_polygon(data = temp2df, aes(x = long, y = lat, group = Community, fill = Count), color = "black", size = 0.25) +
coord_map() +
scale_fill_gradient(low = "white", high = "red") +
theme_nothing(legend = T) +
labs(title = "Number of crimes per community") +
geom_text(data = communum, aes(x = long, y = lat, label = Community), size = 3, fontface = "bold")
# Import the police station
dfpolice <- fread(file = "C:/Users/ZHAO Hanlin/Desktop/RProject (DDL 2020-1-9)/Police_Stations_-_Map.csv", header = T, sep = ",", na.strings = "")
# Extract police stations' locations
dfpolice$LOCATION <- gsub("[(*)]", "", dfpolice$LOCATION)
policeloc <-str_split_fixed(dfpolice$LOCATION, ", ", 2)
policeloc <- as.data.frame(policeloc)
colnames(policeloc) <- c("lat", "long")
policeloc$lat <- as.numeric(as.character(policeloc$lat))
policeloc$long <- as.numeric(as.character(policeloc$long))
policeloc$id <- dfpolice$DISTRICT
# Plot police stations (by using black triangles) on the map
locplot <- locplot +
geom_point(data = policeloc, aes(x = long, y = lat), size = 1, shape = 24, fill = "black")
# Plot histogramme
tempplot <- dttest %>%
group_by(Community) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Community, y = Count)) +
geom_bar(aes(x = reorder(Community, Count), y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
labs(x = "Community number", y = "Number of crimes", title = "Evolution by community areas") +
theme_minimal() +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank())
locplot
In this part, we are trying to answer the question: How has crime types evolved over time in Chicago?
At first, let us plot the number of different crime types for each year from 2014 to 2018.
We can see that the most frequent crime types are: Theft, Battery, and Damage in Chicago for the last five years. In contrast, the least frequent crimes are ARSON and HOMICIDE.
From 2014 to 2018, we can see a general decreasing trend in most of the crime types and the number of Weapon and Homicide cases were almost constant during years. This can be seen as a good signal since people committed less and less crimes. The general decreasing trend could be interpreted as a improve of efficiency of Chicago Police Department.
# Types and number of crimes
p1 <- dttest %>%
group_by(Type) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Type, y = Count)) +
geom_bar(aes(x = reorder(Type, Count), y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
coord_flip() +
labs(x = "Number of crimes", y = "Type", title = "Evolution of number of crimes for different types") +
theme_minimal() +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank())
# Evolution over years
p2 <- dttest %>%
group_by(Year, Type) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Year, y = Count, fill = Type)) +
geom_area() +
labs(x = "Years", y = "Number of crimes", title = "Evolution of crime types over years")
# Combine plots
ggarrange(p1, p2, ncol = 2, nrow = 1)
# Evolution over years multiplots
dttest %>%
group_by(Year, Type) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Year, y = Count)) +
geom_smooth(method = "lm") +
geom_point()+
facet_wrap(~Type, ncol = 4, scales = "free") +
labs(x = "Years", y = "Number of crimes", title = "Evolution of crime types over years")
The following plot shows how the differnt crime types change in different time dimensions: evolution by time intervals, by weekdays, by months and by seasons.
From these heat maps, we find that, from 2014 to 2018 in Chicago: - THEFT occurred a lot in all time intervals, especially from 6h to 24h, and we remark that most of THEFT crimes occurred in the afternoom, i.e. from 12h to 18h. However, BATTERY crimes happened often in the evening, i.e. from 18h to 24h. Some special crimes such as NARCOTICS and DAMAGE occurred more in the evening while DECEIVE crimes happened more frequently in the morning. This result is logical since the frequency of crime types is consistent with their characteristics. Indeed, in general, people want to hide themselves when they committed NARCOTICS and DAMAGE crimes. Moreover, DECEIVE incidents involve mostly businessmen, so when people committed DECEIVE crimes, they usually did that during the office hour. - THEFT crimes occurred the most in eacy day of the week. The BATTERY and DAMAGE crimes happened mostly during the weekend. For the same reason explained before, the DECEIVE crime occurred more during the weekdays than during the weekend. - THEFT occurred the most during the year and then we have BATTERY, DAMAGE and ASSAULT crimes. Similar to our conclusion above, we see that almost all types occurred more often during summer months, i.e. from Mai to August. - THEFT occurred a lot during the four seasons and especially this type of crimes occurred more frequently in summer. Battery, DAMAGE and ASSAULT cases occurred the most frequently in summer, and the least frequently in winter. - The frequency of most incident types such as ARSON, BURGLARY, HOMICIDE, HUMANCHILD, MOTO, ROBBERY, SEX, SOCIETY, TRESPASS, and WEAPONS did nont change a lot no matter the time dimension (no matter we study the evolution by time intervals, by weekdays, by months or by seasons).
# Transform the type
dttest[, c("Month", "Day", "Season", "Tint")] <- lapply(dttest[, c("Month", "Day", "Season", "Tint")], as.character)
# By time intervals
p1 <- dttest %>%
group_by(Type, Tint) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Tint, y = reorder(Type, Count))) +
geom_tile(aes(fill = Count)) +
scale_x_discrete("Time intervals", expand = c(0, 0), position = "top") +
scale_y_discrete("Crime types", expand = c(0, -2)) +
scale_fill_gradient("Number of crimes", low = "white", high = "red") +
ggtitle("Evolution by time intervals") +
theme_bw() +
theme(panel.grid.major =element_line(colour = NA), panel.grid.minor = element_line(colour = NA))
# By weekdays
p2 <- dttest %>%
group_by(Type, Day) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Day, y = reorder(Type, Count))) +
geom_tile(aes(fill = Count)) +
scale_x_discrete("Weekdays", expand = c(0, 0), position = "top") +
scale_y_discrete("Crime types", expand = c(0, -2)) +
scale_fill_gradient("Number of crimes", low = "white", high = "red") +
ggtitle("Evolution by weekdays") +
theme_bw() +
theme(panel.grid.major =element_line(colour = NA), panel.grid.minor = element_line(colour = NA))
# By months
p3 <- dttest %>%
group_by(Type, Month) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Month, y = reorder(Type, Count))) +
geom_tile(aes(fill = Count)) +
scale_x_discrete("Months", expand = c(0, 0), position = "top") +
scale_y_discrete("Crime types", expand = c(0, -2)) +
scale_fill_gradient("Number of crimes", low = "white", high = "red") +
ggtitle("Evolution by months") +
theme_bw() +
theme(panel.grid.major =element_line(colour = NA), panel.grid.minor = element_line(colour = NA))
# By seasons
p4 <- dttest %>%
group_by(Type, Season) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Season, y = reorder(Type, Count))) +
geom_tile(aes(fill = Count)) +
scale_x_discrete("Seasons", expand = c(0, 0), position = "top") +
scale_y_discrete("Crime types", expand = c(0, -2)) +
scale_fill_gradient("Number of crimes", low = "white", high = "red") +
ggtitle("Evolution by seasons") +
theme_bw() +
theme(panel.grid.major =element_line(colour = NA), panel.grid.minor = element_line(colour = NA))
# Combine plots into one plot
ggarrange(p1, p2, p4, p3, ncol = 2, nrow = 2)
In this part, we will make use of the column Location Description and try to answer the question: Different crime types are more likeky to happen in which places?
The following plot shows how the crime types change in different places. We can see that Theft and Battery cases happened the most on the street and residence where people stay longer, which creates opportunites for thefts. Moreover, in these places, people have enough space to fight.Therefore, this result makes sense. Similarly, we find that Damage, Narcotics, Robery, Moto were also recorded almost enterily in the street.
# Since the differences between places and between crime types are quite significant, we can extract the data only for the top10 frequent places as weel as top 10 types, and plot for these data.
# Find top5 most frequent places
top10P <- head(names((sort(table(dttest$Locdescrip), decreasing = TRUE))), 10)
# Find top10 most frequent crime types
top10T <- head(names((sort(table(dttest$Type), decreasing = TRUE))), 10)
# Plot
filter(dttest, Locdescrip %in% top10P) %>%
filter(Type %in% top10T) %>%
group_by(Type, Locdescrip) %>%
summarise(Count = n()) %>%
ggplot(aes(x = reorder(Locdescrip, Count), y = reorder(Type, Count))) +
geom_tile(aes(fill = Count)) +
scale_x_discrete("Places", expand = c(0, 0), position = "top") +
scale_y_discrete("Crime types", expand = c(0, -2)) +
scale_fill_gradient("Number of crimes", low = "white", high = "red") +
ggtitle("Evolution by places") +
theme_bw() +
theme(panel.grid.major =element_line(colour = NA), panel.grid.minor = element_line(colour = NA))
In this part, we will make use of the columns Latitude and Longitude and try to answer the question: In which location is different crime types more likely to happen?
We can see that areas 8, 32, 28, 24 and 25 are particularly dangerous in terms of Theft, that community areas 25, 43, 29 stand out in terms of Battery, most of Narcotics crime concentrates in areas 25, 23 and 29, and Deceive case is more frequet in the same areas as Theft case: community 8 and 32.
# Find top10 most dangerous community areas
top10C <- head(names((sort(table(dttest$Community), decreasing = TRUE))), 10)
# Plot
filter(dttest, Type %in% top10T) %>%
filter(Community %in% top10C) %>%
group_by(Type, Community) %>%
summarise(Count = n()) %>%
ggplot(aes(x = reorder(Community, Count), y = reorder(Type, Count))) +
geom_tile(aes(fill = Count)) +
scale_x_discrete("Community areas", expand = c(0, 0), position = "top") +
scale_y_discrete("Crime types", expand = c(0, -2)) +
scale_fill_gradient("Number of crimes", low = "white", high = "red") +
ggtitle("Evolution by areas") +
theme_bw() +
theme(panel.grid.major =element_line(colour = NA), panel.grid.minor = element_line(colour = NA))
We can see that the number of domestic crimes increases a little from 2014 to 2016 and then decreased from 2016 to 2018.
Moreover, we can see that most domestic crimes happened in the evening (18-24H) and during the weekend when people are generally at home, which makes sense. Moreover, there were more domestic cases in the summer, which is similar as other cases since the temperature may make people be more emotionnal to commit crimes. Finally, we find that most domestic cases happened on the street, in the residence and in the appartment. The two latter places are easy to understand. But Street, the most frequent place where people committed domestic crimes, makes sense if we understand it as cases involving conflits between family members, and this could happen on the street.
# Numbers
dttest %>%
filter(Domestic == T) %>%
group_by(Year) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Year, y = Count)) +
geom_bar(aes(x = Year, y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
labs(x = "Number of crimes", y = "Year", title = "Evolution of number of domestic crimes in different years") +
theme_minimal() +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank())
# By time intervals
p1 <- dttest %>%
filter(Domestic == T) %>%
group_by(Tint) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Tint, y = Count)) +
geom_bar(aes(x = Tint, y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
labs(x = "Time intervals", y = "Number of domestic crimes", title = "Evolution by time intervals") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank())
# By weekdays
p2 <- dttest %>%
filter(Domestic == T) %>%
group_by(Day) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Day, y = Count)) +
geom_bar(aes(x = factor(Day, level = c("lun\\.", "mar\\.", "mer\\.", "jeu\\.", "ven\\.", "sam\\.", "dim\\.")), y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
labs(x = "Weekdays", y = "Number of domestic crimes", title = "Evolution by weekdays") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank())
# By months
p3 <- dttest %>%
filter(Domestic == T) %>%
group_by(Month) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Month, y = Count)) +
geom_bar(aes(x = factor(Month, level = c("janv\\.", "févr\\.", "mars", "avr\\.", "mai", "juin", "juil\\.", "août", "sept\\.", "oct\\.", "nov\\.", "déc\\.")), y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
labs(x = "Months", y = "Number of domestic crimes", title = "Evolution by months") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank())
# By seasons
p4 <- dttest %>%
filter(Domestic == T) %>%
group_by(Season) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Season, y = Count)) +
geom_bar(aes(x = factor(Season, level = c("SPRING", "SUMMER", "FALL", "WINTER")), y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
labs(x = "Seasons", y = "Number of domestic crimes", title = "Evolution by seasons") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank())
# Combine plots into one plot
ggarrange(p1, p2, p4, p3, ncol = 2, nrow = 2)
# Locations
dttest %>%
filter(Domestic == T) %>%
group_by(Locdescrip) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Locdescrip, y = Count)) +
geom_bar(aes(x = reorder(Locdescrip, Count), y = Count), stat = "identity", fill = "#6495ED", width = 0.3, position=position_dodge(0.4)) +
labs(x = "Places", y = "Number of crimes", title = "Evolution by places") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 75,vjust = 1,hjust = 1)) +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank())
Firstly, we can measure the efficiency by the number of crimes over years. From our previous analysis, we know that there’s a decreasing trend, so we can see that in this point of view, the Chicago Police Department is more efficient.
However, if we look at the evolution of arrested crime rate of each years, we can see that the rate is decreasing, meaning that there is less and less people committing crimes who are caught by the police. This show that Chicago Department Police is less and less efficient.
# Extract data
temp <- dttest %>%
filter(Arrest == T) %>%
group_by(Year) %>%
summarise(Count = n())
# Compute the crime rates
temp$rate <- lapply(temp$Count, function(x) x / nrow(dttest))
temp$rate <- as.numeric(temp$rate)
# Plot
ggplot(temp, aes(x = Year, y = rate)) +
geom_line() +
theme_minimal() +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank())
We can also see that in general, the number of crimes decreased even in top10 dangerous community areas, which also shows an improvement of police’s efficiency. However, as we also remark a decrease in arrested crime rate in the 10 most dangerous areas, this suggests a deterioration of police’e efficiency. But it is interesting to emphasize that between 2016 and 2017, there was an increase of arrested crime rate in almost each of these 10 most dangerous community areas (except for the communuty 28), which may suggest an improvement of efficiency of police in these areas. And since we do not have the complete data of 2018, we can not say the efficiency decreased between 2017 and 2018 even the graph shows a decrease of rate in this period.
# Find top10 most dangerous community areas
top10C <- head(names((sort(table(dttest$Community), decreasing = TRUE))), 10)
# Plot number of crimes
filter(dttest, Community %in% top10C) %>%
group_by(Year, Community) %>%
summarise(Count = n()) %>%
ggplot(aes(x = Year, y = Count)) +
geom_smooth(method = "lm") +
geom_point()+
facet_wrap(~Community, ncol = 4, scales = "free") +
labs(x = "Years", y = "Number of crimes", title = "Evolution of number of crimes in different community areas over years")
# Extract data
temp <- dttest %>%
filter(Arrest == T, Community %in% top10C) %>%
group_by(Year, Community) %>%
summarise(Count = n())
# Compute the crime rates
temp$rate <- lapply(temp$Count, function(x) x / nrow(dttest))
temp$rate <- as.numeric(temp$rate)
# Plot
ggplot(temp, aes(x = Year, y = rate)) +
geom_line() +
facet_wrap(~Community, ncol = 4, scales = "free") +
labs(x = "Years", y = "Crime rates", title = "Evolution of arrested crime rates in different community areas over years")
Finally, let us use the rate of arrested crimes to analyse the police efficiency treating the each crime type. We remark that in general, there is a decreasing trend of arrested rate for almost all types of crime, which suggests worse efficiency (especially a significant decline of the efficiency for Narcotics and Battery cases). Moreover, we also see that even though there were a lot of reported crimes each year, the arrest rate of most of crime types is really low and stays at a low level, which suggests also a low efficiency.
In conclusion, we could say that if arrest rate is a good measure of police efficiency, then the Chicago’s police work were not enough effective at least duriing 2014 and 2018.
# Extract data
temp <- filter(dttest, Arrest == T) %>%
group_by(Year, Type) %>%
summarise(Count = n())
# Compute the crime rates
temp$rate <- lapply(temp$Count, function(x) x / nrow(dttest))
temp$rate <- as.numeric(temp$rate)
# Plot
ggplot(temp, aes(x = Year, y = rate, colour = Type)) +
geom_line()