Executive Summary

This project is aimed at analysing crime against women in Delhi. The data was obtained from Safecity, it also has a mobile applicaion. Safecity is a platform that crowdsources personal stories of sexual harassment and abuse in public spaces. The idea is to make this data useful for individuals, local communities and local administration to identify factors that causes behavior that leads to violence and work on strategies for solutions. One of the long term goals of this project is to help police departments understand, predict and if possible prevent or to mitigate potential damage from crime.

The article is structured as follows: In section 1, data was pre-processed and prepared for analysis. Second section is aimed at exploring and getting insights from the data which are reported in conclusions section. An ETS (Error Trend Seasonality) model was also used for forecasting and testing for the presence of seasonality.

1. Data Pre-Processing

#import required packages
library(ggplot2)
library(ggmap)
library(ggfortify)
library(reshape2)
library(RJSONIO)
library(RCurl)
library(forecast)
Delhi2<- read.csv("DelhiUpdatedDashboard1.csv", header = T)
head(names(Delhi2)); tail(names(Delhi2))
## [1] "INCIDENT.ID"       "INCIDENT.TITLE"    "DATE"             
## [4] "YEAR"              "MONTH"             "WEEK.OF.THE.MONTH"
## [1] "Flag.Train" "Flag.Bus"   "Flag.Metro" "Flag.Auto"  "Flag.Rail" 
## [6] "Overall"

Data contains 4513 observations and 67 variables. There are many variables in the dataframe, they are as follows:

Variables like Please Check, Description, Car, Bus, Auto and all other variables following that are not required. So, we omit them from our data frame.

Delhi2<- Delhi2[ ,-c(1,2,12, 38:67)]

#This is what the data looks like
head(Delhi2,3)[1:10]
##         DATE YEAR    MONTH WEEK.OF.THE.MONTH       DAY  TIME
## 1 18-08-2004 2004   August                 3 Wednesday 08:00
## 2 03-11-2004 2004 November                 1 Wednesday 08:00
## 3 21-01-2005 2005  January                 3    Friday 18:07
##   QUARTER.OF.THE.DAY HOUR.OF.THE.DAY MINUTE.OF.THE.HOUR Catcalls.Whistles
## 1                  2               8                  0                 0
## 2                  2               8                  0                 0
## 3                  4              18                  7                 0

Each observation contains the date, year, month, location, coordinates and type of crime. The location’s addresses are not in the standard form and therefore can’t be used for analysis. One cannot group locations into broad meaningful categories and therefore it was decided to group the areas on the basis of pincodes. So we use the coordinates provided to get the address in the standard form using google’s API for geocoding. Package like revgeo or ggmap can be used. Postal pincodes are used instead of the name of the areas as adequate data is not available for all areas in Delhi. Here’s a function for extracting the pin codes from Google.

#function for reverse geocoding
get_pincode<- function(latitude, longitude, API=NULL){
  geocode_data <- list()
  
  
  url<- paste0("https://maps.googleapis.com/maps/api/geocode/json?latlng=",latitude,
               ",",longitude,"&key=",API)
  
  for(i in url){
    data1<- getURLAsynchronous(i)
    returned_data1<- tryCatch(fromJSON(data1), 
                              error = function(e) "Issue retrieving address from google")
    l <- length(returned_data1$results[[1]]$address_components)
    
    for(j in 1:l){
      comp<- returned_data1$results[[1]]$address_components[[j]]
      if(comp$types[1]=="postal_code"){
        zip<- tryCatch(comp$long_name, error = function(e) "Postcode Not Found")
        break
      }
    }
    
    if (!(exists("zip"))) {
      zip <- "Postcode Not Found"
    }
    geocode_data[["zip"]] <- c(geocode_data[["zip"]], 
                               zip)
    
    
  }
  return(as.data.frame(geocode_data))
}

Just input the vector of latitudes and longitudes to the function and voila!

pincodes<- get_pincode(latitude = Delhi2$LATITUDE, longitude = Delhi2$LONGITUDE, 
            API = "api key" )
Delhi2<- cbind(Delhi2, pincodes)

Some pin codes are faulty as they only have 5 digits. The addresses were not written properly, that’s one of the reasons why google gives faulty pincodes. Rows corresponding to these should be removed. Addresses written should be written in the standard form, one should avoid writing addresses like “New Delhi, Delhi, India”, “Bus No. 729”, “Rohini, Delhi”. Adresses should as accuarte as possible and should be in standard format like “Central School, Andrews Ganj, New Delhi, Delhi 110049”, so that it can be effectively used for analysis.

head(table(Delhi3$pincodes))
## 
## 110001 110002 110003 110005 110006 110007 
##    267    377     35     69    153    259
#rows to be omitted
rows_omit<- which(Delhi2$pincodes==76750|Delhi2$pincodes==78758|Delhi2$pincodes==97045|
        Delhi2$pincodes==400706|Delhi2$pincodes==441217|Delhi2$pincodes==442604|Delhi2$pincodes==560070)
Delhi2<- Delhi2[-rows_omit, ]

Let’s have a look at the pincodes that occur less than five times. Out of these, many occur only once. They were removed as they are too few to provide meaningful insights. Data was stored in Delhi 3 after making all the necessary changes.

table(pincodes)[which((table(pincodes))<5)]

##pincodes

# 76750  78758  97045 110004 110011 110012 110036 110040 110046 110061 110072 110080 110081 110084 110087 
#    1      1      1      2      2      3      2      2      3      2      1      4      1      4      4 
#121003 121006 121010 122007 122009 122016 122017 122018 122022 122051 131402 143001 201001 201003 201004 
#    3      4      1      1      1      1      1      1      1      2      1      2      4      2      1 
#201005 201007 201011 201305 201308 201313 203201 206244 250001 250002 250003 250103 301714 321022 380026 
#    2      1      2      1      1      2      1      1      2      1      2      1      1      1      1 
#400706 442604 560070 
#     1      1      1 

# Removing the rows with these pincodes    
Delhi3<- Delhi2[Delhi2$pincodes %in% names(x[x>=5]), ]

Now we are ready to head to the next part.

2. Exploring Data and Getting Insights

Are crimes equiprobable on all months of the year?

data_month<- as.data.frame(table(Delhi3$MONTH))

ggplot(data = data_month, aes(x=Var1, y=Freq))+geom_bar(stat = "identity", width = 0.5, fill = topo.colors(1))+coord_flip()+ggtitle("Crimes against Women")+xlab("Months")+ylab("Frequency")

Ocurrence of Crimes seem to increase astoundingly in February and then in September through November. The question is, is it just by chance or is the crime rate somehow dependent on the month of the year. Chi-square test was employed to test the hypothesis that the crime rate in Delhi does not depend on the Month of the Year.

Delhi3$MONTH<- factor(Delhi3$MONTH, levels = month.name)
month<- as.matrix(table(Delhi3$MONTH))
chisq.test(month)
## 
##  Chi-squared test for given probabilities
## 
## data:  month
## X-squared = 2578.6, df = 11, p-value < 2.2e-16

The p-value is small, therefore we reject the null hypothesis. Meaning, Occurence of a crime does depend on the month of the year. February, most probably is the unsafest month of all.

d2013<-as.data.frame(table(Delhi3[Delhi3$YEAR==2013, ]$MONTH))
d2014<-as.data.frame(table(Delhi3[Delhi3$YEAR==2014, ]$MONTH))
d2015<-as.data.frame(table(Delhi3[Delhi3$YEAR==2015, ]$MONTH))
d2016<- as.data.frame(table(Delhi3[Delhi3$YEAR==2016, ]$MONTH))
timeseries<-rbind(d2013, d2014, d2015, d2016)

{plot.ts( timeseries$Freq[13:24], type="l", xlab = "Months", ylab = "Frequency")
lines(1:12, y=timeseries$Freq[25:36], lty = 2)
lines(1:12, y=timeseries$Freq[37:48], lty = 3, col = "dark grey")
lines(1:12, y=timeseries$Freq[1:12], lty = 3, col = "red")
legend(x=5, y=600, legend = c("2013", "2014", "2015", "2016"), lty = c(3,1,2,3), col = c("red","black","black","dark grey"))}


Are Crimes equally probable on all weeks of the month?

#for the month of february
table(Delhi3$WEEK.OF.THE.MONTH[Delhi3$MONTH=="February"])
test1<-chisq.test(table(Delhi3$WEEK.OF.THE.MONTH[Delhi3$MONTH=="February"]))
#september
table2<-table(Delhi3$WEEK.OF.THE.MONTH[Delhi3$MONTH=="September"])
test2<- chisq.test(table(Delhi3$WEEK.OF.THE.MONTH[Delhi3$MONTH=="September"]))
#October
table3<-table(Delhi3$WEEK.OF.THE.MONTH[Delhi3$MONTH=="October"])
test3<-chisq.test(table(Delhi3$WEEK.OF.THE.MONTH[Delhi3$MONTH=="October"]))
#November
table4<-table(Delhi3$WEEK.OF.THE.MONTH[Delhi3$MONTH=="November"])
test4<-chisq.test(table(Delhi3$WEEK.OF.THE.MONTH[Delhi3$MONTH=="November"]))

After applying the chi-square test of homogeneity to all the tables above, we obtained a p-value smaller than 2.2e-16. This means that crimes are not equally probable on all weeks on the month. Some weeks are more unsafe than others.

feb<- table(Delhi3$WEEK.OF.THE.MONTH[Delhi3$MONTH=="February"])
feb<- data.frame(feb)[-1]
#similar we stored the data from all months in jan, mar, april, etc...

mdf<- as.data.frame(cbind(jan[1:4,], feb[1:4,], mar[-5,], april[-5,], may[-5,],june[-5, ],july[-5, ],august[-5,],sep[-5,],oct[-5,], nov[-5,],dec[-5,]))
mdf2<- as.data.frame(cbind(month.name, t(mdf)))
mdf4<- mdf2[c(2,9,10,11),]
colnames(mdf4)<- c("Months","Week1","Week2","Week3","Week4")
mdf3<- melt( mdf4, id.vars = "Months")
## Warning: attributes are not identical across measure variables; they will
## be dropped
ggplot(mdf3, aes(x=variable, y=value, group  = Months, colour = Months))+geom_line(size=1)+xlab("Weeks of the month")+ylab("Frequency")+ggtitle("Trend of crime rate by weeks \n of the Month")

Data from the most unsafe months was used to construct the plot above. A drastic rise in the crime rate in the second week of each month can be seen followed by a drop in the third week and an increase in the fourth week. According to our data, second and fourth week are most unsafe. Considering the data of all months, second week is more unsafe than other weeks of the month.


Do Crimes occur equally frequently on all day of the weeks?

Delhi3$DAY<- factor(Delhi3$DAY, levels = c("Monday","Tuesday", "Wednesday","Thursday","Friday","Saturday","Sunday"))
 table(Delhi3$DAY)
## 
##    Monday   Tuesday Wednesday  Thursday    Friday  Saturday    Sunday 
##       357       567       654       457       824       926       462
test11<-chisq.test(table(Delhi3$DAY[Delhi3$YEAR==2013]))
test12<-chisq.test(table(Delhi3$DAY[Delhi3$YEAR==2014])) 
test13<-chisq.test(table(Delhi3$DAY[Delhi3$YEAR==2015]))
test14<-chisq.test(table(Delhi3$DAY[Delhi3$YEAR==2016]))
   
day2013<- data.frame(table(Delhi3$DAY[Delhi3$YEAR==2013]), Year = rep(2013,7))
day2014<- data.frame(table(Delhi3$DAY[Delhi3$YEAR==2014]), Year = rep(2014,7))
day2015<- data.frame(table(Delhi3$DAY[Delhi3$YEAR==2015]), Year = rep(2015,7))
day2016<- data.frame(table(Delhi3$DAY[Delhi3$YEAR==2016]), Year = rep(2016,7))
day<- rbind(day2013,day2014,day2015,day2016 )

ggplot(day, aes(x=Var1, y = Freq, group = factor(Year), colour = factor(Year)))+geom_line()+xlab("Day of the Week")+labs(fill = "Year")

Crimes do not occur equally frequently on all days of the week. Crimes seem to peak during the weekends i.e. Friday and Saturday accompanied by a whopping drop on Sundays. Although in the year 2013 most of the crimes took place during the weekdays.


Are girls equally probable of falling victims to these crimes in all areas of Delhi?

test10<- chisq.test(as.data.frame(table(Delhi3$pincodes))$Freq)

Chi-square test yields a highly significant p-value. Crimes against women are not equally probable in all areas of Delhi. Some areas are safer than others.

data_location<-as.data.frame(table(Delhi3$pincodes))
#20 most unsafe areas
top30<- data_location[order(data_location$Freq, decreasing = T), ][1:30, ]

#plot
{ggplot(data = top30, aes(x=reorder(Var1,Freq), y=Freq ))+geom_bar(stat= "identity", fill = c(topo.colors(40)[1:10],topo.colors(40)[1:10],topo.colors(40)[1:10]))+coord_flip()+ggtitle("Crimes against Women \n in various Locations of Delhi")+theme(axis.text.y =element_text(size=7))+xlab("Pincodes of Areas")+ylab("Frequency")} #reorder was used to arrange the bars from highest to shortest

According to the data, the unsafest area1 in Delhi lie in district of Central Delhi which are Ajmeri Gate, Darya Ganj, Indraprastha, Rajghat, etc. Then comes South-West Delhi (Amberhai, Dwarka sec-6), Connaught Place and North delhi (Kamla Nagar, Roop Nagar, North Campus). Five Least Dangerous areas of Delhi-NCR, written in order of least to highest occurence of crimes are: North West Delhi(Sarai Rohilla, Inderlok, Keshav Puram, Rampura, Onkar Nagar), South West Delhi (Aya Nagar, Jaunapur, Mandi, Arjungarh), Faridabad (Sector-16, sector-16a, sector-18, Bhaskola, Kheri Kalan, Faridabad City), Gautam Buddha Nagar (sec-34 noida, sec-55 noida, Khora Gaon, Chhajarsi) and South Delhi (Madanpur Khadar, Sarita Vihar).

Geographic Heatmap

#download base map
delhimap<-get_map(location = c(lon =77.18454 , lat=28.61768 ), maptype = "terrain", zoom = 11
                  , api_key = "AIzaSyCWkDGTEnmV7PFfVvC61CfR5_SohcL71WY", source ="google") 

ggmap(delhimap, extent = "device")+
  geom_density2d(data = Delhi3, aes(x=LONGITUDE, y=LATITUDE),size = 0.5)+
  stat_density2d(data = Delhi3, aes(x=LONGITUDE, y=LATITUDE, fill = ..level.., alpha= ..level..), size=0.01, geom = "polygon", bins = 20)+
  scale_fill_gradient(low = "green", high = "red")+guides(alpha=FALSE)

The heatmap clearly shows, Central Delhi and South-West Delhi are the hotspots of crimes which is consistent with what we found earlier.


Do Crimes occur equally frequently on all hours of the day?

{par(mfrow = c(2,2), oma = c(0,0,0,0))
barplot(as.data.frame(table(Delhi3$HOUR.OF.THE.DAY[Delhi3$YEAR==2013]))$Freq, names.arg=0:21, main = "Year 2013", col =  topo.colors(10)[1])
barplot(as.data.frame(table(Delhi3$HOUR.OF.THE.DAY[Delhi3$YEAR==2014]))$Freq, names.arg=0:23, main = "Year 2014", col = "cyan")
barplot(as.data.frame(table(Delhi3$HOUR.OF.THE.DAY[Delhi3$YEAR==2015]))$Freq, names.arg=0:23, main = "Year 2015", col = "light blue")
barplot(as.data.frame(table(Delhi3$HOUR.OF.THE.DAY[Delhi3$YEAR==2016]))$Freq, names.arg=0:23, main = "Year 2016", col = "turquoise") }

For the year 2013, 2014 and 2015, there seems to be a pattern, most of crimes occured between 10a.m. and 2p.m. (10a.m. and 2p.m. included) and then it peaks during the night from 10 p.m. to 1 a.m. Although in the year 2016, there has been a significant decrease in the total number of crimes reported and no such pattern can be seen. Most of the crimes took place during the evening and at night.

test5<- chisq.test(table(Delhi3$HOUR.OF.THE.DAY))
test6<- chisq.test(table(Delhi3$HOUR.OF.THE.DAY[Delhi3$YEAR==2013]))
test7<- chisq.test(table(Delhi3$HOUR.OF.THE.DAY[Delhi3$YEAR==2014]))
test8<- chisq.test(table(Delhi3$HOUR.OF.THE.DAY[Delhi3$YEAR==2015]))
test9<- chisq.test(table(Delhi3$HOUR.OF.THE.DAY[Delhi3$YEAR==2016]))

Time Series Analysis

d2014<-as.data.frame(table(Delhi3$MONTH[Delhi3$YEAR==2014]))
d2015<-as.data.frame(table(Delhi3$MONTH[Delhi3$YEAR==2015]))
d2016<-as.data.frame(table(Delhi3$MONTH[Delhi3$YEAR==2016]))
timeseries<- rbind(d2014,d2015, d2016)
tser<- ts(timeseries$Freq, start = 2014, frequency = 12)

#plot 
autoplot(tser)+ggtitle("Monthly Time Series \n of Crimes Reported ")+theme(plot.title = element_text(color = "cyan4"), panel.background = element_rect(fill ="lightblue"),axis.title.x = element_text(face = "bold", colour = "cyan4"),axis.title.y = element_text(face = "bold", colour = "cyan4"), axis.text = element_text(face="bold", colour = "blue4") )+xlab("Year")+ylab("Frequency")

In the plot above, no seasonality is visible. Seasonality is a characteristic of the time series. If there is a pattern in the time series that recurs or repeats over a one-year period, then the data is said to have seasonality.

In this case, we expect the crime rate to go up every year in February and near the end of the year in September, October and November in light of the past data. But is the series actually seasonal? Or is just sampling error? So we perform a simple statistical test to test for seasonality.

library(fma)
fit1<- ets(tser)
fit2<- ets(tser, model = "ANN")
deviance<- 2*c(logLik(fit1)- logLik(fit2))
df<- attributes(logLik(fit1))$df-attributes(logLik(fit2))$df
1-pchisq(deviance, df)
## [1] 2.41962e-12

The resulting p-value is 2.41962e-12, so the additional seasonal component is significant. Which means, in the coming years we can most likely see a rise in the frequency of crimes at the beginning of the year in February and near the end of the year in and around September.

Forecasts from ETS model

{par(bg="aquamarine")
plot(forecast(ets(tser)), ylim= c(0,800), xlab="Year",ylab="Frequency", main = "Forecast from ETS model", sub="Crimes in Future", lty = 1, col = "black", lwd = 1)}

As expected, the predicted values are high for the months of February and September.

3. End Notes and Conclusion

Identifying and analysisng patterns and trends in crime can help law enforcement agencies deploy resources in a more effective manner.

Frequency of the crime in Delhi depends on Moths of the year, Weeks of the month, Day of the weeks, Location and Hour of the day.

Girls should be carefull while travelling to these areas. They should carry a pepper spray in their pocket or in side pocket of their bags where it can be easily accessed. Police should patrol these areas especially during the hours when crimes are more probable to occur i.e. at afternoon (10 to 2) and at night.

Patrolling should be more intensive on those weekdays and months during which crimes are highly probable according to the area. For e.g. The areas which experience the most crime during the month of February are the ones with following pincodes:

head(sort(table(Delhi3$pincodes[Delhi3$MONTH=="February"]), decreasing = T))
## 
## 110075 110078 110002 110007 110059 110001 
##    223     82     81     64     48     41

These areas should be more intensively patrolled during the Month of February.


  1. Pincodes of the areas were used in place of name of the areas. Areas reported in parenthesis are actaully the ones that a particular pincode is covering and the district is mentioned before it.