MATH2349 Semester 2, 2019

Required packages

To initiate, we ran the packages readr , dplyr , tidyr , magrittr , lubridate , hmisc , plyr , data.table. These packages will enable us to carry out the data preprocessing tasks and reach the expected conclusion successfully.

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(magrittr)
library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(plyr)

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following object is masked from 'package:lubridate':
## 
##     here

## The following objects are masked from 'package:Hmisc':
## 
##     is.discrete, summarize

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

Executive Summary

For our 3rd assignment of Data Preprocessing course, we chose a dataset from Kaggle.com that consists of crime incident reports of Boston for about 3 years 3 months. The data showed different aspects of offense such as - where, when it happened, description of it, offense code, area code etc. There were two parts of data provided in the source, and we merged them by one of their common variables. The merged dataset had 17 variables in it with numerics and characters.

To preprocess the data, first we converted one of the character variables in a factor variable where it indicated if there was any shooting involved in the offense. To tidy the data, we scrapped off some repeatative variables such as - longitude, latitude, day of the week, year, hour of the day. To tidy it up even more, we created a new variable called “quarter” from existing variable “month” that gives more insight into crimes in Boston indicating what number of offense was occuring in which quarter of the year. Later we scanned the data for NA, infinite and NaN values. All NA values found in the variable were replaced with a suitable value. We also created a new variable again that showed frequency of offense in respected areas, and scanned it for outliers. The outliers were not modified as it had specific significance in the data. Lastly, to decrease the skewness of “frequency” variable, we converted it into normal distribution using log transformation. We also transformed the type of OCCURED_ON_DATE variable using strptime function.

Data

For this assignment we chose Crimes in Boston dataset from www.kaggle.com (https://www.kaggle.com/AnalyzeBoston/crimes-in-boston) . In this dataset crime incident reports are provided by Boston Police Department (BPD) to document the initial details surrounding an incident, focused on capturing the type of incident as well as when and where it occurred. Records begin in June 14, 2015 and continue to September 3, 2018. The data was provided by Analyze Boston.

The first set of data “crime.csv”, saved as data1 provides the following variables of an occured crime - INCIDENT_NUMBER, OFFENSE_CODE, OFFENSE_CODE_GROUP, OFFENSE_DESCRIPTION, DISTRICT, REPORTING_AREA(area code of the place where the offense took place), SHOOTING (to show if any shooting occur or not), OCCURRED_ON_DATE, YEAR, MONTH, DAY_OF_WEEK, HOUR, UCR_PART (in which Uniform Crime Report part it falls into), STREET, Lat (Lattitude), Long (Longitude), Location. It has 319073 observations of 16 variables.

The second set of data “offense_codes.csv” with 576 observations of 2 variables, saved as data2, provides description of offense through OFFENSE CODE and NAME. As data2 had duplicate values in it, we selected the unique ones only and saved it as uniq which now has 493 observations of 2 variables.

We merged the two datasets using inner join on the offense codes for each occured crime. After merging, the number of observations increase as multiple offenses are under one Offense Code. We saved the merged dataset as d1. d1 now consists of 416248 observations of 17 variables.

data1 <- read.csv("crime.csv", stringsAsFactors=FALSE)
head(data1)

data2 <- read.csv("offense_codes.csv", stringsAsFactors = FALSE)
head(data2)

uniq <- unique(data2)

d1 <- uniq %>% inner_join(data1,by = c("OFFENSE_CODE"="OFFENSE_CODE"))
head(d1)

Understand

In this step we first display the structure of the data and it shows that we have integers, characters. We chose to convert the SHOOTING variable into a factor with two levels - Yes or No. The is.factor function shows succesful conversion of the variable.

str(d1)

## 'data.frame':    416248 obs. of  17 variables:
##  $ OFFENSE_CODE      : int  612 612 612 612 612 612 612 612 612 612 ...
##  $ NAME              : chr  "LARCENY PURSE SNATCH - NO FORCE " "LARCENY PURSE SNATCH - NO FORCE " "LARCENY PURSE SNATCH - NO FORCE " "LARCENY PURSE SNATCH - NO FORCE " ...
##  $ INCIDENT_NUMBER   : chr  "I182070353" "I182070281" "I182064832" "I182063454" ...
##  $ OFFENSE_CODE_GROUP: chr  "Larceny" "Larceny" "Larceny" "Larceny" ...
##  $ DISTRICT          : chr  "A1" "D4" "B2" "C11" ...
##  $ REPORTING_AREA    : int  117 143 288 394 173 NA 352 356 78 422 ...
##  $ SHOOTING          : chr  "N" "N" "N" "N" ...
##  $ OCCURRED_ON_DATE  : chr  "1/09/2018 14:20" "1/09/2018 11:00" "14/08/2018 10:57" "10/08/2018 1:41" ...
##  $ YEAR              : int  2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
##  $ MONTH             : int  9 9 8 8 8 7 7 7 7 7 ...
##  $ DAY_OF_WEEK       : chr  "Saturday" "Saturday" "Tuesday" "Friday" ...
##  $ HOUR              : int  14 11 10 1 13 1 16 16 12 12 ...
##  $ UCR_PART          : chr  "Part One" "Part One" "Part One" "Part One" ...
##  $ STREET            : chr  "BOYLSTON ST" "BOYLSTON ST" "WASHINGTON ST" "ASHMONT ST" ...
##  $ Lat               : num  42.4 42.3 42.3 42.3 42.3 ...
##  $ Long              : num  -71.1 -71.1 -71.1 -71.1 -71.1 ...
##  $ Location          : chr  "(42.35234110, -71.06432465)" "(42.34862382, -71.08277637)" "(42.32925905, -71.08482402)" "(42.28600087, -71.06344648)" ...

d1$SHOOTING <- factor(d1$SHOOTING, levels = c("Y","N"),
                         labels = c("Yes", "No"),ordered = TRUE)
is.factor(d1$SHOOTING)

## [1] TRUE

Tidy & Manipulate Data I

To make the data tidy, we dropped the following variables -

YEAR (as it is already visible in the OCCOURED_ON_DATE variable)
DAY_OF_WEEK (as it can already be derived from OCCOURED_ON_DATE variable)
HOUR (as it is already visible in the OCCOURED_ON_DATE variable)
Lat (As the lattitude can already be derived from Location variable)
Long (As the longitude can already be derived from Location variable)

We could have scrapped off the MONTH variable also for the same reason we dropped YEAR, but we have kept it for now to deal with it in the next phase.

d1 <- d1 %>% select(-(YEAR),-(DAY_OF_WEEK:HOUR),-(Lat:Long))
head(d1)

Tidy & Manipulate Data II

In this step we created a new variable called QUARTER using the MONTH variable. We used if_else function to put months 1,2,3 in First Quarter; months 4,5,6 in Second Quarter; months 7,8,9 in Third Quarter; and remaining months 10,11,12 in Fourth Quarter. We ordered them as - “First Quarter”,“Second Quarter”,“Third Quarter”, “Fourth Quarter” which shows the four quarters of a year. This variable indicates the frequency of offenses in each quarter of the year. The summary of QUARTER variable shows the number of offenses that took place in these 4 quarters.

d1$QUARTER <- if_else(d1$MONTH== 1 |d1$MONTH== 2 |d1$MONTH== 3,
                      "First Quarter",
                      if_else(d1$MONTH== 4 |d1$MONTH== 5|d1$MONTH== 6,
                              "Second Quarter",
                              if_else(d1$MONTH== 7|d1$MONTH== 8|d1$MONTH== 9,
                                      "Third Quarter", "Fourth Quarter")))


d1$QUARTER <-factor(d1$QUARTER, levels = c("First Quarter","Second Quarter","Third Quarter", "Fourth Quarter"))

summary(d1$QUARTER)

##  First Quarter Second Quarter  Third Quarter Fourth Quarter 
##          89852         105234         125780          95382

head(d1)

Scan I

We now scan all columns of the data for missing values with colSums function. It shows that there are 25379 NAs in REPORTING_AREA.

We replace the NAs in REPORTING_AREA with 0, as these are area codes for representing the area where the offense took place. Unknown areas are better to be replaced with 0 here.

Now we check for any more NAs in the variable, and the sum of NAs in this variable is 0 which means all the NAs were successfully replaced with 0.

Next, we check the data for infinite values using the lapply function and put it in a matrix called test_inf. We scan the matrix for infinite values and it shows there is no infinite value.
Next, we check the data for NaN values using the lapply function and put it in a matrix called test_nan. We scan the matrix for NaN values and it shows there is no NaN value.

colSums(is.na(d1)) #check NA

##       OFFENSE_CODE               NAME    INCIDENT_NUMBER 
##                  0                  0                  0 
## OFFENSE_CODE_GROUP           DISTRICT     REPORTING_AREA 
##                  0                  0              25379 
##           SHOOTING   OCCURRED_ON_DATE              MONTH 
##                  0                  0                  0 
##           UCR_PART             STREET           Location 
##                  0                  0                  0 
##            QUARTER 
##                  0

d1 <- d1 %>% impute(d1$REPORTING_AREA, fun = 0) 
sum(is.na(d1$REPORTING_AREA))

## [1] 0

test_inf <- do.call(cbind, lapply(d1, is.infinite)) #check infinite
any(test_inf=="TRUE")

## [1] FALSE

test_nan <- do.call(cbind, lapply(d1, is.nan)) #check NaN
any(test_nan=="TRUE")

## [1] FALSE

Scan II

We create a new data frame “df” that counts the frequency of occurances in an area, with 2 variables “REPORTING_AREA” (renamed later) and “freq” indicating frequency. Now we join this data frame to the main dataset d1 by REPORTING_AREA variable and save this new dataset as d2.

We took this step to create a new variable “freq” that indicates the frequency of offenses happening in an area, as previously we did not have any other numerical variable suitable for scanning outliers. Now we create a boxplot to visualize the outliers saving it as outliers. In “unique” we list the outliers’ unique values of REPORTING_AREAs.

We can see that there are 4 areas where we have outliers - * There are 25379 outliers in one area. This is actually the one that we replaced with area code 0 before. * The other 3 reporting areas with outliers have 2881, 3370, 2859 outleirs in each area.

These outlying values indicate that these Reporting Areas have a high frequency of recurring offenses which bears a significance in this whole dataset. So we decided to keep all the outliers as it is. Also, even we wanted to, we could not replace these area codes with any other valid values.

df <- count(d1$REPORTING_AREA)
setnames(df, old = "x", new = "REPORTING_AREA")

d2 <- left_join(d1, df, by = "REPORTING_AREA")
head(d2)

boxplot(d2$freq)
outliers<- boxplot(d2$freq)$out

uniq_out <- unique(outliers)
uniq_out

## [1]  2881 25397  3370  2859

Transform

We check the distribution of the “freq” variable, using a histogram to visualize. The histogram shows that the distribution is heavily right-skewed. To decrease the skewness and convert the distribution into a normal distribution, we use the Log Transformation and save it in log_freq. Checking the distribution again, we find that it has been converted into a normal distribution.
Finally, we transform the OCCURED_ON_DATE variable. When we checked the structure before, it was a character. We use strptime function to convert it to a date-time type variable.

hist(d2$freq)

log_freq <- log10(d2$freq)   

hist(log_freq)

d2$OCCURRED_ON_DATE <- strptime(d2$OCCURRED_ON_DATE, format="%d/%m/%Y %H:%M")
str(d2$OCCURRED_ON_DATE)

##  POSIXlt[1:416248], format: "2018-09-01 14:20:00" "2018-09-01 11:00:00" ...