library(readr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(caret)
library(editrules)
library(forecast)
library(stringr)
library(outliers)
library(lubridate)
To fulfill all the requirements required in datasets, I chose the datasets of crimes in boston and the crime codes for the following offenses from the source https://www.kaggle.com/AnalyzeBoston/crimes-in-boston.
After that, I merged the datasets by performing leftjoin operation on the common attribute.
Then, I inspected all the variables required for the analysis and segregated the non-necessary or redundent once.
Longitude and Latitude of the crimes were deriving a meaningful variable “LOCATION”, which was derived using the ‘Haversine’ formula from the source, https://stackoverflow.com/questions/27928/calculate-distance-between-two-latitude-longitude-points-haversine-formula/23095329#23095329.
After that,I used is.na() function to find all the NA values and then scanned all the missing accordingly.
Next, we looked for the outliers present by deducing boxplot and then removed them accordingly.
Then, by taking the “LOCATION” variable and deducing a histogram to check the normality trend and then transformed it using Box-Cox transformation to attain normality and min-max normalization.
I have taken the datasets of crimes in boston by the source https://www.kaggle.com/AnalyzeBoston/crimes-in-boston. It states all the crimes occured in boston with all the linked descriptions and time recorded by the Boston Police. Then renamed it as bostcrime.csv & crimecodes.csv accordingly.
Following attributes hold in boscrime.csv - DATASET A: 1) INCIDENT_NUMBER 2) OFFENSE_CODE 3) OFFENSE_CODE_GROUP 4) OFFENSE_DESCRIPTION 5) DISTRICT 6) REPORTING_AREA 7) SHOOTING 8) OCCURRED_ON_DATE 9) 10 YEAR 10) MONTH 11) DAY_OF_WEEK 12) HOUR 13) UCR_PART 14) STREET 15) Lat 16) Long
Attributes for crimecodes.csv- DATASET B: 1) CODE 2) NAME
## DATASET 1:
bostcrime <- read.csv("/Users/preritmiglani/Desktop/bostcrime.csv",stringsAsFactors = FALSE)
## DATASET 2:
crimecodes<- read.csv("/Users/preritmiglani/Desktop/crime codes.csv",stringsAsFactors = FALSE)
head(bostcrime)
## Head for crimecodes.csv
head(crimecodes)
## To check the number of attributes and records for the datasets
dim(bostcrime)
[1] 319073 17
dim(crimecodes)
[1] 576 2
## Now,I changed the column name of bostcrime from 'offense_code' to 'code' as in crimecodes dataset to merge
colnames(bostcrime)[colnames(bostcrime) == 'OFFENSE_CODE'] <- 'CODE'
## Now merging both the datasets
crime_comb <- bostcrime %>% left_join(crimecodes, by=c('CODE'))
dim(crime_comb)
[1] 577880 18
head(crime_comb)
## As,we could see too many records in the merged dataset,so we will get the distinct values
crimecodes <- crimecodes %>% distinct(CODE, .keep_all= TRUE)
## Again,we would apply merge function and check
crime_comb <- bostcrime %>% left_join(crimecodes, by=c('CODE'))
dim(crime_comb)
[1] 319073 18
## checking for the head
head(crime_comb)
NA
Now,as while importing, we have taken StringsasFactors=False, so now we will check the structure of the merged dataset and would observe that ‘UCR_PART’ relates to the severity of the crime, so this a categorical variable and we label it accordingly. Also,we would convert ‘DAY_OF_WEEK’ to factors. Also ,we could observe that the ‘OCCURRED_ON_DATA’ variable contains the year,day,month and time which is already given, so we would remove the attribute as it is redundant and also ‘DISTRICT’, ‘REPORTING AREA’,‘SHOOTING’ were removed as they have no role in the analysis.
## Checking for the structure of merged datset
str(crime_comb)
'data.frame': 319073 obs. of 18 variables:
$ INCIDENT_NUMBER : chr "I182070945" "I182070943" "I182070941" "I182070940" ...
$ CODE : int 619 1402 3410 3114 3114 3820 724 3301 301 3301 ...
$ OFFENSE_CODE_GROUP : chr "Larceny" "Vandalism" "Towed" "Investigate Property" ...
$ OFFENSE_DESCRIPTION: chr "LARCENY ALL OTHERS" "VANDALISM" "TOWED MOTOR VEHICLE" "INVESTIGATE PROPERTY" ...
$ DISTRICT : chr "D14" "C11" "D4" "D4" ...
$ REPORTING_AREA : int 808 347 151 272 421 398 330 584 177 364 ...
$ SHOOTING : chr "" "" "" "" ...
$ OCCURRED_ON_DATE : chr "2018-09-02 13:00:00" "2018-08-21 00:00:00" "2018-09-03 19:27:00" "2018-09-03 21:16:00" ...
$ YEAR : int 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
$ MONTH : int 9 8 9 9 9 9 9 9 9 9 ...
$ DAY_OF_WEEK : chr "Sunday" "Tuesday" "Monday" "Monday" ...
$ HOUR : int 13 0 19 21 21 21 21 20 20 20 ...
$ UCR_PART : chr "Part One" "Part Two" "Part Three" "Part Three" ...
$ STREET : chr "LINCOLN ST" "HECLA ST" "CAZENOVE ST" "NEWCOMB ST" ...
$ Lat : num 42.4 42.3 42.3 42.3 42.3 ...
$ Long : num -71.1 -71.1 -71.1 -71.1 -71.1 ...
$ Location : chr "(42.35779134, -71.13937053)" "(42.30682138, -71.06030035)" "(42.34658879, -71.07242943)" "(42.33418175, -71.07866441)" ...
$ NAME : chr "LARCENY ALL OTHERS" "VANDALISM" "TOWED MOTOR VEHICLE" "INVESTIGATE PROPERTY" ...
## Removing the attributes which are not necessary
crime_comb<-crime_comb %>% select(-c("SHOOTING","OCCURRED_ON_DATE","DISTRICT","REPORTING_AREA"))
## Converting to factors and giving labels according to the severity of crime
crime_comb$UCR_PART<-factor(crime_comb$UCR_PART,levels = c("Part Three", "Part Two", "Part One"),
labels = c("Severe", "Mild", "Low"),
ordered = TRUE )
## Converting to factor
crime_comb$DAY_OF_WEEK <- factor(crime_comb$DAY_OF_WEEK)
## Checking the structure again
str(crime_comb)
'data.frame': 319073 obs. of 14 variables:
$ INCIDENT_NUMBER : chr "I182070945" "I182070943" "I182070941" "I182070940" ...
$ CODE : int 619 1402 3410 3114 3114 3820 724 3301 301 3301 ...
$ OFFENSE_CODE_GROUP : chr "Larceny" "Vandalism" "Towed" "Investigate Property" ...
$ OFFENSE_DESCRIPTION: chr "LARCENY ALL OTHERS" "VANDALISM" "TOWED MOTOR VEHICLE" "INVESTIGATE PROPERTY" ...
$ YEAR : int 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
$ MONTH : int 9 8 9 9 9 9 9 9 9 9 ...
$ DAY_OF_WEEK : Factor w/ 7 levels "Friday","Monday",..: 4 6 2 2 2 2 2 2 2 2 ...
$ HOUR : int 13 0 19 21 21 21 21 20 20 20 ...
$ UCR_PART : Ord.factor w/ 3 levels "Severe"<"Mild"<..: 3 2 1 1 1 1 3 1 3 1 ...
$ STREET : chr "LINCOLN ST" "HECLA ST" "CAZENOVE ST" "NEWCOMB ST" ...
$ Lat : num 42.4 42.3 42.3 42.3 42.3 ...
$ Long : num -71.1 -71.1 -71.1 -71.1 -71.1 ...
$ Location : chr "(42.35779134, -71.13937053)" "(42.30682138, -71.06030035)" "(42.34658879, -71.07242943)" "(42.33418175, -71.07866441)" ...
$ NAME : chr "LARCENY ALL OTHERS" "VANDALISM" "TOWED MOTOR VEHICLE" "INVESTIGATE PROPERTY" ...
We can say that the dataset is in tidy condition as each variable in the dataset is placed in its own location and each row has some observation. Also (Longitude,Latitude) gives us location of the crime which is infered furthur and also we could say that dates are placed in separate column with time details , so we removed the combined attribute OCCURRED_ON_DATE from the dataset.
First , we see that LAT and LONG gives us the location of the crime. So, we will add a new column Location in our dataset for the analysis.To find the centroid of the Long and Lat , we will do so my deducing the mean of both the variables.
## As, the values of Lat and long are not matched , so first we will do pattern matching of the columns and calculate the right mean by using clustered centre
## Take the mean of Lat by using the clustered centre
centroid_lat <- crime_comb %>% filter(str_detect(Lat, regex("4[0-9]{1,}\\.[0-9]{4,}"))) %>% summarise(mean(Lat))
## Take the mean of Long by using the clustered centre
centroid_long <- crime_comb %>% filter(str_detect(Long, regex("-7[0-9]{1,}\\.[0-9]{4,}"))) %>% summarise(mean(Long))
## Now, to calculate the distance between two points, we will use the Haversine formula from the source https://stackoverflow.com/questions/27928/calculate-distance-between-two-latitude-longitude-points-haversine-formula/23095329#23095329
hav.location <- function(long1, lat1, long2, lat2) {
R <- 6371
diff.long <- (long2 - long1)
diff.lat <- (lat2 - lat1)
a <- sin(diff.lat/2)^2 + cos(lat1) * cos(lat2) * sin(diff.long/2)^2
b <- 2 * asin(pmin(1, sqrt(a)))
d = R * b
return(d)
}
## Add the LOCATION column
crime_comb <- crime_comb %>% mutate(LOCATION = hav.location(centroid_long[[1]], centroid_lat[[1]], Long, Lat ))
## Head after mutating
head(crime_comb %>% select(Lat,Long,LOCATION))
NA
We, will check for the missing values in this section.
## First,we will check total missing values.
colSums(is.na(crime_comb))
INCIDENT_NUMBER CODE OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION
0 0 0 0
YEAR MONTH DAY_OF_WEEK HOUR
0 0 0 0
UCR_PART STREET Lat Long
1322 0 19999 19999
Location NAME LOCATION
0 0 19999
## We decided to apply pattern matching for Lat and Long values
## Replace the values with NA
crime_comb$Lat[str_detect(crime_comb$Lat, regex("4[0-9]{1,}\\.[0-9]{4,}")) == FALSE] <- NA
crime_comb$Long[str_detect(crime_comb$Long, regex("-7[0-9]{1,}\\.[0-9]{4,}")) == FALSE] <- NA
## Again checking for the missing values
colSums(is.na(crime_comb))
INCIDENT_NUMBER CODE OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION
0 0 0 0
YEAR MONTH DAY_OF_WEEK HOUR
0 0 0 0
UCR_PART STREET Lat Long
1322 0 20744 20756
Location NAME LOCATION
0 0 19999
## Now, we will remove the records with missing values.
crime_comb <- na.omit(crime_comb)
## Checking the missing values
colSums(is.na(crime_comb))
INCIDENT_NUMBER CODE OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION
0 0 0 0
YEAR MONTH DAY_OF_WEEK HOUR
0 0 0 0
UCR_PART STREET Lat Long
0 0 0 0
Location NAME LOCATION
0 0 0
## We will now create a function to check for the infinite number and Nan in the dataset
is.special_value <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
## Apply the function to the dataset
colSums(sapply(crime_comb[, c("Lat", "Long", "LOCATION")], is.special_value))
Lat Long LOCATION
0 0 0
## Checking the dimensions
dim(crime_comb)
[1] 297076 15
First we will create a boxplot for the univariate variable to check for the outliers and then a histogram to check for z-scores but require normality for that and we can see that normality is not attained after making histogram, so we cannot check the z-score for the dataset.
## We will create a box plot
boxplot(crime_comb$LOCATION, main="CRIME IN BOSTON", ylab="LOCATION",col="grey")
## Create a histogram to check for normality check
hist(crime_comb$LOCATION)
## Create a function to cap the outliers values
cap <- function(x) {
qtiles <- quantile(x, c(0.05, 0.25, 0.75, 0.75, 0.95))
x[ x > qtiles[3] + 1.5 * IQR(x) ] <- qtiles[4]
x
}
## Apply the function to cap outliers
crime_comb$LOCATION <- crime_comb$LOCATION %>% cap()
## Now, lets check again to see the outliers
boxplot(crime_comb$LOCATION, main="CRIME IN BOSTON", ylab="LOCATION",col="grey")
We will use LOCATION variable and try to scale the values according to the skewness of the histogram. As, here I have used BOX-COX as it will be perfect for total transformation into the normality.
## Now, check for histogram of LOCATION variable with the crime frequency
hist(crime_comb$LOCATION,
xlab = "LOCATION",
col = "grey",
ylab = "CRIME FREQUENCY")
## Using Box-cox for transformation
boxcox_crime<- BoxCox(crime_comb$LOCATION,lambda = "auto")
## Again we will make a histogram to check for the transformation
hist(boxcox_crime,xlab = "LOCATION",
col = "green",
ylab = "CRIME FREQUENCY")
NA
NA