By using the following code block we are reading the csv file from the location in my system. We are calling the head function to provide the output of first 10 observations in the dataset. Here, we have used the head function in R to print the first 10 observations of a dataset.
crime_dataset <- read.csv("C:/DataWrangling/Assignment1/Week 1/week1_cincy_crimes.csv", header = TRUE)
head(crime_dataset,10)
## instanceid closed opening
## 1 92A296AB-D1B7-40CE-BD96-209CFF141FDA J--CLOSED <NA>
## 2 44ACB102-5B1D-40F8-9E2B-1F823A26705D Z--EARLY CLOSED <NA>
## 3 2CED4B80-3AB7-46DF-BBD9-3531A0C6727A D--VICTIM REFUSED TO COOPERATE <NA>
## 4 EEB41765-CBC3-476C-BDBE-4273B4E0CC7E J--CLOSED <NA>
## 5 F4622DF5-8274-4290-AB0E-73CB9A720905 J--CLOSED <NA>
## 6 EF456ED0-031E-4171-8C96-29CF91BC9A9B Z--EARLY CLOSED <NA>
## 7 0859E5C0-4543-469D-910E-D14F603AB5BC H--WARRANT ISSUED <NA>
## 8 9B091265-0352-4198-A19E-B960BDE15091 Z--EARLY CLOSED <NA>
## 9 D2DAF74C-1991-4E51-B81C-6BE79F7DA0F7 Z--EARLY CLOSED <NA>
## 10 43EEB437-DF03-47D0-AB01-951B9EBBFA04 J--CLOSED <NA>
## dayofweek victim_gender totalnumbervictims totalsuspects
## 1 SATURDAY FEMALE 1 1
## 2 THURSDAY FEMALE 1 NA
## 3 TUESDAY FEMALE 1 1
## 4 WEDNESDAY FEMALE 1 NA
## 5 TUESDAY FEMALE 1 NA
## 6 SUNDAY FEMALE 2 NA
## 7 TUESDAY MALE 1 1
## 8 WEDNESDAY <NA> 2 NA
## 9 FRIDAY <NA> 1 1
## 10 WEDNESDAY FEMALE 1 NA
In the structure we can see that we have 4 categorical variables such as (instanceid,closed,opening,dayofweek,victim_gender). We also have 2 continous variables like totalnumbervictims and totalsuspects.
str(crime_dataset)
## 'data.frame': 21153 obs. of 7 variables:
## $ instanceid : Factor w/ 15912 levels "00012F44-F700-43A2-B1FA-33EAC07810A6",..: 9249 4352 2808 14897 15248 14946 518 9798 13140 4301 ...
## $ closed : Factor w/ 12 levels "A--DEATH OF OFFENDER",..: 9 12 3 9 9 12 7 12 12 9 ...
## $ opening : Factor w/ 7 levels "??","1 - DOOR",..: NA NA NA NA NA NA NA NA NA NA ...
## $ dayofweek : Factor w/ 7 levels "FRIDAY","MONDAY",..: 3 5 6 7 6 4 6 7 1 7 ...
## $ victim_gender : Factor w/ 6 levels "F - FEMALE","FEMALE",..: 2 2 2 2 2 2 4 NA NA 2 ...
## $ totalnumbervictims: int 1 1 1 1 1 2 1 2 1 1 ...
## $ totalsuspects : int 1 NA 1 NA NA NA 1 NA 1 NA ...
Yes the variable names required to be changed to have a clearer understanding of what a particular column represents. We have made a few changes in the following dataset. In this case we have renamed totalnumbervictims as “TotalVictims”, dayofweek as “day_of_week”,instance_id as “crime_id” & “closed” as “CrimeStatus”.
colnames(crime_dataset)[colnames(crime_dataset) == "totalnumbervictims"] <- "TotalVictims"
colnames(crime_dataset)[colnames(crime_dataset) == "dayofweek"] <- "day_of_week"
colnames(crime_dataset)[colnames(crime_dataset) == "instanceid"] <- "CrimeID"
colnames(crime_dataset)[colnames(crime_dataset) == "closed"] <- "CrimeStatus"
Printing the column names of the dataset after the changes are been made.
colnames(crime_dataset)
## [1] "CrimeID" "CrimeStatus" "opening" "day_of_week"
## [5] "victim_gender" "TotalVictims" "totalsuspects"
From the below code we can see the different types for the variables.
sapply(crime_dataset,class)
## CrimeID CrimeStatus opening day_of_week victim_gender
## "factor" "factor" "factor" "factor" "factor"
## TotalVictims totalsuspects
## "integer" "integer"
In the given dataset we don’t think that any variables datatype needs to be changed. Here CrimeId is a categorical variable and it has unique values. So its isnt appropriate to change its type as we won’t be using it to withdraw any information from it.
So the final structure of the dataframe remains the same.
str(crime_dataset)
## 'data.frame': 21153 obs. of 7 variables:
## $ CrimeID : Factor w/ 15912 levels "00012F44-F700-43A2-B1FA-33EAC07810A6",..: 9249 4352 2808 14897 15248 14946 518 9798 13140 4301 ...
## $ CrimeStatus : Factor w/ 12 levels "A--DEATH OF OFFENDER",..: 9 12 3 9 9 12 7 12 12 9 ...
## $ opening : Factor w/ 7 levels "??","1 - DOOR",..: NA NA NA NA NA NA NA NA NA NA ...
## $ day_of_week : Factor w/ 7 levels "FRIDAY","MONDAY",..: 3 5 6 7 6 4 6 7 1 7 ...
## $ victim_gender: Factor w/ 6 levels "F - FEMALE","FEMALE",..: 2 2 2 2 2 2 4 NA NA 2 ...
## $ TotalVictims : int 1 1 1 1 1 2 1 2 1 1 ...
## $ totalsuspects: int 1 NA 1 NA NA NA 1 NA 1 NA ...
In the given dataset there are 487 missing values in ‘closed’ , 19562 in ‘opening’ , 363 in ‘day_of_week’, 3165 in ‘victim_gender’, 20 in ‘Total Victim’ and 10039 in ’totalsuspects’columns.
colSums(is.na(crime_dataset))
## CrimeID CrimeStatus opening day_of_week victim_gender
## 0 487 19562 363 3165
## TotalVictims totalsuspects
## 20 10039
We won’t remove the entire observation in case of missing values.In this case we will first investigate the missing value. If most of the values in an observation are missing and the observation doesn’t give any value to the model that might be developed in the future then in that case we can remove the observation whereas in the other case we should keep the observation.
The data has the same CrimeID repeating multiple times. This is because each observation corresponds to a particular victim of a given CrimID. A good way to deal with this would be to group the original data by CrimeID and add 3 new columns
“number_of_male_victims” , “number_of_female_victims” and
“number_of_victims_with_gender_unknown”. (The data has not been grouped for this assignment.)
The variable CrimeStatus has NA values as well as a level ‘Unknown’, which represents the same inference. Hence NA values has been replaced with the level ‘Unknown’.
The variable ‘opening’ has a number preceeding the category levels such as ‘1-Door’. These preceeding numbers have been removed. This has been done as the variable is not an ordered categorical value, but an unordered categorical value. The levels ‘NA’ and ‘??’ and ‘Unknown’ have been converted as a single level represented by ‘Unknown’.
The variable ‘victim_gender’ uses mulitple levels to indicate the same gender. These levels have been merged. Missing values have been imputed the value ‘UNKNOWN’.
Observations have not been removed at this stage as ‘UNKNOWN’ values also serve helpful in drawing insights.
sapply(crime_dataset[,-c(1)],unique)
## $CrimeStatus
## [1] J--CLOSED Z--EARLY CLOSED
## [3] D--VICTIM REFUSED TO COOPERATE H--WARRANT ISSUED
## [5] F--CLEARED BY ARREST - ADULT I--INVESTIGATION PENDING
## [7] K--UNFOUNDED G--CLEARED BY ARREST - JUVENILE
## [9] <NA> B--PROSECUTION DECLINED
## [11] A--DEATH OF OFFENDER U--UNKNOWN
## [13] E--JUVENILE/NO CUSTODY
## 12 Levels: A--DEATH OF OFFENDER ... Z--EARLY CLOSED
##
## $opening
## [1] <NA> 1 - DOOR 2 - WINDOW 5 - OTHER 3 - GARAGE
## [6] 6 - UNKNOWN 4 - SKYLIGHT ??
## 7 Levels: ?? 1 - DOOR 2 - WINDOW 3 - GARAGE 4 - SKYLIGHT ... 6 - UNKNOWN
##
## $day_of_week
## [1] SATURDAY THURSDAY TUESDAY WEDNESDAY SUNDAY FRIDAY MONDAY
## [8] <NA>
## Levels: FRIDAY MONDAY SATURDAY SUNDAY THURSDAY TUESDAY WEDNESDAY
##
## $victim_gender
## [1] FEMALE MALE <NA>
## [4] UNKNOWN F - FEMALE M - MALE
## [7] NON-PERSON (BUSINESS
## Levels: F - FEMALE FEMALE M - MALE MALE NON-PERSON (BUSINESS UNKNOWN
##
## $TotalVictims
## [1] 1 2 18 3 4 6 126 5 8 NA 9 7
##
## $totalsuspects
## [1] 1 NA 2 4 5 3 7 6 8
# CrimeStatus
crime_dataset$CrimeStatus[is.na(crime_dataset$CrimeStatus)] <- 'U--UNKNOWN'
#opening
crime_dataset$opening[is.na(crime_dataset$opening)] <- '6 - UNKNOWN'
crime_dataset$opening[crime_dataset$opening %in% '??'] <- '6 - UNKNOWN'
crime_dataset$opening <- as.character(crime_dataset$opening)
crime_dataset$opening[crime_dataset$opening == '1 - DOOR'] <- 'DOOR'
crime_dataset$opening[crime_dataset$opening == '2 - WINDOW'] <- 'WINDOW'
crime_dataset$opening[crime_dataset$opening == '3 - GARAGE'] <- 'GARAGE'
crime_dataset$opening[crime_dataset$opening == '4 - SKYLIGHT'] <- 'SKYLIGHT'
crime_dataset$opening[crime_dataset$opening == '5 - OTHER'] <- 'OTHER'
crime_dataset$opening[crime_dataset$opening == '6 - UNKNOWN'] <- 'UNKNOWN'
crime_dataset$opening <- as.factor(crime_dataset$opening)
#victim_gender
crime_dataset$victim_gender <- as.character(crime_dataset$victim_gender)
crime_dataset$victim_gender[is.na(crime_dataset$victim_gender)] <- 'UNKNOWN'
crime_dataset$victim_gender[crime_dataset$victim_gender=='F - FEMALE'] <- 'FEMALE'
crime_dataset$victim_gender[crime_dataset$victim_gender=='M - MALE'] <- 'MALE'
crime_dataset$victim_gender <- as.factor(crime_dataset$victim_gender)
sapply(crime_dataset[,-c(1)],unique)
## $CrimeStatus
## [1] J--CLOSED Z--EARLY CLOSED
## [3] D--VICTIM REFUSED TO COOPERATE H--WARRANT ISSUED
## [5] F--CLEARED BY ARREST - ADULT I--INVESTIGATION PENDING
## [7] K--UNFOUNDED G--CLEARED BY ARREST - JUVENILE
## [9] U--UNKNOWN B--PROSECUTION DECLINED
## [11] A--DEATH OF OFFENDER E--JUVENILE/NO CUSTODY
## 12 Levels: A--DEATH OF OFFENDER ... Z--EARLY CLOSED
##
## $opening
## [1] UNKNOWN DOOR WINDOW OTHER GARAGE SKYLIGHT
## Levels: DOOR GARAGE OTHER SKYLIGHT UNKNOWN WINDOW
##
## $day_of_week
## [1] SATURDAY THURSDAY TUESDAY WEDNESDAY SUNDAY FRIDAY MONDAY
## [8] <NA>
## Levels: FRIDAY MONDAY SATURDAY SUNDAY THURSDAY TUESDAY WEDNESDAY
##
## $victim_gender
## [1] FEMALE MALE UNKNOWN
## [4] NON-PERSON (BUSINESS
## Levels: FEMALE MALE NON-PERSON (BUSINESS UNKNOWN
##
## $TotalVictims
## [1] 1 2 18 3 4 6 126 5 8 NA 9 7
##
## $totalsuspects
## [1] 1 NA 2 4 5 3 7 6 8
# Identifying outliers in numeric variables
summary(crime_dataset$TotalVictims)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.000 1.000 2.092 1.000 126.000 20
b1 <- boxplot(crime_dataset$TotalVictims)
hist(crime_dataset$TotalVictims)
nrow(subset(crime_dataset,crime_dataset$TotalVictims %in% b1$out))/nrow(crime_dataset)
## [1] 0.2056446
summary(crime_dataset$totalsuspects)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.000 1.000 1.579 2.000 8.000 10039
b2 <- boxplot(crime_dataset$totalsuspects)
hist(crime_dataset$totalsuspects)
nrow(subset(crime_dataset,crime_dataset$totalsuspects %in% b2$out))/nrow(crime_dataset)
## [1] 0.03139035
For categorical variables such as ‘opening’,‘victim_gender’ and ‘CrimeStatus’ , missing values have been imputed with an existing category that represents unknown data. E.g. NA in the variable ‘opening’ has been imputed the value ‘UNKNOWN’ which was an already existing level within the variable (in the original dataset). (Code for this point has already been run in question 6)
The variable ‘TotalVictims’ has 20 missing values. The variable’totalsuspects’ has 47.5% missing values.These have not been removed or imputed, as crime data is a very subjective dataset. Each crime is unique and hence it is not a good practice to impute the values of victims or suspects with the median count/average value.
# % of missing values in 'TotalVictims'
sum(is.na(crime_dataset$TotalVictims))/nrow(crime_dataset)
## [1] 0.0009454924
# % of missing values in 'totalsuspects'
sum(is.na(crime_dataset$totalsuspects))/nrow(crime_dataset)
## [1] 0.4745899
For Questions 9 and 10, we decided on a few visualizations of sinigicance: The Frequncey of the Number of Suspects, the Frequency of the Number of Victims, Number of Victims for each Gender, The Number of Crimes for each Day of the Week,and the Number of Crimes per Opening Type.
Frequency of the Number of Suspects
hist(crime_dataset$totalsuspects, probability = T, main="Total Suspect Histogram", xlab="Total Suspects")
Frequency of Number of Victims
hist(crime_dataset$TotalVictims, probability= T, main= "Total Victims Histogram", xlab= "Total Victims")
Number of Victims per Gender
plot(crime_dataset$victim_gender, xlab = "Gender of Victims" , ylab = "Number of Victims", main = "Number of Victims per Gender", col = "red")
Number of Crimes by Day of Week
daytable=table(crime_dataset$day_of_week)
daytable
##
## FRIDAY MONDAY SATURDAY SUNDAY THURSDAY TUESDAY WEDNESDAY
## 3062 3020 2925 2883 2925 3048 2927
daytable=daytable[c(4,2,6,7,5,1,3)]
barplot(daytable,
xlab = "Days of Weeks" , ylab = "Number of Crimes", main = "Number of Crimes each day", col = "gold",
border="red",
density=50,
)
Number of Crimes per Opening Type
barplot(table(crime_dataset$opening),
xlab = "Opening" , ylab = "Number of Crime", main = "Number of Crimes per Opening Type", col = "gold",
border="red",
density=100,
)