BANA 7025
Group number 7
(Ajinkya Dalvi, Anjali Shalimar, Vincent Chiang)
Assignment 1

Part 1 - Acquainting yourself with the data

Question 1

By using the following code block we are reading the csv file from the location in my system. We are calling the head function to provide the output of first 10 observations in the dataset. Here, we have used the head function in R to print the first 10 observations of a dataset.

crime_dataset <- read.csv("C:/DataWrangling/Assignment1/Week 1/week1_cincy_crimes.csv", header = TRUE)
head(crime_dataset,10)
##                              instanceid                         closed opening
## 1  92A296AB-D1B7-40CE-BD96-209CFF141FDA                      J--CLOSED    <NA>
## 2  44ACB102-5B1D-40F8-9E2B-1F823A26705D                Z--EARLY CLOSED    <NA>
## 3  2CED4B80-3AB7-46DF-BBD9-3531A0C6727A D--VICTIM REFUSED TO COOPERATE    <NA>
## 4  EEB41765-CBC3-476C-BDBE-4273B4E0CC7E                      J--CLOSED    <NA>
## 5  F4622DF5-8274-4290-AB0E-73CB9A720905                      J--CLOSED    <NA>
## 6  EF456ED0-031E-4171-8C96-29CF91BC9A9B                Z--EARLY CLOSED    <NA>
## 7  0859E5C0-4543-469D-910E-D14F603AB5BC              H--WARRANT ISSUED    <NA>
## 8  9B091265-0352-4198-A19E-B960BDE15091                Z--EARLY CLOSED    <NA>
## 9  D2DAF74C-1991-4E51-B81C-6BE79F7DA0F7                Z--EARLY CLOSED    <NA>
## 10 43EEB437-DF03-47D0-AB01-951B9EBBFA04                      J--CLOSED    <NA>
##    dayofweek victim_gender totalnumbervictims totalsuspects
## 1   SATURDAY        FEMALE                  1             1
## 2   THURSDAY        FEMALE                  1            NA
## 3    TUESDAY        FEMALE                  1             1
## 4  WEDNESDAY        FEMALE                  1            NA
## 5    TUESDAY        FEMALE                  1            NA
## 6     SUNDAY        FEMALE                  2            NA
## 7    TUESDAY          MALE                  1             1
## 8  WEDNESDAY          <NA>                  2            NA
## 9     FRIDAY          <NA>                  1             1
## 10 WEDNESDAY        FEMALE                  1            NA

Question 2

In the structure we can see that we have 4 categorical variables such as (instanceid,closed,opening,dayofweek,victim_gender). We also have 2 continous variables like totalnumbervictims and totalsuspects.

str(crime_dataset)
## 'data.frame':    21153 obs. of  7 variables:
##  $ instanceid        : Factor w/ 15912 levels "00012F44-F700-43A2-B1FA-33EAC07810A6",..: 9249 4352 2808 14897 15248 14946 518 9798 13140 4301 ...
##  $ closed            : Factor w/ 12 levels "A--DEATH OF OFFENDER",..: 9 12 3 9 9 12 7 12 12 9 ...
##  $ opening           : Factor w/ 7 levels "??","1 - DOOR",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ dayofweek         : Factor w/ 7 levels "FRIDAY","MONDAY",..: 3 5 6 7 6 4 6 7 1 7 ...
##  $ victim_gender     : Factor w/ 6 levels "F - FEMALE","FEMALE",..: 2 2 2 2 2 2 4 NA NA 2 ...
##  $ totalnumbervictims: int  1 1 1 1 1 2 1 2 1 1 ...
##  $ totalsuspects     : int  1 NA 1 NA NA NA 1 NA 1 NA ...

Question 3

Yes the variable names required to be changed to have a clearer understanding of what a particular column represents. We have made a few changes in the following dataset. In this case we have renamed totalnumbervictims as “TotalVictims”, dayofweek as “day_of_week”,instance_id as “crime_id” & “closed” as “CrimeStatus”.

colnames(crime_dataset)[colnames(crime_dataset) == "totalnumbervictims"] <- "TotalVictims"
colnames(crime_dataset)[colnames(crime_dataset) == "dayofweek"] <- "day_of_week"
colnames(crime_dataset)[colnames(crime_dataset) == "instanceid"] <- "CrimeID"
colnames(crime_dataset)[colnames(crime_dataset) == "closed"] <- "CrimeStatus"

Printing the column names of the dataset after the changes are been made.

colnames(crime_dataset)
## [1] "CrimeID"       "CrimeStatus"   "opening"       "day_of_week"  
## [5] "victim_gender" "TotalVictims"  "totalsuspects"

Question 4

From the below code we can see the different types for the variables.

sapply(crime_dataset,class)
##       CrimeID   CrimeStatus       opening   day_of_week victim_gender 
##      "factor"      "factor"      "factor"      "factor"      "factor" 
##  TotalVictims totalsuspects 
##     "integer"     "integer"

In the given dataset we don’t think that any variables datatype needs to be changed. Here CrimeId is a categorical variable and it has unique values. So its isnt appropriate to change its type as we won’t be using it to withdraw any information from it.

So the final structure of the dataframe remains the same.

str(crime_dataset)
## 'data.frame':    21153 obs. of  7 variables:
##  $ CrimeID      : Factor w/ 15912 levels "00012F44-F700-43A2-B1FA-33EAC07810A6",..: 9249 4352 2808 14897 15248 14946 518 9798 13140 4301 ...
##  $ CrimeStatus  : Factor w/ 12 levels "A--DEATH OF OFFENDER",..: 9 12 3 9 9 12 7 12 12 9 ...
##  $ opening      : Factor w/ 7 levels "??","1 - DOOR",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ day_of_week  : Factor w/ 7 levels "FRIDAY","MONDAY",..: 3 5 6 7 6 4 6 7 1 7 ...
##  $ victim_gender: Factor w/ 6 levels "F - FEMALE","FEMALE",..: 2 2 2 2 2 2 4 NA NA 2 ...
##  $ TotalVictims : int  1 1 1 1 1 2 1 2 1 1 ...
##  $ totalsuspects: int  1 NA 1 NA NA NA 1 NA 1 NA ...

Question 5

In the given dataset there are 487 missing values in ‘closed’ , 19562 in ‘opening’ , 363 in ‘day_of_week’, 3165 in ‘victim_gender’, 20 in ‘Total Victim’ and 10039 in ’totalsuspects’columns.

colSums(is.na(crime_dataset))
##       CrimeID   CrimeStatus       opening   day_of_week victim_gender 
##             0           487         19562           363          3165 
##  TotalVictims totalsuspects 
##            20         10039

We won’t remove the entire observation in case of missing values.In this case we will first investigate the missing value. If most of the values in an observation are missing and the observation doesn’t give any value to the model that might be developed in the future then in that case we can remove the observation whereas in the other case we should keep the observation.

Part 2 - Data Cleaning

Question 6

Summary

  1. The data has the same CrimeID repeating multiple times. This is because each observation corresponds to a particular victim of a given CrimID. A good way to deal with this would be to group the original data by CrimeID and add 3 new columns
    “number_of_male_victims” , “number_of_female_victims” and
    “number_of_victims_with_gender_unknown”. (The data has not been grouped for this assignment.)

  2. The variable CrimeStatus has NA values as well as a level ‘Unknown’, which represents the same inference. Hence NA values has been replaced with the level ‘Unknown’.

  3. The variable ‘opening’ has a number preceeding the category levels such as ‘1-Door’. These preceeding numbers have been removed. This has been done as the variable is not an ordered categorical value, but an unordered categorical value. The levels ‘NA’ and ‘??’ and ‘Unknown’ have been converted as a single level represented by ‘Unknown’.

  4. The variable ‘victim_gender’ uses mulitple levels to indicate the same gender. These levels have been merged. Missing values have been imputed the value ‘UNKNOWN’.

  5. Observations have not been removed at this stage as ‘UNKNOWN’ values also serve helpful in drawing insights.

sapply(crime_dataset[,-c(1)],unique)
## $CrimeStatus
##  [1] J--CLOSED                       Z--EARLY CLOSED                
##  [3] D--VICTIM REFUSED TO COOPERATE  H--WARRANT ISSUED              
##  [5] F--CLEARED BY ARREST - ADULT    I--INVESTIGATION PENDING       
##  [7] K--UNFOUNDED                    G--CLEARED BY ARREST - JUVENILE
##  [9] <NA>                            B--PROSECUTION DECLINED        
## [11] A--DEATH OF OFFENDER            U--UNKNOWN                     
## [13] E--JUVENILE/NO CUSTODY         
## 12 Levels: A--DEATH OF OFFENDER ... Z--EARLY CLOSED
## 
## $opening
## [1] <NA>         1 - DOOR     2 - WINDOW   5 - OTHER    3 - GARAGE  
## [6] 6 - UNKNOWN  4 - SKYLIGHT ??          
## 7 Levels: ?? 1 - DOOR 2 - WINDOW 3 - GARAGE 4 - SKYLIGHT ... 6 - UNKNOWN
## 
## $day_of_week
## [1] SATURDAY  THURSDAY  TUESDAY   WEDNESDAY SUNDAY    FRIDAY    MONDAY   
## [8] <NA>     
## Levels: FRIDAY MONDAY SATURDAY SUNDAY THURSDAY TUESDAY WEDNESDAY
## 
## $victim_gender
## [1] FEMALE               MALE                 <NA>                
## [4] UNKNOWN              F - FEMALE           M - MALE            
## [7] NON-PERSON (BUSINESS
## Levels: F - FEMALE FEMALE M - MALE MALE NON-PERSON (BUSINESS UNKNOWN
## 
## $TotalVictims
##  [1]   1   2  18   3   4   6 126   5   8  NA   9   7
## 
## $totalsuspects
## [1]  1 NA  2  4  5  3  7  6  8
# CrimeStatus
crime_dataset$CrimeStatus[is.na(crime_dataset$CrimeStatus)] <- 'U--UNKNOWN'

#opening
crime_dataset$opening[is.na(crime_dataset$opening)] <- '6 - UNKNOWN'
crime_dataset$opening[crime_dataset$opening %in% '??'] <- '6 - UNKNOWN'
crime_dataset$opening <-  as.character(crime_dataset$opening)
crime_dataset$opening[crime_dataset$opening == '1 - DOOR'] <- 'DOOR'
crime_dataset$opening[crime_dataset$opening == '2 - WINDOW'] <- 'WINDOW'
crime_dataset$opening[crime_dataset$opening == '3 - GARAGE'] <- 'GARAGE'
crime_dataset$opening[crime_dataset$opening == '4 - SKYLIGHT'] <- 'SKYLIGHT'
crime_dataset$opening[crime_dataset$opening == '5 - OTHER'] <- 'OTHER'
crime_dataset$opening[crime_dataset$opening == '6 - UNKNOWN'] <- 'UNKNOWN'
crime_dataset$opening <-  as.factor(crime_dataset$opening)

#victim_gender
crime_dataset$victim_gender <- as.character(crime_dataset$victim_gender)
crime_dataset$victim_gender[is.na(crime_dataset$victim_gender)] <- 'UNKNOWN'
crime_dataset$victim_gender[crime_dataset$victim_gender=='F - FEMALE'] <- 'FEMALE'
crime_dataset$victim_gender[crime_dataset$victim_gender=='M - MALE'] <- 'MALE'
crime_dataset$victim_gender <- as.factor(crime_dataset$victim_gender)

sapply(crime_dataset[,-c(1)],unique)
## $CrimeStatus
##  [1] J--CLOSED                       Z--EARLY CLOSED                
##  [3] D--VICTIM REFUSED TO COOPERATE  H--WARRANT ISSUED              
##  [5] F--CLEARED BY ARREST - ADULT    I--INVESTIGATION PENDING       
##  [7] K--UNFOUNDED                    G--CLEARED BY ARREST - JUVENILE
##  [9] U--UNKNOWN                      B--PROSECUTION DECLINED        
## [11] A--DEATH OF OFFENDER            E--JUVENILE/NO CUSTODY         
## 12 Levels: A--DEATH OF OFFENDER ... Z--EARLY CLOSED
## 
## $opening
## [1] UNKNOWN  DOOR     WINDOW   OTHER    GARAGE   SKYLIGHT
## Levels: DOOR GARAGE OTHER SKYLIGHT UNKNOWN WINDOW
## 
## $day_of_week
## [1] SATURDAY  THURSDAY  TUESDAY   WEDNESDAY SUNDAY    FRIDAY    MONDAY   
## [8] <NA>     
## Levels: FRIDAY MONDAY SATURDAY SUNDAY THURSDAY TUESDAY WEDNESDAY
## 
## $victim_gender
## [1] FEMALE               MALE                 UNKNOWN             
## [4] NON-PERSON (BUSINESS
## Levels: FEMALE MALE NON-PERSON (BUSINESS UNKNOWN
## 
## $TotalVictims
##  [1]   1   2  18   3   4   6 126   5   8  NA   9   7
## 
## $totalsuspects
## [1]  1 NA  2  4  5  3  7  6  8

Question 7

Summary

  1. 20.6% of the values in the variable ‘TotalVictims’ are outliers
  2. 3.13% of the values in the variable ‘total suspects’ are outliers
  3. Each row of the data corresponsds to a particular victim within a caseID. Hence a CaseID with 126 victims would have 126 rows in the dataset. Such data should hence be grouped before using for further analysis. A good way to deal with this would be to group the
    original data by CrimeID and add 3 new columns
    “number_of_male_victims”,“number_of_female_victims” and
    “number_of_victims_with_gender_unknown”. (The data has not been grouped for this assignment.)
# Identifying outliers in numeric variables
summary(crime_dataset$TotalVictims)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   1.000   2.092   1.000 126.000      20
b1 <- boxplot(crime_dataset$TotalVictims)

hist(crime_dataset$TotalVictims)

nrow(subset(crime_dataset,crime_dataset$TotalVictims %in% b1$out))/nrow(crime_dataset)
## [1] 0.2056446
summary(crime_dataset$totalsuspects)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   1.000   1.579   2.000   8.000   10039
b2 <- boxplot(crime_dataset$totalsuspects)

hist(crime_dataset$totalsuspects)

nrow(subset(crime_dataset,crime_dataset$totalsuspects %in% b2$out))/nrow(crime_dataset)
## [1] 0.03139035

Question 8

  1. For categorical variables such as ‘opening’,‘victim_gender’ and ‘CrimeStatus’ , missing values have been imputed with an existing category that represents unknown data. E.g. NA in the variable ‘opening’ has been imputed the value ‘UNKNOWN’ which was an already existing level within the variable (in the original dataset). (Code for this point has already been run in question 6)

  2. The variable ‘TotalVictims’ has 20 missing values. The variable’totalsuspects’ has 47.5% missing values.These have not been removed or imputed, as crime data is a very subjective dataset. Each crime is unique and hence it is not a good practice to impute the values of victims or suspects with the median count/average value.

# % of missing values in 'TotalVictims'
sum(is.na(crime_dataset$TotalVictims))/nrow(crime_dataset)
## [1] 0.0009454924
# % of missing values in 'totalsuspects'
sum(is.na(crime_dataset$totalsuspects))/nrow(crime_dataset)
## [1] 0.4745899

Part 3 - Exploratory Data Analysis (EDA)

Questions 9-10

For Questions 9 and 10, we decided on a few visualizations of sinigicance: The Frequncey of the Number of Suspects, the Frequency of the Number of Victims, Number of Victims for each Gender, The Number of Crimes for each Day of the Week,and the Number of Crimes per Opening Type.

Frequency of the Number of Suspects

hist(crime_dataset$totalsuspects, probability = T, main="Total Suspect Histogram", xlab="Total Suspects")

Frequency of Number of Victims

hist(crime_dataset$TotalVictims, probability= T, main= "Total Victims Histogram", xlab= "Total Victims")

Number of Victims per Gender

plot(crime_dataset$victim_gender, xlab = "Gender of Victims" , ylab = "Number of Victims", main = "Number of Victims per Gender", col = "red")

Number of Crimes by Day of Week

daytable=table(crime_dataset$day_of_week)
daytable
## 
##    FRIDAY    MONDAY  SATURDAY    SUNDAY  THURSDAY   TUESDAY WEDNESDAY 
##      3062      3020      2925      2883      2925      3048      2927
daytable=daytable[c(4,2,6,7,5,1,3)]

barplot(daytable,
        xlab = "Days of Weeks" , ylab = "Number of Crimes", main = "Number of Crimes each day", col = "gold",
        border="red",
        density=50,
)

Number of Crimes per Opening Type

barplot(table(crime_dataset$opening),
        xlab = "Opening" , ylab = "Number of Crime", main = "Number of Crimes per Opening Type", col = "gold",
        border="red",
        density=100,
)