2.2 Primary Data Preparation
For reducing the size of the dataset in terms of the number of variables, the variables with a high proportion of NAs (greater than 50%) were removed as these were deemed practically useless in prediction or data exploration. Most of these variables came from the 1990 Law Enforcement Management and Admin Stats survey (Lemas). Removing these variables reduces the number of variables to 125. Based on observation, we saw that the variables householdsize and PersPerOccupHous represents the same information, mean people per household. PersPerOccupHous was removed from the dataset which reduced the number of variables to 124.
#Let's create a table that checks for NAs
na_table <- as.data.frame(!is.na(crime))
colMeans(na_table)
#Remove variables with more 50% NAs:
#LemasSwornFT
#LemasSwFTPerPop
#LemasSwFTFieldOps
#LemasSwFTFieldPerPop
#LemasTotalReq
#LemasTotReqPerPop
#PolicReqPerOffic
#PolicPerPop
#RacialMatchCommPol
#PctPolicWhite
#PctPolicBlack
#PctPolicHisp
#PctPolicAsian
#PctPolicMinor
#OfficAssgnDrugUnits
#NumKindsDrugsSeiz
#PolicAveOTWorked
#PolicCars
#PolicOperBudg
#LemasPctPolicOnPatr
#LemasGangUnitDeploy
#PolicBudgPerPop
#Dropping these variables
drop.cols <- c('LemasSwornFT','LemasSwFTPerPop','LemasSwFTFieldOps','LemasSwFTFieldPerPop',
'LemasTotalReq','LemasTotReqPerPop','PolicReqPerOffic','PolicPerPop','RacialMatchCommPol',
'PctPolicWhite','PctPolicBlack','PctPolicHisp','PctPolicAsian','PctPolicMinor',
'OfficAssgnDrugUnits','NumKindsDrugsSeiz','PolicAveOTWorked','PolicCars',
'PolicOperBudg','LemasPctPolicOnPatr','LemasGangUnitDeploy','PolicBudgPerPop')
crime <- crime %>% select(- drop.cols)
#Dataset now has 125 variables - 22 variables got removed
#householdsize and PersPerOccupHous represents the same info - remove one
crime <- crime %>% select(-PersPerOccupHous)