Muy conocido por el desarrollo del boxplot, este matemático americano, y la prueba de Turkey, Químico de formación y PhD en Matemáticas, fundo el departamento de estadística en Princenton en 1965.
Fue quien introdujo el EDA ó el Análisis Exploratorio de Datos, en el cual hace el uso de la filosofía de las gráficas para entender el comportamiento de los datos y así tener una mejor aproximación
Se parte del punto de que el análisis debe ser gráfico principalmente, y mezcla entre si el componente de la estadística descriptiva y el proceso de producción de los datos. Lo anterior traduce por consecuente que el EDA es un enfoque que proporciona la técnica para llevar a cabo el tratamiento de los datos.
Las ventajas que proporciona este enfoque radican en :
Baltimore es la ciudad número 17 más peligrosa en los EEUU, a pesar que la tasa (rate) de criminalidad ha disminuido en las últimas dos décadas , por cada 100.000 habitantes la tasa de violencia es de 1417 en 2017. El objetivo de este estudio exploratorio es determinar y analizar la criminalidad desde las herramientas de la EDA, pero dándole el valor agregado de la ciencia de datos en términos de visualización de datos.
Paquetes
library("data.table")
library("ggplot2")
library("dplyr")
## Warning: package 'dplyr' was built under R version 3.4.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("ggmap")
library("maps")
library("mapdata")
library("sqldf")
## Warning: package 'sqldf' was built under R version 3.4.1
## Loading required package: gsubfn
## Loading required package: proto
## Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
## dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
## Referenced from: /Library/Frameworks/R.framework/Resources/modules//R_X11.so
## Reason: image not found
## Could not load tcltk. Will use slower R code instead.
## Loading required package: RSQLite
## Warning: package 'RSQLite' was built under R version 3.4.1
library("lubridate")
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday,
## week, yday, year
## The following object is masked from 'package:base':
##
## date
library('gganimate')
library('ggplot2')
library('animation')
Importar librería
Baltimore_crimes<-read.csv("/Users/danieljimenez/Desktop/Rmemories/BPD_Part_1_Victim_Based_Crime_Data.csv")
Estructura de la base de datos
glimpse(Baltimore_crimes)
## Observations: 276,529
## Variables: 15
## $ CrimeDate <fctr> 09/02/2017, 09/02/2017, 09/02/2017, 09/02/201...
## $ CrimeTime <fctr> 23:30:00, 23:00:00, 22:53:00, 22:50:00, 22:31...
## $ CrimeCode <fctr> 3JK, 7A, 9S, 4C, 4E, 5A, 1F, 3B, 4C, 4E, 4C, ...
## $ Location <fctr> 4200 AUDREY AVE, 800 NEWINGTON AVE, 600 RADNO...
## $ Description <fctr> ROBBERY - RESIDENCE, AUTO THEFT, SHOOTING, AG...
## $ Inside.Outside <fctr> I, O, Outside, I, O, I, Outside, O, O, I, O, ...
## $ Weapon <fctr> KNIFE, , FIREARM, OTHER, HANDS, , FIREARM, , ...
## $ Post <int> 913, 133, 524, 934, 113, 922, 232, 123, 641, 3...
## $ District <fctr> SOUTHERN, CENTRAL, NORTHERN, SOUTHERN, CENTRA...
## $ Neighborhood <fctr> Brooklyn, Reservoir Hill, Winston-Govans, Car...
## $ Longitude <dbl> -76.60541, -76.63217, -76.60697, -76.64526, -7...
## $ Latitude <dbl> 39.22951, 39.31360, 39.34768, 39.28315, 39.287...
## $ Location.1 <fctr> (39.2295100000, -76.6054100000), (39.31360000...
## $ Premise <fctr> ROW/TOWNHO, STREET, Street, ROW/TOWNHO, STREE...
## $ Total.Incidents <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
names(Baltimore_crimes)
## [1] "CrimeDate" "CrimeTime" "CrimeCode"
## [4] "Location" "Description" "Inside.Outside"
## [7] "Weapon" "Post" "District"
## [10] "Neighborhood" "Longitude" "Latitude"
## [13] "Location.1" "Premise" "Total.Incidents"
En la base de datos contamos con variables declarativas importantes, como la fecha del crimen , la localización , la descripción del momento, el arma usada, el vecindario y de más.
head(Baltimore_crimes,20)
## CrimeDate CrimeTime CrimeCode Location
## 1 09/02/2017 23:30:00 3JK 4200 AUDREY AVE
## 2 09/02/2017 23:00:00 7A 800 NEWINGTON AVE
## 3 09/02/2017 22:53:00 9S 600 RADNOR AV
## 4 09/02/2017 22:50:00 4C 1800 RAMSAY ST
## 5 09/02/2017 22:31:00 4E 100 LIGHT ST
## 6 09/02/2017 22:00:00 5A CHERRYCREST RD
## 7 09/02/2017 21:15:00 1F 3400 HARMONY CT
## 8 09/02/2017 21:35:00 3B 400 W LANVALE ST
## 9 09/02/2017 21:00:00 4C 2300 LYNDHURST AVE
## 10 09/02/2017 21:00:00 4E 1200 N ELLWOOD AVE
## 11 09/02/2017 21:00:00 4C 2300 LYNDHURST AVE
## 12 09/02/2017 20:56:00 3CF 3600 EDMONDSON AVE
## 13 09/02/2017 20:55:00 6C 5100 PARK HEIGHTS AVE
## 14 09/02/2017 20:10:00 4C 3900 GWYNNS FALLS PKWY
## 15 09/02/2017 20:00:00 6D 5500 SUMMERFIELD AVE
## 16 09/02/2017 19:52:00 5D 2200 VAN DEMAN ST
## 17 09/02/2017 18:08:00 9S 1200 E LAFAYETTE AV
## 18 09/02/2017 18:08:00 1F 1200 E LAFAYETTE AV
## 19 09/02/2017 18:16:00 4E 1000 N EUTAW ST
## 20 09/02/2017 18:00:00 6G 100 S BROADWAY
## Description Inside.Outside Weapon Post District
## 1 ROBBERY - RESIDENCE I KNIFE 913 SOUTHERN
## 2 AUTO THEFT O 133 CENTRAL
## 3 SHOOTING Outside FIREARM 524 NORTHERN
## 4 AGG. ASSAULT I OTHER 934 SOUTHERN
## 5 COMMON ASSAULT O HANDS 113 CENTRAL
## 6 BURGLARY I 922 SOUTHERN
## 7 HOMICIDE Outside FIREARM 232 SOUTHEASTERN
## 8 ROBBERY - STREET O 123 CENTRAL
## 9 AGG. ASSAULT O OTHER 641 NORTHWESTERN
## 10 COMMON ASSAULT I HANDS 332 EASTERN
## 11 AGG. ASSAULT O OTHER 641 NORTHWESTERN
## 12 ROBBERY - COMMERCIAL I FIREARM 844 SOUTHWESTERN
## 13 LARCENY 614 NORTHWESTERN
## 14 AGG. ASSAULT O OTHER 641 NORTHWESTERN
## 15 LARCENY FROM AUTO O 444 NORTHEASTERN
## 16 BURGLARY I 243 SOUTHEASTERN
## 17 SHOOTING Outside FIREARM 343 EASTERN
## 18 HOMICIDE Outside FIREARM 343 EASTERN
## 19 COMMON ASSAULT O HANDS 132 CENTRAL
## 20 LARCENY I 212 SOUTHEASTERN
## Neighborhood Longitude Latitude
## 1 Brooklyn -76.60541 39.22951
## 2 Reservoir Hill -76.63217 39.31360
## 3 Winston-Govans -76.60697 39.34768
## 4 Carrollton Ridge -76.64526 39.28315
## 5 Downtown West -76.61365 39.28756
## 6 Cherry Hill -76.62131 39.24867
## 7 Canton -76.56827 39.28202
## 8 Upton -76.62789 39.30254
## 9 Windsor Hills -76.68365 39.31370
## 10 Berea -76.57419 39.30551
## 11 Windsor Hills -76.68365 39.31370
## 12 Edgewood -76.67759 39.29402
## 13 Central Park Heights -76.67511 39.34861
## 14 Windsor Hills -76.68169 39.31400
## 15 Frankford -76.54270 39.33288
## 16 Holabird Industrial Park -76.53557 39.26533
## 17 Oliver -76.60246 39.31038
## 18 Oliver -76.60246 39.31038
## 19 Madison Park -76.62256 39.30083
## 20 Washington Hill -76.59390 39.29020
## Location.1 Premise Total.Incidents
## 1 (39.2295100000, -76.6054100000) ROW/TOWNHO 1
## 2 (39.3136000000, -76.6321700000) STREET 1
## 3 (39.3476800000, -76.6069700000) Street 1
## 4 (39.2831500000, -76.6452600000) ROW/TOWNHO 1
## 5 (39.2875600000, -76.6136500000) STREET 1
## 6 (39.2486700000, -76.6213100000) ROW/TOWNHO 1
## 7 (39.2820200000, -76.5682700000) Street 1
## 8 (39.3025400000, -76.6278900000) STREET 1
## 9 (39.3137000000, -76.6836500000) STREET 1
## 10 (39.3055100000, -76.5741900000) ROW/TOWNHO 1
## 11 (39.3137000000, -76.6836500000) STREET 1
## 12 (39.2940200000, -76.6775900000) RETAIL/SMA 1
## 13 (39.3486100000, -76.6751100000) 1
## 14 (39.3140000000, -76.6816900000) STREET 1
## 15 (39.3328800000, -76.5427000000) YARD 1
## 16 (39.2653300000, -76.5355700000) OTHER - IN 1
## 17 (39.3103800000, -76.6024600000) Street 1
## 18 (39.3103800000, -76.6024600000) Street 1
## 19 (39.3008300000, -76.6225600000) STREET 1
## 20 (39.2902000000, -76.5939000000) CONVENIENC 1
Notesé que existen datos vacíos, por lo cual se limpia la base de datos y por lo tanto la base se reduce de 276529 observaciones a 274318.
Baltimore_crimes<-na.omit(Baltimore_crimes)
glimpse(Baltimore_crimes)
## Observations: 274,318
## Variables: 15
## $ CrimeDate <fctr> 09/02/2017, 09/02/2017, 09/02/2017, 09/02/201...
## $ CrimeTime <fctr> 23:30:00, 23:00:00, 22:53:00, 22:50:00, 22:31...
## $ CrimeCode <fctr> 3JK, 7A, 9S, 4C, 4E, 5A, 1F, 3B, 4C, 4E, 4C, ...
## $ Location <fctr> 4200 AUDREY AVE, 800 NEWINGTON AVE, 600 RADNO...
## $ Description <fctr> ROBBERY - RESIDENCE, AUTO THEFT, SHOOTING, AG...
## $ Inside.Outside <fctr> I, O, Outside, I, O, I, Outside, O, O, I, O, ...
## $ Weapon <fctr> KNIFE, , FIREARM, OTHER, HANDS, , FIREARM, , ...
## $ Post <int> 913, 133, 524, 934, 113, 922, 232, 123, 641, 3...
## $ District <fctr> SOUTHERN, CENTRAL, NORTHERN, SOUTHERN, CENTRA...
## $ Neighborhood <fctr> Brooklyn, Reservoir Hill, Winston-Govans, Car...
## $ Longitude <dbl> -76.60541, -76.63217, -76.60697, -76.64526, -7...
## $ Latitude <dbl> 39.22951, 39.31360, 39.34768, 39.28315, 39.287...
## $ Location.1 <fctr> (39.2295100000, -76.6054100000), (39.31360000...
## $ Premise <fctr> ROW/TOWNHO, STREET, Street, ROW/TOWNHO, STREE...
## $ Total.Incidents <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
Transformación de la base de datos
Baltimore_crimes$CrimeDate<-as.Date(Baltimore_crimes$CrimeDate, format="%m/%d/%Y")#Transformación en datos de fechas
## Warning in strptime(x, format, tz = "GMT"): unknown timezone 'zone/tz/
## 2018c.1.0/zoneinfo/America/Bogota'
Baltimore_crimes$CrimeTimehour<- as.integer(substr(Baltimore_crimes$CrimeTime,0,2))#Transformación en tiempo
Baltimore_crimes$Year <- as.numeric (format(Baltimore_crimes$CrimeDate,"%Y"))
Baltimore_crimes$Month=as.numeric (format(Baltimore_crimes$CrimeDate,"%m"))
Baltimore_crimes$Day=as.numeric (format(Baltimore_crimes$CrimeDate,"%d"))
Baltimore_crimes$CrimeTimeLine[Baltimore_crimes$CrimeTimehour == 0] <- '(0,6]'
Baltimore_crimes$weekday <- wday(Baltimore_crimes$CrimeDate, label=TRUE)
Baltimore_crimes$hour <- as.numeric(hour(hms(as.character(factor( Baltimore_crimes$CrimeTime)))))
Ajuste de la base de datos
Baltimore_crimes$Weapon[Baltimore_crimes$Weapon =='' & Baltimore_crimes$CrimeGroup =='RAPE']<- 'OTHER'
Baltimore_crimes$Weapon[Baltimore_crimes$Weapon==""] <- NA
#Crimenes por tamaño de los individuos que lo ejecutan
Baltimore_crimes$CrimeGroup[like(Baltimore_crimes$Description,"ROBBERY")] <- "ROBBERY"
Baltimore_crimes$CrimeGroup[like(Baltimore_crimes$Description,"ASSAULT")] <- "ASSAULT"
Baltimore_crimes$CrimeGroup[like(Baltimore_crimes$Description,"LARCENY")] <- "LARCENY"
Baltimore_crimes$CrimeGroup[like(Baltimore_crimes$Description,"ARSON")] <- "ARSON"
Baltimore_crimes$CrimeGroup[like(Baltimore_crimes$Description,"RAPE")] <- "RAPE"
Baltimore_crimes$CrimeGroup[like(Baltimore_crimes$Description,"SHOOTING")] <- "SHOOTING"
Baltimore_crimes$CrimeGroup[like(Baltimore_crimes$Description,"AUTO THEFT")] <- "AUTO THEFT"
Baltimore_crimes$CrimeGroup[like(Baltimore_crimes$Description,"HOMICIDE")] <- "HOMICIDE"
Baltimore_crimes$CrimeGroup[like(Baltimore_crimes$Description,"BURGLARY")] <- "BURGLARY"
Puntaje de los crímenes Son puntajes establecidos por la asamblea de Connecticut, estos van de 1 a 10 (diez)
Baltimore_crimes$CrimeScore[Baltimore_crimes$CrimeGroup=="ROBBERY"] <-7
Baltimore_crimes$CrimeScore[Baltimore_crimes$CrimeGroup=="AUTO THEFT"] <-6
Baltimore_crimes$CrimeScore[Baltimore_crimes$CrimeGroup=="SHOOTING"] <-6
Baltimore_crimes$CrimeScore[Baltimore_crimes$CrimeGroup=="ASSAULT"] <-8
Baltimore_crimes$CrimeScore[Baltimore_crimes$CrimeGroup=="BURGLARY"] <-6
Baltimore_crimes$CrimeScore[Baltimore_crimes$CrimeGroup=="HOMICIDE"] <-10
Baltimore_crimes$CrimeScore[Baltimore_crimes$CrimeGroup=="LARCENY"] <-4
Baltimore_crimes$CrimeScore[Baltimore_crimes$CrimeGroup=="ARSON"] <-5
Baltimore_crimes$CrimeScore[Baltimore_crimes$CrimeGroup=="RAPE"] <-9
Baltimore_crimes$CrimeScore <- as.factor(Baltimore_crimes$CrimeScore)
Linea de tiempo en la cual se desarrolla el crimen
Baltimore_crimes$CrimeTimeLine <- cut(Baltimore_crimes$CrimeTimehour, breaks=c(0,6,12,18,24), right=TRUE)
Algunas medidas de estadística descriptiva
summary(Baltimore_crimes)
## CrimeDate CrimeTime CrimeCode
## Min. :2012-01-01 18:00:00: 6700 4E : 45159
## 1st Qu.:2013-06-05 17:00:00: 6446 6D : 35850
## Median :2014-11-07 16:00:00: 6012 5A : 25601
## Mean :2014-11-09 12:00:00: 5852 7A : 24940
## 3rd Qu.:2016-04-30 20:00:00: 5787 6G : 15821
## Max. :2017-09-02 21:00:00: 5701 6C : 12966
## (Other) :237820 (Other):113981
## Location Description Inside.Outside
## 200 E PRATT ST : 654 LARCENY :60092 : 10205
## 300 LIGHT ST : 567 COMMON ASSAULT :45159 I :131295
## 1500 RUSSELL ST: 556 BURGLARY :42359 Inside : 632
## 3500 BOSTON ST : 405 LARCENY FROM AUTO:35850 O :128350
## 1200 W PRATT ST: 371 AGG. ASSAULT :27315 Outside: 3836
## 0 LIGHT ST : 363 AUTO THEFT :26532
## (Other) :271402 (Other) :37011
## Weapon Post District
## : 0 Min. :111.0 NORTHEASTERN:42707
## FIREARM: 22228 1st Qu.:243.0 SOUTHEASTERN:38023
## HANDS : 48602 Median :511.0 SOUTHERN :31616
## KNIFE : 9564 Mean :506.2 CENTRAL :31397
## OTHER : 14490 3rd Qu.:731.0 NORTHERN :31273
## NA's :179434 Max. :943.0 NORTHWESTERN:27895
## (Other) :71407
## Neighborhood Longitude Latitude
## Downtown : 9048 Min. :-76.71 Min. :39.20
## Frankford : 6642 1st Qu.:-76.65 1st Qu.:39.29
## Belair-Edison : 5977 Median :-76.61 Median :39.30
## Brooklyn : 4516 Mean :-76.62 Mean :39.31
## Cherry Hill : 4086 3rd Qu.:-76.59 3rd Qu.:39.33
## Sandtown-Winchester: 4026 Max. :-76.53 Max. :39.37
## (Other) :240023
## Location.1 Premise
## (39.3180000000, -76.6582100000): 535 STREET :99800
## (39.2854400000, -76.6134000000): 481 ROW/TOWNHO:60240
## (39.2876100000, -76.5398200000): 441 APT/CONDO :11908
## (39.2740800000, -76.6276900000): 417 PARKING LO:11873
## (39.3186800000, -76.6539900000): 408 OTHER - IN:11377
## (39.2865900000, -76.6120400000): 390 :10683
## (Other) :271646 (Other) :68437
## Total.Incidents CrimeTimehour Year Month
## Min. :1 Min. : 0.00 Min. :2012 Min. : 1.000
## 1st Qu.:1 1st Qu.: 9.00 1st Qu.:2013 1st Qu.: 4.000
## Median :1 Median :15.00 Median :2014 Median : 6.000
## Mean :1 Mean :13.29 Mean :2014 Mean : 6.448
## 3rd Qu.:1 3rd Qu.:19.00 3rd Qu.:2016 3rd Qu.: 9.000
## Max. :1 Max. :24.00 Max. :2017 Max. :12.000
##
## Day CrimeTimeLine weekday hour
## Min. : 1.00 (0,6] :37539 Sun :36902 Min. : 0.00
## 1st Qu.: 8.00 (6,12] :62001 Mon :39929 1st Qu.: 9.00
## Median :16.00 (12,18]:88466 Tues :39421 Median :15.00
## Mean :15.81 (18,24]:73650 Wed :39461 Mean :13.29
## 3rd Qu.:23.00 NA's :12662 Thurs:39088 3rd Qu.:19.00
## Max. :31.00 Fri :41242 Max. :24.00
## Sat :38275
## CrimeGroup CrimeScore
## Length:274318 4 :95942
## Class :character 5 : 1444
## Mode :character 6 :71800
## 7 :25981
## 8 :75961
## 9 : 1631
## 10: 1559
ggplot(data=Baltimore_crimes , aes(Weapon)) +
geom_bar(aes(y = (..count..)/sum(..count..)), fill='red', color='Black') +
labs(x= "Arma usada", y="Porcentaje de uso")
ggplot(subset(Baltimore_crimes,!is.na(District)))+
aes(x=District)+
geom_bar(stat = "count",fill='skyblue') +
geom_text(stat="count",aes(label=..count..),vjust=-1)+
labs(title="Frecuencia de incidencias por Distrito",x="Distrito",y="Porcentaje del suceso")+ theme(axis.text.x = element_text(angle = 45, hjust = 1))+
scale_y_continuous(limit=c(0,45000))
ggplot(subset(Baltimore_crimes,!is.na(District) ))+
aes(x=Year, color=District)+
geom_line(stat="count")+
scale_x_continuous(breaks = seq(2012,2017,1))+
scale_y_continuous(breaks = seq(5000,50000,5000))+
labs(title="Frecuencia de incidencias",x="Distritos",y="Incidentes")
ggplot(Baltimore_crimes)+
aes(x=weekday)+
geom_bar(color='black',fill="skyblue")+
scale_y_continuous(breaks = seq(5000,45000,5000),limits = c(0,45000))+
geom_text(stat="count",aes(label=..count..),vjust=-1)+
labs(title="Incidentes",x="Dia de la semana ",y="Incidentes")
Esta observación es clave para reconocer que los viernes son los días con mayor tasa (rate) de criminalidad en Baltimore, durante 2017. Pero ¿ Cuál será la hora más peligrosa ó donde se ejecutan la mayoría de los crímenes?
## Warning: Removed 1 rows containing non-finite values (stat_count).
Baltimore_crimes.District.CrimeScore <- Baltimore_crimes[!is.na(Baltimore_crimes$District),] %>%
group_by(District, Year) %>%
summarise(mean_CrimeScore = mean(as.integer(CrimeScore)))
colnames(Baltimore_crimes.District.CrimeScore)[1] <- "District"
ggplot(data = Baltimore_crimes.District.CrimeScore, aes(x = Year, y =mean_CrimeScore, color = District ) )+
geom_line() +
labs(title = "Promedio de crimenes dado el Distrito\n", x = "Año", y = "Promedio", color = "District\n")
ggplot(data=subset(Baltimore_crimes), aes(x=Year)) +
geom_bar(colour="black", fill='red') +
labs(title="Frecuencia de los delitos",x="Años",y="Incidentes")+
geom_text(stat="count",aes(label=..count..),vjust=8)+
ylim(0,55000)
Las primeras conclusiones e importantes a la vez son:
Data Scientist y Desarrollador Cuantitativo, MsC en Estadística (c)↩