¿Quién es John Tukey?

Muy conocido por el desarrollo del boxplot, este matemático americano, y la prueba de Turkey, Químico de formación y PhD en Matemáticas, fundo el departamento de estadística en Princenton en 1965.

Fue quien introdujo el EDA ó el Análisis Exploratorio de Datos, en el cual hace el uso de la filosofía de las gráficas para entender el comportamiento de los datos y así tener una mejor aproximación

Filosofía del análisis exploratorio de datos

Se parte del punto de que el análisis debe ser gráfico principalmente, y mezcla entre si el componente de la estadística descriptiva y el proceso de producción de los datos. Lo anterior traduce por consecuente que el EDA es un enfoque que proporciona la técnica para llevar a cabo el tratamiento de los datos.

Las ventajas que proporciona este enfoque radican en :

  1. La amplitud del conocimiento de los datos;
  2. Descubrir las estructuras y sub estructuras que en ellos se presentan;
  3. Conocer cuales son las variables importantes y ;
  4. Probar suposiciones

Sobre la base de datos a explorar

Baltimore es la ciudad número 17 más peligrosa en los EEUU, a pesar que la tasa (rate) de criminalidad ha disminuido en las últimas dos décadas , por cada 100.000 habitantes la tasa de violencia es de 1417 en 2017. El objetivo de este estudio exploratorio es determinar y analizar la criminalidad desde las herramientas de la EDA, pero dándole el valor agregado de la ciencia de datos en términos de visualización de datos.

Importe de la base de datos, primeros reconocimientos

Paquetes

library("data.table")
library("ggplot2")
library("dplyr")
## Warning: package 'dplyr' was built under R version 3.4.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("ggmap")
library("maps")
library("mapdata")
library("sqldf")
## Warning: package 'sqldf' was built under R version 3.4.1
## Loading required package: gsubfn
## Loading required package: proto
## Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
##   dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
##   Referenced from: /Library/Frameworks/R.framework/Resources/modules//R_X11.so
##   Reason: image not found
## Could not load tcltk.  Will use slower R code instead.
## Loading required package: RSQLite
## Warning: package 'RSQLite' was built under R version 3.4.1
library("lubridate")
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year
## The following object is masked from 'package:base':
## 
##     date
library('gganimate')
library('ggplot2')
library('animation')

Importar librería

Baltimore_crimes<-read.csv("/Users/danieljimenez/Desktop/Rmemories/BPD_Part_1_Victim_Based_Crime_Data.csv")

Estructura de la base de datos

glimpse(Baltimore_crimes)
## Observations: 276,529
## Variables: 15
## $ CrimeDate       <fctr> 09/02/2017, 09/02/2017, 09/02/2017, 09/02/201...
## $ CrimeTime       <fctr> 23:30:00, 23:00:00, 22:53:00, 22:50:00, 22:31...
## $ CrimeCode       <fctr> 3JK, 7A, 9S, 4C, 4E, 5A, 1F, 3B, 4C, 4E, 4C, ...
## $ Location        <fctr> 4200 AUDREY AVE, 800 NEWINGTON AVE, 600 RADNO...
## $ Description     <fctr> ROBBERY - RESIDENCE, AUTO THEFT, SHOOTING, AG...
## $ Inside.Outside  <fctr> I, O, Outside, I, O, I, Outside, O, O, I, O, ...
## $ Weapon          <fctr> KNIFE, , FIREARM, OTHER, HANDS, , FIREARM, , ...
## $ Post            <int> 913, 133, 524, 934, 113, 922, 232, 123, 641, 3...
## $ District        <fctr> SOUTHERN, CENTRAL, NORTHERN, SOUTHERN, CENTRA...
## $ Neighborhood    <fctr> Brooklyn, Reservoir Hill, Winston-Govans, Car...
## $ Longitude       <dbl> -76.60541, -76.63217, -76.60697, -76.64526, -7...
## $ Latitude        <dbl> 39.22951, 39.31360, 39.34768, 39.28315, 39.287...
## $ Location.1      <fctr> (39.2295100000, -76.6054100000), (39.31360000...
## $ Premise         <fctr> ROW/TOWNHO, STREET, Street, ROW/TOWNHO, STREE...
## $ Total.Incidents <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
names(Baltimore_crimes)
##  [1] "CrimeDate"       "CrimeTime"       "CrimeCode"      
##  [4] "Location"        "Description"     "Inside.Outside" 
##  [7] "Weapon"          "Post"            "District"       
## [10] "Neighborhood"    "Longitude"       "Latitude"       
## [13] "Location.1"      "Premise"         "Total.Incidents"

En la base de datos contamos con variables declarativas importantes, como la fecha del crimen , la localización , la descripción del momento, el arma usada, el vecindario y de más.

head(Baltimore_crimes,20)
##     CrimeDate CrimeTime CrimeCode               Location
## 1  09/02/2017  23:30:00       3JK        4200 AUDREY AVE
## 2  09/02/2017  23:00:00        7A      800 NEWINGTON AVE
## 3  09/02/2017  22:53:00        9S          600 RADNOR AV
## 4  09/02/2017  22:50:00        4C         1800 RAMSAY ST
## 5  09/02/2017  22:31:00        4E           100 LIGHT ST
## 6  09/02/2017  22:00:00        5A         CHERRYCREST RD
## 7  09/02/2017  21:15:00        1F        3400 HARMONY CT
## 8  09/02/2017  21:35:00        3B       400 W LANVALE ST
## 9  09/02/2017  21:00:00        4C     2300 LYNDHURST AVE
## 10 09/02/2017  21:00:00        4E     1200 N ELLWOOD AVE
## 11 09/02/2017  21:00:00        4C     2300 LYNDHURST AVE
## 12 09/02/2017  20:56:00       3CF     3600 EDMONDSON AVE
## 13 09/02/2017  20:55:00        6C  5100 PARK HEIGHTS AVE
## 14 09/02/2017  20:10:00        4C 3900 GWYNNS FALLS PKWY
## 15 09/02/2017  20:00:00        6D   5500 SUMMERFIELD AVE
## 16 09/02/2017  19:52:00        5D      2200 VAN DEMAN ST
## 17 09/02/2017  18:08:00        9S    1200 E LAFAYETTE AV
## 18 09/02/2017  18:08:00        1F    1200 E LAFAYETTE AV
## 19 09/02/2017  18:16:00        4E        1000 N EUTAW ST
## 20 09/02/2017  18:00:00        6G         100 S BROADWAY
##             Description Inside.Outside  Weapon Post     District
## 1   ROBBERY - RESIDENCE              I   KNIFE  913     SOUTHERN
## 2            AUTO THEFT              O          133      CENTRAL
## 3              SHOOTING        Outside FIREARM  524     NORTHERN
## 4          AGG. ASSAULT              I   OTHER  934     SOUTHERN
## 5        COMMON ASSAULT              O   HANDS  113      CENTRAL
## 6              BURGLARY              I          922     SOUTHERN
## 7              HOMICIDE        Outside FIREARM  232 SOUTHEASTERN
## 8      ROBBERY - STREET              O          123      CENTRAL
## 9          AGG. ASSAULT              O   OTHER  641 NORTHWESTERN
## 10       COMMON ASSAULT              I   HANDS  332      EASTERN
## 11         AGG. ASSAULT              O   OTHER  641 NORTHWESTERN
## 12 ROBBERY - COMMERCIAL              I FIREARM  844 SOUTHWESTERN
## 13              LARCENY                         614 NORTHWESTERN
## 14         AGG. ASSAULT              O   OTHER  641 NORTHWESTERN
## 15    LARCENY FROM AUTO              O          444 NORTHEASTERN
## 16             BURGLARY              I          243 SOUTHEASTERN
## 17             SHOOTING        Outside FIREARM  343      EASTERN
## 18             HOMICIDE        Outside FIREARM  343      EASTERN
## 19       COMMON ASSAULT              O   HANDS  132      CENTRAL
## 20              LARCENY              I          212 SOUTHEASTERN
##                Neighborhood Longitude Latitude
## 1                  Brooklyn -76.60541 39.22951
## 2            Reservoir Hill -76.63217 39.31360
## 3            Winston-Govans -76.60697 39.34768
## 4          Carrollton Ridge -76.64526 39.28315
## 5             Downtown West -76.61365 39.28756
## 6               Cherry Hill -76.62131 39.24867
## 7                    Canton -76.56827 39.28202
## 8                     Upton -76.62789 39.30254
## 9             Windsor Hills -76.68365 39.31370
## 10                    Berea -76.57419 39.30551
## 11            Windsor Hills -76.68365 39.31370
## 12                 Edgewood -76.67759 39.29402
## 13     Central Park Heights -76.67511 39.34861
## 14            Windsor Hills -76.68169 39.31400
## 15                Frankford -76.54270 39.33288
## 16 Holabird Industrial Park -76.53557 39.26533
## 17                   Oliver -76.60246 39.31038
## 18                   Oliver -76.60246 39.31038
## 19             Madison Park -76.62256 39.30083
## 20          Washington Hill -76.59390 39.29020
##                         Location.1    Premise Total.Incidents
## 1  (39.2295100000, -76.6054100000) ROW/TOWNHO               1
## 2  (39.3136000000, -76.6321700000)     STREET               1
## 3  (39.3476800000, -76.6069700000)     Street               1
## 4  (39.2831500000, -76.6452600000) ROW/TOWNHO               1
## 5  (39.2875600000, -76.6136500000)     STREET               1
## 6  (39.2486700000, -76.6213100000) ROW/TOWNHO               1
## 7  (39.2820200000, -76.5682700000)     Street               1
## 8  (39.3025400000, -76.6278900000)     STREET               1
## 9  (39.3137000000, -76.6836500000)     STREET               1
## 10 (39.3055100000, -76.5741900000) ROW/TOWNHO               1
## 11 (39.3137000000, -76.6836500000)     STREET               1
## 12 (39.2940200000, -76.6775900000) RETAIL/SMA               1
## 13 (39.3486100000, -76.6751100000)                          1
## 14 (39.3140000000, -76.6816900000)     STREET               1
## 15 (39.3328800000, -76.5427000000)       YARD               1
## 16 (39.2653300000, -76.5355700000) OTHER - IN               1
## 17 (39.3103800000, -76.6024600000)     Street               1
## 18 (39.3103800000, -76.6024600000)     Street               1
## 19 (39.3008300000, -76.6225600000)     STREET               1
## 20 (39.2902000000, -76.5939000000) CONVENIENC               1

Notesé que existen datos vacíos, por lo cual se limpia la base de datos y por lo tanto la base se reduce de 276529 observaciones a 274318.

Baltimore_crimes<-na.omit(Baltimore_crimes)
glimpse(Baltimore_crimes)
## Observations: 274,318
## Variables: 15
## $ CrimeDate       <fctr> 09/02/2017, 09/02/2017, 09/02/2017, 09/02/201...
## $ CrimeTime       <fctr> 23:30:00, 23:00:00, 22:53:00, 22:50:00, 22:31...
## $ CrimeCode       <fctr> 3JK, 7A, 9S, 4C, 4E, 5A, 1F, 3B, 4C, 4E, 4C, ...
## $ Location        <fctr> 4200 AUDREY AVE, 800 NEWINGTON AVE, 600 RADNO...
## $ Description     <fctr> ROBBERY - RESIDENCE, AUTO THEFT, SHOOTING, AG...
## $ Inside.Outside  <fctr> I, O, Outside, I, O, I, Outside, O, O, I, O, ...
## $ Weapon          <fctr> KNIFE, , FIREARM, OTHER, HANDS, , FIREARM, , ...
## $ Post            <int> 913, 133, 524, 934, 113, 922, 232, 123, 641, 3...
## $ District        <fctr> SOUTHERN, CENTRAL, NORTHERN, SOUTHERN, CENTRA...
## $ Neighborhood    <fctr> Brooklyn, Reservoir Hill, Winston-Govans, Car...
## $ Longitude       <dbl> -76.60541, -76.63217, -76.60697, -76.64526, -7...
## $ Latitude        <dbl> 39.22951, 39.31360, 39.34768, 39.28315, 39.287...
## $ Location.1      <fctr> (39.2295100000, -76.6054100000), (39.31360000...
## $ Premise         <fctr> ROW/TOWNHO, STREET, Street, ROW/TOWNHO, STREE...
## $ Total.Incidents <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

Transformación de la base de datos

Baltimore_crimes$CrimeDate<-as.Date(Baltimore_crimes$CrimeDate, format="%m/%d/%Y")#Transformación en datos de fechas
## Warning in strptime(x, format, tz = "GMT"): unknown timezone 'zone/tz/
## 2018c.1.0/zoneinfo/America/Bogota'
Baltimore_crimes$CrimeTimehour<- as.integer(substr(Baltimore_crimes$CrimeTime,0,2))#Transformación en tiempo
Baltimore_crimes$Year <- as.numeric (format(Baltimore_crimes$CrimeDate,"%Y"))
Baltimore_crimes$Month=as.numeric (format(Baltimore_crimes$CrimeDate,"%m"))
Baltimore_crimes$Day=as.numeric (format(Baltimore_crimes$CrimeDate,"%d"))
Baltimore_crimes$CrimeTimeLine[Baltimore_crimes$CrimeTimehour == 0] <-  '(0,6]'
Baltimore_crimes$weekday <- wday(Baltimore_crimes$CrimeDate, label=TRUE)
Baltimore_crimes$hour <- as.numeric(hour(hms(as.character(factor( Baltimore_crimes$CrimeTime)))))

Ajuste de la base de datos

Baltimore_crimes$Weapon[Baltimore_crimes$Weapon =='' & Baltimore_crimes$CrimeGroup =='RAPE']<- 'OTHER'
Baltimore_crimes$Weapon[Baltimore_crimes$Weapon==""] <- NA
#Crimenes por tamaño de los individuos que lo ejecutan
Baltimore_crimes$CrimeGroup[like(Baltimore_crimes$Description,"ROBBERY")] <- "ROBBERY"
Baltimore_crimes$CrimeGroup[like(Baltimore_crimes$Description,"ASSAULT")] <- "ASSAULT"
Baltimore_crimes$CrimeGroup[like(Baltimore_crimes$Description,"LARCENY")] <- "LARCENY"
Baltimore_crimes$CrimeGroup[like(Baltimore_crimes$Description,"ARSON")] <- "ARSON"
Baltimore_crimes$CrimeGroup[like(Baltimore_crimes$Description,"RAPE")] <- "RAPE"
Baltimore_crimes$CrimeGroup[like(Baltimore_crimes$Description,"SHOOTING")] <- "SHOOTING"
Baltimore_crimes$CrimeGroup[like(Baltimore_crimes$Description,"AUTO THEFT")] <- "AUTO THEFT"
Baltimore_crimes$CrimeGroup[like(Baltimore_crimes$Description,"HOMICIDE")] <- "HOMICIDE"
Baltimore_crimes$CrimeGroup[like(Baltimore_crimes$Description,"BURGLARY")] <- "BURGLARY"

Puntaje de los crímenes Son puntajes establecidos por la asamblea de Connecticut, estos van de 1 a 10 (diez)

Baltimore_crimes$CrimeScore[Baltimore_crimes$CrimeGroup=="ROBBERY"] <-7
Baltimore_crimes$CrimeScore[Baltimore_crimes$CrimeGroup=="AUTO THEFT"] <-6
Baltimore_crimes$CrimeScore[Baltimore_crimes$CrimeGroup=="SHOOTING"] <-6
Baltimore_crimes$CrimeScore[Baltimore_crimes$CrimeGroup=="ASSAULT"] <-8
Baltimore_crimes$CrimeScore[Baltimore_crimes$CrimeGroup=="BURGLARY"] <-6
Baltimore_crimes$CrimeScore[Baltimore_crimes$CrimeGroup=="HOMICIDE"] <-10
Baltimore_crimes$CrimeScore[Baltimore_crimes$CrimeGroup=="LARCENY"] <-4
Baltimore_crimes$CrimeScore[Baltimore_crimes$CrimeGroup=="ARSON"] <-5
Baltimore_crimes$CrimeScore[Baltimore_crimes$CrimeGroup=="RAPE"] <-9
Baltimore_crimes$CrimeScore <- as.factor(Baltimore_crimes$CrimeScore)

Linea de tiempo en la cual se desarrolla el crimen

Baltimore_crimes$CrimeTimeLine <- cut(Baltimore_crimes$CrimeTimehour, breaks=c(0,6,12,18,24), right=TRUE)

EDA en Baltimore_Crimes

Algunas medidas de estadística descriptiva

summary(Baltimore_crimes)
##    CrimeDate             CrimeTime        CrimeCode     
##  Min.   :2012-01-01   18:00:00:  6700   4E     : 45159  
##  1st Qu.:2013-06-05   17:00:00:  6446   6D     : 35850  
##  Median :2014-11-07   16:00:00:  6012   5A     : 25601  
##  Mean   :2014-11-09   12:00:00:  5852   7A     : 24940  
##  3rd Qu.:2016-04-30   20:00:00:  5787   6G     : 15821  
##  Max.   :2017-09-02   21:00:00:  5701   6C     : 12966  
##                       (Other) :237820   (Other):113981  
##             Location                 Description    Inside.Outside  
##  200 E PRATT ST :   654   LARCENY          :60092          : 10205  
##  300 LIGHT ST   :   567   COMMON ASSAULT   :45159   I      :131295  
##  1500 RUSSELL ST:   556   BURGLARY         :42359   Inside :   632  
##  3500 BOSTON ST :   405   LARCENY FROM AUTO:35850   O      :128350  
##  1200 W PRATT ST:   371   AGG. ASSAULT     :27315   Outside:  3836  
##  0 LIGHT ST     :   363   AUTO THEFT       :26532                   
##  (Other)        :271402   (Other)          :37011                   
##      Weapon            Post               District    
##         :     0   Min.   :111.0   NORTHEASTERN:42707  
##  FIREARM: 22228   1st Qu.:243.0   SOUTHEASTERN:38023  
##  HANDS  : 48602   Median :511.0   SOUTHERN    :31616  
##  KNIFE  :  9564   Mean   :506.2   CENTRAL     :31397  
##  OTHER  : 14490   3rd Qu.:731.0   NORTHERN    :31273  
##  NA's   :179434   Max.   :943.0   NORTHWESTERN:27895  
##                                   (Other)     :71407  
##               Neighborhood      Longitude         Latitude    
##  Downtown           :  9048   Min.   :-76.71   Min.   :39.20  
##  Frankford          :  6642   1st Qu.:-76.65   1st Qu.:39.29  
##  Belair-Edison      :  5977   Median :-76.61   Median :39.30  
##  Brooklyn           :  4516   Mean   :-76.62   Mean   :39.31  
##  Cherry Hill        :  4086   3rd Qu.:-76.59   3rd Qu.:39.33  
##  Sandtown-Winchester:  4026   Max.   :-76.53   Max.   :39.37  
##  (Other)            :240023                                   
##                            Location.1           Premise     
##  (39.3180000000, -76.6582100000):   535   STREET    :99800  
##  (39.2854400000, -76.6134000000):   481   ROW/TOWNHO:60240  
##  (39.2876100000, -76.5398200000):   441   APT/CONDO :11908  
##  (39.2740800000, -76.6276900000):   417   PARKING LO:11873  
##  (39.3186800000, -76.6539900000):   408   OTHER - IN:11377  
##  (39.2865900000, -76.6120400000):   390             :10683  
##  (Other)                        :271646   (Other)   :68437  
##  Total.Incidents CrimeTimehour        Year          Month       
##  Min.   :1       Min.   : 0.00   Min.   :2012   Min.   : 1.000  
##  1st Qu.:1       1st Qu.: 9.00   1st Qu.:2013   1st Qu.: 4.000  
##  Median :1       Median :15.00   Median :2014   Median : 6.000  
##  Mean   :1       Mean   :13.29   Mean   :2014   Mean   : 6.448  
##  3rd Qu.:1       3rd Qu.:19.00   3rd Qu.:2016   3rd Qu.: 9.000  
##  Max.   :1       Max.   :24.00   Max.   :2017   Max.   :12.000  
##                                                                 
##       Day        CrimeTimeLine    weekday           hour      
##  Min.   : 1.00   (0,6]  :37539   Sun  :36902   Min.   : 0.00  
##  1st Qu.: 8.00   (6,12] :62001   Mon  :39929   1st Qu.: 9.00  
##  Median :16.00   (12,18]:88466   Tues :39421   Median :15.00  
##  Mean   :15.81   (18,24]:73650   Wed  :39461   Mean   :13.29  
##  3rd Qu.:23.00   NA's   :12662   Thurs:39088   3rd Qu.:19.00  
##  Max.   :31.00                   Fri  :41242   Max.   :24.00  
##                                  Sat  :38275                  
##   CrimeGroup        CrimeScore
##  Length:274318      4 :95942  
##  Class :character   5 : 1444  
##  Mode  :character   6 :71800  
##                     7 :25981  
##                     8 :75961  
##                     9 : 1631  
##                     10: 1559

Visualización del tipo de armas

ggplot(data=Baltimore_crimes , aes(Weapon)) + 
  geom_bar(aes(y = (..count..)/sum(..count..)), fill='red', color='Black') + 
  labs(x= "Arma usada", y="Porcentaje  de uso") 

ggplot(subset(Baltimore_crimes,!is.na(District)))+
  aes(x=District)+
  geom_bar(stat = "count",fill='skyblue') + 
  geom_text(stat="count",aes(label=..count..),vjust=-1)+
  labs(title="Frecuencia de incidencias por Distrito",x="Distrito",y="Porcentaje del suceso")+ theme(axis.text.x = element_text(angle = 45, hjust = 1))+
scale_y_continuous(limit=c(0,45000))

ggplot(subset(Baltimore_crimes,!is.na(District) ))+
  aes(x=Year, color=District)+
  geom_line(stat="count")+
  scale_x_continuous(breaks = seq(2012,2017,1))+
  scale_y_continuous(breaks = seq(5000,50000,5000))+
  labs(title="Frecuencia de incidencias",x="Distritos",y="Incidentes")

ggplot(Baltimore_crimes)+
  aes(x=weekday)+
  geom_bar(color='black',fill="skyblue")+
  scale_y_continuous(breaks = seq(5000,45000,5000),limits = c(0,45000))+ 
  geom_text(stat="count",aes(label=..count..),vjust=-1)+
  labs(title="Incidentes",x="Dia de la semana ",y="Incidentes")

Esta observación es clave para reconocer que los viernes son los días con mayor tasa (rate) de criminalidad en Baltimore, durante 2017. Pero ¿ Cuál será la hora más peligrosa ó donde se ejecutan la mayoría de los crímenes?

## Warning: Removed 1 rows containing non-finite values (stat_count).

Baltimore_crimes.District.CrimeScore <- Baltimore_crimes[!is.na(Baltimore_crimes$District),] %>% 
  group_by(District, Year) %>% 
  summarise(mean_CrimeScore = mean(as.integer(CrimeScore)))

colnames(Baltimore_crimes.District.CrimeScore)[1] <- "District"



ggplot(data = Baltimore_crimes.District.CrimeScore, aes(x = Year, y =mean_CrimeScore, color = District ) )+
  geom_line() + 
  labs(title = "Promedio de crimenes dado el Distrito\n", x = "Año", y = "Promedio", color = "District\n")

Categorización

ggplot(data=subset(Baltimore_crimes), aes(x=Year)) + 
  geom_bar(colour="black", fill='red') + 
  labs(title="Frecuencia de los delitos",x="Años",y="Incidentes")+
  geom_text(stat="count",aes(label=..count..),vjust=8)+
  ylim(0,55000)

Las primeras conclusiones e importantes a la vez son:

  • Los viernes en las horas de la madrugada en la zona central es donde se dan los mayores crimenes con armas no identificadas.

  1. Data Scientist y Desarrollador Cuantitativo, MsC en Estadística (c)