Baltimore Crime Rate Final Project

Introduction

Crime is an issue that should be studied thoroughly and taken seriously throughout the world. Our dataset deals with analyzing Baltimore crime patterns from the years 2012-2017. The main purpose of analyzing this dataset is to compare trends in crimes from the years 2012-2017. For each year, we plan to find the total number of crimes, the count of each type of weapon used, how many crimes occurred inside and outside, and the premise for each crime. Looking at trends for each of these varaibles will help in future prediction of crime trends for these variables in the coming years in Baltimore. These trends will help alert safety officials such as police officers, FBI agents and citizens throughout Baltimore to be more prepared on how to be protected from and deal with crime in the future. To answer the problem areas addressed in the question, we plan to plot histograms and scatterplots for the column variables to look at skewness and trends of these variables that will help in prediction of future trends for crime in Baltimore.

Packages Required

library(readxl)
library(tidyverse)
library(stringr)
library(ggvis)
library(ggmap)
library(pander)

readxl: To import the dataset into R, by using the “read_excel” function.
tidyverse: To do basic data cleaning on some fields by using the separate function.(example “CrimeTime” attribute). To manipulate data, specifically allowing pipe opearators to be used in our analysis to separate columns such as the CrimeTime column into date and time where the default date of 1899-12-31 would be subsequently deleted.
stringr: To do consistent formatting for the Inside/Outside column. This will be explained further on in the report.
ggvis: To create some visualizations and graphical charts to represent the analysis that we did on the cleaned data.
ggmap: To show a geographic representation of where Baltimore was located and to show specific longitude and latitude representations of where crimes occurred in Baltimore.
pander: To get the output in a table format

Data Preparation

We searched thoroughly for a sufficient dataset to use and found one on Kaggle. The original purpose of this data was to analyze many factors
contributing to crime in neighborhood areas of Baltimore from 2012 to 2017. These factors are the specific locations of crimes, what type of weapon was used for each crime, the date and time of each crime along with whether the crime occured inside or outside. We originally saved our dataset as an xlsx file and used the code below to import the dataset and obtain basic fidings such as the dimensions, structure, and summary of the original dataset.

crime <- read_excel("Baltimore Crime Data.xlsx")
names(crime)
dim(crime)
str(crime)
head(crime)
summary(crime)

The original dataset had 276,529 observations and 15 variables. Studying each variable is key to analysis of a dataset. We decided that 1 of the columns(Location 1) wasn’t needed in the analysis because the location 1 is redundant to the longitude and latitude variables. To remove this variable, we used this command to remove the 13th column reducing the number of variables from 15 to 14:

crime <- crime[,-c(13)]
dim(crime)

## [1] 276529     14

Missing Values

The next step in the analysis was to look at missing values. There were quite a few missing values in the original dataset. These missing values were spread out over a variety of column variables. For us to get a better feel for how many missing values were in each column, the following code was used:

crime %>% map_dbl(~sum(is.na(.)))

##       CrimeDate       CrimeTime       CrimeCode        Location     Description  Inside/Outside          Weapon            Post        District    Neighborhood       Longitude        Latitude         Premise Total Incidents 
##               0               0               0            2207               0           10279          180952             224              80            2740            2204            2204           10757               0

The Inside/Outside, Weapon, and Premise variables all had over 10,000 missing values while the location, latitude, and longitude variables all had over 2,000 missing values. The post and district variables didn’t have as many missing values and the rest of the variables didn’t have any missing values. We will deal with how missing values will be handled in possible modeling techniques to be undertaken in the next steps of this project. For now, ignoring the missing values and cleaning up the rest of the data is the essential point of focus.

Date Formatting

Looking at the CrimeTime column of our original dataset, we knew that the date and time portion had to split up first and the date portion had to be deleted next as the default date of 1899-12-31 R displayed is incorrect and because the 1st column properly shows the correct date of each crime incident.The tidyverse library was loaded to use a pipe operator for the R command to accomplish this task.

head(crime)

## # A tibble: 6 x 14
##   CrimeDate           CrimeTime           CrimeCode Location          Description         `Inside/Outside` Weapon   Post District Neighborhood     Longitude Latitude Premise    `Total Incidents`
##   <dttm>              <dttm>              <chr>     <chr>             <chr>               <chr>            <chr>   <dbl> <chr>    <chr>                <dbl>    <dbl> <chr>                  <dbl>
## 1 2017-09-02 00:00:00 1899-12-31 23:30:00 3JK       4200 AUDREY AVE   ROBBERY - RESIDENCE I                KNIFE     913 SOUTHERN Brooklyn             -76.6     39.2 ROW/TOWNHO              1.00
## 2 2017-09-02 00:00:00 1899-12-31 23:00:00 7A        800 NEWINGTON AVE AUTO THEFT          O                <NA>      133 CENTRAL  Reservoir Hill       -76.6     39.3 STREET                  1.00
## 3 2017-09-02 00:00:00 1899-12-31 22:53:00 9S        600 RADNOR AV     SHOOTING            Outside          FIREARM   524 NORTHERN Winston-Govans       -76.6     39.3 Street                  1.00
## 4 2017-09-02 00:00:00 1899-12-31 22:50:00 4C        1800 RAMSAY ST    AGG. ASSAULT        I                OTHER     934 SOUTHERN Carrollton Ridge     -76.6     39.3 ROW/TOWNHO              1.00
## 5 2017-09-02 00:00:00 1899-12-31 22:31:00 4E        100 LIGHT ST      COMMON ASSAULT      O                HANDS     113 CENTRAL  Downtown West        -76.6     39.3 STREET                  1.00
## 6 2017-09-02 00:00:00 1899-12-31 22:00:00 5A        CHERRYCREST RD    BURGLARY            I                <NA>      922 SOUTHERN Cherry Hill          -76.6     39.2 ROW/TOWNHO              1.00

crime <- crime %>% separate(CrimeTime, c("date", "Time"), sep = " ")
crime <- crime[,-c(2)]
head(crime)

## # A tibble: 6 x 14
##   CrimeDate           Time     CrimeCode Location          Description         `Inside/Outside` Weapon   Post District Neighborhood     Longitude Latitude Premise    `Total Incidents`
##   <dttm>              <chr>    <chr>     <chr>             <chr>               <chr>            <chr>   <dbl> <chr>    <chr>                <dbl>    <dbl> <chr>                  <dbl>
## 1 2017-09-02 00:00:00 23:30:00 3JK       4200 AUDREY AVE   ROBBERY - RESIDENCE I                KNIFE     913 SOUTHERN Brooklyn             -76.6     39.2 ROW/TOWNHO              1.00
## 2 2017-09-02 00:00:00 23:00:00 7A        800 NEWINGTON AVE AUTO THEFT          O                <NA>      133 CENTRAL  Reservoir Hill       -76.6     39.3 STREET                  1.00
## 3 2017-09-02 00:00:00 22:53:00 9S        600 RADNOR AV     SHOOTING            Outside          FIREARM   524 NORTHERN Winston-Govans       -76.6     39.3 Street                  1.00
## 4 2017-09-02 00:00:00 22:50:00 4C        1800 RAMSAY ST    AGG. ASSAULT        I                OTHER     934 SOUTHERN Carrollton Ridge     -76.6     39.3 ROW/TOWNHO              1.00
## 5 2017-09-02 00:00:00 22:31:00 4E        100 LIGHT ST      COMMON ASSAULT      O                HANDS     113 CENTRAL  Downtown West        -76.6     39.3 STREET                  1.00
## 6 2017-09-02 00:00:00 22:00:00 5A        CHERRYCREST RD    BURGLARY            I                <NA>      922 SOUTHERN Cherry Hill          -76.6     39.2 ROW/TOWNHO              1.00

The date format of the 1st column is written as yy-mm-dd. It is more standard for a reader to read dates as mm/dd/yr. This change was implemented using a format command. We then wanted to include a Year column showing just the years of each crime observation. This will be helpful in visualization analysis that we will use in plots using the ggvis and ggmap packages.

crime$CrimeDate <- as.Date(crime$CrimeDate, format = "%m/%d/%Y")
crime$Year <- as.numeric(format(crime$CrimeDate, "%Y"))
head(crime)

Rearranging Columns

It is important for datasets to have similar columns grouped together and next to each other. It was awkward to have the neighborhood column to be between the crime code and description column as there isn’t much of a logical link between these column names. Because of this, we decided to move the 4th column, location, to be after the neighborhood column as location and neighborhood are related to each other. We also moved the Year column to be placed between the CrimeDate and Time columns.

crime   <-  crime[c(1,15,2,3,5:10,4,11:14)]
names(crime)

##  [1] "CrimeDate"       "Year"            "Time"            "CrimeCode"       "Description"     "Inside/Outside"  "Weapon"          "Post"            "District"        "Neighborhood"    "Location"        "Longitude"       "Latitude"        "Premise"         "Total Incidents"

Trimming the decimal values

The latitude and longitude fields originally had inconsistent decimal place values and we thought we could just round it up to 3 places. We used the round function to achieve this.

crime$Longitude <- round(crime$Longitude, digits = 3)
crime$Latitude <- round(crime$Latitude, digits = 3)

Maintaining Data Consistency

Having consistency in values for categorical columns is very important. For the inside/outside columns, some of the values originally read as “Inside” or “Outside”, but plenty of values in the column read as “I” and “O”. To rectify this, the str_replace_all command was used to change the “Outside” values to “O” followed by changing the “Inside” values to “I”

crime <- crime  %>%     
  mutate(
    `Inside/Outside` = str_replace_all(`Inside/Outside`, "Outside", "O"),
    `Inside/Outside` = str_replace_all(`Inside/Outside`, "Inside", "I")
    
   )

The premise variable had the same values in upper and lower cases. In order to maintain consistency of data, we decided to convert all values into upper case using the toupper function

crime$Premise <- toupper(crime$Premise)

The clean, condensed data of the first 10 observations that we have at this point looks like below:

head(crime, 10)

## # A tibble: 10 x 15
##    CrimeDate   Year Time     CrimeCode Description         `Inside/Outside` Weapon   Post District     Neighborhood     Location           Longitude Latitude Premise    `Total Incidents`
##    <date>     <dbl> <chr>    <chr>     <chr>               <chr>            <chr>   <dbl> <chr>        <chr>            <chr>                  <dbl>    <dbl> <chr>                  <dbl>
##  1 2017-09-02  2017 23:30:00 3JK       ROBBERY - RESIDENCE I                KNIFE     913 SOUTHERN     Brooklyn         4200 AUDREY AVE        -76.6     39.2 ROW/TOWNHO              1.00
##  2 2017-09-02  2017 23:00:00 7A        AUTO THEFT          O                <NA>      133 CENTRAL      Reservoir Hill   800 NEWINGTON AVE      -76.6     39.3 STREET                  1.00
##  3 2017-09-02  2017 22:53:00 9S        SHOOTING            O                FIREARM   524 NORTHERN     Winston-Govans   600 RADNOR AV          -76.6     39.3 STREET                  1.00
##  4 2017-09-02  2017 22:50:00 4C        AGG. ASSAULT        I                OTHER     934 SOUTHERN     Carrollton Ridge 1800 RAMSAY ST         -76.6     39.3 ROW/TOWNHO              1.00
##  5 2017-09-02  2017 22:31:00 4E        COMMON ASSAULT      O                HANDS     113 CENTRAL      Downtown West    100 LIGHT ST           -76.6     39.3 STREET                  1.00
##  6 2017-09-02  2017 22:00:00 5A        BURGLARY            I                <NA>      922 SOUTHERN     Cherry Hill      CHERRYCREST RD         -76.6     39.2 ROW/TOWNHO              1.00
##  7 2017-09-02  2017 21:15:00 1F        HOMICIDE            O                FIREARM   232 SOUTHEASTERN Canton           3400 HARMONY CT        -76.6     39.3 STREET                  1.00
##  8 2017-09-02  2017 21:35:00 3B        ROBBERY - STREET    O                <NA>      123 CENTRAL      Upton            400 W LANVALE ST       -76.6     39.3 STREET                  1.00
##  9 2017-09-02  2017 21:00:00 4C        AGG. ASSAULT        O                OTHER     641 NORTHWESTERN Windsor Hills    2300 LYNDHURST AVE     -76.7     39.3 STREET                  1.00
## 10 2017-09-02  2017 21:00:00 4E        COMMON ASSAULT      I                HANDS     332 EASTERN      Berea            1200 N ELLWOOD AVE     -76.6     39.3 ROW/TOWNHO              1.00

To get a clearer picture of the cleaned data, we ran the structure and summary commands to see what type of variable each column represented and summary statistics of the variable columns.

str(crime)

## Classes 'tbl_df', 'tbl' and 'data.frame':    276529 obs. of  15 variables:
##  $ CrimeDate      : Date, format: "2017-09-02" "2017-09-02" "2017-09-02" "2017-09-02" ...
##  $ Year           : num  2017 2017 2017 2017 2017 ...
##  $ Time           : chr  "23:30:00" "23:00:00" "22:53:00" "22:50:00" ...
##  $ CrimeCode      : chr  "3JK" "7A" "9S" "4C" ...
##  $ Description    : chr  "ROBBERY - RESIDENCE" "AUTO THEFT" "SHOOTING" "AGG. ASSAULT" ...
##  $ Inside/Outside : chr  "I" "O" "O" "I" ...
##  $ Weapon         : chr  "KNIFE" NA "FIREARM" "OTHER" ...
##  $ Post           : num  913 133 524 934 113 922 232 123 641 332 ...
##  $ District       : chr  "SOUTHERN" "CENTRAL" "NORTHERN" "SOUTHERN" ...
##  $ Neighborhood   : chr  "Brooklyn" "Reservoir Hill" "Winston-Govans" "Carrollton Ridge" ...
##  $ Location       : chr  "4200 AUDREY AVE" "800 NEWINGTON AVE" "600 RADNOR AV" "1800 RAMSAY ST" ...
##  $ Longitude      : num  -76.6 -76.6 -76.6 -76.6 -76.6 ...
##  $ Latitude       : num  39.2 39.3 39.3 39.3 39.3 ...
##  $ Premise        : chr  "ROW/TOWNHO" "STREET" "STREET" "ROW/TOWNHO" ...
##  $ Total Incidents: num  1 1 1 1 1 1 1 1 1 1 ...

summary(crime)

##    CrimeDate               Year          Time            CrimeCode         Description        Inside/Outside        Weapon               Post         District         Neighborhood         Location           Longitude         Latitude       Premise          Total Incidents
##  Min.   :2012-01-01   Min.   :2012   Length:276529      Length:276529      Length:276529      Length:276529      Length:276529      Min.   :  2.0   Length:276529      Length:276529      Length:276529      Min.   :-76.71   Min.   :39.20   Length:276529      Min.   :1      
##  1st Qu.:2013-06-04   1st Qu.:2013   Class :character   Class :character   Class :character   Class :character   Class :character   1st Qu.:243.0   Class :character   Class :character   Class :character   1st Qu.:-76.65   1st Qu.:39.29   Class :character   1st Qu.:1      
##  Median :2014-11-05   Median :2014   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median :511.0   Mode  :character   Mode  :character   Mode  :character   Median :-76.61   Median :39.30   Mode  :character   Median :1      
##  Mean   :2014-11-07   Mean   :2014                                                                                                  Mean   :506.3                                                            Mean   :-76.62   Mean   :39.31                      Mean   :1      
##  3rd Qu.:2016-04-27   3rd Qu.:2016                                                                                                  3rd Qu.:731.0                                                            3rd Qu.:-76.59   3rd Qu.:39.33                      3rd Qu.:1      
##  Max.   :2017-09-02   Max.   :2017                                                                                                  Max.   :945.0                                                            Max.   :-76.53   Max.   :39.37                      Max.   :1      
##                                                                                                                                     NA's   :224                                                              NA's   :2204     NA's   :2204

By looking at the output of the structure of the cleaned dataset, there are 10 character variables: CrimeDate, Time, CrimeCode, Description, Inside/Outside, Weapon, District, Neighborhood, Location, and Premise and 4 numerical variables: Post, Longitude, Latitude, and Total Incidents.

Exploratory Data Analysis

Cuts: Time of crime broken into 6 categories

Looking at all our variables, we decided the first step of this process was to see if any new variables should be created to make the data and possible plots to make more sense. It would be easier to divide the time variable into 6 cuts or categories: 00:00:00 to 04:00:00 will be cut 1; 04:00:01 to 08:00:00 will be cut 2; 08:00:01 to 12:00:00 will be cut 3, 12:00:01 to 16:00:00 will be cut 4, 16:00:01 to 20:00:00 will be cut 5, and 20:00:01 to 23:59:59 will be cut 6. A plot of the number of crimes taking place in each time period window is what we are going to explore.

crime$Time <- as.POSIXct(crime$Time, format = "%H:%M:%S")
crime <- data.frame(crime, cuts = cut(crime$Time, breaks = "4 hours", labels = FALSE))
crime$time <- format(crime$Time, format = "%T")  
crime <- crime[c(1,2,3,16,4:15)]
head(crime, 25)

##     CrimeDate Year                Time cuts CrimeCode          Description Inside.Outside  Weapon Post     District             Neighborhood               Location Longitude Latitude    Premise Total.Incidents
## 1  2017-09-02 2017 2018-04-22 23:30:00    6       3JK  ROBBERY - RESIDENCE              I   KNIFE  913     SOUTHERN                 Brooklyn        4200 AUDREY AVE   -76.605   39.230 ROW/TOWNHO               1
## 2  2017-09-02 2017 2018-04-22 23:00:00    6        7A           AUTO THEFT              O    <NA>  133      CENTRAL           Reservoir Hill      800 NEWINGTON AVE   -76.632   39.314     STREET               1
## 3  2017-09-02 2017 2018-04-22 22:53:00    6        9S             SHOOTING              O FIREARM  524     NORTHERN           Winston-Govans          600 RADNOR AV   -76.607   39.348     STREET               1
## 4  2017-09-02 2017 2018-04-22 22:50:00    6        4C         AGG. ASSAULT              I   OTHER  934     SOUTHERN         Carrollton Ridge         1800 RAMSAY ST   -76.645   39.283 ROW/TOWNHO               1
## 5  2017-09-02 2017 2018-04-22 22:31:00    6        4E       COMMON ASSAULT              O   HANDS  113      CENTRAL            Downtown West           100 LIGHT ST   -76.614   39.288     STREET               1
## 6  2017-09-02 2017 2018-04-22 22:00:00    6        5A             BURGLARY              I    <NA>  922     SOUTHERN              Cherry Hill         CHERRYCREST RD   -76.621   39.249 ROW/TOWNHO               1
## 7  2017-09-02 2017 2018-04-22 21:15:00    6        1F             HOMICIDE              O FIREARM  232 SOUTHEASTERN                   Canton        3400 HARMONY CT   -76.568   39.282     STREET               1
## 8  2017-09-02 2017 2018-04-22 21:35:00    6        3B     ROBBERY - STREET              O    <NA>  123      CENTRAL                    Upton       400 W LANVALE ST   -76.628   39.303     STREET               1
## 9  2017-09-02 2017 2018-04-22 21:00:00    6        4C         AGG. ASSAULT              O   OTHER  641 NORTHWESTERN            Windsor Hills     2300 LYNDHURST AVE   -76.684   39.314     STREET               1
## 10 2017-09-02 2017 2018-04-22 21:00:00    6        4E       COMMON ASSAULT              I   HANDS  332      EASTERN                    Berea     1200 N ELLWOOD AVE   -76.574   39.306 ROW/TOWNHO               1
## 11 2017-09-02 2017 2018-04-22 21:00:00    6        4C         AGG. ASSAULT              O   OTHER  641 NORTHWESTERN            Windsor Hills     2300 LYNDHURST AVE   -76.684   39.314     STREET               1
## 12 2017-09-02 2017 2018-04-22 20:56:00    6       3CF ROBBERY - COMMERCIAL              I FIREARM  844 SOUTHWESTERN                 Edgewood     3600 EDMONDSON AVE   -76.678   39.294 RETAIL/SMA               1
## 13 2017-09-02 2017 2018-04-22 20:55:00    6        6C              LARCENY           <NA>    <NA>  614 NORTHWESTERN     Central Park Heights  5100 PARK HEIGHTS AVE   -76.675   39.349       <NA>               1
## 14 2017-09-02 2017 2018-04-22 20:10:00    6        4C         AGG. ASSAULT              O   OTHER  641 NORTHWESTERN            Windsor Hills 3900 GWYNNS FALLS PKWY   -76.682   39.314     STREET               1
## 15 2017-09-02 2017 2018-04-22 20:00:00    6        6D    LARCENY FROM AUTO              O    <NA>  444 NORTHEASTERN                Frankford   5500 SUMMERFIELD AVE   -76.543   39.333       YARD               1
## 16 2017-09-02 2017 2018-04-22 19:52:00    5        5D             BURGLARY              I    <NA>  243 SOUTHEASTERN Holabird Industrial Park      2200 VAN DEMAN ST   -76.536   39.265 OTHER - IN               1
## 17 2017-09-02 2017 2018-04-22 18:08:00    5        9S             SHOOTING              O FIREARM  343      EASTERN                   Oliver    1200 E LAFAYETTE AV   -76.602   39.310     STREET               1
## 18 2017-09-02 2017 2018-04-22 18:08:00    5        1F             HOMICIDE              O FIREARM  343      EASTERN                   Oliver    1200 E LAFAYETTE AV   -76.602   39.310     STREET               1
## 19 2017-09-02 2017 2018-04-22 18:16:00    5        4E       COMMON ASSAULT              O   HANDS  132      CENTRAL             Madison Park        1000 N EUTAW ST   -76.623   39.301     STREET               1
## 20 2017-09-02 2017 2018-04-22 18:00:00    5        6G              LARCENY              I    <NA>  212 SOUTHEASTERN          Washington Hill         100 S BROADWAY   -76.594   39.290 CONVENIENC               1
## 21 2017-09-02 2017 2018-04-22 17:17:00    5        5A             BURGLARY              I    <NA>  426 NORTHEASTERN               Waltherson     4000 RIDGECROFT RD   -76.557   39.336 ROW/TOWNHO               1
## 22 2017-09-02 2017 2018-04-22 16:55:00    5       3CF ROBBERY - COMMERCIAL              I FIREARM  513     NORTHERN           Better Waverly       600 HOMESTEAD ST   -76.608   39.327 RETAIL/SMA               1
## 23 2017-09-02 2017 2018-04-22 16:00:00    5        4C         AGG. ASSAULT              O   OTHER  731      WESTERN                Mondawmin     2000 N BENTALOU ST   -76.654   39.311     STREET               1
## 24 2017-09-02 2017 2018-04-22 15:46:00    4        3B     ROBBERY - STREET              O    <NA>  612 NORTHWESTERN              Park Circle   AV & REISTERSTOWN RD   -76.660   39.329     STREET               1
## 25 2017-09-02 2017 2018-04-22 15:00:00    4        6D    LARCENY FROM AUTO              O    <NA>  415 NORTHEASTERN  Morgan State University       PY & ECHODALE AV   -76.576   39.355     STREET               1

Doing some research and exploration of different types of plots, it was found that area plots show interesting dips and peaks of trends of data. By looking at the area plot for the cuts of time each crime occured at, it is evident that most of the crimes occur in the late afternoon to night time. The cut period of 5 representing time from 4:00:01 P.M. to 8:00:00 P.M. had the highest frequency of crime observations of 63,633. The night time hours of 8:00:01 P.M. to 11:59:59 P.M. in cut category 6 had the second highest frequency of crimes of 59,111. The fewest number of crimes took place from 04:00:01 A.M. to 08:00:00 A.M. as the vast majority of humans are asleep or just about to wake up to get ready for the day during this time period. A table showing these values is also shown in case one wants to quickly identify the statistical count for each of these time cut categories.

 ggplot(crime, aes(x = cuts)) +
   geom_area(stat = "count",fill = 'red') + 
   geom_text(stat = "count",aes(label = ..count..),vjust = -0.5) +
   labs(title = "Year of Crimes",x = "Time Bin",y = " Number of Incidents")

 crime$Time <- as.Date(crime$Time)
 crime %>%
   group_by(cuts) %>%
   summarise(count = n()) %>%
    pander(type = 'grid')

cuts	count
1	37927
2	19839
3	41288
4	54731
5	63633
6	59111

Inside vs Outside

Next, we had to find whether most of the crimes were committed inside or outside, and again we decided to plot that on graph and found it to be more or less equal with 133,619 crimes occuring outside and 132,631 crimes committed inside. As explained earlier, there were 10,279 missing observations and we wanted to show the proportion of those compared to the number of crimes committed inside and outside.

ggplot(crime) +
  aes(x = `Inside.Outside`) +
  geom_bar(stat = "count",fill = 'orange') + 
  geom_text(stat = "count",aes(label = ..count..),vjust = -0.5) +
  labs(title = "Inside vs Outside Crimes",x = "Inside or Outside",y = "Number of Incidents") +
  theme(axis.text.x = element_text(size = 6, angle = 60)) +
scale_y_continuous(limit = c(0,200000))

Yearwise Crime

We were also interested in finding out how the crime rate varied as the years passed by and below is how we plotted it. Finding upward and downward trends in years will help us in deciding if we can do some modeling use year as a response variable. There was a constant trend followed by a slight downward trend in 2014 and slight increasing trend again in 2015 and 2016. 2017 only has 33,824 crimes because the data published only went up to the beginning of September of 2017 covering about 2/3 of the year. The results for the whole year of 2017 should be close to that of the previous years and will be studied further in possible future modeling aspect analyses of the year variable.

aggregate(Total.Incidents~Year, crime, sum) %>%
  pander(type = 'grid')

Year	Total.Incidents
2012	49575
2013	49571
2014	45969
2015	48841
2016	48749
2017	33824

ggplot(data = crime, aes(x = Year,  main = "Yearly Trend")) +
   geom_line(stat = "count", fill = "blue") +
  scale_y_continuous(limit = c(30000, 50000))

Another great technique for visualization is the using of a line chart. We wanted to show the trend of frequency of crimes by year with lines connecting each count value for each specific year. The trend shows a constant rate from 2012-2013 followed by a dip in 2014 and is followed by another increase in 2015-2016. 2017 shows low frequency counts as pointed out earlier because the data only goes to the beginning of September of 2017. We decided to use a zoomed in version of this line plot because all the frequency counts for the years were between 30,000 and 50,000.

Districtwise Crime

Looking at the data set, the first question we had is how is the crime distributed per district. Are some districts that are safer or more dangerous than the rest? We decided to find this out through a barplot using the ggplot package of R as below:

ggplot(subset(crime)) +
  aes(x = District) +
  geom_bar(stat = "count",fill = 'green') + 
  labs(title = "Number of Incidents by Disctrict",x = "Districts",y = "Number of Incidents") +
  theme(axis.text.x = element_text(size = 6, angle = 60)) +
  scale_y_continuous(limit = c(0,50000))

Description of Crime

There were 15 categories of crime in the Baltimore crime dataaset. When there are a large number of categories for a particular variable, it is usually more efficient to make a table to help communicate findings. Using the code below, we ordered the count of the type of crime committed. The output showed that larceny, common assault, and burglary were the most frequent crimes committed in the time period of the dataset with values of 60,528, 45,518, and 42,538 respectively. Homicide, robbery(carjacking), and arson were the least frequent types of crimes committed.

  crime %>%
  group_by(Description) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  pander(type = 'grid')

Description	count
LARCENY	60528
COMMON ASSAULT	45518
BURGLARY	42538
LARCENY FROM AUTO	36295
AGG. ASSAULT	27513
AUTO THEFT	26838
ROBBERY - STREET	17691
ROBBERY - COMMERCIAL	4141
ASSAULT BY THREAT	3503
SHOOTING	2910
ROBBERY - RESIDENCE	2866
RAPE	1637
HOMICIDE	1559
ROBBERY - CARJACKING	1528
ARSON	1464

Premise: What kind of area crime occurred

It is extremely important to know what kinds of areas in a city where crime occurred. Knowing this information can help inform citizens, police officers, and law enforcement officials where to keep a close watch on to catch suspects and prevent future crime. A table was made to see the top 10 areas where crimes occurred from 2012-2017 in Baltimore. The findings show that street areas and townhouses are the areas where most crimes occurred. There were over 100,000 crimes that occurred in street areas and 60,502 crimes occurred in townhouses.

 crime %>%
   group_by(Premise) %>%
   summarise(count = n()) %>%
   arrange(desc(count)) %>%
  top_n(n=10) %>%
  pander(type = 'grid')

Premise	count
STREET	103802
ROW/TOWNHO	60502
PARKING LO	12176
APT/CONDO	12002
OTHER - IN	11459
NA	10757
SCHOOL	7608
CONVENIENC	4314
RETAIL/SMA	3725
OTHER - OU	3423

Crime: Geographically Spread

Some people may not know where exactly Baltimore is on a map and what cities or areas surround Baltimore. Using the ggmap package to get an approximate longitude and latitude of Baltimore, a map showing the location of Baltimore can be visualized. To see more specifically where crimes occured, it was useful for us to plot the specific longitude and latitude locations on a map so one can see the spread and distribution of where specifically crimes occured in the city of Baltimore.

balt <- c(lon = -76.7605701, lat = 39.2846225)
balt_map <- get_map(location = balt, zoom = 10)
ggmap(balt_map)

ggplot(crime, aes(Longitude, Latitude, col = "red")) +
  geom_point()

Summary

The goal of this study was to import the dataset, clean & manipulate the dataset to analyze the trends in crimes that were happening in Baltimore over the past 5 years and to really understand which neighborhoods, premises, and times of the day crimes occured at. Knowing the districts, premises, and times of the day crimes occured at will help police officers, law enforcement officials, and citizens be aware on which areas to be careful in and what times of day to really be careful during to avoid being victims of crime in Baltimore. The analysis focused on answering some key questions like which districts had maximum crimes reported, what were the most common types of crimes, and whether the crime rate showed an increasing or decreasing trend throughout the years of 2012 to 2017. The outcome of this analysis provides the law authorities and citizens of Baltimore to be more vigilant over the coming years. We were lucky to gain some interesting insights, as an example, maximum crimes happened on the streets and the type of crimes that were mostly committed was larceny while the time frame between 4:00 to 8:00 P.M. was the time most crime occurences took place. If you put all of these findings together, it makes sense as larceny is something that doesn’t involve breaking and entering into a building and larceny is more likely to happen on the streets.Baltimore is a really crowded city and 4:00 P.M. to 8:00 P.M. is a really active time for people to be present outside in the city. This exposure to being outside gives more opportunities for larceny, theft, or a crime ridden event. The second most common occurence of crimes being from 8:00 P.M. to 11:59 P.M. also makes sense because it is common for people with criminal tendencies to be out late at night in the city or try to break into a person’s house in the night to avoid being caught. The visualizations and tables we configured really helped us make sense of the data and find out the key patterns and trends. Some of the things to note in this study is that we have not built any regression models, the reason being that most of the variables are factor categorical variables and only a few variables are numerical. It is possible to build regression models and do thorough analysis on categorical variables, but for a dataset such as this, we felt it would have been better to do regression analysis if more numerical variables were present. We also had a lot of NA values which again impacts the accuracy of the model. Having a lot of missing values in certain variable columns affects regression results a bit. For future improvements, using imputation means such as median or mean imputation or more in depth imputation methods can help deal more effectively with the missing values to provide clear results for regression analysis. This project really helped us appreciate the intricaties of R programming and how to analyze the dataset in steps to really find out the story of what this Baltimore crime dataset was trying to tell us. We recommend the police officers, law enforcement officials, adn the citizens of Baltimore to increase in security and really be aware of street areas, be watchful of theft, and to really be safe or be with a group of trustworthy people from the hours of 4:00 P.M. to 12:00 A.M. Baltimore is a great city and we hope that the crime rate will decrease in the future of this promising city.