Assignment Instructions

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.

Your task is to:

  1. Choose any three of the ‘wide’ datasets identified in the Week 5 Discussion items. For each of the three chosen datasets:
    • Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a ‘wide’ structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
    • Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
    • Perform the analysis requested in the discussion item.
    • Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Dataset Three - NYPD Crime Data

The third dataset contains historical New York City crime data from the nyc.gov NYPD website.

Load the raw data into R from csv file

I downloaded the data for the “Seven Major Felony Offenses” as an excel file and then saved that as a csv file for importing into R. The data set contains totals for each of the seven major felony offenses (Murder & Non Negl. Manslaughter, Rape, Robbery, Felony Assault, Burglary, Grand Larceny, and Grand Larceny of Motor Vehicle) for each year and a grand total line for each year for each precinct.

library(RCurl) 
x <- getURL("https://raw.githubusercontent.com/betsyrosalen/DATA_607_Data_Acquisition_and_Management/master/Project2/DS3-1-seven-major-felony-offenses-2000-2017.csv") 
crime <-data.frame(read.csv(text=x, header=FALSE, stringsAsFactors = FALSE))
dim(crime)
## [1] 639  20
head(crime, 5)
##                            V1
## 1 Seven Major Felony Offenses
## 2                            
## 3                         PCT
## 4                           1
## 5                            
##                                                                V2   V3
## 1                                                                   NA
## 2                                                                   NA
## 3                                                           CRIME 2000
## 4 MURDER & NON NEGL. MANSLAUGHTER                                    3
## 5                          RAPE                                     12
##     V4   V5   V6   V7   V8   V9  V10  V11  V12  V13  V14  V15  V16  V17
## 1   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## 2   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## 3 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 4    1    2    2    2    0    1    0    1    2    2    0    1    0    0
## 5    5   10   11   11    5    4    7   12    4    6   13   10   12    8
##    V18  V19  V20
## 1   NA   NA   NA
## 2   NA   NA   NA
## 3 2015 2016 2017
## 4    1    0    9
## 5    7    9   18

Fix header and remove blank rows

First let’s remove the two blank rows at the top and make the third row the header row.

# make the third row the header row
colnames(crime) <- crime[3,]
# then remove the first 3 rows
crime <- crime[-c(1:3),]
# reset the row indexing
rownames(crime) <- NULL

OK, now let’s check out the top and bottom of our data just to get a sense of what is there…

head(crime, 5)
##   PCT                                                           CRIME 2000
## 1   1 MURDER & NON NEGL. MANSLAUGHTER                                    3
## 2                              RAPE                                     12
## 3                              ROBBERY                                 252
## 4                              FELONY ASSAULT                          139
## 5                              BURGLARY                                475
##   2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 1    1    2    2    2    0    1    0    1    2    2    0    1    0    0
## 2    5   10   11   11    5    4    7   12    4    6   13   10   12    8
## 3  188  210  164  154  172  119  128  108  106   96  102   81   68   50
## 4  164  147  134  129  121   94   90   83   83   68   94  110   87   76
## 5  437  297  403  246  293  255  231  189  207  170  155  188  171  145
##   2015 2016 2017
## 1    1    0    9
## 2    7    9   18
## 3   79   60   69
## 4   86   78   91
## 5  138  117   65
tail(crime[,1:10], 14)
##                                                                                                                                                                                                                                                            PCT
## 623                                                                                                                                                                                                                                                           
## 624                                                                                                                                                                                                                                                           
## 625                                                                                                                                                                                                                                                           
## 626                                                                                                                                                                                                                                                           
## 627                                                                                                                                                                                                                                                           
## 628                                                                                                                                                                                                                                          STATISTICAL NOTES
## 629                                                                                                              1. 2000-2005 Data Source-Historical Comfinal data including Complaint Follow-Up data.  Compiled from aggregated monthly tapes 2000 thru 2005.
## 630                                                                                                                                          2. 2006-2016 Data Source-CDW Omniform System and S-DD5 System (Complaint Follow Up) data by record create date.  
## 631                                                                                                                            3. Murder & Non-Negligent Manslaughter data source: 2000-2005 Historical Comfinal Data, 2006-2016 Shooting & Homicide Database.
## 632 4. 2000-2009 data as of 12/8/2010.  2010 data as of 1/18/2011. 2011 data as of 1/18/2012. 2012 data as of 4/1/2013. 2013 data as of 1/17/2014. 2014 data as of 1/16/2015. 2015 data as of 1/18/2016. 2016 data as of 1/16/2017. 2017 data as of 1/17/2018.
## 633                                                                                                                                                                                                                                                           
## 634                                         5. On Sept. 28, 2012, there was a re-alignment of the boundaries of the 077, 078, and 088 precincts.  Therefore statistics for the 077, 078, and 088 precincts following 2011 are not comparable to earlier years.
## 635                                                                         6. The 121 pct was created on 7-1-2013 from parts of the 120 and 122 precinct.  Therefore statistics for 120 and 122 precincts following 2012 are not comparable to earlier years.
## 636                                                            7. As of 1-1-2014 complaints occurring within the jurisdiction of the Department of Correction have been disaggregated from the precinct crime totals and are denoted in "Pct" column as "DOC".
##                                                                CRIME 2000
## 623 GRAND LARCENY OF MOTOR VEHICLE                                     NA
## 624                                TOTAL SEVEN MAJOR FELONY OFFENSES   NA
## 625                                                                    NA
## 626                                                                    NA
## 627                                                                    NA
## 628                                                                    NA
## 629                                                                    NA
## 630                                                                    NA
## 631                                                                    NA
## 632                                                                    NA
## 633                                                                    NA
## 634                                                                    NA
## 635                                                                    NA
## 636                                                                    NA
##     2001 2002 2003 2004 2005 2006 2007
## 623   NA   NA   NA   NA   NA   NA   NA
## 624   NA   NA   NA   NA   NA   NA   NA
## 625   NA   NA   NA   NA   NA   NA   NA
## 626   NA   NA   NA   NA   NA   NA   NA
## 627   NA   NA   NA   NA   NA   NA   NA
## 628   NA   NA   NA   NA   NA   NA   NA
## 629   NA   NA   NA   NA   NA   NA   NA
## 630   NA   NA   NA   NA   NA   NA   NA
## 631   NA   NA   NA   NA   NA   NA   NA
## 632   NA   NA   NA   NA   NA   NA   NA
## 633   NA   NA   NA   NA   NA   NA   NA
## 634   NA   NA   NA   NA   NA   NA   NA
## 635   NA   NA   NA   NA   NA   NA   NA
## 636   NA   NA   NA   NA   NA   NA   NA

Remove ‘Statistical Notes’ from the bottom of the dataset

The bottom of the dataset has some info that really belongs in a separate file or dataframe because it is metadata, not actually part of the dataset. Looks like our data actually ends on row 624 and our “Statistical Notes” start on Row 628, so let’s remove those rows and put the “Statistical Notes” in a separate dataframe for now just so we don’t lose that info.

# Put the Statistical notes in a separate dataframe 
StatisticalNotes <- crime[-(1:627),1]

# Remove the Statistical Notes and blank rows from the bottom of the crime dataframe
crime <- crime[(1:624),]

Fill in the precinct in rows where it’s blank

Every 8 rows refers to the “Seven Major Felony Offenses” plus a total row for each precinct in the dataset. The precinct number was only filled in on the first row of each section. To tidy our data we need to fill it in on the other seven rows for the precinct.

library(tidyverse)
for (x in 2:8){
  crime[seq(x,624,8),1] <- crime[seq(1,624,8),1]
}
glimpse(crime)
## Observations: 624
## Variables: 20
## $ PCT    <chr> "1", "1", "1", "1", "1", "1", "1", "1", "5", "5", "5", ...
## $ CRIME  <chr> "MURDER & NON NEGL. MANSLAUGHTER                       ...
## $ `2000` <int> 3, 12, 252, 139, 475, 2358, 199, 3438, 4, 9, 351, 190, ...
## $ `2001` <int> 1, 5, 188, 164, 437, 2117, 157, 3069, 3, 3, 261, 147, 2...
## $ `2002` <int> 2, 10, 210, 147, 297, 1896, 153, 2715, 0, 12, 196, 108,...
## $ `2003` <int> 2, 11, 164, 134, 403, 1803, 132, 2649, 3, 7, 148, 100, ...
## $ `2004` <int> 2, 11, 154, 129, 246, 1708, 133, 2383, 1, 7, 136, 95, 1...
## $ `2005` <int> 0, 5, 172, 121, 293, 1624, 90, 2305, 4, 13, 134, 80, 15...
## $ `2006` <int> 1, 4, 119, 94, 255, 1462, 78, 2013, 2, 5, 132, 104, 152...
## $ `2007` <int> 0, 7, 128, 90, 231, 1395, 63, 1914, 2, 4, 151, 129, 122...
## $ `2008` <int> 1, 12, 108, 83, 189, 1258, 57, 1708, 2, 2, 131, 151, 15...
## $ `2009` <int> 2, 4, 106, 83, 207, 1122, 54, 1578, 1, 6, 125, 153, 141...
## $ `2010` <int> 2, 6, 96, 68, 170, 1065, 38, 1445, 1, 10, 106, 128, 127...
## $ `2011` <int> 0, 13, 102, 94, 155, 979, 59, 1402, 0, 11, 127, 117, 12...
## $ `2012` <int> 1, 10, 81, 110, 188, 985, 41, 1416, 3, 12, 122, 157, 13...
## $ `2013` <int> 0, 12, 68, 87, 171, 1051, 27, 1416, 3, 5, 102, 167, 92,...
## $ `2014` <int> 0, 8, 50, 76, 145, 896, 33, 1208, 2, 4, 76, 125, 101, 5...
## $ `2015` <int> 1, 7, 79, 86, 138, 1114, 21, 1446, 1, 10, 100, 140, 86,...
## $ `2016` <int> 0, 9, 60, 78, 117, 1081, 50, 1395, 0, 8, 102, 135, 81, ...
## $ `2017` <int> 9, 18, 69, 91, 65, 1068, 17, 1337, 0, 7, 89, 151, 86, 5...
sample_n(crime, 5)
##     PCT                                                            CRIME
## 72   17                                TOTAL SEVEN MAJOR FELONY OFFENSES
## 151  30 GRAND LARCENY OF MOTOR VEHICLE                                  
## 12    5                           FELONY ASSAULT                        
## 348  71                           FELONY ASSAULT                        
## 195  42                           ROBBERY                               
##     2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
## 72  2028 1990 1801 1796 1667 1397 1391 1277 1269 1018  990  937 1020 1024
## 151  167  120   99  110   87   80   69   55   62   47   49   25   28   35
## 12   190  147  108  100   95   80  104  129  151  153  128  117  157  167
## 348  589  506  455  421  377  369  344  324  312  299  293  385  347  372
## 195  424  345  387  409  373  456  396  380  348  267  324  325  334  320
##     2014 2015 2016 2017
## 72   909  986 1000 1023
## 151   22   25   31   22
## 12   125  140  135  151
## 348  389  373  371  303
## 195  307  329  321  244

Remove total rows and convert our wide data to long

The tidyr gather function makes it easy to make our wide data long so that it is easier to analyze.

tidycrime <- crime %>%
  filter(CRIME != 'TOTAL SEVEN MAJOR FELONY OFFENSES') %>%
  gather("Year", "n", 3:20)

sample_n(tidycrime, 10)
##      PCT                                                           CRIME
## 2651 109                          BURGLARY                              
## 4201  88 MURDER & NON NEGL. MANSLAUGHTER                                
## 4017  45                          GRAND LARCENY                         
## 7172  19                          FELONY ASSAULT                        
## 2665 111                          BURGLARY                              
## 2868  32                          BURGLARY                              
## 1464  84 MURDER & NON NEGL. MANSLAUGHTER                                
## 2108 110 MURDER & NON NEGL. MANSLAUGHTER                                
## 7677   9                          BURGLARY                              
## 922   84                          BURGLARY                              
##      Year   n
## 2651 2004 568
## 4201 2007  10
## 4017 2007 444
## 7172 2013  97
## 2665 2004 381
## 2868 2005 161
## 1464 2002   3
## 2108 2003   8
## 7677 2014 138
## 922  2001 306

Last but not least…

Let’s look at the ‘Statistical Notes’ to see if there’s anything in there we need to know before doing our analysis.

StatisticalNotes
## [1] "STATISTICAL NOTES"                                                                                                                                                                                                                                         
## [2] "1. 2000-2005 Data Source-Historical Comfinal data including Complaint Follow-Up data.  Compiled from aggregated monthly tapes 2000 thru 2005."                                                                                                             
## [3] "2. 2006-2016 Data Source-CDW Omniform System and S-DD5 System (Complaint Follow Up) data by record create date.  "                                                                                                                                         
## [4] "3. Murder & Non-Negligent Manslaughter data source: 2000-2005 Historical Comfinal Data, 2006-2016 Shooting & Homicide Database."                                                                                                                           
## [5] "4. 2000-2009 data as of 12/8/2010.  2010 data as of 1/18/2011. 2011 data as of 1/18/2012. 2012 data as of 4/1/2013. 2013 data as of 1/17/2014. 2014 data as of 1/16/2015. 2015 data as of 1/18/2016. 2016 data as of 1/16/2017. 2017 data as of 1/17/2018."
## [6] ""                                                                                                                                                                                                                                                          
## [7] "5. On Sept. 28, 2012, there was a re-alignment of the boundaries of the 077, 078, and 088 precincts.  Therefore statistics for the 077, 078, and 088 precincts following 2011 are not comparable to earlier years."                                        
## [8] "6. The 121 pct was created on 7-1-2013 from parts of the 120 and 122 precinct.  Therefore statistics for 120 and 122 precincts following 2012 are not comparable to earlier years."                                                                        
## [9] "7. As of 1-1-2014 complaints occurring within the jurisdiction of the Department of Correction have been disaggregated from the precinct crime totals and are denoted in \"Pct\" column as \"DOC\"."

There are definitely some notes in there that we should keep in mind as we do our analysis. For example, “The 121 pct was created on 7-1-2013 from parts of the 120 and 122 precinct. Therefore statistics for 120 and 122 precincts following 2012 are not comparable to earlier years.” So if we are looking at specific precincts we need to keep things like this in mind. If we want to look at the entire date range we would have to aggregate those three precincts in order to get relevant statistics.

Exploratory data analysis

First let’s see the totals by type of crime…

tidycrime %>%
  group_by(CRIME) %>%
  summarise(Totals = sum(n, na.rm=TRUE)) %>%
  arrange(desc(Totals))
## # A tibble: 7 x 2
##   CRIME                                                             Totals
##   <chr>                                                              <int>
## 1 "GRAND LARCENY                         "                          800558
## 2 "BURGLARY                              "                          398576
## 3 "ROBBERY                               "                          391059
## 4 "FELONY ASSAULT                        "                          350457
## 5 "GRAND LARCENY OF MOTOR VEHICLE                                 … 268367
## 6 "RAPE                                  "                           28704
## 7 "MURDER & NON NEGL. MANSLAUGHTER                                "   8818

We can get totals by precinct, but without population numbers it’s hard to compare one precinct to another. Also, the numbers are affected by the ‘Statistical Notes’ we looked at earlier, so this may not be very useful info…

tidycrime %>%
  group_by(PCT) %>%
  summarise(Totals = sum(n, na.rm=TRUE)) %>%
  arrange(desc(Totals))
## # A tibble: 78 x 2
##    PCT   Totals
##    <chr>  <int>
##  1 14     70369
##  2 75     65723
##  3 18     56845
##  4 43     48254
##  5 67     46789
##  6 19     46388
##  7 109    45972
##  8 70     44786
##  9 52     44684
## 10 13     44637
## # ... with 68 more rows

Let’s look at just the 120, 121, and 122 precincts together since those numbers can’t really be separated and let’s add the 123rd precint too since that would encompass all of Staten Island, which is where I live…

SICrime <- tidycrime %>%
  filter(PCT %in% c('121', '122', '123', '124')) %>%
  group_by(Year, CRIME) %>%
  summarise(Totals = sum(n, na.rm=TRUE)) %>%
  arrange(desc(Year))

SICrime
## # A tibble: 126 x 3
## # Groups:   Year [18]
##    Year  CRIME                                                      Totals
##    <chr> <chr>                                                       <int>
##  1 2017  "BURGLARY                              "                      261
##  2 2017  "FELONY ASSAULT                        "                      335
##  3 2017  "GRAND LARCENY                         "                      767
##  4 2017  "GRAND LARCENY OF MOTOR VEHICLE                          …    145
##  5 2017  "MURDER & NON NEGL. MANSLAUGHTER                         …      8
##  6 2017  "RAPE                                  "                       20
##  7 2017  "ROBBERY                               "                      186
##  8 2016  "BURGLARY                              "                      339
##  9 2016  "FELONY ASSAULT                        "                      354
## 10 2016  "GRAND LARCENY                         "                      863
## # ... with 116 more rows

Now let’s graph that to see the trend…

ggplot(SICrime, aes(x=Year, y=Totals, group=CRIME)) +
  geom_line(aes(color=CRIME)) +
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

Now let’s compare that to the rest of the city…

NotSICrime <- tidycrime %>%
  filter(!(PCT %in% c('121', '122', '123', '124'))) %>%
  group_by(Year, CRIME) %>%
  summarise(Totals = sum(n, na.rm=TRUE)) %>%
  arrange(desc(Year))

NotSICrime
## # A tibble: 126 x 3
## # Groups:   Year [18]
##    Year  CRIME                                                      Totals
##    <chr> <chr>                                                       <int>
##  1 2017  "BURGLARY                              "                    11822
##  2 2017  "FELONY ASSAULT                        "                    19717
##  3 2017  "GRAND LARCENY                         "                    42383
##  4 2017  "GRAND LARCENY OF MOTOR VEHICLE                          …   5531
##  5 2017  "MURDER & NON NEGL. MANSLAUGHTER                         …    284
##  6 2017  "RAPE                                  "                     1429
##  7 2017  "ROBBERY                               "                    13770
##  8 2016  "BURGLARY                              "                    12651
##  9 2016  "FELONY ASSAULT                        "                    20493
## 10 2016  "GRAND LARCENY                         "                    43416
## # ... with 116 more rows

Now let’s graph that to see if the trends look similar to those we saw on Statne Island…

ggplot(NotSICrime, aes(x=Year, y=Totals, group=CRIME)) +
  geom_line(aes(color=CRIME)) +
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))