The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.
Your task is to:
The third dataset contains historical New York City crime data from the nyc.gov NYPD website.
I downloaded the data for the “Seven Major Felony Offenses” as an excel file and then saved that as a csv file for importing into R. The data set contains totals for each of the seven major felony offenses (Murder & Non Negl. Manslaughter, Rape, Robbery, Felony Assault, Burglary, Grand Larceny, and Grand Larceny of Motor Vehicle) for each year and a grand total line for each year for each precinct.
library(RCurl)
x <- getURL("https://raw.githubusercontent.com/betsyrosalen/DATA_607_Data_Acquisition_and_Management/master/Project2/DS3-1-seven-major-felony-offenses-2000-2017.csv")
crime <-data.frame(read.csv(text=x, header=FALSE, stringsAsFactors = FALSE))
dim(crime)
## [1] 639 20
head(crime, 5)
## V1
## 1 Seven Major Felony Offenses
## 2
## 3 PCT
## 4 1
## 5
## V2 V3
## 1 NA
## 2 NA
## 3 CRIME 2000
## 4 MURDER & NON NEGL. MANSLAUGHTER 3
## 5 RAPE 12
## V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
## 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 3 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 4 1 2 2 2 0 1 0 1 2 2 0 1 0 0
## 5 5 10 11 11 5 4 7 12 4 6 13 10 12 8
## V18 V19 V20
## 1 NA NA NA
## 2 NA NA NA
## 3 2015 2016 2017
## 4 1 0 9
## 5 7 9 18
First let’s remove the two blank rows at the top and make the third row the header row.
# make the third row the header row
colnames(crime) <- crime[3,]
# then remove the first 3 rows
crime <- crime[-c(1:3),]
# reset the row indexing
rownames(crime) <- NULL
OK, now let’s check out the top and bottom of our data just to get a sense of what is there…
head(crime, 5)
## PCT CRIME 2000
## 1 1 MURDER & NON NEGL. MANSLAUGHTER 3
## 2 RAPE 12
## 3 ROBBERY 252
## 4 FELONY ASSAULT 139
## 5 BURGLARY 475
## 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 1 1 2 2 2 0 1 0 1 2 2 0 1 0 0
## 2 5 10 11 11 5 4 7 12 4 6 13 10 12 8
## 3 188 210 164 154 172 119 128 108 106 96 102 81 68 50
## 4 164 147 134 129 121 94 90 83 83 68 94 110 87 76
## 5 437 297 403 246 293 255 231 189 207 170 155 188 171 145
## 2015 2016 2017
## 1 1 0 9
## 2 7 9 18
## 3 79 60 69
## 4 86 78 91
## 5 138 117 65
tail(crime[,1:10], 14)
## PCT
## 623
## 624
## 625
## 626
## 627
## 628 STATISTICAL NOTES
## 629 1. 2000-2005 Data Source-Historical Comfinal data including Complaint Follow-Up data. Compiled from aggregated monthly tapes 2000 thru 2005.
## 630 2. 2006-2016 Data Source-CDW Omniform System and S-DD5 System (Complaint Follow Up) data by record create date.
## 631 3. Murder & Non-Negligent Manslaughter data source: 2000-2005 Historical Comfinal Data, 2006-2016 Shooting & Homicide Database.
## 632 4. 2000-2009 data as of 12/8/2010. 2010 data as of 1/18/2011. 2011 data as of 1/18/2012. 2012 data as of 4/1/2013. 2013 data as of 1/17/2014. 2014 data as of 1/16/2015. 2015 data as of 1/18/2016. 2016 data as of 1/16/2017. 2017 data as of 1/17/2018.
## 633
## 634 5. On Sept. 28, 2012, there was a re-alignment of the boundaries of the 077, 078, and 088 precincts. Therefore statistics for the 077, 078, and 088 precincts following 2011 are not comparable to earlier years.
## 635 6. The 121 pct was created on 7-1-2013 from parts of the 120 and 122 precinct. Therefore statistics for 120 and 122 precincts following 2012 are not comparable to earlier years.
## 636 7. As of 1-1-2014 complaints occurring within the jurisdiction of the Department of Correction have been disaggregated from the precinct crime totals and are denoted in "Pct" column as "DOC".
## CRIME 2000
## 623 GRAND LARCENY OF MOTOR VEHICLE NA
## 624 TOTAL SEVEN MAJOR FELONY OFFENSES NA
## 625 NA
## 626 NA
## 627 NA
## 628 NA
## 629 NA
## 630 NA
## 631 NA
## 632 NA
## 633 NA
## 634 NA
## 635 NA
## 636 NA
## 2001 2002 2003 2004 2005 2006 2007
## 623 NA NA NA NA NA NA NA
## 624 NA NA NA NA NA NA NA
## 625 NA NA NA NA NA NA NA
## 626 NA NA NA NA NA NA NA
## 627 NA NA NA NA NA NA NA
## 628 NA NA NA NA NA NA NA
## 629 NA NA NA NA NA NA NA
## 630 NA NA NA NA NA NA NA
## 631 NA NA NA NA NA NA NA
## 632 NA NA NA NA NA NA NA
## 633 NA NA NA NA NA NA NA
## 634 NA NA NA NA NA NA NA
## 635 NA NA NA NA NA NA NA
## 636 NA NA NA NA NA NA NA
The bottom of the dataset has some info that really belongs in a separate file or dataframe because it is metadata, not actually part of the dataset. Looks like our data actually ends on row 624 and our “Statistical Notes” start on Row 628, so let’s remove those rows and put the “Statistical Notes” in a separate dataframe for now just so we don’t lose that info.
# Put the Statistical notes in a separate dataframe
StatisticalNotes <- crime[-(1:627),1]
# Remove the Statistical Notes and blank rows from the bottom of the crime dataframe
crime <- crime[(1:624),]
Every 8 rows refers to the “Seven Major Felony Offenses” plus a total row for each precinct in the dataset. The precinct number was only filled in on the first row of each section. To tidy our data we need to fill it in on the other seven rows for the precinct.
library(tidyverse)
for (x in 2:8){
crime[seq(x,624,8),1] <- crime[seq(1,624,8),1]
}
glimpse(crime)
## Observations: 624
## Variables: 20
## $ PCT <chr> "1", "1", "1", "1", "1", "1", "1", "1", "5", "5", "5", ...
## $ CRIME <chr> "MURDER & NON NEGL. MANSLAUGHTER ...
## $ `2000` <int> 3, 12, 252, 139, 475, 2358, 199, 3438, 4, 9, 351, 190, ...
## $ `2001` <int> 1, 5, 188, 164, 437, 2117, 157, 3069, 3, 3, 261, 147, 2...
## $ `2002` <int> 2, 10, 210, 147, 297, 1896, 153, 2715, 0, 12, 196, 108,...
## $ `2003` <int> 2, 11, 164, 134, 403, 1803, 132, 2649, 3, 7, 148, 100, ...
## $ `2004` <int> 2, 11, 154, 129, 246, 1708, 133, 2383, 1, 7, 136, 95, 1...
## $ `2005` <int> 0, 5, 172, 121, 293, 1624, 90, 2305, 4, 13, 134, 80, 15...
## $ `2006` <int> 1, 4, 119, 94, 255, 1462, 78, 2013, 2, 5, 132, 104, 152...
## $ `2007` <int> 0, 7, 128, 90, 231, 1395, 63, 1914, 2, 4, 151, 129, 122...
## $ `2008` <int> 1, 12, 108, 83, 189, 1258, 57, 1708, 2, 2, 131, 151, 15...
## $ `2009` <int> 2, 4, 106, 83, 207, 1122, 54, 1578, 1, 6, 125, 153, 141...
## $ `2010` <int> 2, 6, 96, 68, 170, 1065, 38, 1445, 1, 10, 106, 128, 127...
## $ `2011` <int> 0, 13, 102, 94, 155, 979, 59, 1402, 0, 11, 127, 117, 12...
## $ `2012` <int> 1, 10, 81, 110, 188, 985, 41, 1416, 3, 12, 122, 157, 13...
## $ `2013` <int> 0, 12, 68, 87, 171, 1051, 27, 1416, 3, 5, 102, 167, 92,...
## $ `2014` <int> 0, 8, 50, 76, 145, 896, 33, 1208, 2, 4, 76, 125, 101, 5...
## $ `2015` <int> 1, 7, 79, 86, 138, 1114, 21, 1446, 1, 10, 100, 140, 86,...
## $ `2016` <int> 0, 9, 60, 78, 117, 1081, 50, 1395, 0, 8, 102, 135, 81, ...
## $ `2017` <int> 9, 18, 69, 91, 65, 1068, 17, 1337, 0, 7, 89, 151, 86, 5...
sample_n(crime, 5)
## PCT CRIME
## 72 17 TOTAL SEVEN MAJOR FELONY OFFENSES
## 151 30 GRAND LARCENY OF MOTOR VEHICLE
## 12 5 FELONY ASSAULT
## 348 71 FELONY ASSAULT
## 195 42 ROBBERY
## 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
## 72 2028 1990 1801 1796 1667 1397 1391 1277 1269 1018 990 937 1020 1024
## 151 167 120 99 110 87 80 69 55 62 47 49 25 28 35
## 12 190 147 108 100 95 80 104 129 151 153 128 117 157 167
## 348 589 506 455 421 377 369 344 324 312 299 293 385 347 372
## 195 424 345 387 409 373 456 396 380 348 267 324 325 334 320
## 2014 2015 2016 2017
## 72 909 986 1000 1023
## 151 22 25 31 22
## 12 125 140 135 151
## 348 389 373 371 303
## 195 307 329 321 244
The tidyr gather function makes it easy to make our wide data long so that it is easier to analyze.
tidycrime <- crime %>%
filter(CRIME != 'TOTAL SEVEN MAJOR FELONY OFFENSES') %>%
gather("Year", "n", 3:20)
sample_n(tidycrime, 10)
## PCT CRIME
## 2651 109 BURGLARY
## 4201 88 MURDER & NON NEGL. MANSLAUGHTER
## 4017 45 GRAND LARCENY
## 7172 19 FELONY ASSAULT
## 2665 111 BURGLARY
## 2868 32 BURGLARY
## 1464 84 MURDER & NON NEGL. MANSLAUGHTER
## 2108 110 MURDER & NON NEGL. MANSLAUGHTER
## 7677 9 BURGLARY
## 922 84 BURGLARY
## Year n
## 2651 2004 568
## 4201 2007 10
## 4017 2007 444
## 7172 2013 97
## 2665 2004 381
## 2868 2005 161
## 1464 2002 3
## 2108 2003 8
## 7677 2014 138
## 922 2001 306
Let’s look at the ‘Statistical Notes’ to see if there’s anything in there we need to know before doing our analysis.
StatisticalNotes
## [1] "STATISTICAL NOTES"
## [2] "1. 2000-2005 Data Source-Historical Comfinal data including Complaint Follow-Up data. Compiled from aggregated monthly tapes 2000 thru 2005."
## [3] "2. 2006-2016 Data Source-CDW Omniform System and S-DD5 System (Complaint Follow Up) data by record create date. "
## [4] "3. Murder & Non-Negligent Manslaughter data source: 2000-2005 Historical Comfinal Data, 2006-2016 Shooting & Homicide Database."
## [5] "4. 2000-2009 data as of 12/8/2010. 2010 data as of 1/18/2011. 2011 data as of 1/18/2012. 2012 data as of 4/1/2013. 2013 data as of 1/17/2014. 2014 data as of 1/16/2015. 2015 data as of 1/18/2016. 2016 data as of 1/16/2017. 2017 data as of 1/17/2018."
## [6] ""
## [7] "5. On Sept. 28, 2012, there was a re-alignment of the boundaries of the 077, 078, and 088 precincts. Therefore statistics for the 077, 078, and 088 precincts following 2011 are not comparable to earlier years."
## [8] "6. The 121 pct was created on 7-1-2013 from parts of the 120 and 122 precinct. Therefore statistics for 120 and 122 precincts following 2012 are not comparable to earlier years."
## [9] "7. As of 1-1-2014 complaints occurring within the jurisdiction of the Department of Correction have been disaggregated from the precinct crime totals and are denoted in \"Pct\" column as \"DOC\"."
There are definitely some notes in there that we should keep in mind as we do our analysis. For example, “The 121 pct was created on 7-1-2013 from parts of the 120 and 122 precinct. Therefore statistics for 120 and 122 precincts following 2012 are not comparable to earlier years.” So if we are looking at specific precincts we need to keep things like this in mind. If we want to look at the entire date range we would have to aggregate those three precincts in order to get relevant statistics.
First let’s see the totals by type of crime…
tidycrime %>%
group_by(CRIME) %>%
summarise(Totals = sum(n, na.rm=TRUE)) %>%
arrange(desc(Totals))
## # A tibble: 7 x 2
## CRIME Totals
## <chr> <int>
## 1 "GRAND LARCENY " 800558
## 2 "BURGLARY " 398576
## 3 "ROBBERY " 391059
## 4 "FELONY ASSAULT " 350457
## 5 "GRAND LARCENY OF MOTOR VEHICLE … 268367
## 6 "RAPE " 28704
## 7 "MURDER & NON NEGL. MANSLAUGHTER " 8818
We can get totals by precinct, but without population numbers it’s hard to compare one precinct to another. Also, the numbers are affected by the ‘Statistical Notes’ we looked at earlier, so this may not be very useful info…
tidycrime %>%
group_by(PCT) %>%
summarise(Totals = sum(n, na.rm=TRUE)) %>%
arrange(desc(Totals))
## # A tibble: 78 x 2
## PCT Totals
## <chr> <int>
## 1 14 70369
## 2 75 65723
## 3 18 56845
## 4 43 48254
## 5 67 46789
## 6 19 46388
## 7 109 45972
## 8 70 44786
## 9 52 44684
## 10 13 44637
## # ... with 68 more rows
Let’s look at just the 120, 121, and 122 precincts together since those numbers can’t really be separated and let’s add the 123rd precint too since that would encompass all of Staten Island, which is where I live…
SICrime <- tidycrime %>%
filter(PCT %in% c('121', '122', '123', '124')) %>%
group_by(Year, CRIME) %>%
summarise(Totals = sum(n, na.rm=TRUE)) %>%
arrange(desc(Year))
SICrime
## # A tibble: 126 x 3
## # Groups: Year [18]
## Year CRIME Totals
## <chr> <chr> <int>
## 1 2017 "BURGLARY " 261
## 2 2017 "FELONY ASSAULT " 335
## 3 2017 "GRAND LARCENY " 767
## 4 2017 "GRAND LARCENY OF MOTOR VEHICLE … 145
## 5 2017 "MURDER & NON NEGL. MANSLAUGHTER … 8
## 6 2017 "RAPE " 20
## 7 2017 "ROBBERY " 186
## 8 2016 "BURGLARY " 339
## 9 2016 "FELONY ASSAULT " 354
## 10 2016 "GRAND LARCENY " 863
## # ... with 116 more rows
Now let’s graph that to see the trend…
ggplot(SICrime, aes(x=Year, y=Totals, group=CRIME)) +
geom_line(aes(color=CRIME)) +
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
Now let’s compare that to the rest of the city…
NotSICrime <- tidycrime %>%
filter(!(PCT %in% c('121', '122', '123', '124'))) %>%
group_by(Year, CRIME) %>%
summarise(Totals = sum(n, na.rm=TRUE)) %>%
arrange(desc(Year))
NotSICrime
## # A tibble: 126 x 3
## # Groups: Year [18]
## Year CRIME Totals
## <chr> <chr> <int>
## 1 2017 "BURGLARY " 11822
## 2 2017 "FELONY ASSAULT " 19717
## 3 2017 "GRAND LARCENY " 42383
## 4 2017 "GRAND LARCENY OF MOTOR VEHICLE … 5531
## 5 2017 "MURDER & NON NEGL. MANSLAUGHTER … 284
## 6 2017 "RAPE " 1429
## 7 2017 "ROBBERY " 13770
## 8 2016 "BURGLARY " 12651
## 9 2016 "FELONY ASSAULT " 20493
## 10 2016 "GRAND LARCENY " 43416
## # ... with 116 more rows
Now let’s graph that to see if the trends look similar to those we saw on Statne Island…
ggplot(NotSICrime, aes(x=Year, y=Totals, group=CRIME)) +
geom_line(aes(color=CRIME)) +
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))