The goal for this project is to tidy and analyze data sets presented in the week 5 discussion section.
New York City Compstat Historcial Crime Data by Precinct Citywide Seven Major Felony Offenses 2000-2017
United States Census Bureau Data by Enterpise Employment Size: Counties (State by County, totals)
Stanford Mass Shootings in America (MSA) A high quality dataset from 19966-2016 with method, definitions, and references
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.3
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.1 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.3
## Warning: package 'readr' was built under R version 3.4.3
## Warning: package 'purrr' was built under R version 3.4.3
## Warning: package 'dplyr' was built under R version 3.4.2
## Warning: package 'forcats' was built under R version 3.4.3
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(knitr)
Import the data and utilize the skip feature in order to arrive at the first headers with data.
compstat <- read_csv("https://raw.githubusercontent.com/LilesB/DATA-607---Project-2/master/seven-major-felony-offenses-2000-2016%20(2).csv", skip = 4)
## Parsed with column specification:
## cols(
## OFFENSE = col_character(),
## `2000` = col_number(),
## `2001` = col_number(),
## `2002` = col_number(),
## `2003` = col_number(),
## `2004` = col_number(),
## `2005` = col_number(),
## `2006` = col_number(),
## `2007` = col_number(),
## `2008` = col_number(),
## `2009` = col_number(),
## `2010` = col_number(),
## `2011` = col_number(),
## `2012` = col_number(),
## `2013` = col_number(),
## `2014` = col_number(),
## `2015` = col_number(),
## `2016` = col_number()
## )
First step is to look at the dataset using the View feature
View(compstat)
After viewing the dataset we can see that row 9 provides totals which we won’t need. In addition, rows 10 - 16 are no longer needed so we will remove them using the following code:
compstat <- compstat[-c(8:18),]
Next we take a look at the data using the glimpse function
glimpse(compstat)
## Observations: 7
## Variables: 18
## $ OFFENSE <chr> "MURDER & NON-NEGL. MANSLAUGHTER", "RAPE", "ROBBERY", ...
## $ `2000` <dbl> 673, 2068, 32562, 25924, 38352, 49631, 35442
## $ `2001` <dbl> 649, 1981, 28202, 23453, 32763, 46329, 29531
## $ `2002` <dbl> 587, 2144, 27229, 21147, 31275, 45771, 26656
## $ `2003` <dbl> 597, 2070, 25989, 19139, 29110, 46751, 23413
## $ `2004` <dbl> 570, 1905, 24373, 18622, 26976, 48763, 20884
## $ `2005` <dbl> 539, 1858, 24722, 17750, 24117, 48243, 18246
## $ `2006` <dbl> 596, 1525, 23739, 17309, 23143, 46625, 15745
## $ `2007` <dbl> 496, 1351, 21809, 17493, 21762, 44924, 13174
## $ `2008` <dbl> 523, 1299, 22401, 16284, 20725, 44242, 12482
## $ `2009` <dbl> 471, 1205, 18601, 16773, 19430, 39580, 10670
## $ `2010` <dbl> 536, 1373, 19486, 16956, 18600, 37835, 10329
## $ `2011` <dbl> 515, 1420, 19717, 18482, 18720, 38501, 9314
## $ `2012` <dbl> 419, 1445, 20144, 19381, 19168, 42497, 8093
## $ `2013` <dbl> 335, 1378, 19128, 20297, 17429, 45368, 7400
## $ `2014` <dbl> 333, 1352, 16539, 20207, 16765, 43862, 7664
## $ `2015` <dbl> 352, 1438, 16931, 20270, 15125, 44005, 7332
## $ `2016` <dbl> 335, 1438, 15500, 20847, 12990, 44279, 6327
The compstat dataset currently has 7 observations and 18 variables. In order to tidy the data we use the gather feature in order to view the offenses, year, and totals.
compstat <- gather(compstat,"YEAR","TOTALS",2:18)
dim(compstat)
## [1] 119 3
After using gather we arrive at 3 variables and 119 observations making it easier to conduct analysis.
First it will be interesting to look at the total number of crimes by year. We use the piping to group the data by year and then add the seven major crimes and create a bar chart.
compstat %>% group_by(YEAR) %>% summarise(CRIMES = sum(TOTALS)) %>%
ggplot(mapping = aes(x = YEAR, y = CRIMES)) +
geom_bar(stat="identity", fill = "#6C9552")
Based on the graph we can see that overall crime in New York has decreased significantly since 2000. However, crime started to increase slightly in 2012 & 2013 and dropped to its lowest point in 2016.
Create a dataset that will filter on the criminal offense of murder and view the years that recorded the highest values
compstat_murder <- filter(compstat,OFFENSE =="MURDER & NON-NEGL. MANSLAUGHTER")
compstat_murder %>% arrange(TOTALS) %>% select(YEAR,TOTALS) %>% top_n(5)
## Selecting by TOTALS
## # A tibble: 5 x 2
## YEAR TOTALS
## <chr> <dbl>
## 1 2002 587
## 2 2006 596
## 3 2003 597
## 4 2001 649
## 5 2000 673
In 2000 New York City recorded 673 homicides making it the deadliest year in the last decade. Using kable we will list he entire list.
kable(compstat_murder[2:3], format = "html", caption = "New York City Homicide Totals (2000 - 2016)")
| YEAR | TOTALS |
|---|---|
| 2000 | 673 |
| 2001 | 649 |
| 2002 | 587 |
| 2003 | 597 |
| 2004 | 570 |
| 2005 | 539 |
| 2006 | 596 |
| 2007 | 496 |
| 2008 | 523 |
| 2009 | 471 |
| 2010 | 536 |
| 2011 | 515 |
| 2012 | 419 |
| 2013 | 335 |
| 2014 | 333 |
| 2015 | 352 |
| 2016 | 335 |
Next we will calculate an estimated murder rate for each year in compstat_murder based on census data retrieved from https://en.wikipedia.org/wiki/Demographics_of_New_York_City. In order to do so we create a vector entitled population which will store the population for 2000 and 2010 and use the mean value of 8,091,711.
Using mutate we then create Per100 which divides the estimated population by 100000 and then create murder_rate that will divide TOTALS by Per100
population <- c(8008288,8175133)
mean(population)
## [1] 8091711
compstat_murder %>% mutate(Per100 = 8091711/100000, MurderRate = TOTALS/Per100) %>% arrange(desc(MurderRate)) %>% select(YEAR, TOTALS, MurderRate)
## # A tibble: 17 x 3
## YEAR TOTALS MurderRate
## <chr> <dbl> <dbl>
## 1 2000 673 8.32
## 2 2001 649 8.02
## 3 2003 597 7.38
## 4 2006 596 7.37
## 5 2002 587 7.25
## 6 2004 570 7.04
## 7 2005 539 6.66
## 8 2010 536 6.62
## 9 2008 523 6.46
## 10 2011 515 6.36
## 11 2007 496 6.13
## 12 2009 471 5.82
## 13 2012 419 5.18
## 14 2015 352 4.35
## 15 2013 335 4.14
## 16 2016 335 4.14
## 17 2014 333 4.12
In unlisted analysis we saw that 2000 reported the largest amount of criminal activity for each offense with the exception of rape.
compstat_rape <- filter(compstat,OFFENSE =="RAPE")
compstat_rape %>% arrange(TOTALS) %>% select(YEAR,TOTALS) %>% top_n(5)
## Selecting by TOTALS
## # A tibble: 5 x 2
## YEAR TOTALS
## <chr> <dbl>
## 1 2004 1905
## 2 2001 1981
## 3 2000 2068
## 4 2003 2070
## 5 2002 2144
2002 led the city in the numbers of rapes with 2,144.
kable(compstat_rape[2:3], format = "html", caption = "New York City Rape Totals (2000 - 2016)")
| YEAR | TOTALS |
|---|---|
| 2000 | 2068 |
| 2001 | 1981 |
| 2002 | 2144 |
| 2003 | 2070 |
| 2004 | 1905 |
| 2005 | 1858 |
| 2006 | 1525 |
| 2007 | 1351 |
| 2008 | 1299 |
| 2009 | 1205 |
| 2010 | 1373 |
| 2011 | 1420 |
| 2012 | 1445 |
| 2013 | 1378 |
| 2014 | 1352 |
| 2015 | 1438 |
| 2016 | 1438 |
To reduce redundancy we will look at the years that reported the least amount of robberies, felony assualts, and grand larceny by adding a (-) negative sign to the top_n feature.
compstat_robbery <- filter(compstat,OFFENSE =="ROBBERY")
compstat_robbery %>% arrange(TOTALS) %>% select(YEAR,TOTALS) %>% top_n(-5)
## Selecting by TOTALS
## # A tibble: 5 x 2
## YEAR TOTALS
## <chr> <dbl>
## 1 2016 15500
## 2 2014 16539
## 3 2015 16931
## 4 2009 18601
## 5 2013 19128
2016 recorded the least amount of robberies with 15,500.
kable(compstat_robbery[2:3], format = "html", caption = "New York City Robbery Totals (2000 - 2016)")
| YEAR | TOTALS |
|---|---|
| 2000 | 32562 |
| 2001 | 28202 |
| 2002 | 27229 |
| 2003 | 25989 |
| 2004 | 24373 |
| 2005 | 24722 |
| 2006 | 23739 |
| 2007 | 21809 |
| 2008 | 22401 |
| 2009 | 18601 |
| 2010 | 19486 |
| 2011 | 19717 |
| 2012 | 20144 |
| 2013 | 19128 |
| 2014 | 16539 |
| 2015 | 16931 |
| 2016 | 15500 |
compstat_fa <- filter(compstat,OFFENSE =="FELONY ASSAULT")
compstat_fa %>% arrange(TOTALS) %>% select(YEAR,TOTALS) %>% top_n(-5)
## Selecting by TOTALS
## # A tibble: 5 x 2
## YEAR TOTALS
## <chr> <dbl>
## 1 2008 16284
## 2 2009 16773
## 3 2010 16956
## 4 2006 17309
## 5 2007 17493
2008 recorded the least amount of felony assaults with 16,284.
kable(compstat_fa[2:3], format = "html", caption = "New York City Felony Assualt Totals (2000 - 2016)")
| YEAR | TOTALS |
|---|---|
| 2000 | 25924 |
| 2001 | 23453 |
| 2002 | 21147 |
| 2003 | 19139 |
| 2004 | 18622 |
| 2005 | 17750 |
| 2006 | 17309 |
| 2007 | 17493 |
| 2008 | 16284 |
| 2009 | 16773 |
| 2010 | 16956 |
| 2011 | 18482 |
| 2012 | 19381 |
| 2013 | 20297 |
| 2014 | 20207 |
| 2015 | 20270 |
| 2016 | 20847 |
compstat_gl <- filter(compstat,OFFENSE =="GRAND LARCENY")
compstat_gl %>% arrange(TOTALS) %>% select(YEAR,TOTALS) %>% top_n(-5)
## Selecting by TOTALS
## # A tibble: 5 x 2
## YEAR TOTALS
## <chr> <dbl>
## 1 2010 37835
## 2 2011 38501
## 3 2009 39580
## 4 2012 42497
## 5 2014 43862
2010 recorded the least amount of grand larceny with 37,835.
kable(compstat_rape[2:3], format = "html", caption = "New York City Grand Larceny Totals (2000 - 2016)")
| YEAR | TOTALS |
|---|---|
| 2000 | 2068 |
| 2001 | 1981 |
| 2002 | 2144 |
| 2003 | 2070 |
| 2004 | 1905 |
| 2005 | 1858 |
| 2006 | 1525 |
| 2007 | 1351 |
| 2008 | 1299 |
| 2009 | 1205 |
| 2010 | 1373 |
| 2011 | 1420 |
| 2012 | 1445 |
| 2013 | 1378 |
| 2014 | 1352 |
| 2015 | 1438 |
| 2016 | 1438 |
Burglary and Grand Larceny Motor Vehicles were omitted because the both followed the trend of 2000 recording the lowest numbers while 2016 recorded the most.
Based on the analysis we can say with confidence that crime has decreased in New York City within the past decade.
Import the data and utilize the skip feature in order to arrive at the first headers with data. We also noticed that the headers all have spaces between the names so we create a vector that will hold the names and apply.
names <-c("FIPS_StateCode","StateDescr","FIPS_CountyCode",
"CtyDescription","EntEmploySize","NumFirms",
"NumEstablishments","Employment","ERF",
"ENF","AnnualPayroll","APNG","X13","X14","X15")
uscb <- read_csv("https://raw.githubusercontent.com/LilesB/DATA-607---Project-2/master/county_totals_2015.csv", skip = 5, col_names = names)
## Parsed with column specification:
## cols(
## FIPS_StateCode = col_character(),
## StateDescr = col_character(),
## FIPS_CountyCode = col_character(),
## CtyDescription = col_character(),
## EntEmploySize = col_character(),
## NumFirms = col_character(),
## NumEstablishments = col_character(),
## Employment = col_character(),
## ERF = col_character(),
## ENF = col_character(),
## AnnualPayroll = col_character(),
## APNG = col_character(),
## X13 = col_character(),
## X14 = col_character(),
## X15 = col_character()
## )
First step is to look at the dataset using the View feature
View(uscb)
After viewing the dataset we can see that rows 1 & 2 have NA values and can be removed.
uscb <- uscb[-c(1:2),]
Next we take a look at the data using the glimpse function
glimpse(uscb)
## Observations: 15,855
## Variables: 15
## $ FIPS_StateCode <chr> NA, "01", "01", "01", "01", "01", "01", "01"...
## $ StateDescr <chr> NA, "Alabama", "Alabama", "Alabama", "Alabam...
## $ FIPS_CountyCode <chr> NA, "001", "001", "001", "001", "001", "003"...
## $ CtyDescription <chr> NA, "Autauga", "Autauga", "Autauga", "Autaug...
## $ EntEmploySize <chr> NA, "01: Total", "02: <20", "03: 20-99", ...
## $ NumFirms <chr> NA, "788", "526", "75", "35", "152", "4,431"...
## $ NumEstablishments <chr> NA, "844", "527", "81", "51", "185", "5,127"...
## $ Employment <chr> NA, "10,454", "2,216", "2,143", "1,161", "4,...
## $ ERF <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ENF <chr> NA, "G", "G", "G", "G", "G", "G", "G", "G", ...
## $ AnnualPayroll <chr> NA, "321,433", "63,553", "57,490", "29,074",...
## $ APNG <chr> NA, "G", "G", "G", "G", "G", "G", "G", "G", ...
## $ X13 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ X14 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ X15 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
The uscb dataset currently has 15 observations and 15854 variables. We will look at the city of North Carolina so we will first create a dataset entitled nc and view the dimensions using dim
nc <- filter(uscb,StateDescr=="North Carolina")
dim(nc)
## [1] 505 15
Next we will condense the amount of variables to look at while using select and choosing the following:
nc <- nc %>% select(CtyDescription,EntEmploySize,
NumFirms,NumEstablishments,Employment,AnnualPayroll)
To get an idea on how many counties we have our next step is to use the unique feature:
unique(nc$CtyDescription)
## [1] "Alamance" "Alexander"
## [3] "Alleghany" "Anson"
## [5] "Ashe" "Avery"
## [7] "Beaufort" "Bertie"
## [9] "Bladen" "Brunswick"
## [11] "Buncombe" "Burke"
## [13] "Cabarrus" "Caldwell"
## [15] "Camden" "Carteret"
## [17] "Caswell" "Catawba"
## [19] "Chatham" "Cherokee"
## [21] "Chowan" "Clay"
## [23] "Cleveland" "Columbus"
## [25] "Craven" "Cumberland"
## [27] "Currituck" "Dare"
## [29] "Davidson" "Davie"
## [31] "Duplin" "Durham"
## [33] "Edgecombe" "Forsyth"
## [35] "Franklin" "Gaston"
## [37] "Gates" "Graham"
## [39] "Granville" "Greene"
## [41] "Guilford" "Halifax"
## [43] "Harnett" "Haywood"
## [45] "Henderson" "Hertford"
## [47] "Hoke" "Hyde"
## [49] "Iredell" "Jackson"
## [51] "Johnston" "Jones"
## [53] "Lee" "Lenoir"
## [55] "Lincoln" "McDowell"
## [57] "Macon" "Madison"
## [59] "Martin" "Mecklenburg"
## [61] "Mitchell" "Montgomery"
## [63] "Moore" "Nash"
## [65] "New Hanover" "Northampton"
## [67] "Onslow" "Orange"
## [69] "Pamlico" "Pasquotank"
## [71] "Pender" "Perquimans"
## [73] "Person" "Pitt"
## [75] "Polk" "Randolph"
## [77] "Richmond" "Robeson"
## [79] "Rockingham" "Rowan"
## [81] "Rutherford" "Sampson"
## [83] "Scotland" "Stanly"
## [85] "Stokes" "Surry"
## [87] "Swain" "Transylvania"
## [89] "Tyrrell" "Union"
## [91] "Vance" "Wake"
## [93] "Warren" "Washington"
## [95] "Watauga" "Wayne"
## [97] "Wilkes" "Wilson"
## [99] "Yadkin" "Yancey"
## [101] "Statewide North Carolina"
nc has 101 unique counties in the dataset
Before we tidy the data we will first convert NumFirms, Employment, AnnualPayroll to a numeric variable in order to delve deeper in the dataset. We experienced warnings known as NA coercion due to commas in the character variables so we used the gsub function in order to remove the commas.
nc$NumFirms <- as.numeric(gsub(",","",nc$NumFirms))
nc$NumEstablishments <- as.numeric(gsub(",","",nc$NumEstablishments))
nc$Employment <- as.numeric(gsub(",","",nc$Employment))
nc$AnnualPayroll <- as.numeric(gsub(",","",nc$AnnualPayroll))
Identify counties that have the highest annual payroll based on the EntEmploySize being equal to 01: Total
nc %>% filter(EntEmploySize == "01: Total") %>% arrange(AnnualPayroll) %>% select(CtyDescription,NumFirms,NumEstablishments,Employment,AnnualPayroll) %>% top_n(5)
## Selecting by AnnualPayroll
## # A tibble: 5 x 5
## CtyDescription NumFirms NumEstablishments Employment AnnualPayroll
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Forsyth 6940 8366 161704 7870272
## 2 Durham 5949 7103 176821 11375527
## 3 Guilford 10956 13246 258166 11472846
## 4 Wake 22708 27405 431040 22263004
## 5 Mecklenburg 23793 30039 588844 35515013
Create a dataset entitled nc_top5 based on the previous findings.
nc_top5 <- nc %>% filter(CtyDescription =="Forsyth" |
CtyDescription =="Durham" |
CtyDescription =="Guilford" |
CtyDescription =="Wake" |
CtyDescription =="Mecklenburg")
kable(nc_top5, format = "html")
| CtyDescription | EntEmploySize | NumFirms | NumEstablishments | Employment | AnnualPayroll |
|---|---|---|---|---|---|
| Durham | 01: Total | 5949 | 7103 | 176821 | 11375527 |
| Durham | 02: <20 | 4301 | 4313 | 17651 | 757712 |
| Durham | 03: 20-99 | 684 | 776 | 22441 | 1032481 |
| Durham | 04: 100-499 | 245 | 345 | 16495 | 951624 |
| Durham | 05: 500+ | 719 | 1669 | 120234 | 8633710 |
| Forsyth | 01: Total | 6940 | 8366 | 161704 | 7870272 |
| Forsyth | 02: <20 | 5336 | 5364 | 21748 | 819344 |
| Forsyth | 03: 20-99 | 717 | 822 | 24380 | 924506 |
| Forsyth | 04: 100-499 | 244 | 400 | 18075 | 657531 |
| Forsyth | 05: 500+ | 643 | 1780 | 97501 | 5468891 |
| Guilford | 01: Total | 10956 | 13246 | 258166 | 11472846 |
| Guilford | 02: <20 | 8404 | 8452 | 34113 | 1266283 |
| Guilford | 03: 20-99 | 1092 | 1223 | 35662 | 1368081 |
| Guilford | 04: 100-499 | 491 | 769 | 38941 | 1619685 |
| Guilford | 05: 500+ | 969 | 2802 | 149450 | 7218797 |
| Mecklenburg | 01: Total | 23793 | 30039 | 588844 | 35515013 |
| Mecklenburg | 02: <20 | 18567 | 18669 | 69544 | 3098478 |
| Mecklenburg | 03: 20-99 | 2347 | 2625 | 79335 | 3486544 |
| Mecklenburg | 04: 100-499 | 998 | 1553 | 70456 | 3414091 |
| Mecklenburg | 05: 500+ | 1881 | 7192 | 369509 | 25515900 |
| Wake | 01: Total | 22708 | 27405 | 431040 | 22263004 |
| Wake | 02: <20 | 18192 | 18271 | 69640 | 2901460 |
| Wake | 03: 20-99 | 2291 | 2615 | 77462 | 3204235 |
| Wake | 04: 100-499 | 774 | 1298 | 60199 | 3087631 |
| Wake | 05: 500+ | 1451 | 5221 | 223739 | 13069678 |
Create a variable entitled FirmEmpRatio and EstEmpRario to obtain a ratios in order to eliminate size disparity.
nc_top5 <- nc_top5 %>% mutate(FirmEmpRatio = NumFirms/Employment, EstEmpRatio = NumEstablishments/Employment) %>% select(CtyDescription,EntEmploySize, FirmEmpRatio,EstEmpRatio, AnnualPayroll)
kable(nc_top5, format = "html")
| CtyDescription | EntEmploySize | FirmEmpRatio | EstEmpRatio | AnnualPayroll |
|---|---|---|---|---|
| Durham | 01: Total | 0.0336442 | 0.0401706 | 11375527 |
| Durham | 02: <20 | 0.2436689 | 0.2443488 | 757712 |
| Durham | 03: 20-99 | 0.0304799 | 0.0345796 | 1032481 |
| Durham | 04: 100-499 | 0.0148530 | 0.0209154 | 951624 |
| Durham | 05: 500+ | 0.0059800 | 0.0138813 | 8633710 |
| Forsyth | 01: Total | 0.0429179 | 0.0517365 | 7870272 |
| Forsyth | 02: <20 | 0.2453559 | 0.2466434 | 819344 |
| Forsyth | 03: 20-99 | 0.0294094 | 0.0337162 | 924506 |
| Forsyth | 04: 100-499 | 0.0134993 | 0.0221300 | 657531 |
| Forsyth | 05: 500+ | 0.0065948 | 0.0182562 | 5468891 |
| Guilford | 01: Total | 0.0424378 | 0.0513081 | 11472846 |
| Guilford | 02: <20 | 0.2463577 | 0.2477648 | 1266283 |
| Guilford | 03: 20-99 | 0.0306208 | 0.0342942 | 1368081 |
| Guilford | 04: 100-499 | 0.0126088 | 0.0197478 | 1619685 |
| Guilford | 05: 500+ | 0.0064838 | 0.0187487 | 7218797 |
| Mecklenburg | 01: Total | 0.0404063 | 0.0510135 | 35515013 |
| Mecklenburg | 02: <20 | 0.2669821 | 0.2684488 | 3098478 |
| Mecklenburg | 03: 20-99 | 0.0295834 | 0.0330875 | 3486544 |
| Mecklenburg | 04: 100-499 | 0.0141649 | 0.0220421 | 3414091 |
| Mecklenburg | 05: 500+ | 0.0050905 | 0.0194637 | 25515900 |
| Wake | 01: Total | 0.0526819 | 0.0635788 | 22263004 |
| Wake | 02: <20 | 0.2612292 | 0.2623636 | 2901460 |
| Wake | 03: 20-99 | 0.0295758 | 0.0337585 | 3204235 |
| Wake | 04: 100-499 | 0.0128574 | 0.0215618 | 3087631 |
| Wake | 05: 500+ | 0.0064852 | 0.0233352 | 13069678 |
View the counties with the highest total FirmEmpRatio based on the 01: Total data element from the EntEmploySize variable:
nc_top5 %>% filter(EntEmploySize == "01: Total") %>% arrange(desc(FirmEmpRatio)) %>% select(CtyDescription,FirmEmpRatio,AnnualPayroll)
## # A tibble: 5 x 3
## CtyDescription FirmEmpRatio AnnualPayroll
## <chr> <dbl> <dbl>
## 1 Wake 0.0527 22263004
## 2 Forsyth 0.0429 7870272
## 3 Guilford 0.0424 11472846
## 4 Mecklenburg 0.0404 35515013
## 5 Durham 0.0336 11375527
Based on these findings we see that Wake county has highest FirmEmpRatio but that variable doesn’t correlate with AnnualPayroll
Next, we view the counties with the highest total EstEmpRatio based on the 01: Total data element from the EntEmploySize variable:
nc_top5 %>% filter(EntEmploySize == "01: Total") %>% arrange(desc(EstEmpRatio)) %>% select(CtyDescription,EstEmpRatio,AnnualPayroll)
## # A tibble: 5 x 3
## CtyDescription EstEmpRatio AnnualPayroll
## <chr> <dbl> <dbl>
## 1 Wake 0.0636 22263004
## 2 Forsyth 0.0517 7870272
## 3 Guilford 0.0513 11472846
## 4 Mecklenburg 0.0510 35515013
## 5 Durham 0.0402 11375527
Again Wake County is the leader in regards to EstEmpRatio but we also see that doesn’t correlate with AnnualPayroll
We will now take a look each level of the EntEmploySize against the EstEmpRatio:
nc_top5 %>% filter(EntEmploySize == "02: <20") %>%
ggplot(mapping = aes(x = CtyDescription, y = EstEmpRatio))+
geom_bar(stat="identity", fill = "#010134")
nc_top5 %>% filter(EntEmploySize == "02: <20") %>% arrange(desc(EstEmpRatio)) %>% select(CtyDescription,EstEmpRatio,AnnualPayroll)
## # A tibble: 5 x 3
## CtyDescription EstEmpRatio AnnualPayroll
## <chr> <dbl> <dbl>
## 1 Mecklenburg 0.268 3098478
## 2 Wake 0.262 2901460
## 3 Guilford 0.248 1266283
## 4 Forsyth 0.247 819344
## 5 Durham 0.244 757712
For employment size less than 20 Meckenburg couny has a slight lead over Wake County with Durham county at the bottom.
nc_top5 %>% filter(EntEmploySize == "03: 20-99") %>%
ggplot(mapping = aes(x = CtyDescription, y = EstEmpRatio))+
geom_bar(stat="identity", fill = "#afd965")
nc_top5 %>% filter(EntEmploySize == "03: 20-99") %>% arrange(desc(EstEmpRatio)) %>% select(CtyDescription,EstEmpRatio,AnnualPayroll)
## # A tibble: 5 x 3
## CtyDescription EstEmpRatio AnnualPayroll
## <chr> <dbl> <dbl>
## 1 Durham 0.0346 1032481
## 2 Guilford 0.0343 1368081
## 3 Wake 0.0338 3204235
## 4 Forsyth 0.0337 924506
## 5 Mecklenburg 0.0331 3486544
For employment size between 20 and 99 Durham couny has a slight lead over Guilford County with Mecklenburg county at the bottom.
nc_top5 %>% filter(EntEmploySize == "04: 100-499") %>%
ggplot(mapping = aes(x = CtyDescription, y = EstEmpRatio))+
geom_bar(stat="identity", fill = "#E7DFC3")
nc_top5 %>% filter(EntEmploySize == "04: 100-499") %>% arrange(desc(EstEmpRatio)) %>% select(CtyDescription,EstEmpRatio,AnnualPayroll)
## # A tibble: 5 x 3
## CtyDescription EstEmpRatio AnnualPayroll
## <chr> <dbl> <dbl>
## 1 Forsyth 0.0221 657531
## 2 Mecklenburg 0.0220 3414091
## 3 Wake 0.0216 3087631
## 4 Durham 0.0209 951624
## 5 Guilford 0.0197 1619685
For employment size between 100 and 499 Forsyth couny has a slight lead over Mecklenburg County with Guilford county at the bottom.
nc_top5 %>% filter(EntEmploySize == "05: 500+") %>%
ggplot(mapping = aes(x = CtyDescription, y = EstEmpRatio))+
geom_bar(stat="identity", fill = "#616465")
nc_top5 %>% filter(EntEmploySize == "05: 500+") %>% arrange(desc(EstEmpRatio)) %>% select(CtyDescription,EstEmpRatio,AnnualPayroll)
## # A tibble: 5 x 3
## CtyDescription EstEmpRatio AnnualPayroll
## <chr> <dbl> <dbl>
## 1 Wake 0.0233 13069678
## 2 Mecklenburg 0.0195 25515900
## 3 Guilford 0.0187 7218797
## 4 Forsyth 0.0183 5468891
## 5 Durham 0.0139 8633710
For employment size of 500 and over Wake couny has a sizable lead over Mecklenburg County with Durham county at the bottom.
Based on the analysis we can say that out of the top five counties in the state of North Carolina with the highest AnnualPayroll Wake County has the greatest variability in employment size. It was never had the bottom of any category and remained near the top and had the highest position of employment size of 500 and higher.
Import the data and utilize the skip feature in order to arrive at the first headers with data.
mass_shootings <- read_csv("https://raw.githubusercontent.com/LilesB/DATA-607---Project-2/master/mass_shooting_events_stanford_msa_release_06142016.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## CaseID = col_integer(),
## Latitude = col_double(),
## Longitude = col_double(),
## `Number of Victim Fatalities` = col_integer(),
## `Total Number of Fatalities` = col_integer(),
## `Number of Victims Injured` = col_integer(),
## `Total Number of Victims` = col_integer()
## )
## See spec(...) for full column specifications.
First step is to look at the dataset using the View feature
View(mass_shootings)
Next we will use the glimpse function to dig deeper in the mass_shootings datset
glimpse(mass_shootings)
## Observations: 347
## Variables: 48
## $ CaseID <int> 1, 2, 3, 4, 5, 6, 7, 8,...
## $ Title <chr> "University of Texas at...
## $ Location <chr> "Austin, Texas", "Mesa,...
## $ City <chr> "Austin", "Mesa", "New ...
## $ State <chr> "Texas", "Arizona", "Lo...
## $ Latitude <dbl> 30.19889, 33.42269, 30....
## $ Longitude <dbl> -97.84416, -111.81632, ...
## $ `Number of Victim Fatalities` <int> 16, 5, 9, 1, 3, 1, 7, 2...
## $ `Total Number of Fatalities` <int> 17, 5, 10, 1, 3, 1, 7, ...
## $ `Number of Victims Injured` <int> 32, 1, 13, 3, 7, 7, 2, ...
## $ `Total Number of Victims` <int> 48, 6, 22, 4, 10, 8, 9,...
## $ Description <chr> "On August 1, 1966, a 2...
## $ Date <chr> "8/1/1966", "11/12/1966...
## $ `Day of Week` <chr> "Monday", "Saturday", "...
## $ `Date - Detailed` <chr> "Monday, August 01, 196...
## $ `Shooter Name` <chr> "Charles S. Whitman", "...
## $ `Shooter Age(s)` <chr> "25", "18", "23", "14",...
## $ `Average Shooter Age` <chr> "25", "18", "23", "14",...
## $ `Shooter Sex` <chr> "Male", "Male", "Male",...
## $ `Shooter Race` <chr> "White American or Euro...
## $ `Type of Gun - Detailed` <chr> "Remington 700 ADL (6mm...
## $ `Type of Gun - General` <chr> "Multiple guns", "Handg...
## $ `Number of Shotguns` <chr> "1", "0", "0", "0", "1"...
## $ `Number of Rifles` <chr> "3", "0", "1", "0", "1"...
## $ `Number of Handguns` <chr> "3", "1", "1", "2", "0"...
## $ `Total Number of Guns` <chr> "7", "1", "2", "2", "2"...
## $ `Number of Automatic Guns` <chr> "0", "0", "0", "0", "0"...
## $ `Number of Semi-Automatic Guns` <chr> "1", "0", "1", "0", "0"...
## $ `Fate of Shooter at the scene` <chr> "Deceased", "Arrested",...
## $ `Shooter's Cause of Death` <chr> "Killed", "Not applicab...
## $ `School Related` <chr> "Yes", "Yes", "No", "Ye...
## $ `Place Type` <chr> "College/University/Adu...
## $ `Relationship to Incident Location` <chr> "Place of schooling", "...
## $ `Targeted Victim/s - Detailed` <chr> "Family, students, teac...
## $ `Targeted Victim/s - General` <chr> "General public", "Stud...
## $ `Possible Motive - Detailed` <chr> "A task force of profes...
## $ `Possible Motive - General` <chr> "Mental illness", "Ment...
## $ `History of Mental Illness - Detailed` <chr> "Shooter had visited a...
## $ `History of Mental Illness - General` <chr> "Yes", "Yes", "Yes", "Y...
## $ `Data Source 1` <chr> "http://en.wikipedia.or...
## $ `Data Source 2` <chr> "Lexington Examiner (KY...
## $ `Data Source 3` <chr> "http://news.google.com...
## $ `Data Source 4` <chr> "http://news.google.com...
## $ `Data Source 5` <chr> "http://books.google.co...
## $ `Data Source 6` <chr> NA, NA, NA, NA, NA, NA,...
## $ `Data Source 7` <chr> NA, NA, NA, NA, NA, NA,...
## $ `Military Experience` <chr> "Yes", "Unknown", "Unkn...
## $ Class <chr> "SPK", "MS", "SPK", "MS...
The mass_shootings dataset is rather large like the uscb dataset but differs with the amount of options to pull data from the untidy set.
After first glance we will be explore school shootings but first we will conduct some preliminary data analysis to find a few other data elements.
First we will pull filter the mass_shootings dataset for school shootings and create a dataset entitled school:
names(mass_shootings)
## [1] "CaseID"
## [2] "Title"
## [3] "Location"
## [4] "City"
## [5] "State"
## [6] "Latitude"
## [7] "Longitude"
## [8] "Number of Victim Fatalities"
## [9] "Total Number of Fatalities"
## [10] "Number of Victims Injured"
## [11] "Total Number of Victims"
## [12] "Description"
## [13] "Date"
## [14] "Day of Week"
## [15] "Date - Detailed"
## [16] "Shooter Name"
## [17] "Shooter Age(s)"
## [18] "Average Shooter Age"
## [19] "Shooter Sex"
## [20] "Shooter Race"
## [21] "Type of Gun - Detailed"
## [22] "Type of Gun - General"
## [23] "Number of Shotguns"
## [24] "Number of Rifles"
## [25] "Number of Handguns"
## [26] "Total Number of Guns"
## [27] "Number of Automatic Guns"
## [28] "Number of Semi-Automatic Guns"
## [29] "Fate of Shooter at the scene"
## [30] "Shooter's Cause of Death"
## [31] "School Related"
## [32] "Place Type"
## [33] "Relationship to Incident Location"
## [34] "Targeted Victim/s - Detailed"
## [35] "Targeted Victim/s - General"
## [36] "Possible Motive - Detailed"
## [37] "Possible Motive - General"
## [38] "History of Mental Illness - Detailed"
## [39] "History of Mental Illness - General"
## [40] "Data Source 1"
## [41] "Data Source 2"
## [42] "Data Source 3"
## [43] "Data Source 4"
## [44] "Data Source 5"
## [45] "Data Source 6"
## [46] "Data Source 7"
## [47] "Military Experience"
## [48] "Class"
school <- mass_shootings %>% filter(`School Related`=="Yes")
dim(school)
## [1] 73 48
There are 73 instances where there were school shootings
We will now condense the school dataset in order to explore a few variables:
school <- school %>% select(`State`,`Total Number of Fatalities`,
`Total Number of Victims`,
`Day of Week`,
`Shooter Race`,
`Possible Motive - General`,
`Place Type`,
`Shooter Sex`)
We will explore the variables to see if it makes sense to delve deeper or eliminate them from the data set excluding the number of fatalities,victims, and day of the week:
school %>% count(`State`) %>% arrange(desc(n))
## # A tibble: 33 x 2
## State n
## <chr> <int>
## 1 California 10
## 2 Ohio 5
## 3 Arizona 4
## 4 Illinois 4
## 5 Washington 4
## 6 Michigan 3
## 7 Nevada 3
## 8 New York 3
## 9 Oregon 3
## 10 Pennsylvania 3
## # ... with 23 more rows
California has the most school related shootings with 10 but there is limited variability therefore it we will not pursue any further analysis.
school %>% count(`Shooter Race`) %>% arrange(desc(n))
## # A tibble: 7 x 2
## `Shooter Race` n
## <chr> <int>
## 1 White American or European American 42
## 2 Black American or African American 14
## 3 Asian American 7
## 4 Some other race 5
## 5 Native American or Alaska Native 2
## 6 Unknown 2
## 7 Two or more races 1
Whites have the most school related shootings with 42 while Blacks come in behind with 14 so we will explore the total number of victims and fatalities.
ggplot(data = school) +
geom_point(mapping = aes(x = `Shooter Race`,
y = `Total Number of Victims`,
color = `Shooter Race`)) +
theme(axis.text.x=element_blank())
#removes the text from the x-axis
We already knew from the count that Whites had the most school shootings, but it seems that Asian Americans logged in the most victims in a shooting.
ggplot(data = school) +
geom_point(mapping = aes(x = `Shooter Race`,
y = `Total Number of Fatalities`,
color = `Shooter Race`)) +
theme(axis.text.x=element_blank())
school %>% count(`Possible Motive - General`) %>% arrange(desc(n))
## # A tibble: 14 x 2
## `Possible Motive - General` n
## <chr> <int>
## 1 Mental illness 19
## 2 Multiple motives 10
## 3 Unknown 8
## 4 Social dispute 7
## 5 Failure 6
## 6 Expulsion/Suspension 5
## 7 Harassment 5
## 8 Terminated/Denied/Reprimanded 4
## 9 Rejection 3
## 10 Domestic dispute 2
## 11 Legal dispute 1
## 12 Political/Religious ideals 1
## 13 Race 1
## 14 Social Dispute 1
Mental illness leads the numbers with 19, and there seems to be variability here once we get past mental illness, we will conduct the same analysis from prior which surround the number of victims and fatalities.
ggplot(data = school) +
geom_point(mapping = aes(x = `Possible Motive - General`,
y = `Total Number of Victims`,
color = `Possible Motive - General`)) +
theme(axis.text.x=element_blank())
#removes the text from the x-axis
ggplot(data = school) +
geom_point(mapping = aes(x = `Possible Motive - General`,
y = `Total Number of Fatalities`,
color = `Possible Motive - General`)) +
theme(axis.text.x=element_blank())
#removes the text from the x-axis
Although mental illness seems like the driving force behind school shootings we see that there are quite a few of these acts that are unknown. The unknown factor also claimed the second most fatalities.
school %>% count(`Place Type`) %>% arrange(desc(n))
## # A tibble: 9 x 2
## `Place Type` n
## <chr> <int>
## 1 College/University/Adult education 27
## 2 Secondary school 21
## 3 Primary school 16
## 4 Residential home/Neighborhood 4
## 5 Place of worship 1
## 6 Restaurant/Cafe? 1
## 7 "Retail/ Wholesale/Services facility\rand Primary school" 1
## 8 Secondary School 1
## 9 Street/Highway 1
The top three places were Colleges, Secondary Schools, and Primary Schools with counts of 27, 21, and 16 respectively. This data element will be excluded.
school %>% count(`Shooter Sex`) %>% arrange(desc(n))
## # A tibble: 2 x 2
## `Shooter Sex` n
## <chr> <int>
## 1 Male 70
## 2 Female 3
The disparities are huge with 70 males and 3 females; data will be excluded.
school %>% count(`Day of Week`) %>% arrange(desc(n))
## # A tibble: 7 x 2
## `Day of Week` n
## <chr> <int>
## 1 Thursday 19
## 2 Monday 18
## 3 Friday 15
## 4 Tuesday 10
## 5 Wednesday 6
## 6 Sunday 3
## 7 Saturday 2
What stands out the most is the fact there were 5 school shootings that occurred on the weekend; 2 and 3 respectively. Out of the school/work week it seems that Wednesday is the safest day with only 6 school shootings while Thursday and Monday are the most violent.
ggplot(data = school) +
geom_point(mapping = aes(x = `Day of Week`,
y = `Total Number of Fatalities`,
color = `Day of Week`)) +
theme(axis.text.x=element_blank())
#removes the text from the x-axis
Even though Thursday logged the most school shootings we see that Monday has registered the most victims.
ggplot(data = school) +
geom_point(mapping = aes(x = `Day of Week`,
y = `Total Number of Fatalities`,
color = `Day of Week`)) +
theme(axis.text.x=element_blank())
With 15 school shootings Friday took the second spot in the number of fatalities.
The issue with school shootings has become a hot issue once again in the US, and what I found most interesting was the days of the week these horrific acts occurred.