Goal

The goal for this project is to tidy and analyze data sets presented in the week 5 discussion section.

Data Sets

New York City Compstat Historcial Crime Data by Precinct Citywide Seven Major Felony Offenses 2000-2017

United States Census Bureau Data by Enterpise Employment Size: Counties (State by County, totals)

Stanford Mass Shootings in America (MSA) A high quality dataset from 19966-2016 with method, definitions, and references

Load packages

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.3
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.1     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.3
## Warning: package 'readr' was built under R version 3.4.3
## Warning: package 'purrr' was built under R version 3.4.3
## Warning: package 'dplyr' was built under R version 3.4.2
## Warning: package 'forcats' was built under R version 3.4.3
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(knitr)

New York City Compstat Data (Import and Tidy)

Import the data and utilize the skip feature in order to arrive at the first headers with data.

compstat <- read_csv("https://raw.githubusercontent.com/LilesB/DATA-607---Project-2/master/seven-major-felony-offenses-2000-2016%20(2).csv", skip = 4)
## Parsed with column specification:
## cols(
##   OFFENSE = col_character(),
##   `2000` = col_number(),
##   `2001` = col_number(),
##   `2002` = col_number(),
##   `2003` = col_number(),
##   `2004` = col_number(),
##   `2005` = col_number(),
##   `2006` = col_number(),
##   `2007` = col_number(),
##   `2008` = col_number(),
##   `2009` = col_number(),
##   `2010` = col_number(),
##   `2011` = col_number(),
##   `2012` = col_number(),
##   `2013` = col_number(),
##   `2014` = col_number(),
##   `2015` = col_number(),
##   `2016` = col_number()
## )

First step is to look at the dataset using the View feature

View(compstat)

After viewing the dataset we can see that row 9 provides totals which we won’t need. In addition, rows 10 - 16 are no longer needed so we will remove them using the following code:

compstat <- compstat[-c(8:18),]

Next we take a look at the data using the glimpse function

glimpse(compstat)
## Observations: 7
## Variables: 18
## $ OFFENSE <chr> "MURDER & NON-NEGL. MANSLAUGHTER", "RAPE", "ROBBERY", ...
## $ `2000`  <dbl> 673, 2068, 32562, 25924, 38352, 49631, 35442
## $ `2001`  <dbl> 649, 1981, 28202, 23453, 32763, 46329, 29531
## $ `2002`  <dbl> 587, 2144, 27229, 21147, 31275, 45771, 26656
## $ `2003`  <dbl> 597, 2070, 25989, 19139, 29110, 46751, 23413
## $ `2004`  <dbl> 570, 1905, 24373, 18622, 26976, 48763, 20884
## $ `2005`  <dbl> 539, 1858, 24722, 17750, 24117, 48243, 18246
## $ `2006`  <dbl> 596, 1525, 23739, 17309, 23143, 46625, 15745
## $ `2007`  <dbl> 496, 1351, 21809, 17493, 21762, 44924, 13174
## $ `2008`  <dbl> 523, 1299, 22401, 16284, 20725, 44242, 12482
## $ `2009`  <dbl> 471, 1205, 18601, 16773, 19430, 39580, 10670
## $ `2010`  <dbl> 536, 1373, 19486, 16956, 18600, 37835, 10329
## $ `2011`  <dbl> 515, 1420, 19717, 18482, 18720, 38501, 9314
## $ `2012`  <dbl> 419, 1445, 20144, 19381, 19168, 42497, 8093
## $ `2013`  <dbl> 335, 1378, 19128, 20297, 17429, 45368, 7400
## $ `2014`  <dbl> 333, 1352, 16539, 20207, 16765, 43862, 7664
## $ `2015`  <dbl> 352, 1438, 16931, 20270, 15125, 44005, 7332
## $ `2016`  <dbl> 335, 1438, 15500, 20847, 12990, 44279, 6327

The compstat dataset currently has 7 observations and 18 variables. In order to tidy the data we use the gather feature in order to view the offenses, year, and totals.

compstat <- gather(compstat,"YEAR","TOTALS",2:18)
dim(compstat)
## [1] 119   3

After using gather we arrive at 3 variables and 119 observations making it easier to conduct analysis.

New York City Compstat Data (Analyze)

First it will be interesting to look at the total number of crimes by year. We use the piping to group the data by year and then add the seven major crimes and create a bar chart.

compstat %>% group_by(YEAR) %>% summarise(CRIMES = sum(TOTALS)) %>%
    ggplot(mapping = aes(x = YEAR, y = CRIMES)) +
               geom_bar(stat="identity", fill = "#6C9552")

Based on the graph we can see that overall crime in New York has decreased significantly since 2000. However, crime started to increase slightly in 2012 & 2013 and dropped to its lowest point in 2016.

Create a dataset that will filter on the criminal offense of murder and view the years that recorded the highest values

compstat_murder <- filter(compstat,OFFENSE =="MURDER & NON-NEGL. MANSLAUGHTER") 

compstat_murder %>% arrange(TOTALS) %>% select(YEAR,TOTALS) %>% top_n(5) 
## Selecting by TOTALS
## # A tibble: 5 x 2
##   YEAR  TOTALS
##   <chr>  <dbl>
## 1 2002     587
## 2 2006     596
## 3 2003     597
## 4 2001     649
## 5 2000     673

In 2000 New York City recorded 673 homicides making it the deadliest year in the last decade. Using kable we will list he entire list.

kable(compstat_murder[2:3], format = "html", caption = "New York City Homicide Totals (2000 - 2016)")
New York City Homicide Totals (2000 - 2016)
YEAR TOTALS
2000 673
2001 649
2002 587
2003 597
2004 570
2005 539
2006 596
2007 496
2008 523
2009 471
2010 536
2011 515
2012 419
2013 335
2014 333
2015 352
2016 335

Next we will calculate an estimated murder rate for each year in compstat_murder based on census data retrieved from https://en.wikipedia.org/wiki/Demographics_of_New_York_City. In order to do so we create a vector entitled population which will store the population for 2000 and 2010 and use the mean value of 8,091,711.

Using mutate we then create Per100 which divides the estimated population by 100000 and then create murder_rate that will divide TOTALS by Per100

population <- c(8008288,8175133)
mean(population)
## [1] 8091711
compstat_murder %>% mutate(Per100 = 8091711/100000, MurderRate = TOTALS/Per100) %>% arrange(desc(MurderRate)) %>% select(YEAR, TOTALS, MurderRate)
## # A tibble: 17 x 3
##    YEAR  TOTALS MurderRate
##    <chr>  <dbl>      <dbl>
##  1 2000     673       8.32
##  2 2001     649       8.02
##  3 2003     597       7.38
##  4 2006     596       7.37
##  5 2002     587       7.25
##  6 2004     570       7.04
##  7 2005     539       6.66
##  8 2010     536       6.62
##  9 2008     523       6.46
## 10 2011     515       6.36
## 11 2007     496       6.13
## 12 2009     471       5.82
## 13 2012     419       5.18
## 14 2015     352       4.35
## 15 2013     335       4.14
## 16 2016     335       4.14
## 17 2014     333       4.12

In unlisted analysis we saw that 2000 reported the largest amount of criminal activity for each offense with the exception of rape.

Rape

compstat_rape <- filter(compstat,OFFENSE =="RAPE") 

compstat_rape %>% arrange(TOTALS) %>% select(YEAR,TOTALS) %>% top_n(5)
## Selecting by TOTALS
## # A tibble: 5 x 2
##   YEAR  TOTALS
##   <chr>  <dbl>
## 1 2004    1905
## 2 2001    1981
## 3 2000    2068
## 4 2003    2070
## 5 2002    2144

2002 led the city in the numbers of rapes with 2,144.

kable(compstat_rape[2:3], format = "html", caption = "New York City Rape Totals (2000 - 2016)")
New York City Rape Totals (2000 - 2016)
YEAR TOTALS
2000 2068
2001 1981
2002 2144
2003 2070
2004 1905
2005 1858
2006 1525
2007 1351
2008 1299
2009 1205
2010 1373
2011 1420
2012 1445
2013 1378
2014 1352
2015 1438
2016 1438

To reduce redundancy we will look at the years that reported the least amount of robberies, felony assualts, and grand larceny by adding a (-) negative sign to the top_n feature.

Robbery

compstat_robbery <- filter(compstat,OFFENSE =="ROBBERY") 

compstat_robbery %>% arrange(TOTALS) %>% select(YEAR,TOTALS) %>% top_n(-5)
## Selecting by TOTALS
## # A tibble: 5 x 2
##   YEAR  TOTALS
##   <chr>  <dbl>
## 1 2016   15500
## 2 2014   16539
## 3 2015   16931
## 4 2009   18601
## 5 2013   19128

2016 recorded the least amount of robberies with 15,500.

kable(compstat_robbery[2:3], format = "html", caption = "New York City Robbery Totals (2000 - 2016)")
New York City Robbery Totals (2000 - 2016)
YEAR TOTALS
2000 32562
2001 28202
2002 27229
2003 25989
2004 24373
2005 24722
2006 23739
2007 21809
2008 22401
2009 18601
2010 19486
2011 19717
2012 20144
2013 19128
2014 16539
2015 16931
2016 15500

Felony Assualt

compstat_fa <- filter(compstat,OFFENSE =="FELONY ASSAULT") 

compstat_fa %>% arrange(TOTALS) %>% select(YEAR,TOTALS) %>% top_n(-5)
## Selecting by TOTALS
## # A tibble: 5 x 2
##   YEAR  TOTALS
##   <chr>  <dbl>
## 1 2008   16284
## 2 2009   16773
## 3 2010   16956
## 4 2006   17309
## 5 2007   17493

2008 recorded the least amount of felony assaults with 16,284.

kable(compstat_fa[2:3], format = "html", caption = "New York City Felony Assualt Totals (2000 - 2016)")
New York City Felony Assualt Totals (2000 - 2016)
YEAR TOTALS
2000 25924
2001 23453
2002 21147
2003 19139
2004 18622
2005 17750
2006 17309
2007 17493
2008 16284
2009 16773
2010 16956
2011 18482
2012 19381
2013 20297
2014 20207
2015 20270
2016 20847

Grand Larceny

compstat_gl <- filter(compstat,OFFENSE =="GRAND LARCENY") 

compstat_gl %>% arrange(TOTALS) %>% select(YEAR,TOTALS) %>% top_n(-5)
## Selecting by TOTALS
## # A tibble: 5 x 2
##   YEAR  TOTALS
##   <chr>  <dbl>
## 1 2010   37835
## 2 2011   38501
## 3 2009   39580
## 4 2012   42497
## 5 2014   43862

2010 recorded the least amount of grand larceny with 37,835.

kable(compstat_rape[2:3], format = "html", caption = "New York City Grand Larceny Totals (2000 - 2016)")
New York City Grand Larceny Totals (2000 - 2016)
YEAR TOTALS
2000 2068
2001 1981
2002 2144
2003 2070
2004 1905
2005 1858
2006 1525
2007 1351
2008 1299
2009 1205
2010 1373
2011 1420
2012 1445
2013 1378
2014 1352
2015 1438
2016 1438

Burglary and Grand Larceny Motor Vehicles were omitted because the both followed the trend of 2000 recording the lowest numbers while 2016 recorded the most.

New York City Compstat Data (outcome)

Based on the analysis we can say with confidence that crime has decreased in New York City within the past decade.

United States Census Bureau (Import and Tidy)

Import the data and utilize the skip feature in order to arrive at the first headers with data. We also noticed that the headers all have spaces between the names so we create a vector that will hold the names and apply.

names <-c("FIPS_StateCode","StateDescr","FIPS_CountyCode",
          "CtyDescription","EntEmploySize","NumFirms",
          "NumEstablishments","Employment","ERF",
          "ENF","AnnualPayroll","APNG","X13","X14","X15")
uscb <- read_csv("https://raw.githubusercontent.com/LilesB/DATA-607---Project-2/master/county_totals_2015.csv", skip = 5, col_names = names)
## Parsed with column specification:
## cols(
##   FIPS_StateCode = col_character(),
##   StateDescr = col_character(),
##   FIPS_CountyCode = col_character(),
##   CtyDescription = col_character(),
##   EntEmploySize = col_character(),
##   NumFirms = col_character(),
##   NumEstablishments = col_character(),
##   Employment = col_character(),
##   ERF = col_character(),
##   ENF = col_character(),
##   AnnualPayroll = col_character(),
##   APNG = col_character(),
##   X13 = col_character(),
##   X14 = col_character(),
##   X15 = col_character()
## )

First step is to look at the dataset using the View feature

View(uscb)

After viewing the dataset we can see that rows 1 & 2 have NA values and can be removed.

uscb <- uscb[-c(1:2),]

Next we take a look at the data using the glimpse function

glimpse(uscb)
## Observations: 15,855
## Variables: 15
## $ FIPS_StateCode    <chr> NA, "01", "01", "01", "01", "01", "01", "01"...
## $ StateDescr        <chr> NA, "Alabama", "Alabama", "Alabama", "Alabam...
## $ FIPS_CountyCode   <chr> NA, "001", "001", "001", "001", "001", "003"...
## $ CtyDescription    <chr> NA, "Autauga", "Autauga", "Autauga", "Autaug...
## $ EntEmploySize     <chr> NA, "01:  Total", "02:  <20", "03:  20-99", ...
## $ NumFirms          <chr> NA, "788", "526", "75", "35", "152", "4,431"...
## $ NumEstablishments <chr> NA, "844", "527", "81", "51", "185", "5,127"...
## $ Employment        <chr> NA, "10,454", "2,216", "2,143", "1,161", "4,...
## $ ERF               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ENF               <chr> NA, "G", "G", "G", "G", "G", "G", "G", "G", ...
## $ AnnualPayroll     <chr> NA, "321,433", "63,553", "57,490", "29,074",...
## $ APNG              <chr> NA, "G", "G", "G", "G", "G", "G", "G", "G", ...
## $ X13               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ X14               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ X15               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

The uscb dataset currently has 15 observations and 15854 variables. We will look at the city of North Carolina so we will first create a dataset entitled nc and view the dimensions using dim

nc <- filter(uscb,StateDescr=="North Carolina")
dim(nc)
## [1] 505  15

Next we will condense the amount of variables to look at while using select and choosing the following:

nc <- nc %>% select(CtyDescription,EntEmploySize,
              NumFirms,NumEstablishments,Employment,AnnualPayroll)

To get an idea on how many counties we have our next step is to use the unique feature:

unique(nc$CtyDescription)
##   [1] "Alamance"                 "Alexander"               
##   [3] "Alleghany"                "Anson"                   
##   [5] "Ashe"                     "Avery"                   
##   [7] "Beaufort"                 "Bertie"                  
##   [9] "Bladen"                   "Brunswick"               
##  [11] "Buncombe"                 "Burke"                   
##  [13] "Cabarrus"                 "Caldwell"                
##  [15] "Camden"                   "Carteret"                
##  [17] "Caswell"                  "Catawba"                 
##  [19] "Chatham"                  "Cherokee"                
##  [21] "Chowan"                   "Clay"                    
##  [23] "Cleveland"                "Columbus"                
##  [25] "Craven"                   "Cumberland"              
##  [27] "Currituck"                "Dare"                    
##  [29] "Davidson"                 "Davie"                   
##  [31] "Duplin"                   "Durham"                  
##  [33] "Edgecombe"                "Forsyth"                 
##  [35] "Franklin"                 "Gaston"                  
##  [37] "Gates"                    "Graham"                  
##  [39] "Granville"                "Greene"                  
##  [41] "Guilford"                 "Halifax"                 
##  [43] "Harnett"                  "Haywood"                 
##  [45] "Henderson"                "Hertford"                
##  [47] "Hoke"                     "Hyde"                    
##  [49] "Iredell"                  "Jackson"                 
##  [51] "Johnston"                 "Jones"                   
##  [53] "Lee"                      "Lenoir"                  
##  [55] "Lincoln"                  "McDowell"                
##  [57] "Macon"                    "Madison"                 
##  [59] "Martin"                   "Mecklenburg"             
##  [61] "Mitchell"                 "Montgomery"              
##  [63] "Moore"                    "Nash"                    
##  [65] "New Hanover"              "Northampton"             
##  [67] "Onslow"                   "Orange"                  
##  [69] "Pamlico"                  "Pasquotank"              
##  [71] "Pender"                   "Perquimans"              
##  [73] "Person"                   "Pitt"                    
##  [75] "Polk"                     "Randolph"                
##  [77] "Richmond"                 "Robeson"                 
##  [79] "Rockingham"               "Rowan"                   
##  [81] "Rutherford"               "Sampson"                 
##  [83] "Scotland"                 "Stanly"                  
##  [85] "Stokes"                   "Surry"                   
##  [87] "Swain"                    "Transylvania"            
##  [89] "Tyrrell"                  "Union"                   
##  [91] "Vance"                    "Wake"                    
##  [93] "Warren"                   "Washington"              
##  [95] "Watauga"                  "Wayne"                   
##  [97] "Wilkes"                   "Wilson"                  
##  [99] "Yadkin"                   "Yancey"                  
## [101] "Statewide North Carolina"

nc has 101 unique counties in the dataset

Before we tidy the data we will first convert NumFirms, Employment, AnnualPayroll to a numeric variable in order to delve deeper in the dataset. We experienced warnings known as NA coercion due to commas in the character variables so we used the gsub function in order to remove the commas.

nc$NumFirms <- as.numeric(gsub(",","",nc$NumFirms))
nc$NumEstablishments <- as.numeric(gsub(",","",nc$NumEstablishments))
nc$Employment <- as.numeric(gsub(",","",nc$Employment))
nc$AnnualPayroll <- as.numeric(gsub(",","",nc$AnnualPayroll))

Identify counties that have the highest annual payroll based on the EntEmploySize being equal to 01: Total

nc %>% filter(EntEmploySize == "01:  Total") %>% arrange(AnnualPayroll) %>%  select(CtyDescription,NumFirms,NumEstablishments,Employment,AnnualPayroll) %>% top_n(5)
## Selecting by AnnualPayroll
## # A tibble: 5 x 5
##   CtyDescription NumFirms NumEstablishments Employment AnnualPayroll
##   <chr>             <dbl>             <dbl>      <dbl>         <dbl>
## 1 Forsyth            6940              8366     161704       7870272
## 2 Durham             5949              7103     176821      11375527
## 3 Guilford          10956             13246     258166      11472846
## 4 Wake              22708             27405     431040      22263004
## 5 Mecklenburg       23793             30039     588844      35515013

Create a dataset entitled nc_top5 based on the previous findings.

nc_top5 <- nc %>% filter(CtyDescription =="Forsyth" |
           CtyDescription =="Durham" |
           CtyDescription =="Guilford" | 
           CtyDescription =="Wake" |
           CtyDescription =="Mecklenburg")
kable(nc_top5, format = "html")
CtyDescription EntEmploySize NumFirms NumEstablishments Employment AnnualPayroll
Durham 01: Total 5949 7103 176821 11375527
Durham 02: <20 4301 4313 17651 757712
Durham 03: 20-99 684 776 22441 1032481
Durham 04: 100-499 245 345 16495 951624
Durham 05: 500+ 719 1669 120234 8633710
Forsyth 01: Total 6940 8366 161704 7870272
Forsyth 02: <20 5336 5364 21748 819344
Forsyth 03: 20-99 717 822 24380 924506
Forsyth 04: 100-499 244 400 18075 657531
Forsyth 05: 500+ 643 1780 97501 5468891
Guilford 01: Total 10956 13246 258166 11472846
Guilford 02: <20 8404 8452 34113 1266283
Guilford 03: 20-99 1092 1223 35662 1368081
Guilford 04: 100-499 491 769 38941 1619685
Guilford 05: 500+ 969 2802 149450 7218797
Mecklenburg 01: Total 23793 30039 588844 35515013
Mecklenburg 02: <20 18567 18669 69544 3098478
Mecklenburg 03: 20-99 2347 2625 79335 3486544
Mecklenburg 04: 100-499 998 1553 70456 3414091
Mecklenburg 05: 500+ 1881 7192 369509 25515900
Wake 01: Total 22708 27405 431040 22263004
Wake 02: <20 18192 18271 69640 2901460
Wake 03: 20-99 2291 2615 77462 3204235
Wake 04: 100-499 774 1298 60199 3087631
Wake 05: 500+ 1451 5221 223739 13069678

United States Census Bureau (Analysis)

Create a variable entitled FirmEmpRatio and EstEmpRario to obtain a ratios in order to eliminate size disparity.

nc_top5 <- nc_top5 %>% mutate(FirmEmpRatio = NumFirms/Employment, EstEmpRatio = NumEstablishments/Employment) %>% select(CtyDescription,EntEmploySize, FirmEmpRatio,EstEmpRatio, AnnualPayroll)
kable(nc_top5, format = "html")
CtyDescription EntEmploySize FirmEmpRatio EstEmpRatio AnnualPayroll
Durham 01: Total 0.0336442 0.0401706 11375527
Durham 02: <20 0.2436689 0.2443488 757712
Durham 03: 20-99 0.0304799 0.0345796 1032481
Durham 04: 100-499 0.0148530 0.0209154 951624
Durham 05: 500+ 0.0059800 0.0138813 8633710
Forsyth 01: Total 0.0429179 0.0517365 7870272
Forsyth 02: <20 0.2453559 0.2466434 819344
Forsyth 03: 20-99 0.0294094 0.0337162 924506
Forsyth 04: 100-499 0.0134993 0.0221300 657531
Forsyth 05: 500+ 0.0065948 0.0182562 5468891
Guilford 01: Total 0.0424378 0.0513081 11472846
Guilford 02: <20 0.2463577 0.2477648 1266283
Guilford 03: 20-99 0.0306208 0.0342942 1368081
Guilford 04: 100-499 0.0126088 0.0197478 1619685
Guilford 05: 500+ 0.0064838 0.0187487 7218797
Mecklenburg 01: Total 0.0404063 0.0510135 35515013
Mecklenburg 02: <20 0.2669821 0.2684488 3098478
Mecklenburg 03: 20-99 0.0295834 0.0330875 3486544
Mecklenburg 04: 100-499 0.0141649 0.0220421 3414091
Mecklenburg 05: 500+ 0.0050905 0.0194637 25515900
Wake 01: Total 0.0526819 0.0635788 22263004
Wake 02: <20 0.2612292 0.2623636 2901460
Wake 03: 20-99 0.0295758 0.0337585 3204235
Wake 04: 100-499 0.0128574 0.0215618 3087631
Wake 05: 500+ 0.0064852 0.0233352 13069678

View the counties with the highest total FirmEmpRatio based on the 01: Total data element from the EntEmploySize variable:

nc_top5 %>% filter(EntEmploySize == "01:  Total") %>% arrange(desc(FirmEmpRatio)) %>% select(CtyDescription,FirmEmpRatio,AnnualPayroll)
## # A tibble: 5 x 3
##   CtyDescription FirmEmpRatio AnnualPayroll
##   <chr>                 <dbl>         <dbl>
## 1 Wake                 0.0527      22263004
## 2 Forsyth              0.0429       7870272
## 3 Guilford             0.0424      11472846
## 4 Mecklenburg          0.0404      35515013
## 5 Durham               0.0336      11375527

Based on these findings we see that Wake county has highest FirmEmpRatio but that variable doesn’t correlate with AnnualPayroll

Next, we view the counties with the highest total EstEmpRatio based on the 01: Total data element from the EntEmploySize variable:

nc_top5 %>% filter(EntEmploySize == "01:  Total") %>% arrange(desc(EstEmpRatio)) %>% select(CtyDescription,EstEmpRatio,AnnualPayroll)
## # A tibble: 5 x 3
##   CtyDescription EstEmpRatio AnnualPayroll
##   <chr>                <dbl>         <dbl>
## 1 Wake                0.0636      22263004
## 2 Forsyth             0.0517       7870272
## 3 Guilford            0.0513      11472846
## 4 Mecklenburg         0.0510      35515013
## 5 Durham              0.0402      11375527

Again Wake County is the leader in regards to EstEmpRatio but we also see that doesn’t correlate with AnnualPayroll

We will now take a look each level of the EntEmploySize against the EstEmpRatio:

Employment Size Less Than 20

nc_top5 %>% filter(EntEmploySize == "02:  <20") %>% 
    ggplot(mapping = aes(x = CtyDescription, y = EstEmpRatio))+
    geom_bar(stat="identity", fill = "#010134")

nc_top5 %>% filter(EntEmploySize == "02:  <20") %>% arrange(desc(EstEmpRatio)) %>% select(CtyDescription,EstEmpRatio,AnnualPayroll)
## # A tibble: 5 x 3
##   CtyDescription EstEmpRatio AnnualPayroll
##   <chr>                <dbl>         <dbl>
## 1 Mecklenburg          0.268       3098478
## 2 Wake                 0.262       2901460
## 3 Guilford             0.248       1266283
## 4 Forsyth              0.247        819344
## 5 Durham               0.244        757712

For employment size less than 20 Meckenburg couny has a slight lead over Wake County with Durham county at the bottom.

Employment Size between 20 and 99

nc_top5 %>% filter(EntEmploySize == "03:  20-99") %>% 
    ggplot(mapping = aes(x = CtyDescription, y = EstEmpRatio))+
    geom_bar(stat="identity", fill = "#afd965")

nc_top5 %>% filter(EntEmploySize == "03:  20-99") %>% arrange(desc(EstEmpRatio)) %>% select(CtyDescription,EstEmpRatio,AnnualPayroll)
## # A tibble: 5 x 3
##   CtyDescription EstEmpRatio AnnualPayroll
##   <chr>                <dbl>         <dbl>
## 1 Durham              0.0346       1032481
## 2 Guilford            0.0343       1368081
## 3 Wake                0.0338       3204235
## 4 Forsyth             0.0337        924506
## 5 Mecklenburg         0.0331       3486544

For employment size between 20 and 99 Durham couny has a slight lead over Guilford County with Mecklenburg county at the bottom.

Employment Size between 100 and 499

nc_top5 %>% filter(EntEmploySize == "04:  100-499") %>% 
    ggplot(mapping = aes(x = CtyDescription, y = EstEmpRatio))+
    geom_bar(stat="identity", fill = "#E7DFC3")

nc_top5 %>% filter(EntEmploySize == "04:  100-499") %>% arrange(desc(EstEmpRatio)) %>% select(CtyDescription,EstEmpRatio,AnnualPayroll)
## # A tibble: 5 x 3
##   CtyDescription EstEmpRatio AnnualPayroll
##   <chr>                <dbl>         <dbl>
## 1 Forsyth             0.0221        657531
## 2 Mecklenburg         0.0220       3414091
## 3 Wake                0.0216       3087631
## 4 Durham              0.0209        951624
## 5 Guilford            0.0197       1619685

For employment size between 100 and 499 Forsyth couny has a slight lead over Mecklenburg County with Guilford county at the bottom.

Employment Size of 500 and higher

nc_top5 %>% filter(EntEmploySize == "05:  500+") %>% 
    ggplot(mapping = aes(x = CtyDescription, y = EstEmpRatio))+
    geom_bar(stat="identity", fill = "#616465")

nc_top5 %>% filter(EntEmploySize == "05:  500+") %>% arrange(desc(EstEmpRatio)) %>% select(CtyDescription,EstEmpRatio,AnnualPayroll)
## # A tibble: 5 x 3
##   CtyDescription EstEmpRatio AnnualPayroll
##   <chr>                <dbl>         <dbl>
## 1 Wake                0.0233      13069678
## 2 Mecklenburg         0.0195      25515900
## 3 Guilford            0.0187       7218797
## 4 Forsyth             0.0183       5468891
## 5 Durham              0.0139       8633710

For employment size of 500 and over Wake couny has a sizable lead over Mecklenburg County with Durham county at the bottom.

United States Census Bureau (outcome)

Based on the analysis we can say that out of the top five counties in the state of North Carolina with the highest AnnualPayroll Wake County has the greatest variability in employment size. It was never had the bottom of any category and remained near the top and had the highest position of employment size of 500 and higher.

Stanford Mass Shootings in America (MSA) (Import and Tidy)

Import the data and utilize the skip feature in order to arrive at the first headers with data.

mass_shootings <- read_csv("https://raw.githubusercontent.com/LilesB/DATA-607---Project-2/master/mass_shooting_events_stanford_msa_release_06142016.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   CaseID = col_integer(),
##   Latitude = col_double(),
##   Longitude = col_double(),
##   `Number of Victim Fatalities` = col_integer(),
##   `Total Number of Fatalities` = col_integer(),
##   `Number of Victims Injured` = col_integer(),
##   `Total Number of Victims` = col_integer()
## )
## See spec(...) for full column specifications.

First step is to look at the dataset using the View feature

View(mass_shootings)

Next we will use the glimpse function to dig deeper in the mass_shootings datset

glimpse(mass_shootings)
## Observations: 347
## Variables: 48
## $ CaseID                                 <int> 1, 2, 3, 4, 5, 6, 7, 8,...
## $ Title                                  <chr> "University of Texas at...
## $ Location                               <chr> "Austin, Texas", "Mesa,...
## $ City                                   <chr> "Austin", "Mesa", "New ...
## $ State                                  <chr> "Texas", "Arizona", "Lo...
## $ Latitude                               <dbl> 30.19889, 33.42269, 30....
## $ Longitude                              <dbl> -97.84416, -111.81632, ...
## $ `Number of Victim Fatalities`          <int> 16, 5, 9, 1, 3, 1, 7, 2...
## $ `Total Number of Fatalities`           <int> 17, 5, 10, 1, 3, 1, 7, ...
## $ `Number of Victims Injured`            <int> 32, 1, 13, 3, 7, 7, 2, ...
## $ `Total Number of Victims`              <int> 48, 6, 22, 4, 10, 8, 9,...
## $ Description                            <chr> "On August 1, 1966, a 2...
## $ Date                                   <chr> "8/1/1966", "11/12/1966...
## $ `Day of Week`                          <chr> "Monday", "Saturday", "...
## $ `Date - Detailed`                      <chr> "Monday, August 01, 196...
## $ `Shooter Name`                         <chr> "Charles S. Whitman", "...
## $ `Shooter Age(s)`                       <chr> "25", "18", "23", "14",...
## $ `Average Shooter Age`                  <chr> "25", "18", "23", "14",...
## $ `Shooter Sex`                          <chr> "Male", "Male", "Male",...
## $ `Shooter Race`                         <chr> "White American or Euro...
## $ `Type of Gun - Detailed`               <chr> "Remington 700 ADL (6mm...
## $ `Type of Gun - General`                <chr> "Multiple guns", "Handg...
## $ `Number of Shotguns`                   <chr> "1", "0", "0", "0", "1"...
## $ `Number of Rifles`                     <chr> "3", "0", "1", "0", "1"...
## $ `Number of Handguns`                   <chr> "3", "1", "1", "2", "0"...
## $ `Total Number of Guns`                 <chr> "7", "1", "2", "2", "2"...
## $ `Number of Automatic Guns`             <chr> "0", "0", "0", "0", "0"...
## $ `Number of Semi-Automatic Guns`        <chr> "1", "0", "1", "0", "0"...
## $ `Fate of Shooter at the scene`         <chr> "Deceased", "Arrested",...
## $ `Shooter's Cause of Death`             <chr> "Killed", "Not applicab...
## $ `School Related`                       <chr> "Yes", "Yes", "No", "Ye...
## $ `Place Type`                           <chr> "College/University/Adu...
## $ `Relationship to Incident Location`    <chr> "Place of schooling", "...
## $ `Targeted Victim/s - Detailed`         <chr> "Family, students, teac...
## $ `Targeted Victim/s - General`          <chr> "General public", "Stud...
## $ `Possible Motive - Detailed`           <chr> "A task force of profes...
## $ `Possible Motive - General`            <chr> "Mental illness", "Ment...
## $ `History of Mental Illness - Detailed` <chr> "Shooter had visited  a...
## $ `History of Mental Illness - General`  <chr> "Yes", "Yes", "Yes", "Y...
## $ `Data Source 1`                        <chr> "http://en.wikipedia.or...
## $ `Data Source 2`                        <chr> "Lexington Examiner (KY...
## $ `Data Source 3`                        <chr> "http://news.google.com...
## $ `Data Source 4`                        <chr> "http://news.google.com...
## $ `Data Source 5`                        <chr> "http://books.google.co...
## $ `Data Source 6`                        <chr> NA, NA, NA, NA, NA, NA,...
## $ `Data Source 7`                        <chr> NA, NA, NA, NA, NA, NA,...
## $ `Military Experience`                  <chr> "Yes", "Unknown", "Unkn...
## $ Class                                  <chr> "SPK", "MS", "SPK", "MS...

The mass_shootings dataset is rather large like the uscb dataset but differs with the amount of options to pull data from the untidy set.

After first glance we will be explore school shootings but first we will conduct some preliminary data analysis to find a few other data elements.

School Shootings

First we will pull filter the mass_shootings dataset for school shootings and create a dataset entitled school:

names(mass_shootings)
##  [1] "CaseID"                              
##  [2] "Title"                               
##  [3] "Location"                            
##  [4] "City"                                
##  [5] "State"                               
##  [6] "Latitude"                            
##  [7] "Longitude"                           
##  [8] "Number of Victim Fatalities"         
##  [9] "Total Number of Fatalities"          
## [10] "Number of Victims Injured"           
## [11] "Total Number of Victims"             
## [12] "Description"                         
## [13] "Date"                                
## [14] "Day of Week"                         
## [15] "Date - Detailed"                     
## [16] "Shooter Name"                        
## [17] "Shooter Age(s)"                      
## [18] "Average Shooter Age"                 
## [19] "Shooter Sex"                         
## [20] "Shooter Race"                        
## [21] "Type of Gun - Detailed"              
## [22] "Type of Gun - General"               
## [23] "Number of Shotguns"                  
## [24] "Number of Rifles"                    
## [25] "Number of Handguns"                  
## [26] "Total Number of Guns"                
## [27] "Number of Automatic Guns"            
## [28] "Number of Semi-Automatic Guns"       
## [29] "Fate of Shooter at the scene"        
## [30] "Shooter's Cause of Death"            
## [31] "School Related"                      
## [32] "Place Type"                          
## [33] "Relationship to Incident Location"   
## [34] "Targeted Victim/s - Detailed"        
## [35] "Targeted Victim/s - General"         
## [36] "Possible Motive - Detailed"          
## [37] "Possible Motive - General"           
## [38] "History of Mental Illness - Detailed"
## [39] "History of Mental Illness - General" 
## [40] "Data Source 1"                       
## [41] "Data Source 2"                       
## [42] "Data Source 3"                       
## [43] "Data Source 4"                       
## [44] "Data Source 5"                       
## [45] "Data Source 6"                       
## [46] "Data Source 7"                       
## [47] "Military Experience"                 
## [48] "Class"
school <- mass_shootings %>% filter(`School Related`=="Yes")
dim(school)
## [1] 73 48

There are 73 instances where there were school shootings

We will now condense the school dataset in order to explore a few variables:

school <- school %>% select(`State`,`Total Number of Fatalities`,
                            `Total Number of Victims`,
                            `Day of Week`,
                            `Shooter Race`,
                            `Possible Motive - General`,
                            `Place Type`,
                            `Shooter Sex`)

Stanford Mass Shootings in America (MSA) (Analysis)

We will explore the variables to see if it makes sense to delve deeper or eliminate them from the data set excluding the number of fatalities,victims, and day of the week:

State

school %>% count(`State`) %>% arrange(desc(n))
## # A tibble: 33 x 2
##    State            n
##    <chr>        <int>
##  1 California      10
##  2 Ohio             5
##  3 Arizona          4
##  4 Illinois         4
##  5 Washington       4
##  6 Michigan         3
##  7 Nevada           3
##  8 New York         3
##  9 Oregon           3
## 10 Pennsylvania     3
## # ... with 23 more rows

California has the most school related shootings with 10 but there is limited variability therefore it we will not pursue any further analysis.

Shooter Race

school %>% count(`Shooter Race`) %>% arrange(desc(n))
## # A tibble: 7 x 2
##   `Shooter Race`                          n
##   <chr>                               <int>
## 1 White American or European American    42
## 2 Black American or African American     14
## 3 Asian American                          7
## 4 Some other race                         5
## 5 Native American or Alaska Native        2
## 6 Unknown                                 2
## 7 Two or more races                       1

Whites have the most school related shootings with 42 while Blacks come in behind with 14 so we will explore the total number of victims and fatalities.

ggplot(data = school) + 
  geom_point(mapping = aes(x = `Shooter Race`,
                           y = `Total Number of Victims`,
                           color = `Shooter Race`)) +
    theme(axis.text.x=element_blank()) 

#removes the text from the x-axis

We already knew from the count that Whites had the most school shootings, but it seems that Asian Americans logged in the most victims in a shooting.

ggplot(data = school) + 
  geom_point(mapping = aes(x = `Shooter Race`,
                           y = `Total Number of Fatalities`,
                           color = `Shooter Race`)) +
    theme(axis.text.x=element_blank()) 

Possible Motive - General

school %>% count(`Possible Motive - General`) %>% arrange(desc(n))
## # A tibble: 14 x 2
##    `Possible Motive - General`       n
##    <chr>                         <int>
##  1 Mental illness                   19
##  2 Multiple motives                 10
##  3 Unknown                           8
##  4 Social dispute                    7
##  5 Failure                           6
##  6 Expulsion/Suspension              5
##  7 Harassment                        5
##  8 Terminated/Denied/Reprimanded     4
##  9 Rejection                         3
## 10 Domestic dispute                  2
## 11 Legal dispute                     1
## 12 Political/Religious ideals        1
## 13 Race                              1
## 14 Social Dispute                    1

Mental illness leads the numbers with 19, and there seems to be variability here once we get past mental illness, we will conduct the same analysis from prior which surround the number of victims and fatalities.

ggplot(data = school) + 
  geom_point(mapping = aes(x = `Possible Motive - General`,
                           y = `Total Number of Victims`,
                           color = `Possible Motive - General`)) +
    theme(axis.text.x=element_blank()) 

#removes the text from the x-axis
ggplot(data = school) + 
  geom_point(mapping = aes(x = `Possible Motive - General`,
                           y = `Total Number of Fatalities`,
                           color = `Possible Motive - General`)) +
    theme(axis.text.x=element_blank()) 

#removes the text from the x-axis

Although mental illness seems like the driving force behind school shootings we see that there are quite a few of these acts that are unknown. The unknown factor also claimed the second most fatalities.

Place Type

school %>% count(`Place Type`) %>% arrange(desc(n))
## # A tibble: 9 x 2
##   `Place Type`                                                  n
##   <chr>                                                     <int>
## 1 College/University/Adult education                           27
## 2 Secondary school                                             21
## 3 Primary school                                               16
## 4 Residential home/Neighborhood                                 4
## 5 Place of worship                                              1
## 6 Restaurant/Cafe?                                              1
## 7 "Retail/ Wholesale/Services facility\rand Primary school"     1
## 8 Secondary School                                              1
## 9 Street/Highway                                                1

The top three places were Colleges, Secondary Schools, and Primary Schools with counts of 27, 21, and 16 respectively. This data element will be excluded.

Shooter Sex

school %>% count(`Shooter Sex`) %>% arrange(desc(n))
## # A tibble: 2 x 2
##   `Shooter Sex`     n
##   <chr>         <int>
## 1 Male             70
## 2 Female            3

The disparities are huge with 70 males and 3 females; data will be excluded.

Day of Week

school %>% count(`Day of Week`) %>% arrange(desc(n))
## # A tibble: 7 x 2
##   `Day of Week`     n
##   <chr>         <int>
## 1 Thursday         19
## 2 Monday           18
## 3 Friday           15
## 4 Tuesday          10
## 5 Wednesday         6
## 6 Sunday            3
## 7 Saturday          2

What stands out the most is the fact there were 5 school shootings that occurred on the weekend; 2 and 3 respectively. Out of the school/work week it seems that Wednesday is the safest day with only 6 school shootings while Thursday and Monday are the most violent.

ggplot(data = school) + 
  geom_point(mapping = aes(x = `Day of Week`,
                           y = `Total Number of Fatalities`,
                           color = `Day of Week`)) +
    theme(axis.text.x=element_blank()) 

#removes the text from the x-axis

Even though Thursday logged the most school shootings we see that Monday has registered the most victims.

ggplot(data = school) + 
  geom_point(mapping = aes(x = `Day of Week`,
                           y = `Total Number of Fatalities`,
                           color = `Day of Week`)) +
    theme(axis.text.x=element_blank()) 

With 15 school shootings Friday took the second spot in the number of fatalities.

Stanford Mass Shootings in America (MSA) (outcome)

The issue with school shootings has become a hot issue once again in the US, and what I found most interesting was the days of the week these horrific acts occurred.