Goal

The goal for this project is to tidy and analyze data sets presented in the week 5 discussion section.

Data Sets

New York City Compstat Historcial Crime Data by Precinct Citywide Seven Major Felony Offenses 2000-2017

United States Census Bureau Data by Enterpise Employment Size: Counties (State by County, totals)

Stanford Mass Shootings in America (MSA) A high quality dataset from 19966-2016 with method, definitions, and references

Load packages

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.4.3

## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.1     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0

## Warning: package 'tibble' was built under R version 3.4.3

## Warning: package 'tidyr' was built under R version 3.4.3

## Warning: package 'readr' was built under R version 3.4.3

## Warning: package 'purrr' was built under R version 3.4.3

## Warning: package 'dplyr' was built under R version 3.4.2

## Warning: package 'forcats' was built under R version 3.4.3

## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(knitr)

New York City Compstat Data (Import and Tidy)

Import the data and utilize the skip feature in order to arrive at the first headers with data.

compstat <- read_csv("https://raw.githubusercontent.com/LilesB/DATA-607---Project-2/master/seven-major-felony-offenses-2000-2016%20(2).csv", skip = 4)

## Parsed with column specification:
## cols(
##   OFFENSE = col_character(),
##   `2000` = col_number(),
##   `2001` = col_number(),
##   `2002` = col_number(),
##   `2003` = col_number(),
##   `2004` = col_number(),
##   `2005` = col_number(),
##   `2006` = col_number(),
##   `2007` = col_number(),
##   `2008` = col_number(),
##   `2009` = col_number(),
##   `2010` = col_number(),
##   `2011` = col_number(),
##   `2012` = col_number(),
##   `2013` = col_number(),
##   `2014` = col_number(),
##   `2015` = col_number(),
##   `2016` = col_number()
## )

First step is to look at the dataset using the View feature

View(compstat)

After viewing the dataset we can see that row 9 provides totals which we won’t need. In addition, rows 10 - 16 are no longer needed so we will remove them using the following code:

compstat <- compstat[-c(8:18),]

Next we take a look at the data using the glimpse function

glimpse(compstat)

## Observations: 7
## Variables: 18
## $ OFFENSE <chr> "MURDER & NON-NEGL. MANSLAUGHTER", "RAPE", "ROBBERY", ...
## $ `2000`  <dbl> 673, 2068, 32562, 25924, 38352, 49631, 35442
## $ `2001`  <dbl> 649, 1981, 28202, 23453, 32763, 46329, 29531
## $ `2002`  <dbl> 587, 2144, 27229, 21147, 31275, 45771, 26656
## $ `2003`  <dbl> 597, 2070, 25989, 19139, 29110, 46751, 23413
## $ `2004`  <dbl> 570, 1905, 24373, 18622, 26976, 48763, 20884
## $ `2005`  <dbl> 539, 1858, 24722, 17750, 24117, 48243, 18246
## $ `2006`  <dbl> 596, 1525, 23739, 17309, 23143, 46625, 15745
## $ `2007`  <dbl> 496, 1351, 21809, 17493, 21762, 44924, 13174
## $ `2008`  <dbl> 523, 1299, 22401, 16284, 20725, 44242, 12482
## $ `2009`  <dbl> 471, 1205, 18601, 16773, 19430, 39580, 10670
## $ `2010`  <dbl> 536, 1373, 19486, 16956, 18600, 37835, 10329
## $ `2011`  <dbl> 515, 1420, 19717, 18482, 18720, 38501, 9314
## $ `2012`  <dbl> 419, 1445, 20144, 19381, 19168, 42497, 8093
## $ `2013`  <dbl> 335, 1378, 19128, 20297, 17429, 45368, 7400
## $ `2014`  <dbl> 333, 1352, 16539, 20207, 16765, 43862, 7664
## $ `2015`  <dbl> 352, 1438, 16931, 20270, 15125, 44005, 7332
## $ `2016`  <dbl> 335, 1438, 15500, 20847, 12990, 44279, 6327

The compstat dataset currently has 7 observations and 18 variables. In order to tidy the data we use the gather feature in order to view the offenses, year, and totals.

compstat <- gather(compstat,"YEAR","TOTALS",2:18)
dim(compstat)

## [1] 119   3

After using gather we arrive at 3 variables and 119 observations making it easier to conduct analysis.

New York City Compstat Data (Analyze)

First it will be interesting to look at the total number of crimes by year. We use the piping to group the data by year and then add the seven major crimes and create a bar chart.

compstat %>% group_by(YEAR) %>% summarise(CRIMES = sum(TOTALS)) %>%
    ggplot(mapping = aes(x = YEAR, y = CRIMES)) +
               geom_bar(stat="identity", fill = "#6C9552")

Based on the graph we can see that overall crime in New York has decreased significantly since 2000. However, crime started to increase slightly in 2012 & 2013 and dropped to its lowest point in 2016.

Create a dataset that will filter on the criminal offense of murder and view the years that recorded the highest values

compstat_murder <- filter(compstat,OFFENSE =="MURDER & NON-NEGL. MANSLAUGHTER") 

compstat_murder %>% arrange(TOTALS) %>% select(YEAR,TOTALS) %>% top_n(5)

## Selecting by TOTALS

## # A tibble: 5 x 2
##   YEAR  TOTALS
##   <chr>  <dbl>
## 1 2002     587
## 2 2006     596
## 3 2003     597
## 4 2001     649
## 5 2000     673

In 2000 New York City recorded 673 homicides making it the deadliest year in the last decade. Using kable we will list he entire list.

kable(compstat_murder[2:3], format = "html", caption = "New York City Homicide Totals (2000 - 2016)")

New York City Homicide Totals (2000 - 2016)
YEAR	TOTALS
2000	673
2001	649
2002	587
2003	597
2004	570
2005	539
2006	596
2007	496
2008	523
2009	471
2010	536
2011	515
2012	419
2013	335
2014	333
2015	352
2016	335

Next we will calculate an estimated murder rate for each year in compstat_murder based on census data retrieved from https://en.wikipedia.org/wiki/Demographics_of_New_York_City. In order to do so we create a vector entitled population which will store the population for 2000 and 2010 and use the mean value of 8,091,711.

Using mutate we then create Per100 which divides the estimated population by 100000 and then create murder_rate that will divide TOTALS by Per100

population <- c(8008288,8175133)
mean(population)

## [1] 8091711

compstat_murder %>% mutate(Per100 = 8091711/100000, MurderRate = TOTALS/Per100) %>% arrange(desc(MurderRate)) %>% select(YEAR, TOTALS, MurderRate)

## # A tibble: 17 x 3
##    YEAR  TOTALS MurderRate
##    <chr>  <dbl>      <dbl>
##  1 2000     673       8.32
##  2 2001     649       8.02
##  3 2003     597       7.38
##  4 2006     596       7.37
##  5 2002     587       7.25
##  6 2004     570       7.04
##  7 2005     539       6.66
##  8 2010     536       6.62
##  9 2008     523       6.46
## 10 2011     515       6.36
## 11 2007     496       6.13
## 12 2009     471       5.82
## 13 2012     419       5.18
## 14 2015     352       4.35
## 15 2013     335       4.14
## 16 2016     335       4.14
## 17 2014     333       4.12

In unlisted analysis we saw that 2000 reported the largest amount of criminal activity for each offense with the exception of rape.

Rape

compstat_rape <- filter(compstat,OFFENSE =="RAPE") 

compstat_rape %>% arrange(TOTALS) %>% select(YEAR,TOTALS) %>% top_n(5)

## Selecting by TOTALS

## # A tibble: 5 x 2
##   YEAR  TOTALS
##   <chr>  <dbl>
## 1 2004    1905
## 2 2001    1981
## 3 2000    2068
## 4 2003    2070
## 5 2002    2144

2002 led the city in the numbers of rapes with 2,144.

kable(compstat_rape[2:3], format = "html", caption = "New York City Rape Totals (2000 - 2016)")

New York City Rape Totals (2000 - 2016)
YEAR	TOTALS
2000	2068
2001	1981
2002	2144
2003	2070
2004	1905
2005	1858
2006	1525
2007	1351
2008	1299
2009	1205
2010	1373
2011	1420
2012	1445
2013	1378
2014	1352
2015	1438
2016	1438

To reduce redundancy we will look at the years that reported the least amount of robberies, felony assualts, and grand larceny by adding a (-) negative sign to the top_n feature.

Robbery

compstat_robbery <- filter(compstat,OFFENSE =="ROBBERY") 

compstat_robbery %>% arrange(TOTALS) %>% select(YEAR,TOTALS) %>% top_n(-5)

## Selecting by TOTALS

## # A tibble: 5 x 2
##   YEAR  TOTALS
##   <chr>  <dbl>
## 1 2016   15500
## 2 2014   16539
## 3 2015   16931
## 4 2009   18601
## 5 2013   19128

2016 recorded the least amount of robberies with 15,500.

kable(compstat_robbery[2:3], format = "html", caption = "New York City Robbery Totals (2000 - 2016)")

New York City Robbery Totals (2000 - 2016)
YEAR	TOTALS
2000	32562
2001	28202
2002	27229
2003	25989
2004	24373
2005	24722
2006	23739
2007	21809
2008	22401
2009	18601
2010	19486
2011	19717
2012	20144
2013	19128
2014	16539
2015	16931
2016	15500

Felony Assualt

compstat_fa <- filter(compstat,OFFENSE =="FELONY ASSAULT") 

compstat_fa %>% arrange(TOTALS) %>% select(YEAR,TOTALS) %>% top_n(-5)

## Selecting by TOTALS

## # A tibble: 5 x 2
##   YEAR  TOTALS
##   <chr>  <dbl>
## 1 2008   16284
## 2 2009   16773
## 3 2010   16956
## 4 2006   17309
## 5 2007   17493

2008 recorded the least amount of felony assaults with 16,284.

kable(compstat_fa[2:3], format = "html", caption = "New York City Felony Assualt Totals (2000 - 2016)")

New York City Felony Assualt Totals (2000 - 2016)
YEAR	TOTALS
2000	25924
2001	23453
2002	21147
2003	19139
2004	18622
2005	17750
2006	17309
2007	17493
2008	16284
2009	16773
2010	16956
2011	18482
2012	19381
2013	20297
2014	20207
2015	20270
2016	20847

Grand Larceny

compstat_gl <- filter(compstat,OFFENSE =="GRAND LARCENY") 

compstat_gl %>% arrange(TOTALS) %>% select(YEAR,TOTALS) %>% top_n(-5)

## Selecting by TOTALS

## # A tibble: 5 x 2
##   YEAR  TOTALS
##   <chr>  <dbl>
## 1 2010   37835
## 2 2011   38501
## 3 2009   39580
## 4 2012   42497
## 5 2014   43862

2010 recorded the least amount of grand larceny with 37,835.

kable(compstat_rape[2:3], format = "html", caption = "New York City Grand Larceny Totals (2000 - 2016)")

New York City Grand Larceny Totals (2000 - 2016)
YEAR	TOTALS
2000	2068
2001	1981
2002	2144
2003	2070
2004	1905
2005	1858
2006	1525
2007	1351
2008	1299
2009	1205
2010	1373
2011	1420
2012	1445
2013	1378
2014	1352
2015	1438
2016	1438

Burglary and Grand Larceny Motor Vehicles were omitted because the both followed the trend of 2000 recording the lowest numbers while 2016 recorded the most.

New York City Compstat Data (outcome)

Based on the analysis we can say with confidence that crime has decreased in New York City within the past decade.

United States Census Bureau (Import and Tidy)

Import the data and utilize the skip feature in order to arrive at the first headers with data. We also noticed that the headers all have spaces between the names so we create a vector that will hold the names and apply.

names <-c("FIPS_StateCode","StateDescr","FIPS_CountyCode",
          "CtyDescription","EntEmploySize","NumFirms",
          "NumEstablishments","Employment","ERF",
          "ENF","AnnualPayroll","APNG","X13","X14","X15")
uscb <- read_csv("https://raw.githubusercontent.com/LilesB/DATA-607---Project-2/master/county_totals_2015.csv", skip = 5, col_names = names)

## Parsed with column specification:
## cols(
##   FIPS_StateCode = col_character(),
##   StateDescr = col_character(),
##   FIPS_CountyCode = col_character(),
##   CtyDescription = col_character(),
##   EntEmploySize = col_character(),
##   NumFirms = col_character(),
##   NumEstablishments = col_character(),
##   Employment = col_character(),
##   ERF = col_character(),
##   ENF = col_character(),
##   AnnualPayroll = col_character(),
##   APNG = col_character(),
##   X13 = col_character(),
##   X14 = col_character(),
##   X15 = col_character()
## )

First step is to look at the dataset using the View feature

View(uscb)

After viewing the dataset we can see that rows 1 & 2 have NA values and can be removed.

uscb <- uscb[-c(1:2),]

Next we take a look at the data using the glimpse function

glimpse(uscb)

## Observations: 15,855
## Variables: 15
## $ FIPS_StateCode    <chr> NA, "01", "01", "01", "01", "01", "01", "01"...
## $ StateDescr        <chr> NA, "Alabama", "Alabama", "Alabama", "Alabam...
## $ FIPS_CountyCode   <chr> NA, "001", "001", "001", "001", "001", "003"...
## $ CtyDescription    <chr> NA, "Autauga", "Autauga", "Autauga", "Autaug...
## $ EntEmploySize     <chr> NA, "01:  Total", "02:  <20", "03:  20-99", ...
## $ NumFirms          <chr> NA, "788", "526", "75", "35", "152", "4,431"...
## $ NumEstablishments <chr> NA, "844", "527", "81", "51", "185", "5,127"...
## $ Employment        <chr> NA, "10,454", "2,216", "2,143", "1,161", "4,...
## $ ERF               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ENF               <chr> NA, "G", "G", "G", "G", "G", "G", "G", "G", ...
## $ AnnualPayroll     <chr> NA, "321,433", "63,553", "57,490", "29,074",...
## $ APNG              <chr> NA, "G", "G", "G", "G", "G", "G", "G", "G", ...
## $ X13               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ X14               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ X15               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

The uscb dataset currently has 15 observations and 15854 variables. We will look at the city of North Carolina so we will first create a dataset entitled nc and view the dimensions using dim

nc <- filter(uscb,StateDescr=="North Carolina")
dim(nc)

## [1] 505  15

Next we will condense the amount of variables to look at while using select and choosing the following:

nc <- nc %>% select(CtyDescription,EntEmploySize,
              NumFirms,NumEstablishments,Employment,AnnualPayroll)

To get an idea on how many counties we have our next step is to use the unique feature:

unique(nc$CtyDescription)

##   [1] "Alamance"                 "Alexander"               
##   [3] "Alleghany"                "Anson"                   
##   [5] "Ashe"                     "Avery"                   
##   [7] "Beaufort"                 "Bertie"                  
##   [9] "Bladen"                   "Brunswick"               
##  [11] "Buncombe"                 "Burke"                   
##  [13] "Cabarrus"                 "Caldwell"                
##  [15] "Camden"                   "Carteret"                
##  [17] "Caswell"                  "Catawba"                 
##  [19] "Chatham"                  "Cherokee"                
##  [21] "Chowan"                   "Clay"                    
##  [23] "Cleveland"                "Columbus"                
##  [25] "Craven"                   "Cumberland"              
##  [27] "Currituck"                "Dare"                    
##  [29] "Davidson"                 "Davie"                   
##  [31] "Duplin"                   "Durham"                  
##  [33] "Edgecombe"                "Forsyth"                 
##  [35] "Franklin"                 "Gaston"                  
##  [37] "Gates"                    "Graham"                  
##  [39] "Granville"                "Greene"                  
##  [41] "Guilford"                 "Halifax"                 
##  [43] "Harnett"                  "Haywood"                 
##  [45] "Henderson"                "Hertford"                
##  [47] "Hoke"                     "Hyde"                    
##  [49] "Iredell"                  "Jackson"                 
##  [51] "Johnston"                 "Jones"                   
##  [53] "Lee"                      "Lenoir"                  
##  [55] "Lincoln"                  "McDowell"                
##  [57] "Macon"                    "Madison"                 
##  [59] "Martin"                   "Mecklenburg"             
##  [61] "Mitchell"                 "Montgomery"              
##  [63] "Moore"                    "Nash"                    
##  [65] "New Hanover"              "Northampton"             
##  [67] "Onslow"                   "Orange"                  
##  [69] "Pamlico"                  "Pasquotank"              
##  [71] "Pender"                   "Perquimans"              
##  [73] "Person"                   "Pitt"                    
##  [75] "Polk"                     "Randolph"                
##  [77] "Richmond"                 "Robeson"                 
##  [79] "Rockingham"               "Rowan"                   
##  [81] "Rutherford"               "Sampson"                 
##  [83] "Scotland"                 "Stanly"                  
##  [85] "Stokes"                   "Surry"                   
##  [87] "Swain"                    "Transylvania"            
##  [89] "Tyrrell"                  "Union"                   
##  [91] "Vance"                    "Wake"                    
##  [93] "Warren"                   "Washington"              
##  [95] "Watauga"                  "Wayne"                   
##  [97] "Wilkes"                   "Wilson"                  
##  [99] "Yadkin"                   "Yancey"                  
## [101] "Statewide North Carolina"

nc has 101 unique counties in the dataset

Before we tidy the data we will first convert NumFirms, Employment, AnnualPayroll to a numeric variable in order to delve deeper in the dataset. We experienced warnings known as NA coercion due to commas in the character variables so we used the gsub function in order to remove the commas.

nc$NumFirms <- as.numeric(gsub(",","",nc$NumFirms))
nc$NumEstablishments <- as.numeric(gsub(",","",nc$NumEstablishments))
nc$Employment <- as.numeric(gsub(",","",nc$Employment))
nc$AnnualPayroll <- as.numeric(gsub(",","",nc$AnnualPayroll))

Identify counties that have the highest annual payroll based on the EntEmploySize being equal to 01: Total

nc %>% filter(EntEmploySize == "01:  Total") %>% arrange(AnnualPayroll) %>%  select(CtyDescription,NumFirms,NumEstablishments,Employment,AnnualPayroll) %>% top_n(5)

## Selecting by AnnualPayroll

## # A tibble: 5 x 5
##   CtyDescription NumFirms NumEstablishments Employment AnnualPayroll
##   <chr>             <dbl>             <dbl>      <dbl>         <dbl>
## 1 Forsyth            6940              8366     161704       7870272
## 2 Durham             5949              7103     176821      11375527
## 3 Guilford          10956             13246     258166      11472846
## 4 Wake              22708             27405     431040      22263004
## 5 Mecklenburg       23793             30039     588844      35515013

Create a dataset entitled nc_top5 based on the previous findings.

nc_top5 <- nc %>% filter(CtyDescription =="Forsyth" |
           CtyDescription =="Durham" |
           CtyDescription =="Guilford" | 
           CtyDescription =="Wake" |
           CtyDescription =="Mecklenburg")
kable(nc_top5, format = "html")

CtyDescription	EntEmploySize	NumFirms	NumEstablishments	Employment	AnnualPayroll
Durham	01: Total	5949	7103	176821	11375527
Durham	02: <20	4301	4313	17651	757712
Durham	03: 20-99	684	776	22441	1032481
Durham	04: 100-499	245	345	16495	951624
Durham	05: 500+	719	1669	120234	8633710
Forsyth	01: Total	6940	8366	161704	7870272
Forsyth	02: <20	5336	5364	21748	819344
Forsyth	03: 20-99	717	822	24380	924506
Forsyth	04: 100-499	244	400	18075	657531
Forsyth	05: 500+	643	1780	97501	5468891
Guilford	01: Total	10956	13246	258166	11472846
Guilford	02: <20	8404	8452	34113	1266283
Guilford	03: 20-99	1092	1223	35662	1368081
Guilford	04: 100-499	491	769	38941	1619685
Guilford	05: 500+	969	2802	149450	7218797
Mecklenburg	01: Total	23793	30039	588844	35515013
Mecklenburg	02: <20	18567	18669	69544	3098478
Mecklenburg	03: 20-99	2347	2625	79335	3486544
Mecklenburg	04: 100-499	998	1553	70456	3414091
Mecklenburg	05: 500+	1881	7192	369509	25515900
Wake	01: Total	22708	27405	431040	22263004
Wake	02: <20	18192	18271	69640	2901460
Wake	03: 20-99	2291	2615	77462	3204235
Wake	04: 100-499	774	1298	60199	3087631
Wake	05: 500+	1451	5221	223739	13069678

United States Census Bureau (Analysis)

Create a variable entitled FirmEmpRatio and EstEmpRario to obtain a ratios in order to eliminate size disparity.

nc_top5 <- nc_top5 %>% mutate(FirmEmpRatio = NumFirms/Employment, EstEmpRatio = NumEstablishments/Employment) %>% select(CtyDescription,EntEmploySize, FirmEmpRatio,EstEmpRatio, AnnualPayroll)
kable(nc_top5, format = "html")

CtyDescription	EntEmploySize	FirmEmpRatio	EstEmpRatio	AnnualPayroll
Durham	01: Total	0.0336442	0.0401706	11375527
Durham	02: <20	0.2436689	0.2443488	757712
Durham	03: 20-99	0.0304799	0.0345796	1032481
Durham	04: 100-499	0.0148530	0.0209154	951624
Durham	05: 500+	0.0059800	0.0138813	8633710
Forsyth	01: Total	0.0429179	0.0517365	7870272
Forsyth	02: <20	0.2453559	0.2466434	819344
Forsyth	03: 20-99	0.0294094	0.0337162	924506
Forsyth	04: 100-499	0.0134993	0.0221300	657531
Forsyth	05: 500+	0.0065948	0.0182562	5468891
Guilford	01: Total	0.0424378	0.0513081	11472846
Guilford	02: <20	0.2463577	0.2477648	1266283
Guilford	03: 20-99	0.0306208	0.0342942	1368081
Guilford	04: 100-499	0.0126088	0.0197478	1619685
Guilford	05: 500+	0.0064838	0.0187487	7218797
Mecklenburg	01: Total	0.0404063	0.0510135	35515013
Mecklenburg	02: <20	0.2669821	0.2684488	3098478
Mecklenburg	03: 20-99	0.0295834	0.0330875	3486544
Mecklenburg	04: 100-499	0.0141649	0.0220421	3414091
Mecklenburg	05: 500+	0.0050905	0.0194637	25515900
Wake	01: Total	0.0526819	0.0635788	22263004
Wake	02: <20	0.2612292	0.2623636	2901460
Wake	03: 20-99	0.0295758	0.0337585	3204235
Wake	04: 100-499	0.0128574	0.0215618	3087631
Wake	05: 500+	0.0064852	0.0233352	13069678

View the counties with the highest total FirmEmpRatio based on the 01: Total data element from the EntEmploySize variable:

nc_top5 %>% filter(EntEmploySize == "01:  Total") %>% arrange(desc(FirmEmpRatio)) %>% select(CtyDescription,FirmEmpRatio,AnnualPayroll)

## # A tibble: 5 x 3
##   CtyDescription FirmEmpRatio AnnualPayroll
##   <chr>                 <dbl>         <dbl>
## 1 Wake                 0.0527      22263004
## 2 Forsyth              0.0429       7870272
## 3 Guilford             0.0424      11472846
## 4 Mecklenburg          0.0404      35515013
## 5 Durham               0.0336      11375527

Based on these findings we see that Wake county has highest FirmEmpRatio but that variable doesn’t correlate with AnnualPayroll

Next, we view the counties with the highest total EstEmpRatio based on the 01: Total data element from the EntEmploySize variable:

nc_top5 %>% filter(EntEmploySize == "01:  Total") %>% arrange(desc(EstEmpRatio)) %>% select(CtyDescription,EstEmpRatio,AnnualPayroll)

## # A tibble: 5 x 3
##   CtyDescription EstEmpRatio AnnualPayroll
##   <chr>                <dbl>         <dbl>
## 1 Wake                0.0636      22263004
## 2 Forsyth             0.0517       7870272
## 3 Guilford            0.0513      11472846
## 4 Mecklenburg         0.0510      35515013
## 5 Durham              0.0402      11375527

Again Wake County is the leader in regards to EstEmpRatio but we also see that doesn’t correlate with AnnualPayroll

We will now take a look each level of the EntEmploySize against the EstEmpRatio:

Employment Size Less Than 20

nc_top5 %>% filter(EntEmploySize == "02:  <20") %>% 
    ggplot(mapping = aes(x = CtyDescription, y = EstEmpRatio))+
    geom_bar(stat="identity", fill = "#010134")

nc_top5 %>% filter(EntEmploySize == "02:  <20") %>% arrange(desc(EstEmpRatio)) %>% select(CtyDescription,EstEmpRatio,AnnualPayroll)

## # A tibble: 5 x 3
##   CtyDescription EstEmpRatio AnnualPayroll
##   <chr>                <dbl>         <dbl>
## 1 Mecklenburg          0.268       3098478
## 2 Wake                 0.262       2901460
## 3 Guilford             0.248       1266283
## 4 Forsyth              0.247        819344
## 5 Durham               0.244        757712

For employment size less than 20 Meckenburg couny has a slight lead over Wake County with Durham county at the bottom.

Employment Size between 20 and 99

nc_top5 %>% filter(EntEmploySize == "03:  20-99") %>% 
    ggplot(mapping = aes(x = CtyDescription, y = EstEmpRatio))+
    geom_bar(stat="identity", fill = "#afd965")

nc_top5 %>% filter(EntEmploySize == "03:  20-99") %>% arrange(desc(EstEmpRatio)) %>% select(CtyDescription,EstEmpRatio,AnnualPayroll)

## # A tibble: 5 x 3
##   CtyDescription EstEmpRatio AnnualPayroll
##   <chr>                <dbl>         <dbl>
## 1 Durham              0.0346       1032481
## 2 Guilford            0.0343       1368081
## 3 Wake                0.0338       3204235
## 4 Forsyth             0.0337        924506
## 5 Mecklenburg         0.0331       3486544

For employment size between 20 and 99 Durham couny has a slight lead over Guilford County with Mecklenburg county at the bottom.

Employment Size between 100 and 499

nc_top5 %>% filter(EntEmploySize == "04:  100-499") %>% 
    ggplot(mapping = aes(x = CtyDescription, y = EstEmpRatio))+
    geom_bar(stat="identity", fill = "#E7DFC3")

nc_top5 %>% filter(EntEmploySize == "04:  100-499") %>% arrange(desc(EstEmpRatio)) %>% select(CtyDescription,EstEmpRatio,AnnualPayroll)

## # A tibble: 5 x 3
##   CtyDescription EstEmpRatio AnnualPayroll
##   <chr>                <dbl>         <dbl>
## 1 Forsyth             0.0221        657531
## 2 Mecklenburg         0.0220       3414091
## 3 Wake                0.0216       3087631
## 4 Durham              0.0209        951624
## 5 Guilford            0.0197       1619685

For employment size between 100 and 499 Forsyth couny has a slight lead over Mecklenburg County with Guilford county at the bottom.

Employment Size of 500 and higher

nc_top5 %>% filter(EntEmploySize == "05:  500+") %>% 
    ggplot(mapping = aes(x = CtyDescription, y = EstEmpRatio))+
    geom_bar(stat="identity", fill = "#616465")

nc_top5 %>% filter(EntEmploySize == "05:  500+") %>% arrange(desc(EstEmpRatio)) %>% select(CtyDescription,EstEmpRatio,AnnualPayroll)

## # A tibble: 5 x 3
##   CtyDescription EstEmpRatio AnnualPayroll
##   <chr>                <dbl>         <dbl>
## 1 Wake                0.0233      13069678
## 2 Mecklenburg         0.0195      25515900
## 3 Guilford            0.0187       7218797
## 4 Forsyth             0.0183       5468891
## 5 Durham              0.0139       8633710

For employment size of 500 and over Wake couny has a sizable lead over Mecklenburg County with Durham county at the bottom.

United States Census Bureau (outcome)

Based on the analysis we can say that out of the top five counties in the state of North Carolina with the highest AnnualPayroll Wake County has the greatest variability in employment size. It was never had the bottom of any category and remained near the top and had the highest position of employment size of 500 and higher.

Stanford Mass Shootings in America (MSA) (Import and Tidy)

Import the data and utilize the skip feature in order to arrive at the first headers with data.

mass_shootings <- read_csv("https://raw.githubusercontent.com/LilesB/DATA-607---Project-2/master/mass_shooting_events_stanford_msa_release_06142016.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   CaseID = col_integer(),
##   Latitude = col_double(),
##   Longitude = col_double(),
##   `Number of Victim Fatalities` = col_integer(),
##   `Total Number of Fatalities` = col_integer(),
##   `Number of Victims Injured` = col_integer(),
##   `Total Number of Victims` = col_integer()
## )

## See spec(...) for full column specifications.

First step is to look at the dataset using the View feature

View(mass_shootings)

Next we will use the glimpse function to dig deeper in the mass_shootings datset

glimpse(mass_shootings)

## Observations: 347
## Variables: 48
## $ CaseID                                 <int> 1, 2, 3, 4, 5, 6, 7, 8,...
## $ Title                                  <chr> "University of Texas at...
## $ Location                               <chr> "Austin, Texas", "Mesa,...
## $ City                                   <chr> "Austin", "Mesa", "New ...
## $ State                                  <chr> "Texas", "Arizona", "Lo...
## $ Latitude                               <dbl> 30.19889, 33.42269, 30....
## $ Longitude                              <dbl> -97.84416, -111.81632, ...
## $ `Number of Victim Fatalities`          <int> 16, 5, 9, 1, 3, 1, 7, 2...
## $ `Total Number of Fatalities`           <int> 17, 5, 10, 1, 3, 1, 7, ...
## $ `Number of Victims Injured`            <int> 32, 1, 13, 3, 7, 7, 2, ...
## $ `Total Number of Victims`              <int> 48, 6, 22, 4, 10, 8, 9,...
## $ Description                            <chr> "On August 1, 1966, a 2...
## $ Date                                   <chr> "8/1/1966", "11/12/1966...
## $ `Day of Week`                          <chr> "Monday", "Saturday", "...
## $ `Date - Detailed`                      <chr> "Monday, August 01, 196...
## $ `Shooter Name`                         <chr> "Charles S. Whitman", "...
## $ `Shooter Age(s)`                       <chr> "25", "18", "23", "14",...
## $ `Average Shooter Age`                  <chr> "25", "18", "23", "14",...
## $ `Shooter Sex`                          <chr> "Male", "Male", "Male",...
## $ `Shooter Race`                         <chr> "White American or Euro...
## $ `Type of Gun - Detailed`               <chr> "Remington 700 ADL (6mm...
## $ `Type of Gun - General`                <chr> "Multiple guns", "Handg...
## $ `Number of Shotguns`                   <chr> "1", "0", "0", "0", "1"...
## $ `Number of Rifles`                     <chr> "3", "0", "1", "0", "1"...
## $ `Number of Handguns`                   <chr> "3", "1", "1", "2", "0"...
## $ `Total Number of Guns`                 <chr> "7", "1", "2", "2", "2"...
## $ `Number of Automatic Guns`             <chr> "0", "0", "0", "0", "0"...
## $ `Number of Semi-Automatic Guns`        <chr> "1", "0", "1", "0", "0"...
## $ `Fate of Shooter at the scene`         <chr> "Deceased", "Arrested",...
## $ `Shooter's Cause of Death`             <chr> "Killed", "Not applicab...
## $ `School Related`                       <chr> "Yes", "Yes", "No", "Ye...
## $ `Place Type`                           <chr> "College/University/Adu...
## $ `Relationship to Incident Location`    <chr> "Place of schooling", "...
## $ `Targeted Victim/s - Detailed`         <chr> "Family, students, teac...
## $ `Targeted Victim/s - General`          <chr> "General public", "Stud...
## $ `Possible Motive - Detailed`           <chr> "A task force of profes...
## $ `Possible Motive - General`            <chr> "Mental illness", "Ment...
## $ `History of Mental Illness - Detailed` <chr> "Shooter had visited  a...
## $ `History of Mental Illness - General`  <chr> "Yes", "Yes", "Yes", "Y...
## $ `Data Source 1`                        <chr> "http://en.wikipedia.or...
## $ `Data Source 2`                        <chr> "Lexington Examiner (KY...
## $ `Data Source 3`                        <chr> "http://news.google.com...
## $ `Data Source 4`                        <chr> "http://news.google.com...
## $ `Data Source 5`                        <chr> "http://books.google.co...
## $ `Data Source 6`                        <chr> NA, NA, NA, NA, NA, NA,...
## $ `Data Source 7`                        <chr> NA, NA, NA, NA, NA, NA,...
## $ `Military Experience`                  <chr> "Yes", "Unknown", "Unkn...
## $ Class                                  <chr> "SPK", "MS", "SPK", "MS...

The mass_shootings dataset is rather large like the uscb dataset but differs with the amount of options to pull data from the untidy set.

After first glance we will be explore school shootings but first we will conduct some preliminary data analysis to find a few other data elements.

School Shootings

First we will pull filter the mass_shootings dataset for school shootings and create a dataset entitled school:

names(mass_shootings)

##  [1] "CaseID"                              
##  [2] "Title"                               
##  [3] "Location"                            
##  [4] "City"                                
##  [5] "State"                               
##  [6] "Latitude"                            
##  [7] "Longitude"                           
##  [8] "Number of Victim Fatalities"         
##  [9] "Total Number of Fatalities"          
## [10] "Number of Victims Injured"           
## [11] "Total Number of Victims"             
## [12] "Description"                         
## [13] "Date"                                
## [14] "Day of Week"                         
## [15] "Date - Detailed"                     
## [16] "Shooter Name"                        
## [17] "Shooter Age(s)"                      
## [18] "Average Shooter Age"                 
## [19] "Shooter Sex"                         
## [20] "Shooter Race"                        
## [21] "Type of Gun - Detailed"              
## [22] "Type of Gun - General"               
## [23] "Number of Shotguns"                  
## [24] "Number of Rifles"                    
## [25] "Number of Handguns"                  
## [26] "Total Number of Guns"                
## [27] "Number of Automatic Guns"            
## [28] "Number of Semi-Automatic Guns"       
## [29] "Fate of Shooter at the scene"        
## [30] "Shooter's Cause of Death"            
## [31] "School Related"                      
## [32] "Place Type"                          
## [33] "Relationship to Incident Location"   
## [34] "Targeted Victim/s - Detailed"        
## [35] "Targeted Victim/s - General"         
## [36] "Possible Motive - Detailed"          
## [37] "Possible Motive - General"           
## [38] "History of Mental Illness - Detailed"
## [39] "History of Mental Illness - General" 
## [40] "Data Source 1"                       
## [41] "Data Source 2"                       
## [42] "Data Source 3"                       
## [43] "Data Source 4"                       
## [44] "Data Source 5"                       
## [45] "Data Source 6"                       
## [46] "Data Source 7"                       
## [47] "Military Experience"                 
## [48] "Class"

school <- mass_shootings %>% filter(`School Related`=="Yes")
dim(school)

## [1] 73 48

There are 73 instances where there were school shootings

We will now condense the school dataset in order to explore a few variables:

school <- school %>% select(`State`,`Total Number of Fatalities`,
                            `Total Number of Victims`,
                            `Day of Week`,
                            `Shooter Race`,
                            `Possible Motive - General`,
                            `Place Type`,
                            `Shooter Sex`)

Stanford Mass Shootings in America (MSA) (Analysis)

We will explore the variables to see if it makes sense to delve deeper or eliminate them from the data set excluding the number of fatalities,victims, and day of the week:

State

school %>% count(`State`) %>% arrange(desc(n))

## # A tibble: 33 x 2
##    State            n
##    <chr>        <int>
##  1 California      10
##  2 Ohio             5
##  3 Arizona          4
##  4 Illinois         4
##  5 Washington       4
##  6 Michigan         3
##  7 Nevada           3
##  8 New York         3
##  9 Oregon           3
## 10 Pennsylvania     3
## # ... with 23 more rows

California has the most school related shootings with 10 but there is limited variability therefore it we will not pursue any further analysis.

Shooter Race

school %>% count(`Shooter Race`) %>% arrange(desc(n))

## # A tibble: 7 x 2
##   `Shooter Race`                          n
##   <chr>                               <int>
## 1 White American or European American    42
## 2 Black American or African American     14
## 3 Asian American                          7
## 4 Some other race                         5
## 5 Native American or Alaska Native        2
## 6 Unknown                                 2
## 7 Two or more races                       1

Whites have the most school related shootings with 42 while Blacks come in behind with 14 so we will explore the total number of victims and fatalities.

ggplot(data = school) + 
  geom_point(mapping = aes(x = `Shooter Race`,
                           y = `Total Number of Victims`,
                           color = `Shooter Race`)) +
    theme(axis.text.x=element_blank())

#removes the text from the x-axis

We already knew from the count that Whites had the most school shootings, but it seems that Asian Americans logged in the most victims in a shooting.

ggplot(data = school) + 
  geom_point(mapping = aes(x = `Shooter Race`,
                           y = `Total Number of Fatalities`,
                           color = `Shooter Race`)) +
    theme(axis.text.x=element_blank())

Possible Motive - General

school %>% count(`Possible Motive - General`) %>% arrange(desc(n))

## # A tibble: 14 x 2
##    `Possible Motive - General`       n
##    <chr>                         <int>
##  1 Mental illness                   19
##  2 Multiple motives                 10
##  3 Unknown                           8
##  4 Social dispute                    7
##  5 Failure                           6
##  6 Expulsion/Suspension              5
##  7 Harassment                        5
##  8 Terminated/Denied/Reprimanded     4
##  9 Rejection                         3
## 10 Domestic dispute                  2
## 11 Legal dispute                     1
## 12 Political/Religious ideals        1
## 13 Race                              1
## 14 Social Dispute                    1

Mental illness leads the numbers with 19, and there seems to be variability here once we get past mental illness, we will conduct the same analysis from prior which surround the number of victims and fatalities.

ggplot(data = school) + 
  geom_point(mapping = aes(x = `Possible Motive - General`,
                           y = `Total Number of Victims`,
                           color = `Possible Motive - General`)) +
    theme(axis.text.x=element_blank())

#removes the text from the x-axis

ggplot(data = school) + 
  geom_point(mapping = aes(x = `Possible Motive - General`,
                           y = `Total Number of Fatalities`,
                           color = `Possible Motive - General`)) +
    theme(axis.text.x=element_blank())

#removes the text from the x-axis

Although mental illness seems like the driving force behind school shootings we see that there are quite a few of these acts that are unknown. The unknown factor also claimed the second most fatalities.

Place Type

school %>% count(`Place Type`) %>% arrange(desc(n))

## # A tibble: 9 x 2
##   `Place Type`                                                  n
##   <chr>                                                     <int>
## 1 College/University/Adult education                           27
## 2 Secondary school                                             21
## 3 Primary school                                               16
## 4 Residential home/Neighborhood                                 4
## 5 Place of worship                                              1
## 6 Restaurant/Cafe?                                              1
## 7 "Retail/ Wholesale/Services facility\rand Primary school"     1
## 8 Secondary School                                              1
## 9 Street/Highway                                                1

The top three places were Colleges, Secondary Schools, and Primary Schools with counts of 27, 21, and 16 respectively. This data element will be excluded.

Shooter Sex

school %>% count(`Shooter Sex`) %>% arrange(desc(n))

## # A tibble: 2 x 2
##   `Shooter Sex`     n
##   <chr>         <int>
## 1 Male             70
## 2 Female            3

The disparities are huge with 70 males and 3 females; data will be excluded.

Day of Week

school %>% count(`Day of Week`) %>% arrange(desc(n))

## # A tibble: 7 x 2
##   `Day of Week`     n
##   <chr>         <int>
## 1 Thursday         19
## 2 Monday           18
## 3 Friday           15
## 4 Tuesday          10
## 5 Wednesday         6
## 6 Sunday            3
## 7 Saturday          2

What stands out the most is the fact there were 5 school shootings that occurred on the weekend; 2 and 3 respectively. Out of the school/work week it seems that Wednesday is the safest day with only 6 school shootings while Thursday and Monday are the most violent.

ggplot(data = school) + 
  geom_point(mapping = aes(x = `Day of Week`,
                           y = `Total Number of Fatalities`,
                           color = `Day of Week`)) +
    theme(axis.text.x=element_blank())

#removes the text from the x-axis

Even though Thursday logged the most school shootings we see that Monday has registered the most victims.

ggplot(data = school) + 
  geom_point(mapping = aes(x = `Day of Week`,
                           y = `Total Number of Fatalities`,
                           color = `Day of Week`)) +
    theme(axis.text.x=element_blank())

With 15 school shootings Friday took the second spot in the number of fatalities.

Stanford Mass Shootings in America (MSA) (outcome)

The issue with school shootings has become a hot issue once again in the US, and what I found most interesting was the days of the week these horrific acts occurred.

DATA 607 - Project 2

Brian K. Liles

March 10, 2018

Goal

Data Sets

Load packages

New York City Compstat Data (Import and Tidy)

New York City Compstat Data (Analyze)

Rape

Robbery

Felony Assualt

Grand Larceny

New York City Compstat Data (outcome)

United States Census Bureau (Import and Tidy)

United States Census Bureau (Analysis)

Employment Size Less Than 20

Employment Size between 20 and 99

Employment Size between 100 and 499

Employment Size of 500 and higher

United States Census Bureau (outcome)

Stanford Mass Shootings in America (MSA) (Import and Tidy)

School Shootings

Stanford Mass Shootings in America (MSA) (Analysis)

State

Shooter Race

Possible Motive - General

Place Type

Shooter Sex

Day of Week

Stanford Mass Shootings in America (MSA) (outcome)