Introduction

Below is a analysis of different data sets provided by classmates. This analysis will attempt to answer specific questions from the person who provided the data, or a question that seems appropriate during data tidying.

Libraries needed for analysis.

library(sf)
library(DBI)
library(RCurl)
library(readr)
library(ggplot2)
library(dplyr)
library(tidyr)

Data 1 - NYC Restaurant Inspection

Get Dataset

The first data set that will be analyzed are all of the NYC Restaurant Inspection results from 2019 to 10/1/2020. The data will be pulled from a csv file that is on GitHub.

#get URL and import data from GitHub
site <- "https://raw.githubusercontent.com/ltcancel/Project2_Data607_F20/master/DOHMH_New_York_City_Restaurant_Inspection_Results.csv"
df <- read_csv(url(site))

head(df)
## # A tibble: 6 x 13
##    CAMIS DBA   BORO  ZIPCODE `CUISINE DESCRI~ `INSPECTION DAT~
##    <dbl> <chr> <chr>   <dbl> <chr>            <chr>           
## 1 5.01e7 PUER~ Quee~   11385 Latin (Cuban, D~ 9/29/2018       
## 2 5.01e7 LIKK~ Broo~   11201 Chicken          4/11/2019       
## 3 5.00e7 PRES~ Quee~   11374 Jewish/Kosher    5/16/2016       
## 4 5.01e7 TONY~ Bronx   10468 Pizza            8/22/2019       
## 5 4.08e7 GRAD~ Manh~   10011 Italian          11/12/2019      
## 6 5.00e7 CASP~ Broo~   11235 Russian          5/31/2018       
## # ... with 7 more variables: `VIOLATION CODE` <chr>, `VIOLATION
## #   DESCRIPTION` <chr>, `CRITICAL FLAG` <chr>, SCORE <dbl>, GRADE <chr>,
## #   Latitude <dbl>, Longitude <dbl>

Tidy the data/Analysis

Q. How many violations were issued for restaurants in each borough for each year?

#separate date into its own columns
df <- df %>%
  separate('INSPECTION DATE',sep="/",into = c("month","day","year"),remove = FALSE)

#not every row has a value for BORO so I want to exclude this data. I also want to include anyone who does not have a violation description
df1Filter <- df %>% group_by(year) %>% count(BORO) %>% filter(BORO != 0 & !is.na('VIOLATION DESCRIPTION'))

head(df1Filter)
## # A tibble: 6 x 3
## # Groups:   year [2]
##   year  BORO              n
##   <chr> <chr>         <int>
## 1 2016  Bronx          1791
## 2 2016  Brooklyn       4693
## 3 2016  Manhattan      7208
## 4 2016  Queens         3380
## 5 2016  Staten Island   777
## 6 2017  Bronx          7008

After plotting the data, we can see that the number of violations has increased over the years. Manhattan consistently has the greatest number of violations. Brooklyn and Queens are very close, and Staten Island has the least number of violations. I think this data needs further analysis to see the percentage of restaurants with violations and the percentage of restaurants with an A grade. Manhattan may have a larger number of restaurants making their violations seem inflated when compared to the other boroughs.

ggplot(df1Filter) +
  geom_bar(mapping = aes(x=BORO, y = n, fill = BORO), stat = 'identity') +
  facet_wrap(~year) +
  coord_flip() +
  labs(title = "Resturant Insepction Violations",x="Borough",y="Violations")

Data 2 - Police Killings Since 2015

Introduction

This dataset is of all Police killings from January 1, 2015 to September 27, 2020. It was collected by The Washington Post and can be found on their GitHub.

Get Dataset

First we pull the most recent csv file and create a dataframe.

#get data file
site2 <- getURL("https://raw.githubusercontent.com/ltcancel/Project2_Data607_F20/master/fatal-police-shootings-data.csv")

df2 <- read.csv(text = site2,sep = ",",header = TRUE)
head(df2)
##   id               name       date  manner_of_death      armed age gender
## 1  3         Tim Elliot 2015-01-02             shot        gun  53      M
## 2  4   Lewis Lee Lembke 2015-01-02             shot        gun  47      M
## 3  5 John Paul Quintero 2015-01-03 shot and Tasered    unarmed  23      M
## 4  8    Matthew Hoffman 2015-01-04             shot toy weapon  32      M
## 5  9  Michael Rodriguez 2015-01-04             shot   nail gun  39      M
## 6 11  Kenneth Joe Brown 2015-01-04             shot        gun  18      M
##   race          city state signs_of_mental_illness threat_level
## 1    A       Shelton    WA                    True       attack
## 2    W         Aloha    OR                   False       attack
## 3    H       Wichita    KS                   False        other
## 4    W San Francisco    CA                    True       attack
## 5    H         Evans    CO                   False       attack
## 6    W       Guthrie    OK                   False       attack
##          flee body_camera longitude latitude is_geocoding_exact
## 1 Not fleeing       False  -123.122   47.247               True
## 2 Not fleeing       False  -122.892   45.487               True
## 3 Not fleeing       False   -97.281   37.695               True
## 4 Not fleeing       False  -122.422   37.763               True
## 5 Not fleeing       False  -104.692   40.384               True
## 6 Not fleeing       False   -97.423   35.877               True

Tidy the data/Analysis

Q: Which states had the most deadly shootings from 2015 - 2020

I tidied the data by splitting the date into 3 columns so I can do an analysis by year. Since this is such a large dataset, I took a subset of data that only includes the 5 states with the highest number of killings for each year.

#Split date into year, month, and day columns so we can makes graphs based on year
df2<- df2 %>%
  separate(date,sep="-",into = c("year","month","day"),remove = FALSE)
head(df2)
##   id               name       date year month day  manner_of_death
## 1  3         Tim Elliot 2015-01-02 2015    01  02             shot
## 2  4   Lewis Lee Lembke 2015-01-02 2015    01  02             shot
## 3  5 John Paul Quintero 2015-01-03 2015    01  03 shot and Tasered
## 4  8    Matthew Hoffman 2015-01-04 2015    01  04             shot
## 5  9  Michael Rodriguez 2015-01-04 2015    01  04             shot
## 6 11  Kenneth Joe Brown 2015-01-04 2015    01  04             shot
##        armed age gender race          city state signs_of_mental_illness
## 1        gun  53      M    A       Shelton    WA                    True
## 2        gun  47      M    W         Aloha    OR                   False
## 3    unarmed  23      M    H       Wichita    KS                   False
## 4 toy weapon  32      M    W San Francisco    CA                    True
## 5   nail gun  39      M    H         Evans    CO                   False
## 6        gun  18      M    W       Guthrie    OK                   False
##   threat_level        flee body_camera longitude latitude
## 1       attack Not fleeing       False  -123.122   47.247
## 2       attack Not fleeing       False  -122.892   45.487
## 3        other Not fleeing       False   -97.281   37.695
## 4       attack Not fleeing       False  -122.422   37.763
## 5       attack Not fleeing       False  -104.692   40.384
## 6       attack Not fleeing       False   -97.423   35.877
##   is_geocoding_exact
## 1               True
## 2               True
## 3               True
## 4               True
## 5               True
## 6               True
#top 5 states with the highest number of deaths per year
top5 <- df2 %>% group_by(year) %>% count(state) %>% top_n(5)
## Selecting by n
head(top5)
## # A tibble: 6 x 3
## # Groups:   year [2]
##   year  state     n
##   <chr> <fct> <int>
## 1 2015  AZ       42
## 2 2015  CA      190
## 3 2015  FL       61
## 4 2015  OK       32
## 5 2015  TX      100
## 6 2016  AZ       50

In the bar graph below, we can see that some states are on the top of the list for all 6 years. California has the greatest number of police killings each year, followed by Texas, and then Florida. 2020 is the only year where Florida has slightly more killings than Texas. Arizona has dropped from the top 5 in 2019.

#Barplot using the top 5 states with highest number of deaths per year data
ggplot(top5) +
  geom_bar(mapping = aes(x = state, y = n, fill = state), stat = 'identity') +
  facet_wrap(year~.) +
  labs(title = "Top 5 States with Deadly Shootings")

Q: What is the average number of police killings by city?

#get average shootings per year
yearCount <- df2 %>% group_by(year) %>% count(state)


stateAvg <- yearCount %>% 
  group_by(state) %>%
  summarise(Average = mean(n)) %>%
  arrange(desc(Average))

head(stateAvg)
## # A tibble: 6 x 2
##   state Average
##   <fct>   <dbl>
## 1 CA      140. 
## 2 TX       84  
## 3 FL       62.3
## 4 AZ       43.7
## 5 CO       34.8
## 6 GA       32.7

This graph is a bit crowded with bars, but we can clearly see that California and Texas has the greatest average.

ggplot(stateAvg) +
  geom_bar(mapping = aes(x = state, y = Average, fill = state), stat = 'identity') +
  coord_flip() +
  scale_y_continuous(breaks = seq(0,150,25))

Q: How many of the victims showed signs of mental illness

Each year we can see a large gap in the number of killings where the victim showed signs of mental illness versus those who did not. Most were reported to not have shown signs of mental illness.

ggplot(df2) +
  geom_bar(mapping = aes(x=year,fill=signs_of_mental_illness), position = "dodge")

Since the count looks very similar year over year, I created a subset of states with the highest total number of killings. There isn’t much of a significant difference. Each of the states has a gap between victims who showed signs of mental illness versus those who did not.

df2Filter <- df2 %>%
  filter(state %in% c("AZ","CA","CO","FL","TX"))

ggplot(df2Filter) +
  geom_bar(mapping = aes(x=year, fill=signs_of_mental_illness), position = "dodge") +
  facet_wrap(~state) +
  coord_flip()