Below is a analysis of different data sets provided by classmates. This analysis will attempt to answer specific questions from the person who provided the data, or a question that seems appropriate during data tidying.
Libraries needed for analysis.
library(sf)
library(DBI)
library(RCurl)
library(readr)
library(ggplot2)
library(dplyr)
library(tidyr)
The first data set that will be analyzed are all of the NYC Restaurant Inspection results from 2019 to 10/1/2020. The data will be pulled from a csv file that is on GitHub.
#get URL and import data from GitHub
site <- "https://raw.githubusercontent.com/ltcancel/Project2_Data607_F20/master/DOHMH_New_York_City_Restaurant_Inspection_Results.csv"
df <- read_csv(url(site))
head(df)
## # A tibble: 6 x 13
## CAMIS DBA BORO ZIPCODE `CUISINE DESCRI~ `INSPECTION DAT~
## <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 5.01e7 PUER~ Quee~ 11385 Latin (Cuban, D~ 9/29/2018
## 2 5.01e7 LIKK~ Broo~ 11201 Chicken 4/11/2019
## 3 5.00e7 PRES~ Quee~ 11374 Jewish/Kosher 5/16/2016
## 4 5.01e7 TONY~ Bronx 10468 Pizza 8/22/2019
## 5 4.08e7 GRAD~ Manh~ 10011 Italian 11/12/2019
## 6 5.00e7 CASP~ Broo~ 11235 Russian 5/31/2018
## # ... with 7 more variables: `VIOLATION CODE` <chr>, `VIOLATION
## # DESCRIPTION` <chr>, `CRITICAL FLAG` <chr>, SCORE <dbl>, GRADE <chr>,
## # Latitude <dbl>, Longitude <dbl>
Q. How many violations were issued for restaurants in each borough for each year?
#separate date into its own columns
df <- df %>%
separate('INSPECTION DATE',sep="/",into = c("month","day","year"),remove = FALSE)
#not every row has a value for BORO so I want to exclude this data. I also want to include anyone who does not have a violation description
df1Filter <- df %>% group_by(year) %>% count(BORO) %>% filter(BORO != 0 & !is.na('VIOLATION DESCRIPTION'))
head(df1Filter)
## # A tibble: 6 x 3
## # Groups: year [2]
## year BORO n
## <chr> <chr> <int>
## 1 2016 Bronx 1791
## 2 2016 Brooklyn 4693
## 3 2016 Manhattan 7208
## 4 2016 Queens 3380
## 5 2016 Staten Island 777
## 6 2017 Bronx 7008
After plotting the data, we can see that the number of violations has increased over the years. Manhattan consistently has the greatest number of violations. Brooklyn and Queens are very close, and Staten Island has the least number of violations. I think this data needs further analysis to see the percentage of restaurants with violations and the percentage of restaurants with an A grade. Manhattan may have a larger number of restaurants making their violations seem inflated when compared to the other boroughs.
ggplot(df1Filter) +
geom_bar(mapping = aes(x=BORO, y = n, fill = BORO), stat = 'identity') +
facet_wrap(~year) +
coord_flip() +
labs(title = "Resturant Insepction Violations",x="Borough",y="Violations")
This dataset is of all Police killings from January 1, 2015 to September 27, 2020. It was collected by The Washington Post and can be found on their GitHub.
First we pull the most recent csv file and create a dataframe.
#get data file
site2 <- getURL("https://raw.githubusercontent.com/ltcancel/Project2_Data607_F20/master/fatal-police-shootings-data.csv")
df2 <- read.csv(text = site2,sep = ",",header = TRUE)
head(df2)
## id name date manner_of_death armed age gender
## 1 3 Tim Elliot 2015-01-02 shot gun 53 M
## 2 4 Lewis Lee Lembke 2015-01-02 shot gun 47 M
## 3 5 John Paul Quintero 2015-01-03 shot and Tasered unarmed 23 M
## 4 8 Matthew Hoffman 2015-01-04 shot toy weapon 32 M
## 5 9 Michael Rodriguez 2015-01-04 shot nail gun 39 M
## 6 11 Kenneth Joe Brown 2015-01-04 shot gun 18 M
## race city state signs_of_mental_illness threat_level
## 1 A Shelton WA True attack
## 2 W Aloha OR False attack
## 3 H Wichita KS False other
## 4 W San Francisco CA True attack
## 5 H Evans CO False attack
## 6 W Guthrie OK False attack
## flee body_camera longitude latitude is_geocoding_exact
## 1 Not fleeing False -123.122 47.247 True
## 2 Not fleeing False -122.892 45.487 True
## 3 Not fleeing False -97.281 37.695 True
## 4 Not fleeing False -122.422 37.763 True
## 5 Not fleeing False -104.692 40.384 True
## 6 Not fleeing False -97.423 35.877 True
Q: Which states had the most deadly shootings from 2015 - 2020
I tidied the data by splitting the date into 3 columns so I can do an analysis by year. Since this is such a large dataset, I took a subset of data that only includes the 5 states with the highest number of killings for each year.
#Split date into year, month, and day columns so we can makes graphs based on year
df2<- df2 %>%
separate(date,sep="-",into = c("year","month","day"),remove = FALSE)
head(df2)
## id name date year month day manner_of_death
## 1 3 Tim Elliot 2015-01-02 2015 01 02 shot
## 2 4 Lewis Lee Lembke 2015-01-02 2015 01 02 shot
## 3 5 John Paul Quintero 2015-01-03 2015 01 03 shot and Tasered
## 4 8 Matthew Hoffman 2015-01-04 2015 01 04 shot
## 5 9 Michael Rodriguez 2015-01-04 2015 01 04 shot
## 6 11 Kenneth Joe Brown 2015-01-04 2015 01 04 shot
## armed age gender race city state signs_of_mental_illness
## 1 gun 53 M A Shelton WA True
## 2 gun 47 M W Aloha OR False
## 3 unarmed 23 M H Wichita KS False
## 4 toy weapon 32 M W San Francisco CA True
## 5 nail gun 39 M H Evans CO False
## 6 gun 18 M W Guthrie OK False
## threat_level flee body_camera longitude latitude
## 1 attack Not fleeing False -123.122 47.247
## 2 attack Not fleeing False -122.892 45.487
## 3 other Not fleeing False -97.281 37.695
## 4 attack Not fleeing False -122.422 37.763
## 5 attack Not fleeing False -104.692 40.384
## 6 attack Not fleeing False -97.423 35.877
## is_geocoding_exact
## 1 True
## 2 True
## 3 True
## 4 True
## 5 True
## 6 True
#top 5 states with the highest number of deaths per year
top5 <- df2 %>% group_by(year) %>% count(state) %>% top_n(5)
## Selecting by n
head(top5)
## # A tibble: 6 x 3
## # Groups: year [2]
## year state n
## <chr> <fct> <int>
## 1 2015 AZ 42
## 2 2015 CA 190
## 3 2015 FL 61
## 4 2015 OK 32
## 5 2015 TX 100
## 6 2016 AZ 50
In the bar graph below, we can see that some states are on the top of the list for all 6 years. California has the greatest number of police killings each year, followed by Texas, and then Florida. 2020 is the only year where Florida has slightly more killings than Texas. Arizona has dropped from the top 5 in 2019.
#Barplot using the top 5 states with highest number of deaths per year data
ggplot(top5) +
geom_bar(mapping = aes(x = state, y = n, fill = state), stat = 'identity') +
facet_wrap(year~.) +
labs(title = "Top 5 States with Deadly Shootings")
Q: What is the average number of police killings by city?
#get average shootings per year
yearCount <- df2 %>% group_by(year) %>% count(state)
stateAvg <- yearCount %>%
group_by(state) %>%
summarise(Average = mean(n)) %>%
arrange(desc(Average))
head(stateAvg)
## # A tibble: 6 x 2
## state Average
## <fct> <dbl>
## 1 CA 140.
## 2 TX 84
## 3 FL 62.3
## 4 AZ 43.7
## 5 CO 34.8
## 6 GA 32.7
This graph is a bit crowded with bars, but we can clearly see that California and Texas has the greatest average.
ggplot(stateAvg) +
geom_bar(mapping = aes(x = state, y = Average, fill = state), stat = 'identity') +
coord_flip() +
scale_y_continuous(breaks = seq(0,150,25))
Q: How many of the victims showed signs of mental illness
Each year we can see a large gap in the number of killings where the victim showed signs of mental illness versus those who did not. Most were reported to not have shown signs of mental illness.
ggplot(df2) +
geom_bar(mapping = aes(x=year,fill=signs_of_mental_illness), position = "dodge")
Since the count looks very similar year over year, I created a subset of states with the highest total number of killings. There isn’t much of a significant difference. Each of the states has a gap between victims who showed signs of mental illness versus those who did not.
df2Filter <- df2 %>%
filter(state %in% c("AZ","CA","CO","FL","TX"))
ggplot(df2Filter) +
geom_bar(mapping = aes(x=year, fill=signs_of_mental_illness), position = "dodge") +
facet_wrap(~state) +
coord_flip()