The dataset that I am working with is the US gun deaths dataset, sourced from The Washington Post. This dataset includes 389,730 observations and 21 variables. The variables from this dataset that I will be working with are the victim’s age, the region, the victim’s race, month, and the number of deaths. What I would like to explore is how race plays a role in the number of gun deaths according to this dataset, as well as how time can factor into the number of gun deaths. Using the months allows for seeing if there is a trend for times during the calender year where it is more or less likely to die by gunshot wound. I would also like to explore how region plays a role in the amount of deaths by gunshot wound, and if there are any patterns that may be consistent through the different regions. I cleaned my dataset by removing NA values from the victim’s age category as well as changing the months from numbers to their actual names. The reason why I chose this dataset is because I think there is meaning using the data from terrible things that have happened to find correlations that may save lives. I think it is important for people to aware of what factors could potentially be putting their lives in more or less danger, and everyone has the right to do what they can to live their lives as safely as possible.
library(tidyverse) # Installing all of the packages I will need.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
library(RColorBrewer)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
New names:
Rows: 389730 Columns: 21
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(15): region, state, victim_age, victim_sex, victim_race, victim_race_pl... dbl
(5): ...1, year, month, multiple_victim_count, incident_id lgl (1):
additional_victim
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
str(gundeaths) # Looking at the structure of my dataset in order to figure out what may need to be changed later.
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `victim_age = as.numeric(victim_age)`.
Caused by warning:
! NAs introduced by coercion
categories <- gundeaths %>%filter(!is.na(victim_age), !is.na(victim_race)) %>%group_by(victim_age, victim_race) %>%summarise(count =n(), .groups ="drop")# I used group_by to create a subset consisting of 3 variables I wanted in my highcharter visualization.
highchart() %>%hc_chart(type ="line") %>%hc_title(text ="Gun Deaths by Victim Age and Race") %>%hc_xAxis(title =list(text ="Victim Age")) %>%hc_yAxis(title =list(text ="Number of Deaths")) %>%hc_add_series(data = categories,type ="line",hcaes(x = victim_age, y = count, group = victim_race) )
gundeaths1 <- gundeaths %>%filter(!is.na(month), !is.na(region)) %>%mutate(month_name = month.name[as.numeric(month)],month_name =factor(month_name, levels = month.name) )deathsmonth <- gundeaths1 %>%group_by(month_name, region) %>%summarise(count =n(), .groups ="drop")# In this chunk of code, I am creating another subset that uses the mutate function to change the months from 1-12 to the actual month names. I also filtered out any rows that may have NA values for the region or month. Then I used group_by to create another subset that could specifically be used for my scatterplot.
Source 2: “How do I use the mutate function in Rstudio to change the months to their names instead of numbers?” (ChatGBT)
ggplot(deathsmonth, aes(x = month_name, y = count, color = region)) +geom_point(size =2.5, alpha =0.6) +geom_smooth(method ="loess", se =FALSE, size =1.2) +scale_color_manual(values =c("Midwest"="green","Northeast"="purple","Northwest"="yellow","Southeast"="blue","Southwest"="orange" )) +labs(title ="Gun Death Variation by Month and Region",x ="Month",y ="Number of Deaths",color ="US Region" ) +theme_classic(base_size =14) +theme(axis.text.x =element_text(angle =40, hjust =1))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using formula = 'y ~ x'
Source 3: “What to do if my scatterplot using geom_point and geom_smooth will not appear in RStudio?” (ChatGBT)
source 6: “How do I fix my linear regression plot not displaying in RStudio?” (ChatGBT)
For this linear regression, age is a very significant factor in whether or not a person will die by gunshot. This is made clear by the fact that the p-value is 1.651. This means that as victim’s age increases, the number of deaths decrease and there is a negative correlation. The r-squared value does hold some weight, though not very much. The r-squared value is 0.2494, which means that about 25% of the the variation in the gun deaths can be tied to the victim’s age.
gundeaths3 <- gundeaths %>%filter(month %in%as.character(1:12)) %>%filter(region !=""&!is.na(region)) %>%mutate(month =as.integer(month)) %>%group_by(month, region) %>%summarise(count =n(), .groups ="drop")month_categories <-sort(unique(gundeaths3$month))region_categories <-sort(unique(gundeaths3$region))highchart() %>%hc_chart(type ="heatmap") %>%hc_title(text ="Variation of Gun Deaths by Month and Region") %>%hc_xAxis(categories = month.name[month_categories], title =list(text ="Month")) %>%hc_yAxis(categories = region_categories, title =list(text ="Region")) %>%hc_add_series(data = gundeaths3 %>%mutate(x =match(month, month_categories) -1, y =match(region, region_categories) -1,value = count ) %>%select(x, y, value) %>%list_parse() ) %>%hc_colorAxis(minColor ="white", maxColor ="magenta") %>%hc_tooltip(pointFormat ="Month: {point.x}, Region: {point.y}<br>Deaths: {point.value}")
Source 8: “How to make the months in a dataset suit the numerical requirements for a heat map?” “How to create a heatmap in RStudio and remove NA’s that are not listed as NA?” (ChatGBT)
Gun violence has been a prevalent issue in the United States for a very long time, and unfortunately there hasn’t been a solution to the problem quite yet. While the dataset that I’m working with only covers the US in terms of death by gunshot wound, there are many datasets and visualizations out there that prove many other countries do not deal with nearly as much fatality and injury from shootings. However, I also think that it is important to look at data that focuses on the United States. There may not be laws going into place to protect the country from gun violence, but there is always a chance that there is valuable information that could be the center of protesting, attempting to push legislation, as well as just personal choices. in the John Hopkins Public Health Journal it states, “For the third straight year, firearms killed more children and teens, ages 1 to 17, than any other cause including car crashes and cancer. Our analysis found 48,204 people, the second highest on record, died from gunshots in the U.S. in 2022, including 27,032 suicides, an all-time high for the country.” This information is very unfortunate from an objective standpoint, and that shows just how important data is. People may not feel inclined to make change if they are not completely aware of what needs to be changed and that is where data comes in. The dataset that I worked with falls relatively in line with this quote, where the majority of those dying by gunshot wound have an age range that is roughly between 12 and 30. This is important to look into, because it could lead to preventative solutions. My final visualization represents how gun deaths change throughout the year during different months, and for each region of the United States. I wanted to be able to display which regions are “safer” and which regions are more “dangerous” in terms of gun deaths in the United States. I noticed that the northwest region seems to be very minimal when it comes to the number of gun deaths and even relatively consistent across the board. However, with every other region, there seems to be a trend where July, August, December and January tend to have more gun deaths compared to months like February or November. I also noticed that the southeast area has the most dense amount of gun deaths and that is also relatively consistent, just like the northwest. If possible, I wish that I could have come up with a way to do this visualization with the states instead, but I was not sure how I could include every state in a comparative fashion without it being overwhelming. I also would have loved for the visualization to look more interesting but I am content with me being able to do a heatmap, because I used to always struggle with getting them to actually show up.
Source 8: “How to make the months in a dataset suit the numerical requirements for a heat map?” “How to create a heatmap in RStudio and remove NA’s that are not listed as NA?” (ChatGBT)