Source: Gun Violence Archive

Introduction:

Loaded in the necessary libraries

library(tidyverse)

## Warning: package 'lubridate' was built under R version 4.4.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)

## Warning: package 'plotly' was built under R version 4.4.2

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

Imported my dataset

data <- read_csv('us_gun_deaths.csv')

## New names:
## Rows: 389730 Columns: 21
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (15): region, state, victim_age, victim_sex, victim_race, victim_race_pl... dbl
## (5): ...1, year, month, multiple_victim_count, incident_id lgl (1):
## additional_victim
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

Cleaned my data specifically focusing on the variables that I’ll be using later. I decided to change the NAs to ‘Unknown’ because I want to keep the original values while making it obvious that this data is missing. There were a few other values labeled as “less than one year old”, so I changed those values to 0. I ultimately ended up removing the unknown values once I decided they wouldn’t be necessary.

cleaned_data <- data |>
  mutate(victim_age = ifelse(is.na(victim_age), 'Unknown', victim_age),  #Replaced NAs with 'Unknown'
         victim_age = ifelse(victim_age == 'Less than one year old', 0, victim_age)) |>  #Replaced 'less than one year old' with '0'
  filter(victim_age != 'Unknown') #Used the mutate and filter function to remove the NAs

Used the select function to remove columns/variables I will not be using.

sorted_data <- cleaned_data |>
  select(2, 5, 6, 8) #Used the select function to remove unnecessary columns

Used the filter function to filter out all states except for CA, MD, and NY. Also filtered out every race except for ‘American Indian or Alaskan Native’.

filtered_data <- sorted_data |>
  filter(state %in% c('CA', 'MD', 'NY') & victim_race == 'American Indian or Alaskan Native') #Used the filter function to filter the data I need

I was having trouble creating my visualizations and I realized that the victim_age values were characters.

str(filtered_data$victim_age)

##  chr [1:329] "37" "53" "33" "26" "35" "18" "35" "34" "22" "30" "24" "34" ...

Converted the values (character to numeric).

filtered_data$victim_age <- as.numeric(as.character(filtered_data$victim_age))

Linear Regression

Linear regression model

lr_model <- lm(`year` ~ `victim_age`, data = filtered_data)

Correlation Coefficient

cor(filtered_data$'year', filtered_data$'victim_age')

## [1] 0.1073053

Summary of the linear regression model

summary(lr_model)

## 
## Call:
## lm(formula = year ~ victim_age, data = filtered_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.752  -7.361  -2.623   7.075  20.378 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 1.996e+03  1.528e+00 1306.605   <2e-16 ***
## victim_age  8.692e-02  4.454e-02    1.952   0.0518 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.816 on 327 degrees of freedom
## Multiple R-squared:  0.01151,    Adjusted R-squared:  0.008492 
## F-statistic: 3.809 on 1 and 327 DF,  p-value: 0.05183

lin_reg <- ggplot(filtered_data, aes(x = year, y = victim_age)) +
 geom_point() + # Scatter plot of Year vs. Victim Age
 geom_smooth(method = "lm", se = TRUE, color = "red") + # Linear regression line
 labs(x = "Year",
 y = "Victim's Age",
 title = "Linear Regression: Year vs. Victim Age", 
 caption = "Source: Johns Hopkins Bloomberg School of Public Health")+ # Axis labels and title
 theme_bw()

lin_reg

## `geom_smooth()` using formula = 'y ~ x'

Linear Regression Analysis

Linear model equation: year= (0.087 × victim_age) + 1996.0

The coefficient for the victim_age variable is about 0.087. This means that for each year of age, the year of the gun violence incident increases by about 0.087 years (roughly 32 days).

P-value: The p-value, 0.0518, is very close to being statistically significant; however, it is not significant at the 0.05 significance level. This means that the values are probably not dependent on each other.

Adjusted R^2 value: The adjusted R^2 value of 0.008492 indicates that there is a very weak relationship between the year and the victim_age variables.

Visualization #1

Creating my first visualization.

plot1 <- filtered_data |>
 ggplot(aes(state, victim_age, fill = state)) + #Using the state for the x-axis and the victim_age for the y-axis
 labs(x = "State", y = "Victim's Age", 
      title = "Violin Plot of American Indian/Alaskan Native Gun Violence Victim's Age and State",
      caption = "Source: Johns Hopkins Bloomberg School of Public Health") +
  geom_violin() +
  scale_fill_brewer(palette = "Set1", name = "State", labels = c("California", "Maryland", "New York")) + #Changing the color palette
  theme_bw() #Changing the theme

plot1

This violin plot is a visual representation of the (American Indian or Alaskan Native) victim’s age when the gun violence occurred and which state it happened in. The highest age, around 75, and the lowest age, 0, are both from California. We can see that California has the largest range of ages compared to the other two states. This is most likely due to the fact that there were more acts of gun violence (more data to plot) towards American Indians or Alaskan Natives within California.

Visualization #2

Creating my second visualization

plot2 <- filtered_data |>
  ggplot() +
  geom_point(aes(x = year, y = victim_age, color = state), size = 4) + #Creating the scatter plot
  labs(x = 'Year', y = 'Victim Age', color = 'State',
       title = "Comparison of the Year and Victim Age of Gun Violence Occurrences",
       caption = "Source: Johns Hopkins Bloomberg School of Public Health") +
  scale_color_brewer(palette = "Set1", name = "State", labels = c("California", "Maryland", "New York")) + #Changing the color palette
  theme_bw() #Changing the theme
  
plot <- ggplotly(plot2)

plot

This scatter plot represents the year a specific act of gun violence took place and the age of the victim. I colored the data points based on the state that the gun violence happened in. It is also interactive: it shows the year, victim age, and state of each plot point when it is hovered over.

Essay

The dataset that I chose is from Johns Hopkins Bloomberg School of Public Health. It has data on gun violence in the US ranging from 1985 to 2018. It gives information on which state the attack took place, when it happened (month and year as different variables), the age of the victim, the race of the victim, the sex of the offender, etc. The variables I will be exploring consist of the year, the state the attack happened, the age of the victim, and the victims race. I began cleaning the dataset by removing NAs. I also changed values named ‘less than one year old’ to 0 in order to keep my data consistent. I used the select function to remove all the columns except for the ones I decided to use (year, state, race of victim, age of victim). Then, I used the filter function to filter out every state except for CA, MD, and NY; and I filtered out every race except for American Indian/Alaskan Native. I believe this is an important topic to explore because we’ve seen such drastic increases in gun violence across the country. It isn’t explicitly explained as to how the data was collected; however, a lot of it, if not all, come directly from the CDC. Although California has the highest rate of gun violence (between the other two states) towards American Indians/Alaskan Natives, it’s important to note that it also has a much higher population. The CDCs National Center for Health Statistics (Stats of the States 1) does a wonderful job of illustrating this through an interactive map. It shows that although California has a high gun violence count, the rate is relatively low because of how large the population is. The first visualization is a violin plot that represents the (American Indian or Alaskan Native) victim’s age when the act of gun violence occurred and which state it happened in. The second visualization represents the year a specific act of gun violence took place and the age of the victim. The most interesting pattern arises in the second visualization. We can see that in 1992 there’s an increase in the amount of gun violence towards American Indian/Alaskan Natives in New York. This pattern seems to follow into 1993 as well. I would’ve liked to include an interactive map that shows the three states and the other information I’ve included. This would be difficult since that data isn’t available in this specific dataset.

Works Cited: “Stats of the States - Firearm Mortality.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 1 Mar. 2022, www.cdc.gov/nchs/pressroom/sosmap/firearm_mortality/firearm.htm.

Final Project - 110

N Diker

2024-12-15