Project Two

Author

Marie-Anne Kemajou

The dataset that I am working with is the US gun deaths dataset, sourced from The Washington Post. This dataset includes 389,730 observations and 21 variables. The variables from this dataset that I will be working with are the victim’s age, the region, the victim’s race, month, and the number of deaths. What I would like to explore is how race plays a role in the number of gun deaths according to this dataset, as well as how time can factor into the number of gun deaths. Using the months allows for seeing if there is a trend for times during the calender year where it is more or less likely to die by gunshot wound. I would also like to explore how region plays a role in the amount of deaths by gunshot wound, and if there are any patterns that may be consistent through the different regions. I cleaned my dataset by removing NA values from the victim’s age category as well as changing the months from numbers to their actual names. The reason why I chose this dataset is because I think there is meaning using the data from terrible things that have happened to find correlations that may save lives. I think it is important for people to aware of what factors could potentially be putting their lives in more or less danger, and everyone has the right to do what they can to live their lives as safely as possible.

library(tidyverse) # Installing all of the packages I will need.

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(highcharter)

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

library(RColorBrewer)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(ggfortify)

gundeaths <- read_csv("/Users/marieannekemajou/Documents/Data 110/us_gun_deaths.csv")

New names:
Rows: 389730 Columns: 21
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(15): region, state, victim_age, victim_sex, victim_race, victim_race_pl... dbl
(5): ...1, year, month, multiple_victim_count, incident_id lgl (1):
additional_victim
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

str(gundeaths) # Looking at the structure of my dataset in order to figure out what may need to be changed later.

spc_tbl_ [389,730 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ...1                                     : num [1:389730] 0 1 2 3 4 5 6 7 8 9 ...
 $ year                                     : num [1:389730] 1985 1985 1985 1985 1985 ...
 $ month                                    : num [1:389730] 2 3 4 5 7 7 7 8 9 10 ...
 $ region                                   : chr [1:389730] "Southeast" "Southeast" "Southeast" "Southeast" ...
 $ state                                    : chr [1:389730] "AL" "AL" "AL" "AL" ...
 $ victim_age                               : chr [1:389730] "27" "61" "29" "45" ...
 $ victim_sex                               : chr [1:389730] "Male" "Male" "Female" "Male" ...
 $ victim_race                              : chr [1:389730] "White (includes Mexican-Americans)" "White (includes Mexican-Americans)" "White (includes Mexican-Americans)" "White (includes Mexican-Americans)" ...
 $ victim_race_plus_hispanic                : chr [1:389730] "White" "White" "White" "White" ...
 $ victim_ethnicity                         : chr [1:389730] "Not of Hispanic Origin" "Not of Hispanic Origin" "Not of Hispanic Origin" "Not of Hispanic Origin" ...
 $ weapon_used                              : chr [1:389730] "handgun" "handgun" "rifle" "handgun" ...
 $ victim_offender_split                    : chr [1:389730] "Single Victim/Single Offender" "Single Victim/Single Offender" "Single Victim/Single Offender" "Single Victim/Single Offender" ...
 $ offenders_relationship_to_victim         : chr [1:389730] "stranger" "acquaintance" "wife" "acquaintance" ...
 $ offenders_relationship_to_victim_grouping: chr [1:389730] "Offender Not Known to Victim" "Outside Family But Known to Victim" "Within Family" "Outside Family But Known to Victim" ...
 $ offender_sex                             : chr [1:389730] "Male" "Male" "Male" "Male" ...
 $ circumstance                             : chr [1:389730] "Felon Killed by Police" "Brawl Due to Influence of Alcohol" "Other (Other Than Felony Type)" "Other (Other Than Felony Type)" ...
 $ circumstance_grouping                    : chr [1:389730] "Justifiable Homicide" "Other Than Felony Type" "Other Than Felony Type" "Other Than Felony Type" ...
 $ extra_circumstance_info                  : chr [1:389730] "Felon Killed in Commission of a Crime" NA NA NA ...
 $ multiple_victim_count                    : num [1:389730] 0 0 0 0 1 1 0 0 0 0 ...
 $ incident_id                              : num [1:389730] 0 1 2 3 7 7 8 10 11 12 ...
 $ additional_victim                        : logi [1:389730] FALSE FALSE FALSE FALSE FALSE TRUE ...
 - attr(*, "spec")=
  .. cols(
  ..   ...1 = col_double(),
  ..   year = col_double(),
  ..   month = col_double(),
  ..   region = col_character(),
  ..   state = col_character(),
  ..   victim_age = col_character(),
  ..   victim_sex = col_character(),
  ..   victim_race = col_character(),
  ..   victim_race_plus_hispanic = col_character(),
  ..   victim_ethnicity = col_character(),
  ..   weapon_used = col_character(),
  ..   victim_offender_split = col_character(),
  ..   offenders_relationship_to_victim = col_character(),
  ..   offenders_relationship_to_victim_grouping = col_character(),
  ..   offender_sex = col_character(),
  ..   circumstance = col_character(),
  ..   circumstance_grouping = col_character(),
  ..   extra_circumstance_info = col_character(),
  ..   multiple_victim_count = col_double(),
  ..   incident_id = col_double(),
  ..   additional_victim = col_logical()
  .. )
 - attr(*, "problems")=<externalptr>

summary(gundeaths) # Looking at summary statistics to get more information about my dataset.

      ...1             year          month           region         
 Min.   :     0   Min.   :1985   Min.   : 1.000   Length:389730     
 1st Qu.: 97432   1st Qu.:1992   1st Qu.: 4.000   Class :character  
 Median :194864   Median :1999   Median : 7.000   Mode  :character  
 Mean   :194864   Mean   :2001   Mean   : 6.625                     
 3rd Qu.:292297   3rd Qu.:2009   3rd Qu.:10.000                     
 Max.   :389729   Max.   :2018   Max.   :12.000                     
    state            victim_age         victim_sex        victim_race       
 Length:389730      Length:389730      Length:389730      Length:389730     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 victim_race_plus_hispanic victim_ethnicity   weapon_used       
 Length:389730             Length:389730      Length:389730     
 Class :character          Class :character   Class :character  
 Mode  :character          Mode  :character   Mode  :character  
                                                                
                                                                
                                                                
 victim_offender_split offenders_relationship_to_victim
 Length:389730         Length:389730                   
 Class :character      Class :character                
 Mode  :character      Mode  :character                
                                                       
                                                       
                                                       
 offenders_relationship_to_victim_grouping offender_sex      
 Length:389730                             Length:389730     
 Class :character                          Class :character  
 Mode  :character                          Mode  :character  
                                                             
                                                             
                                                             
 circumstance       circumstance_grouping extra_circumstance_info
 Length:389730      Length:389730         Length:389730          
 Class :character   Class :character      Class :character       
 Mode  :character   Mode  :character      Mode  :character       
                                                                 
                                                                 
                                                                 
 multiple_victim_count  incident_id     additional_victim
 Min.   :0.00000       Min.   :     0   Mode :logical    
 1st Qu.:0.00000       1st Qu.:143242   FALSE:368864     
 Median :0.00000       Median :277662   TRUE :20866      
 Mean   :0.09735       Mean   :277689                    
 3rd Qu.:0.00000       3rd Qu.:414191                    
 Max.   :1.00000       Max.   :550566

colSums(is.na(gundeaths)) # Checking for columns that may have NA values.

                                     ...1 
                                        0 
                                     year 
                                        0 
                                    month 
                                        0 
                                   region 
                                        0 
                                    state 
                                        0 
                               victim_age 
                                     4410 
                               victim_sex 
                                        0 
                              victim_race 
                                        0 
                victim_race_plus_hispanic 
                                        0 
                         victim_ethnicity 
                                        0 
                              weapon_used 
                                        0 
                    victim_offender_split 
                                        0 
         offenders_relationship_to_victim 
                                        0 
offenders_relationship_to_victim_grouping 
                                        0 
                             offender_sex 
                                        0 
                             circumstance 
                                        0 
                    circumstance_grouping 
                                        0 
                  extra_circumstance_info 
                                   368867 
                    multiple_victim_count 
                                        0 
                              incident_id 
                                        0 
                        additional_victim 
                                        0

gundeaths4 <- gundeaths %>%
  mutate(
    victim_age = as.numeric(victim_age),
    victim_age = replace_na(victim_age, mean(victim_age, na.rm = TRUE))
  )

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `victim_age = as.numeric(victim_age)`.
Caused by warning:
! NAs introduced by coercion

categories <- gundeaths %>% 
  filter(!is.na(victim_age), !is.na(victim_race)) %>%
  group_by(victim_age, victim_race) %>%
  summarise(count = n(), .groups = "drop")
# I used group_by to create a subset consisting of 3 variables I wanted in my highcharter visualization.

highchart() %>%
  hc_chart(type = "line") %>%
  hc_title(text = "Gun Deaths by Victim
           Age and Race") %>%
  hc_xAxis(title = list(text = "Victim Age")) %>%
  hc_yAxis(title = list(text = "Number of Deaths")) %>%
  hc_add_series(
    data = categories,
    type = "line",
    hcaes(x = victim_age, y = count, group = victim_race)
  )

source 1: https://rpubs.com/rsaidi/cont_var (Professor Saidi’s Code)

gundeaths1 <- gundeaths %>%
  filter(!is.na(month), !is.na(region)) %>%
  mutate(
    month_name = month.name[as.numeric(month)],
    month_name = factor(month_name, levels = month.name) 
  )
deathsmonth <- gundeaths1 %>%
  group_by(month_name, region) %>%
  summarise(count = n(), .groups = "drop")
# In this chunk of code, I am creating another subset that uses the mutate function to change the months from 1-12 to the actual month names. I also filtered out any rows that may have NA values for the region or month. Then I used group_by to create another subset that could specifically be used for my scatterplot.

Source 2: “How do I use the mutate function in Rstudio to change the months to their names instead of numbers?” (ChatGBT)

ggplot(deathsmonth, aes(x = month_name, y = count, color = region)) +
  geom_point(size = 2.5, alpha = 0.6) +                    
  geom_smooth(method = "loess", se = FALSE, size = 1.2) +  
  scale_color_manual(values = c(
    "Midwest" = "green",
    "Northeast" = "purple",
    "Northwest" = "yellow",
    "Southeast" = "blue",
    "Southwest" = "orange"
  )) +
  labs(
    title = "Gun Death Variation by Month and Region",
    x = "Month",
    y = "Number of Deaths",
    color = "US Region"
  ) +
  theme_classic(base_size = 14) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1))

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

`geom_smooth()` using formula = 'y ~ x'

Source 3: “What to do if my scatterplot using geom_point and geom_smooth will not appear in RStudio?” (ChatGBT)

Source 4: https://rpubs.com/rsaidi/1007730 (Professor Saidi)

deathsage <- gundeaths4 %>%
  filter(!is.na(victim_age)) %>%
  group_by(victim_age) %>%
  summarise(count = n(), .groups = "drop") %>%
  mutate(victim_age = as.numeric(victim_age))

model <- lm(count ~ victim_age, data = deathsage)
summary(model)


Call:
lm(formula = count ~ victim_age, data = deathsage)

Residuals:
    Min      1Q  Median      3Q     Max 
-7906.6 -1528.7  -660.0   916.3 10506.4 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8382.76     902.37   9.290 4.59e-15 ***
victim_age    -90.16      15.88  -5.676 1.43e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4458 on 97 degrees of freedom
Multiple R-squared:  0.2493,    Adjusted R-squared:  0.2416 
F-statistic: 32.22 on 1 and 97 DF,  p-value: 1.432e-07

plot1 <- ggplot(deathsage, aes(x = victim_age, y = count)) +
  geom_point(size = 2, alpha = 0.7, color = "maroon") +
  geom_smooth(method = "lm", formula = y~x) +
  scale_y_continuous(limits = c(0, 10000)) +
  labs(
    title = "Gun Deaths by Age of Victim",
    x = "Victim Age",
    y = "Number of Gun Deaths"
  ) +
  theme_classic(base_size = 12)

plot1

Warning: Removed 16 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 16 rows containing missing values or values outside the scale range
(`geom_point()`).

Source 4: https://rpubs.com/rsaidi/1007730 (Professor Saidi)

Source 5: https://www.youtube.com/watch?v=-mGXnm0fHtI

source 6: “How do I fix my linear regression plot not displaying in RStudio?” (ChatGBT)

For this linear regression, age is a very significant factor in whether or not a person will die by gunshot. This is made clear by the fact that the p-value is 1.651. This means that as victim’s age increases, the number of deaths decrease and there is a negative correlation. The r-squared value does hold some weight, though not very much. The r-squared value is 0.2494, which means that about 25% of the the variation in the gun deaths can be tied to the victim’s age.

gundeaths3 <- gundeaths %>%
  filter(month %in% as.character(1:12)) %>%
  filter(region != "" & !is.na(region)) %>%
  mutate(month = as.integer(month)) %>%
  group_by(month, region) %>%
  summarise(count = n(), .groups = "drop")

month_categories <- sort(unique(gundeaths3$month))
region_categories <- sort(unique(gundeaths3$region))

highchart() %>%
  hc_chart(type = "heatmap") %>%
  hc_title(text = "Variation of Gun Deaths by Month and Region") %>%
  hc_xAxis(categories = month.name[month_categories], title = list(text = "Month")) %>%
  hc_yAxis(categories = region_categories, title = list(text = "Region")) %>%
  hc_add_series(
    data = gundeaths3 %>%
      mutate(
        x = match(month, month_categories) - 1,    
        y = match(region, region_categories) - 1,
        value = count
      ) %>%
      select(x, y, value) %>%
      list_parse()
  ) %>%
  hc_colorAxis(minColor = "white", maxColor = "magenta") %>%
  hc_tooltip(pointFormat = "Month: {point.x}, Region: {point.y}<br>Deaths: {point.value}")

Source 7: https://rpubs.com/rsaidi/1274076 (Professor Saidi)

Source 8: “How to make the months in a dataset suit the numerical requirements for a heat map?” “How to create a heatmap in RStudio and remove NA’s that are not listed as NA?” (ChatGBT)

Gun violence has been a prevalent issue in the United States for a very long time, and unfortunately there hasn’t been a solution to the problem quite yet. While the dataset that I’m working with only covers the US in terms of death by gunshot wound, there are many datasets and visualizations out there that prove many other countries do not deal with nearly as much fatality and injury from shootings. However, I also think that it is important to look at data that focuses on the United States. There may not be laws going into place to protect the country from gun violence, but there is always a chance that there is valuable information that could be the center of protesting, attempting to push legislation, as well as just personal choices. in the John Hopkins Public Health Journal it states, “For the third straight year, firearms killed more children and teens, ages 1 to 17, than any other cause including car crashes and cancer. Our analysis found 48,204 people, the second highest on record, died from gunshots in the U.S. in 2022, including 27,032 suicides, an all-time high for the country.” This information is very unfortunate from an objective standpoint, and that shows just how important data is. People may not feel inclined to make change if they are not completely aware of what needs to be changed and that is where data comes in. The dataset that I worked with falls relatively in line with this quote, where the majority of those dying by gunshot wound have an age range that is roughly between 12 and 30. This is important to look into, because it could lead to preventative solutions. My final visualization represents how gun deaths change throughout the year during different months, and for each region of the United States. I wanted to be able to display which regions are “safer” and which regions are more “dangerous” in terms of gun deaths in the United States. I noticed that the northwest region seems to be very minimal when it comes to the number of gun deaths and even relatively consistent across the board. However, with every other region, there seems to be a trend where July, August, December and January tend to have more gun deaths compared to months like February or November. I also noticed that the southeast area has the most dense amount of gun deaths and that is also relatively consistent, just like the northwest. If possible, I wish that I could have come up with a way to do this visualization with the states instead, but I was not sure how I could include every state in a comparative fashion without it being overwhelming. I also would have loved for the visualization to look more interesting but I am content with me being able to do a heatmap, because I used to always struggle with getting them to actually show up.

                                           Works Cited

source 1: https://rpubs.com/rsaidi/cont_var (Professor Saidi’s Code)

Source 2: “How do I use the mutate function in Rstudio to change the months to their names instead of numbers?” (ChatGBT)

Source 3: “What to do if my scatterplot using geom_point and geom_smooth will not appear in RStudio?” (ChatGBT)

Source 4: https://rpubs.com/rsaidi/1007730 (Professor Saidi)

Source 5: https://www.youtube.com/watch?v=-mGXnm0fHtI

source 6: “How do I fix my linear regression plot not displaying in RStudio?” (ChatGBT)

Source 7: https://rpubs.com/rsaidi/1274076 (Professor Saidi)

Source 8: “How to make the months in a dataset suit the numerical requirements for a heat map?” “How to create a heatmap in RStudio and remove NA’s that are not listed as NA?” (ChatGBT)

source 9: https://publichealth.jhu.edu/center-for-gun-violence-solutions/research-reports/gun-violence-in-the-united-states (Essay Article)