DATA 110 Project 2

Author

Emilio Difilippantonio

Introduction

(Essay 1)

For this project, I will be analyzing the US Gun Deaths data set, which contains information about the year, month, region, and state of each gun death, the victim’s age, sex, race, and ethnicity, the weapon used, the offender’s sex and relation to the victim, the circumstance of the death, and other data gun deaths across America from 1985 to 2018. The circumstance of the killing briefly details what occured, such as if it was and argument, robbery, etc., while the ciicumstance grouping gives a more general overview, such as if it was a felony, a justifiable homicide, etc. Gun violence is a very prevelant issue in the US, and I wish to take a look at some of the data to see who is affected the most. I will be examining how many people of each race, age, and sex die each year in America due to gun violence. I will only be looking at data from 2000 to 2018, as data older than that is too old and not as useful. I will also be removing the first column of the data set, as it simply contains an index number, which is unnecessary as all the data is already indexed. Lastly, I will convert columns to either numeric, character, or logical. I would usually change the column names to remove spaces and capital letters, but in this data set, the column names are already in the form that I prefer, which means that there is no need to change them.

Source

# Loading in necessary libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
library(ggfortify)
library(highcharter)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(htmltools)

# Loading in data set
setwd("/Users/emiliodifilippantonio/Desktop/DATA 110/DATA 110 Working Directory")
guns <- read_csv("us_gun_deaths.csv")
New names:
Rows: 389730 Columns: 21
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(15): region, state, victim_age, victim_sex, victim_race, victim_race_pl... dbl
(5): ...1, year, month, multiple_victim_count, incident_id lgl (1):
additional_victim
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
# Filtering for deaths that occured during or after 2000 CE
guns2 <- filter(guns, year >= 2000)

# Removing column one of the data set (index numbers)
guns3<- guns2[,-1]

# Converting numeric columns to numeric
guns3$year <- as.numeric(guns3$year)
guns3$month <- as.numeric(guns3$month)
guns3$victim_age <- as.numeric(guns3$victim_age)
Warning: NAs introduced by coercion
guns3$multiple_victim_count <- as.numeric(guns3$multiple_victim_count)
guns3$incident_id <- as.numeric(guns3$incident_id)

# Converting factor columns to factor
guns3$region <- as.factor(guns3$region)
guns3$state <- as.factor(guns3$state)
guns3$victim_sex <- as.factor(guns3$victim_sex)
guns3$victim_race <- as.factor(guns3$victim_race)
guns3$victim_race_plus_hispanic <-
  as.factor(guns3$victim_race_plus_hispanic)
guns3$victim_ethnicity <- as.factor(guns3$victim_ethnicity)
guns3$weapon_used <- as.factor(guns3$weapon_used)
guns3$victim_offender_split <- as.factor(guns3$victim_offender_split)
guns3$offenders_relationship_to_victim <-
  as.factor(guns3$offenders_relationship_to_victim)
guns3$offenders_relationship_to_victim_grouping <-
  as.factor(guns3$offenders_relationship_to_victim_grouping)
guns3$offender_sex <- as.factor(guns3$offender_sex)
guns3$circumstance <- as.factor(guns3$circumstance)
guns3$circumstance_grouping <- as.factor(guns3$circumstance_grouping)
guns3$extra_circumstance_info <- as.factor(guns3$extra_circumstance_info)

# Converting logical columns to logical
guns3$additional_victim <- as.logical(guns3$additional_victim)

Let the Journey Begin!

Now that we have filtered our data set and understand the variables, we will take a closer look to try to gain a deeper understanding of the data.

Linear Regression Analysis

First, we will look at the relationship between the avergae age of gun death victims and the year to determine whether the average age of gun death victims is increasing, decreasing, staying relatively the same, or changing unpredictably.

# Grouping the data by year
guns_grouped <- guns3 %>% group_by(year)

# Summarizing the average age of gun death victims in the US per year
avg_ages <- guns_grouped %>% summarize(age = mean(victim_age, na.rm = TRUE))

# Analyzing the relationship between the year and the average age of gun death victims in the US
summary(lm(age ~ year, data = avg_ages))

Call:
lm(formula = age ~ year, data = avg_ages)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.73209 -0.07832 -0.02896  0.14995  0.50161 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -236.18684   24.94856  -9.467 3.43e-08 ***
year           0.13325    0.01242  10.730 5.46e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2965 on 17 degrees of freedom
Multiple R-squared:  0.8713,    Adjusted R-squared:  0.8638 
F-statistic: 115.1 on 1 and 17 DF,  p-value: 5.459e-09

The equation to predict the average age of guns death victims in the US given the year is the following:

age = -236.18684 + 0.13325(year)

This means that the average age of gun deasth victims in the US increases by 0.13325 years (approximately 49 days, or 7 weeks) per year, which is a small but noticable number.

The p-value of our test is 0.000000005459, which is extremely small, meaning that there is almost certainly a correlation between average age of gun death victims and year. Our adjusted r-squared value is 0.8638, meaning that 86.38% of the variation in our data can be explained by our model. The low p-value and high adjuested r-squared value point towards our linear model being a good fit for the data. However, we can never be too careful. Let’s analyze our data a bit closer, shall we?

Diagnostic Plots

Next, we will look at some diagnostic plots to determine whether our model to predict the average age of gun death victims in the US given the year is an effective model.

# Creating diagnostic plots
autoplot(lm(age ~ year, data = avg_ages))

It seems like the first, seventh, and eigth data entries are outliers with high leverage. These are 2000, 2006, and 2007. In 2000, the average age of gun death victims was 30.8 years, while the predicted value for that year is 30.3. The residual is 0.5. In 2006, the average age of gun death victims was 30.6, but the predicted age was 31.1; 2006 has a residual of -0.5. In 2007, the average age of gun death victims was 30.5, but the predicted age was 31.2; 2007 had a residual of -0.7. 2000 had an average gun death victim age higher than expected, while 2006 and 2007 had average gun death victim ages lower than expected. Our increase per year is very small, so even minor differences in expected and actual values can significantly interfere with the model. This also shows that, despite a small increase, the average age of gun death victms in the US is the early 30’s.

Exploring the Data with Plots

Let’s use some visualizations to further explore the data.

Age

First, we’ll make a line chart looking at the increase in the average age of gun death victims in the US per year. I will be using boxplots in an admittedly unorthodox way, though I hope it will help us analyze the data. Boxplots show us the medians of data sets, not the means, so this visualization will likely differ from our expectations based on previous calculations that we made involving the year and the average age of gun deaths victims.

# Creating the plot
plot1 <- guns3 |>
  ggplot(aes(x = as.factor(year), y = victim_age)) +
  geom_boxplot() +
  labs(x = "Year", y = "Age",
       title = "Boxplot of Gun Death Victim Ages\nin the US from 2000 to 2018",
       caption = "Source: guns.periscopic.com") +
  theme(axis.text.x = element_text(angle = 90))

# Calling the plot
plot1
Warning: Removed 2030 rows containing non-finite values (`stat_boxplot()`).

Interesting………

Well, that wasn’t quite what I expected, but it does tell us something: half of the gun death victims in the US every year are between the ages of (approximately) 22 and 40. Additionaly, gun death victims in the US aged around 60 and older are considered outliers. Perhaps the most frightening thing is that a quarter of gun death victims in the US are 21 or younger.

Sex

Let’s look at the number of male versus female gun death victims in the US. Though comparing proportions is better than comparing numbers, the proportion of males and females in the US is about the same, so just looking at the raw data can still tell us quite a bit.

# Grouping the data by year and sex
guns_grouped_2 <- guns3 |> group_by(year, victim_sex)

# Summarizing the data grouped by year and sex
sexes <- guns_grouped_2 %>% summarize(total = n())
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Adding a column for the proportions
sexes <- mutate(sexes, proportion =
                if_else(year == 2000, total / 9179,
                 if_else(year == 2001, total / 9541,
                  if_else(year == 2002, total / 10155,
                   if_else(year == 2003, total / 10333,
                    if_else(year == 2004, total / 10023,
                     if_else(year == 2005, total / 10733,
                      if_else(year == 2006, total / 10890,
                       if_else(year == 2007, total / 10817,
                        if_else(year == 2008, total / 10212,
                         if_else(year == 2009, total / 9931,
                          if_else(year == 2010, total / 9594,
                           if_else(year == 2011, total / 9356,
                            if_else(year == 2012, total / 9735,
                             if_else(year == 2013, total / 9288,
                              if_else(year == 2014, total / 9077,
                               if_else(year == 2015, total / 10601,
                                if_else(year == 2016, total / 11975,
                                 if_else(year == 2017, total / 12034,
                                  if_else(year == 2018, total / 11356, NA))))))))))))))))))))

# Creating the plot
plot2 <- sexes |>
  ggplot(aes(x = year, y = proportion, color = victim_sex)) +
  geom_point() +
  geom_line() +
  ylim(0,1) +
  xlim(2000, 2018) +
  labs(x = "Year", y = "Proportion of Deaths", color = "Sex of Victim",
       title = "Proportion of Gun Death Victims of Each Sex\nin the US from 2000 to 2018",
       caption = "Source: guns.periscopic.com")

# Calling the plot
plot2

As we can see, around 85% of gun death victims in the US every year from 2000 to 2018 are male, while around 15% are female. The amount of victims of unknown sex is negligible.

Race

Next, let’s take a look at the number of gun death victims of each race in the US from 2000 to 2018.

# Grouping the data by year and race
guns_grouped_3 <- guns3 |> group_by(year, victim_race)

# Summarizing the data grouped by year and race
races <- guns_grouped_3 %>% summarize(total = n())
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Adding a column for the proportions
races <- mutate(races, proportion =
                if_else(year == 2000, total / 9179,
                 if_else(year == 2001, total / 9541,
                  if_else(year == 2002, total / 10155,
                   if_else(year == 2003, total / 10333,
                    if_else(year == 2004, total / 10023,
                     if_else(year == 2005, total / 10733,
                      if_else(year == 2006, total / 10890,
                       if_else(year == 2007, total / 10817,
                        if_else(year == 2008, total / 10212,
                         if_else(year == 2009, total / 9931,
                          if_else(year == 2010, total / 9594,
                           if_else(year == 2011, total / 9356,
                            if_else(year == 2012, total / 9735,
                             if_else(year == 2013, total / 9288,
                              if_else(year == 2014, total / 9077,
                               if_else(year == 2015, total / 10601,
                                if_else(year == 2016, total / 11975,
                                 if_else(year == 2017, total / 12034,
                                  if_else(year == 2018, total / 11356, NA))))))))))))))))))))

# Creating the plot
plot2 <- races |>
  ggplot(aes(x = year, y = proportion, color = victim_race)) +
  geom_point(alpha = 0.5) +
  geom_line(alpha = 0.5) +
  ylim(0,1) +
  xlim(2000, 2018) +
  labs(x = "Year", y = "Proportion of Deaths", color = "Race of Victim",
       title = "Proportion of Gun Death Victims of Each Race\nin the US from 2000 to 2018",
       caption = "Source: guns.periscopic.com") +
  guides(color = guide_legend(override.aes = list(alpha = 1)))

# Calling the plot
plot2

According to the US Census Bureau, the population estimates for the US population are as follows (US Census Bureau):

  • American Indian or Alaskan Native: 1.3%
  • Asian or Pacific Islander: 6.6%
  • Black: 13.6%
  • White (includes Mexican-Americans): 75.5%

(Essay 2)

Though black people make up only 13.6% of Americans, over 50% of gun death victims each year are black, and the number seems to be rising. On the other side of the coin, white people make up over three quarters of Americans, but only around 38 to 40 percent of gun death victims in the US since 2010 are white. This shows a clear disparity between the races, that is likely rooted in systemic racism and other societal factors such as poverty. As you can see in the visualization, black people in the US are much more susceptible to being killed by guns, but this is just the measure of the outcome of a specific kind of violence. Black people are likely more susceptible to many kinds of violence due to societal injustices, such as police brutality targeting black people and redlining, which forces black people to live in poorer neighborhoods with higher crime rates.

Race and Ethnicity

Now we’re going to look at US gun death victims from 2000 to 2018 based on race and ethnicity. This graph will include people of hispanic origin as another variable, whereas before they were counted as white.

# Grouping the data by year and race and ethnicity
guns_grouped_4 <- guns3 |> group_by(year, victim_race_plus_hispanic)

# Summarizing the data grouped by year and race and ethnicity
race_plus <- guns_grouped_4 %>% summarize(total = n())
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Adding a column for the proportions
race_plus <- mutate(race_plus, proportion =
                if_else(year == 2000, total / 9179,
                 if_else(year == 2001, total / 9541,
                  if_else(year == 2002, total / 10155,
                   if_else(year == 2003, total / 10333,
                    if_else(year == 2004, total / 10023,
                     if_else(year == 2005, total / 10733,
                      if_else(year == 2006, total / 10890,
                       if_else(year == 2007, total / 10817,
                        if_else(year == 2008, total / 10212,
                         if_else(year == 2009, total / 9931,
                          if_else(year == 2010, total / 9594,
                           if_else(year == 2011, total / 9356,
                            if_else(year == 2012, total / 9735,
                             if_else(year == 2013, total / 9288,
                              if_else(year == 2014, total / 9077,
                               if_else(year == 2015, total / 10601,
                                if_else(year == 2016, total / 11975,
                                 if_else(year == 2017, total / 12034,
                                  if_else(year == 2018, total / 11356, NA))))))))))))))))))))

# Creating the plot
plot3 <- race_plus |>
  ggplot(aes(x = year, y = proportion, color = victim_race_plus_hispanic)) +
  geom_point(alpha = 0.5) +
  geom_line(alpha = 0.5) +
  ylim(0,1) +
  xlim(2000, 2018) +
  labs(x = "Year", y = "Proportion of Deaths", color = "Race and Ethnicity of Victim",
       title = "Proportion of Gun Death Victims of Each Race and Ethnicity\nin the US from 2000 to 2018",
       caption = "Source: guns.periscopic.com") +
  guides(color = guide_legend(override.aes = list(alpha = 1)))

# Calling the plot
plot3

As we can see, many people counted as white in the previous graph are hispanic. We will further discuss this in the conclusion.

Relationship between Year, Race and Ethnicity, Sex, and Age

Let’s make a multifaceted visualization to further analyze our data set.

# Grouping the data by year and race, ethnicity, and sex
guns_grouped_5 <- guns3 |> group_by(year, victim_race_plus_hispanic, victim_sex)

# Summarizing the data grouped by year and race, ethnicity, and sex
race_sex <- guns_grouped_5 %>% summarize(age = mean(victim_age, na.rm = TRUE))
`summarise()` has grouped output by 'year', 'victim_race_plus_hispanic'. You
can override using the `.groups` argument.
# Filtering out unknown races and ethnicities
race_sex2 <- filter(race_sex, victim_race_plus_hispanic != "Unknown")

# Creating the plot
plot4 <- race_sex2 |>
  ggplot(aes(x = year, y = age, color = victim_sex)) +
  facet_wrap(~victim_race_plus_hispanic) +
  geom_point(alpha = 0.8) +
  geom_line(alpha = 0.5) +
  ylim(0,50) +
  xlim(2000, 2018) +
  labs(x = "Year", y = "Average Age of Victims", color = "Sex of Victim",
       title = "Average Age of Gun Death Victims of Each Race and Ethnicity\nand Sex in the US from 2000 to 2018",
       caption = "Source: guns.periscopic.com") +
  guides(color = guide_legend(override.aes = list(alpha = 1))) +
  theme_minimal() +
  scale_color_brewer(palette = "Set2") +
  theme(legend.position = c(0.8, 0.2))

# Calling the plot
plot4
Warning: Removed 8 rows containing missing values (`geom_point()`).

Let’s make interactive versions of each of these plots.

# Creating the data sets
american <- filter(race_sex2, victim_race_plus_hispanic == "American Indian or Alaskan Native")

asian <- filter(race_sex2, victim_race_plus_hispanic == "Asian or Pacific Islander")

black <- filter(race_sex2, victim_race_plus_hispanic == "Black")

hispanic <- filter(race_sex2, victim_race_plus_hispanic == "Hispanic Origin")

white <- filter(race_sex2, victim_race_plus_hispanic == "White")

# Creating the plots
plot5 <- highchart() |>
  hc_add_series(data = american,
                type = "line",
                hcaes(x = year, y = age, group = victim_sex)) |>
  hc_xAxis(title = list(text = "Year")) |>
  hc_yAxis(title = list(text = "Average Victim Age")) |>
  hc_title(text = "Average Age of  American Indian and Native Alaskan Gun Death Victims\nof Each Sex in the US from 2000 to 2018") |>
  hc_caption(text = "Source: guns.periscopic.com") |>
  hc_colors(c("blue", "red", "green"))

plot6 <- highchart() |>
  hc_add_series(data = asian,
                type = "line",
                hcaes(x = year, y = age, group = victim_sex)) |>
  hc_xAxis(title = list(text = "Year")) |>
  hc_yAxis(title = list(text = "Average Victim Age")) |>
  hc_title(text = "Average Age of Asian and Pacific Islander Gun Death Victims\nof Each Sex in the US from 2000 to 2018") |>
  hc_caption(text = "Source: guns.periscopic.com") |>
  hc_colors(c("blue", "red", "green"))

plot7 <- highchart() |>
  hc_add_series(data = black,
                type = "line",
                hcaes(x = year, y = age, group = victim_sex)) |>
  hc_xAxis(title = list(text = "Year")) |>
  hc_yAxis(title = list(text = "Average Victim Age")) |>
  hc_title(text = "Average Age of Black Gun Death Victims of Each Sex\nin the US from 2000 to 2018") |>
  hc_caption(text = "Source: guns.periscopic.com") |>
  hc_colors(c("blue", "red", "green"))

plot8 <- highchart() |>
  hc_add_series(data = hispanic,
                type = "line",
                hcaes(x = year, y = age, group = victim_sex)) |>
  hc_xAxis(title = list(text = "Year")) |>
  hc_yAxis(title = list(text = "Average Victim Age")) |>
  hc_title(text = "Average Age of Hispanic Gun Death Victims of Each Sex\nin the US from 2000 to 2018") |>
  hc_caption(text = "Source: guns.periscopic.com") |>
  hc_colors(c("blue", "red", "green"))

plot9 <- highchart() |>
  hc_add_series(data = white,
                type = "line",
                hcaes(x = year, y = age, group = victim_sex)) |>
  hc_xAxis(title = list(text = "Year")) |>
  hc_yAxis(title = list(text = "Average Victim Age")) |>
  hc_title(text = "Average Age White of Gun Death Victims of Each Sex\nin the US from 2000 to 2018") |>
  hc_caption(text = "Source: guns.periscopic.com") |>
  hc_colors(c("blue", "red", "green"))
# Calling plot 5
plot5

These numbers seem to fluctuate quite a bit, likely because so few American Indians and Alaskan Natives get killed by guns each year, which means that each data point has a lot of leverage.

# Calling plot 6
plot6

Interestingly, the average age of Asian and Pacific Islander gun death victims is consistently higher than women than for men, and from 2007 to 2009, the data points seem to move together. This warrants further investigation, but for now, let’s continue.

# Calling plot 7
plot7

This data seems to stay very consistent, and this is likely for a very sad reason: so many black people are killed by guns every year, so each data point has little leverage and the average is very predictable. Also, the average age of female victims is higher than the average age of male victims.

# Calling plot 8
plot8

It seems that the average age of hispanic gun death victims seems to be increasing faster than that of the other races, though the females are still older on average than the males.

# Calling plot 9
plot9

Once again, this is a very standardized graph with little variance and the same pattern of female gun death victims being older than their male counterparts.

Final Visualization

For our last visualization, let’s revisit a previous visualization but add interactivity.

# Creating a percentage variable
race_plus <- mutate(race_plus, percent = proportion * 100)

# Creating the plot
plot10 <- highchart() |>
  hc_add_series(data = race_plus,
                type = "line",
                hcaes(x = year, y = percent, group = victim_race_plus_hispanic)) |>
  hc_xAxis(title = list(text = "Year")) |>
  hc_yAxis(title = list(text = "Percent of Victims")) |>
  hc_title(text = "Percent of Gun Death Victims of Each Race and Ethnicity\nin the US from 2000 to 2018") |>
  hc_caption(text = "Source: guns.periscopic.com") |>
  hc_colors(c("red", "blue", "black", "darkgreen", "green", "gray")) |>
  hc_add_theme(hc_theme(chart = list(backgroundColor = "lightblue")))

# Calling the plot
plot10

Conclusion

(Essay 3)

As discussed earlier, black people are much more likely to be killed with a gun in the US than any other race. Now that we have included hispanic ethnicity into the visualization, we can see that a large amount of the people counted as white in the plot we previously discussed were hispanic. As of 2020, 18.9% of Americans were hispanic (source), though they only represent between 11% and 16% of people killed by guns every year. It is important to remember, however, that it was, and still is, and quickly growing demographic in America, so we would have to look at individual data from each year to determine whether or not the difference in expected and actual proportion of hispanic people killed by gun violence in that year is significantly different. Overall, we can see that certain demographics, such as black people, are at a significantly higher risk than usual of being killed by a gun, while other demographics, such as white people, are at a significantly lower risk.

Sadly, I wasn’t able to create the final visualization that I wanted. I created the multi-faceted plot seen earlier in the document as my final visualization, though a few hours before I had to submit the assignment, I realized that it wasn’t interactive. I wasn’t sure how to get plotly to work, and I wasn’t able to make a multi-faceted visualization with highcharter. There is a function called hw_grid() that allows you to make a grid of highcharter visualizations, but I wasn’t able to get it to work. I looked at many different sources, both on how to use and how to troubleshoot hw_grid(), but they were so complicated and advanced that they nearly made me cry. Like, seriously, that stuff was confusing. Here are the two places on which I spent the most time researching:

I hope to be able to make multi-faceted interactive visualizations in the future, but I’m proud of what I accomplished, especially since I only realized I needed to make the visualizatons interactive at the end but was still able to make 6 good visualizations (granted, 5 of them were pretty much clones of each other with different data, and the final visualization is just a previous visualization with interactivity, but it was still a scramble at the end and I was very stressed). I’m looking forward to seeing what I can accomplish on the final project and learning more about data visualizations, both in rStudio and other places (such as Tableau).