Exploratory Data Analysis

The data set that we have chosen describes numerics of COVID-19 deaths in the USA across state, race, and age demographics. The data set also tells us the distribution of COVID-19 deaths across these categories. All of the data has been collected in 2020 and 2021, since the onset of the COVID-19 pandemic. This data set is interesting to us, not only because it is highly relevant, but also because we feel that there is high potential for us to observe and expose inequities across age/race/geography in real world data. We also feel as though there is opportunity to join this data with additional census data that may allow us to get a more in depth understanding of these inequities, as we take the next steps in our project. The broader implications of this data and hopefully economic income data will be intersectional data visualizations. These visuals can help us understand the connections between race, socio-economic status, and geographical location. If we do find trends in the data, be them surprising or not, using ggplot2 will allow us to illustrate them in a manner that will facilitate effective communication of our findings to an audience of varying statistical literacy. Three main questions that we have about our data are as follows: How do COVID-19 deaths vary by state? How do COVID-19 deaths vary by race? If we were to successfully supplement this data set with census data or BEA data, how do COVID-19 deaths by state compare to economic development/ personal income by state, using per-capita-income as a metric?

Graphics Exploration

Set Up

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# Importing our final project data set

covidDeaths <- read.csv("covidDeaths.csv",
                        header=TRUE)

Renaming Variables

names(covidDeaths)[names(covidDeaths) == "Race.Hispanic.origin"] <- "Race.Hispanic"

names(covidDeaths)[names(covidDeaths) == "Count.of.COVID.19.deaths"] <- "Count.Deaths"

names(covidDeaths)[names(covidDeaths) == "Distribution.of.COVID.19.deaths...."] <- "Distribution.Deaths"

names(covidDeaths)[names(covidDeaths) == "Unweighted.distribution.of.population...."] <- "Unweight.Distribution.Pop"

names(covidDeaths)[names(covidDeaths) == "Weighted.distribution.of.population...."] <- "Weight.Distribution.Pop"

names(covidDeaths)[names(covidDeaths) == "Difference.between.COVID.19.and.unweighted.population.."] <- "Unweight.Difference"

names(covidDeaths)[names(covidDeaths) == "Difference.between.COVID.19.and.weighted.population.."] <- "Weight.Difference"
names(covidDeaths)

##  [1] "Data.as.of"                "Start.Date"               
##  [3] "End.Date"                  "State"                    
##  [5] "Race.Hispanic"             "Count.Deaths"             
##  [7] "Distribution.Deaths"       "Unweight.Distribution.Pop"
##  [9] "Weight.Distribution.Pop"   "Unweight.Difference"      
## [11] "Weight.Difference"         "AgeGroup"                 
## [13] "Suppression"

Early Data Visualization

These graphs definitely still need some work!

# graph 1 to turn in
ggplot(covidDeaths, aes(Distribution.Deaths, State, color = Race.Hispanic)) +
  geom_point()

## Warning: Removed 1076 rows containing missing values (geom_point).

# graph 2 to turn in
ggplot(covidDeaths, aes(Race.Hispanic, Weight.Difference, fill = Race.Hispanic)) + 
  geom_boxplot()

## Warning: Removed 1076 rows containing non-finite values (stat_boxplot).

# graph 3 to turn in
ggplot(covidDeaths, aes(Weight.Difference, Race.Hispanic)) +
  geom_jitter() + 
  geom_smooth(method="lm")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 1076 rows containing non-finite values (stat_smooth).

## Warning: Removed 1076 rows containing missing values (geom_point).

# graph 4 to turn in
covidDeathsAge <- covidDeaths %>%
  filter(!AgeGroup %in% c("All ages, standardized", "All ages, unadjusted"))
ggplot(covidDeathsAge, aes(Distribution.Deaths, AgeGroup)) +
  geom_point()

## Warning: Removed 989 rows containing missing values (geom_point).

# graph 5 to turn in
covidDeathsAIAN <- covidDeaths %>%
  filter(Race.Hispanic == "Non-Hispanic American Indian or Alaska Native", !AgeGroup %in% c("All ages, standardized", "All ages, unadjusted"))
#view(covidDeathsAIAN)
ggplot(covidDeathsAIAN, aes(Count.Deaths, fill = AgeGroup)) +
  geom_bar() +
  scale_x_log10()

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 318 rows containing non-finite values (stat_count).

## Warning: position_stack requires non-overlapping x intervals

Project Milestone 2

Isabel Duxbury, Johnny Valdez

03/04/21

Exploratory Data Analysis

Graphics Exploration

Set Up

Renaming Variables

Early Data Visualization