Exploratory Analysis on OWID Covid data

World Total Cases

Data as of June 10, 2020.

full_data %>%
  filter(iso_code == "OWID_WRL") %>%
  ggplot(aes(date, total_cases)) + geom_line(color = "blue") + 
  theme_minimal() +
  labs(title = "Total Covid Cases, World", subtitle = paste("Made on 6/12/20"),
       x = "Date", y="Total Cases")

By Continent

We see that we do not have continent data for many cases. The NA section is large.

ggplot(full_data) + geom_col(aes(date, new_cases, fill=continent))

# split it by continent, provide 7 different graphs
ggplot(full_data) + geom_col(aes(date, new_cases, fill=continent)) +
  facet_wrap(~ continent)

North America

When we look at just the countries in North America, we see that the United States has the highest number of cases by far.

We see that the United States has the highest case count by far. This makes sense, as it has the highest population (among other factors).

However, this is not to say that other countries in North America are not feeling a burden of Covid in their populations. Let’s look at the incidence rate by population size, to estimate the effect on a country’s health system and resources.

ggplot(northamerica) + 
  geom_bar(aes(location, total_cases_per_million), stat="identity", fill="#f68060", alpha=.9, width=.4) +
  coord_flip() +
theme_minimal() +
  labs(title = "Covid Total Cases Per One Million People, North America", subtitle = paste("Made on 6/12/20"),
       x = "Country", y="Total Cases Per Million")

## Warning: Removed 9 rows containing missing values (position_stack).

Here we see that Monserrat, Canada, Bermuda, Panama, and other Latin American countries are experiencing a high burden of Covid cases. Still, the United States has the highest rate.

United States

Here is the number of cases in the US, as well as the curve (average) number.

unitedstates <- full_data %>%
  filter(location == "United States")

ggplot(unitedstates) + geom_col(aes(date, new_cases)) + geom_smooth(aes(date, new_cases)) +
  theme_minimal() +
  labs(title = "Covid Daily Incidence, United States", subtitle = paste("Made on 6/12/20"),
       x = "Date", y="New Cases")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Top 20 countries

Now let’s look at just the top 20 countries. We can do this in a few ways to look at the relationships.

top_20_countries <- full_data %>%
  arrange(desc(total_cases)) %>%
  filter(iso_code %in% c("USA", "BRA", "RUS", "GBR", "IND", "ESP", "ITA", "PER", "DEU", "IRN", "TUR", "FRA", "CHL", "MEX", "PAK", "SAU", "CAN", "CHN", "QAT", "BGD"))

ggplot(top_20_countries) + geom_col(aes(date, total_cases, fill=location)) +
  theme_minimal() +
  labs(title = "Covid Total Cases, Top 20 Countries", subtitle = paste("As of 6/12/20"),
       x = "Date", y="Total Cases")+
  theme(legend.title = element_text(size = 10), legend.text = element_text(size = 8))

ggplot(top_20_countries, aes(x = date, y = total_cases, color = location)) + geom_line(size = 1) +
  theme_minimal() +
  labs(title = "Covid Total Cases, Top 20 Countries", subtitle = paste("As of 6/12/20"),
       x = "Date", y="Total Cases", color= "Country")+
  theme(legend.title = element_text(size = 10), legend.text = element_text(size = 8))

However, it is once again important to recognize some countries might be facing a higher burden based on their population size. Let’s look at cases per million.

ggplot(top_20_countries, aes(x = date, y = total_cases_per_million, color = location)) + geom_line(size = 1)+
  geom_dl(aes(label = location), method = list(dl.combine("first.points", "last.points")), cex = 0.8)+
  theme_minimal()+
  labs(title = "Covid Cases Per Million, Top 20 Countries", subtitle = paste("As of 6/12/20"),
       x = "Date", y="Total Cases per Million", color= "Country")

## Warning: Removed 8 row(s) containing missing values (geom_path).

In this graph we see that Qatar, Chile, and Peru are experiencing high amounts of cases for the size of their populations.

Top 7 Countries with Highest Incidence on June 10

top_7_new_cases <- full_data %>%
  arrange(desc(new_cases)) %>%
  filter(iso_code %in% c("BRA", "USA", "IND", "RUS", "PAK", "MEX", "PER"))

ggplot(top_7_new_cases, aes(x = date, y = new_cases, color = location)) + 
  geom_line(size = 1) +
  geom_smooth(se = F) +
  geom_dl(aes(label = location), method = list(dl.combine("first.points", "last.points")), cex = 0.8)+
  theme_minimal()+
  labs(title = "New Daily Covid Cases - Top 7 Countries with Increasing Incidence", subtitle = paste("As of 6/12/20"),
       x = "Date", y="New Daily Cases", color= "Country") +
  theme(legend.position = "none")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

However, when we split this out to look at each country, we see that all these countries are increasing in daily incidence, except the United States, which is decreasing.

ggplot(top_7_new_cases, aes(x = date, y = new_cases, color = location)) + 
  geom_smooth(se = F) +
  facet_wrap(~ location)+
  theme_minimal()+
  labs(title = "New Daily Covid Cases - Top 7 Countries with Increasing Incidence", subtitle = paste("As of 6/12/20"),
       x = "Date", y="New Daily Cases", color= "Country") +
  theme(legend.position = "none")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Incidence Rate (per million)

We can also look at the top 7 countries with the highest incidence rate, the number of new cases per one million in the population.

top_7_incidence_rate <- full_data %>%
  arrange(desc(new_cases)) %>%
  filter(iso_code %in% c("QAT", "BHR", "CHL", "BRA", "KWT", "OMN", "PER"))

ggplot(top_7_incidence_rate, aes(x = date, y = new_cases_per_million, color = location)) + 
  geom_smooth(se = F) +
  geom_dl(aes(label = location), method = list(dl.combine("first.points", "last.points")), cex = 0.8)+
  theme_minimal()+
  labs(title = "New Cases Per Million - Top 7 Countries with Highest Incidence Rate", subtitle = paste("As of 6/12/20"),
       x = "Date", y="Cases per Million", color= "Country") +
  theme(legend.position = "none")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 4 rows containing non-finite values (stat_smooth).