Research Question

What countries have the most cases of Covid-19? Which ones have the highest mortality rate from Covid-19? Out of those top countries, what were the cases and mortality rates over each month in 2020 since the pandemic hit?

Importance of this Topic

Covid-19 has affected all of our lives for almost 2 years now. It has had many affect on us personally, the Penn State community, our country, and the whole world. However, we often focus on the direct effects on our lives, like online schooling and missing out on things, and we forget how dangerous this virus has been. So, we wanted to further investigate the effects of Covid-19 across the world. We want to see how deadly this disease really is in the United States and in other countries. We want to find out exactly how many people have had it and have unfortunately died from it. We are also curious about the different months of this ongoing pandemic and when it was the deadliest as well as what it is like now. We chose this topic because it is very relevant to us and we are interested in fully understanding the severity of this global pandemic.

Loading Packages

rm(list = ls())
library(tidyverse)
library(rvest)
library(utils)
library(dplyr)

First Data Source

wikipage <- "https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data"
tableList <- wikipage %>%
  read_html() %>%
  html_nodes(css = "table") %>%
  html_table(fill = TRUE)

CovidByCountry <-
  tableList[[1]] 

#Selecting only columns 2-4 because the rest are empty
CovidByCountry <-
  CovidByCountry[c(2, 3, 4)] 

#Converting Cases and Deaths to numerical values
CovidByCountry <-  
  CovidByCountry %>%
    mutate(Cases = as.numeric(gsub(",", "", Cases)),
           Deaths = as.numeric(gsub(",", "", Deaths)))
Warning: Problem with `mutate()` column `Cases`.
ℹ `Cases = as.numeric(gsub(",", "", Cases))`.
ℹ NAs introduced by coercion
Warning: Problem with `mutate()` column `Deaths`.
ℹ `Deaths = as.numeric(gsub(",", "", Deaths))`.
ℹ NAs introduced by coercion
head(CovidByCountry)
tail(CovidByCountry)
str(CovidByCountry)
tibble [197 × 3] (S3: tbl_df/tbl/data.frame)
 $ Location: chr [1:197] "World[a]" "European Union[b]" "United States" "India" ...
 $ Cases   : num [1:197] 2.71e+08 5.03e+07 5.02e+07 3.47e+07 2.22e+07 ...
 $ Deaths  : num [1:197] 5320822 873736 800343 476135 616970 ...

This data table shows the total covid cases and deaths for every country. It is a grand total from the start of the pandemic. There are 197 cases included in the data, as we can see from using the str() function.

Second Data Source

EuropaData <- read.csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv", na.strings = "", fileEncoding = "UTF-8-BOM")
EuropaData
head(EuropaData)
tail(EuropaData)
str(EuropaData)
'data.frame':   61900 obs. of  12 variables:
 $ dateRep                                                   : chr  "14/12/2020" "13/12/2020" "12/12/2020" "11/12/2020" ...
 $ day                                                       : int  14 13 12 11 10 9 8 7 6 5 ...
 $ month                                                     : int  12 12 12 12 12 12 12 12 12 12 ...
 $ year                                                      : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
 $ cases                                                     : int  746 298 113 63 202 135 200 210 234 235 ...
 $ deaths                                                    : int  6 9 11 10 16 13 6 26 10 18 ...
 $ countriesAndTerritories                                   : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ geoId                                                     : chr  "AF" "AF" "AF" "AF" ...
 $ countryterritoryCode                                      : chr  "AFG" "AFG" "AFG" "AFG" ...
 $ popData2019                                               : int  38041757 38041757 38041757 38041757 38041757 38041757 38041757 38041757 38041757 38041757 ...
 $ continentExp                                              : chr  "Asia" "Asia" "Asia" "Asia" ...
 $ Cumulative_number_for_14_days_of_COVID.19_cases_per_100000: num  9.01 7.05 6.87 7.13 6.97 ...

This data table shows the covid cases and deaths every day in each country. This table includes 61,900 objects with 12 variables. Each case describes a country’s cases and deaths on a given day, so there are multiple rows for a single country.

Total Cases and Deaths by Country

#table displaying the sum of all COVID cases per country
EuropaCases <- 
  EuropaData %>%
  group_by(countriesAndTerritories) %>%
  summarise(totalcases = sum(cases)) %>%
  arrange(desc(totalcases))

#table displaying the sum of all COVID deaths per country
EuropaDeaths <- 
  EuropaData %>%
  group_by(countriesAndTerritories) %>%
  summarise(totaldeaths = sum(deaths)) %>%
  arrange(desc(totaldeaths))

#table combining the total cases and total deaths table, then showing the top 10 countries based on total number of cases
EuropaMerge <-
  merge(x=EuropaCases, y = EuropaDeaths, all=TRUE) %>%
  filter(totalcases > 1400000) %>%
  arrange(desc(totalcases))
EuropaMerge

First, we wanted to combine the total number of cases and deaths by country in order to see which country had the highest total case and death counts. This is a good start to help us answer our research question. We created two tables, one for deaths and the other for cases, with the totals of each and the name of the country. Then, we combined these two tables into one.

Total Cases and Deaths by Continent

#table showing the total number of cases and deaths broken up by Continent j
Continenttotals<-
  EuropaData%>%
  group_by(continentExp) %>%
    summarise(TotalCases = sum(cases, na.rm = TRUE), TotalDeaths = sum(deaths, na.rm = TRUE))
  
Continenttotals
Continenttotals %>%
  ggplot(aes(x = continentExp, y = TotalCases)) +
  geom_col(aes(color = continentExp, fill = continentExp))+
  ggtitle("Covid cases in each Continent")


Continenttotals %>%
  ggplot(aes(x = continentExp, y = TotalDeaths)) +
  geom_col(aes(color = continentExp, fill = continentExp))+
  ggtitle("Covid deaths in each Continent")

Next, we wanted to see the total number of deaths and cases per continent to see the spread of COVID worldwide. We visualized this spread in a bar graph, with each bar representing a continent. Obviously, as Antarctica is barren, it would not make sense to include it in the graph.

Visualization of Cases Worldwide in July

#wrangling data to include only dates in July, preventing for outliers 
EuropaData2 <- 
  EuropaData %>%
  filter(grepl("^07", dateRep,  ignore.case = TRUE)) %>%
  filter(cases < 25000)


EuropaData2 %>%
  ggplot(aes(x = continentExp, fill = continentExp)) + 
  geom_density(alpha = .25) + 
  xlab("Continent")

This graph is an example of the amount of times a country is mentioned as having cases in the month of July. We used an overlaid graph to show this.

Top 5 Countries with Highest Case totals

We wanted to find out which 5 countries had the highest number of cases of Covid-19. We excluded the world and European Union totals because we wanted to focus on individual countries. Using, rank we made a table of these top 5 countries. They include the United States, India, Brazil, United Kingdom, and Russia. The United States had the most with 49,833,439 cases. We represented this data visually as well by making a bar graph with the countries on the x-axis and the Cases on the y-axis. The different colors also represent each country.

Top5Cases <-
  CovidByCountry %>%
    filter(Location != "World[a]") %>%
    filter(Location != "European Union[b]") %>%
    filter( rank(desc(Cases)) <= 5 ) %>%
    select(Location, Cases)

Top5Cases

Top5Cases %>%
    ggplot(aes(x = Location, y = Cases)) +
  geom_col(aes(color = Location, fill = Location)) + 
  ggtitle("Highest Number of Cases in Top 5 Countries")

Similarly, we did the same analysis on the total number of deaths. The top 5 countries here included the United States, India, Brazil, Russia, and Mexico. The only countries that are different between the Top5Cases data and the Top5Deaths data is United Kingdom and Mexico. United Kingdom was 4th for the highest cases and Mexico was 5th for the highest deaths. The other 4 countries were included in both tables. The United States was #1 again here with 796,764 deaths. We also made a bar chart for Top5Deaths that represents these 5 countries.

Top5Deaths <-
  CovidByCountry %>%
    filter(Location != "World[a]") %>%
    filter(Location != "European Union[b]") %>%
    filter( rank(desc(Deaths)) <= 5 ) %>%
    select(Location, Deaths)

Top5Deaths

Top5Deaths %>%
    ggplot(aes(x = Location, y = Deaths)) +
  geom_col(aes(color = Location, fill = Location)) + 
  ggtitle("Highest Number of Deaths in Top 5 Countries")

Mortality Rate

To answer part of our research question, “Which countries have the highest mortality rates of Covid-19?”, we used the first data set to compare the ratio of deaths to cases for each country. We added a variable “ratio” to represent the mortality rate. To calculate this value, we divided the number of deaths by the number of cases for each country. We then ranked the countries based off which ones had the highest mortality rates. We got rid of 2 rows that represented the whole world and the whole European Union since we are only interested in individual countries. Then, we made a scatterplot of each country’s mortality rate to compare them. The x-axis represents the countries and the y-axis represents the mortality rate. We also used the variable Deaths to represent the size of each point. The graph shows us that most mortality rates were below .05, or 5%, with very few countries going above that. The highest mortality rate is about 19%. However, the highest mortality rate is represented by a very small dot, meaning there were very few deaths in that country. This is interesting because this shows that even though this country has the highest mortality rate, it has very few deaths. This is an indicator that this may be an outlier.

MortalityRate <-
  CovidByCountry %>%
    group_by(Location) %>%
    mutate(ratio = Deaths/Cases)

MortalityRate

HighestMortalityRates <-
  MortalityRate %>%
    filter(Location != "World[a]") %>%
    filter(Location != "European Union[b]")

HighestMortalityRates$rank <- rank(HighestMortalityRates$ratio)

HighestMortalityRates <-  
  HighestMortalityRates %>%
    na.omit(ratio) %>%
    arrange(desc(rank)) 

HighestMortalityRates

TotalMortalityRates <-
  HighestMortalityRates %>%
    ggplot(aes(x = Location, y = ratio)) +
    geom_point(aes(size = Deaths)) +
    ggtitle("Mortality Rates of Covid-19") + 
    xlab("Mortality Rate") +
    ylab("Country")
TotalMortalityRates

After looking at the overall mortality rates, we wanted to focus in on the top 5 highest ones. To do this, we filtered out the ones with the highest ratio values and put those into a separate table. Then, we made a bar graph showing these 5 countries and their mortality rates. The top 5 countries included Yemen, Vanuatu, Peru, Mexico, and Sudan. The highest mortality right was .19510740, or about 19.5%, which was in Yemen.

Top5HighestMortalityRates <-
  HighestMortalityRates %>%
    filter(rank >= 185) #Since there are 189 countries with values in 'ratio' we want ones with rank greater than or equal to 185.

Top5HighestMortalityRates

Top5HighestMortalityRates %>%
  ggplot(aes(x = Location, y = ratio)) +
  geom_col(aes(color = Location, fill = Location)) +
  ggtitle("Highest 5 Mortality Rates of Covid-19") +
  xlab("Mortality Rate") +
  ylab("Country")

Calculating Average Mortality Rate

To further our investigation of the mortality rate of Covid-19, we wanted to explore the average mortality rate. To simply find the mean of all the cases, we used the mean() function. The mean of all of the ratio variables is 0.02102764. After examining the scatterplot we created, we wanted to try to get rid of outliers or countries with very few amounts of cases. For example, Yemen had the highest mortality rate but one of the number of deaths. So, we filtered out cases that had less than 100 cases of Covid-19. Then, we used a for loop to go through the countries with more than 100 cases and add the ratio variable to a new variable, “avg”. We also had to keep track of how many countries there were with this condition in a variable “index” in order to be able to calculate the average. After the for loop ran, we simply divided “avg” by “index” to get the average mortality rate for countries with more than 100 cases of Covid-19. The result was 0.02058139, which is lower than the original mean. The change was not as significant as we thought it may be, but there was still about a .05% difference.

#Finding mean mortality rate with all data included
mean(HighestMortalityRates$ratio)
[1] 0.02084382
#Finding mean mortality rate with only countries with more than 100 cases of Covid-19
CasesOver100 <-
  HighestMortalityRates %>%
    filter(Cases > 100)
avg = 0
index = 0
for (i in CasesOver100$ratio) {
    avg <- avg + i
    index <- index + 1
}
avg <- avg/index
avg
[1] 0.0205223

Monthly Cases and Deaths

We decided to analyze the Covid-19 cases and deaths for the year 2020 in the countries with the most amount of cases and deaths. To do this, we first selected the United States, India and Brazil because from our previous findings, they have the most cases. We grouped them by month in order to see the changes. We were able to see the monthly cases and deaths for the three countries.

USAmonthly<-
  EuropaData%>%
    filter(countriesAndTerritories== "United_States_of_America")%>%
  group_by(month) %>%
    summarise(Monthlycases = sum(cases, na.rm = TRUE),  Monthlydeaths = sum(deaths, na.rm = TRUE))%>%
  mutate(Country="USA")

Indiamonthly<-
  EuropaData%>%
    filter(countriesAndTerritories== "India")%>%
  group_by(month) %>%
    summarise(Monthlycases = sum(cases, na.rm = TRUE),  Monthlydeaths = sum(deaths, na.rm = TRUE))%>%
  mutate(Country="India")

Brazilmonthly<-
  EuropaData%>%
    filter(countriesAndTerritories== "Brazil")%>%
  group_by(month) %>%
    summarise(Monthlycases = sum(cases, na.rm = TRUE),  Monthlydeaths = sum(deaths, na.rm = TRUE))%>%
  mutate(Country="Brazil")

Merging tables

We merged the tables for the top 3 countries cases and deaths so we could graph them.

Firstmerge<-
  merge(x=USAmonthly, y=Indiamonthly, all=TRUE)

Top3<-
  merge(x=Firstmerge, y=Brazilmonthly, all=TRUE)

Graphing the top three countries

We graphed the top three countries with Covid cases. USA, Brazil and India are represented by the different lines, respectively. The cases are graphed over time in 2020. United States had a spike in November and India had a spike in September. Overall, the cases have been increasing since the pandemic started and at the end of the year they have slightly gone down as the pandemic has gotten under control.

Top3%>%
  ggplot(aes(x=month, y=Monthlycases)) +
  geom_line(aes(linetype=Country, color=Country))+
  xlab("Month")+
  ylab("Cases")+
  ggtitle("Monthly Covid Cases for the Top Three Countries in 2020")+
  xlim(0,12)

We graphed the top three countries with Covid deaths. USA, Brazil and India are represented by their respective lines here as well. The cases are graphed over time in 2020. There was a spike in USA cases in April as the pandemic was new and unmanageable at the time. The deaths increases for all of the countries in the summer as the pandemic picked up speed. Over the year 2020, the hospitals regained control and the deaths decreased.

Top3%>%
  ggplot(aes(x=month, y=Monthlydeaths)) +
  geom_line(aes(linetype=Country, color=Country))+
  xlab("Month")+
  ylab("Deaths")+
  ggtitle("Monthly Covid Deaths for the Top Three Countries in 2020")+
  xlim(0,12)

Challenges we encountered

One of the main technical challenges we had to overcome was when we were trying to rank the countries from the first data source by their mortality rate. This is in the section entitled “Mortality Rate”. At first, after we used mutate() to add the “ratio” variable we tried to use filter(rank(desc(ratio) >= 5)) on order to get the top 5 highest mortality rates. However, the order of the countries was not resulting in descending ratios. We could not discover the root of this problem even after asking for help. It may have been because the ratios are all decimals, but we are still unsure. To overcome this issue, we had to change how we were going to complete the task of finding which countries had the highest mortality rates. So, instead of using the rank() function, we had to create a new variable in the HighestMortalityRates table called “rank”. The values of this new variable are the rankings of the countries from lowest to highest mortality rates. We did this by doing HighestMortalityRates\(rank <- rank(HighestMortalityRates\)ratio). Once we had this new variable we used the arrange() function to order the country by their rankings. Finally, we were able to find the top 5 by filtering the countries with ranks greater than or equal to 185, since there were a total of 189 countries and the ranking was from lowest to highest mortality rates. Another challenge we had was with our first data source. We were querying with the numbers in this data set and we were not getting the correct results. We then realized that the variables Cases and Deaths were type instead of . To overcome this issue, we used as.numeric() to convert the data from characters to numerical types. The original numbers also contained commas in the character strings, so we had to use the gsub() function, which is a regular expression, to get rid of the commas before converting the data types.

Significant Findings and Conclusion

In analyzing the Covid-19 data sets, we were able to draw conclusions based on our findings. Our first step was finding the countries with the highest Covid-19 cases. These countries, in descending order, were United States, India, Brazil, United Kingdom and Russia. One reason these countries could have the highest amount of cases is their large populations. When looking at the countries with the most deaths, the results were similar. The countries with the top 5 Covid-19 death count are the United States, India, Brazil, Russia and Mexico. The top countries for cases and deaths are very similar with the exception that the United Kingdom replaced Mexico on the top deaths list. To get an accurate depiction of countries who struggle with the pandemic and losing people to Covid-19, we calculated the mortality rates. The countries with the highest mortality rates are Yemen, Vanuatu, Peru, Mexico and Sudan. A reason that these countries have high mortality rates are they don’t have the resources or healthcare system to keep up with the pandemic. Further analysis on the countries healthcare and economic system as well as the standard of living would be needed to understand why those countries have higher mortality rates. We calculated the overall mortality rate of the globe for Covid-19. The result was 2.05%, which is relatively low. With modern medicine, most people survived Covid-19. Additionally, we wanted to graph out the countries with the highest amount of cases and deaths in 2020 to see trends in the pandemic. We choose the top three countries with cases and deaths as they were heavily impacted by the pandemic. For the cases graph, all of the countries were steadily increasing the amount of cases over the summer. The number of Brazil cases began to regain control in August and the India cases started to decline in September. The United States had large spiked in July and November as they struggle to keep the amount of cases under control despite enforcing masking requirements. For the deaths graph, the United States had a gigantic spike in April. This is when the county was on complete lock down because the hospitals were overcrowding and it was difficult to keep the patients alive. The deaths for India and Brazil steadily increased when the pandemic started. Over the summer, the death counts steadied out as the hospitals were able to manage the patients. At the end of the year, death counts declined. The death counts continue to decline since the vaccine was released to the public, saving many from Covid-19.

---
title: "Final Project Report"
author: "Anna Gillard, Katie Kelly, Kelly McVeigh"
date: "Due: December 15, 2021"
output: html_notebook
---


### Research Question

What countries have the most cases of Covid-19? Which ones have the highest mortality rate from Covid-19? Out of those top countries, what were the cases and mortality rates over each month in 2020 since the pandemic hit? 

### Importance of this Topic
Covid-19 has affected all of our lives for almost 2 years now. It has had many affect on us personally, the Penn State community, our country, and the whole world. However, we often focus on the direct effects on our lives, like online schooling and missing out on things, and we forget how dangerous this virus has been. So, we wanted to further investigate the effects of Covid-19 across the world. We want to see how deadly this disease really is in the United States and in other countries. We want to find out exactly how many people have had it and have unfortunately died from it. We are also curious about the different months of this ongoing pandemic and when it was the deadliest as well as what it is like now. We chose this topic because it is very relevant to us and we are interested in fully understanding the severity of this global pandemic. 


### Loading Packages
```{r}
rm(list = ls())
library(tidyverse)
library(rvest)
library(utils)
library(dplyr)
```


### First Data Source
```{r}
wikipage <- "https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data"
tableList <- wikipage %>%
  read_html() %>%
  html_nodes(css = "table") %>%
  html_table(fill = TRUE)

CovidByCountry <-
  tableList[[1]] 

#Selecting only columns 2-4 because the rest are empty
CovidByCountry <-
  CovidByCountry[c(2, 3, 4)] 

#Converting Cases and Deaths to numerical values
CovidByCountry <-  
  CovidByCountry %>%
    mutate(Cases = as.numeric(gsub(",", "", Cases)),
           Deaths = as.numeric(gsub(",", "", Deaths)))

head(CovidByCountry)
tail(CovidByCountry)
str(CovidByCountry)
```
This data table shows the total covid cases and deaths for every country. It is a grand total from the start of the pandemic. There are 197 cases included in the data, as we can see from using the str() function. 


### Second Data Source 
```{r}
EuropaData <- read.csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv", na.strings = "", fileEncoding = "UTF-8-BOM")
EuropaData
head(EuropaData)
tail(EuropaData)
str(EuropaData)
```
This data table shows the covid cases and deaths every day in each country. This table includes 61,900 objects with 12 variables. Each case describes a country's cases and deaths on a given day, so there are multiple rows for a single country.

### Total Cases and Deaths by Country
```{r}
#table displaying the sum of all COVID cases per country
EuropaCases <- 
  EuropaData %>%
  group_by(countriesAndTerritories) %>%
  summarise(totalcases = sum(cases)) %>%
  arrange(desc(totalcases))

#table displaying the sum of all COVID deaths per country
EuropaDeaths <- 
  EuropaData %>%
  group_by(countriesAndTerritories) %>%
  summarise(totaldeaths = sum(deaths)) %>%
  arrange(desc(totaldeaths))

#table combining the total cases and total deaths table, then showing the top 10 countries based on total number of cases
EuropaMerge <-
  merge(x=EuropaCases, y = EuropaDeaths, all=TRUE) %>%
  filter(totalcases > 1400000) %>%
  arrange(desc(totalcases))
EuropaMerge
```
First, we wanted to combine the total number of cases and deaths by country in order to see which country had the highest total case and death counts. This is a good start to help us answer our research question. We created two tables, one for deaths and the other for cases, with the totals of each and the name of the country. Then, we combined these two tables into one. 

### Total Cases and Deaths by Continent
```{r}
#table showing the total number of cases and deaths broken up by Continent j
Continenttotals<-
  EuropaData%>%
  group_by(continentExp) %>%
	summarise(TotalCases = sum(cases, na.rm = TRUE), TotalDeaths = sum(deaths, na.rm = TRUE))
  
Continenttotals
```


```{r}
Continenttotals %>%
  ggplot(aes(x = continentExp, y = TotalCases)) +
  geom_col(aes(color = continentExp, fill = continentExp))+
  ggtitle("Covid cases in each Continent")

Continenttotals %>%
  ggplot(aes(x = continentExp, y = TotalDeaths)) +
  geom_col(aes(color = continentExp, fill = continentExp))+
  ggtitle("Covid deaths in each Continent")
```
Next, we wanted to see the total number of deaths and cases per continent to see the spread of COVID worldwide. We visualized this spread in a bar graph, with each bar representing a continent. Obviously, as Antarctica is barren, it would not make sense to include it in the graph. 

### Visualization of Cases Worldwide in July 
```{r}
#wrangling data to include only dates in July, preventing for outliers 
EuropaData2 <- 
  EuropaData %>%
  filter(grepl("^07", dateRep,  ignore.case = TRUE)) %>%
  filter(cases < 25000)


EuropaData2 %>%
  ggplot(aes(x = continentExp, fill = continentExp)) + 
  geom_density(alpha = .25) + 
  xlab("Continent")
```  
This graph is an example of the amount of times a country is mentioned as having cases in the month of July. We used an overlaid graph to show this. 

### Top 5 Countries with Highest Case totals
We wanted to find out which 5 countries had the highest number of cases of Covid-19. We excluded the world and European Union totals because we wanted to focus on individual countries. Using, rank we made a table of these top 5 countries. They include the United States, India, Brazil, United Kingdom, and Russia. The United States had the most with 49,833,439 cases. We represented this data visually as well by making a bar graph with the countries on the x-axis and the Cases on the y-axis. The different colors also represent each country. 
```{r}
Top5Cases <-
  CovidByCountry %>%
    filter(Location != "World[a]") %>%
    filter(Location != "European Union[b]") %>%
    filter( rank(desc(Cases)) <= 5 ) %>%
    select(Location, Cases)

Top5Cases

Top5Cases %>%
	ggplot(aes(x = Location, y = Cases)) +
  geom_col(aes(color = Location, fill = Location)) + 
  ggtitle("Highest Number of Cases in Top 5 Countries")
```

Similarly, we did the same analysis on the total number of deaths. The top 5 countries here included the United States, India, Brazil, Russia, and Mexico. The only countries that are different between the Top5Cases data and the Top5Deaths data is United Kingdom and Mexico. United Kingdom was 4th for the highest cases and Mexico was 5th for the highest deaths. The other 4 countries were included in both tables. The United States was #1 again here with 796,764 deaths. We also made a bar chart for Top5Deaths that represents these 5 countries. 
```{r}
Top5Deaths <-
  CovidByCountry %>%
    filter(Location != "World[a]") %>%
    filter(Location != "European Union[b]") %>%
    filter( rank(desc(Deaths)) <= 5 ) %>%
    select(Location, Deaths)

Top5Deaths

Top5Deaths %>%
	ggplot(aes(x = Location, y = Deaths)) +
  geom_col(aes(color = Location, fill = Location)) + 
  ggtitle("Highest Number of Deaths in Top 5 Countries")
```


### Mortality Rate 
To answer part of our research question, "Which countries have the highest mortality rates of Covid-19?", we used the first data set to compare the ratio of deaths to cases for each country. We added a variable "ratio" to represent the mortality rate. To calculate this value, we divided the number of deaths by the number of cases for each country. We then ranked the countries based off which ones had the highest mortality rates. We got rid of 2 rows that represented the whole world and the whole European Union since we are only interested in individual countries. Then, we made a scatterplot of each country's mortality rate to compare them. The x-axis represents the countries and the y-axis represents the mortality rate. We also used the variable Deaths to represent the size of each point. The graph shows us that most mortality rates were below .05, or 5%, with very few countries going above that. The highest mortality rate is about 19%. However, the highest mortality rate is represented by a very small dot, meaning there were very few deaths in that country. This is interesting because this shows that even though this country has the highest mortality rate, it has very few deaths. This is an indicator that this may be an outlier.  
```{r}
MortalityRate <-
  CovidByCountry %>%
  	group_by(Location) %>%
    mutate(ratio = Deaths/Cases)

MortalityRate

HighestMortalityRates <-
  MortalityRate %>%
    filter(Location != "World[a]") %>%
    filter(Location != "European Union[b]")

HighestMortalityRates$rank <- rank(HighestMortalityRates$ratio)

HighestMortalityRates <-  
  HighestMortalityRates %>%
    na.omit(ratio) %>%
    arrange(desc(rank)) 

HighestMortalityRates

TotalMortalityRates <-
  HighestMortalityRates %>%
  	ggplot(aes(x = Location, y = ratio)) +
    geom_point(aes(size = Deaths)) +
    ggtitle("Mortality Rates of Covid-19") + 
    xlab("Mortality Rate") +
    ylab("Country")
TotalMortalityRates
```

After looking at the overall mortality rates, we wanted to focus in on the top 5 highest ones. To do this, we filtered out the ones with the highest ratio values and put those into a separate table. Then, we made a bar graph showing these 5 countries and their mortality rates. The top 5 countries included Yemen, Vanuatu, Peru, Mexico, and Sudan. The highest mortality right was .19510740, or about 19.5%, which was in Yemen. 
```{r}
Top5HighestMortalityRates <-
  HighestMortalityRates %>%
    filter(rank >= 185) #Since there are 189 countries with values in 'ratio' we want ones with rank greater than or equal to 185.

Top5HighestMortalityRates

Top5HighestMortalityRates %>%
  ggplot(aes(x = Location, y = ratio)) +
  geom_col(aes(color = Location, fill = Location)) +
  ggtitle("Highest 5 Mortality Rates of Covid-19") +
  xlab("Mortality Rate") +
  ylab("Country")
```

### Calculating Average Mortality Rate
To further our investigation of the mortality rate of Covid-19, we wanted to explore the average mortality rate. To simply find the mean of all the cases, we used the mean() function. The mean of all of the ratio variables is 0.02102764. After examining the scatterplot we created, we wanted to try to get rid of outliers or countries with very few amounts of cases. For example, Yemen had the highest mortality rate but one of the number of deaths. So, we filtered out cases that had less than 100 cases of Covid-19. Then, we used a for loop to go through the countries with more than 100 cases and add the ratio variable to a new variable, "avg". We also had to keep track of how many countries there were with this condition in a variable "index" in order to be able to calculate the average. After the for loop ran, we simply divided "avg" by "index" to get the average mortality rate for countries with more than 100 cases of Covid-19. The result was 0.02058139, which is lower than the original mean. The change was not as significant as we thought it may be, but there was still about a .05% difference. 
```{r}
#Finding mean mortality rate with all data included
mean(HighestMortalityRates$ratio)

#Finding mean mortality rate with only countries with more than 100 cases of Covid-19
CasesOver100 <-
  HighestMortalityRates %>%
    filter(Cases > 100)
avg = 0
index = 0
for (i in CasesOver100$ratio) {
    avg <- avg + i
    index <- index + 1
}
avg <- avg/index
avg
```



### Monthly Cases and Deaths
We decided to analyze the Covid-19 cases and deaths for the year 2020 in the countries with the most amount of cases and deaths. To do this, we first selected the United States, India and Brazil because from our previous findings, they have the most cases. We grouped them by month in order to see the changes. We were able to see the monthly cases and deaths for the three countries.

```{r}
USAmonthly<-
  EuropaData%>%
	filter(countriesAndTerritories== "United_States_of_America")%>%
  group_by(month) %>%
	summarise(Monthlycases = sum(cases, na.rm = TRUE),  Monthlydeaths = sum(deaths, na.rm = TRUE))%>%
  mutate(Country="USA")

Indiamonthly<-
  EuropaData%>%
	filter(countriesAndTerritories== "India")%>%
  group_by(month) %>%
	summarise(Monthlycases = sum(cases, na.rm = TRUE),  Monthlydeaths = sum(deaths, na.rm = TRUE))%>%
  mutate(Country="India")

Brazilmonthly<-
  EuropaData%>%
	filter(countriesAndTerritories== "Brazil")%>%
  group_by(month) %>%
	summarise(Monthlycases = sum(cases, na.rm = TRUE),  Monthlydeaths = sum(deaths, na.rm = TRUE))%>%
  mutate(Country="Brazil")

```

### Merging tables
We merged the tables for the top 3 countries cases and deaths so we could graph them.

```{r}
Firstmerge<-
  merge(x=USAmonthly, y=Indiamonthly, all=TRUE)

Top3<-
  merge(x=Firstmerge, y=Brazilmonthly, all=TRUE)

```

### Graphing the top three countries
We graphed the top three countries with Covid cases. USA, Brazil and India are represented by the different lines, respectively. The cases are graphed over time in 2020. United States had a spike in November and India had a spike in September. Overall, the cases have been increasing since the pandemic started and at the end of the year they have slightly gone down as the pandemic has gotten under control.
```{r}
Top3%>%
  ggplot(aes(x=month, y=Monthlycases)) +
  geom_line(aes(linetype=Country, color=Country))+
  xlab("Month")+
  ylab("Cases")+
  ggtitle("Monthly Covid Cases for the Top Three Countries in 2020")+
  xlim(0,12)
```
We graphed the top three countries with Covid deaths. USA, Brazil and India are represented by their respective lines here as well. The cases are graphed over time in 2020. There was a spike in USA cases in April as the pandemic was new and unmanageable at the time. The deaths increases for all of the countries in the summer as the pandemic picked up speed. Over the year 2020, the hospitals regained control and the deaths decreased.

```{r}
Top3%>%
  ggplot(aes(x=month, y=Monthlydeaths)) +
  geom_line(aes(linetype=Country, color=Country))+
  xlab("Month")+
  ylab("Deaths")+
  ggtitle("Monthly Covid Deaths for the Top Three Countries in 2020")+
  xlim(0,12)
```



### Challenges we encountered
One of the main technical challenges we had to overcome was when we were trying to rank the countries from the first data source by their mortality rate. This is in the section entitled "Mortality Rate". At first, after we used mutate() to add the "ratio" variable we tried to use filter(rank(desc(ratio) >= 5)) on order to get the top 5 highest mortality rates. However, the order of the countries was not resulting in descending ratios. We could not discover the root of this problem even after asking for help. It may have been because the ratios are all decimals, but we are still unsure. To overcome this issue, we had to change how we were going to complete the task of finding which countries had the highest mortality rates. So, instead of using the rank() function, we had to create a new variable in the HighestMortalityRates table called "rank". The values of this new variable are the rankings of the countries from lowest to highest mortality rates. We did this by doing HighestMortalityRates$rank <- rank(HighestMortalityRates$ratio). Once we had this new variable we used the arrange() function to order the country by their rankings. Finally, we were able to find the top 5 by filtering the countries with ranks greater than or equal to 185, since there were a total of 189 countries and the ranking was from lowest to highest mortality rates.
Another challenge we had was with our first data source. We were querying with the numbers in this data set and we were not getting the correct results. We then realized that the variables Cases and Deaths were type <chr> instead of <dbl>. To overcome this issue, we used as.numeric() to convert the data from characters to numerical types. The original numbers also contained commas in the character strings, so we had to use the gsub() function, which is a regular expression, to get rid of the commas before converting the data types. 

### Significant Findings and Conclusion
In analyzing the Covid-19 data sets, we were able to draw conclusions based on our findings. Our first step was finding the countries with the highest Covid-19 cases. These countries, in descending order, were United States, India, Brazil, United Kingdom and Russia. One reason these countries could have the highest amount of cases is their large populations. When looking at the countries with the most deaths, the results were similar. The countries with the top 5 Covid-19 death count are the United States, India, Brazil, Russia and Mexico. The top countries for cases and deaths are very similar with the exception that the United Kingdom replaced Mexico on the top deaths list. To get an accurate depiction of countries who struggle with the pandemic and losing people to Covid-19, we calculated the mortality rates. The countries with the highest mortality rates are Yemen, Vanuatu, Peru, Mexico and Sudan. A reason that these countries have high mortality rates are they don't have the resources or healthcare system to keep up with the pandemic. Further analysis on the countries healthcare and economic system as well as the standard of living would be needed to understand why those countries have higher mortality rates. We calculated the overall mortality rate of the globe for Covid-19. The result was 2.05%, which is relatively low. With modern medicine, most people survived Covid-19. Additionally, we wanted to graph out the countries with the highest amount of cases and deaths in 2020 to see trends in the pandemic. We choose the top three countries with cases and deaths as they were heavily impacted by the pandemic. For the cases graph, all of the countries were steadily increasing the amount of cases over the summer. The number of Brazil cases began to regain control in August and the India cases started to decline in September. The United States had large spiked in July and November as they struggle to keep the amount of cases under control despite enforcing masking requirements. For the deaths graph, the United States had a gigantic spike in April. This is when the county was on complete lock down because the hospitals were overcrowding and it was difficult to keep the patients alive. The deaths for India and Brazil steadily increased when the pandemic started. Over the summer, the death counts steadied out as the hospitals were able to manage the patients. At the end of the year, death counts declined. The death counts continue to decline since the vaccine was released to the public, saving many from Covid-19. 
