Assignment 6: Coronavirus Cases by Country

Introduction

This document explores coronavirus data by country as of May 7, 2020. The dataset, scraped from Worldometers.info, contains the number of cases, deaths, survivals, critical cases, tests administered, and more per country.

Packages

Several packages will be critical for our analysis.

Tidyverse: The tidyverse is a collection of packages that have different notations to create a more seamless data science approach.

Dplyr: Dplyr is comparable to the SQL language and helps users manipulate datasets easily.

## -- Attaching packages ---------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Import Data

## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   Country.Other = col_character(),
##   TotalCases = col_number(),
##   NewCases = col_character(),
##   TotalDeaths = col_number(),
##   NewDeaths = col_character(),
##   TotalRecovered = col_character(),
##   ActiveCases = col_number(),
##   Serious.Critical = col_number(),
##   TotÂ.Cases.1M.pop = col_number(),
##   Deaths.1M.pop = col_number(),
##   TotalTests = col_number(),
##   Tests.1M.pop = col_number(),
##   Continent = col_character(),
##   Total.Cases.1M.pop = col_number()
## )

Question 1

Question: How does the number of total cases differ by continent? From the media, I would assume Asia, Europe, and North America have the most cases.

Process: Using the dataframe scraped from the Worldometer site, I will group the countries by continent then sum the number of cases for each.

Data Wrangling: This execution requires using the dplyr package to group by and summarise. I used ggplot to create the data visualization bar graph.

Analysis:

## # A tibble: 7 x 2
##   Continent         No_Cases
##   <chr>                <dbl>
## 1 Africa               55263
## 2 Asia                618504
## 3 Australia/Oceania     8489
## 4 Europe             1553699
## 5 North America      1407916
## 6 South America       266840
## 7 <NA>                   721

Results: As the table and graph demonstrate, Europe (1,553,699) and North America (1,407,916) have the greatest number of cases. Antarctica has no cases and Australia/Oceania has only 8,489. However, some believe that China has underreported so the Asian cases could be higher.

Statistical Method: The numbers in the table demonstrate the actual number of cases that could be compared to each other or used in other tests. For further study, it would be interesting to see how this changes week to week. For example, Asia would have dominated the cases in January when the majority of patients were in China.

Question 2

Question: How many more tests have been administered (per 1M population) in countries with high qualities of life compared to coutries with a low quality of life? I would assume there is a large disparity between these two numbers, considering that health and technology are huge influences in quality of life.

Process: I found the top 5 quality of life countries and the bottom 5 from https://www.numbeo.com/quality-of-life/rankings_by_country.jsp. Note that the ranking only has the 80 most populated/industrialized countries. I will filter the first chunk of code for the top 5 countries and find their average number of tests per 1M population. Then I will repeat that process with the bottom 5 countries.

Data Wrangling: This execution requires using the dplyr package to filter and summarise.

Analysis:

## # A tibble: 1 x 1
##   Top_5_Avg_Tests_Per_1M
##                    <dbl>
## 1                 29160.
## # A tibble: 1 x 1
##   Bottom_5_Avg_Tests_Per_1M
##                       <dbl>
## 1                     1828.

Results: The countries with high qualities of life had 29,160 tests per 1M population. The countries with low qualities of life had 1,828 tests per 1M population. Therefore, high quality of life had roughly 16 times more tests. However, there are many confounding variables like the countries’ locations (the pattern of coronavirus spreading) or the total number of cases per country. Even with those considerations, this points to a strong, logical conclusion that countries with high qualities of life have better access to healthcare and technology.

Statistical Method: I could use a t-test to see if there is a statistically significant difference between the two means.

Question 3

Question: What is the average mortality rate? How do countries with the most media attention like China, Italy, and the USA’s mortality rates compare?

Process: I found the average mortality of the entire world by dividing the total number of deaths by the total number of cases. Then, I will add a column to the data table to calculate the mortality rates of each country. I will filter the data to only report the mortality rates of the three countries of interest.

Data Wrangling: This execution requires using the dplyr package to filter.

Analysis:

## # A tibble: 3 x 2
##   Country.Other Mortality_Rate
##   <chr>                  <dbl>
## 1 USA                     5.96
## 2 Italy                  13.9 
## 3 China                   5.59

Results: The average mortality rate is ~7% which I am a tad suspicious of (it seems too high). The table shows that the USA and China have about a 6% mortality rate compared to Italy who’s healthcare system was overwhelmed greatly with a 14% mortality rate.

Statistical Method: The results are pretty straightforward and can be compared to each other in their current state. If I wanted to average the mortality rates of different countries and compare them to each other, I would use an ANOVA.

Question 4

Question: The data set shows the number of cases, deaths, and tests per 1M people in the country. Is there a correlation between the number of deaths and tests per 1M people? Is there a difference in the correlation if the country has an above versus below average number of cases? If there is a strong correlation, which countries are the outliers?

Process: First, I found the average number of cases to be 946 people per 1M people worldwide. Then, I created a column in the data set to categorize a country as above average versus below average. This demarcation is seen in the colors on the graph. Each country is represented as a dot, with tests per 1M on the y-axis and deaths per 1M on the x-axis.

Data Wrangling: No data wrangling was necessary.

Analysis:

## [1] 961.3146
## Warning: Removed 61 rows containing missing values (geom_point).

Results: The average number of cases per country is 946 cases per 1M population. As expected, there is a large concentration of countries with only a few tests, deaths, and cases. The outliers with very high tests per 1M are countries with tiny populations like the Faeroe Islands and Iceland. Likewise, San Marino is the outlier with a high percent of deaths.

Statistical Method: A linear regression can find the correlation coeffiecient of the distribution.

Question 5

Question: Which countries have the highest percent of all cases worldwide?

Process: I found the total number of cases worldwide and created a new column dividing the country’s cases by the worldwide cases. Then I arranged by the top 10 countries.

Data Wrangling: This execution requires using the dplyr package to select and arrange.

Analysis:

## # A tibble: 10 x 3
##    Country.Other TotalCases Percent_of_Total_Cases
##    <chr>              <dbl>                  <dbl>
##  1 Iran              103135                   2.64
##  2 Turkey            133721                   3.42
##  3 Brazil            135106                   3.45
##  4 Germany           169430                   4.33
##  5 France            174791                   4.47
##  6 Russia            177160                   4.53
##  7 UK                206715                   5.28
##  8 Italy             215858                   5.52
##  9 Spain             256855                   6.57
## 10 USA              1291222                  33.0

Results: The United States tops the list with 33% of all cases worldwide, not suprising considering the huge population of the USA relative to smaller European nations. Spain, Italy, and the UK all have between 5-7% of cases. Suprisingly, China did not make the top ten.

Statistical Method: The results are pretty straightforward and can be compared to each other in their current state. Further study could group the countries by location, quality of life, or population and compare means (ANOVA) or examine how the percents change over time.