For my final project I intend to explore worldwide COVID 19 infection rates, deaths, and recoveries with the intention of answering the following questions:
Which countries have the highest number of confirmed COVID 19 cases?
Which countries have the highest number of COVID 19 deaths?
Which countries have the highest number of COVID 19 recoveries?
The data source I intend to use for this project is the Novel Corona Virus 2019 dataset which is available via kaggle.com. The dataset was taken from the John Hopkins University COVID-19 Data Repository.
The dataset consists of 6 different CSV files:
covid_19_data
The main dataset that will be used for this project. This file contains data on worldwide COVID cases,
deaths, and recoveries broken down by Country/Region, and Province/State were applicable.
time_series_covid_19_confirmed
Time series data on the number of confirmed cases worldwide.
The file contains longitude and latitude data that can be used to plot cases on a map.
time_series_covid_19_confirmed_US
Time series data on the number of confirmed cases in the US at the county level.
This file will not be used in this project.
time_series_covid_19_deaths
Time series data on the number of COVID deaths worldwide.
time_series_covid_19_deaths_US
Time series data on the number of COVID deaths in the US at the county level.
This file will not be used in this project.
time_series_covid_19_recovered
Time series data on the number of COVID recoveries worlwide.
In order to visualize the data, I will use 2 main static visualizations:
A Horizontal Bar Chart to show a side-by-side comparison of per-country COVID 19 related statistics. This will allow the user to gain insight into the data at a glance.
A World Map containing per-country COVID data points. The size of the plots on the map will be reflective of the amount of COVID related cases in the country they are ploted on (The larger the data plot size, the larger the amount of COVID related cases).
The following R packages will be utilized in this project:
tidyr
This package will be used to shape the data into a usable format (i.e. converting data that is in a wide format to a long format.).
stringr
This package will be used to replace NA values with meaningful values.
lubridate
This package will be used to format dates.
ggplot2
This package will be used to create the horizontal bar charts and world maps.
kableExtra
This package will be used to create data tables.
Hmisc
This package will be used to gain insight into data via the package's summary() function.