Coursera - Developing Data Products - Project 2 - Plotly

In this project I use R Markdown and Plotly to create a bar chart of the COVID-19 outbreak using April 18th, 2020 data. Data was taken from the 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE GitHub. I will only examine the death toll among states within the United States.

Project Prompt: Create a web page presentation using R Markdown that features a plot created with Plotly. Host your webpage on either GitHub Pages, RPubs, or NeoCities. Your webpage must contain the date that you created the document, and it must contain a plot created with Plotly.

Sys.info()
##           sysname           release           version          nodename 
##         "Windows"          "10 x64"     "build 18363" "DESKTOP-DP7KPRO" 
##           machine             login              user    effective_user 
##          "x86-64"           "Derek"           "Derek"           "Derek"
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
library(plotly)
## Warning: package 'plotly' was built under R version 3.6.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.6.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Downloading and reading in the data set.

url <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/04-18-2020.csv"
file <- "04-18-2020.csv"

download.file(url = url, destfile = file, method = "curl")

mapdata <- read.csv(file)

head(mapdata)
##   ï..FIPS    Admin2 Province_State Country_Region         Last_Update      Lat
## 1   45001 Abbeville South Carolina             US 2020-04-18 22:32:47 34.22333
## 2   22001    Acadia      Louisiana             US 2020-04-18 22:32:47 30.29506
## 3   51001  Accomack       Virginia             US 2020-04-18 22:32:47 37.76707
## 4   16001       Ada          Idaho             US 2020-04-18 22:32:47 43.45266
## 5   19001     Adair           Iowa             US 2020-04-18 22:32:47 41.33076
## 6   21001     Adair       Kentucky             US 2020-04-18 22:32:47 37.10460
##        Long_ Confirmed Deaths Recovered Active                  Combined_Key
## 1  -82.46171        15      0         0     15 Abbeville, South Carolina, US
## 2  -92.41420       110      7         0    103         Acadia, Louisiana, US
## 3  -75.63235        33      0         0     33        Accomack, Virginia, US
## 4 -116.24155       593      9         0    584                Ada, Idaho, US
## 5  -94.47106         1      0         0      1               Adair, Iowa, US
## 6  -85.28130        47      3         0     44           Adair, Kentucky, US
str(mapdata)
## 'data.frame':    3053 obs. of  12 variables:
##  $ ï..FIPS       : int  45001 22001 51001 16001 19001 21001 29001 40001 8001 16003 ...
##  $ Admin2        : Factor w/ 1636 levels "","Abbeville",..: 2 3 4 5 6 6 6 6 7 7 ...
##  $ Province_State: Factor w/ 138 levels "","Alabama","Alaska",..: 116 62 129 50 54 60 73 94 20 50 ...
##  $ Country_Region: Factor w/ 185 levels "Afghanistan",..: 177 177 177 177 177 177 177 177 177 177 ...
##  $ Last_Update   : Factor w/ 33 levels "2020-02-23 11:19:02",..: 31 31 31 31 31 31 31 31 31 31 ...
##  $ Lat           : num  34.2 30.3 37.8 43.5 41.3 ...
##  $ Long_         : num  -82.5 -92.4 -75.6 -116.2 -94.5 ...
##  $ Confirmed     : int  15 110 33 593 1 47 12 29 860 1 ...
##  $ Deaths        : int  0 7 0 9 0 3 0 3 31 0 ...
##  $ Recovered     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Active        : int  15 103 33 584 1 44 12 26 829 1 ...
##  $ Combined_Key  : Factor w/ 3053 levels "Abbeville, South Carolina, US",..: 1 2 3 4 5 6 7 8 9 10 ...

Now I will manipulate the data set to exclude all rows that do no pertain to the United States.

mapdata_US <- mapdata[mapdata$Country_Region == "US",]

str(mapdata_US)
## 'data.frame':    2791 obs. of  12 variables:
##  $ ï..FIPS       : int  45001 22001 51001 16001 19001 21001 29001 40001 8001 16003 ...
##  $ Admin2        : Factor w/ 1636 levels "","Abbeville",..: 2 3 4 5 6 6 6 6 7 7 ...
##  $ Province_State: Factor w/ 138 levels "","Alabama","Alaska",..: 116 62 129 50 54 60 73 94 20 50 ...
##  $ Country_Region: Factor w/ 185 levels "Afghanistan",..: 177 177 177 177 177 177 177 177 177 177 ...
##  $ Last_Update   : Factor w/ 33 levels "2020-02-23 11:19:02",..: 31 31 31 31 31 31 31 31 31 31 ...
##  $ Lat           : num  34.2 30.3 37.8 43.5 41.3 ...
##  $ Long_         : num  -82.5 -92.4 -75.6 -116.2 -94.5 ...
##  $ Confirmed     : int  15 110 33 593 1 47 12 29 860 1 ...
##  $ Deaths        : int  0 7 0 9 0 3 0 3 31 0 ...
##  $ Recovered     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Active        : int  15 103 33 584 1 44 12 26 829 1 ...
##  $ Combined_Key  : Factor w/ 3053 levels "Abbeville, South Carolina, US",..: 1 2 3 4 5 6 7 8 9 10 ...

Now I will show a stacked bar chart of the confirmed, recovered, and death cases by state of Covid-19.

State     <- mapdata_US$Province_State
Confirmed <- mapdata_US$Confirmed
Death     <- mapdata_US$Deaths
PlotData  <- data.frame(State, Confirmed, Death)

## As it is, PlotData contains many rows per state. We first need to aggregate these into one row per state, with other columns being summed over. This can be done using dplyr.

PlotData_Clean <- PlotData %>%
                    group_by(State) %>%
                    summarise_all(sum)

summary(PlotData_Clean)
##         State      Confirmed          Death         
##  Alabama   : 1   Min.   :     0   Min.   :    0.00  
##  Alaska    : 1   1st Qu.:  1151   1st Qu.:   26.75  
##  Arizona   : 1   Median :  2812   Median :  128.50  
##  Arkansas  : 1   Mean   : 12624   Mean   :  666.62  
##  California: 1   3rd Qu.: 10536   3rd Qu.:  443.50  
##  Colorado  : 1   Max.   :241712   Max.   :17671.00  
##  (Other)   :52
## Examining the States, we see that there are actually a set of "states" that are not actually states and need to be removed. I will do this now.

remove <- c("Diamond Princess", "Grand Princess", "Guam", "Northern Mariana Islands", "Puerto Rico", "Recovered", "Virgin Islands")

PlotData_Clean <- PlotData_Clean[which(!PlotData_Clean$State %in% remove),]

## R will still hold on to the unused factor levels in the State variable, even thoug we aren't using them anymore. This will cause our plot later on to show them when we do not want them. Here we can remove them with the droplevels function.

PlotData_Clean <- droplevels(PlotData_Clean)

## Now the data is ready for plotting.

plot <- plot_ly(PlotData_Clean, x = ~State, y = ~Confirmed, type = 'bar', name = 'Confirmed Cases')
plot <- plot %>% add_trace(y = ~Death, name = 'Deaths')
plot <- plot %>% layout(yaxis = list(title = 'Count'), barmode = 'group')

plot