Covid-19 Data Analysis

1. Objective

The objectives of this project is to analyze the Covid-19 data available with us. The data contains detected, confirmed,cured patients count from all the countries all over the world. To achieve this we have used below R packages

*   tidyverse
*   lubridate
*   ggplot2
*   tidyr
*   dplyr

The analysis is devided in few sections. Such as Preliminary analysis, Detailed Analysis, Outcome, Result & Discussion.

Different R Packages which are included with this projects are:

*   tidyverse
*   lubridate
*   ggplot2
*   tidyr
*   dplyr

# List of packages loaded
library(tidyverse)

## Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
## had status 1

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(lubridate)

## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggplot2)
library(tidyr)
library(dplyr)

2. Preliminary analysis

We are loading the dataset here which will be analyzed here. As per preliminary analysis we have all the raw data. These raw data is segregated in 3 categories. The categories are: 1. Confirmed 2. Deaths 3. Recovered. Through On this project the data sets are given names like rdata_confirmed,rdata_deaths,rdata_recovered.

filenames <- c('time_series_covid19_confirmed_global.csv',
'time_series_covid19_deaths_global.csv',
'time_series_covid19_recovered_global.csv')
url.path <- paste0('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/',
'master/csse_covid_19_data/csse_covid_19_time_series/')
download <- function(filename) {
url <- file.path(url.path, filename)
dest <- file.path('./', filename)
download.file(url, dest)
}
bin <- lapply(filenames, download)

## load data into R
rdata_confirmed <- read.csv('time_series_covid19_confirmed_global.csv')
rdata_deaths <- read.csv('time_series_covid19_deaths_global.csv')
rdata_recovered <- read.csv('time_series_covid19_recovered_global.csv')

3. Detailed analysis

The detail analysis includes summarizing cases based on country wide which have count of each county , number of confirmed case, death cases and recovered cases. The cumulative cases of confirmed cases has been calculated per country wise.

Even the analysis done for India where the data was compared to other neighbouring countries like Pakistan, Afganistan, Bangladesh and Nepal as well.

# Confirmed Cases
rdata_confirmed

# Death cases
rdata_deaths

# Recovered Cases
rdata_recovered

Lets lean the data to create a country level and global combined data.

# DATA CLEANING: To create country level and global combined data
# Convert each data set from wide to long AND aggregate at country level
options(dplyr.summarise.inform=FALSE)

confirmed.case <- rdata_confirmed %>% gather(key="date", value="confirmed", -c(Country.Region, Province.State, Lat, Long)) %>% group_by(Country.Region, date) %>% summarize(confirmed=sum(confirmed))

deaths.case <- rdata_deaths %>% gather(key="date", value="deaths", -c(Country.Region, Province.State, Lat, Long)) %>% group_by(Country.Region, date) %>% summarize(deaths=sum(deaths))

recovered.case <- rdata_recovered %>% gather(key="date", value="recovered", -c(Country.Region, Province.State, Lat, Long)) %>% group_by(Country.Region, date) %>% summarize(recovered=sum(recovered))
summary(confirmed.case)

##  Country.Region         date             confirmed       
##  Length:208638      Length:208638      Min.   :       0  
##  Class :character   Class :character   1st Qu.:    2623  
##  Mode  :character   Mode  :character   Median :   41796  
##                                        Mean   : 1184737  
##                                        3rd Qu.:  411046  
##                                        Max.   :98538245

# Final Data by combanining all the three
country <- full_join(confirmed.case,deaths.case) %>% full_join(recovered.case)

## Joining, by = c("Country.Region", "date")
## Joining, by = c("Country.Region", "date")

Number of cases confirmed, detaths and recovered since days is presented below by grouping the regions and substituting X with on date column.

# Date variable
# Fix date variable and convert from character to date
#str(country) # check date character
country$date <- country$date %>% sub("X", "", .) %>% as.Date("%m.%d.%y")
#str(country) # check date Date
# Create new variable: number of days
country <- country %>% group_by(Country.Region) %>% mutate(cumconfirmed=cumsum(confirmed), days = date - first(date) + 1)

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'

country

World wise cases on Date Wise. Even on daily basis number of cases getting increased or decreased.

# Aggregate at world level
world <- country %>% group_by(date) %>% summarize(confirmed=sum(confirmed), cumconfirmed=sum(cumconfirmed), deaths=sum(deaths), recovered=sum(recovered)) %>% mutate(days = date - first(date) + 1)
# Extract specific country: Italy
world

Number of cases only in India. based on date

india <- country %>% filter(Country.Region=="India") %>% arrange(date())
india

World wide Summery of cases for all countries by Confirmed, Cumconfirmed, detaths and recovered.

# SUMMARY STATISTICS
summary(country)

##  Country.Region          date              confirmed            deaths       
##  Length:208638      Min.   :2020-01-22   Min.   :       0   Min.   :      0  
##  Class :character   1st Qu.:2020-10-07   1st Qu.:    2623   1st Qu.:     35  
##  Mode  :character   Median :2021-06-23   Median :   41796   Median :    672  
##                     Mean   :2021-06-23   Mean   : 1184737   Mean   :  17781  
##                     3rd Qu.:2022-03-10   3rd Qu.:  411046   3rd Qu.:   6418  
##                     Max.   :2022-11-24   Max.   :98538245   Max.   :1079052  
##                                                                              
##    recovered         cumconfirmed           days         
##  Min.   :      -1   Min.   :0.000e+00   Length:208638    
##  1st Qu.:       0   1st Qu.:4.953e+06   Class :difftime  
##  Median :       0   Median :3.866e+07   Mode  :numeric   
##  Mean   :  112594   Mean   :2.137e+08                    
##  3rd Qu.:    6080   3rd Qu.:2.275e+08                    
##  Max.   :30974748   Max.   :2.147e+09                    
##                     NA's   :12097

#by(country$confirmed, country$Country.Region, summary)
#by(country$cumconfirmed, country$Country.Region, summary)
#by(country$deaths, country$Country.Region, summary)
#by(country$recovered, country$Country.Region, summary)

World summery

summary(world)

##       date              confirmed          cumconfirmed           deaths       
##  Min.   :2020-01-22   Min.   :      557   Min.   :8.436e+07   Min.   :     17  
##  1st Qu.:2020-10-07   1st Qu.: 36306535   1st Qu.:2.916e+09   1st Qu.:1124106  
##  Median :2021-06-23   Median :180300522   Median :5.824e+09   Median :3925632  
##  Mean   :2021-06-23   Mean   :238132221   Mean   :5.418e+09   Mean   :3573964  
##  3rd Qu.:2022-03-09   3rd Qu.:453179623   3rd Qu.:7.830e+09   3rd Qu.:6058397  
##  Max.   :2022-11-24   Max.   :640360567   Max.   :1.000e+10   Max.   :6627835  
##                                           NA's   :982                          
##    recovered             days         
##  Min.   :       -1   Length:1038      
##  1st Qu.:        0   Class :difftime  
##  Median :    49698   Mode  :numeric   
##  Mean   : 22631460                    
##  3rd Qu.: 36086810                    
##  Max.   :130899061                    
##

summary(india)

##  Country.Region          date              confirmed            deaths      
##  Length:1038        Min.   :2020-01-22   Min.   :       0   Min.   :     0  
##  Class :character   1st Qu.:2020-10-07   1st Qu.: 6853279   1st Qu.:105767  
##  Mode  :character   Median :2021-06-23   Median :30108612   Median :392646  
##                     Mean   :2021-06-23   Mean   :23544833   Mean   :297877  
##                     3rd Qu.:2022-03-09   3rd Qu.:42983212   3rd Qu.:515650  
##                     Max.   :2022-11-24   Max.   :44672048   Max.   :530604  
##                                                                             
##    recovered         cumconfirmed           days         
##  Min.   :       0   Min.   :1.031e+07   Length:1038      
##  1st Qu.:       0   1st Qu.:5.710e+08   Class :difftime  
##  Median :       3   Median :9.788e+08   Mode  :numeric   
##  Mean   : 4681491   Mean   :1.027e+09                    
##  3rd Qu.: 8371479   3rd Qu.:1.478e+09                    
##  Max.   :30974748   Max.   :2.137e+09                    
##                     NA's   :943

India Summary

summary(india)

##  Country.Region          date              confirmed            deaths      
##  Length:1038        Min.   :2020-01-22   Min.   :       0   Min.   :     0  
##  Class :character   1st Qu.:2020-10-07   1st Qu.: 6853279   1st Qu.:105767  
##  Mode  :character   Median :2021-06-23   Median :30108612   Median :392646  
##                     Mean   :2021-06-23   Mean   :23544833   Mean   :297877  
##                     3rd Qu.:2022-03-09   3rd Qu.:42983212   3rd Qu.:515650  
##                     Max.   :2022-11-24   Max.   :44672048   Max.   :530604  
##                                                                             
##    recovered         cumconfirmed           days         
##  Min.   :       0   Min.   :1.031e+07   Length:1038      
##  1st Qu.:       0   1st Qu.:5.710e+08   Class :difftime  
##  Median :       3   Median :9.788e+08   Mode  :numeric   
##  Mean   : 4681491   Mean   :1.027e+09                    
##  3rd Qu.: 8371479   3rd Qu.:1.478e+09                    
##  Max.   :30974748   Max.   :2.137e+09                    
##                     NA's   :943

Graphical representation of cases using Bar Chart over time and these are comparitive report of Confirmed cases.

ggplot(world, aes(x=date, y=confirmed)) + geom_bar(stat="identity", width=0.1) +
  theme_classic() +
  labs(title = "Covid-19 Global Confirmed Cases", x= "Date", y= "Daily confirmed cases") +
  theme(plot.title = element_text(hjust = 0.5))

Confirmed Cases in India represented using Graph.

# India confirmed
ggplot(india, aes(x=date, y=confirmed)) + geom_bar(stat="identity", width=0.1) +
  theme_classic() +
  labs(title = "Covid-19 Confirmed Cases in India", x= "Date", y= "Daily confirmed cases") +
  theme(plot.title = element_text(hjust = 0.4))

Graphical representation of cases : World confirmed, deaths and recovered

#str(world)
world %>% gather("Type", "Cases", -c(date, days)) %>%
ggplot(aes(x=date, y=Cases, colour=Type)) + geom_bar(stat="identity", width=0.2, fill="white") +
  theme_classic() +
  labs(title = "Covid-19 Global Cases", x= "Date", y= "Daily cases") +
  theme(plot.title = element_text(hjust = 0.5))

## Warning: Removed 982 rows containing missing values (position_stack).

# Line graph of cases over time
# World confirmed
ggplot(world, aes(x=days, y=confirmed)) + geom_line() +
  theme_classic() +
  labs(title = "Covid-19 Global Confirmed Cases", x= "Days", y= "Daily confirmed cases") +
  theme(plot.title = element_text(hjust = 0.5))

## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

# Ignore warning

# World confirmed with counts in log10 scale
ggplot(world, aes(x=days, y=confirmed)) + geom_line() +
  theme_classic() +
  labs(title = "Covid-19 Global Confirmed Cases", x= "Days", y= "Daily confirmed cases  (log scale)") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(trans="log10")

## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

# World confirmed, deaths and recovered
str(world)

## tibble [1,038 × 6] (S3: tbl_df/tbl/data.frame)
##  $ date        : Date[1:1038], format: "2020-01-22" "2020-01-23" ...
##  $ confirmed   : int [1:1038] 557 657 944 1437 2120 2929 5580 6169 8237 9927 ...
##  $ cumconfirmed: num [1:1038] 5.82e+09 6.27e+09 6.72e+09 7.18e+09 7.64e+09 ...
##  $ deaths      : int [1:1038] 17 18 26 42 56 82 131 133 172 214 ...
##  $ recovered   : int [1:1038] 30 32 39 42 56 65 108 127 145 225 ...
##  $ days        : 'difftime' num [1:1038] 1 2 3 4 ...
##   ..- attr(*, "units")= chr "days"

world %>% gather("Type", "Cases", -c(date, days)) %>%
ggplot(aes(x=days, y=Cases, colour=Type)) + geom_line() +
  theme_classic() +
  labs(title = "Covid-19 Global Cases", x= "Days", y= "Daily cases") +
  theme(plot.title = element_text(hjust = 0.5))

## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

## Warning: Removed 299 row(s) containing missing values (geom_path).

# Confirmed by country for select countries with counts in log10 scale
countryselection <- country %>% filter(Country.Region==c("India", "Pakistan", "China", "Bangladesh", "Nepal", "Srilanka"))
ggplot(countryselection, aes(x=days, y=confirmed, colour=Country.Region)) + geom_line(size=1) +
  theme_classic() +
  labs(title = "Covid-19 Confirmed Cases by Country", x= "Days", y= "Daily confirmed cases (log scale)") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(trans="log10")

## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

## Warning: Transformation introduced infinite values in continuous y-axis

# Matrix of line graphs of confirmed, deaths and recovered for select countries in log10 scale
str(countryselection)

## grouped_df [865 × 7] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ Country.Region: chr [1:865] "Bangladesh" "Bangladesh" "Bangladesh" "Bangladesh" ...
##  $ date          : Date[1:865], format: "2022-01-10" "2022-01-13" ...
##  $ confirmed     : int [1:865] 1595931 1604664 1617711 1642294 1664616 1685136 1715997 1747331 1773149 0 ...
##  $ deaths        : int [1:865] 28105 28123 28144 28176 28192 28223 28256 28288 28329 0 ...
##  $ recovered     : int [1:865] 0 0 0 0 0 0 0 0 0 0 ...
##  $ cumconfirmed  : int [1:865] 4218793 10595383 17013896 23498363 29977769 34399351 38879512 43424727 48029007 52452268 ...
##  $ days          : 'difftime' num [1:865] 375 378 381 384 ...
##   ..- attr(*, "units")= chr "days"
##  - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ Country.Region: chr [1:5] "Bangladesh" "China" "India" "Nepal" ...
##   ..$ .rows         : list<int> [1:5] 
##   .. ..$ : int [1:173] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ : int [1:173] 174 175 176 177 178 179 180 181 182 183 ...
##   .. ..$ : int [1:173] 347 348 349 350 351 352 353 354 355 356 ...
##   .. ..$ : int [1:173] 520 521 522 523 524 525 526 527 528 529 ...
##   .. ..$ : int [1:173] 693 694 695 696 697 698 699 700 701 702 ...
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE

countryselection %>% gather("Type", "Cases", -c(date, days, Country.Region)) %>%
ggplot(aes(x=days, y=Cases, colour=Country.Region)) + geom_line(size=1) +
  theme_classic() +
  labs(title = "Covid-19 Cases by Country", x= "Days", y= "Daily cases (log scale)") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(trans="log10") +
  facet_grid(rows=vars(Type))

## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

## Warning: Transformation introduced infinite values in continuous y-axis

countrytotal <- country %>% group_by(Country.Region) %>% summarize(cumconfirmed=sum(confirmed), cumdeaths=sum(deaths), cumrecovered=sum(recovered))

countrytotal

4. Outcomes

The analysis gives us clear understanding of growth rate of cases all over the world.

Country specific data
Which country have more:-
Confirmed case and cumulative confirmed cases.
- Recover case
- Death case.

The data says the growth rate on daily basis on country wide. + Case status in India + Cases growth rate in India + Case growth rate as compare to neighboring country.

5. Results and Discussions

A huge dataset of people suffering from Corona virus to give us better ways of fighting the pandemic situation. The data gives us to think and prepare in a better way:-

Country where virus is spreading quickly.
To stop in/out flow of physical movement of people.
Country where authorities can take necessary action like:
- To ask people wear a face mask in public indoor spaces.
- To ask people to Maintain at least six feet of distance between yourself and others.
- To ask people to Avoid large gatherings.
- Socialize outdoors.
- Govt. can provide vaccination and ask people to Get vaccinated and boosted as soon as they are eligible

Covid-19 Data Analysis

2022MCS120007 and Swaranchi Parida

10/29/2022