1. Objective
The objectives of this project is to analyze the Covid-19 data available with us. The data contains detected, confirmed,cured patients count from all the countries all over the world. To achieve this we have used below R packages
* tidyverse
* lubridate
* ggplot2
* tidyr
* dplyr
The analysis is devided in few sections. Such as Preliminary analysis, Detailed Analysis, Outcome, Result & Discussion.
Different R Packages which are included with this projects are:
* tidyverse
* lubridate
* ggplot2
* tidyr
* dplyr
# List of packages loaded
library(tidyverse)
## Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
## had status 1
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(tidyr)
library(dplyr)
2. Preliminary analysis
We are loading the dataset here which will be analyzed here. As per preliminary analysis we have all the raw data. These raw data is segregated in 3 categories. The categories are: 1. Confirmed 2. Deaths 3. Recovered. Through On this project the data sets are given names like rdata_confirmed,rdata_deaths,rdata_recovered.
filenames <- c('time_series_covid19_confirmed_global.csv',
'time_series_covid19_deaths_global.csv',
'time_series_covid19_recovered_global.csv')
url.path <- paste0('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/',
'master/csse_covid_19_data/csse_covid_19_time_series/')
download <- function(filename) {
url <- file.path(url.path, filename)
dest <- file.path('./', filename)
download.file(url, dest)
}
bin <- lapply(filenames, download)
## load data into R
rdata_confirmed <- read.csv('time_series_covid19_confirmed_global.csv')
rdata_deaths <- read.csv('time_series_covid19_deaths_global.csv')
rdata_recovered <- read.csv('time_series_covid19_recovered_global.csv')
3. Detailed analysis
The detail analysis includes summarizing cases based on country wide which have count of each county , number of confirmed case, death cases and recovered cases. The cumulative cases of confirmed cases has been calculated per country wise.
Even the analysis done for India where the data was compared to other neighbouring countries like Pakistan, Afganistan, Bangladesh and Nepal as well.
# Confirmed Cases
rdata_confirmed
# Death cases
rdata_deaths
# Recovered Cases
rdata_recovered
Lets lean the data to create a country level and global combined data.
# DATA CLEANING: To create country level and global combined data
# Convert each data set from wide to long AND aggregate at country level
options(dplyr.summarise.inform=FALSE)
confirmed.case <- rdata_confirmed %>% gather(key="date", value="confirmed", -c(Country.Region, Province.State, Lat, Long)) %>% group_by(Country.Region, date) %>% summarize(confirmed=sum(confirmed))
deaths.case <- rdata_deaths %>% gather(key="date", value="deaths", -c(Country.Region, Province.State, Lat, Long)) %>% group_by(Country.Region, date) %>% summarize(deaths=sum(deaths))
recovered.case <- rdata_recovered %>% gather(key="date", value="recovered", -c(Country.Region, Province.State, Lat, Long)) %>% group_by(Country.Region, date) %>% summarize(recovered=sum(recovered))
summary(confirmed.case)
## Country.Region date confirmed
## Length:208638 Length:208638 Min. : 0
## Class :character Class :character 1st Qu.: 2623
## Mode :character Mode :character Median : 41796
## Mean : 1184737
## 3rd Qu.: 411046
## Max. :98538245
# Final Data by combanining all the three
country <- full_join(confirmed.case,deaths.case) %>% full_join(recovered.case)
## Joining, by = c("Country.Region", "date")
## Joining, by = c("Country.Region", "date")
Number of cases confirmed, detaths and recovered since days is presented below by grouping the regions and substituting X with on date column.
# Date variable
# Fix date variable and convert from character to date
#str(country) # check date character
country$date <- country$date %>% sub("X", "", .) %>% as.Date("%m.%d.%y")
#str(country) # check date Date
# Create new variable: number of days
country <- country %>% group_by(Country.Region) %>% mutate(cumconfirmed=cumsum(confirmed), days = date - first(date) + 1)
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
## Warning in mask$eval_all_mutate(quo): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
country
World wise cases on Date Wise. Even on daily basis number of cases getting increased or decreased.
# Aggregate at world level
world <- country %>% group_by(date) %>% summarize(confirmed=sum(confirmed), cumconfirmed=sum(cumconfirmed), deaths=sum(deaths), recovered=sum(recovered)) %>% mutate(days = date - first(date) + 1)
# Extract specific country: Italy
world
Number of cases only in India. based on date
india <- country %>% filter(Country.Region=="India") %>% arrange(date())
india
World wide Summery of cases for all countries by Confirmed, Cumconfirmed, detaths and recovered.
# SUMMARY STATISTICS
summary(country)
## Country.Region date confirmed deaths
## Length:208638 Min. :2020-01-22 Min. : 0 Min. : 0
## Class :character 1st Qu.:2020-10-07 1st Qu.: 2623 1st Qu.: 35
## Mode :character Median :2021-06-23 Median : 41796 Median : 672
## Mean :2021-06-23 Mean : 1184737 Mean : 17781
## 3rd Qu.:2022-03-10 3rd Qu.: 411046 3rd Qu.: 6418
## Max. :2022-11-24 Max. :98538245 Max. :1079052
##
## recovered cumconfirmed days
## Min. : -1 Min. :0.000e+00 Length:208638
## 1st Qu.: 0 1st Qu.:4.953e+06 Class :difftime
## Median : 0 Median :3.866e+07 Mode :numeric
## Mean : 112594 Mean :2.137e+08
## 3rd Qu.: 6080 3rd Qu.:2.275e+08
## Max. :30974748 Max. :2.147e+09
## NA's :12097
#by(country$confirmed, country$Country.Region, summary)
#by(country$cumconfirmed, country$Country.Region, summary)
#by(country$deaths, country$Country.Region, summary)
#by(country$recovered, country$Country.Region, summary)
World summery
summary(world)
## date confirmed cumconfirmed deaths
## Min. :2020-01-22 Min. : 557 Min. :8.436e+07 Min. : 17
## 1st Qu.:2020-10-07 1st Qu.: 36306535 1st Qu.:2.916e+09 1st Qu.:1124106
## Median :2021-06-23 Median :180300522 Median :5.824e+09 Median :3925632
## Mean :2021-06-23 Mean :238132221 Mean :5.418e+09 Mean :3573964
## 3rd Qu.:2022-03-09 3rd Qu.:453179623 3rd Qu.:7.830e+09 3rd Qu.:6058397
## Max. :2022-11-24 Max. :640360567 Max. :1.000e+10 Max. :6627835
## NA's :982
## recovered days
## Min. : -1 Length:1038
## 1st Qu.: 0 Class :difftime
## Median : 49698 Mode :numeric
## Mean : 22631460
## 3rd Qu.: 36086810
## Max. :130899061
##
summary(india)
## Country.Region date confirmed deaths
## Length:1038 Min. :2020-01-22 Min. : 0 Min. : 0
## Class :character 1st Qu.:2020-10-07 1st Qu.: 6853279 1st Qu.:105767
## Mode :character Median :2021-06-23 Median :30108612 Median :392646
## Mean :2021-06-23 Mean :23544833 Mean :297877
## 3rd Qu.:2022-03-09 3rd Qu.:42983212 3rd Qu.:515650
## Max. :2022-11-24 Max. :44672048 Max. :530604
##
## recovered cumconfirmed days
## Min. : 0 Min. :1.031e+07 Length:1038
## 1st Qu.: 0 1st Qu.:5.710e+08 Class :difftime
## Median : 3 Median :9.788e+08 Mode :numeric
## Mean : 4681491 Mean :1.027e+09
## 3rd Qu.: 8371479 3rd Qu.:1.478e+09
## Max. :30974748 Max. :2.137e+09
## NA's :943
India Summary
summary(india)
## Country.Region date confirmed deaths
## Length:1038 Min. :2020-01-22 Min. : 0 Min. : 0
## Class :character 1st Qu.:2020-10-07 1st Qu.: 6853279 1st Qu.:105767
## Mode :character Median :2021-06-23 Median :30108612 Median :392646
## Mean :2021-06-23 Mean :23544833 Mean :297877
## 3rd Qu.:2022-03-09 3rd Qu.:42983212 3rd Qu.:515650
## Max. :2022-11-24 Max. :44672048 Max. :530604
##
## recovered cumconfirmed days
## Min. : 0 Min. :1.031e+07 Length:1038
## 1st Qu.: 0 1st Qu.:5.710e+08 Class :difftime
## Median : 3 Median :9.788e+08 Mode :numeric
## Mean : 4681491 Mean :1.027e+09
## 3rd Qu.: 8371479 3rd Qu.:1.478e+09
## Max. :30974748 Max. :2.137e+09
## NA's :943
Graphical representation of cases using Bar Chart over time and these are comparitive report of Confirmed cases.
ggplot(world, aes(x=date, y=confirmed)) + geom_bar(stat="identity", width=0.1) +
theme_classic() +
labs(title = "Covid-19 Global Confirmed Cases", x= "Date", y= "Daily confirmed cases") +
theme(plot.title = element_text(hjust = 0.5))
Confirmed Cases in India represented using Graph.
# India confirmed
ggplot(india, aes(x=date, y=confirmed)) + geom_bar(stat="identity", width=0.1) +
theme_classic() +
labs(title = "Covid-19 Confirmed Cases in India", x= "Date", y= "Daily confirmed cases") +
theme(plot.title = element_text(hjust = 0.4))
Graphical representation of cases : World confirmed, deaths and recovered
#str(world)
world %>% gather("Type", "Cases", -c(date, days)) %>%
ggplot(aes(x=date, y=Cases, colour=Type)) + geom_bar(stat="identity", width=0.2, fill="white") +
theme_classic() +
labs(title = "Covid-19 Global Cases", x= "Date", y= "Daily cases") +
theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 982 rows containing missing values (position_stack).
# Line graph of cases over time
# World confirmed
ggplot(world, aes(x=days, y=confirmed)) + geom_line() +
theme_classic() +
labs(title = "Covid-19 Global Confirmed Cases", x= "Days", y= "Daily confirmed cases") +
theme(plot.title = element_text(hjust = 0.5))
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
# Ignore warning
# World confirmed with counts in log10 scale
ggplot(world, aes(x=days, y=confirmed)) + geom_line() +
theme_classic() +
labs(title = "Covid-19 Global Confirmed Cases", x= "Days", y= "Daily confirmed cases (log scale)") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(trans="log10")
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
# World confirmed, deaths and recovered
str(world)
## tibble [1,038 × 6] (S3: tbl_df/tbl/data.frame)
## $ date : Date[1:1038], format: "2020-01-22" "2020-01-23" ...
## $ confirmed : int [1:1038] 557 657 944 1437 2120 2929 5580 6169 8237 9927 ...
## $ cumconfirmed: num [1:1038] 5.82e+09 6.27e+09 6.72e+09 7.18e+09 7.64e+09 ...
## $ deaths : int [1:1038] 17 18 26 42 56 82 131 133 172 214 ...
## $ recovered : int [1:1038] 30 32 39 42 56 65 108 127 145 225 ...
## $ days : 'difftime' num [1:1038] 1 2 3 4 ...
## ..- attr(*, "units")= chr "days"
world %>% gather("Type", "Cases", -c(date, days)) %>%
ggplot(aes(x=days, y=Cases, colour=Type)) + geom_line() +
theme_classic() +
labs(title = "Covid-19 Global Cases", x= "Days", y= "Daily cases") +
theme(plot.title = element_text(hjust = 0.5))
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
## Warning: Removed 299 row(s) containing missing values (geom_path).
# Confirmed by country for select countries with counts in log10 scale
countryselection <- country %>% filter(Country.Region==c("India", "Pakistan", "China", "Bangladesh", "Nepal", "Srilanka"))
ggplot(countryselection, aes(x=days, y=confirmed, colour=Country.Region)) + geom_line(size=1) +
theme_classic() +
labs(title = "Covid-19 Confirmed Cases by Country", x= "Days", y= "Daily confirmed cases (log scale)") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(trans="log10")
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
## Warning: Transformation introduced infinite values in continuous y-axis
# Matrix of line graphs of confirmed, deaths and recovered for select countries in log10 scale
str(countryselection)
## grouped_df [865 × 7] (S3: grouped_df/tbl_df/tbl/data.frame)
## $ Country.Region: chr [1:865] "Bangladesh" "Bangladesh" "Bangladesh" "Bangladesh" ...
## $ date : Date[1:865], format: "2022-01-10" "2022-01-13" ...
## $ confirmed : int [1:865] 1595931 1604664 1617711 1642294 1664616 1685136 1715997 1747331 1773149 0 ...
## $ deaths : int [1:865] 28105 28123 28144 28176 28192 28223 28256 28288 28329 0 ...
## $ recovered : int [1:865] 0 0 0 0 0 0 0 0 0 0 ...
## $ cumconfirmed : int [1:865] 4218793 10595383 17013896 23498363 29977769 34399351 38879512 43424727 48029007 52452268 ...
## $ days : 'difftime' num [1:865] 375 378 381 384 ...
## ..- attr(*, "units")= chr "days"
## - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ Country.Region: chr [1:5] "Bangladesh" "China" "India" "Nepal" ...
## ..$ .rows : list<int> [1:5]
## .. ..$ : int [1:173] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ : int [1:173] 174 175 176 177 178 179 180 181 182 183 ...
## .. ..$ : int [1:173] 347 348 349 350 351 352 353 354 355 356 ...
## .. ..$ : int [1:173] 520 521 522 523 524 525 526 527 528 529 ...
## .. ..$ : int [1:173] 693 694 695 696 697 698 699 700 701 702 ...
## .. ..@ ptype: int(0)
## ..- attr(*, ".drop")= logi TRUE
countryselection %>% gather("Type", "Cases", -c(date, days, Country.Region)) %>%
ggplot(aes(x=days, y=Cases, colour=Country.Region)) + geom_line(size=1) +
theme_classic() +
labs(title = "Covid-19 Cases by Country", x= "Days", y= "Daily cases (log scale)") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(trans="log10") +
facet_grid(rows=vars(Type))
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
## Warning: Transformation introduced infinite values in continuous y-axis
countrytotal <- country %>% group_by(Country.Region) %>% summarize(cumconfirmed=sum(confirmed), cumdeaths=sum(deaths), cumrecovered=sum(recovered))
countrytotal
4. Outcomes
The analysis gives us clear understanding of growth rate of cases all over the world.
The data says the growth rate on daily basis on country wide. + Case status in India + Cases growth rate in India + Case growth rate as compare to neighboring country.
5. Results and Discussions
A huge dataset of people suffering from Corona virus to give us better ways of fighting the pandemic situation. The data gives us to think and prepare in a better way:-