Objective

This project aims to analyze the covid-19 data and compare the cases with various socio-economic factors.The project is also aims to find out insights from data and model the correlation between factors to predict the impact of secondary wave over top affected countries

 

Preliminary Analysis

Importing necessary packages

library(dplyr)
library(gapminder)
library(ggplot2)
library(stringr)
library(lubridate)
library(tidyverse)
library(gapminder)

Importing data

raw.data.confirmed <- read.csv('time_series_covid19_confirmed_global.csv')
raw.data.deaths <- read.csv('time_series_covid19_deaths_global.csv')
raw.data.recovered <- read.csv('time_series_covid19_recovered_global.csv')

 

Dimensions of the tables

dim(raw.data.confirmed)
## [1] 274 450
dim(raw.data.deaths)
## [1] 274 450
dim(raw.data.recovered)
## [1] 259 450

 

inspecting sample subset of the table

  • Covid confirmed data
raw.data.confirmed[1:5,1:7]
  • last column of the dataset
colnames(raw.data.confirmed)[450]
## [1] "X4.11.21"

 

 

  • Covid death data
raw.data.deaths[1:5,1:7]
  • Last column of deaths data
colnames(raw.data.deaths)[450]
## [1] "X4.11.21"

 

 

  • Covid recovered data
raw.data.recovered[1:5,1:7]
  • Last column of recovered data
colnames(raw.data.recovered)[450]
## [1] "X4.11.21"

Number of countries in the data

  • Confirmed data
raw.data.confirmed$Country.Region = factor(raw.data.confirmed$Country.Region)
nlevels(raw.data.confirmed$Country.Region)
## [1] 192

 

  • Deaths dataset
raw.data.deaths$Country.Region = factor(raw.data.deaths$Country.Region)
nlevels(raw.data.deaths$Country.Region)
## [1] 192

 

  • Recovered dataset
raw.data.recovered$Country.Region = factor(raw.data.recovered$Country.Region)
nlevels(raw.data.recovered$Country.Region)
## [1] 192

 

All the datasets contains same number of countries , same number of dates but dimensions of recovered data is not same as others

 

Inspecting for NA values

sum(is.na(raw.data.confirmed))
## [1] 2
sum(is.na(raw.data.deaths))
## [1] 2
sum(is.na(raw.data.recovered))
## [1] 0

 

 

Dataset confirmed and deaths contains 2 NA values ** Index of NA values in confirmed **

 

which(is.na(raw.data.confirmed)==T)
## [1] 601 875

Let’s look the two rows that contains NA values

raw.data.confirmed[601,1:10]
raw.data.confirmed[875,1:10]

 

since both rows are empty we can drop the rows

covid_confirmed <- raw.data.confirmed %>% na.omit()

 

  • index of NA values in covid deaths data
which(is.na(raw.data.deaths)==T)
## [1] 601 875

Let’s look the two rows that contains NA values

raw.data.deaths[601,1:10]
raw.data.deaths[875,1:10]

since both rows are empty, so we can drop them

covid_deaths <- raw.data.deaths %>% na.omit()

 

 

From the above basic data explorations, the Dataframe contains daily confirmed covid cases in 192 countries over the period of 22/01/2020 to 11/04/2021 . The dataframe has missing values in Province.State. and there is also two complete NA rows.The date fields in the datasets are spread in columns we can create a column for confirmed cases and arrange date into rows

 

Restructuring the messy date columns

confirmed_long <- covid_confirmed %>% gather(date,confirmed,X1.22.20:X4.11.21,na.rm = TRUE)
confirmed_long %>% head(10)

 

confirmed_long %>% tail(10)

 

Replacing X with 0 in date column

confirmed_long$date[1:69069] <- confirmed_long$date[1:69069] %>% str_replace("X","0")
confirmed_long$date[94186:121758] <- confirmed_long$date[94186:121758] %>% str_replace("X","0")
confirmed_long$date[69070:94185] <- confirmed_long$date[69070:94185] %>% str_replace("X","")
confirmed_long[114919,]

 

 

Converting date column string data into date using lubridate

confirmed_long$date <- mdy(confirmed_long$date)
confirmed_long$date %>% class()
## [1] "Date"

 

Similarly we can convert date fields in deaths and recovered datasets

deaths_long <- covid_deaths %>% gather(date,deaths,X1.22.20:X4.11.21,na.rm = TRUE)

 

deaths_long$date[1:69069] <- deaths_long$date[1:69069] %>% str_replace("X","0")
deaths_long$date[94186:121758] <- deaths_long$date[94186:121758] %>% str_replace("X","0")
deaths_long$date[69070:94185] <- deaths_long$date[69070:94185] %>% str_replace("X","")
deaths_long[114919,]

 

deaths_long$date <- mdy(deaths_long$date)
deaths_long$date %>% class()
## [1] "Date"
deaths_long %>% head(10)
deaths_long %>% summary()
##  Province.State            Country.Region       Lat               Long        
##  Length:121758      China         :14718   Min.   :-51.796   Min.   :-178.12  
##  Class :character   Canada        : 6690   1st Qu.:  5.152   1st Qu.: -19.02  
##  Mode  :character   France        : 5352   Median : 21.694   Median :  20.94  
##                     United Kingdom: 5352   Mean   : 20.535   Mean   :  23.03  
##                     Australia     : 3568   3rd Qu.: 41.113   3rd Qu.:  84.25  
##                     Netherlands   : 2230   Max.   : 71.707   Max.   : 178.06  
##                     (Other)       :83848                                      
##       date                deaths      
##  Min.   :2020-01-22   Min.   :     0  
##  1st Qu.:2020-05-12   1st Qu.:     0  
##  Median :2020-08-31   Median :    16  
##  Mean   :2020-08-31   Mean   :  3846  
##  3rd Qu.:2020-12-21   3rd Qu.:   366  
##  Max.   :2021-04-11   Max.   :562066  
## 

 

colnames(raw.data.recovered)[5:257] <- colnames(raw.data.recovered)[5:257] %>% str_replace("X","0")
colnames(raw.data.recovered)[258:349] <- colnames(raw.data.recovered)[258:349] %>% str_replace("X","")
colnames(raw.data.recovered)[350:450] <- colnames(raw.data.recovered)[350:450] %>% str_replace("X","0")

 

recovered_long <- gather(raw.data.recovered,date,recovered,"01.22.20":"04.11.21",na.rm = TRUE)

 

recovered_long %>% head()

 

recovered_long %>% tail()

 

recovered_long$date <- mdy(recovered_long$date)
recovered_long$date %>% class()
## [1] "Date"
recovered_long %>% head(20)
recovered_long %>% summary()
##  Province.State            Country.Region       Lat               Long        
##  Length:115514      China         :14718   Min.   :-51.796   Min.   :-178.12  
##  Class :character   France        : 5352   1st Qu.:  4.571   1st Qu.:  -9.43  
##  Mode  :character   United Kingdom: 5352   Median : 19.856   Median :  24.60  
##                     Australia     : 3568   Mean   : 19.103   Mean   :  28.50  
##                     Netherlands   : 2230   3rd Qu.: 39.074   3rd Qu.:  90.43  
##                     Denmark       : 1338   Max.   : 71.707   Max.   : 178.06  
##                     (Other)       :82956                                      
##       date              recovered       
##  Min.   :2020-01-22   Min.   :       0  
##  1st Qu.:2020-05-12   1st Qu.:      24  
##  Median :2020-08-31   Median :     803  
##  Mean   :2020-08-31   Mean   :   95422  
##  3rd Qu.:2020-12-21   3rd Qu.:   12878  
##  Max.   :2021-04-11   Max.   :12156529  
## 

 

Detailed analysis

1. Cumulative covid cases over the world as of 2021-04-11

cumulative_confirmed_cases <- confirmed_long %>% group_by(Country.Region) %>% mutate(total_confirmed = sum(confirmed)) %>% filter(date=="2020-12-13 ") %>% arrange(desc(confirmed))

 

cumulative_confirmed_cases %>% head(10)
cumulative_confirmed_cases %>% head(10)%>% ggplot(aes(x = Country.Region,y=total_confirmed,color=Country.Region))+geom_col()

 

leaflet(cumulative_confirmed_cases) %>% 
  addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>% 
  addCircleMarkers(lng=~Long,lat=~Lat,
                   label = ~paste0(Country.Region,"-",Province.State),
                   popup = ~paste0(Country.Region, "<br/>", confirmed),
                   radius= ~confirmed*0.0000015,color = "red")

 

 

2. Cumulative deaths till 11 April 2021

cumulative_death_cases <- deaths_long %>% group_by(Country.Region) %>% mutate(total_deaths = sum(deaths)) %>% filter(date=="2020-12-13 ") %>% arrange(desc(deaths))
cumulative_death_cases %>% head(10)
cumulative_death_cases %>% head(10) %>% ggplot(aes(x = Country.Region,y=deaths,color=Country.Region))+geom_col()

 

leaflet(cumulative_death_cases) %>% 
  addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>% 
  addCircleMarkers(lng=~Long,lat=~Lat,
                   label = ~paste0(Country.Region,"-",Province.State),
                   popup = ~paste0(Country.Region, "<br/>", deaths),
                   radius= ~deaths*0.0000425,color = "silver")

 

 

3. Cumulative recovered

cumulative_recovered_cases <- recovered_long %>% group_by(Country.Region) %>% mutate(total_recovered = sum(recovered)) %>% filter(date=="2020-12-13 ") %>% arrange(desc(recovered))
cumulative_recovered_cases %>% head(10)
cumulative_recovered_cases %>% head(10)%>% ggplot(aes(x = Country.Region,y=recovered,color=Country.Region))+geom_col()

 

leaflet(cumulative_recovered_cases) %>% 
  addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>% 
  addCircleMarkers(lng=~Long,lat=~Lat,
                   label = ~paste0(Country.Region,"-",Province.State),
                   popup = ~paste0(Country.Region, "<br/>", recovered),
                   radius= ~recovered*0.0000035,color = "#00FF00")

 

confirmed_long  %>%  group_by(date) %>% 
  summarize(daily_confirm=sum(confirmed)) %>% 
  ggplot(aes(x=date,y=daily_confirm))+geom_line()+ geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

 

 

Cumulative covid cases in India

confirmed_long[confirmed_long$Country.Region=="India",] %>% 
  group_by(date) %>% 
  summarize(average_daily_confirm=sum(confirmed)) %>% 
  ggplot(aes(x=date,y=average_daily_confirm))+geom_line()+ geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

 

Extracting covid cases in India

confirmed_india <- confirmed_long%>% filter(Country.Region=="India")
daily_confirmed_india <- confirmed_india %>% group_by(date) %>% 
  summarize(cumulative_confirm=sum(confirmed))

deaths_india <- deaths_long%>% filter(Country.Region=="India")
daily_deaths_india <- deaths_india %>% group_by(date) %>% 
  summarize(cumulative_deaths=sum(deaths))


recovered_india <- recovered_long %>% filter(Country.Region=="India")
daily_recovered_india <- recovered_india %>% group_by(date) %>% 
  summarize(cumulative_recovered=sum(recovered))

 

daily_cases_confirmed <- daily_confirmed_india$cumulative_confirm[2:dim(daily_confirmed_india)[1]] - daily_confirmed_india$cumulative_confirm[1:(dim(daily_confirmed_india)[1]-1)]
daily_confirmed_india$daily_cases_confirmed <- c(0,daily_cases_confirmed)

daily_cases_recovered <- daily_recovered_india$cumulative_recovered[2:dim(daily_recovered_india)[1]] - daily_recovered_india$cumulative_recovered[1:(dim(daily_recovered_india)[1]-1)]
daily_recovered_india$daily_cases_recovered <- c(0,daily_cases_recovered)

daily_cases_death <- daily_deaths_india$cumulative_deaths[2:dim(daily_deaths_india)[1]] - daily_deaths_india$cumulative_deaths[1:(dim(daily_deaths_india)[1]-1)]
daily_deaths_india$daily_cases_death <- c(0,daily_cases_death)

 

combined_daily_india <- merge(daily_confirmed_india,daily_deaths_india ,by.x = "date",by.y = "date")
combined_daily_india <- merge(combined_daily_india,daily_recovered_india ,by.x = "date",by.y = "date")
combined_daily_india %>% tail(20)

 

 

Analysing the time difference between waves

Top affected countries

cumulative_confirmed_cases %>% head(10) %>% ggplot(aes(x = Country.Region,y=confirmed,color=Country.Region))+geom_col()

 

US , India ,Brazil are the top affected countries by covid - 19 confirmed cases

 

Daily confirmed cases in india

combined_daily_india %>%  ggplot(aes(x=date,))+
  geom_line(aes(y=daily_cases_confirmed),color="red")+
  geom_line(aes(y=daily_cases_death))+
  geom_line(aes(y=daily_cases_recovered),color="blue")

 

combined_daily_india %>%  ggplot(aes(x=date,))+
  geom_smooth(aes(y=daily_cases_confirmed), span = 0.3,color ="red")+
  geom_smooth(aes(y=daily_cases_confirmed),method = lm,color = "gray")+
  geom_smooth(aes(y=daily_cases_death), span = 0.3,color="black")+
  geom_smooth(aes(y=daily_cases_recovered), span = 0.3,color ="blue")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

 

From the above plot we can infer that India is facing a second wave

 

 

Daily confirmed cases in US

daily_confirmed_US <- confirmed_long %>% filter(Country.Region=="US") %>% group_by(date) %>% summarize(daily_confirm=sum(confirmed))
daily_cases <- daily_confirmed_US$daily_confirm[2:dim(daily_confirmed_US)[1]] - daily_confirmed_US$daily_confirm[1:(dim(daily_confirmed_US)[1]-1)]
daily_confirmed_US$daily_cases <- c(0,daily_cases)
daily_confirmed_US %>%  ggplot(aes(x=date,y=daily_cases))+geom_line()+geom_smooth(span = 0.3,color ="red")+expand_limits(y=0)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

 

 

Daily confirmed cases in Brazil

daily_confirmed_Brazil <- confirmed_long %>% filter(Country.Region=="Brazil") %>% group_by(date) %>% summarize(daily_confirm=sum(confirmed))
daily_cases <- daily_confirmed_Brazil$daily_confirm[2:dim(daily_confirmed_Brazil)[1]] - daily_confirmed_Brazil$daily_confirm[1:(dim(daily_confirmed_Brazil)[1]-1)]
daily_confirmed_Brazil$daily_cases <- c(0,daily_cases)
daily_confirmed_Brazil %>%  ggplot(aes(x=date,y=daily_cases))+geom_line()+geom_smooth(span = 0.3,color ="red")+expand_limits(y=0)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Daily confirmed cases in Italy

daily_confirmed_Italy <- confirmed_long %>% filter(Country.Region=="Italy") %>% group_by(date) %>% summarize(daily_confirm=sum(confirmed))
daily_cases <- daily_confirmed_Italy$daily_confirm[2:dim(daily_confirmed_Italy)[1]] - daily_confirmed_Italy$daily_confirm[1:(dim(daily_confirmed_Italy)[1]-1)]
daily_confirmed_Italy$daily_cases <- c(0,daily_cases)
daily_confirmed_Italy %>%  ggplot(aes(x=date,y=daily_cases))+geom_line()+geom_smooth(span = 0.3,color ="red")+expand_limits(y=0)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Daily confirmed cases in Mexico

daily_confirmed_Mexico <- confirmed_long %>% filter(Country.Region=="Mexico") %>% group_by(date) %>% summarize(daily_confirm=sum(confirmed))
daily_cases <- daily_confirmed_Mexico$daily_confirm[2:dim(daily_confirmed_Mexico)[1]] - daily_confirmed_Mexico$daily_confirm[1:(dim(daily_confirmed_Mexico)[1]-1)]
daily_confirmed_Mexico$daily_cases <- c(0,daily_cases)
daily_confirmed_Mexico %>%  ggplot(aes(x=date,y=daily_cases))+geom_line()+geom_smooth(span = 0.3,color ="red")+expand_limits(y=0)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

combined_data <- merge(cumulative_recovered_cases,cumulative_confirmed_cases,by.x = "Country.Region" ,by.y = "Country.Region")
combined_data <- merge(combined_data,cumulative_death_cases,by.x = "Country.Region" ,by.y = "Country.Region")
combined_data$percent_recovery <- (combined_data$total_recovered/combined_data$total_confirmed)*100
combined_data$percent_deaths <- (combined_data$total_deaths/combined_data$total_confirmed)*100
combined_data <- combined_data %>% arrange(percent_recovery)
combined_data%>% head(5)

 

Removing sweden from the data

combined_data <- combined_data[2:dim(combined_data)[1],]
combined_data%>% head(5)

 

Importing gapminder dataset to compare the socio-economic factors with covid cases

lifeexp <- gapminder %>% filter(year==2007)
lifeexp %>% arrange(desc(lifeExp))

 

merging gapminder dataset with covid cases by country

combined_data <- merge(combined_data,lifeexp,by.x = "Country.Region",by.y = "country")

 

combined_data_percentile_recovered <- combined_data  %>% group_by(Country.Region,Lat,Long) %>% summarise(recovery_percentile = max(percent_recovery))%>% arrange(desc(recovery_percentile))
## `summarise()` has grouped output by 'Country.Region', 'Lat'. You can override using the `.groups` argument.
combined_data_percentile_recovered %>% head(5)

 

The table shows Iceland has more recovery percentile

 

 

** point plot with smoothened curve

combined_data %>% ggplot(aes(x=pop,y=total_confirmed))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

 

combined_data$confirmed_vs_pop <- combined_data$total_confirmed/combined_data$pop
combined_data %>% arrange(percent_recovery) %>% head(5)

 

We can check whether the confirmed_vs_pop follows any distribution using boxplot

combined_data %>% ggplot(aes(x=confirmed_vs_pop))+geom_boxplot()

cor(x=combined_data$pop,combined_data$total_confirmed)
## [1] -0.8673437

 

From the box plot we can see that total confirmed_vs_population factor is not following normal distribution , this is due to 2 reasons 1. There countries where confirmed cases > population (2007) 2. For countries having relatively higher population , the total confirmed cases have lesser correlation with population

 

We can filter countries based on population based on the geom_smooth plot and filter based on the assumption that confirmed_vs_pop always < 1 (max confirmed cases = pop)

 

combined_data_cleaned <- combined_data %>% filter(confirmed_vs_pop <1) %>% filter(pop<1000000000)
combined_data_cleaned %>% ggplot(aes(x=confirmed_vs_pop))+geom_boxplot()

** Correlation between population and total confirmed **

cor(x=combined_data_cleaned$pop,combined_data_cleaned$total_confirmed)
## [1] 0.9012901
combined_data_cleaned %>%  ggplot(aes(x=pop,y=total_confirmed))+geom_point()+geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'

 

Again experimenting with confirmed_vs_pop filter condition we will get an approximate normal distribution at confirmed_vs_pop = 0.3

combined_data_cleaned <- combined_data %>% filter(confirmed_vs_pop <0.3) %>% filter(pop<1000000000)
combined_data_cleaned %>% ggplot(aes(x=confirmed_vs_pop))+geom_boxplot()

** Correlation between population and total confirmed **

combined_data_cleaned %>%  ggplot(aes(x=pop,y=total_confirmed))+geom_point()+geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'

 

combined_data_cleaned %>%  ggplot(aes(x=pop,y=total_confirmed))+geom_point()+geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'

 

cor(x=combined_data_cleaned$pop,combined_data_cleaned$total_confirmed)
## [1] 0.7717487

 

** Correlation between lefe expectance and percent deaths **

cor(x=combined_data_cleaned$lifeExp,combined_data_cleaned$percent_deaths)
## [1] -0.1060197

 

** Correlation between gdpPercap and percent deaths **

cor(x=combined_data_cleaned$gdpPercap,y=combined_data_cleaned$percent_deaths)
## [1] -0.1379226

 

** Correlation between population and total percent recovery **

cor(x=combined_data_cleaned$pop,y=combined_data_cleaned$percent_recovery)
## [1] 0.03567923

 

combined_data_cleaned
cumulative_combined_data_cleaned <- combined_data_cleaned %>%group_by(Country.Region,Long,Lat) %>% summarise(cumulative_confirmed = sum(total_confirmed))
## `summarise()` has grouped output by 'Country.Region', 'Long'. You can override using the `.groups` argument.
leaflet(cumulative_combined_data_cleaned) %>% 
  addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>% 
  addCircleMarkers(lng=~Long,lat=~Lat,
                   popup = ~paste0(Country.Region, "<br/>", cumulative_confirmed),
                   radius= ~cumulative_confirmed*0.0000015,color = "red")

African countries have higher correlation between confirmed cases and population

Outcomes

  1. US,India,Brazil are the top covid affected countries
  2. Iceland has higher recovery percentile
  3. All the top affected countries are in the face of secondary wave.
  4. When comparing India with US, Brazil , France , Italy (top affected countries) , second wave is started after January 2021
  5. Life expectancy and covid death rate are not correlated
  6. gdpPercapita and covid death rate are not correlated
  7. Population and covid confirmed cases are correlated . which is more in countries where population < 1000000000 and confirmed vs population ratio < 1

Results and Discussions

The spread of covid-19 pandemic in different countries shows differences in rate of spread , time period , recovery percentile (total recovered/total confirmed), death percentile (total death/total confirmed).The examined countries also shows signals of secondary wave. Examined the influences of various socio-economic factors with covid-19 data to find any relations that cause the differences among countries. Population has direct influence in total confirmed cases which is assumable and this influence has a correlation of 0.90 , also correlated GDP per capita and Life Expectancy with recovery percentile , death percentile but the correlations were less than 0.3 which means country wise quality of health and economic conditions have no effect in covid resistance as per the provided data.