This project aims to analyze the covid-19 data and compare the cases with various socio-economic factors.The project is also aims to find out insights from data and model the correlation between factors to predict the impact of secondary wave over top affected countries
library(dplyr)
library(gapminder)
library(ggplot2)
library(stringr)
library(lubridate)
library(tidyverse)
library(gapminder)
raw.data.confirmed <- read.csv('time_series_covid19_confirmed_global.csv')
raw.data.deaths <- read.csv('time_series_covid19_deaths_global.csv')
raw.data.recovered <- read.csv('time_series_covid19_recovered_global.csv')
dim(raw.data.confirmed)
## [1] 274 450
dim(raw.data.deaths)
## [1] 274 450
dim(raw.data.recovered)
## [1] 259 450
raw.data.confirmed[1:5,1:7]
colnames(raw.data.confirmed)[450]
## [1] "X4.11.21"
raw.data.deaths[1:5,1:7]
colnames(raw.data.deaths)[450]
## [1] "X4.11.21"
raw.data.recovered[1:5,1:7]
colnames(raw.data.recovered)[450]
## [1] "X4.11.21"
raw.data.confirmed$Country.Region = factor(raw.data.confirmed$Country.Region)
nlevels(raw.data.confirmed$Country.Region)
## [1] 192
raw.data.deaths$Country.Region = factor(raw.data.deaths$Country.Region)
nlevels(raw.data.deaths$Country.Region)
## [1] 192
raw.data.recovered$Country.Region = factor(raw.data.recovered$Country.Region)
nlevels(raw.data.recovered$Country.Region)
## [1] 192
All the datasets contains same number of countries , same number of dates but dimensions of recovered data is not same as others
sum(is.na(raw.data.confirmed))
## [1] 2
sum(is.na(raw.data.deaths))
## [1] 2
sum(is.na(raw.data.recovered))
## [1] 0
Dataset confirmed and deaths contains 2 NA values ** Index of NA values in confirmed **
which(is.na(raw.data.confirmed)==T)
## [1] 601 875
Let’s look the two rows that contains NA values
raw.data.confirmed[601,1:10]
raw.data.confirmed[875,1:10]
since both rows are empty we can drop the rows
covid_confirmed <- raw.data.confirmed %>% na.omit()
which(is.na(raw.data.deaths)==T)
## [1] 601 875
Let’s look the two rows that contains NA values
raw.data.deaths[601,1:10]
raw.data.deaths[875,1:10]
since both rows are empty, so we can drop them
covid_deaths <- raw.data.deaths %>% na.omit()
From the above basic data explorations, the Dataframe contains daily confirmed covid cases in 192 countries over the period of 22/01/2020 to 11/04/2021 . The dataframe has missing values in Province.State. and there is also two complete NA rows.The date fields in the datasets are spread in columns we can create a column for confirmed cases and arrange date into rows
confirmed_long <- covid_confirmed %>% gather(date,confirmed,X1.22.20:X4.11.21,na.rm = TRUE)
confirmed_long %>% head(10)
confirmed_long %>% tail(10)
Replacing X with 0 in date column
confirmed_long$date[1:69069] <- confirmed_long$date[1:69069] %>% str_replace("X","0")
confirmed_long$date[94186:121758] <- confirmed_long$date[94186:121758] %>% str_replace("X","0")
confirmed_long$date[69070:94185] <- confirmed_long$date[69070:94185] %>% str_replace("X","")
confirmed_long[114919,]
confirmed_long$date <- mdy(confirmed_long$date)
confirmed_long$date %>% class()
## [1] "Date"
deaths_long <- covid_deaths %>% gather(date,deaths,X1.22.20:X4.11.21,na.rm = TRUE)
deaths_long$date[1:69069] <- deaths_long$date[1:69069] %>% str_replace("X","0")
deaths_long$date[94186:121758] <- deaths_long$date[94186:121758] %>% str_replace("X","0")
deaths_long$date[69070:94185] <- deaths_long$date[69070:94185] %>% str_replace("X","")
deaths_long[114919,]
deaths_long$date <- mdy(deaths_long$date)
deaths_long$date %>% class()
## [1] "Date"
deaths_long %>% head(10)
deaths_long %>% summary()
## Province.State Country.Region Lat Long
## Length:121758 China :14718 Min. :-51.796 Min. :-178.12
## Class :character Canada : 6690 1st Qu.: 5.152 1st Qu.: -19.02
## Mode :character France : 5352 Median : 21.694 Median : 20.94
## United Kingdom: 5352 Mean : 20.535 Mean : 23.03
## Australia : 3568 3rd Qu.: 41.113 3rd Qu.: 84.25
## Netherlands : 2230 Max. : 71.707 Max. : 178.06
## (Other) :83848
## date deaths
## Min. :2020-01-22 Min. : 0
## 1st Qu.:2020-05-12 1st Qu.: 0
## Median :2020-08-31 Median : 16
## Mean :2020-08-31 Mean : 3846
## 3rd Qu.:2020-12-21 3rd Qu.: 366
## Max. :2021-04-11 Max. :562066
##
colnames(raw.data.recovered)[5:257] <- colnames(raw.data.recovered)[5:257] %>% str_replace("X","0")
colnames(raw.data.recovered)[258:349] <- colnames(raw.data.recovered)[258:349] %>% str_replace("X","")
colnames(raw.data.recovered)[350:450] <- colnames(raw.data.recovered)[350:450] %>% str_replace("X","0")
recovered_long <- gather(raw.data.recovered,date,recovered,"01.22.20":"04.11.21",na.rm = TRUE)
recovered_long %>% head()
recovered_long %>% tail()
recovered_long$date <- mdy(recovered_long$date)
recovered_long$date %>% class()
## [1] "Date"
recovered_long %>% head(20)
recovered_long %>% summary()
## Province.State Country.Region Lat Long
## Length:115514 China :14718 Min. :-51.796 Min. :-178.12
## Class :character France : 5352 1st Qu.: 4.571 1st Qu.: -9.43
## Mode :character United Kingdom: 5352 Median : 19.856 Median : 24.60
## Australia : 3568 Mean : 19.103 Mean : 28.50
## Netherlands : 2230 3rd Qu.: 39.074 3rd Qu.: 90.43
## Denmark : 1338 Max. : 71.707 Max. : 178.06
## (Other) :82956
## date recovered
## Min. :2020-01-22 Min. : 0
## 1st Qu.:2020-05-12 1st Qu.: 24
## Median :2020-08-31 Median : 803
## Mean :2020-08-31 Mean : 95422
## 3rd Qu.:2020-12-21 3rd Qu.: 12878
## Max. :2021-04-11 Max. :12156529
##
cumulative_confirmed_cases <- confirmed_long %>% group_by(Country.Region) %>% mutate(total_confirmed = sum(confirmed)) %>% filter(date=="2020-12-13 ") %>% arrange(desc(confirmed))
cumulative_confirmed_cases %>% head(10)
cumulative_confirmed_cases %>% head(10)%>% ggplot(aes(x = Country.Region,y=total_confirmed,color=Country.Region))+geom_col()
leaflet(cumulative_confirmed_cases) %>%
addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>%
addCircleMarkers(lng=~Long,lat=~Lat,
label = ~paste0(Country.Region,"-",Province.State),
popup = ~paste0(Country.Region, "<br/>", confirmed),
radius= ~confirmed*0.0000015,color = "red")
cumulative_death_cases <- deaths_long %>% group_by(Country.Region) %>% mutate(total_deaths = sum(deaths)) %>% filter(date=="2020-12-13 ") %>% arrange(desc(deaths))
cumulative_death_cases %>% head(10)
cumulative_death_cases %>% head(10) %>% ggplot(aes(x = Country.Region,y=deaths,color=Country.Region))+geom_col()
leaflet(cumulative_death_cases) %>%
addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>%
addCircleMarkers(lng=~Long,lat=~Lat,
label = ~paste0(Country.Region,"-",Province.State),
popup = ~paste0(Country.Region, "<br/>", deaths),
radius= ~deaths*0.0000425,color = "silver")
cumulative_recovered_cases <- recovered_long %>% group_by(Country.Region) %>% mutate(total_recovered = sum(recovered)) %>% filter(date=="2020-12-13 ") %>% arrange(desc(recovered))
cumulative_recovered_cases %>% head(10)
cumulative_recovered_cases %>% head(10)%>% ggplot(aes(x = Country.Region,y=recovered,color=Country.Region))+geom_col()
leaflet(cumulative_recovered_cases) %>%
addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>%
addCircleMarkers(lng=~Long,lat=~Lat,
label = ~paste0(Country.Region,"-",Province.State),
popup = ~paste0(Country.Region, "<br/>", recovered),
radius= ~recovered*0.0000035,color = "#00FF00")
confirmed_long %>% group_by(date) %>%
summarize(daily_confirm=sum(confirmed)) %>%
ggplot(aes(x=date,y=daily_confirm))+geom_line()+ geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
confirmed_long[confirmed_long$Country.Region=="India",] %>%
group_by(date) %>%
summarize(average_daily_confirm=sum(confirmed)) %>%
ggplot(aes(x=date,y=average_daily_confirm))+geom_line()+ geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
confirmed_india <- confirmed_long%>% filter(Country.Region=="India")
daily_confirmed_india <- confirmed_india %>% group_by(date) %>%
summarize(cumulative_confirm=sum(confirmed))
deaths_india <- deaths_long%>% filter(Country.Region=="India")
daily_deaths_india <- deaths_india %>% group_by(date) %>%
summarize(cumulative_deaths=sum(deaths))
recovered_india <- recovered_long %>% filter(Country.Region=="India")
daily_recovered_india <- recovered_india %>% group_by(date) %>%
summarize(cumulative_recovered=sum(recovered))
daily_cases_confirmed <- daily_confirmed_india$cumulative_confirm[2:dim(daily_confirmed_india)[1]] - daily_confirmed_india$cumulative_confirm[1:(dim(daily_confirmed_india)[1]-1)]
daily_confirmed_india$daily_cases_confirmed <- c(0,daily_cases_confirmed)
daily_cases_recovered <- daily_recovered_india$cumulative_recovered[2:dim(daily_recovered_india)[1]] - daily_recovered_india$cumulative_recovered[1:(dim(daily_recovered_india)[1]-1)]
daily_recovered_india$daily_cases_recovered <- c(0,daily_cases_recovered)
daily_cases_death <- daily_deaths_india$cumulative_deaths[2:dim(daily_deaths_india)[1]] - daily_deaths_india$cumulative_deaths[1:(dim(daily_deaths_india)[1]-1)]
daily_deaths_india$daily_cases_death <- c(0,daily_cases_death)
combined_daily_india <- merge(daily_confirmed_india,daily_deaths_india ,by.x = "date",by.y = "date")
combined_daily_india <- merge(combined_daily_india,daily_recovered_india ,by.x = "date",by.y = "date")
combined_daily_india %>% tail(20)
cumulative_confirmed_cases %>% head(10) %>% ggplot(aes(x = Country.Region,y=confirmed,color=Country.Region))+geom_col()
US , India ,Brazil are the top affected countries by covid - 19 confirmed cases
combined_daily_india %>% ggplot(aes(x=date,))+
geom_line(aes(y=daily_cases_confirmed),color="red")+
geom_line(aes(y=daily_cases_death))+
geom_line(aes(y=daily_cases_recovered),color="blue")
combined_daily_india %>% ggplot(aes(x=date,))+
geom_smooth(aes(y=daily_cases_confirmed), span = 0.3,color ="red")+
geom_smooth(aes(y=daily_cases_confirmed),method = lm,color = "gray")+
geom_smooth(aes(y=daily_cases_death), span = 0.3,color="black")+
geom_smooth(aes(y=daily_cases_recovered), span = 0.3,color ="blue")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
From the above plot we can infer that India is facing a second wave
daily_confirmed_US <- confirmed_long %>% filter(Country.Region=="US") %>% group_by(date) %>% summarize(daily_confirm=sum(confirmed))
daily_cases <- daily_confirmed_US$daily_confirm[2:dim(daily_confirmed_US)[1]] - daily_confirmed_US$daily_confirm[1:(dim(daily_confirmed_US)[1]-1)]
daily_confirmed_US$daily_cases <- c(0,daily_cases)
daily_confirmed_US %>% ggplot(aes(x=date,y=daily_cases))+geom_line()+geom_smooth(span = 0.3,color ="red")+expand_limits(y=0)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
daily_confirmed_Brazil <- confirmed_long %>% filter(Country.Region=="Brazil") %>% group_by(date) %>% summarize(daily_confirm=sum(confirmed))
daily_cases <- daily_confirmed_Brazil$daily_confirm[2:dim(daily_confirmed_Brazil)[1]] - daily_confirmed_Brazil$daily_confirm[1:(dim(daily_confirmed_Brazil)[1]-1)]
daily_confirmed_Brazil$daily_cases <- c(0,daily_cases)
daily_confirmed_Brazil %>% ggplot(aes(x=date,y=daily_cases))+geom_line()+geom_smooth(span = 0.3,color ="red")+expand_limits(y=0)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
daily_confirmed_Italy <- confirmed_long %>% filter(Country.Region=="Italy") %>% group_by(date) %>% summarize(daily_confirm=sum(confirmed))
daily_cases <- daily_confirmed_Italy$daily_confirm[2:dim(daily_confirmed_Italy)[1]] - daily_confirmed_Italy$daily_confirm[1:(dim(daily_confirmed_Italy)[1]-1)]
daily_confirmed_Italy$daily_cases <- c(0,daily_cases)
daily_confirmed_Italy %>% ggplot(aes(x=date,y=daily_cases))+geom_line()+geom_smooth(span = 0.3,color ="red")+expand_limits(y=0)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Daily confirmed cases in Mexico
daily_confirmed_Mexico <- confirmed_long %>% filter(Country.Region=="Mexico") %>% group_by(date) %>% summarize(daily_confirm=sum(confirmed))
daily_cases <- daily_confirmed_Mexico$daily_confirm[2:dim(daily_confirmed_Mexico)[1]] - daily_confirmed_Mexico$daily_confirm[1:(dim(daily_confirmed_Mexico)[1]-1)]
daily_confirmed_Mexico$daily_cases <- c(0,daily_cases)
daily_confirmed_Mexico %>% ggplot(aes(x=date,y=daily_cases))+geom_line()+geom_smooth(span = 0.3,color ="red")+expand_limits(y=0)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
combined_data <- merge(cumulative_recovered_cases,cumulative_confirmed_cases,by.x = "Country.Region" ,by.y = "Country.Region")
combined_data <- merge(combined_data,cumulative_death_cases,by.x = "Country.Region" ,by.y = "Country.Region")
combined_data$percent_recovery <- (combined_data$total_recovered/combined_data$total_confirmed)*100
combined_data$percent_deaths <- (combined_data$total_deaths/combined_data$total_confirmed)*100
combined_data <- combined_data %>% arrange(percent_recovery)
combined_data%>% head(5)
Removing sweden from the data
combined_data <- combined_data[2:dim(combined_data)[1],]
combined_data%>% head(5)
Importing gapminder dataset to compare the socio-economic factors with covid cases
lifeexp <- gapminder %>% filter(year==2007)
lifeexp %>% arrange(desc(lifeExp))
merging gapminder dataset with covid cases by country
combined_data <- merge(combined_data,lifeexp,by.x = "Country.Region",by.y = "country")
combined_data_percentile_recovered <- combined_data %>% group_by(Country.Region,Lat,Long) %>% summarise(recovery_percentile = max(percent_recovery))%>% arrange(desc(recovery_percentile))
## `summarise()` has grouped output by 'Country.Region', 'Lat'. You can override using the `.groups` argument.
combined_data_percentile_recovered %>% head(5)
The table shows Iceland has more recovery percentile
** point plot with smoothened curve
combined_data %>% ggplot(aes(x=pop,y=total_confirmed))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
combined_data$confirmed_vs_pop <- combined_data$total_confirmed/combined_data$pop
combined_data %>% arrange(percent_recovery) %>% head(5)
We can check whether the confirmed_vs_pop follows any distribution using boxplot
combined_data %>% ggplot(aes(x=confirmed_vs_pop))+geom_boxplot()
cor(x=combined_data$pop,combined_data$total_confirmed)
## [1] -0.8673437
From the box plot we can see that total confirmed_vs_population factor is not following normal distribution , this is due to 2 reasons 1. There countries where confirmed cases > population (2007) 2. For countries having relatively higher population , the total confirmed cases have lesser correlation with population
We can filter countries based on population based on the geom_smooth plot and filter based on the assumption that confirmed_vs_pop always < 1 (max confirmed cases = pop)
combined_data_cleaned <- combined_data %>% filter(confirmed_vs_pop <1) %>% filter(pop<1000000000)
combined_data_cleaned %>% ggplot(aes(x=confirmed_vs_pop))+geom_boxplot()
** Correlation between population and total confirmed **
cor(x=combined_data_cleaned$pop,combined_data_cleaned$total_confirmed)
## [1] 0.9012901
combined_data_cleaned %>% ggplot(aes(x=pop,y=total_confirmed))+geom_point()+geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
Again experimenting with confirmed_vs_pop filter condition we will get an approximate normal distribution at confirmed_vs_pop = 0.3
combined_data_cleaned <- combined_data %>% filter(confirmed_vs_pop <0.3) %>% filter(pop<1000000000)
combined_data_cleaned %>% ggplot(aes(x=confirmed_vs_pop))+geom_boxplot()
** Correlation between population and total confirmed **
combined_data_cleaned %>% ggplot(aes(x=pop,y=total_confirmed))+geom_point()+geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
combined_data_cleaned %>% ggplot(aes(x=pop,y=total_confirmed))+geom_point()+geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
cor(x=combined_data_cleaned$pop,combined_data_cleaned$total_confirmed)
## [1] 0.7717487
** Correlation between lefe expectance and percent deaths **
cor(x=combined_data_cleaned$lifeExp,combined_data_cleaned$percent_deaths)
## [1] -0.1060197
** Correlation between gdpPercap and percent deaths **
cor(x=combined_data_cleaned$gdpPercap,y=combined_data_cleaned$percent_deaths)
## [1] -0.1379226
** Correlation between population and total percent recovery **
cor(x=combined_data_cleaned$pop,y=combined_data_cleaned$percent_recovery)
## [1] 0.03567923
combined_data_cleaned
cumulative_combined_data_cleaned <- combined_data_cleaned %>%group_by(Country.Region,Long,Lat) %>% summarise(cumulative_confirmed = sum(total_confirmed))
## `summarise()` has grouped output by 'Country.Region', 'Long'. You can override using the `.groups` argument.
leaflet(cumulative_combined_data_cleaned) %>%
addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>%
addCircleMarkers(lng=~Long,lat=~Lat,
popup = ~paste0(Country.Region, "<br/>", cumulative_confirmed),
radius= ~cumulative_confirmed*0.0000015,color = "red")
African countries have higher correlation between confirmed cases and population
The spread of covid-19 pandemic in different countries shows differences in rate of spread , time period , recovery percentile (total recovered/total confirmed), death percentile (total death/total confirmed).The examined countries also shows signals of secondary wave. Examined the influences of various socio-economic factors with covid-19 data to find any relations that cause the differences among countries. Population has direct influence in total confirmed cases which is assumable and this influence has a correlation of 0.90 , also correlated GDP per capita and Life Expectancy with recovery percentile , death percentile but the correlations were less than 0.3 which means country wise quality of health and economic conditions have no effect in covid resistance as per the provided data.