Severe acute respiratory syndrome coronavirus 2 (SARS-COV-2),which courses coronavirus disease(COVID19) emerged from Wuhan,Hubei province china in late 2019 leading to pandemic which has infected over 700k worldwide as of 30/03/2020:
But its until March 2020 that full effect of Covid19 as been experienced with mass of new infection and death with number of confirmed cases moving from 88.4k as of 3/1/2020 to 720k by 3/29/2020 (John Hopskin University Data on Covid19 cases):
Their has been many study and research on this new infectious disease each day since outbreak and with vaccine or cure yet to be available it leaves less to be desired,China and other Asian country have managed to flatten the curve of infections thanks to various measures like lockdown,mass testing,tracking of cantact persons and social distancing which included ban on public gathering,but chicken have come home to roast across Europe and America which are now at the epicenter of pandemic while rest of Africa is catching up slowly with over 5k confirmed cases as of 31/03/2020:
Given the spread is going down in China while increasing at higher rate in other part of the world especially Europe and America the trend is not so in Africa and much of this can be attributed to climatic factors which is what this case study is trying to figure out is how geographical location and climatic factors temperature and air density are correlated with cases of covid19.
Correlation does not imply causation
Let load our data and look at it strature
covid19<-read.csv('covid19_air.csv')
head(covid19,n=13)
## State Month Temp Humidity. Confirmed Air_Density
## 1 New York March 8 58 59746 1.249
## 2 New Jersey March 8 59 13386 1.249
## 3 Washington DC March 11 64 4896 1.235
## 4 Lombarda March 9 70 41007 1.243
## 5 Emillia_Romagna March 9 69 13119 1.243
## 6 Madrid March 13 63 22677 1.225
## 7 Birmingham March 7 76 513 1.253
## 8 Gauteng March 19 68 533 1.198
## 9 Western Cape March 22 67 271 1.184
## 10 Liverpool March 8 76 157 1.248
## 11 Nairobi March 22 71 37 1.184
## 12 Wuhan February 10 74 46454 1.239
## 13 Wuhan March 14 71 1449 1.220
Dataframe consist of 13 observation of 6 variables which are State,Month and avarege temperature/humidity,air density plus comfirmed cases per state:
Let creat a plot to view how cases are distributed across various state against Temperature
library(ggplot2)
library(ggrepel)
ggplot(covid19,aes(x=Temp,y=Confirmed),groub=1)+geom_point(color='red')+geom_text_repel(aes(label=State),size=3)+geom_line()
Have included Wuhan 2 times to show effects of Temp and Air Density over time between February and March next we compute the correlation bwtween confirmed cases and Temp/Air Density
From above graph state that have recorded low temperature have high number of cases compared to those with high temperature
First let’s visualise our data by means of a scatter plot.using ggpubr package
library(ggpubr)
## Loading required package: magrittr
ggscatter(covid19,x='Air_Density',y='Confirmed',add = 'reg.line',conf.int = TRUE,cor.coef = TRUE,cor.method = 'pearson',xlab = 'Air_Density',ylab = 'Confirmed Covid19 Cases')+geom_smooth()+geom_text_repel(aes(label=State))
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
from above correlation graph its clear that number of covid19 infection cases increases as air density increase and vice versa.
Wuhan average Temperature for Januray was 5 degrees celsius,10 February and 14 for month of March 2020.
As of 3rd february 2020 their where 2106 covid19 confirmed cases in Hubei province which homes Wahun where the disease is believed to have originated but by February 28th their where(46454 confirmed cases in wuhan alone)
New cases in Wuhan dropped dramatically in March with only 1449 confirmed as of 29th of March 2020.
The fist case of covid19 was reported in china in late 2019 which was followed by January which registered avarege of 5 degrees celsius temperature with air density of 1.262 which is high than February and March which recorded lowest density at 1.22km/m3 and since cold air weighs more,it sink driving hot air up and with study showing that virus can survive in droplet for up to 3 hours after being coughed out into the air plus given coid air is still and more calm it could be possible that more people continued to pass the virus to other members of the community by inhalation of droplet without showing any symptoms during incubation period which is approximited at 14 days which could further explain why number of confirmed cases reached peak from mid February from only 2106 cases by 3rd of February in Hubei to (46454 cases Wuhan in Hubei):
Thanks to goverment measures and natural favour with air density dropping to 1.22km/m3 due to increase in temperature from 5 and 10 in previous two months to 14 degrees celsius in March number of new cases was low at 1449 as of 29th March 2020:
Hence high infection rate of covid19 caeses across Europe and USA can be linked to coid weather season with most experiencing temperatures below 10 degrees celsius and vice versa while increasing gradually in Africa.
##Let load data
USA_Temp<-read.csv('USA WEATHER BY MONTH.csv')
head(USA_Temp,n=13)
## Month Temp
## 1 January 4.5
## 2 February 5.0
## 3 March 8.1
## 4 April 13.0
## 5 May 17.4
## 6 June 21.1
## 7 July 25.9
## 8 August 26.6
## 9 September 22.9
## 10 October 17.0
## 11 Novermber 12.0
## 12 December 7.0
January is the coidest month in USA changing gradualy up to April when temperature start to go above 10 degree celsius climaxing in August when temp reaches 26.6 celsius and given monthly average we can predict when are cases likely to go down using linear regression model by establishing a statistically significant linear relationship with with temperature.
##Scatter Plot
library(ggpubr)
scatter.smooth(x=covid19$Temp,y=covid19$Confirmed,main='Confirmed ~ Temperature')
let find coeficient interval between temp and confirmed cases
cor(covid19$Temp,covid19$Confirmed)
## [1] -0.4399795
the scatter plot along with the smoothing line above suggest a linear decreasing relationship between ‘confirmed cases of covid19’ and ‘temperature’ variables while correlation is a moderate negative of -0.4399795 (when one decreases the other increases)
library(caret)
## Loading required package: lattice
set.seed(100)
train<-sample(1:nrow(covid19),0.8*nrow(covid19))
traindata<-covid19[train,]
testdata<-covid19[-train,]
lmmod<-lm(Confirmed~Temp,data = traindata)
## our model is ready now let look at it summary
summary(lmmod)
##
## Call:
## lm(formula = Confirmed ~ Temp, data = traindata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25570 -13849 -4832 15520 35389
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38166 19913 1.917 0.0916 .
## Temp -1726 1678 -1.029 0.3336
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22310 on 8 degrees of freedom
## Multiple R-squared: 0.1169, Adjusted R-squared: 0.006495
## F-statistic: 1.059 on 1 and 8 DF, p-value: 0.3336
From summary of the model above we are interested in R-squared and p-value to see if the model is best fit for this data.
The greater R-square the better the model while if the p-value is less than the significance level (usualy 0.05) then model fits the data well.
In this case R2 is 0.1169 meaning that its small but the better while p-value is greater than 0.05 at 0.3336 hence correlation is low which we can attribute to limited data used.
fit<-lm(Confirmed~Temp,data = covid19)
covid19$predicted<-predict(fit)
covid19$residual<-residuals(fit)
ggplot(covid19,aes(x=Temp,y=Confirmed))+geom_smooth(method = "lm",se=FALSE,color="lightgrey")+geom_segment(aes(xend=Temp,yend=predicted),alpha=.2)+geom_point(aes(color=residual))+scale_color_gradient2(low="blue",mid = "white",high = "red")+geom_point(aes(y=predicted),shape=1)+theme_bw()
## `geom_smooth()` using formula 'y ~ x'
A residual is the vertical distance between a data point and the regression line.Each data point has one residual.they are positve if they are above the regression line and negative if they are below the regression line.if the regression line actually passess through the point ,the residual at that point is zero - in other term residual is the bit that left when you subtract the predicted value from the observed value.
Colours identify non linearity in the data.in this case we can see that there is three red data point where the actual values are greater than what is being predicted.there is two low blue points below the regression line indicating that the actual values are less than what is being predicted - hence our data in this case fit the model but with low linearity.
Now we can use the model to predict number of cases in USA per month given average temperature for each month as follows.
Pred_USA<-predict(lmmod,newdata = USA_Temp)
##let print prediction for each month
print(Pred_USA)
## 1 2 3 4 5 6 7 8
## 30398.410 29535.310 24184.095 15725.723 8130.450 1743.516 -6542.237 -7750.575
## 9 10 11 12
## -1363.641 8820.929 17451.921 26082.914
tempred_cases<-cbind(USA_Temp,Pred_USA)
## now let view average and predicted values as one dataframe
head(tempred_cases,n=12)
## Month Temp Pred_USA
## 1 January 4.5 30398.410
## 2 February 5.0 29535.310
## 3 March 8.1 24184.095
## 4 April 13.0 15725.723
## 5 May 17.4 8130.450
## 6 June 21.1 1743.516
## 7 July 25.9 -6542.237
## 8 August 26.6 -7750.575
## 9 September 22.9 -1363.641
## 10 October 17.0 8820.929
## 11 Novermber 12.0 17451.921
## 12 December 7.0 26082.914
From above table we can conclude that covid19 infection cases decrease as temperature increases significantly as from May until September whch have the highest temperature,predicted cases as of July -September are negative indicating most likely when this pandemic maybe contained since you cant have negative infection.
library(ggrepel)
ggplot(tempred_cases,aes(x=Temp,y=Pred_USA,color=Pred_USA))+geom_point()+geom_text_repel(aes(label=Month))
data used here is limited to only a few places since no public data is available on covid19 and climatic factors such as Temperature and air density and if any do share.
John Hopskin github (https://github.com/CSSEGISandData/COVID-19)
timeanddate.com/weather (https://www.timeanddate.com/weather/)
3.climate-data.org (https://en.climate-data.org/)