Background

Severe acute respiratory syndrome coronavirus 2 (SARS-COV-2),which courses coronavirus disease(COVID19) emerged from Wuhan,Hubei province china in late 2019 leading to pandemic which has infected over 700k worldwide as of 30/03/2020:

But its until March 2020 that full effect of Covid19 as been experienced with mass of new infection and death with number of confirmed cases moving from 88.4k as of 3/1/2020 to 720k by 3/29/2020 (John Hopskin University Data on Covid19 cases):

Their has been many study and research on this new infectious disease each day since outbreak and with vaccine or cure yet to be available it leaves less to be desired,China and other Asian country have managed to flatten the curve of infections thanks to various measures like lockdown,mass testing,tracking of cantact persons and social distancing which included ban on public gathering,but chicken have come home to roast across Europe and America which are now at the epicenter of pandemic while rest of Africa is catching up slowly with over 5k confirmed cases as of 31/03/2020:

Given the spread is going down in China while increasing at higher rate in other part of the world especially Europe and America the trend is not so in Africa and much of this can be attributed to climatic factors which is what this case study is trying to figure out is how geographical location and climatic factors temperature and air density are correlated with cases of covid19.

NB

Correlation does not imply causation

Data Set

Let load our data and look at it strature

covid19<-read.csv('covid19_air.csv')
head(covid19,n=13)
##              State    Month Temp Humidity. Confirmed Air_Density
## 1         New York    March    8        58     59746       1.249
## 2       New Jersey    March    8        59     13386       1.249
## 3    Washington DC    March   11        64      4896       1.235
## 4         Lombarda    March    9        70     41007       1.243
## 5  Emillia_Romagna    March    9        69     13119       1.243
## 6           Madrid    March   13        63     22677       1.225
## 7       Birmingham    March    7        76       513       1.253
## 8          Gauteng    March   19        68       533       1.198
## 9     Western Cape    March   22        67       271       1.184
## 10       Liverpool    March    8        76       157       1.248
## 11         Nairobi    March   22        71        37       1.184
## 12           Wuhan February   10        74     46454       1.239
## 13           Wuhan    March   14        71      1449       1.220

Dataframe consist of 13 observation of 6 variables which are State,Month and avarege temperature/humidity,air density plus comfirmed cases per state:

Let creat a plot to view how cases are distributed across various state against Temperature

library(ggplot2)
library(ggrepel)
ggplot(covid19,aes(x=Temp,y=Confirmed),groub=1)+geom_point(color='red')+geom_text_repel(aes(label=State),size=3)+geom_line()

Have included Wuhan 2 times to show effects of Temp and Air Density over time between February and March next we compute the correlation bwtween confirmed cases and Temp/Air Density

From above graph state that have recorded low temperature have high number of cases compared to those with high temperature

Compute correlation between covid19 confirmed cases against air_Density

As per Wikipedia Air Density is defined as mass per unit volume of earth’s atmosphere,Air density like air pressure,decrease with increasing altitude.its also changes with variation in atmospheric pressure,temperature and humidity:

First let’s visualise our data by means of a scatter plot.using ggpubr package

library(ggpubr)
## Loading required package: magrittr
ggscatter(covid19,x='Air_Density',y='Confirmed',add = 'reg.line',conf.int = TRUE,cor.coef = TRUE,cor.method = 'pearson',xlab = 'Air_Density',ylab = 'Confirmed Covid19 Cases')+geom_smooth()+geom_text_repel(aes(label=State))
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

From scatter plot above there is a linear Relationship between confirmed covid19 cases and Air Density which is moderate positive coefficient as indicated by R=0.44 meaning(when one goes up so does the other)

from above correlation graph its clear that number of covid19 infection cases increases as air density increase and vice versa.

Wuhan case study

Wuhan average Temperature for Januray was 5 degrees celsius,10 February and 14 for month of March 2020.

As of 3rd february 2020 their where 2106 covid19 confirmed cases in Hubei province which homes Wahun where the disease is believed to have originated but by February 28th their where(46454 confirmed cases in wuhan alone)

New cases in Wuhan dropped dramatically in March with only 1449 confirmed as of 29th of March 2020.

Conclusion

The fist case of covid19 was reported in china in late 2019 which was followed by January which registered avarege of 5 degrees celsius temperature with air density of 1.262 which is high than February and March which recorded lowest density at 1.22km/m3 and since cold air weighs more,it sink driving hot air up and with study showing that virus can survive in droplet for up to 3 hours after being coughed out into the air plus given coid air is still and more calm it could be possible that more people continued to pass the virus to other members of the community by inhalation of droplet without showing any symptoms during incubation period which is approximited at 14 days which could further explain why number of confirmed cases reached peak from mid February from only 2106 cases by 3rd of February in Hubei to (46454 cases Wuhan in Hubei):

Thanks to goverment measures and natural favour with air density dropping to 1.22km/m3 due to increase in temperature from 5 and 10 in previous two months to 14 degrees celsius in March number of new cases was low at 1449 as of 29th March 2020:

Hence high infection rate of covid19 caeses across Europe and USA can be linked to coid weather season with most experiencing temperatures below 10 degrees celsius and vice versa while increasing gradually in Africa.

USA is currenty leading with number of covid19 cases worldwide with over 170k as of 31/3/2020 so let us look how Temperature is distributed all year round across USA using average for each month.

##Let load data

USA_Temp<-read.csv('USA WEATHER BY MONTH.csv')
head(USA_Temp,n=13)
##        Month Temp
## 1    January  4.5
## 2   February  5.0
## 3      March  8.1
## 4      April 13.0
## 5        May 17.4
## 6       June 21.1
## 7       July 25.9
## 8     August 26.6
## 9  September 22.9
## 10   October 17.0
## 11 Novermber 12.0
## 12  December  7.0

January is the coidest month in USA changing gradualy up to April when temperature start to go above 10 degree celsius climaxing in August when temp reaches 26.6 celsius and given monthly average we can predict when are cases likely to go down using linear regression model by establishing a statistically significant linear relationship with with temperature.

##Scatter Plot

library(ggpubr)
scatter.smooth(x=covid19$Temp,y=covid19$Confirmed,main='Confirmed ~ Temperature')

let find coeficient interval between temp and confirmed cases

cor(covid19$Temp,covid19$Confirmed)
## [1] -0.4399795

the scatter plot along with the smoothing line above suggest a linear decreasing relationship between ‘confirmed cases of covid19’ and ‘temperature’ variables while correlation is a moderate negative of -0.4399795 (when one decreases the other increases)

Build Linear Model

library(caret)
## Loading required package: lattice
set.seed(100)
train<-sample(1:nrow(covid19),0.8*nrow(covid19))
traindata<-covid19[train,]
testdata<-covid19[-train,]
lmmod<-lm(Confirmed~Temp,data = traindata)
## our model is ready now let look at it summary

summary(lmmod)
## 
## Call:
## lm(formula = Confirmed ~ Temp, data = traindata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -25570 -13849  -4832  15520  35389 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    38166      19913   1.917   0.0916 .
## Temp           -1726       1678  -1.029   0.3336  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22310 on 8 degrees of freedom
## Multiple R-squared:  0.1169, Adjusted R-squared:  0.006495 
## F-statistic: 1.059 on 1 and 8 DF,  p-value: 0.3336

From summary of the model above we are interested in R-squared and p-value to see if the model is best fit for this data.

The greater R-square the better the model while if the p-value is less than the significance level (usualy 0.05) then model fits the data well.

In this case R2 is 0.1169 meaning that its small but the better while p-value is greater than 0.05 at 0.3336 hence correlation is low which we can attribute to limited data used.

Fitting the Regression Line and its Residual

fit<-lm(Confirmed~Temp,data = covid19)
covid19$predicted<-predict(fit)
covid19$residual<-residuals(fit)
ggplot(covid19,aes(x=Temp,y=Confirmed))+geom_smooth(method = "lm",se=FALSE,color="lightgrey")+geom_segment(aes(xend=Temp,yend=predicted),alpha=.2)+geom_point(aes(color=residual))+scale_color_gradient2(low="blue",mid = "white",high = "red")+geom_point(aes(y=predicted),shape=1)+theme_bw()
## `geom_smooth()` using formula 'y ~ x'

A residual is the vertical distance between a data point and the regression line.Each data point has one residual.they are positve if they are above the regression line and negative if they are below the regression line.if the regression line actually passess through the point ,the residual at that point is zero - in other term residual is the bit that left when you subtract the predicted value from the observed value.

Colours identify non linearity in the data.in this case we can see that there is three red data point where the actual values are greater than what is being predicted.there is two low blue points below the regression line indicating that the actual values are less than what is being predicted - hence our data in this case fit the model but with low linearity.

Now we can use the model to predict number of cases in USA per month given average temperature for each month as follows.

Pred_USA<-predict(lmmod,newdata = USA_Temp)
##let print prediction for each month
print(Pred_USA)
##         1         2         3         4         5         6         7         8 
## 30398.410 29535.310 24184.095 15725.723  8130.450  1743.516 -6542.237 -7750.575 
##         9        10        11        12 
## -1363.641  8820.929 17451.921 26082.914

Now let combine both dataframe average temp per month and predicted values.

tempred_cases<-cbind(USA_Temp,Pred_USA)
## now let view average and predicted values as one dataframe
head(tempred_cases,n=12)
##        Month Temp  Pred_USA
## 1    January  4.5 30398.410
## 2   February  5.0 29535.310
## 3      March  8.1 24184.095
## 4      April 13.0 15725.723
## 5        May 17.4  8130.450
## 6       June 21.1  1743.516
## 7       July 25.9 -6542.237
## 8     August 26.6 -7750.575
## 9  September 22.9 -1363.641
## 10   October 17.0  8820.929
## 11 Novermber 12.0 17451.921
## 12  December  7.0 26082.914

From above table we can conclude that covid19 infection cases decrease as temperature increases significantly as from May until September whch have the highest temperature,predicted cases as of July -September are negative indicating most likely when this pandemic maybe contained since you cant have negative infection.

library(ggrepel)
ggplot(tempred_cases,aes(x=Temp,y=Pred_USA,color=Pred_USA))+geom_point()+geom_text_repel(aes(label=Month))

As per our model predicted cases above are expected to go down as time goes as from May to September.

Conclusion though sample size is limited in this case study its true there is a correlation between number of covid19 infection and temperature and air density.

countries should continue to put inplace prevention measures to break the chain of infection as the only way to stop spread of coronavirus.

key Note

data used here is limited to only a few places since no public data is available on covid19 and climatic factors such as Temperature and air density and if any do share.

References

  1. John Hopskin github (https://github.com/CSSEGISandData/COVID-19)

  2. timeanddate.com/weather (https://www.timeanddate.com/weather/)

3.climate-data.org (https://en.climate-data.org/)

  1. National Health Commission of the PRC (http://en.nhc.gov.cn/)