Final Project

Introduction

The problem that countries all over the world are facing is COVID-19 and the struggle to contain it. The question on everyone’s minds is when will life begin to return to normal, if ever? There is a theory that the coronavirus may spread less quickly in high temperatures, similar to the influenza virus, where infections peak in cold, winter temperatures. This idea gives hope that the coming summer temperatures in areas like the Northeast United States could help to continue to bring the curve of new infections down. Whether or not the coronavirus could become a seasonal virus is another issue, but any chance to bring new infections down in this current crisis would be a huge help towards life returning to some type of “new normal.” Another possibility to explore is whether the virus spreads differently according to geographic location (Latitude and Longitude). We can investigate this by exploring if countries with higher average temperatures and different geographic locations had any significant difference in growth rate of infections.

pacman::p_load(dplyr, tidyr, magrittr, stringr, ggplot2, caTools, randomForest)

Tidying/Preparing For Analysis

This is a dataset scraped from a wikipedia page using Pandas read_html function. It contains a table of average temperatures yearly by country.

country_temps <- read.csv('avg_temps.csv')

tail(country_temps)

colnames(country_temps) <- c('Country', 'Avg_Yearly_TempF')

Those characters were supposed to be a negative sign, so I will replace them with a “-” using the str_replace_all function from stringr. I also converted this column to numeric and converted from Celsius to Fahrenheit.

country_temps$Avg_Yearly_TempF <- str_replace_all(country_temps$Avg_Yearly_TempF, "â\\ˆ\\’", '-')  
country_temps$Avg_Yearly_TempF <- as.numeric(country_temps$Avg_Yearly_TempF)
country_temps$Avg_Yearly_TempF <- country_temps$Avg_Yearly_TempF * (9/5) + 32
tail(country_temps)

This is a complete dataset of a total running count of confirmed coronavirus cases for each country, starting on January 2nd until May 9th.

covid19 <- read.csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")

#This dataset is updated daily and will not work with the below code after May 9th unless the below code is adjusted.

head(covid19)

covid19 %<>% 
  rename('Country' = 'Country.Region')

To tidy the data, I will create a vector of numbers between 1 to 108, representing the number of days that the dataset covers and will use this to replace the date columns. This will help in transforming the dataset from wide to long.

days <- 1:109
cols <- c('Province', 'Country', 'Lat', 'Long', days)
colnames(covid19) <- cols

covid19 <- gather(covid19, 'Day', 'Confirmed_Cases', 5:113)
head(covid19)

The original dataset has multiple rows for countries with provinces. I will take the sum of the infections for all of the provinces of these countries and combine it into one row, that way there is one row for each country.

bycountry_cases <- covid19 %>%
  mutate(Day = as.integer(Day)) %>%
  group_by(Country, Day) %>%
  arrange(Day) %>%
  select(Confirmed_Cases) %>%
  summarise_each(sum)

## Adding missing grouping variables: `Country`, `Day`

I will also take the mean Latitude and Longitude of all the provinces for each country, and use that to estimate the country’s Latitude and Longitude.

bycountry_loc <- covid19 %>%
  group_by(Country) %>%
  select(Lat, Long) %>%
  summarise_each(mean)

## Adding missing grouping variables: `Country`

We can now merge the two dataframes with one row per country, then merge that dataframe with the average temperature in Fahrenheit from the wikipedia temperature dataset.

covid19 <- merge(bycountry_cases, bycountry_loc, by = 'Country')
covid19 <- merge(country_temps, covid19, by = 'Country')

covid19 %<>% arrange(Day)
head(covid19)

We now have a dataset with each row representing a day for a country, showing their average yearly temperature, their cumulative number of confirmed cases, and their geographic location measured by Latitude and Longitude. We can now begin to analyze the data.

Analysis

ten <- covid19 %>%
  filter(Confirmed_Cases >= 10 & Confirmed_Cases <=20) %>%
  arrange(Country, Day) %>%
  group_by(Country) %>%
  mutate(Start_Cases = min(Confirmed_Cases)) %>%
  mutate(Start_Day = min(Day)) %>%
  mutate(Days = 108 - Start_Day) %>%
  distinct(Country, .keep_all = TRUE)
head(ten)

In this dataframe, I located the first day that each country reached a number of confirmed cases between 10 and 20. This is a good starting point for the analysis, because that is a sufficient number of cases for the virus to start to spread more rapidly, and in this way all countries will be starting at nearly the same number of cases. We also mark exactly how many cases they had on that day. The ‘Days’ column marks the number of days that have passed since the country reached 10-20 cases. We need this column because we want to compare the growth rates across roughly the same period of time.

current <- covid19 %>%
  filter(Day == 108) %>%
  select(Country, Confirmed_Cases) %>%
  rename('Current_Cases' = 'Confirmed_Cases') %>%
  mutate(Current_Day = 108)

Here, I am making a column of the number of total cases as of now, day 108. We can then merge these two dataframes and have data that is much more useful for finding the growth rate. We have the day the country reached roughly 10 cases, the day they reached their current number of cases, and the current number of cases. We no longer need the original columns ‘Day’ and ‘Confirmed_Cases.’

covid19 <- merge(ten, current, by = 'Country')
covid19 %<>%
  select(-c(Day, Confirmed_Cases))
head(covid19)

table(covid19$Days)

## 
##  -1   6   8  16  19  22  25  26  27  28  29  30  31  32  33  34  35  36  37  38 
##   1   1   1   1   2   1   1   2   1   1   2   1   1   2   1   3   3   1   4   4 
##  39  40  42  43  44  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   1   4   2   4   3   1   5   3   6   3   1   5   4   3   8   6   5   7   5   1 
##  61  63  64  65  66  68  69  72  73  74  76  77  90  92  94  96  97  99 101 
##   8   6   1   4   4   4   4   1   1   2   1   2   1   1   1   1   1   2   1

There are quite a few countries that have only had 10-20 cases forr less than 35 days. I will start by using 35 days as a cut off and only compare growth rates for countries that have had over 10 cases for at least 35 days so that it will give a more fair comparison because growth rates change as the number of cases increase.

covid19 %<>%
  filter(Days >= 35)
table(covid19$Current_Cases)

## 
##     10     11     16     18     19     21     25     39     42     47     64 
##      1      1      2      1      1      1      1      1      1      1      1 
##     82     83     92     94     95    101    141    145    146    167    193 
##      1      1      1      1      1      1      1      1      1      1      1 
##    194    242    273    288    324    332    388    439    489    490    508 
##      1      1      1      1      1      1      1      1      1      1      1 
##    509    563    594    620    621    623    668    742    744    773    795 
##      1      1      1      1      1      2      1      1      2      1      1 
##    796    835    850    891    900    906    928   1030   1111   1135   1436 
##      1      1      1      1      1      1      1      1      1      1      1 
##   1450   1455   1492   1551   1586   1725   1741   1771   1801   2009   2070 
##      1      1      1      1      1      1      1      1      1      1      1 
##   2161   2266   2267   2279   2325   2603   3000   3029   3112   3178   3778 
##      1      1      1      1      1      1      1      1      1      1      1 
##   3871   3912   4012   4728   4834   5369   5611   5711   5738   6535   6918 
##      1      1      1      1      1      1      1      1      1      1      1 
##   7208   8070   8476   8895   9376   9943  10051  10416  10463  13112  13134 
##      1      2      1      1      1      1      1      1      1      1      1 
##  14195  14811  15366  15575  15774  16436  16793  20201  21101  21707  22541 
##      1      1      1      1      1      1      1      1      1      1      1 
##  25265  25972  26435  27268  28818  30207  31522  35432  42292  52011  61847 
##      1      1      1      1      1      1      1      1      1      1      1 
##  67674 104691 135569 146894 170588 176202 187859 212629 217185 222857 
##      1      1      1      1      1      1      1      1      1      1

There are still too many countries in the data set with a very low number of total confirmed cases. I will increase the ‘Days’ filter to 55 and above.

covid19 %<>%
  filter(Days >= 55)
table(covid19$Current_Cases)

## 
##    141    288    489    623    744    773    796    835    850    891    928 
##      1      1      1      2      1      1      1      1      1      1      1 
##   1030   1450   1455   1551   1586   1725   1801   2070   2161   2266   2279 
##      1      1      1      1      1      1      1      1      1      1      1 
##   2603   3000   3029   3112   3178   3778   3871   4728   5369   5611   5711 
##      1      1      1      1      1      1      1      1      1      1      1 
##   5738   6535   6918   7208   8070   8476   8895   9376   9943  10051  10416 
##      1      1      1      1      2      1      1      1      1      1      1 
##  10463  13112  14811  15366  15575  15774  16436  16793  20201  21101  21707 
##      1      1      1      1      1      1      1      1      1      1      1 
##  22541  25265  25972  26435  27268  28818  30207  31522  35432  42292  52011 
##      1      1      1      1      1      1      1      1      1      1      1 
##  61847  67674 104691 146894 170588 176202 187859 212629 217185 222857 
##      1      1      1      1      1      1      1      1      1      1

There are now 78 countries left in the dataset. Although I do not want to have to remove a significant amount of data from from the dataset, I should also remove countries that had a very early ‘Start_Day’ of reaching at least 10 confirmed cases, for the same reason that we should be comparing over a similar time frame.

table(covid19$Start_Day)

## 
##  7  9 11 12 14 16 18 31 32 34 35 36 39 40 42 43 44 45 47 48 49 50 51 52 53 
##  1  2  1  1  1  1  1  2  1  2  1  1  4  4  4  4  1  6  8  1  5  7  5  6  8

I will remove the countries with a ‘Start_Day’ below 19.

covid19 %<>%
  filter(Start_Day > 18) %>%
  select(-c(Current_Day, Start_Day))

I will now make a new column ‘Avg_Growth_Rate,’ The average percent increase in cases per day over the course of this time frame.

covid19 %<>%
  mutate(Avg_Growth_Rate = round(((Current_Cases - Start_Cases) / Start_Cases / Days) * 100, digits = 2))

Avg_Growth_Rate is our target variable. We can now use our features ‘Lat,’ ‘Long,’ and ‘Avg_Yearly_TempF’ to explore if there is any correlation to the ‘Avg_Growth_Rate.’

ggplot(covid19, aes(x=Lat, y=Avg_Growth_Rate)) +
  geom_point()+
  geom_smooth(method=lm, color='Orange', se=FALSE) +
  xlab('Latitude')+
  ylab('Average Growth Rate')

## `geom_smooth()` using formula 'y ~ x'

ggplot(covid19, aes(x=Long, y=Avg_Growth_Rate)) +
  geom_point()+
  geom_smooth(method=lm, color='Blue', se=FALSE) +
  xlab('Longitude')+
  ylab('Average Growth Rate')

## `geom_smooth()` using formula 'y ~ x'

ggplot(covid19, aes(x=Avg_Yearly_TempF, y=Avg_Growth_Rate)) +
  geom_point()+
  geom_smooth(method=lm, color='Red', se=FALSE) +
  xlab('Average Yearly Temperature')+
  ylab('Average Growth Rate')

## `geom_smooth()` using formula 'y ~ x'

At first glance, there does not appear to be any correlation between the individual features and the Average Growth Rate, with Latitude having almost zero correlation.

lmodel <- lm(Avg_Growth_Rate ~ Lat + Long + Avg_Yearly_TempF, data = covid19)
summary(lmodel)

## 
## Call:
## lm(formula = Avg_Growth_Rate ~ Lat + Long + Avg_Yearly_TempF, 
##     data = covid19)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5813.5 -2873.2 -1578.3   480.3 20795.8 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      10022.29    4174.43   2.401   0.0192 *
## Lat                -40.37      37.93  -1.064   0.2910  
## Long               -11.12      13.79  -0.806   0.4232  
## Avg_Yearly_TempF   -97.24      56.41  -1.724   0.0894 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5246 on 66 degrees of freedom
## Multiple R-squared:  0.072,  Adjusted R-squared:  0.02981 
## F-statistic: 1.707 on 3 and 66 DF,  p-value: 0.1741

As the graphs described, none of the features are useful predictors of Avg_Growth_Rate. However, the predictor with the lowest P value (best predictor) was average yearly temperature. Higher yearly temperatures were slightly correlated with lower average growth rates.

Modeling

It is unlikely that any model using these features will produce an accurate model, but we can see if a Random Forest model will produce something useful. We start by splitting the data into train and test sets.

X <- covid19 %>%
  select(c('Lat', 'Long', 'Avg_Yearly_TempF'))
y <- covid19$Avg_Growth_Rate

set.seed(122)
spl = sample.split(y, 0.7)
train = subset(X, spl == TRUE)
test = subset(X, spl == FALSE)
ytrain <- subset(y, spl == TRUE)
ytest <- subset(y, spl == FALSE)

We can now train the Random Forest model using our selected features.

rf <- randomForest(ytrain ~ ., data=train, ntree=100)

rss <- sum((predict(rf, test) - ytest) ^ 2)  ## residual sum of squares
tss <- sum((ytest - mean(ytest)) ^ 2)  ## total sum of squares
r2 <- 1 - rss/tss

pred_frame <- data.frame(x = ytest, y = predict(rf, test))

ggplot(pred_frame, aes(x=x, y=y)) +
  geom_point() +
  geom_abline(slope=1, intercept = 0, color="red") +
    ggtitle(paste("RandomForest Regression in R    r^2=", r2, sep="")) +
  xlab('Actual Avg Growth Rates') +
  ylab('Predicted Avg Growth Rates')

Data points near the red line represent predictions that were close to the actual average growth rate for that country. From this graph and the r-squared value we can see that the model performed very poorly. There is not much that can be done to improve the model, the problem is the features we have are simply not good predictors of the average growth rate of coronavirus cases.

Conclusion

While the model performed very poorly at predicting the average growth rate of coronavirus cases, it is not a failure. It confirmed that Latitude, Longitude, and Average Yearly Temperature do not have much of an impact on the spread of the virus at all. There are definitely limitations to this model, because not all countries are doing the same things. Some are practicing much more strict social distancing measures, some have better testing capacities, and some have a larger elderly population. Those are all features that this model cannot account for, and those are likely much better predictors of average growth rate. If we look at the impact of average yearly temperature alone, there is a slight negative correlation, but you can look at Brazil and see that it has the 12th highest average yearly temperature and a growth rate of 17934.19 % over the last 63 days, leaving them with 146,894 cases currently. It seems safe to say that higher temperatures in the summer months will not provide relief, improving testing capacities and continuing to social distance are much more likely to.

References

https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv

https://en.wikipedia.org/wiki/List_of_countries_by_average_yearly_temperature