Coronavirus cases based on latitude

Let’s take a look at whether or not the number of Coronavirus cases in each US state is responsive at all to the latitude of the state. There have been claims about warmer weather affecting the virus; if this is true, countries at higher latitudes may show fewer cases.

To start let’s pull two sets of data - the latest number of confirmed cases of COronavirus from the Johns Hopkins github and the latest population numbers from the US Census website

Corona <- read.csv("Coronavirus_USA.csv")
State_Pop <- read.csv("State_Census_data.csv")

Now let’s clean up these datasets and combine them into one table containing the confirmed number of Coronavirus cases, the latitude (calculated as the average of the latitudes of the regions identified within the original dataset), and the lates estimate for the population of the state.Also calculate the number of cases per capita for each state:

library(plyr)

Corona <- Corona[which(Corona$Lat != 0),]
State_Cases <- aggregate(X4.12.20~Province_State,Corona,sum)
State_Lat <- aggregate(Lat~Province_State,Corona,mean)
State_Corona <- merge(State_Cases,State_Lat)

#Let's remove data from territories outside the continental US
State_Corona <- State_Corona[which(State_Corona$Lat > 25),]

keepvars <- c("NAME","POPESTIMATE2019")
States <- State_Pop[keepvars]
States <- rename(States,c("NAME"="Province_State"))
State_Corona <- merge(State_Corona,States)
names(State_Corona) <- c("Name","Cases","Latitude","Population")
State_Corona$Case_per_Capita <- State_Corona$Cases/State_Corona$Population
#Let's calculate the average number of cases withing the US (per capita) as reference
US_per_Capita <- sum(State_Corona$Cases) / sum(State_Corona$Population)

Let’s put together a linear regression based on this data and see if a linear model seems appropriate (and if it agrees with our hypothesis that satates at lower latitudes would see fewer cases per capita due to higher temperatures on average).

Corona_lat_lm <- lm(State_Corona$Case_per_Capita ~ State_Corona$Latitude)
plot(State_Corona$Latitude,State_Corona$Case_per_Capita, main="Coronavirus cases -to- Latitude", xlab = "Latitude", ylab = "Coronavirus Cases per Capita")
abline(Corona_lat_lm)

qqnorm(resid(Corona_lat_lm))
qqline(resid(Corona_lat_lm))

summary(Corona_lat_lm)
## 
## Call:
## lm(formula = State_Corona$Case_per_Capita ~ State_Corona$Latitude)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0010287 -0.0008430 -0.0006044 -0.0000317  0.0084040 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)           1.284e-03  1.847e-03   0.695    0.490
## State_Corona$Latitude 6.937e-07  4.589e-05   0.015    0.988
## 
## Residual standard error: 0.001725 on 48 degrees of freedom
## Multiple R-squared:  4.762e-06,  Adjusted R-squared:  -0.02083 
## F-statistic: 0.0002286 on 1 and 48 DF,  p-value: 0.988

It comes as no suprise that the residuals skew strongly positive - the overall numbers are fairly low and there is a hard stop a zero cases per capita with any states exhibiting higher caseloads obviously skewing high.

An analysis of the summary shows the following:

Residuals:

  1. The median falls fairly close to 0, but the data is comprised of very low numbers in general so this tells us little.

  2. The 1Q and 3Q numbers are an order of magnitude apart - not a good sign.

  3. min and max residuals are closer in magnitude than the 1Q/3Q values, but still not very close (one is 8 times greater than the other)

Coefficients:

  1. The ratio of Standard error to the Intercept is about 1.3, and a p-value of 0.49, meaning the intercept is not a very good predictor for the model.

  2. The ratio of Standard error to the coefficient is greater than 2486, and a p-value of 0.988, meaning there is an extremely low chance of this slope being representative of the model.

Quality of Fit:

  1. Residual standard error divided by the 1Q/3Q values above is 3.94, which nowhere near 1.5

  2. 48 degrees of freedom means every datapoint was used (50 datapoints - 2 coefficients)

  3. Multiple R-squared: This model describes nearly none of the data’s variation, so we can conclude that:

Latitude has no influence on the number of Coronavirus cases per capita within the continental United States.

Whether or not that disproves the warm weather hypothesis would require a much more involved analysis.