COVID-19 has rapidly spread worldwide in 2020 since it was first reported in Wuhan,China and has affected the lives of many people and nations. Understanding the spread of this virus based on spatio-temporal patterns, will enable agencies to better plan and utilise resources. As such this study aims to do this by analysing the COVID-19 situation in Mexico, specifically in Central Mexico (i.e. Mexico City, Mexico State and Morelos State).
The COVID-19 pandemic reached Mexico in February 2020, and as of 14th Oct 2020 there are 825k cases and 84,420 deaths in Mexico from it. Central-Mexico is the country’s historic core and is made out of Mexico City, Mexico State and Morelos State. It is also a popular tourist destination and it will also be the study area for this analysis.
Through this analysis we aim to:
The data for this study was provided by Prof. Kam Tin Seong.
The following code chunk loads the relevant packages to be utilised in this study.
packages = c('rgdal', 'sf', 'spdep', 'tmap', 'tidyverse','magick','plotly','tmaptools','gridExtra')
for (p in packages){
library(p,character.only = T)
}
We will import in the Mexico municipalities GIS data. We will import in the ESRI shapefile format in and save it in a simple feature dataframe format.
mexico <-st_read(dsn="data/geospatial",
layer = "municipalities_COVID")
## Reading layer `municipalities_COVID' from data source `E:\Y4S1\gis\in_class_ex\Take-Home_Ex02\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 2465 features and 198 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 911292 ymin: 319149.1 xmax: 4082997 ymax: 2349615
## projected CRS: MEXICO_ITRF_2008_LCC
The COVID-19 cases are aggregated based on epidemiological week (e-week) from 1st Jan to 6th Aug (32 Weeks) and municipality levels as can be seen in this preview.
head(mexico, n=3)
We can also have quick look at the map itself and all the municipalities.
qtm(mexico)
Lets do a quick check to ensure that the data imported has no NA values.
mexico[rowSums(is.na(mexico))!=0,]
There are no NA values detected. Additionally we will also conduct a geometry check to ensure that the data is not corrupted or invalid when imported into R. The following code chunk does this check
test <- st_is_valid(mexico,reason=TRUE)
length(which(test!= "Valid Geometry"))
## [1] 0
All geometries are valid, we can continue on to extracting and plotting the study area.
For the purpose of the study we will be focusing on Central Mexico. Which consists of Mexico City, Mexico State and Morelos State. Their CVE_ENT codes are 9, 15 and 17 respectively. Lets extract and plot this study area.
mexico_city <- mexico[mexico$CVE_ENT == "09",]
mexico_state <- mexico[mexico$CVE_ENT == "15",]
morelos_state<- mexico[mexico$CVE_ENT == "17",]
#Plot individual municipalities
tmap_arrange(qtm(mexico_city, title="Mexico City"),
qtm(mexico_state, title="Mexico State"),
qtm(morelos_state, title = "Morelos State"))
central_mexico <- mexico %>%
filter(mexico$CVE_ENT == "09"
|mexico$CVE_ENT == "15"
|mexico$CVE_ENT == "17")
#Plot Central Mexico
qtm(central_mexico, title="Central Mexico")
To have an understanding of the COVID-19 rate lets calculate the number of new cases per 10000 population per week, from e-week 13 until e-week 32.
central_mexico <- central_mexico %>%
select(!num_range("cumul",1:12))%>%
mutate_at(vars(starts_with("cumul")),funs(covid_rate=./Pop2020*10000))
Lets take a quick look at how the rates grow for each week based on CVE_ENT (e.g. Mexico City, Mexico State and Morelos State).
data_long <- central_mexico%>%
gather(week,value,cumul13_covid_rate:cumul32_covid_rate)%>%
mutate(week = str_replace_all(week, "cumul", "wk"))
covid_analysis<-ggplot(data_long, aes(x = CVE_ENT, y= value), xlab="") +
geom_bar(stat="identity", width=.5, position = "dodge") +
labs(title="Covid-19 Rates By Week") +
theme(axis.text.x = element_text(angle = 25, hjust = 1)) +
scale_x_discrete(breaks=c("09","15","17"),
labels=c("Mexico City", "Mexico State", "Morelos State"))+
facet_wrap(~week)
ggplotly(covid_analysis)
Based on the trellis plot of Cumulative COVID-19 Rates for each CVE_ENT we can see that over time the COVID-19 rates start to grow with all 3 regions registering significant covid-19 cases by e-week 22. This indicates that there is indeed a growth of covid-19 rates. Howerever we will need to utilise more localised geospatial anlysis to understand the temporal relationships.
As an initial exploratory data analysis, let take a look at the distribution of COVID-19 rates at municipality level. To do this we can utilse a choropleth map to visualise these desriptions over the weeks. To be able to get an even comparison over the weaks we will first identify the quantile values for the e-week 32 and utilise it as our breaks in our chloropleth map.
The following code chunk creates the choropleth for each week and saves it as .png image that will be utilised later to create a gif to visualise the spread over the weeks.
create_choro <- function(title, eweek){
title <- paste("Covid Rate for ", title)
m <- tm_shape(central_mexico)+
tm_polygons(col = eweek, breaks=c(0,8.5,23.3,48.3,86.7,170.4),style="fixed", palette="Reds",title = title) +
tm_text("NOMGEO",size="AREA") +
tm_borders(alpha = 0.5)+
tmap_style("white")
filename = paste(eweek,".png")
tmap_save(m, filename, asp=0)
return(m)
}
The following code chunk stitches the choropleth for the various weeks together and creates a gif file that we can visualise.
imglayers <- sapply(13:32, function(wk) {
image_read(paste('cumul', wk, '_covid_rate .png', sep=''))
})
# Generate an animated GIF with the individual maps and write to a file
imganim <- image_animate(image_join(imglayers), fps = 1, dispose = "previous")
image_write(imganim, 'choropleth-anim.gif')
As we can see from the animated choropleth map, the virus starts out prominently in the municipalities of Milpa Alta, Iztacalco and Miguel Hidalg (municipalities in Mexico City). It then spreads to the rest of Mexico City and progresses to other municpalities in central mexico. Already infected municipalities continued to grow interms of cases as well.
We can take a closer a look at the spread with the trellis plot below of a choropleth map of COVID-19 rates over the e-weeks 13 to 32. We can similarly observe the spread of COVID-19 from one municipality to another.
tmap_central_mexico <- tm_shape(data_long) +
tm_fill("value",
style="jenks",
palette="Reds") +
tm_borders() +
tm_facets("week", nrow=5, ncol=4)+
tm_layout(title.position = c("right","bottom"))
tmap_central_mexico
While neighbouring municipalities show similar levels of covid-19 rates over the same periods, it might be indicative that there is some form of spread or diffusion of covid-19 from one municipality to another. However based on this visualisation alone we cannot tell if the diffusion of covid-19 has occured.
Based on out inital EDA and thematic mappings we where able to understand that there are indications of of some form of spread or diffusion of COVID-19, Now we will utilise ESDA methods to attempt to establish whether the data is infact spatially autocorrelated.
Before we are able to understand the location related tendency of the COVID-19 rates we will need to codify the neighbourhood structure of central mexico to be utilised in our analysis., to do this we will need to select and appropriate spatial weight matrix.
There are various methods in determining the contiguity or adjacency matrix. We will be utilising the queen and rook method. The following code chunk below computes it.
# Convert to spatialpointdataframe
spdf_central_mexico<-as_Spatial(central_mexico)
# compute Rook contiguity weight matrix
wm_r <- poly2nb(spdf_central_mexico, queen=FALSE)
# compute Queen contiguity weight matrix
wm_q <- poly2nb(spdf_central_mexico, queen=TRUE)
# save coordinates to variable
coords <- coordinates(spdf_central_mexico)
Lets plot both the queen and rook contiguity maps side by side.
par(mfrow=c(1,2))
plot(spdf_central_mexico, border="lightgrey")
plot(wm_q, coords, pch = 19, cex = 0.6, add = TRUE, col= "red", main="Queen Contiguity")
title(main = "Queen Contiguity")
plot(spdf_central_mexico, border="lightgrey")
plot(wm_r, coords, pch = 19, cex = 0.6, add = TRUE, col = "red", main="Rook Contiguity")
title(main = "Rook Contiguity")
Here is the summary of the queen matrice
summary(wm_q)
## Neighbour list object:
## Number of regions: 176
## Number of nonzero links: 962
## Percentage nonzero weights: 3.10563
## Average number of links: 5.465909
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11 14
## 3 3 18 41 31 31 22 13 9 3 1 1
## 3 least connected regions:
## 77 95 121 with 1 link
## 1 most connected region:
## 117 with 14 links
The summary above shows based on the queen contiguity matrix that there are 176 area units in central mexico and the most connected area unit has 14 neighbours. There are 3 area units with only one neighbours.
Here is the summary of the rook matrice
summary(wm_r)
## Neighbour list object:
## Number of regions: 176
## Number of nonzero links: 946
## Percentage nonzero weights: 3.053977
## Average number of links: 5.375
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 14
## 3 3 18 47 27 31 22 14 6 4 1
## 3 least connected regions:
## 77 95 121 with 1 link
## 1 most connected region:
## 117 with 14 links
The summary above shows based on the rook contiguity matrix that there are 176 area units in central mexico and the most connected area unit has 14 neighbours. There are 3 area units with only one neighbours.
An issue that is presented by the contiguity based weight matrices is that the number of neighbors across the dataset is not consistent, this could be due to the variations in the sizes of municipalities in Central Mexico. As such a more appropriate weight matrix might be a Distance based weight matrices.
There are two main distance based matrices (fixed & adaptive). An appropriate matrices for us will be to utilise an adaptive distance based matrices as it will adjust itself based on the density of data. This might be relevant especially as different municipalities in central mexico may vary in terms of more densely settled areas to less densely settled areas.
We will be utilising the K-nearest neighbour as our adaptive distance method. Where we first preset a fixed number of neighbours desired per area, and the algorithm selects the number of “adjacent” neighbours based on a proximity measure of the calculation of distance between each region.
knb <- knn2nb(knearneigh(coords, k=8, longlat = FALSE))
knb_lw <- nb2listw(knb, style = 'B')
summary(knb_lw)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 176
## Number of nonzero links: 1408
## Percentage nonzero weights: 4.545455
## Average number of links: 8
## Non-symmetric neighbours list
## Link number distribution:
##
## 8
## 176
## 176 least connected regions:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 with 8 links
## 176 most connected regions:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 with 8 links
##
## Weights style: B
## Weights constants summary:
## n nn S0 S1 S2
## B 176 30976 1408 2526 46182
Lets visualise the weight matrix onto the map
plot(spdf_central_mexico, border="lightgrey")
plot(knb, coords, pch = 19, cex = 0.6, add = TRUE, col = "red")
title(main = "Adaptive Distance Based")
We will be utilising this weight matrix in our analysis moving forward for the rest of our analysis.
Based on these tests conditions:
We will be testing the hypothesis through calculating the global Moran’s I statistics. To establish the p-value of the statistics we will run the Monte Carlo simulations for each e-week from the period e-week 13 to e-week 32. The code chunk below does this
for(i in 13:32){
weekcol<- paste("cumul",i, "_covid_rate",sep="")
moranmc <- paste("moranmc_week",i,sep="")
assign(moranmc, moran.mc(spdf_central_mexico[[weekcol]], listw=knb_lw, nsim=39, zero.policy = TRUE, na.action=na.omit))
}
moranmctable <-data.frame(matrix(vector(),ncol=3))
names(moranmctable) <-c("week","p-value","statistics")
for(i in 13:32){
var_names<- paste("moranmc_week",i,sep="")
moranmc<-get(var_names, envir = globalenv())
moranmctable <- rbind(moranmctable,data.frame(i,moranmc$p.value,moranmc$statistic))
}
Let visualise the global Moran I statistics over the e-weeks 13 to 32.
moranmcscore <- ggplot(moranmctable, aes(x=i,y=moranmc.statistic))+
geom_line()+xlab("Weeks (13-32)")+ylab("Moran I Statistics")+ggtitle("Moran I Statistics over e-weeks 13 to 32")
ggplotly(moranmcscore)
Lets similarly visualise the p-value for each e-week.
moranmcp <- ggplot(moranmctable, aes(x=i,y=moranmc.p.value))+
geom_line()+xlab("Weeks (13-32)")+ylab("p-value")+ggtitle("Moran I p-value over e-weeks 13 to 32")
moranmcp
Through all the e-weeks the P-value is 0.025 which is smaller than alpha value of 0.05 as such we fail to accept the null hypothesis. We can infer the alternative hypothesis, At its lowest the Moran I statistics is 0.2435 and throughout all the weeks it is > 0 as such there is positive auto-correlation. In conclusion the spatial pattern resemble cluster patterns.
Given that we have established that there is spatial clustering of covid-19 rates in central mexico from the e-weeks 13-32, we can proceed to detect these clusters and outliers and also discover hotspots of covid-19 rates.
To do this we will be conducting Localised Geospatial Analysis by utilising local Moran’s I analysis and local Getis-Ord Gi* analysis.
We will continue to utilise the adaptive distance based spatial weight matrix that we derived in the previous section.
Lets first compute the local Moran I statistics for each e-week from the period e-week 13 to e-week 32. The code chunk below does this.
fips <- order(spdf_central_mexico$CVEGEO)
for(i in 13:32){
weekcol<- paste("cumul",i, "_covid_rate",sep="")
localMI <- paste("lmoran_week",i,sep="")
cm_localMI <- paste("cm_localMI_week", i, sep="")
assign(localMI, localmoran(spdf_central_mexico[[weekcol]], knb_lw))
assign(cm_localMI,
cbind(spdf_central_mexico,eval(as.name(localMI))))
}
Let’s plot the local Moran’s I values map and its corresponding p-values map for e-week 13 and e-week 32.
localMI13.map <- tm_shape(cm_localMI_week13) +
tm_fill(col = "Ii",
style = "pretty",
title = "e-week 13 local moran statistics") +
tm_borders(alpha = 0.5)
pvalue13.map <- tm_shape(cm_localMI_week13) +
tm_fill(col = "Pr.z...0.",
breaks=c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
palette="-Blues",
title = "e-week 13 local Moran's I p-values") +
tm_borders(alpha = 0.5)
localMI32.map <- tm_shape(cm_localMI_week13) +
tm_fill(col = "Ii",
style = "pretty",
title = "e-week 32 local moran statistics") +
tm_borders(alpha = 0.5)
pvalue32.map <- tm_shape(cm_localMI_week13) +
tm_fill(col = "Pr.z...0.",
breaks=c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
palette="-Blues",
title = "e-week 32 local Moran's I p-values") +
tm_borders(alpha = 0.5)
tmap_arrange(localMI13.map, pvalue13.map,localMI32.map, pvalue32.map, ncol=2,nrow=2)
From these plots we can get a better sense of the local Moran’s I scores. We can observe where the areas with positive and negative spatial auto-correlation are distributed respectively. When studied in union with the Moran sacatterplot we can have a better understanding of the municipalities that are statistically significant and their auto-correlation.
Based on this inital analysis of the first and last e-week we can see that there is indeed a clustering pattern in terms of covid-19 rates and it suggests that these municipalities are potential hotspots for COVID-19 cases.
We can additionally observe the direction and magnitude of global autocorrelation through a moran scatterplot, the slope of the linear regression of the lagged values of covid-19 rates vs the original frequencies of covid-19 is equivalent to the Moran’s I score.
scatterMC_wk13 <- moran.plot(central_mexico$cumul13_covid_rate,
knb_lw,
labels=as.character(central_mexico$NOMGEO),
xlab="COVID-19 rate in Central Mexico e-week 13",
ylab="Spatially Lag COVID-19 in Central Mexico")
scatterMC_wk32 <- moran.plot(central_mexico$cumul32_covid_rate,
knb_lw,
labels=as.character(central_mexico$NOMGEO),
xlab="COVID-19 rate in Central Mexico e-week 32",
ylab="Spatially Lag COVID-19 in Central Mexico")
Through these scatter plots we can see the negative and positive auto-correlations of each municipality to the rate of covid-19. But we will need to conduct further anlysis to understand their statistical significance.
However it might not always be easy to pin point the statistically significant relationships from these plots, as such utilisng a LISA cluster map might assist in easily visulaising all this in one map.
The following code chunk creates the LISA cluster map for the e-weeks 13 to 32.
colors <- c("#ffffff", "#2c7bb6", "#abd9e9", "#fdae61", "#d7191c")
clusters <- c("insignificant", "low-low", "low-high", "high-low", "high-high")
for(i in 13:32){
week <- paste("e-week",i)
weekcol<- paste("cumul",i, "_covid_rate",sep="")
localMI <- paste("lmoran_week",i,sep="")
quadrant <- vector(mode="numeric",length=nrow(eval(as.name(localMI))))
DV <- spdf_central_mexico[[weekcol]] - mean(spdf_central_mexico[[weekcol]])
C_mI <- eval(as.name(localMI))[,1]- mean(eval(as.name(localMI))[,1])
signif <- 0.05
quadrant[DV >0 & C_mI>0] <- 4
quadrant[DV <0 & C_mI<0] <- 1
quadrant[DV <0 & C_mI>0] <- 2
quadrant[DV >0 & C_mI<0] <- 3
quadrant[eval(as.name(localMI))[,5]>signif] <- 0
cm_localMI <- paste("cm_localMI_week", i, sep="")
quadrant_df <- data.frame("quadrant" = quadrant)
assign(cm_localMI, cbind(eval(as.name(cm_localMI)), quadrant_df))
LisaPlotName<- paste("lplot_week", i, sep="")
LisaPlot <- tm_shape(eval(as.name(cm_localMI))) +
tm_fill(col = "quadrant", style = "cat", palette = colors[c(sort(unique(quadrant)))+1], labels = clusters[c(sort(unique(quadrant)))+1]) +
tm_borders(alpha=0.5) +
tm_view(set.zoom.limits = c(11,17))+
tm_layout(week,
legend.outside=TRUE)
assign(LisaPlotName, LisaPlot)
}
Lets visualise out the LISA cluster maps for the e-weeks 13 to 32.
From the lisa cluster plots, we can visualise the auto correlation and whether or not something is a cluster or outlier. The colors are based on their type of autocorrelations:
We can see that during weeks 13-16 there was a High-high cluster in the centre of central mexico, this is inline with our inital EDA where we found that the covid cases started in municipalities in mexico city (Milpa Alta, Iztacalco and Miguel Hidalg etc.) before spreading outwards.
Interestingly a Low-Low cluster startes to occur from week 17 in the south-west of central-mexico, the cluster seems to grow up to week 21, indicating that the covid-rates in these municipalites where low and their neighbours also had low numbers. However by week 30 these clusters start to diminish, indicative that the covid-19 rates may be rising across central mexico.
The High-High cluster first observed in the centre of central mexico, at mexico-city,continued to grow outwards. Indicating that there are a high rates of ccovid-19 cases in these municipalites and at their neighbours, however the cluster stops growing from week 30. This could indicate that there is an improvemnet in the COVID-19 pandemic situation that is allowing the rates to stabilize.
Beyond just utiliing local Moran I analysis, it will be interesting to utilise Getis-Ord Gi* Analysis as well. This will enable us to identify areas of hot spots and cold spots in-terms of covid-19 rates.
We will continue to utilise the adaptive distance based weight matrix that we derived in the earlier sections. The code chunk below calculates the Getis-Ord Gi* statistics for the e-week 13 to 32.
for(i in 13:32){
week <- paste("e-week",i)
weekcol<- paste("cumul",i, "_covid_rate",sep="")
localGI <- paste("gi_week",i,sep="")
assign(localGI, cbind(spdf_central_mexico, data.frame("gstat_adaptive" = as.matrix(localG(spdf_central_mexico[[weekcol]], knb_lw)))))
gi <- paste("gimap_week", i, sep="")
giplot <- tm_shape(eval(as.name(localGI ))) +
tm_fill(col = "gstat_adaptive",
style = "pretty",
palette = "-RdBu",
title = "local Gi") +
tm_borders(alpha=0.5) +
tm_view(set.zoom.limits = c(11,17))+
tm_layout(week,
legend.outside=TRUE)
assign(gi, giplot)
}
Lets visualise out the Gi values derived by creating a hot/cold spot map for the e-weeks 13 to 32.
The map plots above showcases the hot and cold spots of covid-19 case rate in central mexico. The municipalities in red are the hot spots areas and the municipalities in blue are the cold spot areas. The darkeness of the colors will represent the intensity of the Gi values
As such from the plot we can see that similar to our earlier analysis there is a hot spot right at mexico city, from e-week 13. This hot spot area will grow and contract over the weeks from e-week 13 to e-week 32. This indicates that this municipalities in the area is associated with relatively high values of covid-19 case rates in the surrounding locations.
The cold spots on the other hand are in the North, South-East and South of mexico city. This indicates that the municipalties in this location are associated with relatively low values in surrounding locations.
It hold true to our initial analysis that the COVID-19 cases seems to slowly spread spatially from mexico city and the outskirts of central mexico are relatively safer and less impacted from the virus.
In conclusion, we where able to derive much insights on the Spatiotemporal Patterns of COVID-19 in Central Mexico. We where able to identify the spatio-temporal spread of the virus starting from mexico city. In addition, we where able to identify that there are signs of spatial clusters interms of covid-19 rates.
For which we where able to pinpoint where these clusters are, and where there are high levels of covid rates and the neighbours also had high levels of covid rates. This will enable agencies to focus their resources in these localized regions in a more equitable and evidence based manner.
Finally,through this analysis we where able to statistically test and identify the hypothesis that there is a location related tendency in the covid-19 rates attribute in central mexico.
The guiding research from which the analysis for this project was built out from:
Codes in this document have been adapted from the following useful guides:
A work by Rajiv Abraham Xavier