Covid-19 is an infectious disease which has affected many parts of the world. The COVID-19 virus spreads primarily through droplets of saliva or discharge from the nose when an infected person coughs or sneezes.(WHO)With as many as 726,431 confirmed cases and 76,243 reported deaths in Mexico, it is worthwhile to analyze how covid-19 is being spread in the spatial dimension.
Being the most densely populated location in Mexico, Mexico city is the location that will be most affected when this infectious start to spread within Mexico. Hence, we will be focusing the spatial-temporal analysis in Mexico City and its surrounding location Mexico State and Morelos State.
First we will need to install and import the necessary packages for this analysis to take place. This packages helps us to read, visualize and analyze the geospatial and aspatial files provided. spdplyr is a package that allow us to perform data wrangling on sp objects.
First we will use the st_read() function to read the data and save it as a sf dataframe.
## Reading layer `municipalities_COVID' from data source `C:\IS415\in_class_ex\Take_home_ex2_temp\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 2465 features and 198 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 911292 ymin: 319149.1 xmax: 4082997 ymax: 2349615
## projected CRS: MEXICO_ITRF_2008_LCC
As the data consist of data across the entire mexico, we will need to filter the data such that it includes the 3 main area we will be analyzing The dataframe consist of many different variables such as new cases, death and cumulative cases. As we would like to see how the virus spread across the weeks, we will be extracting only: * Cumulative number of cases * Data from e-week 13 to e-week 32 We will also be creating a new variable “ccvid_rate” with the formula of cumulative case / Pop2020 * 10000. This allow us to see the number of culmulative cases per 10,000 population. With the evaluation method, we will be able to come out with a more unbias analysis. (If there are more people, there will be more cases hence the matrix remove this bias from the analysis)
covid19_clean <- covid19%>%
filter(`CVE_ENT` %in% c('09','15','17')) %>%
select(CVE_ENT,contains("cumul"),Pop2020,CVEGEO,NOMGEO)%>%
select(1,14:36)%>%
mutate_at(vars(contains("cumul")),funs(covid_rate = ./Pop2020*10000))## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Rows: 176
## Columns: 45
## $ CVE_ENT <fct> 09, 09, 09, 09, 09, 09, 09, 09, 09, 09, 09, 09, ...
## $ cumul13 <dbl> 19, 25, 66, 28, 11, 41, 15, 0, 56, 3, 41, 11, 23...
## $ cumul14 <dbl> 46, 44, 75, 96, 27, 100, 27, 1, 90, 11, 70, 32, ...
## $ cumul15 <dbl> 87, 107, 93, 193, 65, 225, 45, 23, 146, 30, 153,...
## $ cumul16 <dbl> 156, 213, 122, 397, 188, 526, 83, 59, 270, 79, 3...
## $ cumul17 <dbl> 254, 375, 157, 734, 343, 1061, 169, 137, 412, 17...
## $ cumul18 <dbl> 391, 554, 202, 1107, 552, 1754, 220, 207, 609, 3...
## $ cumul19 <dbl> 615, 831, 260, 1653, 806, 2617, 294, 301, 847, 5...
## $ cumul20 <dbl> 931, 1150, 323, 2330, 1125, 3583, 427, 420, 1169...
## $ cumul21 <dbl> 1267, 1492, 414, 3108, 1430, 4659, 571, 615, 155...
## $ cumul22 <dbl> 1590, 1866, 573, 3843, 1706, 5532, 713, 778, 194...
## $ cumul23 <dbl> 1939, 2255, 714, 4674, 1989, 6416, 839, 961, 232...
## $ cumul24 <dbl> 2249, 2554, 851, 5366, 2209, 7234, 961, 1184, 27...
## $ cumul25 <dbl> 2594, 2936, 991, 5977, 2454, 7981, 1142, 1354, 3...
## $ cumul26 <dbl> 2922, 3208, 1117, 6502, 2660, 8646, 1321, 1491, ...
## $ cumul27 <dbl> 3238, 3590, 1285, 7075, 2825, 9408, 1529, 1667, ...
## $ cumul28 <dbl> 3539, 3903, 1444, 7587, 2997, 10078, 1758, 1818,...
## $ cumul29 <dbl> 3839, 4393, 1615, 8212, 3183, 10822, 2103, 2022,...
## $ cumul30 <dbl> 4171, 4820, 1835, 8875, 3416, 11550, 2350, 2219,...
## $ cumul31 <dbl> 4442, 5216, 1949, 9380, 3612, 12170, 2627, 2364,...
## $ cumul32 <dbl> 4466, 5246, 1956, 9439, 3674, 12305, 2639, 2375,...
## $ Pop2020 <int> 408441, 621952, 199809, 1176967, 393821, 1815551...
## $ CVEGEO <fct> 09002, 09003, 09004, 09005, 09006, 09007, 09008,...
## $ NOMGEO <fct> Azcapotzalco, Coyoacán, Cuajimalpa de Morelos, G...
## $ geometry <MULTIPOLYGON [m]> MULTIPOLYGON (((2794860 837..., MUL...
## $ cumul13_covid_rate <dbl> 0.46518347, 0.40196028, 3.30315451, 0.23789962, ...
## $ cumul14_covid_rate <dbl> 1.12623365, 0.70745009, 3.75358467, 0.81565583, ...
## $ cumul15_covid_rate <dbl> 2.1300506, 1.7203900, 4.6544450, 1.6398081, 1.65...
## $ cumul16_covid_rate <dbl> 3.8194011, 3.4247016, 6.1058311, 3.3730767, 4.77...
## $ cumul17_covid_rate <dbl> 6.2187684, 6.0294042, 7.8575039, 6.2363686, 8.70...
## $ cumul18_covid_rate <dbl> 9.5729861, 8.9074398, 10.1096547, 9.4055313, 14....
## $ cumul19_covid_rate <dbl> 15.0572543, 13.3611597, 13.0124269, 14.0445739, ...
## $ cumul20_covid_rate <dbl> 22.7939898, 18.4901729, 16.1654380, 19.7966468, ...
## $ cumul21_covid_rate <dbl> 31.020392, 23.988990, 20.719787, 26.406858, 36.3...
## $ cumul22_covid_rate <dbl> 38.928511, 30.002315, 28.677387, 32.651723, 43.3...
## $ cumul23_covid_rate <dbl> 47.473197, 36.256817, 35.734126, 39.712243, 50.5...
## $ cumul24_covid_rate <dbl> 55.063032, 41.064262, 42.590674, 45.591763, 56.0...
## $ cumul25_covid_rate <dbl> 63.509785, 47.206215, 49.597365, 50.783072, 62.3...
## $ cumul26_covid_rate <dbl> 71.540320, 51.579543, 55.903388, 55.243690, 67.5...
## $ cumul27_covid_rate <dbl> 79.277056, 57.721496, 64.311417, 60.112136, 71.7...
## $ cumul28_covid_rate <dbl> 86.646541, 62.754039, 72.269017, 64.462300, 76.1...
## $ cumul29_covid_rate <dbl> 93.991543, 70.632460, 80.827190, 69.772559, 80.8...
## $ cumul30_covid_rate <dbl> 102.120012, 77.497942, 91.837705, 75.405683, 86....
## $ cumul31_covid_rate <dbl> 108.754998, 83.864993, 97.543154, 79.696372, 91....
## $ cumul32_covid_rate <dbl> 109.342598, 84.347345, 97.893488, 80.197661, 93....
In order to view the plot as an animation there is a need to restructure the dataframe to perform time series visualization. Hence, we use gather so that we can plot the chart along weeks.
covid19_clean_plot <- covid19%>%
filter(`CVE_ENT` %in% c('09','15','17')) %>%
select(CVE_ENT,contains("cumul"),Pop2020,CVEGEO,NOMGEO)%>%
select(1,14:36) %>%
gather(key = "week", value = "cumul",2:21)%>%
mutate(week = gsub("cumul", "Week ", week),
week_title = gsub("cumul", "Cumulative Cases in Week ", week),
covid_rate = (cumul/Pop2020)*10000)%>%
arrange(week)
glimpse(covid19_clean_plot)## Rows: 3,520
## Columns: 9
## $ CVE_ENT <fct> 09, 09, 09, 09, 09, 09, 09, 09, 09, 09, 09, 09, 09, 09, ...
## $ Pop2020 <int> 408441, 621952, 199809, 1176967, 393821, 1815551, 245147...
## $ CVEGEO <fct> 09002, 09003, 09004, 09005, 09006, 09007, 09008, 09009, ...
## $ NOMGEO <fct> Azcapotzalco, Coyoacán, Cuajimalpa de Morelos, Gustavo A...
## $ week <chr> "Week 13", "Week 13", "Week 13", "Week 13", "Week 13", "...
## $ cumul <dbl> 19, 25, 66, 28, 11, 41, 15, 0, 56, 3, 41, 11, 23, 33, 11...
## $ week_title <chr> "Week 13", "Week 13", "Week 13", "Week 13", "Week 13", "...
## $ covid_rate <dbl> 0.46518347, 0.40196028, 3.30315451, 0.23789962, 0.279314...
## $ geometry <MULTIPOLYGON [m]> MULTIPOLYGON (((2794860 837..., MULTIPOLYGO...
Below is the bar chart to show the distribution of covid rate at the start and end of the study period. As you can see from the 2 charts, both of them have more points concentrated on the left side of the plot. Although both of them show that the distribution is skewed, we can see that in week 32, the distribution is less skewed which signify that there is spreading of covid-19 virus to other municipal.
case_13 <- ggplot(covid19_clean, aes(x= cumul13_covid_rate)) +
geom_histogram(bins = 10)
case_32 <- ggplot(covid19_clean, aes(x= cumul32_covid_rate)) +
geom_histogram(bins = 10)
grid.arrange(case_13, case_32, ncol=2)Before we proceed with any kind analysis, it is important for us to understand the distribution of population in among the study area. As you can see the choropleth map which are the places has the most concentrated population. As this is an interactive map you may click on each polygon to see some details. It includes location, Population, Cumulative cases of week 13 and 32(To see the start and end of the period of analysis). The title represent Mexico City (9), Mexico State (15) and Morelos State (17).
## tmap mode set to interactive viewing
Next we need to create a gif to demonstrate the covid-19 rate across the municipality and animate it across the time dimension. In other words, we will visualize the change in covid-19 rate changes in the choropleth map as the week past. It is important to fixed the break point to ensure that the diagram we are visualizing is not distorted. The breakpoint are fixed at 25 interval with 200 as the max as none of the municipal exceed the covid rate of 200 per 10,000 population
## tmap mode set to plotting
covid_map<- tm_shape(covid19_clean_plot) +
tm_fill("covid_rate",style = "fixed",breaks = c(0,25,50,75,100,125,150,175,200))+
tm_borders(lwd = 0.8, alpha = 1)+
tm_facets(along = "week_title")This code chunk help us to create a gif file and display it. It display the cumulative covid_rate distribution across the municipal from week 13 to week 32
The choropleth map show the covid-19 rate from week 13 to week 32. For the first few weeks, all part of the study area fall below 0-25 cases per 10000 people. As time past, a the number in the middle start to rise. The spread actually start around mexico city and mexico state. Then it slowly branches out to its neighboring municipal. As you can get the location from the interactive map above, Milpa Alta is the first one increase fastest. In week 32 it has the highest covid rate and you can see that it surrounding is also relatively higher compared to the rest of the study area. Hence, we can say that the virus are being spread spatially.
However, to further confirm our hypothesis we should perform further cluster analysis.
Construct a Queen contiguity matrix. This is a matrix that shows us all their neighboring municipal for each municipal. Queen is being chosen as we want to see all their neighbor regardless of the direction.
This is a summary of the queen contiguity matrix. We can see that the region with least neighbor is 3 and the most is 14.
## Neighbour list object:
## Number of regions: 176
## Number of nonzero links: 962
## Percentage nonzero weights: 3.10563
## Average number of links: 5.465909
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11 14
## 3 3 18 41 31 31 22 13 9 3 1 1
## 3 least connected regions:
## 77 95 121 with 1 link
## 1 most connected region:
## 117 with 14 links
To visualize the queen matrix, we can plot it out to visualize the link between each municipal. We can see that the node at the near the boarder of the study area have lesser link as they do not have much neighbors and vice-versa.
plot(covid19_clean_sp, border="lightgrey")
plot(wm_q, coordinates(covid19_clean_sp), pch = 19, cex = 0.6, add = TRUE, col= "red")In order to avoid analysis that is affected due to number of neighbors, row standardisation is being applied to the weight matrix. It create proportional weights by dividing each neighbor weight by the sum of all neighbor weights for a feature. Style w is chosen as the number of neighbors evenly distributed.
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 176
## Number of nonzero links: 962
## Percentage nonzero weights: 3.10563
## Average number of links: 5.465909
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11 14
## 3 3 18 41 31 31 22 13 9 3 1 1
## 3 least connected regions:
## 77 95 121 with 1 link
## 1 most connected region:
## 117 with 14 links
##
## Weights style: W
## Weights constants summary:
## n nn S0 S1 S2
## W 176 30976 176 71.10695 731.679
Local Moran I can helps us to identify spatial clusters Covid rate with high or low values. It will return us a matrix of values of local statistic value, its expectation and variance, a z-score, a pseudo p-value. In this analysis we will focus on the local statistic value and its p-value which show its statistical significance.
As we will be analyzing the data across 20 weeks it will be more efficient to create a function rather than duplicating the code 19 times. You will notice that the first 2 line are similar to the code in the for loop, this is to initialize the variables. In the code we will be: * Computing the local Moran I for each week * using rbind() to merge the local Moran I results across the week together * Merge the local Moran I results with the data frame with cbind()
fips <- order(covid19_clean_plot$NOMGEO)
data <- covid19_clean_plot %>% filter(week == paste("Week",13,sep=" "))
localMI <- localmoran(data$covid_rate, rswm_q)
for (i in 14:32){
data <- covid19_clean_plot %>% filter(week == paste("Week",i,sep=" "))
temp_mi <- localmoran(data$covid_rate, rswm_q)
localMI <- rbind(localMI,temp_mi)
}
covid19_clean_plot.localMI <- cbind(covid19_clean_plot,localMI)localMI.map <- tm_shape(covid19_clean_plot.localMI) +
tm_fill(col = "Ii",
palette = "RdBu",
breaks=c(-Inf,-4, -2, 0 , 2, 4, 6,8,10,Inf),
title = "local moran statistics") +
tm_borders(alpha = 0.5)+
tm_facets(along="week")
pvalue.map <- tm_shape(covid19_clean_plot.localMI) +
tm_fill(col = "Pr.z...0.",
breaks=c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
palette="-Blues",
title = "local Moran's I p-values") +
tm_borders(alpha = 0.5)+
tm_facets(along="week")## Variable(s) "Ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
Merge the Tmap across 20 week together and create a gif
## tmap_animation(localMI.map,filename = "localMI.gif",delay = 100, restart.delay = 100)
## tmap_animation(pvalue.map,filename = "pvalue.gif",delay = 100, restart.delay = 100)Arrange the gif to show local statistic value and statistical significance
moran_map <- function(data)
{
localMI <- localmoran(data, rswm_q)
covid19_clean.localMI <- cbind(covid19_clean,localMI)
localMI.map <- tm_shape(covid19_clean.localMI) +
tm_fill(col = "Ii",
palette = "RdBu",
style = "pretty",
title = "local moran statistics") +
tm_borders(alpha = 0.5)
pvalue.map <- tm_shape(covid19_clean.localMI) +
tm_fill(col = "Pr.z...0.",
breaks=c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
palette="-Blues",
title = "local Moran's I p-values") +
tm_borders(alpha = 0.5)
tmap_arrange(localMI.map, pvalue.map, asp=1, ncol=2)
}This is the plot for Week 13
## Variable(s) "Ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
This is the plot for Week 32
## Variable(s) "Ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
Confidence level: 95% Critial value: 0.05
There are 2 plot. The local moran stat plot show us the distribution of covid-19 rate based on moran I statistics. While the Moran I p-value plot shows us that if a particular polygon is statistical significant. Hence for the area with high Moran I stat value and high p value would be a significant positive covid-19 cluster while for both that is low like the edges of the study area, it would signify that the cluster are insignificant.
The area with the highest significance value is always in the middle. The area with high moran I value is also in the middle. From this plot we can see that the cluster of positive value is in the middle of the plot. While there are no significant negative value cluster anywhere in the plot. We can also see the from week 13 to week 32 the positive cluster grew in size which is a sign of the virus spreading.
From the plot there seem to be no strong cluster across the 20 weeks except for the few municiple that is in the middle of the study area. Also the place with negative moran I stats become lesser as the week pass. This shows that the spreading has started to reach these places.
Getis and Ord’s G-Statistics can be use to detect spatial anomalities. It can help us to detect hot and cold spot spatially. Hot spot refers to the places with high value and cold spot, low values.
As you can see from EDA, Covid-19 rates are very skewed hence using an adaptive proximity matrix will be the better choice.
coords <- coordinates(covid19_clean_sp)
knb <- knn2nb(knearneigh(coords, k=8, longlat = FALSE))
knb_lw <- nb2listw(knb, style = 'B')Display the link between the neighbors
plot(covid19_clean_sp, border="lightgrey")
plot(knb, coordinates(covid19_clean_sp), pch = 19, cex = 0.6, add = TRUE, col = "red")As we will be analyzing the data across 20 weeks it will be more efficient to create a loop rather than duplicating the code 19 times. You will notice that the first 2 line are similar to the code in the for loop, this is to initialize the variables. In the code we will be: * Computing the Getis and Ord’s G-Statistics for each week * Unlike local Moran I GI* return a vector hence combining them using c() works well * Merge the local Moran I results with the data frame with cbind()
covid19_clean_plot_sp <- as(covid19_clean_plot, "Spatial")
data <- covid19_clean_plot_sp %>% filter(week == paste("Week",13,sep=" "))
gi.adaptive <- localG(data$covid_rate, knb_lw)
for (i in 14:32){
data <- covid19_clean_plot_sp %>% filter(week == paste("Week",i,sep=" "))
gi.adaptive <- c(gi.adaptive,localG(data$covid_rate, knb_lw))
}
covid19_clean_plot_sp.gi <- cbind(covid19_clean_plot_sp,as.matrix(gi.adaptive))
names(covid19_clean_plot_sp.gi)[9] <- "gstat_adaptive"gi_map<- tm_shape(covid19_clean_plot_sp.gi) +
tm_fill(col = "gstat_adaptive",
style = "pretty",
palette="-RdBu",
title = "local Gi") +
tm_borders(alpha = 0.5)+
tm_facets(along = "week")## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
Create a function to plot the start and end week for easier analysis
GiMap <- function(data)
{
fips <- order(covid19_clean_sp$NOMGEO)
gi.adaptive <- localG(data, knb_lw)
spcovid.gi <- cbind(covid19_clean_sp, as.matrix(gi.adaptive))
names(spcovid.gi)[45] <- "gstat_adaptive"
tm_shape(spcovid.gi) +
tm_fill(col = "gstat_adaptive",
style = "pretty",
palette = "-RdBu",
title = "local Gi") +
tm_borders(alpha = 0.5)
}This is the Gi plot for week 13 (left) and week 32 (right)
plot1 <- GiMap(covid19_clean_sp$cumul13_covid_rate)
plot2 <- GiMap(covid19_clean_sp$cumul32_covid_rate)
tmap_arrange(plot1, plot2, asp=1, ncol=2)## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Variable(s) "gstat_adaptive" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
The Red region represent the hot spot and the blue region represent the cold spot. The darkness of the color refer to the magnitude of Gi stats. The plot on the left is the GI plot for week 13 and the one on the right is for GI plot for week 32
From the map we can see that at the beginning, the place with high positive values concentrate in the middle of the study area with very low GI value in the rest of the area. But towards the end of the period, the "hottest spot is still in the middle but it is being branch out such that the color around the hotspot is not as blue as before.
In conclusion we can see the spread of the virus in mexico always starts from the center and towards the outer area. However, perhaps there is insufficient study period we cannot determine if this spread trend will continue to the rest of the study area and even mexico.