Child mortality is one of those factors which, due to the issue of malnutrition, have been discussed quite extensively on a global level since the last decade or so. In the list of countries doing quite well in reducing the childhood mortality rates, India is placed very low. Even though the continuous monitoring and the policy making has helped in reducing the rates in the last decade, the issue still persists quite alarmingly. As a result, the focus always remain on analyzing and finding ways to reduce the impact of factors which lead to the under 5 year children deaths in India.
This study tries to analyze the relationship between various factors including availability of clean water, and smoking among others and the mortality rates. What this research also tries to do is to check if there is any spatial dependence and spatial correlations among Indian provinces when it comes to the under-5 child death rates. The study includes the use of various tools like Moran’s test, and the Join count test before building and fitting simple linear regression and spatial models to compare and check the impact of different variables involved.
library(ggplot2)
library(sf)
library(maps)
library(tidyverse)
library(readxl)
library(rgdal)
library(maptools)
library(sp)
library(RColorBrewer)
library(GISTools)
library(splm)
library(car)
library(lmtest)
library(Formula)
library(plm)
library(spdep)
library(spatialreg)
library(texreg)
The data has been collected from the National Family Health Survey (NFHS-5), 2019-21 report. This is a household based survey conducted with the support of International Institute for Population Sciences that aimed for providing data on province level indicators involving health, nutrition, and other socio-economic and demographic ones.
The data contains observations about 9 factors across the total 36 provinces and Union territories of India. A Union territory is an administrative division which, unlike the provinces, is governed completely or partially, by the Union Government of India. The 9 explanatory variables are as follows -
data <- read_excel("provinces.xlsx")
head(data)
## # A tibble: 6 x 11
## ID_1 NAME_1 u5mortality antenatal_care fully_vaccinated violence smoker
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Andaman and~ 24.5 65.9 79.9 18.3 20.1
## 2 2 Andhra Prad~ 35.2 46.8 73.2 33.4 29.1
## 3 3 Arunachal P~ 18.8 14.4 64.9 26.6 70.6
## 4 4 Assam 39.1 26.9 66.7 34.3 57.2
## 5 5 Bihar 56.4 7.6 71 42.5 41
## 6 6 Chandigarh 19.7 61.6 80.9 11.8 15.5
## # ... with 4 more variables: teenage_preg <dbl>, water <dbl>,
## # toilet_facility <dbl>, schooling_f <dbl>
data1 <- data[-c(1,19),] # dropping two island provinces
The two provinces named Andaman and Nicobar and Lakshwadeep are islands located in the Indian Ocean. The same have been dropped from the data as they do not share any land boundaries with other provinces, and this means including them will lead to unreliable results.
pro.sf <- st_read("IND_adm1.shp")
## Reading layer `IND_adm1' from data source `D:\UW\Summer semester 2021\Spatial\1\Project\IND_adm1.shp' using driver `ESRI Shapefile'
## Simple feature collection with 36 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 68.18625 ymin: 6.754256 xmax: 97.41516 ymax: 35.50133
## Geodetic CRS: WGS 84
head(pro.sf)
## Simple feature collection with 6 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 76.69063 ymin: 6.754256 xmax: 97.41516 ymax: 30.79904
## Geodetic CRS: WGS 84
## ID_0 ISO NAME_0 ID_1 NAME_1 TYPE_1 ENGTYPE_1
## 1 105 IND India 1 Andaman and Nicobar Union Territor Union Territory
## 2 105 IND India 2 Andhra Pradesh State State
## 3 105 IND India 3 Arunachal Pradesh State State
## 4 105 IND India 4 Assam State State
## 5 105 IND India 5 Bihar State State
## 6 105 IND India 6 Chandigarh Union Territor Union Territory
## NL_NAME_1
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
## VARNAME_1
## 1 Andaman & Nicobar Islands|Andaman et Nicobar|Iihas de Andama e Nicobar|Inseln Andamanen und Nikobare
## 2 <NA>
## 3 Agence de la Frontière du Nord-Est(French-obsolete)|North East Frontier Agency
## 4 <NA>
## 5 <NA>
## 6 <NA>
## geometry
## 1 MULTIPOLYGON (((93.78773 6....
## 2 MULTIPOLYGON (((80.27458 13...
## 3 MULTIPOLYGON (((96.15778 29...
## 4 MULTIPOLYGON (((89.87145 25...
## 5 MULTIPOLYGON (((88.10548 26...
## 6 MULTIPOLYGON (((76.80293 30...
pro.sf.final <- pro.sf[-c(1,19),]
Here again, the same observations have been removed for further analysis. As it can be seen, the data has been ordered as per the shapefile format.
plot1 <- ggplot() + geom_sf(pro.sf.final, mapping=aes(geometry=geometry, fill=NAME_1))+
labs(title= "Indian administrative units at the provincial level")
plot1
# Filtering out the desired data to visualize spatial distribution of Childhood mortality rate across provinces
distdata <- data1[, c("ID_1", "u5mortality")]
pro.sf.final <- merge(pro.sf.final, distdata, by= "ID_1")
# Plotting the distribution
plot2 <- ggplot() + geom_sf(pro.sf.final, mapping=aes(geometry=geometry, fill=u5mortality)) +
scale_fill_gradient(low = "#d3e7e7", high = "#300bdd") +
labs(title= "Spatial Distribution of Under 5 mortality rate in Indian provinces 2019-21")
plot2
As visible from the above plot, the provinces of Uttar Pradesh and Bihar are found to have maximum child mortality rates in the country, while Goa and Kerala with the lowest rates.
par(mfrow = c(3,3))
hist(data1$u5mortality)
hist(data1$antenatal_care)
hist(data$fully_vaccinated)
hist(data$violence)
hist(data$smoker)
hist(data$teenage_preg)
hist(data$water)
hist(data$toilet_facility)
hist(data$schooling_f)
More variables seemingly tend toward normality which is a good thing but for this analysis, we are not going to be focusing too much on this aspect.
Before calculating the contiguity matrix, the data must be converted into a Spatial object.
# Converting into spatial object
support.sp <- as(pro.sf.final, "Spatial")
# Generating the matrix
cont.nb <- poly2nb(as(support.sp, "SpatialPolygons"))
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
cont.listw <- nb2listw(cont.nb, style="W")
cont.listw
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 34
## Number of nonzero links: 132
## Percentage nonzero weights: 11.41869
## Average number of links: 3.882353
##
## Weights style: W
## Weights constants summary:
## n nn S0 S1 S2
## W 34 1156 34 20.30275 152.8701
As per the above output, the average number of links, which means average number of provinces a particular province shares border with, is roughly 3.88.
india.mt <- nb2mat(cont.nb)
india.mt[1:5, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## 1 0 0.0000000 0.0 0 0
## 2 0 0.0000000 0.5 0 0
## 3 0 0.1428571 0.0 0 0
## 4 0 0.0000000 0.0 0 0
## 5 0 0.0000000 0.0 0 0
The above matrix contains the assigned weights to every province depending on the neighborhood structure.
lm.model <- lm(u5mortality~antenatal_care+fully_vaccinated+violence+
smoker+teenage_preg+water+toilet_facility+schooling_f, data=data1)
summary(lm.model)
##
## Call:
## lm(formula = u5mortality ~ antenatal_care + fully_vaccinated +
## violence + smoker + teenage_preg + water + toilet_facility +
## schooling_f, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1427 -4.0200 -0.7727 5.8895 14.1200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 112.638365 39.999826 2.816 0.00935 **
## antenatal_care -0.059518 0.154334 -0.386 0.70302
## fully_vaccinated -0.485757 0.276702 -1.756 0.09142 .
## violence -0.208686 0.189276 -1.103 0.28072
## smoker -0.006264 0.120097 -0.052 0.95882
## teenage_preg -0.656152 0.454916 -1.442 0.16161
## water 0.440932 0.357495 1.233 0.22890
## toilet_facility -0.704111 0.253731 -2.775 0.01029 *
## schooling_f -0.647720 0.323979 -1.999 0.05656 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.477 on 25 degrees of freedom
## Multiple R-squared: 0.7011, Adjusted R-squared: 0.6054
## F-statistic: 7.329 on 8 and 25 DF, p-value: 5.336e-05
From the output of the model above, it can be seen that only one variable, i.e., toilet_facility is found to be statistically significant at the level of 5%. However, the reason behind building this model is not to check the significant variables but to check if there is a need for spatial models for the data we are trying to analyze or not.
res <- lm.model$residuals
lm.morantest(lm.model, cont.listw) # Moran test
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = u5mortality ~ antenatal_care + fully_vaccinated +
## violence + smoker + teenage_preg + water + toilet_facility +
## schooling_f, data = data1)
## weights: cont.listw
##
## Moran I statistic standard deviate = 3.0391, p-value = 0.001186
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.22648928 -0.09650817 0.01129536
moran.test(res, cont.listw) # Justification for spatial models
##
## Moran I test under randomisation
##
## data: res
## weights: cont.listw
##
## Moran I statistic standard deviate = 2.0719, p-value = 0.01914
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.22648928 -0.03030303 0.01536126
The test used above to check for the existence of spatial correlation is the Moran’s test. The null hypothesis (H0) for the test states that there is spatial randomness (no correlation), while the alternative hypothesis justifies existence of spatial correlation.
As per the results of the test in this case, p-value (~0.019) is very less than the 5% level of significance, which implies we reject the null hypothesis. This means social correlation exists and there is a need for spatial analysis for the problem.
x <- data1$u5mortality # variable selection
zx <- as.data.frame(scale(x)) #standardization of variable
moran.plot(zx$V1, cont.listw, pch=19, labels=as.character(data1$NAME_1))
Plotting the moran’s graph makes it clear that the provinces with high Under-5 mortality rates are situated next to each other and the low ones with low. The figure also clearly depicts the result from plot2, i.e, Bihar and Uttar pradesh, two neighbouring provinces have the maximum mortality rates.
wzx <- lag.listw(cont.listw, zx$V1) # spatial lag of x
morlm <- lm(wzx~zx$V1) # linear regression
summary(morlm)
##
## Call:
## lm(formula = wzx ~ zx$V1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.91229 -0.25113 0.06669 0.32540 1.13523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.17344 0.07820 2.218 0.0338 *
## zx$V1 0.38173 0.07938 4.809 3.46e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.456 on 32 degrees of freedom
## Multiple R-squared: 0.4195, Adjusted R-squared: 0.4014
## F-statistic: 23.13 on 1 and 32 DF, p-value: 3.458e-05
slope <- morlm$coefficients[2] # coefficient from regression
intercept <- morlm$coefficients[1] # constant term in regression
# Let's make the conditions for the distribution
cond1 <- ifelse(zx>=0 & wzx>=0, 1,0) # I quarter
cond2 <- ifelse(zx>=0 & wzx<0, 2,0) # II quarter
cond3 <- ifelse(zx<0 & wzx<0, 3,0) # III quarter
cond4 <- ifelse(zx<0 & wzx>=0, 4,0) # IV quarter
cond.all <- cond1+cond2+cond3+cond4 # combining all quarters in one
cond <- as.data.frame(cond.all)
cond
## V1
## 1 2
## 2 4
## 3 2
## 4 1
## 5 4
## 6 1
## 7 1
## 8 1
## 9 4
## 10 3
## 11 1
## 12 1
## 13 4
## 14 3
## 15 1
## 16 3
## 17 3
## 18 1
## 19 4
## 20 3
## 21 1
## 22 4
## 23 2
## 24 1
## 25 3
## 26 2
## 27 1
## 28 3
## 29 3
## 30 4
## 31 2
## 32 1
## 33 1
## 34 4
brks <- c(1,2,3,4)
cols <- c("#003D30", "#006650", "#00F5C0", "#00B890")
plot(support.sp, col=cols[findInterval(cond$V1, brks)])
legend("bottomleft", legend=c("HH – high surrounded by high",
"LH - low surrounded by high",
"LL - low surrounded by low",
"HL - high surrounded by low"),
fill=cols, bty="n", cex=0.80)
title(main="Mapping of Moranscatterplot results
Child mortality rate in 2019-21")
The aim to perform this test is to categorize the u5mortality variable into different groups (clusters) as per the severeness of mortality rates, and afterwards plot the map of India with the assigned colors.
# Defining range to assign clusters
fctr <- factor(cut(data1$u5mortality, breaks=c(0,20,40,60),
labels=c("low", "medium", "high")))
# Graphical parameters
brks1<-c(0,20,40,60)
cols<-c("green", "blue", "red")
# Plotting the scatter plot
plot(data1$u5mortality, bg=cols[findInterval(data1$u5mortality, brks1)], pch=21, ylab="Child Mortality Rate")
abline(h=c(20,40,60), lty=3)
The above chart shows the distribution of the variable “u5mortality” assigned to 3 different colors as per the range they belong to.
# Spatial distribution
plot(support.sp, col=cols[findInterval(data1$u5mortality, brks1)])
title(main="Child mortality in 2019-21")
legend("bottomleft", legend=c("low", "medium", "high"), leglabs(brks1), fill=cols, bty="n")
# Join-count test (based on colors)
joincount.test(fctr, cont.listw)
##
## Join count test under nonfree sampling
##
## data: fctr
## weights: cont.listw
##
## Std. deviate for low = -0.80333, p-value = 0.7891
## alternative hypothesis: greater
## sample estimates:
## Same colour statistic Expectation Variance
## 0.3333333 0.6363636 0.1422936
##
##
## Join count test under nonfree sampling
##
## data: fctr
## weights: cont.listw
##
## Std. deviate for medium = 1.5111, p-value = 0.06538
## alternative hypothesis: greater
## sample estimates:
## Same colour statistic Expectation Variance
## 6.3761905 5.1818182 0.6247368
##
##
## Join count test under nonfree sampling
##
## data: fctr
## weights: cont.listw
##
## Std. deviate for high = 2.6843, p-value = 0.003634
## alternative hypothesis: greater
## sample estimates:
## Same colour statistic Expectation Variance
## 1.9944444 0.8484848 0.1822492
The output of the join count test clearly shows that there is a significant association among provinces with high child mortality rates, i.e., similar colors locate next to each other more often than randomly. The result also depicts that there is no serious autocorrelation for the low and medium mortality rate provinces.
# Storing the formula in a variable for further use
fm <- u5mortality~antenatal_care+fully_vaccinated+violence+smoker+teenage_preg+
water+toilet_facility+schooling_f
GNS <- sacsarlm(fm, data=data1, listw=cont.listw, type="sacmixed", method="LU")
summary(GNS)
##
## Call:
## sacsarlm(formula = fm, data = data1, listw = cont.listw, type = "sacmixed",
## method = "LU")
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.56639 -3.67619 0.12385 2.89738 11.05194
##
## Type: sacmixed
## Coefficients: (numerical Hessian approximate standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 72.781788 60.205173 1.2089 0.2267028
## antenatal_care 0.059609 0.113310 0.5261 0.5988398
## fully_vaccinated -0.097437 0.213602 -0.4562 0.6482724
## violence 0.037910 0.150154 0.2525 0.8006751
## smoker -0.129135 0.101353 -1.2741 0.2026241
## teenage_preg -0.603746 0.317851 -1.8995 0.0575038
## water -0.019448 0.241549 -0.0805 0.9358270
## toilet_facility -0.317909 0.223998 -1.4193 0.1558251
## schooling_f -1.101855 0.282367 -3.9022 9.532e-05
## lag.antenatal_care 0.216314 0.216797 0.9978 0.3183890
## lag.fully_vaccinated -0.181182 0.461054 -0.3930 0.6943386
## lag.violence 0.582396 0.309235 1.8833 0.0596537
## lag.smoker 0.666563 0.177634 3.7525 0.0001751
## lag.teenage_preg -1.771946 0.660936 -2.6810 0.0073410
## lag.water 0.037342 0.688233 0.0543 0.9567300
## lag.toilet_facility -0.710452 0.575705 -1.2341 0.2171824
## lag.schooling_f 1.381878 0.683373 2.0221 0.0431615
##
## Rho: 0.39459
## Approximate (numerical Hessian) standard error: 0.29832
## z-value: 1.3227, p-value: 0.18593
## Lambda: -0.12965
## Approximate (numerical Hessian) standard error: 0.43323
## z-value: -0.29927, p-value: 0.76474
##
## LR test value: 27.301, p-value: 0.0023331
##
## Log likelihood: -102.0364 for sacmixed model
## ML residual variance (sigma squared): 22.612, (sigma: 4.7552)
## Number of observations: 34
## Number of parameters estimated: 20
## AIC: 244.07, (AIC for lm: 251.37)
As we can see, the list of variables is quite long, however, the number of significant variables is only 4 (including 3 variables with lags). What is more noteworthy is that the AIC value for the model (244.07) is lower than the AIC value for simple regression (251.37). Clearly, there is a need to build some other models to compare which one would be the best.
SAC <- sacsarlm(fm, data=data1, listw=cont.listw)
summary(SAC)
##
## Call:sacsarlm(formula = fm, data = data1, listw = cont.listw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.92930 -4.36219 0.48047 3.56741 14.75252
##
## Type: sac
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 124.100751 31.881864 3.8925 9.921e-05
## antenatal_care 0.012265 0.117044 0.1048 0.91654
## fully_vaccinated -0.404578 0.203340 -1.9897 0.04663
## violence -0.169598 0.155588 -1.0900 0.27569
## smoker -0.184347 0.090260 -2.0424 0.04111
## teenage_preg -0.305141 0.329804 -0.9252 0.35485
## water 0.030243 0.255520 0.1184 0.90578
## toilet_facility -0.273600 0.220361 -1.2416 0.21438
## schooling_f -1.164280 0.289722 -4.0186 5.854e-05
##
## Rho: 0.15411
## Asymptotic standard error: 0.22978
## z-value: 0.67068, p-value: 0.50243
## Lambda: 0.63364
## Asymptotic standard error: 0.19174
## z-value: 3.3047, p-value: 0.00095065
##
## LR test value: 9.1395, p-value: 0.01036
##
## Log likelihood: -111.1173 for sac model
## ML residual variance (sigma squared): 35.481, (sigma: 5.9566)
## Number of observations: 34
## Number of parameters estimated: 12
## AIC: 246.23, (AIC for lm: 251.37)
Clearly, the results have improved as out of the 8 variables now, 3 are found to be statistically significant at the 5% level. Also, the AIC value, even though slightly, has increased to 246.23, however still remains lower than that of OLS model.
SDEM <- errorsarlm(fm, data=data1, listw=cont.listw, etype="emixed") # with spatial lags of X
summary(SDEM)
##
## Call:
## errorsarlm(formula = fm, data = data1, listw = cont.listw, etype = "emixed")
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.61232 -3.71294 0.04674 2.70920 11.49521
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 96.336411 60.678420 1.5877 0.1123643
## antenatal_care 0.035512 0.112210 0.3165 0.7516409
## fully_vaccinated -0.105656 0.217092 -0.4867 0.6264810
## violence 0.021886 0.139462 0.1569 0.8753011
## smoker -0.087851 0.097598 -0.9001 0.3680542
## teenage_preg -0.687226 0.315581 -2.1777 0.0294317
## water 0.045328 0.243390 0.1862 0.8522592
## toilet_facility -0.366735 0.222687 -1.6469 0.0995858
## schooling_f -1.036141 0.280821 -3.6897 0.0002245
## lag.antenatal_care 0.160547 0.221742 0.7240 0.4690510
## lag.fully_vaccinated -0.312470 0.438620 -0.7124 0.4762219
## lag.violence 0.546807 0.313831 1.7424 0.0814446
## lag.smoker 0.608880 0.157338 3.8699 0.0001089
## lag.teenage_preg -1.979808 0.633488 -3.1253 0.0017765
## lag.water 0.276374 0.659731 0.4189 0.6752751
## lag.toilet_facility -0.804444 0.546004 -1.4733 0.1406623
## lag.schooling_f 0.956287 0.564725 1.6934 0.0903855
##
## Lambda: 0.24118, LR test value: 0.59478, p-value: 0.44058
## Asymptotic standard error: 0.20216
## z-value: 1.193, p-value: 0.23285
## Wald statistic: 1.4233, p-value: 0.23285
##
## Log likelihood: -102.7502 for error model
## ML residual variance (sigma squared): 24.324, (sigma: 4.932)
## Number of observations: 34
## Number of parameters estimated: 19
## AIC: 243.5, (AIC for lm: 242.1)
As it can be seen from the output, SDEM model performed almost at par with the Manski model with same number of variables being significant. However, a major difference in both these models is the AIC value as the AIC value here in case of SDEM is higher than that of OLS.
SEM <- errorsarlm(fm, data=data1, listw=cont.listw) # no spat-lags of X
summary(SEM)
##
## Call:errorsarlm(formula = fm, data = data1, listw = cont.listw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.14558 -4.00544 0.48186 2.84130 14.82028
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 130.344610 29.957400 4.3510 1.355e-05
## antenatal_care 0.031948 0.114021 0.2802 0.77933
## fully_vaccinated -0.420508 0.200618 -2.0961 0.03608
## violence -0.181425 0.154466 -1.1745 0.24018
## smoker -0.171460 0.088772 -1.9315 0.05342
## teenage_preg -0.308322 0.325313 -0.9478 0.34325
## water 0.041718 0.253890 0.1643 0.86948
## toilet_facility -0.288047 0.221945 -1.2978 0.19435
## schooling_f -1.193533 0.286870 -4.1605 3.175e-05
##
## Lambda: 0.69151, LR test value: 8.5755, p-value: 0.0034071
## Asymptotic standard error: 0.11808
## z-value: 5.8565, p-value: 4.7266e-09
## Wald statistic: 34.299, p-value: 4.7266e-09
##
## Log likelihood: -111.3993 for error model
## ML residual variance (sigma squared): 35.191, (sigma: 5.9322)
## Number of observations: 34
## Number of parameters estimated: 11
## AIC: 244.8, (AIC for lm: 251.37)
Comparing the results of first three models with the output of SEM model, it can be said that the performance of SEM model lies somewhere between that of SDEM and SARAR, with SARAR being the best one so far.
SDM <- lagsarlm(fm, data=data1, listw=cont.listw, type="mixed") # with spatial lags of X
summary(SDM)
##
## Call:lagsarlm(formula = fm, data = data1, listw = cont.listw, type = "mixed")
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.67417 -3.64573 0.28149 2.89565 11.31768
##
## Type: mixed
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 77.684177 59.955121 1.2957 0.195077
## antenatal_care 0.051327 0.109850 0.4672 0.640326
## fully_vaccinated -0.101675 0.211951 -0.4797 0.631435
## violence 0.026531 0.141176 0.1879 0.850932
## smoker -0.120552 0.096178 -1.2534 0.210050
## teenage_preg -0.617841 0.311698 -1.9822 0.047459
## water -0.006221 0.240792 -0.0258 0.979389
## toilet_facility -0.331651 0.220116 -1.5067 0.131885
## schooling_f -1.086144 0.277955 -3.9076 9.321e-05
## lag.antenatal_care 0.209680 0.214101 0.9794 0.327407
## lag.fully_vaccinated -0.198294 0.443905 -0.4467 0.655089
## lag.violence 0.573016 0.310340 1.8464 0.064832
## lag.smoker 0.643151 0.156912 4.0988 4.153e-05
## lag.teenage_preg -1.773865 0.623774 -2.8438 0.004458
## lag.water 0.061340 0.655918 0.0935 0.925492
## lag.toilet_facility -0.699743 0.556663 -1.2570 0.208742
## lag.schooling_f 1.279310 0.600272 2.1312 0.033071
##
## Rho: 0.33165, LR test value: 1.933, p-value: 0.16443
## Asymptotic standard error: 0.18723
## z-value: 1.7714, p-value: 0.0765
## Wald statistic: 3.1377, p-value: 0.0765
##
## Log likelihood: -102.0811 for mixed model
## ML residual variance (sigma squared): 23.06, (sigma: 4.8021)
## Number of observations: 34
## Number of parameters estimated: 19
## AIC: 242.16, (AIC for lm: 242.1)
## LM test for residual autocorrelation
## test value: 0.37861, p-value: 0.53835
The results here are a bit interesting because the number of significant variables has increased to 5 and also the AIC value is also very slightly higher than that of OLS. Moreover, the result of LM test shows that there is no serial correlation between residuals, which is a good thing. Surely, it seems that SDM is one of the better models.
SAR <- lagsarlm(fm, data=data1, listw=cont.listw) # no spatial lags of X
summary(SAR)
##
## Call:lagsarlm(formula = fm, data = data1, listw = cont.listw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.606197 -4.465981 -0.030338 5.021473 12.333193
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 89.826870 33.908199 2.6491 0.00807
## antenatal_care -0.086166 0.122574 -0.7030 0.48208
## fully_vaccinated -0.348909 0.223514 -1.5610 0.11852
## violence -0.106386 0.153649 -0.6924 0.48869
## smoker -0.095516 0.095325 -1.0020 0.31635
## teenage_preg -0.438030 0.361254 -1.2125 0.22531
## water 0.217527 0.282401 0.7703 0.44114
## toilet_facility -0.422860 0.209928 -2.0143 0.04398
## schooling_f -0.672705 0.261893 -2.5686 0.01021
##
## Rho: 0.40502, LR test value: 4.5127, p-value: 0.033644
## Asymptotic standard error: 0.14686
## z-value: 2.7578, p-value: 0.005819
## Wald statistic: 7.6055, p-value: 0.005819
##
## Log likelihood: -113.4307 for lag model
## ML residual variance (sigma squared): 44.268, (sigma: 6.6534)
## Number of observations: 34
## Number of parameters estimated: 11
## AIC: 248.86, (AIC for lm: 251.37)
## LM test for residual autocorrelation
## test value: 3.1098, p-value: 0.077825
The results seems to be alright but not as good as the results of SDM and SARAR models. So, let’s move onto the last spatial model for this analysis which is the Spatial Lag of X Model (SLX).
SLX <- lmSLX(fm, data=data1, listw=cont.listw)
summary(SLX)
##
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))),
## data = as.data.frame(x), weights = weights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1619 -3.5336 -0.0098 2.9391 11.0956
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.80913 86.72357 0.978 0.3418
## antenatal_care 0.05057 0.16212 0.312 0.7589
## fully_vaccinated -0.08678 0.31254 -0.278 0.7846
## violence 0.04832 0.20838 0.232 0.8194
## smoker -0.08980 0.14029 -0.640 0.5306
## teenage_preg -0.68040 0.45745 -1.487 0.1552
## water 0.04208 0.35397 0.119 0.9068
## toilet_facility -0.34256 0.32467 -1.055 0.3061
## schooling_f -1.06076 0.40998 -2.587 0.0192 *
## lag.antenatal_care 0.13510 0.31278 0.432 0.6712
## lag.fully_vaccinated -0.31839 0.64856 -0.491 0.6298
## lag.violence 0.57073 0.45387 1.257 0.2256
## lag.smoker 0.66177 0.23017 2.875 0.0105 *
## lag.teenage_preg -2.10728 0.91323 -2.308 0.0339 *
## lag.water 0.37830 0.95008 0.398 0.6955
## lag.toilet_facility -0.89070 0.81427 -1.094 0.2893
## lag.schooling_f 1.09749 0.85120 1.289 0.2145
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.088 on 17 degrees of freedom
## Multiple R-squared: 0.8579, Adjusted R-squared: 0.7241
## F-statistic: 6.413 on 16 and 17 DF, p-value: 0.0002101
The output above clearly shows that the SLX model has performed fairly enough.
However, as there is not a single standout model, we will have to finalize the model on the basis of various things like AIC, variables and error significance.
screenreg(list(GNS, SAC, SDEM, SEM, SDM, SAR, SLX), custom.model.names=c("GNS", "SAC", "SDEM", "SEM", "SDM", "SAR", "SLX"))
##
## ============================================================================================================
## GNS SAC SDEM SEM SDM SAR SLX
## ------------------------------------------------------------------------------------------------------------
## (Intercept) 72.78 124.10 *** 96.34 130.34 *** 77.68 89.83 ** 84.81
## (60.21) (31.88) (60.68) (29.96) (59.96) (33.91) (86.72)
## antenatal_care 0.06 0.01 0.04 0.03 0.05 -0.09 0.05
## (0.11) (0.12) (0.11) (0.11) (0.11) (0.12) (0.16)
## fully_vaccinated -0.10 -0.40 * -0.11 -0.42 * -0.10 -0.35 -0.09
## (0.21) (0.20) (0.22) (0.20) (0.21) (0.22) (0.31)
## violence 0.04 -0.17 0.02 -0.18 0.03 -0.11 0.05
## (0.15) (0.16) (0.14) (0.15) (0.14) (0.15) (0.21)
## smoker -0.13 -0.18 * -0.09 -0.17 -0.12 -0.10 -0.09
## (0.10) (0.09) (0.10) (0.09) (0.10) (0.10) (0.14)
## teenage_preg -0.60 -0.31 -0.69 * -0.31 -0.62 * -0.44 -0.68
## (0.32) (0.33) (0.32) (0.33) (0.31) (0.36) (0.46)
## water -0.02 0.03 0.05 0.04 -0.01 0.22 0.04
## (0.24) (0.26) (0.24) (0.25) (0.24) (0.28) (0.35)
## toilet_facility -0.32 -0.27 -0.37 -0.29 -0.33 -0.42 * -0.34
## (0.22) (0.22) (0.22) (0.22) (0.22) (0.21) (0.32)
## schooling_f -1.10 *** -1.16 *** -1.04 *** -1.19 *** -1.09 *** -0.67 * -1.06 *
## (0.28) (0.29) (0.28) (0.29) (0.28) (0.26) (0.41)
## lag.antenatal_care 0.22 0.16 0.21 0.14
## (0.22) (0.22) (0.21) (0.31)
## lag.fully_vaccinated -0.18 -0.31 -0.20 -0.32
## (0.46) (0.44) (0.44) (0.65)
## lag.violence 0.58 0.55 0.57 0.57
## (0.31) (0.31) (0.31) (0.45)
## lag.smoker 0.67 *** 0.61 *** 0.64 *** 0.66 *
## (0.18) (0.16) (0.16) (0.23)
## lag.teenage_preg -1.77 ** -1.98 ** -1.77 ** -2.11 *
## (0.66) (0.63) (0.62) (0.91)
## lag.water 0.04 0.28 0.06 0.38
## (0.69) (0.66) (0.66) (0.95)
## lag.toilet_facility -0.71 -0.80 -0.70 -0.89
## (0.58) (0.55) (0.56) (0.81)
## lag.schooling_f 1.38 * 0.96 1.28 * 1.10
## (0.68) (0.56) (0.60) (0.85)
## rho 0.39 0.15 0.33 0.41 **
## (0.30) (0.23) (0.19) (0.15)
## lambda -0.13 0.63 *** 0.24 0.69 ***
## (0.43) (0.19) (0.20) (0.12)
## ------------------------------------------------------------------------------------------------------------
## Num. obs. 34 34 34 34 34 34
## Parameters 20 12 19 11 19 11
## Log Likelihood -102.04 -111.12 -102.75 -111.40 -102.08 -113.43 -103.05
## AIC (Linear model) 251.37 251.37 242.10 251.37 242.10 251.37
## AIC (Spatial model) 244.07 246.23 243.50 244.80 242.16 248.86
## LR test: statistic 27.30 9.14 0.59 8.58 1.93 4.51
## LR test: p-value 0.00 0.01 0.44 0.00 0.16 0.03
## R^2 0.86
## Adj. R^2 0.72
## Sigma 7.09
## Statistic 6.41
## P Value 0.00
## DF 16.00
## AIC 242.10
## BIC 269.57
## Deviance 854.15
## DF Resid. 17
## nobs 34
## ============================================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
Looking at the output and comparing the results, it can be said that some of the spatially lagged variables are significant. This means it would be better if the lagged variables are taken into account for the spatial analysis of the problem. As per the above results, SDM and SDEM are seemingly the best models.
The objective of the research was to analyze the problem regarding childhood mortality and check if there is a scope of using spatial analysis methods. As we found out by the Moran’s test that there was indeed existence of spatial correlation, which meant spatial models would be better for more reliable and comprehensive analysis.
Moreover, the provinces on the central-eastern part of the country, especially Uttar Pradesh and Bihar, were found to be the most affected ones with the highest childhood mortality rates. As per the two best selected models, out of all the variables, teenage pregnancy rates of the provinces, schooling levels for females, lagged variable of smoker, among others were found to be statistically significant.