This section describes procedures that yield stable small area estimators for each of \(D\) areas over \(T\) subsequent time instants. Area populations, the samples and the data might change between time periods. Accordingly, we denote \(U_t\) the overall population at time \(t\), which is partitioned into \(D\) areas \(U_{1t}, ... ,U_{Dt}\), of respective population sizes \(N_{1t}\)
An overview of the MFH model estimation process is as follows:
Step 1: Compute the selected direct area estimators for each target area \(d = 1, ..., D\) for each time \(t = 1, ..., T\) and estimators of their corresponding sampling variances and covariances.
Step 2: Prepare variables (from administrative data or other sources of data representative at the target area level) at the level of the target area for each time instant in the MFH model. We present a simple approach which performs model selection in a pooled linear regression model without time effects.
Step 3: Fit the MFH models to test for homoskedastic area-time effects \(({\mu_d}_1, ... , {\mu_d}_T)\) are homoskedastic or not. If we reject the homoskedasticity of variances, implement the MFH3 model. Otherwise, we proceed with the MFH2 model. For the purposes of this training, we assume a homoskedastic model for simplicity.
Step 4: Check the selected model assumptions, including linearity, normality of predicted area effects and standardized model residuals, and the presence of the outlying areas.
Step 5: In case of systematic model departures such as isolated departures because of outlying areas, some adjustments might need to be implemented before returning to Step 2 to recompute the MFH model.
Step 6: If model assumptions hold, using the above direct estimates and estimated sampling variances and covariances, and the selected auxiliary variables, compute MFH estimators \(\hat{\delta}_{dt}^{MFH},\quad d = 1,...,D\) and \(t = 1, ..., T\) and their corresponding estimated MSEs.
We will show below the use of the eblupMFH2() and
eblupMFH3() from the R package msae [@permatasari2022package] compute the EBLUPs and
their MSE estimates under the MFH models 2 and 3, respectively. The
calls to these functions are:
eblupMFH2(formula, vardir, MAXITER = 100, PRECISION = 1e-04, data)
eblupMFH3(formula, vardir, MAXITER = 100, PRECISION = 1e-04, data)
In this example, we use a synthetic data set adapted from R package
sae called incomedata. The original data
contains information for \(n = 17,119\)
fictitious individuals residing across \(D =
52\) Spanish provinces. The variables include the name of the
province of residence (provlab), province code
(prov), as well as several correlates of income. We have
added two additional income vectors corresponding to two additional
years of data.
We will show how to estimate a poverty map for each year by using the Multivariate Fay Herriot modelling approach. This approach allows us to take advantage of the temporal correlation between poverty rates i.e. an individuals income in year \(t\) is likely correlated with their income in year \(t+1\).
The rest of this tutorial shows how to prepare MFH models using a
random 10% sample of the incomedata to estimate the poverty
rates.
if (sum(installed.packages()[,1] %in% "pacman") != 1){
install.packages("pacman")
}
pacman::p_load(sf, data.table, tidyverse, car, msae,
sae, survey, spdep, knitr, MASS, caret,
purrr, here)
source(here::here("scripts/Function_pbmcpeMFH2_test.R"))
income_dt <- readRDS(here::here("data/incomedata_survey.RDS"))
prov_dt <- readRDS(here::here("data/shapes/simadmin.RDS"))
shp_dt <- readRDS(here::here("data/shapes/spainshape.RDS"))
glimpse(income_dt)
## Rows: 4,298
## Columns: 9
## Groups: provlab [52]
## $ provlab <fct> Alava, Alava, Alava, Alava, Alava, Alava, Alava, Alava, Al…
## $ prov <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ income2012 <dbl> 26125.271, 27404.099, 13117.838, 27920.737, 8705.592, 6340…
## $ income2013 <dbl> 27500.673, 28174.985, 13581.315, 28508.978, 9182.841, 6272…
## $ income2014 <dbl> 34739.733, 32137.846, 15980.581, 31512.153, 11699.903, 593…
## $ povline2012 <dbl> 6477.484, 6477.484, 6477.484, 6477.484, 6477.484, 6477.484…
## $ povline2013 <dbl> 6515.865, 6515.865, 6515.865, 6515.865, 6515.865, 6515.865…
## $ povline2014 <dbl> 6515.865, 6515.865, 6515.865, 6515.865, 6515.865, 6515.865…
## $ weight <dbl> 7424.759, 23577.245, 41976.969, 19471.566, 12738.217, 2463…
We will use the direct HT estimators that use the survey weights in
weight variable. First, we calculate the total sample size,
the number of provinces, the sample sizes for each province and extract
the population sizes for each province/target area from the
sizeprov file. For those using the household/individual
level survey data, this may be obtained from the sum of the household or
individual weights as appropriate.
data(sizeprov)
## quickly compute sample size for each province
sampsize_dt <-
income_dt |>
group_by(prov) |>
summarize(N = n())
## the poverty line for each is already included within the data.
## Lets compute the direct estimates for each year of data and create a list of data.frames (equal in length to the number of years)
## containing the direct estimate, the standard errors and the coefficient of variation.
direct_list <-
mapply(FUN = function(y, threshold){
z <- emdi::direct(y = y,
smp_data = income_dt %>% as.data.table(),
smp_domains = "prov",
weights = "weight",
threshold = unique(income_dt[[threshold]]),
var = TRUE)
z <-
z$ind |>
dplyr::select(Domain, Head_Count) |>
rename(Direct = "Head_Count") |>
merge(z$MSE |>
dplyr::select(Domain, Head_Count) |>
rename(MSE = "Head_Count"),
by = "Domain") |>
mutate(SD = sqrt(MSE)) |>
mutate(CV = SD / Direct) |>
merge(sampsize_dt |>
mutate(prov = as.factor(prov)),
by.x = "Domain",
by.y = "prov") |>
mutate(var_SRS = Direct * (1 - Direct) / N) |>
mutate(deff = MSE / var_SRS) |>
mutate(n_eff = N / deff)
return(z)
}, SIMPLIFY = FALSE,
y = c("income2012", "income2013", "income2014"),
threshold = c("povline2012", "povline2013", "povline2014"))
## Registered S3 method overwritten by 'openxlsx':
## method from
## as.character.formula formula.tools
direct_list %>%
lapply(X = .,
FUN = function(x){
y <- head(x)
y %>% kable()
})
## $income2012
##
##
## |Domain | Direct| MSE| SD| CV| N| var_SRS| deff| n_eff|
## |:------|---------:|---------:|---------:|---------:|---:|---------:|---------:|--------:|
## |1 | 0.2439406| 0.0110510| 0.1051237| 0.4309396| 24| 0.0076847| 1.4380443| 16.68933|
## |10 | 0.2571874| 0.0031061| 0.0557327| 0.2167007| 70| 0.0027292| 1.1381234| 61.50476|
## |11 | 0.1009752| 0.0010541| 0.0324664| 0.3215285| 100| 0.0009078| 1.1611329| 86.12278|
## |12 | 0.1862899| 0.0051402| 0.0716952| 0.3848579| 30| 0.0050529| 1.0172832| 29.49031|
## |13 | 0.1538687| 0.0020267| 0.0450187| 0.2925786| 62| 0.0020999| 0.9651377| 64.23953|
## |14 | 0.2485756| 0.0040143| 0.0633589| 0.2548876| 56| 0.0033355| 1.2035352| 46.52959|
##
## $income2013
##
##
## |Domain | Direct| MSE| SD| CV| N| var_SRS| deff| n_eff|
## |:------|---------:|---------:|---------:|---------:|---:|---------:|---------:|---------:|
## |1 | 0.3117905| 0.0151210| 0.1229675| 0.3943914| 24| 0.0089407| 1.6912517| 14.19067|
## |10 | 0.2735087| 0.0025398| 0.0503964| 0.1842587| 70| 0.0028386| 0.8947362| 78.23535|
## |11 | 0.1174984| 0.0009567| 0.0309310| 0.2632464| 100| 0.0010369| 0.9226596| 108.38233|
## |12 | 0.1862899| 0.0051402| 0.0716952| 0.3848579| 30| 0.0050529| 1.0172832| 29.49031|
## |13 | 0.1617958| 0.0020810| 0.0456177| 0.2819463| 62| 0.0021874| 0.9513534| 65.17032|
## |14 | 0.2353446| 0.0035808| 0.0598402| 0.2542663| 56| 0.0032135| 1.1143047| 50.25555|
##
## $income2014
##
##
## |Domain | Direct| MSE| SD| CV| N| var_SRS| deff| n_eff|
## |:------|---------:|---------:|---------:|---------:|---:|---------:|---------:|--------:|
## |1 | 0.4142299| 0.0141892| 0.1191185| 0.2875661| 24| 0.0101101| 1.4034626| 17.10056|
## |10 | 0.4206141| 0.0032788| 0.0572612| 0.1361372| 70| 0.0034814| 0.9418188| 74.32428|
## |11 | 0.2112138| 0.0025982| 0.0509725| 0.2413312| 100| 0.0016660| 1.5595162| 64.12245|
## |12 | 0.4128282| 0.0122581| 0.1107162| 0.2681896| 30| 0.0080800| 1.5170830| 19.77479|
## |13 | 0.2411189| 0.0036303| 0.0602519| 0.2498847| 62| 0.0029513| 1.2300663| 50.40379|
## |14 | 0.3264915| 0.0044502| 0.0667098| 0.2043232| 56| 0.0039267| 1.1333194| 49.41237|
Next, we quickly estimate the sample variance and covariance for the direct estimator using the survey R package as follows:
## quickly creating the poverty indicator
income_dt <-
income_dt |>
mutate(poor2012 = ifelse(income2012 < povline2012, 1, 0),
poor2013 = ifelse(income2013 < povline2013, 1, 0),
poor2014 = ifelse(income2014 < povline2014, 1, 0))
### a simple worker function for computing the variance covariance matrix of the sampling error of the direct estimate
compute_vcov <- function(dt,
domain,
ids,
weights = NULL,
fpc = NULL,
strata = NULL,
probs = NULL,
yvars,
deff = FALSE,
...){
## a little worker function to convert expressions that are character to the right format for the survey package
convert_expr <- function(x){
y <- if (is.character(x)){
rlang::expr(~!!sym(x))
} else {
rlang::expr(~ !!x)
}
y <- eval(y)
return(y)
}
### run it on the proper arguments
ids <- convert_expr(x = ids)
if (is.null(weights)){
dt$weights <- 1
}
weights <- convert_expr(x = weights)
if (!is.null(fpc)){
fpc <- convert_expr(x = fpc)
}
if (!is.null(probs)){
probs <- convert_expr(x = probs)
}
if (!is.null(strata)){
strata <- convert_expr(x = strata)
}
### create a survey design object from the survey package
dom_list <- unique(dt[[domain]])
surv_vcov <- function(x){
design_obj <- svydesign(ids = ids,
probs = probs,
strata = strata,
fpc = fpc,
data = dt |>
dplyr::filter(!!sym(domain) == x),
weights = weights)
### prepare the y formula for the svymean function
yform <- reformulate(yvars)
var_dt <- svymean(yform, design_obj, ...) ## use the svymean object to compute variance
var_dt <- vcov(var_dt) ### compute variance covariance matrix
var_dt <- as.numeric(c(diag(var_dt), var_dt[lower.tri(var_dt, diag = FALSE)]))
pair_list <- c(lapply(yvars, rep, 2),
combn(yvars, 2, simplify = FALSE))
pair_list <- lapply(pair_list,
function(x){
zz <- paste0("v_", paste(x, collapse = ""))
return(zz)
})
pair_list <- unlist(pair_list)
names(var_dt) <- pair_list
return(var_dt)
}
vcov_list <- lapply(X = dom_list,
FUN = surv_vcov)
var_dt <- Reduce(f = "rbind",
x = vcov_list) %>%
as_tibble() %>%
mutate(domain = dom_list)
return(var_dt)
}
### running the compute_vcov function to prepare the variance covariance matrix
var_dt <-
compute_vcov(dt = income_dt,
domain = "prov",
ids = 1,
weights = "weight",
yvars = c("poor2012", "poor2013", "poor2014"))
### sample size
var_dt <-
var_dt |>
merge(sampsize_dt, by.x = "domain", by.y = "prov")
A quick inspection of the preceding results will show some provinces
contain low sample sizes which sometimes result in extreme value poverty
rates and hence 0 variance. To avoid this, we will show you how to apply
the variance smoothing method suggested by [@you2023application]. Please see the code and
Roxygen comments below explaining the use of the
varsmoothie_king() function which computes smoothed
variances.
#' A function to perform variance smoothing
#'
#' The variance smoothing function applies the methodology of (Hiridoglou and You, 2023)
#' which uses simply log linear regression to estimate direct variances for sample
#' poverty rates which is useful for replacing poverty rates in areas with low sampling
#'
#' @param domain a vector of unique domain/target areas
#' @param direct_var the raw variances estimated from sample data e=
#' @param sampsize the sample size for each domain
#' @param y indicator variable at the survey sample level for each household/individual/unit i.e. poor = 1, non-poor = 0
#'
#' @export
varsmoothie_king <- function(domain,
direct_var,
sampsize,
y){
dt <- data.table(domain = domain,
var = direct_var,
n = sampsize)
dt$log_n <- log(dt$n)
dt$log_var <- log(dt$var)
lm_model <- lm(formula = log_var ~ log_n,
data = dt[!(abs(dt$log_var) == Inf),])
dt$pred_var <- predict(lm_model, newdata = dt)
residual_var <- summary(lm_model)$sigma^2
dt$var_smooth <- exp(dt$pred_var) * exp(residual_var/2)
return(dt[, c("domain", "var_smooth"), with = F])
}
OK, the goal now is to use the above varsmoothie_king()
function to add an additional column of smoothed variances into our
direct_list object.
vardir <- grep("^v_", names(var_dt), value = TRUE)
var_dt <-
lapply(X = vardir,
FUN = function(x){
z <- varsmoothie_king(domain = var_dt[["domain"]],
direct_var = var_dt[[x]],
sampsize = var_dt[["N"]]) |>
as.data.table() |>
setnames(old = "var_smooth", new = paste0("vs", x)) |>
as_tibble()
return(z)
}) %>%
Reduce(f = "merge",
x = .) %>%
merge(x = var_dt,
y = .,
by = "domain") |>
as_tibble()
## Warning in log(dt$var): NaNs produced
## Warning in log(dt$var): NaNs produced
More (and probably better correlates of income and poverty) can be created to improve the predictive power of the model. The next step is to take target level variables. The MFH model is a model of poverty rates at the target area level, hence the data format required for this exercise has the province as its unit of observation. This format has a few essential columns:
prov and provlab)Next, we apply a simple variable selection process which employs the
stepwise regression algorithm using the AIC selection criteria as in
described by [@yamashita2007stepwise]. The
function step_wrapper() implemented below is a wrapper to
the stepAIC() function carries all the perfunctory cleaning
necessary use the stepAIC() function. This includes
dropping columns that are entirely missing (NA) and keep
only complete cases/observations and remove perfectly or near collinear
variables and combinations using the variance inflation method.
#' A function to perform stepwise variable selection with AIC selection criteria
#'
#' @param dt data.frame, dataset containing the set of outcome and independent variables
#' @param xvars character vector, the set of x variables
#' @param y chr, the name of the y variable
#' @param vif_threshold integer, a variance inflation factor threshold
#'
#' @import data.table
#' @importFrom cars vif
#' @importFrom MASS stepAIC
step_wrapper <- function(dt, xvars, y, cor_thresh = 0.95) {
dt <- as.data.table(dt)
# Drop columns that are entirely NA
dt <- dt[, which(unlist(lapply(dt, function(x) !all(is.na(x))))), with = FALSE]
xvars <- xvars[xvars %in% colnames(dt)]
# Keep only complete cases
dt <- na.omit(dt[, c(y, xvars), with = FALSE])
# Step 1: Remove aliased (perfectly collinear) variables
model_formula <- as.formula(paste(y, "~", paste(xvars, collapse = " + ")))
lm_model <- lm(model_formula, data = dt)
aliased <- is.na(coef(lm_model))
if (any(aliased)) {
xvars <- names(aliased)[!aliased & names(aliased) != "(Intercept)"]
}
# Step 2: Remove near-linear combinations
xmat <- as.matrix(dt[, ..xvars])
combo_check <- tryCatch(findLinearCombos(xmat), error = function(e) NULL)
if (!is.null(combo_check) && length(combo_check$remove) > 0) {
xvars <- xvars[-combo_check$remove]
xmat <- as.matrix(dt[, ..xvars])
}
# Step 3: Drop highly correlated variables
cor_mat <- abs(cor(xmat))
diag(cor_mat) <- 0
while (any(cor_mat > cor_thresh, na.rm = TRUE)) {
cor_pairs <- which(cor_mat == max(cor_mat, na.rm = TRUE), arr.ind = TRUE)[1, ]
var1 <- colnames(cor_mat)[cor_pairs[1]]
var2 <- colnames(cor_mat)[cor_pairs[2]]
# Drop the variable with higher mean correlation
drop_var <- if (mean(cor_mat[var1, ]) > mean(cor_mat[var2, ])) var1 else var2
xvars <- setdiff(xvars, drop_var)
xmat <- as.matrix(dt[, ..xvars])
cor_mat <- abs(cor(xmat))
diag(cor_mat) <- 0
}
# Step 4: Warn if still ill-conditioned
cond_number <- kappa(xmat, exact = TRUE)
if (cond_number > 1e10) {
warning("Design matrix is ill-conditioned (condition number > 1e10). Consider reviewing variable selection.")
}
# Final model fit
model_formula <- as.formula(paste(y, "~", paste(xvars, collapse = " + ")))
full_model <- lm(model_formula, data = dt)
# Stepwise selection
stepwise_model <- stepAIC(full_model, direction = "both", trace = 0)
return(stepwise_model)
}
We apply the variable selection process for the selection of each time period model. See the application below:
### the set of outcome variables
candidate_vars <- colnames(prov_dt)[!colnames(prov_dt) %in% c("prov", "provlab")]
### creating the province level data for variable selection and poverty mapping
prov_dt <-
map2(.x = list("poor2012", "poor2013", "poor2014"), ### first we create a direct estimate dataset from direct_list
.y = direct_list,
~ .y |>
rename(!!sym(.x) := Direct) |>
dplyr::select(Domain, all_of(.x))) |>
Reduce(f = "merge") |> ### merge the selected variables from each dataframe of the list
merge(prov_dt, ### merging the prepared simulated administrative data variables
by.x = "Domain", by.y = "prov", all = TRUE) |>
merge(var_dt,
by.x = "Domain", by.y = "domain", all = TRUE)
fh_step <-
step_wrapper(dt = prov_dt,
xvars = candidate_vars,
y = "poor2012",
cor_thresh = 0.7)
fh_step <- names(fh_step$coefficients)[!grepl("(Intercept)", names(fh_step$coefficients))]
fh_step <- as.formula(paste0("poor2012 ~ ", paste(fh_step, collapse = " + ")))
rhs <- paste(deparse(fh_step[[3]]), collapse = " ")
outcome_list <- list("poor2012", "poor2013", "poor2014")
# Then generate the list of formulas
mfh_formula <-
lapply(outcome_list,
function(x) {
as.formula(paste(x, "~", rhs))
})
names(mfh_formula) <- outcome_list
Let’s run some linear regressions for set of selected variables to get a sense for the predicted power of the models independently
lmcheck_obj <-
lapply(X = mfh_formula,
FUN = lm,
data = prov_dt)
lapply(X = lmcheck_obj,
FUN = summary)
## $poor2012
##
## Call:
## FUN(formula = X[[i]], data = ..1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.12244 -0.03693 0.01055 0.03833 0.09114
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.406566 0.190009 2.140 0.038233 *
## gen -0.194001 0.121391 -1.598 0.117508
## age5 0.378420 0.131694 2.873 0.006343 **
## labor1 0.231915 0.120111 1.931 0.060272 .
## labor2 0.532883 0.289103 1.843 0.072360 .
## aec -0.012975 0.009338 -1.389 0.172025
## mkt 0.032275 0.008521 3.788 0.000478 ***
## age3_X_gen -0.213792 0.078529 -2.722 0.009395 **
## age2_X_educ3 1.626525 0.885467 1.837 0.073306 .
## age3_X_educ3 0.354384 0.165834 2.137 0.038468 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05718 on 42 degrees of freedom
## Multiple R-squared: 0.6163, Adjusted R-squared: 0.5341
## F-statistic: 7.495 on 9 and 42 DF, p-value: 1.94e-06
##
##
## $poor2013
##
## Call:
## FUN(formula = X[[i]], data = ..1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.12999 -0.04099 0.01257 0.04400 0.09036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.47048 0.21049 2.235 0.0308 *
## gen -0.21470 0.13448 -1.597 0.1179
## age5 0.32666 0.14589 2.239 0.0305 *
## labor1 0.14109 0.13306 1.060 0.2950
## labor2 0.35688 0.32027 1.114 0.2715
## aec -0.01335 0.01034 -1.290 0.2040
## mkt 0.03632 0.00944 3.847 0.0004 ***
## age3_X_gen -0.17262 0.08699 -1.984 0.0538 .
## age2_X_educ3 1.60622 0.98091 1.637 0.1090
## age3_X_educ3 0.34192 0.18371 1.861 0.0697 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06334 on 42 degrees of freedom
## Multiple R-squared: 0.5664, Adjusted R-squared: 0.4734
## F-statistic: 6.095 on 9 and 42 DF, p-value: 1.928e-05
##
##
## $poor2014
##
## Call:
## FUN(formula = X[[i]], data = ..1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.18919 -0.03379 -0.00547 0.04263 0.15614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.57338 0.25506 2.248 0.0299 *
## gen -0.07116 0.16295 -0.437 0.6646
## age5 -0.11779 0.17678 -0.666 0.5089
## labor1 -0.28669 0.16123 -1.778 0.0826 .
## labor2 0.56014 0.38808 1.443 0.1563
## aec -0.02156 0.01254 -1.720 0.0928 .
## mkt 0.02276 0.01144 1.990 0.0531 .
## age3_X_gen -0.14322 0.10541 -1.359 0.1815
## age2_X_educ3 -0.41151 1.18861 -0.346 0.7309
## age3_X_educ3 0.43581 0.22261 1.958 0.0569 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07675 on 42 degrees of freedom
## Multiple R-squared: 0.4368, Adjusted R-squared: 0.3161
## F-statistic: 3.619 on 9 and 42 DF, p-value: 0.002054
Next, we show how to use the msae R package to estimate
the Empirical Best Linear Unbiased Predictor (EBLUP) for the poverty map
using the eblupMFH2() which allow for time series fay
herriot estimation under homoskedastic assumptions. For completeness, we
also briefly perform the previous described direct estimation in step 1,
using the eblupUFH() function as well as the
eblupMFH1() for the fay herriot model.
### first lets add the sampling variance and covariance matrix to the prov_dt dataset
# prov_dt <-
# prov_dt |>
# merge(var_dt, by.x = "prov", by.y = "domain")
### variance-covariance matrix columns
varcols <- colnames(var_dt)[grepl(pattern = "^v_", x = colnames(var_dt))]
## replace the variances-covariances that are zero with their smoothed counterparts
prov_dt <-
prov_dt |>
mutate(across(
starts_with("v_"),
~ if_else(abs(.x) <= 1e-4, get(paste0("vsv", str_remove(cur_column(), "^v"))), .x),
.names = "{.col}"
))
#### now we estimate all 4 models including the multivariate fay herriot models
model0_obj <- eblupUFH(mfh_formula, vardir = varcols, data = prov_dt)
model1_obj <- eblupMFH1(mfh_formula, vardir = varcols, data = prov_dt, MAXITER = 1e10, PRECISION = 1e-2)
model2_obj <- eblupMFH2(mfh_formula, vardir = varcols, data = prov_dt, MAXITER = 1e10, PRECISION = 1e-2)
# model3_obj <- eblupMFH3(mfh_formula, vardir = varcols, data = prov_dt, MAXITER = 1e10, PRECISION = 1e-2)
With the MFH3 estimation, we can check if the variance-covariance matrix structure points to heteroskedastic random effects. We do so as follows:
# model3_obj$fit$refvarTest
This tests the null hypothesis that the variances \(\sigma_t^2\) at each pair of time instants
\(t\) and \(s\) are equal against the alternative that
they are not. In this case, at significance level of 0.05, we reject the
equality of variances between \(t =
2013\) and \(t = 2014\). Hence
we can use the MFH3 model (eblupMFH3() function). If these
tests supported the equality of variances, then we would use instead the
function eblupMFH2 for estimation.
We now verify the assumptions of the MFH3 model. This includes assessing linearity, the normality of the predicted area effects and standardized residuals, as well as checking for the presence of outlying areas.
We first check the linearity assumption. This can be addressed by regressing the model residuals against the predicted values (EBLUPs).
resids_3 <- cbind(prov_dt$poor2012 - model2_obj$eblup$poor2012,
prov_dt$poor2013 - model2_obj$eblup$poor2013,
prov_dt$poor2014 - model2_obj$eblup$poor2014)
layout(matrix(1:3, nrow = 1, byrow = TRUE))
plot(model2_obj$eblup$poor2012, resids_3[,1], pch = 19, xlab = "EBLUPs t = 1", ylab = "Residuals t = 1")
plot(model2_obj$eblup$poor2013, resids_3[,2], pch = 19, xlab = "EBLUPs t = 2", ylab = "Residuals t = 2")
plot(model2_obj$eblup$poor2014, resids_3[,3], pch = 19, xlab = "EBLUPs t = 3", ylab = "Residuals t = 3")
fits_3 <- model2_obj$eblup # EBLUPs (predicted values)
# Run regression of residuals on fitted values for each time period
lm_resid_fit_t1 <- lm(resids_3[,1] ~ model2_obj$eblup$poor2012 + model2_obj$eblup$poor2012^2)
lm_resid_fit_t2 <- lm(resids_3[,2] ~ model2_obj$eblup$poor2013 + model2_obj$eblup$poor2013^2)
lm_resid_fit_t3 <- lm(resids_3[,3] ~ model2_obj$eblup$poor2014 + model2_obj$eblup$poor2014^2)
# View summaries
summary(lm_resid_fit_t1)
##
## Call:
## lm(formula = resids_3[, 1] ~ model2_obj$eblup$poor2012 + model2_obj$eblup$poor2012^2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.19940 -0.03052 0.00796 0.03862 0.08945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03395 0.02327 1.459 0.151
## model2_obj$eblup$poor2012 -0.10025 0.11095 -0.904 0.371
##
## Residual standard error: 0.0556 on 50 degrees of freedom
## Multiple R-squared: 0.01607, Adjusted R-squared: -0.003614
## F-statistic: 0.8164 on 1 and 50 DF, p-value: 0.3706
summary(lm_resid_fit_t2)
##
## Call:
## lm(formula = resids_3[, 2] ~ model2_obj$eblup$poor2013 + model2_obj$eblup$poor2013^2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.196228 -0.040498 0.005571 0.040744 0.104039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03226 0.02437 1.323 0.192
## model2_obj$eblup$poor2013 -0.08303 0.11556 -0.719 0.476
##
## Residual standard error: 0.05865 on 50 degrees of freedom
## Multiple R-squared: 0.01022, Adjusted R-squared: -0.009576
## F-statistic: 0.5163 on 1 and 50 DF, p-value: 0.4758
summary(lm_resid_fit_t3)
##
## Call:
## lm(formula = resids_3[, 3] ~ model2_obj$eblup$poor2014 + model2_obj$eblup$poor2014^2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34029 -0.03226 0.01047 0.04736 0.16709
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03868 0.05732 0.675 0.503
## model2_obj$eblup$poor2014 -0.09438 0.19996 -0.472 0.639
##
## Residual standard error: 0.07893 on 50 degrees of freedom
## Multiple R-squared: 0.004436, Adjusted R-squared: -0.01548
## F-statistic: 0.2228 on 1 and 50 DF, p-value: 0.639
We use the shapiro wilks test of normality using the
shapiro.test() function in base R. The Shapiro-Wilk test
assesses whether a sample of data is drawn from a normally distributed
population. It does so by comparing the order statistics (i.e., sorted
values) of the sample to the expected values under a normal
distribution. Specifically, the test statistic \(W\) is a ratio of the squared correlation
between the observed sample quantiles and the corresponding normal
quantiles.
First, we perform the shapiro wilks normality test on the model errors, \(\varepsilon\). We show both the normality distribution histogram as well as the qqplots as below:
### evaluating the normality assumption
resid_dt <- prov_dt[,c("poor2012", "poor2013", "poor2014")] - model2_obj$eblup
### perform the shapiro test
shapiro_obj <- apply(resid_dt, 2, shapiro.test)
summary_dt <-
data.frame(Time = names(shapiro_obj),
W = lapply(X = shapiro_obj,
FUN = function(x){
return(x$statistic[[1]])
}) %>%
as.numeric(),
p_value = lapply(X = shapiro_obj,
FUN = function(x){
return(x$p.value)
}) %>%
as.numeric())
### plot the results
summary_dt <-
summary_dt %>%
mutate(label = paste0("W = ", round(W, 3), "\n", "p = ", signif(p_value, 3)))
resid_dt %>%
pivot_longer(cols = everything(),
names_to = "Time",
values_to = "Residual") %>%
ggplot(aes(x = Residual)) +
geom_histogram(bins = 10, fill = "steelblue", color = "white") +
geom_text(data = summary_dt, aes(x = -Inf, y = Inf, label = label),
hjust = -0.1, vjust = 1.2, inherit.aes = FALSE, size = 3.5) +
facet_wrap(~Time, scales = "free") +
theme_minimal() +
labs(title = "Residual Histograms by Time Period")
### here's how to create qqplots
resid_dt %>%
pivot_longer(cols = everything(),
names_to = "Time",
values_to = "Residual") %>%
ggplot(aes(sample = Residual)) +
stat_qq() +
stat_qq_line() +
facet_wrap(~Time, scales = "free") +
theme_minimal() +
labs(title = "QQ Plots of Residuals by Time Period")
Likewise, we test the normality of the random effect variable
#### For the random effects
raneff_dt <- as.data.frame(model2_obj$randomEffect)
### lets run the shapiro wilks tests again
shapiro_obj <- apply(raneff_dt, 2, shapiro.test)
summary_dt <-
data.frame(Time = names(shapiro_obj),
W = lapply(X = shapiro_obj,
FUN = function(x){
return(x$statistic[[1]])
}) %>%
as.numeric(),
p_value = lapply(X = shapiro_obj,
FUN = function(x){
return(x$p.value)
}) %>%
as.numeric())
### plot the results
summary_dt <-
summary_dt %>%
mutate(label = paste0("W = ", round(W, 3), "\n", "p = ", signif(p_value, 3)))
raneff_dt %>%
pivot_longer(cols = everything(),
names_to = "Time",
values_to = "RandEff") %>%
ggplot(aes(x = RandEff)) +
geom_histogram(bins = 10, fill = "darkorange", color = "white") +
geom_text(data = summary_dt, aes(x = -Inf, y = Inf, label = label),
hjust = -0.1, vjust = 1.2, inherit.aes = FALSE, size = 3.5) +
facet_wrap(~Time, scales = "free") +
theme_minimal() +
labs(title = "Randon Effects Histograms by Time Period")
In both cases, we compare the p-value to the 0.05 level of significance. The results suggest we cannot reject the null hypothesis of normally distributed model errors and random effects.
In some cases, the assumptions described by the MFH2 model are not met. Steps 5 and 6 recommend re-estimate the models by creating additional variables via transformation or variable interactions.
#### first let us compare the gains in precision
direct_list <-
mapply(x = direct_list,
y = c(2012, 2013, 2014),
FUN = function(x, y){
x$year <- y
return(x)
}, SIMPLIFY = FALSE)
direct_dt <- Reduce("rbind", direct_list)
model2mse_dt <-
model2_obj$MSE |>
mutate(Domain = 1:n()) |>
pivot_longer(
cols = starts_with("poor"), # columns to pivot
names_to = "year", # new column for the years
values_to = "modelMSE"
) |>
mutate(year = as.integer(substr(year, 5, 8)))
model2pov_dt <-
model2_obj$eblup |>
mutate(Domain = 1:n()) |>
pivot_longer(
cols = starts_with("poor"), # columns to pivot
names_to = "year", # new column for the years
values_to = "modelpov"
) |>
mutate(year = as.integer(substr(year, 5, 8)))
model2pov_dt <- merge(model2mse_dt, model2pov_dt, by = c("Domain", "year"))
model2pov_dt <-
model2pov_dt |>
mutate(modelCV = sqrt(modelMSE) / modelpov)
cv_dt <- merge(model2pov_dt, direct_dt, by = c("Domain", "year"))
cv_dt |>
dplyr::select(Domain, year, CV, modelCV) |>
pivot_longer(cols = c(CV, modelCV), names_to = "Type", values_to = "CV_value") |>
ggplot(aes(x = factor(year), y = CV_value, color = Type, group = Type)) +
geom_line(size = 1) +
facet_wrap(~ Domain, scales = "free_y") +
labs(title = "Comparison of Direct vs Model-based CVs",
y = "Coefficient of Variation (CV)",
x = "Year",
color = "Estimation Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Next we can call the pbmcpeMFH2() function, which
returns the EBLUP, as well as, the MSEs of the EBLUPs for each time
point, and the MCPEs for each pair of time points based on the MFH model
2 as follows:
mcpemfh2_obj <-
pbmcpeMFH2(formula = mfh_formula,
vardir = varcols,
nB = 50,
data = prov_dt)
## 1
## 2
## 3
## 3
## 4
## 5
## 6
## 6
## 6
## 7
## 8
## 8
## 9
## 9
## 9
## 9
## 9
## 10
## 10
## 10
## 11
## 12
## 12
## 13
## 13
## 14
## 14
## 14
## 14
## 15
## 16
## 16
## 17
## 18
## 19
## 20
## 20
## 21
## 22
## 22
## 23
## 24
## 24
## 24
## 24
## 25
## 26
## 26
## 27
## 27
## 28
## 28
## 29
## 29
## 29
## 29
## 29
## 30
## 31
## 32
## 32
## 33
## 34
## 35
## 35
## 36
## 36
## 37
## 38
## 38
## 39
## 39
## 40
## 41
## 42
## 42
## 42
## 43
## 43
## 43
## 44
## 45
## 45
## 46
## 47
## 48
## 48
## 48
## 48
## 48
## 49
## 50
## 50
#' A function to compare and plot differences from the mcpe object
#'
compare_mfh2 <- function(period_list = c(1,2),
mcpe_obj = mcpemfh2_obj,
alpha = 0.05){
col_chr <- paste0("(", period_list[1], ",", period_list[2], ")")
df <-
tibble(diff = mcpe_obj$eblup[,period_list[2]] - mcpe_obj$eblup[,period_list[1]],
mse = mcpe_obj$mse[,period_list[1]] + mcpe_obj$mse[,period_list[2]] - 2*mcpe_obj$mcpe[,col_chr],
alpha = rep(alpha, nrow(mcpe_obj$eblup)),
zq = qnorm(alpha / 2, lower.tail = F)) |>
mutate(lb = diff - zq * sqrt(mse),
ub = diff + zq * sqrt(mse)) |>
mutate(significant = ifelse(lb > 0 | ub < 0,
"Significant",
"Not Significant")) |>
mutate(index = row_number())
p1 <-
ggplot(df, aes(x = index, y = diff)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
geom_errorbar(aes(ymin = lb, ymax = ub, color = significant), width = 0.2, linewidth = 1) +
geom_point(color = "black", size = 2) +
scale_color_manual(values = c("Significant" = "red", "Not Significant" = "gray40")) +
labs(
x = "Area (Index)",
y = "Difference between time periods",
color = "Significance",
title = paste0("Change in Poverty Rates based on MCPE between period ",
period_list[1],
" and ",
period_list[2])
) +
theme_minimal()
return(list(df = df,
plot = p1))
}
The above function takes the difference between any two time periods and prepares a table of differences for each area include the MCPE estimated error rates as well as lower and upper bounds given a specified confidence level. See the function implemented to compare periods 1 and 2 (years 2012 and 2013)
comp12_obj <- compare_mfh2()
comp23_obj <- compare_mfh2(period_list = c(2, 3))
comp13_obj <- compare_mfh2(period_list = c(1, 3))
See the plots below
comp12_obj$plot
comp23_obj$plot
comp13_obj$plot
See the data created
comp12_obj$df %>%
head() %>%
kable()
| diff | mse | alpha | zq | lb | ub | significant | index |
|---|---|---|---|---|---|---|---|
| 0.01331 | 0.0002437 | 0.05 | 1.959964 | -0.0172891 | 0.0439091 | Not Significant | 1 |
| 0.00545 | 0.0000628 | 0.05 | 1.959964 | -0.0100798 | 0.0209798 | Not Significant | 2 |
| -0.00109 | 0.0000187 | 0.05 | 1.959964 | -0.0095636 | 0.0073836 | Not Significant | 3 |
| 0.00000 | 0.0000000 | 0.05 | 1.959964 | -0.0000408 | 0.0000408 | Not Significant | 4 |
| 0.00313 | 0.0000353 | 0.05 | 1.959964 | -0.0085208 | 0.0147808 | Not Significant | 5 |
| 0.00235 | 0.0000563 | 0.05 | 1.959964 | -0.0123526 | 0.0170526 | Not Significant | 6 |
comp23_obj$df %>%
head() %>%
kable()
| diff | mse | alpha | zq | lb | ub | significant | index |
|---|---|---|---|---|---|---|---|
| 0.07862 | 0.0013890 | 0.05 | 1.959964 | 0.0055744 | 0.1516656 | Significant | 1 |
| 0.10423 | 0.0004644 | 0.05 | 1.959964 | 0.0619940 | 0.1464660 | Significant | 2 |
| 0.12974 | 0.0003893 | 0.05 | 1.959964 | 0.0910680 | 0.1684120 | Significant | 3 |
| 0.09968 | 0.0010776 | 0.05 | 1.959964 | 0.0353400 | 0.1640200 | Significant | 4 |
| 0.11912 | 0.0008807 | 0.05 | 1.959964 | 0.0609557 | 0.1772843 | Significant | 5 |
| 0.10658 | 0.0013460 | 0.05 | 1.959964 | 0.0346738 | 0.1784862 | Significant | 6 |
comp13_obj$df %>%
head() %>%
kable()
| diff | mse | alpha | zq | lb | ub | significant | index |
|---|---|---|---|---|---|---|---|
| 0.09193 | 0.0014206 | 0.05 | 1.959964 | 0.0180584 | 0.1658016 | Significant | 1 |
| 0.10968 | 0.0005839 | 0.05 | 1.959964 | 0.0623211 | 0.1570389 | Significant | 2 |
| 0.12865 | 0.0004065 | 0.05 | 1.959964 | 0.0891316 | 0.1681684 | Significant | 3 |
| 0.09968 | 0.0010775 | 0.05 | 1.959964 | 0.0353433 | 0.1640167 | Significant | 4 |
| 0.12225 | 0.0008663 | 0.05 | 1.959964 | 0.0645627 | 0.1799373 | Significant | 5 |
| 0.10893 | 0.0012002 | 0.05 | 1.959964 | 0.0410304 | 0.1768296 | Significant | 6 |