Stable FH Estimators over T time periods: The Multivariate Fay Herriot Modelling Approach

This section describes procedures that yield stable small area estimators for each of \(D\) areas over \(T\) subsequent time instants. Area populations, the samples and the data might change between time periods. Accordingly, we denote \(U_t\) the overall population at time \(t\), which is partitioned into \(D\) areas \(U_{1t}, ... ,U_{Dt}\), of respective population sizes \(N_{1t}\)

An overview of the MFH model estimation process is as follows:

Step 1: Compute the selected direct area estimators for each target area \(d = 1, ..., D\) for each time \(t = 1, ..., T\) and estimators of their corresponding sampling variances and covariances.
Step 2: Prepare variables (from administrative data or other sources of data representative at the target area level) at the level of the target area for each time instant in the MFH model. We present a simple approach which performs model selection in a pooled linear regression model without time effects.
Step 3: Fit the MFH models to test for homoskedastic area-time effects \(({\mu_d}_1, ... , {\mu_d}_T)\) are homoskedastic or not. If we reject the homoskedasticity of variances, implement the MFH3 model. Otherwise, we proceed with the MFH2 model. For the purposes of this training, we assume a homoskedastic model for simplicity.
Step 4: Check the selected model assumptions, including linearity, normality of predicted area effects and standardized model residuals, and the presence of the outlying areas.
Step 5: In case of systematic model departures such as isolated departures because of outlying areas, some adjustments might need to be implemented before returning to Step 2 to recompute the MFH model.
Step 6: If model assumptions hold, using the above direct estimates and estimated sampling variances and covariances, and the selected auxiliary variables, compute MFH estimators \(\hat{\delta}_{dt}^{MFH},\quad d = 1,...,D\) and \(t = 1, ..., T\) and their corresponding estimated MSEs.

We will show below the use of the eblupMFH2() and eblupMFH3() from the R package msae [@permatasari2022package] compute the EBLUPs and their MSE estimates under the MFH models 2 and 3, respectively. The calls to these functions are:

eblupMFH2(formula, vardir, MAXITER = 100, PRECISION = 1e-04, data)

eblupMFH3(formula, vardir, MAXITER = 100, PRECISION = 1e-04, data)

MFH Estimation of Poverty Rates for T time periods

In this example, we use a synthetic data set adapted from R package sae called incomedata. The original data contains information for \(n = 17,119\) fictitious individuals residing across \(D = 52\) Spanish provinces. The variables include the name of the province of residence (provlab), province code (prov), as well as several correlates of income. We have added two additional income vectors corresponding to two additional years of data.

We will show how to estimate a poverty map for each year by using the Multivariate Fay Herriot modelling approach. This approach allows us to take advantage of the temporal correlation between poverty rates i.e. an individuals income in year \(t\) is likely correlated with their income in year \(t+1\).

The rest of this tutorial shows how to prepare MFH models using a random 10% sample of the incomedata to estimate the poverty rates.

if (sum(installed.packages()[,1] %in% "pacman") != 1){
  
  install.packages("pacman")
  
}

pacman::p_load(sf, data.table, tidyverse, car, msae, 
               sae, survey, spdep, knitr, MASS, caret,
               purrr, here)

source(here::here("scripts/Function_pbmcpeMFH2_test.R"))

income_dt <- readRDS(here::here("data/incomedata_survey.RDS"))

prov_dt <- readRDS(here::here("data/shapes/simadmin.RDS"))

shp_dt <- readRDS(here::here("data/shapes/spainshape.RDS"))

glimpse(income_dt)

## Rows: 4,298
## Columns: 9
## Groups: provlab [52]
## $ provlab     <fct> Alava, Alava, Alava, Alava, Alava, Alava, Alava, Alava, Al…
## $ prov        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ income2012  <dbl> 26125.271, 27404.099, 13117.838, 27920.737, 8705.592, 6340…
## $ income2013  <dbl> 27500.673, 28174.985, 13581.315, 28508.978, 9182.841, 6272…
## $ income2014  <dbl> 34739.733, 32137.846, 15980.581, 31512.153, 11699.903, 593…
## $ povline2012 <dbl> 6477.484, 6477.484, 6477.484, 6477.484, 6477.484, 6477.484…
## $ povline2013 <dbl> 6515.865, 6515.865, 6515.865, 6515.865, 6515.865, 6515.865…
## $ povline2014 <dbl> 6515.865, 6515.865, 6515.865, 6515.865, 6515.865, 6515.865…
## $ weight      <dbl> 7424.759, 23577.245, 41976.969, 19471.566, 12738.217, 2463…

Step 1: Direct Estimation

We will use the direct HT estimators that use the survey weights in weight variable. First, we calculate the total sample size, the number of provinces, the sample sizes for each province and extract the population sizes for each province/target area from the sizeprov file. For those using the household/individual level survey data, this may be obtained from the sum of the household or individual weights as appropriate.

data(sizeprov)

## quickly compute sample size for each province
sampsize_dt <- 
income_dt |>
  group_by(prov) |>
  summarize(N = n())

## the poverty line for each is already included within the data. 
## Lets compute the direct estimates for each year of data and create a list of data.frames (equal in length to the number of years) 
## containing the direct estimate, the standard errors and the coefficient of variation. 

direct_list <- 
  mapply(FUN = function(y, threshold){
    
     z <- emdi::direct(y = y,
                       smp_data = income_dt %>% as.data.table(),
                       smp_domains = "prov",
                       weights = "weight",
                       threshold = unique(income_dt[[threshold]]),
                       var = TRUE)
     
     z <- 
       z$ind |>
       dplyr::select(Domain, Head_Count) |>
       rename(Direct = "Head_Count") |>
       merge(z$MSE |>
               dplyr::select(Domain, Head_Count) |>
               rename(MSE = "Head_Count"),
             by = "Domain") |>
       mutate(SD = sqrt(MSE)) |>
       mutate(CV = SD / Direct) |>
       merge(sampsize_dt |> 
               mutate(prov = as.factor(prov)), 
             by.x = "Domain", 
             by.y = "prov") |>
       mutate(var_SRS = Direct * (1 - Direct) / N) |>
       mutate(deff = MSE / var_SRS) |>
       mutate(n_eff = N / deff)
       
      
     return(z)

  }, SIMPLIFY = FALSE,
  y = c("income2012", "income2013", "income2014"),
  threshold = c("povline2012", "povline2013", "povline2014"))

## Registered S3 method overwritten by 'openxlsx':
##   method               from         
##   as.character.formula formula.tools

direct_list %>%
  lapply(X = .,
         FUN = function(x){
           
           y <- head(x)
           
           y %>% kable()
           
         })

## $income2012
## 
## 
## |Domain |    Direct|       MSE|        SD|        CV|   N|   var_SRS|      deff|    n_eff|
## |:------|---------:|---------:|---------:|---------:|---:|---------:|---------:|--------:|
## |1      | 0.2439406| 0.0110510| 0.1051237| 0.4309396|  24| 0.0076847| 1.4380443| 16.68933|
## |10     | 0.2571874| 0.0031061| 0.0557327| 0.2167007|  70| 0.0027292| 1.1381234| 61.50476|
## |11     | 0.1009752| 0.0010541| 0.0324664| 0.3215285| 100| 0.0009078| 1.1611329| 86.12278|
## |12     | 0.1862899| 0.0051402| 0.0716952| 0.3848579|  30| 0.0050529| 1.0172832| 29.49031|
## |13     | 0.1538687| 0.0020267| 0.0450187| 0.2925786|  62| 0.0020999| 0.9651377| 64.23953|
## |14     | 0.2485756| 0.0040143| 0.0633589| 0.2548876|  56| 0.0033355| 1.2035352| 46.52959|
## 
## $income2013
## 
## 
## |Domain |    Direct|       MSE|        SD|        CV|   N|   var_SRS|      deff|     n_eff|
## |:------|---------:|---------:|---------:|---------:|---:|---------:|---------:|---------:|
## |1      | 0.3117905| 0.0151210| 0.1229675| 0.3943914|  24| 0.0089407| 1.6912517|  14.19067|
## |10     | 0.2735087| 0.0025398| 0.0503964| 0.1842587|  70| 0.0028386| 0.8947362|  78.23535|
## |11     | 0.1174984| 0.0009567| 0.0309310| 0.2632464| 100| 0.0010369| 0.9226596| 108.38233|
## |12     | 0.1862899| 0.0051402| 0.0716952| 0.3848579|  30| 0.0050529| 1.0172832|  29.49031|
## |13     | 0.1617958| 0.0020810| 0.0456177| 0.2819463|  62| 0.0021874| 0.9513534|  65.17032|
## |14     | 0.2353446| 0.0035808| 0.0598402| 0.2542663|  56| 0.0032135| 1.1143047|  50.25555|
## 
## $income2014
## 
## 
## |Domain |    Direct|       MSE|        SD|        CV|   N|   var_SRS|      deff|    n_eff|
## |:------|---------:|---------:|---------:|---------:|---:|---------:|---------:|--------:|
## |1      | 0.4142299| 0.0141892| 0.1191185| 0.2875661|  24| 0.0101101| 1.4034626| 17.10056|
## |10     | 0.4206141| 0.0032788| 0.0572612| 0.1361372|  70| 0.0034814| 0.9418188| 74.32428|
## |11     | 0.2112138| 0.0025982| 0.0509725| 0.2413312| 100| 0.0016660| 1.5595162| 64.12245|
## |12     | 0.4128282| 0.0122581| 0.1107162| 0.2681896|  30| 0.0080800| 1.5170830| 19.77479|
## |13     | 0.2411189| 0.0036303| 0.0602519| 0.2498847|  62| 0.0029513| 1.2300663| 50.40379|
## |14     | 0.3264915| 0.0044502| 0.0667098| 0.2043232|  56| 0.0039267| 1.1333194| 49.41237|

Estimating the Variance-Covariance Matrix of the Sampling Error Distribution

Next, we quickly estimate the sample variance and covariance for the direct estimator using the survey R package as follows:

## quickly creating the poverty indicator
income_dt <- 
  income_dt |>
  mutate(poor2012 = ifelse(income2012 < povline2012, 1, 0),
         poor2013 = ifelse(income2013 < povline2013, 1, 0),
         poor2014 = ifelse(income2014 < povline2014, 1, 0))


### a simple worker function for computing the variance covariance matrix of the sampling error of the direct estimate

compute_vcov <- function(dt,
                         domain,
                         ids,
                         weights = NULL,
                         fpc = NULL,
                         strata = NULL,
                         probs = NULL,
                         yvars,
                         deff = FALSE,
                         ...){
  
  ## a little worker function to convert expressions that are character to the right format for the survey package
  convert_expr <- function(x){
    
    y <- if (is.character(x)){
      
      rlang::expr(~!!sym(x))
      
    } else {
      
      rlang::expr(~ !!x)
      
    }
    
    y <- eval(y)
    
    return(y)
    
  }
  
  ### run it on the proper arguments
  ids <- convert_expr(x = ids)
  
  if (is.null(weights)){
    
    dt$weights <- 1
    
  }
  
  weights <- convert_expr(x = weights)
  
  if (!is.null(fpc)){
    
    fpc <- convert_expr(x = fpc)

  }
    
  if (!is.null(probs)){
    
    probs <- convert_expr(x = probs)

  }

  if (!is.null(strata)){
    
    strata <- convert_expr(x = strata)

  }

  ### create a survey design object from the survey package 
  
  dom_list <- unique(dt[[domain]])
  
  surv_vcov <- function(x){
    
    design_obj <- svydesign(ids = ids,
                            probs = probs,
                            strata = strata,
                            fpc = fpc,
                            data = dt |>
                              dplyr::filter(!!sym(domain) == x),
                            weights = weights)
    
    ### prepare the y formula for the svymean function
    yform <- reformulate(yvars)
    
    var_dt <- svymean(yform, design_obj, ...) ## use the svymean object to compute variance
    
    var_dt <- vcov(var_dt) ### compute variance covariance matrix
    
    var_dt <- as.numeric(c(diag(var_dt), var_dt[lower.tri(var_dt, diag = FALSE)]))
    
    pair_list <- c(lapply(yvars, rep, 2),
                   combn(yvars, 2, simplify = FALSE))
  
    
    pair_list <- lapply(pair_list,
                        function(x){
                          
                          zz <- paste0("v_", paste(x, collapse = ""))
                          
                          return(zz)
                          
                        })
    
    pair_list <- unlist(pair_list)
    
    names(var_dt) <- pair_list
    
    return(var_dt)

  }
  
  vcov_list <- lapply(X = dom_list,
                      FUN = surv_vcov)
  
  var_dt <- Reduce(f = "rbind",
                   x = vcov_list) %>%
    as_tibble() %>%
    mutate(domain = dom_list)
  
  return(var_dt)  
  
}

### running the compute_vcov function to prepare the variance covariance matrix 
var_dt <- 
compute_vcov(dt = income_dt, 
             domain = "prov",
             ids = 1, 
             weights = "weight", 
             yvars = c("poor2012", "poor2013", "poor2014"))

### sample size
var_dt <- 
  var_dt |> 
  merge(sampsize_dt, by.x = "domain", by.y = "prov")

Handling Low Sample Size: Variance Smoothing

A quick inspection of the preceding results will show some provinces contain low sample sizes which sometimes result in extreme value poverty rates and hence 0 variance. To avoid this, we will show you how to apply the variance smoothing method suggested by [@you2023application]. Please see the code and Roxygen comments below explaining the use of the varsmoothie_king() function which computes smoothed variances.

#' A function to perform variance smoothing
#' 
#' The variance smoothing function applies the methodology of (Hiridoglou and You, 2023)
#' which uses simply log linear regression to estimate direct variances for sample 
#' poverty rates which is useful for replacing poverty rates in areas with low sampling
#' 
#' @param domain a vector of unique domain/target areas
#' @param direct_var the raw variances estimated from sample data e=
#' @param sampsize the sample size for each domain
#' @param y indicator variable at the survey sample level for each household/individual/unit i.e. poor = 1, non-poor = 0
#' 
#' @export

varsmoothie_king <- function(domain,
                             direct_var,
                             sampsize,
                             y){

  dt <- data.table(domain = domain,
                   var = direct_var,
                   n = sampsize)

  dt$log_n <- log(dt$n)
  dt$log_var <- log(dt$var)

  lm_model <- lm(formula = log_var ~ log_n,
                 data = dt[!(abs(dt$log_var) == Inf),])

  dt$pred_var <- predict(lm_model, newdata = dt)

  residual_var <-  summary(lm_model)$sigma^2
  
  dt$var_smooth <- exp(dt$pred_var) * exp(residual_var/2)

  return(dt[, c("domain", "var_smooth"), with = F])

}

OK, the goal now is to use the above varsmoothie_king() function to add an additional column of smoothed variances into our direct_list object.

vardir <- grep("^v_", names(var_dt), value = TRUE)

var_dt <-
  lapply(X = vardir,
         FUN = function(x){

           z <- varsmoothie_king(domain = var_dt[["domain"]],
                                 direct_var = var_dt[[x]],
                                 sampsize = var_dt[["N"]]) |>
             as.data.table() |>
             setnames(old = "var_smooth", new = paste0("vs", x)) |>
             as_tibble()
           

           return(z)

         }) %>%
  Reduce(f = "merge",
         x = .) %>%
  merge(x = var_dt,
        y = .,
        by = "domain") |>
  as_tibble()

## Warning in log(dt$var): NaNs produced
## Warning in log(dt$var): NaNs produced

Now, you can replace the zero/near zero sample size area MSEs with their smoothed variances.

Step 2: Model Selection

More (and probably better correlates of income and poverty) can be created to improve the predictive power of the model. The next step is to take target level variables. The MFH model is a model of poverty rates at the target area level, hence the data format required for this exercise has the province as its unit of observation. This format has a few essential columns:

1. Variables for poverty rates for each time period
1. The set of candidate variables from which the most predicted of poverty rates will be selected
1. The target area variable identifier (i.e. in this case the province variable prov and provlab)

Model Selection

Next, we apply a simple variable selection process which employs the stepwise regression algorithm using the AIC selection criteria as in described by [@yamashita2007stepwise]. The function step_wrapper() implemented below is a wrapper to the stepAIC() function carries all the perfunctory cleaning necessary use the stepAIC() function. This includes dropping columns that are entirely missing (NA) and keep only complete cases/observations and remove perfectly or near collinear variables and combinations using the variance inflation method.

#' A function to perform stepwise variable selection with AIC selection criteria
#' 
#' @param dt data.frame, dataset containing the set of outcome and independent variables
#' @param xvars character vector, the set of x variables
#' @param y chr, the name of the y variable
#' @param vif_threshold integer, a variance inflation factor threshold
#' 
#' @import data.table
#' @importFrom cars vif
#' @importFrom MASS stepAIC

step_wrapper <- function(dt, xvars, y, cor_thresh = 0.95) {
  
  dt <- as.data.table(dt)
  
  # Drop columns that are entirely NA
  dt <- dt[, which(unlist(lapply(dt, function(x) !all(is.na(x))))), with = FALSE]
  
  xvars <- xvars[xvars %in% colnames(dt)]
  
  # Keep only complete cases
  dt <- na.omit(dt[, c(y, xvars), with = FALSE])
  
  # Step 1: Remove aliased (perfectly collinear) variables
  model_formula <- as.formula(paste(y, "~", paste(xvars, collapse = " + ")))
  lm_model <- lm(model_formula, data = dt)
  aliased <- is.na(coef(lm_model))
  if (any(aliased)) {
    xvars <- names(aliased)[!aliased & names(aliased) != "(Intercept)"]
  }
  
  # Step 2: Remove near-linear combinations
  xmat <- as.matrix(dt[, ..xvars])
  combo_check <- tryCatch(findLinearCombos(xmat), error = function(e) NULL)
  if (!is.null(combo_check) && length(combo_check$remove) > 0) {
    xvars <- xvars[-combo_check$remove]
    xmat <- as.matrix(dt[, ..xvars])
  }
  
  # Step 3: Drop highly correlated variables
  cor_mat <- abs(cor(xmat))
  diag(cor_mat) <- 0
  while (any(cor_mat > cor_thresh, na.rm = TRUE)) {
    cor_pairs <- which(cor_mat == max(cor_mat, na.rm = TRUE), arr.ind = TRUE)[1, ]
    var1 <- colnames(cor_mat)[cor_pairs[1]]
    var2 <- colnames(cor_mat)[cor_pairs[2]]
    # Drop the variable with higher mean correlation
    drop_var <- if (mean(cor_mat[var1, ]) > mean(cor_mat[var2, ])) var1 else var2
    xvars <- setdiff(xvars, drop_var)
    xmat <- as.matrix(dt[, ..xvars])
    cor_mat <- abs(cor(xmat))
    diag(cor_mat) <- 0
  }
  
  # Step 4: Warn if still ill-conditioned
  cond_number <- kappa(xmat, exact = TRUE)
  if (cond_number > 1e10) {
    warning("Design matrix is ill-conditioned (condition number > 1e10). Consider reviewing variable selection.")
  }
  
  # Final model fit
  model_formula <- as.formula(paste(y, "~", paste(xvars, collapse = " + ")))
  full_model <- lm(model_formula, data = dt)
  
  # Stepwise selection
  stepwise_model <- stepAIC(full_model, direction = "both", trace = 0)
  
  return(stepwise_model)
  
}

We apply the variable selection process for the selection of each time period model. See the application below:

### the set of outcome variables
candidate_vars <- colnames(prov_dt)[!colnames(prov_dt) %in% c("prov", "provlab")]


### creating the province level data for variable selection and poverty mapping
prov_dt <- 
  map2(.x = list("poor2012", "poor2013", "poor2014"), ### first we create a direct estimate dataset from direct_list
       .y = direct_list,
       ~ .y |> 
         rename(!!sym(.x) := Direct) |>
         dplyr::select(Domain, all_of(.x))) |>
  Reduce(f = "merge") |> ### merge the selected variables from each dataframe of the list
  merge(prov_dt, ### merging the prepared simulated administrative data variables
        by.x = "Domain", by.y = "prov", all = TRUE) |>
  merge(var_dt,
        by.x = "Domain", by.y = "domain", all = TRUE)
  
fh_step <- 
step_wrapper(dt = prov_dt,
             xvars = candidate_vars,
             y = "poor2012",
             cor_thresh = 0.7)

fh_step <- names(fh_step$coefficients)[!grepl("(Intercept)", names(fh_step$coefficients))]

fh_step <- as.formula(paste0("poor2012 ~ ",  paste(fh_step, collapse = " + ")))

rhs <- paste(deparse(fh_step[[3]]), collapse = " ")


outcome_list <- list("poor2012", "poor2013", "poor2014")
# Then generate the list of formulas
mfh_formula <- 
  lapply(outcome_list, 
         function(x) {
           
            as.formula(paste(x, "~", rhs)) 
           
           })

names(mfh_formula) <- outcome_list

Let’s run some linear regressions for set of selected variables to get a sense for the predicted power of the models independently

lmcheck_obj <- 
lapply(X = mfh_formula,
       FUN = lm,
       data = prov_dt)


lapply(X = lmcheck_obj,
       FUN = summary)

## $poor2012
## 
## Call:
## FUN(formula = X[[i]], data = ..1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12244 -0.03693  0.01055  0.03833  0.09114 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.406566   0.190009   2.140 0.038233 *  
## gen          -0.194001   0.121391  -1.598 0.117508    
## age5          0.378420   0.131694   2.873 0.006343 ** 
## labor1        0.231915   0.120111   1.931 0.060272 .  
## labor2        0.532883   0.289103   1.843 0.072360 .  
## aec          -0.012975   0.009338  -1.389 0.172025    
## mkt           0.032275   0.008521   3.788 0.000478 ***
## age3_X_gen   -0.213792   0.078529  -2.722 0.009395 ** 
## age2_X_educ3  1.626525   0.885467   1.837 0.073306 .  
## age3_X_educ3  0.354384   0.165834   2.137 0.038468 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05718 on 42 degrees of freedom
## Multiple R-squared:  0.6163, Adjusted R-squared:  0.5341 
## F-statistic: 7.495 on 9 and 42 DF,  p-value: 1.94e-06
## 
## 
## $poor2013
## 
## Call:
## FUN(formula = X[[i]], data = ..1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12999 -0.04099  0.01257  0.04400  0.09036 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.47048    0.21049   2.235   0.0308 *  
## gen          -0.21470    0.13448  -1.597   0.1179    
## age5          0.32666    0.14589   2.239   0.0305 *  
## labor1        0.14109    0.13306   1.060   0.2950    
## labor2        0.35688    0.32027   1.114   0.2715    
## aec          -0.01335    0.01034  -1.290   0.2040    
## mkt           0.03632    0.00944   3.847   0.0004 ***
## age3_X_gen   -0.17262    0.08699  -1.984   0.0538 .  
## age2_X_educ3  1.60622    0.98091   1.637   0.1090    
## age3_X_educ3  0.34192    0.18371   1.861   0.0697 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06334 on 42 degrees of freedom
## Multiple R-squared:  0.5664, Adjusted R-squared:  0.4734 
## F-statistic: 6.095 on 9 and 42 DF,  p-value: 1.928e-05
## 
## 
## $poor2014
## 
## Call:
## FUN(formula = X[[i]], data = ..1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.18919 -0.03379 -0.00547  0.04263  0.15614 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.57338    0.25506   2.248   0.0299 *
## gen          -0.07116    0.16295  -0.437   0.6646  
## age5         -0.11779    0.17678  -0.666   0.5089  
## labor1       -0.28669    0.16123  -1.778   0.0826 .
## labor2        0.56014    0.38808   1.443   0.1563  
## aec          -0.02156    0.01254  -1.720   0.0928 .
## mkt           0.02276    0.01144   1.990   0.0531 .
## age3_X_gen   -0.14322    0.10541  -1.359   0.1815  
## age2_X_educ3 -0.41151    1.18861  -0.346   0.7309  
## age3_X_educ3  0.43581    0.22261   1.958   0.0569 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07675 on 42 degrees of freedom
## Multiple R-squared:  0.4368, Adjusted R-squared:  0.3161 
## F-statistic: 3.619 on 9 and 42 DF,  p-value: 0.002054

Step 3: Model Estimations

Next, we show how to use the msae R package to estimate the Empirical Best Linear Unbiased Predictor (EBLUP) for the poverty map using the eblupMFH2() which allow for time series fay herriot estimation under homoskedastic assumptions. For completeness, we also briefly perform the previous described direct estimation in step 1, using the eblupUFH() function as well as the eblupMFH1() for the fay herriot model.

### first lets add the sampling variance and covariance matrix to the prov_dt dataset

# prov_dt <- 
#   prov_dt |>
#   merge(var_dt, by.x = "prov", by.y = "domain")

### variance-covariance matrix columns
varcols <- colnames(var_dt)[grepl(pattern = "^v_", x = colnames(var_dt))] 

## replace the variances-covariances that are zero with their smoothed counterparts
prov_dt <- 
  prov_dt |>
  mutate(across(
    starts_with("v_"),
    ~ if_else(abs(.x) <= 1e-4, get(paste0("vsv", str_remove(cur_column(), "^v"))), .x),
    .names = "{.col}"
  ))



#### now we estimate all 4 models including the multivariate fay herriot models

model0_obj <- eblupUFH(mfh_formula, vardir = varcols, data = prov_dt)
model1_obj <- eblupMFH1(mfh_formula, vardir = varcols, data = prov_dt, MAXITER = 1e10, PRECISION = 1e-2)
model2_obj <- eblupMFH2(mfh_formula, vardir = varcols, data = prov_dt, MAXITER = 1e10, PRECISION = 1e-2)
# model3_obj <- eblupMFH3(mfh_formula, vardir = varcols, data = prov_dt, MAXITER = 1e10, PRECISION = 1e-2)

With the MFH3 estimation, we can check if the variance-covariance matrix structure points to heteroskedastic random effects. We do so as follows:

# model3_obj$fit$refvarTest

This tests the null hypothesis that the variances \(\sigma_t^2\) at each pair of time instants \(t\) and \(s\) are equal against the alternative that they are not. In this case, at significance level of 0.05, we reject the equality of variances between \(t = 2013\) and \(t = 2014\). Hence we can use the MFH3 model (eblupMFH3() function). If these tests supported the equality of variances, then we would use instead the function eblupMFH2 for estimation.

Step 4: Post Estimation Diagnostics: Model Assumption Checks for Linearity, Normality and Outliers

We now verify the assumptions of the MFH3 model. This includes assessing linearity, the normality of the predicted area effects and standardized residuals, as well as checking for the presence of outlying areas.

The Check for Linearity

We first check the linearity assumption. This can be addressed by regressing the model residuals against the predicted values (EBLUPs).

resids_3 <- cbind(prov_dt$poor2012 - model2_obj$eblup$poor2012,
                  prov_dt$poor2013 - model2_obj$eblup$poor2013,
                  prov_dt$poor2014 - model2_obj$eblup$poor2014)

layout(matrix(1:3, nrow = 1, byrow = TRUE))

plot(model2_obj$eblup$poor2012, resids_3[,1], pch = 19, xlab = "EBLUPs t = 1", ylab = "Residuals t = 1")
plot(model2_obj$eblup$poor2013, resids_3[,2], pch = 19, xlab = "EBLUPs t = 2", ylab = "Residuals t = 2")
plot(model2_obj$eblup$poor2014, resids_3[,3], pch = 19, xlab = "EBLUPs t = 3", ylab = "Residuals t = 3")

fits_3 <- model2_obj$eblup  # EBLUPs (predicted values)

# Run regression of residuals on fitted values for each time period
lm_resid_fit_t1 <- lm(resids_3[,1] ~ model2_obj$eblup$poor2012 + model2_obj$eblup$poor2012^2)
lm_resid_fit_t2 <- lm(resids_3[,2] ~ model2_obj$eblup$poor2013 + model2_obj$eblup$poor2013^2)
lm_resid_fit_t3 <- lm(resids_3[,3] ~ model2_obj$eblup$poor2014 + model2_obj$eblup$poor2014^2)

# View summaries
summary(lm_resid_fit_t1)

## 
## Call:
## lm(formula = resids_3[, 1] ~ model2_obj$eblup$poor2012 + model2_obj$eblup$poor2012^2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.19940 -0.03052  0.00796  0.03862  0.08945 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                0.03395    0.02327   1.459    0.151
## model2_obj$eblup$poor2012 -0.10025    0.11095  -0.904    0.371
## 
## Residual standard error: 0.0556 on 50 degrees of freedom
## Multiple R-squared:  0.01607,    Adjusted R-squared:  -0.003614 
## F-statistic: 0.8164 on 1 and 50 DF,  p-value: 0.3706

summary(lm_resid_fit_t2)

## 
## Call:
## lm(formula = resids_3[, 2] ~ model2_obj$eblup$poor2013 + model2_obj$eblup$poor2013^2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.196228 -0.040498  0.005571  0.040744  0.104039 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                0.03226    0.02437   1.323    0.192
## model2_obj$eblup$poor2013 -0.08303    0.11556  -0.719    0.476
## 
## Residual standard error: 0.05865 on 50 degrees of freedom
## Multiple R-squared:  0.01022,    Adjusted R-squared:  -0.009576 
## F-statistic: 0.5163 on 1 and 50 DF,  p-value: 0.4758

summary(lm_resid_fit_t3)

## 
## Call:
## lm(formula = resids_3[, 3] ~ model2_obj$eblup$poor2014 + model2_obj$eblup$poor2014^2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34029 -0.03226  0.01047  0.04736  0.16709 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                0.03868    0.05732   0.675    0.503
## model2_obj$eblup$poor2014 -0.09438    0.19996  -0.472    0.639
## 
## Residual standard error: 0.07893 on 50 degrees of freedom
## Multiple R-squared:  0.004436,   Adjusted R-squared:  -0.01548 
## F-statistic: 0.2228 on 1 and 50 DF,  p-value: 0.639

Evaluating the Normality Assumption

The Shapiro Wilks Test

We use the shapiro wilks test of normality using the shapiro.test() function in base R. The Shapiro-Wilk test assesses whether a sample of data is drawn from a normally distributed population. It does so by comparing the order statistics (i.e., sorted values) of the sample to the expected values under a normal distribution. Specifically, the test statistic \(W\) is a ratio of the squared correlation between the observed sample quantiles and the corresponding normal quantiles.

First, we perform the shapiro wilks normality test on the model errors, \(\varepsilon\). We show both the normality distribution histogram as well as the qqplots as below:

### evaluating the normality assumption
resid_dt <- prov_dt[,c("poor2012", "poor2013", "poor2014")] - model2_obj$eblup

### perform the shapiro test

shapiro_obj <- apply(resid_dt, 2, shapiro.test)

summary_dt <- 
  data.frame(Time = names(shapiro_obj),
             W = lapply(X = shapiro_obj,
                        FUN = function(x){
                          
                          return(x$statistic[[1]])
                          
                        }) %>%
               as.numeric(),
             p_value = lapply(X = shapiro_obj,
                              FUN = function(x){
                                
                                return(x$p.value)
                                
                              }) %>%
               as.numeric())

### plot the results
summary_dt <- 
  summary_dt %>%
  mutate(label = paste0("W = ", round(W, 3), "\n", "p = ", signif(p_value, 3)))

resid_dt %>%
  pivot_longer(cols = everything(), 
               names_to = "Time", 
               values_to = "Residual") %>%
  ggplot(aes(x = Residual)) + 
  geom_histogram(bins = 10, fill = "steelblue", color = "white") + 
  geom_text(data = summary_dt, aes(x = -Inf, y = Inf, label = label),
            hjust = -0.1, vjust = 1.2, inherit.aes = FALSE, size = 3.5) +
  facet_wrap(~Time, scales = "free") + 
  theme_minimal() + 
  labs(title = "Residual Histograms by Time Period")

### here's how to create qqplots
resid_dt %>%
  pivot_longer(cols = everything(),
               names_to = "Time",
               values_to = "Residual") %>%
  ggplot(aes(sample = Residual)) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~Time, scales = "free") +
  theme_minimal() +
  labs(title = "QQ Plots of Residuals by Time Period")

Likewise, we test the normality of the random effect variable

#### For the random effects
raneff_dt <- as.data.frame(model2_obj$randomEffect)

### lets run the shapiro wilks tests again
shapiro_obj <- apply(raneff_dt, 2, shapiro.test)


summary_dt <- 
  data.frame(Time = names(shapiro_obj),
             W = lapply(X = shapiro_obj,
                        FUN = function(x){
                          
                          return(x$statistic[[1]])
                          
                        }) %>%
               as.numeric(),
             p_value = lapply(X = shapiro_obj,
                              FUN = function(x){
                                
                                return(x$p.value)
                                
                              }) %>%
               as.numeric())

### plot the results
summary_dt <- 
  summary_dt %>%
  mutate(label = paste0("W = ", round(W, 3), "\n", "p = ", signif(p_value, 3)))

raneff_dt %>%
  pivot_longer(cols = everything(), 
               names_to = "Time", 
               values_to = "RandEff") %>%
  ggplot(aes(x = RandEff)) + 
  geom_histogram(bins = 10, fill = "darkorange", color = "white") + 
  geom_text(data = summary_dt, aes(x = -Inf, y = Inf, label = label),
            hjust = -0.1, vjust = 1.2, inherit.aes = FALSE, size = 3.5) +
  facet_wrap(~Time, scales = "free") + 
  theme_minimal() + 
  labs(title = "Randon Effects Histograms by Time Period")

In both cases, we compare the p-value to the 0.05 level of significance. The results suggest we cannot reject the null hypothesis of normally distributed model errors and random effects.

In some cases, the assumptions described by the MFH2 model are not met. Steps 5 and 6 recommend re-estimate the models by creating additional variables via transformation or variable interactions.

Comparing Direct Estimation to Multivariate Model Outputs

#### first let us compare the gains in precision

direct_list <- 
  mapply(x = direct_list,
         y = c(2012, 2013, 2014),
         FUN = function(x, y){
           
           x$year <- y
           
           return(x)
           
         }, SIMPLIFY = FALSE)

direct_dt <- Reduce("rbind", direct_list)


model2mse_dt <- 
  model2_obj$MSE |>
  mutate(Domain = 1:n()) |>
  pivot_longer(
    cols = starts_with("poor"),         # columns to pivot
    names_to = "year",                  # new column for the years
    values_to = "modelMSE"
  ) |>
  mutate(year = as.integer(substr(year, 5, 8)))

model2pov_dt <- 
  model2_obj$eblup |>
  mutate(Domain = 1:n()) |>
  pivot_longer(
    cols = starts_with("poor"),         # columns to pivot
    names_to = "year",                  # new column for the years
    values_to = "modelpov"
  ) |>
  mutate(year = as.integer(substr(year, 5, 8)))

model2pov_dt <- merge(model2mse_dt, model2pov_dt, by = c("Domain", "year"))


model2pov_dt <- 
  model2pov_dt |>
  mutate(modelCV = sqrt(modelMSE) / modelpov)


cv_dt <- merge(model2pov_dt, direct_dt, by = c("Domain", "year"))

cv_dt |>
  dplyr::select(Domain, year, CV, modelCV) |>
  pivot_longer(cols = c(CV, modelCV), names_to = "Type", values_to = "CV_value") |>
  ggplot(aes(x = factor(year), y = CV_value, color = Type, group = Type)) +
  geom_line(size = 1) +
  facet_wrap(~ Domain, scales = "free_y") +
  labs(title = "Comparison of Direct vs Model-based CVs",
       y = "Coefficient of Variation (CV)",
       x = "Year",
       color = "Estimation Type") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Did the poverty rates change over time?

Next we can call the pbmcpeMFH2() function, which returns the EBLUP, as well as, the MSEs of the EBLUPs for each time point, and the MCPEs for each pair of time points based on the MFH model 2 as follows:

mcpemfh2_obj <- 
pbmcpeMFH2(formula = mfh_formula,
           vardir = varcols,
           nB = 50,
           data = prov_dt)

## 1

## 2

## 3
## 3

## 4

## 5

## 6
## 6
## 6

## 7

## 8
## 8

## 9
## 9
## 9
## 9
## 9

## 10
## 10
## 10

## 11

## 12
## 12

## 13
## 13

## 14
## 14
## 14
## 14

## 15

## 16
## 16

## 17

## 18

## 19

## 20
## 20

## 21

## 22
## 22

## 23

## 24
## 24
## 24
## 24

## 25

## 26
## 26

## 27
## 27

## 28
## 28

## 29
## 29
## 29
## 29
## 29

## 30

## 31

## 32
## 32

## 33

## 34

## 35
## 35

## 36
## 36

## 37

## 38
## 38

## 39
## 39

## 40

## 41

## 42
## 42
## 42

## 43
## 43
## 43

## 44

## 45
## 45

## 46

## 47

## 48
## 48
## 48
## 48
## 48

## 49

## 50
## 50

#' A function to compare and plot differences from the mcpe object
#' 

compare_mfh2 <- function(period_list = c(1,2),
                         mcpe_obj = mcpemfh2_obj,
                         alpha = 0.05){
  
  col_chr <- paste0("(", period_list[1], ",", period_list[2], ")")

  df <- 
    tibble(diff = mcpe_obj$eblup[,period_list[2]] - mcpe_obj$eblup[,period_list[1]],
           mse = mcpe_obj$mse[,period_list[1]] + mcpe_obj$mse[,period_list[2]] - 2*mcpe_obj$mcpe[,col_chr],
           alpha = rep(alpha, nrow(mcpe_obj$eblup)),
           zq = qnorm(alpha / 2, lower.tail = F)) |>
    mutate(lb = diff - zq * sqrt(mse),
           ub = diff + zq * sqrt(mse)) |>
    mutate(significant = ifelse(lb > 0 | ub < 0, 
                                "Significant", 
                                "Not Significant")) |>
    mutate(index = row_number())
  
  p1 <- 
  ggplot(df, aes(x = index, y = diff)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_errorbar(aes(ymin = lb, ymax = ub, color = significant), width = 0.2, linewidth = 1) +
  geom_point(color = "black", size = 2) +
  scale_color_manual(values = c("Significant" = "red", "Not Significant" = "gray40")) +
  labs(
    x = "Area (Index)",
    y = "Difference between time periods",
    color = "Significance",
    title = paste0("Change in Poverty Rates based on MCPE between period ", 
                   period_list[1], 
                   " and ", 
                   period_list[2])
  ) +
  theme_minimal()
    
  
  return(list(df = df,
              plot = p1))
  
}

The above function takes the difference between any two time periods and prepares a table of differences for each area include the MCPE estimated error rates as well as lower and upper bounds given a specified confidence level. See the function implemented to compare periods 1 and 2 (years 2012 and 2013)

comp12_obj <- compare_mfh2()

comp23_obj <- compare_mfh2(period_list = c(2, 3))

comp13_obj <- compare_mfh2(period_list = c(1, 3))

See the plots below

comp12_obj$plot

comp23_obj$plot

comp13_obj$plot

See the data created

comp12_obj$df %>%
  head() %>%
  kable()

diff	mse	alpha	zq	lb	ub	significant	index
0.01331	0.0002437	0.05	1.959964	-0.0172891	0.0439091	Not Significant	1
0.00545	0.0000628	0.05	1.959964	-0.0100798	0.0209798	Not Significant	2
-0.00109	0.0000187	0.05	1.959964	-0.0095636	0.0073836	Not Significant	3
0.00000	0.0000000	0.05	1.959964	-0.0000408	0.0000408	Not Significant	4
0.00313	0.0000353	0.05	1.959964	-0.0085208	0.0147808	Not Significant	5
0.00235	0.0000563	0.05	1.959964	-0.0123526	0.0170526	Not Significant	6

comp23_obj$df %>%
  head() %>%
  kable()

diff	mse	alpha	zq	lb	ub	significant	index
0.07862	0.0013890	0.05	1.959964	0.0055744	0.1516656	Significant	1
0.10423	0.0004644	0.05	1.959964	0.0619940	0.1464660	Significant	2
0.12974	0.0003893	0.05	1.959964	0.0910680	0.1684120	Significant	3
0.09968	0.0010776	0.05	1.959964	0.0353400	0.1640200	Significant	4
0.11912	0.0008807	0.05	1.959964	0.0609557	0.1772843	Significant	5
0.10658	0.0013460	0.05	1.959964	0.0346738	0.1784862	Significant	6

comp13_obj$df %>%
  head() %>%
  kable()

diff	mse	alpha	zq	lb	ub	significant	index
0.09193	0.0014206	0.05	1.959964	0.0180584	0.1658016	Significant	1
0.10968	0.0005839	0.05	1.959964	0.0623211	0.1570389	Significant	2
0.12865	0.0004065	0.05	1.959964	0.0891316	0.1681684	Significant	3
0.09968	0.0010775	0.05	1.959964	0.0353433	0.1640167	Significant	4
0.12225	0.0008663	0.05	1.959964	0.0645627	0.1799373	Significant	5
0.10893	0.0012002	0.05	1.959964	0.0410304	0.1768296	Significant	6

Multivariate Fay Herriot Modelling

Ifeanyi Edochie

2025-06-04