Stable FH Estimators over T time periods: The Multivariate Fay Herriot Modelling Approach

This section describes procedures that yield stable small area estimators for each of \(D\) areas over \(T\) subsequent time instants. Area populations, the samples and the data might change between time periods. Accordingly, we denote \(U_t\) the overall population at time \(t\), which is partitioned into \(D\) areas \(U_{1t}, ... ,U_{Dt}\), of respective population sizes \(N_{1t}\)

An overview of the MFH model estimation process is as follows:

We will show below the use of the eblupMFH2() and eblupMFH3() from the R package msae [@permatasari2022package] compute the EBLUPs and their MSE estimates under the MFH models 2 and 3, respectively. The calls to these functions are:

eblupMFH2(formula, vardir, MAXITER = 100, PRECISION = 1e-04, data)

eblupMFH3(formula, vardir, MAXITER = 100, PRECISION = 1e-04, data)

MFH Estimation of Poverty Rates for T time periods

In this example, we use a synthetic data set adapted from R package sae called incomedata. The original data contains information for \(n = 17,119\) fictitious individuals residing across \(D = 52\) Spanish provinces. The variables include the name of the province of residence (provlab), province code (prov), as well as several correlates of income. We have added two additional income vectors corresponding to two additional years of data.

We will show how to estimate a poverty map for each year by using the Multivariate Fay Herriot modelling approach. This approach allows us to take advantage of the temporal correlation between poverty rates i.e. an individuals income in year \(t\) is likely correlated with their income in year \(t+1\).

The rest of this tutorial shows how to prepare MFH models using a random 10% sample of the incomedata to estimate the poverty rates.

if (sum(installed.packages()[,1] %in% "pacman") != 1){
  
  install.packages("pacman")
  
}

pacman::p_load(sf, data.table, tidyverse, car, msae, 
               sae, survey, spdep, knitr, MASS, caret,
               purrr, here)

source(here::here("scripts/Function_pbmcpeMFH2_test.R"))

income_dt <- readRDS(here::here("data/incomedata_survey.RDS"))

prov_dt <- readRDS(here::here("data/shapes/simadmin.RDS"))

shp_dt <- readRDS(here::here("data/shapes/spainshape.RDS"))

glimpse(income_dt)
## Rows: 4,298
## Columns: 9
## Groups: provlab [52]
## $ provlab     <fct> Alava, Alava, Alava, Alava, Alava, Alava, Alava, Alava, Al…
## $ prov        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ income2012  <dbl> 26125.271, 27404.099, 13117.838, 27920.737, 8705.592, 6340…
## $ income2013  <dbl> 27500.673, 28174.985, 13581.315, 28508.978, 9182.841, 6272…
## $ income2014  <dbl> 34739.733, 32137.846, 15980.581, 31512.153, 11699.903, 593…
## $ povline2012 <dbl> 6477.484, 6477.484, 6477.484, 6477.484, 6477.484, 6477.484…
## $ povline2013 <dbl> 6515.865, 6515.865, 6515.865, 6515.865, 6515.865, 6515.865…
## $ povline2014 <dbl> 6515.865, 6515.865, 6515.865, 6515.865, 6515.865, 6515.865…
## $ weight      <dbl> 7424.759, 23577.245, 41976.969, 19471.566, 12738.217, 2463…

Step 1: Direct Estimation

We will use the direct HT estimators that use the survey weights in weight variable. First, we calculate the total sample size, the number of provinces, the sample sizes for each province and extract the population sizes for each province/target area from the sizeprov file. For those using the household/individual level survey data, this may be obtained from the sum of the household or individual weights as appropriate.

data(sizeprov)

## quickly compute sample size for each province
sampsize_dt <- 
income_dt |>
  group_by(prov) |>
  summarize(N = n())

## the poverty line for each is already included within the data. 
## Lets compute the direct estimates for each year of data and create a list of data.frames (equal in length to the number of years) 
## containing the direct estimate, the standard errors and the coefficient of variation. 

direct_list <- 
  mapply(FUN = function(y, threshold){
    
     z <- emdi::direct(y = y,
                       smp_data = income_dt %>% as.data.table(),
                       smp_domains = "prov",
                       weights = "weight",
                       threshold = unique(income_dt[[threshold]]),
                       var = TRUE)
     
     z <- 
       z$ind |>
       dplyr::select(Domain, Head_Count) |>
       rename(Direct = "Head_Count") |>
       merge(z$MSE |>
               dplyr::select(Domain, Head_Count) |>
               rename(MSE = "Head_Count"),
             by = "Domain") |>
       mutate(SD = sqrt(MSE)) |>
       mutate(CV = SD / Direct) |>
       merge(sampsize_dt |> 
               mutate(prov = as.factor(prov)), 
             by.x = "Domain", 
             by.y = "prov") |>
       mutate(var_SRS = Direct * (1 - Direct) / N) |>
       mutate(deff = MSE / var_SRS) |>
       mutate(n_eff = N / deff)
       
      
     return(z)

  }, SIMPLIFY = FALSE,
  y = c("income2012", "income2013", "income2014"),
  threshold = c("povline2012", "povline2013", "povline2014"))
## Registered S3 method overwritten by 'openxlsx':
##   method               from         
##   as.character.formula formula.tools
direct_list %>%
  lapply(X = .,
         FUN = function(x){
           
           y <- head(x)
           
           y %>% kable()
           
         })
## $income2012
## 
## 
## |Domain |    Direct|       MSE|        SD|        CV|   N|   var_SRS|      deff|    n_eff|
## |:------|---------:|---------:|---------:|---------:|---:|---------:|---------:|--------:|
## |1      | 0.2439406| 0.0110510| 0.1051237| 0.4309396|  24| 0.0076847| 1.4380443| 16.68933|
## |10     | 0.2571874| 0.0031061| 0.0557327| 0.2167007|  70| 0.0027292| 1.1381234| 61.50476|
## |11     | 0.1009752| 0.0010541| 0.0324664| 0.3215285| 100| 0.0009078| 1.1611329| 86.12278|
## |12     | 0.1862899| 0.0051402| 0.0716952| 0.3848579|  30| 0.0050529| 1.0172832| 29.49031|
## |13     | 0.1538687| 0.0020267| 0.0450187| 0.2925786|  62| 0.0020999| 0.9651377| 64.23953|
## |14     | 0.2485756| 0.0040143| 0.0633589| 0.2548876|  56| 0.0033355| 1.2035352| 46.52959|
## 
## $income2013
## 
## 
## |Domain |    Direct|       MSE|        SD|        CV|   N|   var_SRS|      deff|     n_eff|
## |:------|---------:|---------:|---------:|---------:|---:|---------:|---------:|---------:|
## |1      | 0.3117905| 0.0151210| 0.1229675| 0.3943914|  24| 0.0089407| 1.6912517|  14.19067|
## |10     | 0.2735087| 0.0025398| 0.0503964| 0.1842587|  70| 0.0028386| 0.8947362|  78.23535|
## |11     | 0.1174984| 0.0009567| 0.0309310| 0.2632464| 100| 0.0010369| 0.9226596| 108.38233|
## |12     | 0.1862899| 0.0051402| 0.0716952| 0.3848579|  30| 0.0050529| 1.0172832|  29.49031|
## |13     | 0.1617958| 0.0020810| 0.0456177| 0.2819463|  62| 0.0021874| 0.9513534|  65.17032|
## |14     | 0.2353446| 0.0035808| 0.0598402| 0.2542663|  56| 0.0032135| 1.1143047|  50.25555|
## 
## $income2014
## 
## 
## |Domain |    Direct|       MSE|        SD|        CV|   N|   var_SRS|      deff|    n_eff|
## |:------|---------:|---------:|---------:|---------:|---:|---------:|---------:|--------:|
## |1      | 0.4142299| 0.0141892| 0.1191185| 0.2875661|  24| 0.0101101| 1.4034626| 17.10056|
## |10     | 0.4206141| 0.0032788| 0.0572612| 0.1361372|  70| 0.0034814| 0.9418188| 74.32428|
## |11     | 0.2112138| 0.0025982| 0.0509725| 0.2413312| 100| 0.0016660| 1.5595162| 64.12245|
## |12     | 0.4128282| 0.0122581| 0.1107162| 0.2681896|  30| 0.0080800| 1.5170830| 19.77479|
## |13     | 0.2411189| 0.0036303| 0.0602519| 0.2498847|  62| 0.0029513| 1.2300663| 50.40379|
## |14     | 0.3264915| 0.0044502| 0.0667098| 0.2043232|  56| 0.0039267| 1.1333194| 49.41237|

Estimating the Variance-Covariance Matrix of the Sampling Error Distribution

Next, we quickly estimate the sample variance and covariance for the direct estimator using the survey R package as follows:

## quickly creating the poverty indicator
income_dt <- 
  income_dt |>
  mutate(poor2012 = ifelse(income2012 < povline2012, 1, 0),
         poor2013 = ifelse(income2013 < povline2013, 1, 0),
         poor2014 = ifelse(income2014 < povline2014, 1, 0))


### a simple worker function for computing the variance covariance matrix of the sampling error of the direct estimate

compute_vcov <- function(dt,
                         domain,
                         ids,
                         weights = NULL,
                         fpc = NULL,
                         strata = NULL,
                         probs = NULL,
                         yvars,
                         deff = FALSE,
                         ...){
  
  ## a little worker function to convert expressions that are character to the right format for the survey package
  convert_expr <- function(x){
    
    y <- if (is.character(x)){
      
      rlang::expr(~!!sym(x))
      
    } else {
      
      rlang::expr(~ !!x)
      
    }
    
    y <- eval(y)
    
    return(y)
    
  }
  
  ### run it on the proper arguments
  ids <- convert_expr(x = ids)
  
  if (is.null(weights)){
    
    dt$weights <- 1
    
  }
  
  weights <- convert_expr(x = weights)
  
  if (!is.null(fpc)){
    
    fpc <- convert_expr(x = fpc)

  }
    
  if (!is.null(probs)){
    
    probs <- convert_expr(x = probs)

  }

  if (!is.null(strata)){
    
    strata <- convert_expr(x = strata)

  }

  ### create a survey design object from the survey package 
  
  dom_list <- unique(dt[[domain]])
  
  surv_vcov <- function(x){
    
    design_obj <- svydesign(ids = ids,
                            probs = probs,
                            strata = strata,
                            fpc = fpc,
                            data = dt |>
                              dplyr::filter(!!sym(domain) == x),
                            weights = weights)
    
    ### prepare the y formula for the svymean function
    yform <- reformulate(yvars)
    
    var_dt <- svymean(yform, design_obj, ...) ## use the svymean object to compute variance
    
    var_dt <- vcov(var_dt) ### compute variance covariance matrix
    
    var_dt <- as.numeric(c(diag(var_dt), var_dt[lower.tri(var_dt, diag = FALSE)]))
    
    pair_list <- c(lapply(yvars, rep, 2),
                   combn(yvars, 2, simplify = FALSE))
  
    
    pair_list <- lapply(pair_list,
                        function(x){
                          
                          zz <- paste0("v_", paste(x, collapse = ""))
                          
                          return(zz)
                          
                        })
    
    pair_list <- unlist(pair_list)
    
    names(var_dt) <- pair_list
    
    return(var_dt)

  }
  
  vcov_list <- lapply(X = dom_list,
                      FUN = surv_vcov)
  
  var_dt <- Reduce(f = "rbind",
                   x = vcov_list) %>%
    as_tibble() %>%
    mutate(domain = dom_list)
  
  return(var_dt)  
  
}

### running the compute_vcov function to prepare the variance covariance matrix 
var_dt <- 
compute_vcov(dt = income_dt, 
             domain = "prov",
             ids = 1, 
             weights = "weight", 
             yvars = c("poor2012", "poor2013", "poor2014"))

### sample size
var_dt <- 
  var_dt |> 
  merge(sampsize_dt, by.x = "domain", by.y = "prov")

Handling Low Sample Size: Variance Smoothing

A quick inspection of the preceding results will show some provinces contain low sample sizes which sometimes result in extreme value poverty rates and hence 0 variance. To avoid this, we will show you how to apply the variance smoothing method suggested by [@you2023application]. Please see the code and Roxygen comments below explaining the use of the varsmoothie_king() function which computes smoothed variances.

#' A function to perform variance smoothing
#' 
#' The variance smoothing function applies the methodology of (Hiridoglou and You, 2023)
#' which uses simply log linear regression to estimate direct variances for sample 
#' poverty rates which is useful for replacing poverty rates in areas with low sampling
#' 
#' @param domain a vector of unique domain/target areas
#' @param direct_var the raw variances estimated from sample data e=
#' @param sampsize the sample size for each domain
#' @param y indicator variable at the survey sample level for each household/individual/unit i.e. poor = 1, non-poor = 0
#' 
#' @export

varsmoothie_king <- function(domain,
                             direct_var,
                             sampsize,
                             y){

  dt <- data.table(domain = domain,
                   var = direct_var,
                   n = sampsize)

  dt$log_n <- log(dt$n)
  dt$log_var <- log(dt$var)

  lm_model <- lm(formula = log_var ~ log_n,
                 data = dt[!(abs(dt$log_var) == Inf),])

  dt$pred_var <- predict(lm_model, newdata = dt)

  residual_var <-  summary(lm_model)$sigma^2
  
  dt$var_smooth <- exp(dt$pred_var) * exp(residual_var/2)

  return(dt[, c("domain", "var_smooth"), with = F])

}

OK, the goal now is to use the above varsmoothie_king() function to add an additional column of smoothed variances into our direct_list object.

vardir <- grep("^v_", names(var_dt), value = TRUE)

var_dt <-
  lapply(X = vardir,
         FUN = function(x){

           z <- varsmoothie_king(domain = var_dt[["domain"]],
                                 direct_var = var_dt[[x]],
                                 sampsize = var_dt[["N"]]) |>
             as.data.table() |>
             setnames(old = "var_smooth", new = paste0("vs", x)) |>
             as_tibble()
           

           return(z)

         }) %>%
  Reduce(f = "merge",
         x = .) %>%
  merge(x = var_dt,
        y = .,
        by = "domain") |>
  as_tibble()
## Warning in log(dt$var): NaNs produced
## Warning in log(dt$var): NaNs produced
  • Now, you can replace the zero/near zero sample size area MSEs with their smoothed variances.

Step 2: Model Selection

More (and probably better correlates of income and poverty) can be created to improve the predictive power of the model. The next step is to take target level variables. The MFH model is a model of poverty rates at the target area level, hence the data format required for this exercise has the province as its unit of observation. This format has a few essential columns:

    1. Variables for poverty rates for each time period
    1. The set of candidate variables from which the most predicted of poverty rates will be selected
    1. The target area variable identifier (i.e. in this case the province variable prov and provlab)

Model Selection

Next, we apply a simple variable selection process which employs the stepwise regression algorithm using the AIC selection criteria as in described by [@yamashita2007stepwise]. The function step_wrapper() implemented below is a wrapper to the stepAIC() function carries all the perfunctory cleaning necessary use the stepAIC() function. This includes dropping columns that are entirely missing (NA) and keep only complete cases/observations and remove perfectly or near collinear variables and combinations using the variance inflation method.

#' A function to perform stepwise variable selection with AIC selection criteria
#' 
#' @param dt data.frame, dataset containing the set of outcome and independent variables
#' @param xvars character vector, the set of x variables
#' @param y chr, the name of the y variable
#' @param vif_threshold integer, a variance inflation factor threshold
#' 
#' @import data.table
#' @importFrom cars vif
#' @importFrom MASS stepAIC

step_wrapper <- function(dt, xvars, y, cor_thresh = 0.95) {
  
  dt <- as.data.table(dt)
  
  # Drop columns that are entirely NA
  dt <- dt[, which(unlist(lapply(dt, function(x) !all(is.na(x))))), with = FALSE]
  
  xvars <- xvars[xvars %in% colnames(dt)]
  
  # Keep only complete cases
  dt <- na.omit(dt[, c(y, xvars), with = FALSE])
  
  # Step 1: Remove aliased (perfectly collinear) variables
  model_formula <- as.formula(paste(y, "~", paste(xvars, collapse = " + ")))
  lm_model <- lm(model_formula, data = dt)
  aliased <- is.na(coef(lm_model))
  if (any(aliased)) {
    xvars <- names(aliased)[!aliased & names(aliased) != "(Intercept)"]
  }
  
  # Step 2: Remove near-linear combinations
  xmat <- as.matrix(dt[, ..xvars])
  combo_check <- tryCatch(findLinearCombos(xmat), error = function(e) NULL)
  if (!is.null(combo_check) && length(combo_check$remove) > 0) {
    xvars <- xvars[-combo_check$remove]
    xmat <- as.matrix(dt[, ..xvars])
  }
  
  # Step 3: Drop highly correlated variables
  cor_mat <- abs(cor(xmat))
  diag(cor_mat) <- 0
  while (any(cor_mat > cor_thresh, na.rm = TRUE)) {
    cor_pairs <- which(cor_mat == max(cor_mat, na.rm = TRUE), arr.ind = TRUE)[1, ]
    var1 <- colnames(cor_mat)[cor_pairs[1]]
    var2 <- colnames(cor_mat)[cor_pairs[2]]
    # Drop the variable with higher mean correlation
    drop_var <- if (mean(cor_mat[var1, ]) > mean(cor_mat[var2, ])) var1 else var2
    xvars <- setdiff(xvars, drop_var)
    xmat <- as.matrix(dt[, ..xvars])
    cor_mat <- abs(cor(xmat))
    diag(cor_mat) <- 0
  }
  
  # Step 4: Warn if still ill-conditioned
  cond_number <- kappa(xmat, exact = TRUE)
  if (cond_number > 1e10) {
    warning("Design matrix is ill-conditioned (condition number > 1e10). Consider reviewing variable selection.")
  }
  
  # Final model fit
  model_formula <- as.formula(paste(y, "~", paste(xvars, collapse = " + ")))
  full_model <- lm(model_formula, data = dt)
  
  # Stepwise selection
  stepwise_model <- stepAIC(full_model, direction = "both", trace = 0)
  
  return(stepwise_model)
  
}

We apply the variable selection process for the selection of each time period model. See the application below:

### the set of outcome variables
candidate_vars <- colnames(prov_dt)[!colnames(prov_dt) %in% c("prov", "provlab")]


### creating the province level data for variable selection and poverty mapping
prov_dt <- 
  map2(.x = list("poor2012", "poor2013", "poor2014"), ### first we create a direct estimate dataset from direct_list
       .y = direct_list,
       ~ .y |> 
         rename(!!sym(.x) := Direct) |>
         dplyr::select(Domain, all_of(.x))) |>
  Reduce(f = "merge") |> ### merge the selected variables from each dataframe of the list
  merge(prov_dt, ### merging the prepared simulated administrative data variables
        by.x = "Domain", by.y = "prov", all = TRUE) |>
  merge(var_dt,
        by.x = "Domain", by.y = "domain", all = TRUE)
  
fh_step <- 
step_wrapper(dt = prov_dt,
             xvars = candidate_vars,
             y = "poor2012",
             cor_thresh = 0.7)

fh_step <- names(fh_step$coefficients)[!grepl("(Intercept)", names(fh_step$coefficients))]

fh_step <- as.formula(paste0("poor2012 ~ ",  paste(fh_step, collapse = " + ")))

rhs <- paste(deparse(fh_step[[3]]), collapse = " ")


outcome_list <- list("poor2012", "poor2013", "poor2014")
# Then generate the list of formulas
mfh_formula <- 
  lapply(outcome_list, 
         function(x) {
           
            as.formula(paste(x, "~", rhs)) 
           
           })

names(mfh_formula) <- outcome_list

Let’s run some linear regressions for set of selected variables to get a sense for the predicted power of the models independently

lmcheck_obj <- 
lapply(X = mfh_formula,
       FUN = lm,
       data = prov_dt)


lapply(X = lmcheck_obj,
       FUN = summary)
## $poor2012
## 
## Call:
## FUN(formula = X[[i]], data = ..1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12244 -0.03693  0.01055  0.03833  0.09114 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.406566   0.190009   2.140 0.038233 *  
## gen          -0.194001   0.121391  -1.598 0.117508    
## age5          0.378420   0.131694   2.873 0.006343 ** 
## labor1        0.231915   0.120111   1.931 0.060272 .  
## labor2        0.532883   0.289103   1.843 0.072360 .  
## aec          -0.012975   0.009338  -1.389 0.172025    
## mkt           0.032275   0.008521   3.788 0.000478 ***
## age3_X_gen   -0.213792   0.078529  -2.722 0.009395 ** 
## age2_X_educ3  1.626525   0.885467   1.837 0.073306 .  
## age3_X_educ3  0.354384   0.165834   2.137 0.038468 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05718 on 42 degrees of freedom
## Multiple R-squared:  0.6163, Adjusted R-squared:  0.5341 
## F-statistic: 7.495 on 9 and 42 DF,  p-value: 1.94e-06
## 
## 
## $poor2013
## 
## Call:
## FUN(formula = X[[i]], data = ..1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12999 -0.04099  0.01257  0.04400  0.09036 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.47048    0.21049   2.235   0.0308 *  
## gen          -0.21470    0.13448  -1.597   0.1179    
## age5          0.32666    0.14589   2.239   0.0305 *  
## labor1        0.14109    0.13306   1.060   0.2950    
## labor2        0.35688    0.32027   1.114   0.2715    
## aec          -0.01335    0.01034  -1.290   0.2040    
## mkt           0.03632    0.00944   3.847   0.0004 ***
## age3_X_gen   -0.17262    0.08699  -1.984   0.0538 .  
## age2_X_educ3  1.60622    0.98091   1.637   0.1090    
## age3_X_educ3  0.34192    0.18371   1.861   0.0697 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06334 on 42 degrees of freedom
## Multiple R-squared:  0.5664, Adjusted R-squared:  0.4734 
## F-statistic: 6.095 on 9 and 42 DF,  p-value: 1.928e-05
## 
## 
## $poor2014
## 
## Call:
## FUN(formula = X[[i]], data = ..1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.18919 -0.03379 -0.00547  0.04263  0.15614 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.57338    0.25506   2.248   0.0299 *
## gen          -0.07116    0.16295  -0.437   0.6646  
## age5         -0.11779    0.17678  -0.666   0.5089  
## labor1       -0.28669    0.16123  -1.778   0.0826 .
## labor2        0.56014    0.38808   1.443   0.1563  
## aec          -0.02156    0.01254  -1.720   0.0928 .
## mkt           0.02276    0.01144   1.990   0.0531 .
## age3_X_gen   -0.14322    0.10541  -1.359   0.1815  
## age2_X_educ3 -0.41151    1.18861  -0.346   0.7309  
## age3_X_educ3  0.43581    0.22261   1.958   0.0569 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07675 on 42 degrees of freedom
## Multiple R-squared:  0.4368, Adjusted R-squared:  0.3161 
## F-statistic: 3.619 on 9 and 42 DF,  p-value: 0.002054

Step 3: Model Estimations

Next, we show how to use the msae R package to estimate the Empirical Best Linear Unbiased Predictor (EBLUP) for the poverty map using the eblupMFH2() which allow for time series fay herriot estimation under homoskedastic assumptions. For completeness, we also briefly perform the previous described direct estimation in step 1, using the eblupUFH() function as well as the eblupMFH1() for the fay herriot model.

### first lets add the sampling variance and covariance matrix to the prov_dt dataset

# prov_dt <- 
#   prov_dt |>
#   merge(var_dt, by.x = "prov", by.y = "domain")

### variance-covariance matrix columns
varcols <- colnames(var_dt)[grepl(pattern = "^v_", x = colnames(var_dt))] 

## replace the variances-covariances that are zero with their smoothed counterparts
prov_dt <- 
  prov_dt |>
  mutate(across(
    starts_with("v_"),
    ~ if_else(abs(.x) <= 1e-4, get(paste0("vsv", str_remove(cur_column(), "^v"))), .x),
    .names = "{.col}"
  ))



#### now we estimate all 4 models including the multivariate fay herriot models

model0_obj <- eblupUFH(mfh_formula, vardir = varcols, data = prov_dt)
model1_obj <- eblupMFH1(mfh_formula, vardir = varcols, data = prov_dt, MAXITER = 1e10, PRECISION = 1e-2)
model2_obj <- eblupMFH2(mfh_formula, vardir = varcols, data = prov_dt, MAXITER = 1e10, PRECISION = 1e-2)
# model3_obj <- eblupMFH3(mfh_formula, vardir = varcols, data = prov_dt, MAXITER = 1e10, PRECISION = 1e-2)

With the MFH3 estimation, we can check if the variance-covariance matrix structure points to heteroskedastic random effects. We do so as follows:

# model3_obj$fit$refvarTest

This tests the null hypothesis that the variances \(\sigma_t^2\) at each pair of time instants \(t\) and \(s\) are equal against the alternative that they are not. In this case, at significance level of 0.05, we reject the equality of variances between \(t = 2013\) and \(t = 2014\). Hence we can use the MFH3 model (eblupMFH3() function). If these tests supported the equality of variances, then we would use instead the function eblupMFH2 for estimation.

Step 4: Post Estimation Diagnostics: Model Assumption Checks for Linearity, Normality and Outliers

We now verify the assumptions of the MFH3 model. This includes assessing linearity, the normality of the predicted area effects and standardized residuals, as well as checking for the presence of outlying areas.

The Check for Linearity

We first check the linearity assumption. This can be addressed by regressing the model residuals against the predicted values (EBLUPs).

resids_3 <- cbind(prov_dt$poor2012 - model2_obj$eblup$poor2012,
                  prov_dt$poor2013 - model2_obj$eblup$poor2013,
                  prov_dt$poor2014 - model2_obj$eblup$poor2014)

layout(matrix(1:3, nrow = 1, byrow = TRUE))

plot(model2_obj$eblup$poor2012, resids_3[,1], pch = 19, xlab = "EBLUPs t = 1", ylab = "Residuals t = 1")
plot(model2_obj$eblup$poor2013, resids_3[,2], pch = 19, xlab = "EBLUPs t = 2", ylab = "Residuals t = 2")
plot(model2_obj$eblup$poor2014, resids_3[,3], pch = 19, xlab = "EBLUPs t = 3", ylab = "Residuals t = 3")

fits_3 <- model2_obj$eblup  # EBLUPs (predicted values)

# Run regression of residuals on fitted values for each time period
lm_resid_fit_t1 <- lm(resids_3[,1] ~ model2_obj$eblup$poor2012 + model2_obj$eblup$poor2012^2)
lm_resid_fit_t2 <- lm(resids_3[,2] ~ model2_obj$eblup$poor2013 + model2_obj$eblup$poor2013^2)
lm_resid_fit_t3 <- lm(resids_3[,3] ~ model2_obj$eblup$poor2014 + model2_obj$eblup$poor2014^2)

# View summaries
summary(lm_resid_fit_t1)
## 
## Call:
## lm(formula = resids_3[, 1] ~ model2_obj$eblup$poor2012 + model2_obj$eblup$poor2012^2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.19940 -0.03052  0.00796  0.03862  0.08945 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                0.03395    0.02327   1.459    0.151
## model2_obj$eblup$poor2012 -0.10025    0.11095  -0.904    0.371
## 
## Residual standard error: 0.0556 on 50 degrees of freedom
## Multiple R-squared:  0.01607,    Adjusted R-squared:  -0.003614 
## F-statistic: 0.8164 on 1 and 50 DF,  p-value: 0.3706
summary(lm_resid_fit_t2)
## 
## Call:
## lm(formula = resids_3[, 2] ~ model2_obj$eblup$poor2013 + model2_obj$eblup$poor2013^2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.196228 -0.040498  0.005571  0.040744  0.104039 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                0.03226    0.02437   1.323    0.192
## model2_obj$eblup$poor2013 -0.08303    0.11556  -0.719    0.476
## 
## Residual standard error: 0.05865 on 50 degrees of freedom
## Multiple R-squared:  0.01022,    Adjusted R-squared:  -0.009576 
## F-statistic: 0.5163 on 1 and 50 DF,  p-value: 0.4758
summary(lm_resid_fit_t3)
## 
## Call:
## lm(formula = resids_3[, 3] ~ model2_obj$eblup$poor2014 + model2_obj$eblup$poor2014^2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34029 -0.03226  0.01047  0.04736  0.16709 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                0.03868    0.05732   0.675    0.503
## model2_obj$eblup$poor2014 -0.09438    0.19996  -0.472    0.639
## 
## Residual standard error: 0.07893 on 50 degrees of freedom
## Multiple R-squared:  0.004436,   Adjusted R-squared:  -0.01548 
## F-statistic: 0.2228 on 1 and 50 DF,  p-value: 0.639

Evaluating the Normality Assumption

The Shapiro Wilks Test

We use the shapiro wilks test of normality using the shapiro.test() function in base R. The Shapiro-Wilk test assesses whether a sample of data is drawn from a normally distributed population. It does so by comparing the order statistics (i.e., sorted values) of the sample to the expected values under a normal distribution. Specifically, the test statistic \(W\) is a ratio of the squared correlation between the observed sample quantiles and the corresponding normal quantiles.

First, we perform the shapiro wilks normality test on the model errors, \(\varepsilon\). We show both the normality distribution histogram as well as the qqplots as below:

### evaluating the normality assumption
resid_dt <- prov_dt[,c("poor2012", "poor2013", "poor2014")] - model2_obj$eblup

### perform the shapiro test

shapiro_obj <- apply(resid_dt, 2, shapiro.test)

summary_dt <- 
  data.frame(Time = names(shapiro_obj),
             W = lapply(X = shapiro_obj,
                        FUN = function(x){
                          
                          return(x$statistic[[1]])
                          
                        }) %>%
               as.numeric(),
             p_value = lapply(X = shapiro_obj,
                              FUN = function(x){
                                
                                return(x$p.value)
                                
                              }) %>%
               as.numeric())

### plot the results
summary_dt <- 
  summary_dt %>%
  mutate(label = paste0("W = ", round(W, 3), "\n", "p = ", signif(p_value, 3)))

resid_dt %>%
  pivot_longer(cols = everything(), 
               names_to = "Time", 
               values_to = "Residual") %>%
  ggplot(aes(x = Residual)) + 
  geom_histogram(bins = 10, fill = "steelblue", color = "white") + 
  geom_text(data = summary_dt, aes(x = -Inf, y = Inf, label = label),
            hjust = -0.1, vjust = 1.2, inherit.aes = FALSE, size = 3.5) +
  facet_wrap(~Time, scales = "free") + 
  theme_minimal() + 
  labs(title = "Residual Histograms by Time Period")

### here's how to create qqplots
resid_dt %>%
  pivot_longer(cols = everything(),
               names_to = "Time",
               values_to = "Residual") %>%
  ggplot(aes(sample = Residual)) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~Time, scales = "free") +
  theme_minimal() +
  labs(title = "QQ Plots of Residuals by Time Period")

Likewise, we test the normality of the random effect variable

#### For the random effects
raneff_dt <- as.data.frame(model2_obj$randomEffect)

### lets run the shapiro wilks tests again
shapiro_obj <- apply(raneff_dt, 2, shapiro.test)


summary_dt <- 
  data.frame(Time = names(shapiro_obj),
             W = lapply(X = shapiro_obj,
                        FUN = function(x){
                          
                          return(x$statistic[[1]])
                          
                        }) %>%
               as.numeric(),
             p_value = lapply(X = shapiro_obj,
                              FUN = function(x){
                                
                                return(x$p.value)
                                
                              }) %>%
               as.numeric())

### plot the results
summary_dt <- 
  summary_dt %>%
  mutate(label = paste0("W = ", round(W, 3), "\n", "p = ", signif(p_value, 3)))

raneff_dt %>%
  pivot_longer(cols = everything(), 
               names_to = "Time", 
               values_to = "RandEff") %>%
  ggplot(aes(x = RandEff)) + 
  geom_histogram(bins = 10, fill = "darkorange", color = "white") + 
  geom_text(data = summary_dt, aes(x = -Inf, y = Inf, label = label),
            hjust = -0.1, vjust = 1.2, inherit.aes = FALSE, size = 3.5) +
  facet_wrap(~Time, scales = "free") + 
  theme_minimal() + 
  labs(title = "Randon Effects Histograms by Time Period")

In both cases, we compare the p-value to the 0.05 level of significance. The results suggest we cannot reject the null hypothesis of normally distributed model errors and random effects.

In some cases, the assumptions described by the MFH2 model are not met. Steps 5 and 6 recommend re-estimate the models by creating additional variables via transformation or variable interactions.

Comparing Direct Estimation to Multivariate Model Outputs

#### first let us compare the gains in precision

direct_list <- 
  mapply(x = direct_list,
         y = c(2012, 2013, 2014),
         FUN = function(x, y){
           
           x$year <- y
           
           return(x)
           
         }, SIMPLIFY = FALSE)

direct_dt <- Reduce("rbind", direct_list)


model2mse_dt <- 
  model2_obj$MSE |>
  mutate(Domain = 1:n()) |>
  pivot_longer(
    cols = starts_with("poor"),         # columns to pivot
    names_to = "year",                  # new column for the years
    values_to = "modelMSE"
  ) |>
  mutate(year = as.integer(substr(year, 5, 8)))

model2pov_dt <- 
  model2_obj$eblup |>
  mutate(Domain = 1:n()) |>
  pivot_longer(
    cols = starts_with("poor"),         # columns to pivot
    names_to = "year",                  # new column for the years
    values_to = "modelpov"
  ) |>
  mutate(year = as.integer(substr(year, 5, 8)))

model2pov_dt <- merge(model2mse_dt, model2pov_dt, by = c("Domain", "year"))


model2pov_dt <- 
  model2pov_dt |>
  mutate(modelCV = sqrt(modelMSE) / modelpov)


cv_dt <- merge(model2pov_dt, direct_dt, by = c("Domain", "year"))

cv_dt |>
  dplyr::select(Domain, year, CV, modelCV) |>
  pivot_longer(cols = c(CV, modelCV), names_to = "Type", values_to = "CV_value") |>
  ggplot(aes(x = factor(year), y = CV_value, color = Type, group = Type)) +
  geom_line(size = 1) +
  facet_wrap(~ Domain, scales = "free_y") +
  labs(title = "Comparison of Direct vs Model-based CVs",
       y = "Coefficient of Variation (CV)",
       x = "Year",
       color = "Estimation Type") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Did the poverty rates change over time?

Next we can call the pbmcpeMFH2() function, which returns the EBLUP, as well as, the MSEs of the EBLUPs for each time point, and the MCPEs for each pair of time points based on the MFH model 2 as follows:

mcpemfh2_obj <- 
pbmcpeMFH2(formula = mfh_formula,
           vardir = varcols,
           nB = 50,
           data = prov_dt)
## 1
## 2
## 3
## 3
## 4
## 5
## 6
## 6
## 6
## 7
## 8
## 8
## 9
## 9
## 9
## 9
## 9
## 10
## 10
## 10
## 11
## 12
## 12
## 13
## 13
## 14
## 14
## 14
## 14
## 15
## 16
## 16
## 17
## 18
## 19
## 20
## 20
## 21
## 22
## 22
## 23
## 24
## 24
## 24
## 24
## 25
## 26
## 26
## 27
## 27
## 28
## 28
## 29
## 29
## 29
## 29
## 29
## 30
## 31
## 32
## 32
## 33
## 34
## 35
## 35
## 36
## 36
## 37
## 38
## 38
## 39
## 39
## 40
## 41
## 42
## 42
## 42
## 43
## 43
## 43
## 44
## 45
## 45
## 46
## 47
## 48
## 48
## 48
## 48
## 48
## 49
## 50
## 50
#' A function to compare and plot differences from the mcpe object
#' 

compare_mfh2 <- function(period_list = c(1,2),
                         mcpe_obj = mcpemfh2_obj,
                         alpha = 0.05){
  
  col_chr <- paste0("(", period_list[1], ",", period_list[2], ")")

  df <- 
    tibble(diff = mcpe_obj$eblup[,period_list[2]] - mcpe_obj$eblup[,period_list[1]],
           mse = mcpe_obj$mse[,period_list[1]] + mcpe_obj$mse[,period_list[2]] - 2*mcpe_obj$mcpe[,col_chr],
           alpha = rep(alpha, nrow(mcpe_obj$eblup)),
           zq = qnorm(alpha / 2, lower.tail = F)) |>
    mutate(lb = diff - zq * sqrt(mse),
           ub = diff + zq * sqrt(mse)) |>
    mutate(significant = ifelse(lb > 0 | ub < 0, 
                                "Significant", 
                                "Not Significant")) |>
    mutate(index = row_number())
  
  p1 <- 
  ggplot(df, aes(x = index, y = diff)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_errorbar(aes(ymin = lb, ymax = ub, color = significant), width = 0.2, linewidth = 1) +
  geom_point(color = "black", size = 2) +
  scale_color_manual(values = c("Significant" = "red", "Not Significant" = "gray40")) +
  labs(
    x = "Area (Index)",
    y = "Difference between time periods",
    color = "Significance",
    title = paste0("Change in Poverty Rates based on MCPE between period ", 
                   period_list[1], 
                   " and ", 
                   period_list[2])
  ) +
  theme_minimal()
    
  
  return(list(df = df,
              plot = p1))
  
}

The above function takes the difference between any two time periods and prepares a table of differences for each area include the MCPE estimated error rates as well as lower and upper bounds given a specified confidence level. See the function implemented to compare periods 1 and 2 (years 2012 and 2013)

comp12_obj <- compare_mfh2()

comp23_obj <- compare_mfh2(period_list = c(2, 3))

comp13_obj <- compare_mfh2(period_list = c(1, 3))

See the plots below

comp12_obj$plot

comp23_obj$plot

comp13_obj$plot

See the data created

comp12_obj$df %>%
  head() %>%
  kable()
diff mse alpha zq lb ub significant index
0.01331 0.0002437 0.05 1.959964 -0.0172891 0.0439091 Not Significant 1
0.00545 0.0000628 0.05 1.959964 -0.0100798 0.0209798 Not Significant 2
-0.00109 0.0000187 0.05 1.959964 -0.0095636 0.0073836 Not Significant 3
0.00000 0.0000000 0.05 1.959964 -0.0000408 0.0000408 Not Significant 4
0.00313 0.0000353 0.05 1.959964 -0.0085208 0.0147808 Not Significant 5
0.00235 0.0000563 0.05 1.959964 -0.0123526 0.0170526 Not Significant 6
comp23_obj$df %>%
  head() %>%
  kable()
diff mse alpha zq lb ub significant index
0.07862 0.0013890 0.05 1.959964 0.0055744 0.1516656 Significant 1
0.10423 0.0004644 0.05 1.959964 0.0619940 0.1464660 Significant 2
0.12974 0.0003893 0.05 1.959964 0.0910680 0.1684120 Significant 3
0.09968 0.0010776 0.05 1.959964 0.0353400 0.1640200 Significant 4
0.11912 0.0008807 0.05 1.959964 0.0609557 0.1772843 Significant 5
0.10658 0.0013460 0.05 1.959964 0.0346738 0.1784862 Significant 6
comp13_obj$df %>%
  head() %>%
  kable()
diff mse alpha zq lb ub significant index
0.09193 0.0014206 0.05 1.959964 0.0180584 0.1658016 Significant 1
0.10968 0.0005839 0.05 1.959964 0.0623211 0.1570389 Significant 2
0.12865 0.0004065 0.05 1.959964 0.0891316 0.1681684 Significant 3
0.09968 0.0010775 0.05 1.959964 0.0353433 0.1640167 Significant 4
0.12225 0.0008663 0.05 1.959964 0.0645627 0.1799373 Significant 5
0.10893 0.0012002 0.05 1.959964 0.0410304 0.1768296 Significant 6