Economic activity estimates in Mexico City Metro Area

Objective

The goal at this step is to estimate the number of jobs and its type (by income decile) at a lowest possible geographic level (block) to aggregate them later into different spatial scales, e.g. post code, TAZ, uniform grid, etc. This information represents the ‘attractiveness’ of each zone in accessibility measures (e.g. cumulative, gravity, etc.).

Sources

For this purpose, the main sources of information are the following:

Economic Census 2014

The information provided by the Census contains general figures, such as:

Number of economic units (in most of the cases this refers to a fix establishment that offers a service or produces goods. To a lesser extent, it also includes non-profit organizations.);
Occupied personnel (not necessarily formal employees but includes voluntaries, family members working in business, etc.);
Total salary paid; among others.

The information provided by the Census is collected at the individual address level. However, this information is aggregated at different levels due to confidentiality. In the metropolitan area, about 90% of the occupied personnel is referenced at the block level, the rest is aggregated at larger units namely census tract, locality or municipality.

National Directory of Economic Units

The Denue contains disaggregated information at the point level for all economic units. The information included for each unit essentially contains: (1) location (X-Y coordinates); (2) number of occupied personnel by category (e.g. 0-5, 10-30, … and ‘more than 251’); and (3) industry, e.g. construction, manufacturing, etc., following the North American Classification System, NAICS).

Cartographic data

This includes the geographic definition for the different aggregation areas, i.e. municipality, locality, census tract (ageb), and block. It should be noted that it distinguishes urban from rural geometries. When a locality is larger than 2.5K inhabitants it is considered as urban and it includes the geometries at the block level (this is the case for the vast majority in of areas in Mexico City Metro Area). Rural localities are represented by points and do not contain block geometries. For this reason, I buffered a square shape around each of these points and created one spatial data frame including urban and rural geometries since the Census collects information from both types of areas.

Method

As mentioned in the past revision, there are various possible approaches to estimate the missing figures at the block level, e.g. OLS, iterative proportional fitting, geographic centroid, etc.

I opted to use an approach that was transparent and relatively straightforward. In a nutshell, it first allocates as much data as possible from the Census referenced at the block level. Then, it allocates the aggregated figures to each of the blocks by proportionally adjusting these numbers following the Denue structure. The general steps are:

Summarize point data from the Denue at the block level;
Join the Census and Denue data to the block geometries;
Identify the blocks that that did not match the Census but had a match with the Denue. These are assumed to be the locations for the aggregated data and are the ones that will be estimated.
Proportionally assign the aggregated figures from the Census to blocks based on the number and size reported in the Denue.
Re-aggregate the information and estimates from the block level to various other geographic units, i.e. uniform grids and post code areas using a proportional overlay function.

The approach is detailed in the subsequent sections.

Organize and merge data at block level (Stepts 1 to 3)

The following chunk first organizes the point data level from the Denue by block. Then, it joins the Denue and Census data to the block geometries.

#  Identify the block that contains the Denue points

    # Check Denue points that match block's id geometries
    # otherwise, spatially join nearest block
    denue_join <- denue_2015_sf %>% 
      filter(!cvegeo %in% blocks_ur_rur$cvegeo) %>% 
      st_join(blocks_ur_rur[ , "cvegeo"], join = st_nearest_feature) %>%
      dplyr:: select(id, cvegeo.y)
    
    # remove geometry
    denue_join <- st_set_geometry(denue_join, NULL)
    
    # create new column with joined blocks
    denue_2015 <- denue_2015_sf %>% 
    st_set_geometry(NULL) %>% 
    left_join(denue_join, by = "id", "id") %>% 
    mutate(cvegeo.z = ifelse(is.na(cvegeo.y), cvegeo, cvegeo.y))

# Summarize DENUE in blocks by size of economic units

    # Aggregate number of economic units by block and size
    denue_size_block <- denue_2015 %>%
      group_by(cvegeo.z, descripcion_estrato_personal) %>% # Group by block and division 
      dplyr::summarise(un_ec = n()) %>% 
      pivot_wider(names_from = descripcion_estrato_personal,
                  values_from = un_ec,
                  names_prefix = "un_ec_") %>% 
      mutate_at(vars(2: ncol(.)), replace_na, 0) %>% # replace NAs with 0s
      dplyr::rename_all( funs(gsub(" ", "_", .))) %>% # replace space by _ in names
      dplyr::rename(cvegeo = cvegeo.z) %>% 
      mutate(un_ec_tot = rowSums(.[2:ncol(.)]))  # create a column with total
    
    # Percentage of economic units by size per block
    denue_size_block_pct <- denue_size_block %>% 
      mutate_at(vars(un_ec_0_a_5_personas:un_ec_101_a_250_personas),  
                funs(round( . / un_ec_tot, 6))) %>% # compute proportion
      mutate(un_ec_tot = rowSums(.[2:8])) %>% # sum of pct
      dplyr::rename_at(vars(un_ec_0_a_5_personas:un_ec_tot), funs(paste0(., "_pct")))
    
# Join Census and Denue to block geometries

    # join Census data only in urban geometries
        
        blocks_data <- blocks_ur_rur %>% 
        st_set_geometry(NULL) %>% 
        filter(geografico != "lrur") %>% # omit rural blocks
        dplyr::select(-fechaact) %>%
        left_join(census_by_block[ , c(1, 11:12, 18, 39, 42)], # census data
                  by = "cvegeo", "cvegeo") 
        
    # join rural blocks
      
      # create census summary of rural localities at ageb level
      census_by_loc_rural_loc <- 
        census_by_block[!census_by_block$cvegeo %in% 
                          blocks_data$cvegeo & 
                            !is.na(census_by_block$cvegeo), 
                        c(3, 11:12, 18, 39, 42)] %>%
        group_by(cve_loc) %>% 
        dplyr::summarise(unidades_economicas = sum(unidades_economicas),
                         personal_ocupado_total = sum(personal_ocupado_total),
                         personal_remunerado_total = sum(personal_remunerado_total),
                         total_remuneraciones = sum(total_remuneraciones),
                         produccion_bruta_total = sum(produccion_bruta_total))
      
      blocks_data_loc <- blocks_ur_rur %>% 
        st_set_geometry(NULL) %>% 
        filter(geografico == "lrur") %>% 
        dplyr::select(-fechaact) %>%
        left_join(census_by_loc_rural_loc, 
                  by = "cve_loc", "cve_loc")  # census data
      
    # Bind the rural and urban geometries in one data set
    blocks_data_all <- bind_rows(blocks_data, blocks_data_loc)
        
    # join denue data
    blocks_data_all <- blocks_data_all %>% 
      left_join(denue_size_block, by = "cvegeo", "cvegeo")
    
    # remove block geometries that do not contain economic units
    blocks_data_all <- blocks_data_all %>%
      # keep if at least one of these contain data
      filter(!is.na(unidades_economicas) | !is.na(un_ec_tot)) 
    
    # Rename by adding prefix of data source
    blocks_data_all <- blocks_data_all %>% 
    dplyr::rename_at(vars(unidades_economicas:produccion_bruta_total), 
                   ~ paste0("cens_", .)) %>% 
    dplyr::rename_at(vars(un_ec_0_a_5_personas:un_ec_tot),
                   ~ paste0("den_", .))
        
    # Clean Global Env.
    gc(reset = T)

The general figures allocated at the block level so far look as follows:

# Total number of economic units in Census
sum(census_by_block$unidades_economicas)

## [1] 817973

# Versus number of economic units allocated in blocks...That is 88%
sum(blocks_data_all$cens_unidades_economicas, na.rm = TRUE)

## [1] 723601

# Total number of economic units in Census
sum(census_by_block$personal_ocupado_total)

## [1] 5083414

# Total number of occupied personal allocated in blocks...That is 92%
sum(blocks_data_all$cens_personal_ocupado_total, na.rm = TRUE)

## [1] 4667532

# Number of economic units in Denue
nrow(denue_2015)

## [1] 905192

#Number of economic units from Denue allocated in blocks
sum(blocks_data_all$den_un_ec_tot, na.rm = TRUE)

## [1] 905192

Proportional assignment (Step 4)

As mentioned above, the aggregated data from the Census is grouped at various levels. This is an overview:

# Overview aggregated data for occupied personnel in Census

  # All oc. pe. that is aggregated
  filter(census_by_block, grepl("C$", manzana)) %>% 
  {sum(.$personal_ocupado_total)}

## [1] 415882

  # Aggregated at municipality level
  filter(census_by_block, grepl("CCC$", manzana)) %>% 
    {sum(.$personal_ocupado_total)}

## [1] 125286

  # Aggregated at locality level
  filter(census_by_block, 
         grepl("CC$", manzana) & 
           !grepl("C$", loc) & !is.na(loc)) %>%
    {sum(.$personal_ocupado_total)}

## [1] 5776

  # aggregated at ageb level
  filter(census_by_block, grepl("C$", manzana) 
         & !grepl("C$", ageb)) %>% 
    {sum(.$personal_ocupado_total)}

## [1] 284820

As shown, about 70% of the aggregated data is referenced at the census tract level, 1% at the locality, and 30% at the municipal.

Next, I created three different datasets for the aggregated data in the Census at their respective level.

## Create datasets with aggregated data at different levels

    # Data aggregated at municipal level
    census_mun_agg <- census_by_block %>% 
    filter(grepl("CCC$", manzana)) %>%
    dplyr::select(cve_ent_mun, unidades_economicas, personal_ocupado_total, total_remuneraciones, produccion_bruta_total)
    sum(census_mun_agg$personal_ocupado_total)

## [1] 125286

    # Data aggregated at locality level
    census_loc_agg <- census_by_block %>% 
    filter(grepl("CC$", manzana) & !grepl("C$", loc) & !is.na(loc)) %>%
    mutate(cve_loc = paste0(entidad, mun, loc)) %>% 
    dplyr::select(cve_loc, unidades_economicas, personal_ocupado_total, total_remuneraciones, produccion_bruta_total)
    sum(census_loc_agg$personal_ocupado_total)

## [1] 5776

    # Data aggregated at ageb level
    census_ageb_agg <- census_by_block %>% 
    filter(grepl("C$", manzana) & !grepl("C$", ageb)) %>%
    mutate(cve_ageb = paste0(entidad, mun, loc, ageb)) %>% 
    dplyr::select(cve_ageb, unidades_economicas, personal_ocupado_total, total_remuneraciones, produccion_bruta_total)
    sum(census_ageb_agg$personal_ocupado_total)

## [1] 284820

    # Validate subsets
    sum(census_by_block$personal_ocupado_total) ==
    filter(census_by_block, !grepl("C$", manzana)) %>% 
      {sum(.$personal_ocupado_total)} + # disaggregated
    sum(census_mun_agg$personal_ocupado_total) + # agg. at mun level
    sum(census_loc_agg$personal_ocupado_total) + # agg. at loc level
    sum(census_ageb_agg$personal_ocupado_total) # agg. at ageb level

## [1] TRUE

Identify level of aggregation of each block

To assign the aggregated data it is necessary to identify the level of aggregation of the ‘empty’ geometries. This can be at: (1) the block level, (2) census tract, (3) locality, or (4) municipal.

To do this, it is inferred that the blocks that are not known at the X level, are aggregated at the next geographic hierarchy, e.g. if it is not known at the block nor the census tract, it is aggregated at the locality, and so on.

# Identify known data by geographic level

    # Localities known (this is known only at locality level)
    known_loc <- census_by_block %>% 
      filter(!grepl("C$", loc) & grepl("C$", ageb) & !is.na(loc)) %>% 
      mutate(cve_loc = paste0(entidad, mun, loc)) %>%
      distinct(cve_loc) %>% pull()
    
    # Agebs known (this is known at ageb level)
    known_ageb <- census_by_block %>% 
      filter(!grepl("C$", ageb) & grepl("C$", manzana)) %>% 
      mutate(cve_ageb = paste0(entidad, mun, loc, ageb)) %>%
      distinct(cve_ageb) %>% pull()
    
    # blocks known (this is known at block level)
    known_block <-  census_by_block %>% 
      filter(!grepl("C$", manzana)) %>%
      distinct(cvegeo) %>% pull()
  
# Identify unknown data by level
      
    # Create column indicating level of aggregation
    blocks_data_all <- blocks_data_all %>% 
    mutate(agg_level = case_when(
    # The data is not known at the locality level, ageb or block level. 
    # Therefore, it is agg. at municipal level
    !cve_loc %in% known_loc & !cve_ageb %in% known_ageb & !cvegeo %in% known_block ~ "mun",
    # The data is known at locality level only, and not at the ageb or block level. 
    # Therefore, it is agg. at the locality level.
    cve_loc %in% known_loc & !cve_ageb %in% known_ageb & !cvegeo %in% known_block ~ "loc",
    # The data is known at the ageb level only, and not at the block level. 
    # Therefore, it is agg. at the ageb level
    cve_ageb %in% known_ageb & !cvegeo %in% known_block ~ "ageb",
    # The data is known at the block level. 
    # Therefore, it is agg. at the block level.
    cvegeo %in% known_block ~ "block"
    ))
    
    # The aggregation levels of data are:
    table(blocks_data_all$agg_level)

## 
##  ageb block   loc   mun 
## 38397 63271   817  8066

Function to distribute the figures of the census aggregated at various levels in their respective blocks

In essence the function:

Adjusts the number of economic units summarized from the Denue to the Census aggregates for each zone \(_{j}\) by size category \(_{s}\) \(Adj \;ec \;un_{si} = \frac{ Denue \;ec \;un_{si}} { \sum\ Denue \;ec \;unit_{i}} * \sum\ \;Cens \;ec \;un_{i}\);
Assigns a weigh to the number of economic units by their respective size category;
Distributes the aggregated figures according to the number of economic units and their weight \(Est \;oc \;pe = \frac{Adj \:ec \:un _{si} \;* \;Weight_{si}} {\sum\ Weight_{si}} * \sum\ \;Cens \;oc \;pe_{i}\);
Readjusts when the average size of the unit is larger than their respective size category. To do this, the excess of occupied personnel are assigned to the units in the same area that are “larger than 251”;
To estimate salaries paid by block, it distributes the aggregated sum by the number of occupied personnel estimated for each block. This assumes that the occupied personnel in the same geographic area are paid equally.

# Load proportional assignment function
        
prop_assignment <- function(agg_data, level, agg_unit, by_agg_unit) {

agg_unit <- enquo(agg_unit)

# Join aggregated data at the respective level
  blocks_agg_data <- blocks_data_all %>% 
    filter(agg_level == level & is.na(cens_unidades_economicas)) %>%
    left_join(agg_data, 
              by = setNames(by_agg_unit, by_agg_unit))

# Adjust number of economic units in each block according to the census
  blocks_agg_data <- blocks_agg_data %>%
    # Add sum of ec. un. by aggregation unit
    left_join(group_by(blocks_agg_data, !! agg_unit) %>%
                dplyr::summarise(sum_ec_un = sum(den_un_ec_tot)),
              by = setNames(by_agg_unit, by_agg_unit)) %>%
    # Arrange data by size group of ec. un. in each block
    pivot_longer(cols = den_un_ec_0_a_5_personas: den_un_ec_101_a_250_personas,
                 values_to = "den_un_ec", names_to = "size") %>%
    # Re-scale number of ec. un. according to denue
    mutate(adj_ec_un = (den_un_ec / sum_ec_un) * unidades_economicas)

# Assign occupied personnel based on adjusted ec. un. and weight 
  blocks_agg_data <- blocks_agg_data %>%
      mutate(size = sub("den_un_ec_", "", size)) %>%
      mutate(theo_oc = case_when(
        size == "0_a_5_personas"  ~ adj_ec_un * 2.5,
        size == "6_a_10_personas"  ~ adj_ec_un * 8,
        size == "11_a_30_personas"  ~ adj_ec_un * 20.5,
        size == "31_a_50_personas"  ~ adj_ec_un * 40.5,
        size == "51_a_100_personas"  ~ adj_ec_un * 75.5,
        size == "101_a_250_personas"  ~ adj_ec_un * 175.5,
        size == "251_y_más_personas"  ~ adj_ec_un * 375.5,
        TRUE ~ 0))

# Estimate oc. pe. based on Denue and adjusted to Census data
  blocks_agg_est <- blocks_agg_data %>%
    # Add sum of theoretical occupied personnel (weights) by aggregation unit
    left_join(group_by(blocks_agg_data, !! agg_unit) %>%
                dplyr::summarise(sum_theo_oc = sum(theo_oc)),
              by = setNames(by_agg_unit, by_agg_unit)) %>%
    # Remove rows that do not have any information from the economic census
    filter(!is.na(personal_ocupado_total)) %>%
    # Re-scale occupied personnel based on census
    mutate(est_pe_oc = (theo_oc / sum_theo_oc) *  personal_ocupado_total,
    # Compute average size of economic units
    avg_est_size = case_when(
       adj_ec_un < 1 ~ est_pe_oc,
       est_pe_oc > 0 ~ est_pe_oc / adj_ec_un,
       TRUE ~ 0)
      ) %>%
    
# Adjust when large economic units bias the smaller units,
# causing over estimating the total occupied personnel
    
      # define limit of economic units size
      mutate(limit_unit_size = case_when(
        size == "0_a_5_personas"  ~ 5,
        size == "6_a_10_personas"  ~ 10,
        size == "11_a_30_personas"  ~ 30,
        size == "31_a_50_personas"  ~ 50,
        size == "51_a_100_personas"  ~ 100,
        size == "101_a_250_personas"  ~ 250,
        size == "251_y_más_personas"  ~ Inf,
      )) %>%
      #Add the number of economic units larger than 251 within the geographic unit
      left_join(filter(., size == "251_y_más_personas") %>%
                  group_by(., !! agg_unit) %>%
                  dplyr::summarise(n_ec_un_larger_251 = sum(adj_ec_un)),
                by = setNames(by_agg_unit, by_agg_unit)) %>%
    # Compute estimated pe oc conditioned to size only if
    # the aggregation unit contains large economic units
    mutate(est_pe_oc_adjusted = ifelse(avg_est_size > limit_unit_size & n_ec_un_larger_251 > 0,
                                       theo_oc, est_pe_oc),
           # Indicate whether there were changes in the area
           adjusted_by_size = est_pe_oc != est_pe_oc_adjusted) %>%
    # If the sum of the estimated pe oc conditioned by size is smaller than the agg. number of oc. pe. in census
    # Compute the difference, and divide it by the number of ec. units larger than 251
        # Include the sum of the new estimate of oc. pe.
        left_join(filter(., size != "251_y_más_personas") %>% # Deducting large units
                    group_by(., !! agg_unit) %>%
                      dplyr::summarise(sum_est_pe_oc_adjusted = sum(est_pe_oc_adjusted, na.rm = T),
                                       sum_adjusted_by_size = sum(adjusted_by_size)),
                  by = setNames(by_agg_unit, by_agg_unit)) %>%
        # Compute the diff. between census and sum of the estimates if
        # there were changes in the area due to constrain in size
        mutate(diff = personal_ocupado_total - sum_est_pe_oc_adjusted) %>%
        # Divide the difference by the n. of ec un larger than 251
        # And assign it to units larger than 251 if...
        mutate(pe_oc_adjusted_2 = case_when(
                                      adj_ec_un > 0 &
                                      sum_adjusted_by_size > 0 &
                                      diff > 0 &
                                      size == "251_y_más_personas"  ~
                                    (diff / n_ec_un_larger_251) * adj_ec_un,
                                  TRUE ~ est_pe_oc_adjusted
                            )) %>%
    # Add total sum by municipality (for checks)
    left_join(group_by(., !! agg_unit) %>% dplyr::summarise(sum_pe_oc_adjusted_2 = sum(pe_oc_adjusted_2))) %>%
      mutate(valid_census_vs_est = personal_ocupado_total == round(sum_pe_oc_adjusted_2, 0)) %>%
    # Distribute paid salaries according to n. of oc. pe. (assuming all oc. pe. in the area have same salary)
    mutate(est_remuner = (total_remuneraciones / personal_ocupado_total) * pe_oc_adjusted_2)

    # Re-structure data by block
    blocks_agg_est <- blocks_agg_est %>%
    group_by(cvegeo) %>%
    dplyr::summarise(ec_un = sum(adj_ec_un, na.rm = T),
                     oc_pe = sum(pe_oc_adjusted_2, na.rm = T),
                     remun_tot = sum(est_remuner, na.rm = T)
                     ) %>%
    # Create aggregation codes
    mutate(cve_ent_mun = str_sub(cvegeo, 1, 5),
           cve_loc = str_sub(cvegeo, 1, 9),
           cve_ageb = str_sub(cvegeo, 1, 13)) %>%
    # Organize variables
    dplyr::select(cvegeo, cve_ent_mun:cve_ageb, everything())

# Result
return(blocks_agg_est)
}

Later, the function is applied for each of the datasets for the different levels of aggregation.

# Apply function

    est_pe_oc_mun <- prop_assignment(census_mun_agg, "mun", cve_ent_mun, "cve_ent_mun")
    est_pe_oc_loc <- prop_assignment(census_loc_agg, "loc", cve_loc, "cve_loc")
    est_pe_oc_ageb <- prop_assignment(census_ageb_agg, "ageb", cve_ageb, "cve_ageb")

# Three agebs not assigned because these contained data at the block level already.
# These are actually rural localities. Thus, the values can be added directly.
    
    not_joined_agebs <- census_ageb_agg[ , c(1, 3, 4)] %>% 
      left_join(
        group_by(est_pe_oc_ageb, cve_ageb) %>% 
          dplyr::summarise(oc_pe = sum(oc_pe, na.rm = T),
                           remun_tot = sum(remun_tot, na.rm = T))) %>% 
      mutate(diff = round(oc_pe - personal_ocupado_total, 2),
             remun_diff = round(total_remuneraciones - remun_tot , 2)) %>% 
      bind_rows(summarise_all(., funs(if(is.numeric(.)) sum(., na.rm = T) else "Total"))) %>%
      filter(is.na(oc_pe)) %>% pull(cve_ageb)

    census_ageb_agg <- as.data.frame(census_ageb_agg)
    
    # Add the aggregated values to the rural locality
    for( x in not_joined_agebs) {
      # Economic units
      blocks_data_all[blocks_data_all$cve_ageb == x, "cens_unidades_economicas"] <- 
        blocks_data_all[blocks_data_all$cve_ageb == x, "cens_unidades_economicas"] +
        census_ageb_agg[census_ageb_agg$cve_ageb == x, "unidades_economicas"]
      # Occupied personnel
      blocks_data_all[blocks_data_all$cve_ageb == x, "cens_personal_ocupado_total"] <- 
        blocks_data_all[blocks_data_all$cve_ageb == x, "cens_personal_ocupado_total"] +
        census_ageb_agg[census_ageb_agg$cve_ageb == x, "personal_ocupado_total"]
      # Paid salaries
      blocks_data_all[blocks_data_all$cve_ageb == x, "cens_total_remuneraciones"] <- 
        blocks_data_all[blocks_data_all$cve_ageb == x, "cens_total_remuneraciones"] +
        census_ageb_agg[census_ageb_agg$cve_ageb == x, "total_remuneraciones"]
    }

# Bind all estimates in one data frame
    est_pe_oc_all <- bind_rows(est_pe_oc_mun, est_pe_oc_loc, est_pe_oc_ageb) %>% 
    dplyr::select(-cve_ent_mun:- cve_ageb)

# Join estimates to geometries containing the Census figures
    blocks_data_all <- blocks_data_all %>% 
      left_join(est_pe_oc_all, by = "cvegeo", "cvegeo") %>% 
      mutate(est_ec_un = coalesce(cens_unidades_economicas, ec_un),
             est_pe_oc = coalesce(cens_personal_ocupado_total, oc_pe),
             est_remun_tot = coalesce(cens_total_remuneraciones, remun_tot))

Finally, the estimates are validated by summing the values at the block level and comparing them with the aggregates reported in the Census. Also, I estimated the average income for each block by computing the ratio of total paid salary and the total occupied personnel. This estimate includes all occupied personnel (not only remunerated personnel).

## Validate estimates

# Check results

    # Global sums
    
    # Total economic units
    sum(blocks_data_all$est_ec_un, na.rm = T)

## [1] 817973

    # vs
    sum(census_by_block$unidades_economicas)

## [1] 817973

    # Total occupied personnel
    sum(blocks_data_all$est_pe_oc, na.rm = T)

## [1] 5083414

    # vs
    sum(census_by_block$personal_ocupado_total)

## [1] 5083414

    # paid salaries
    sum(blocks_data_all$est_remun_tot, na.rm = T)

## [1] 449577497

    # vs
    sum(census_by_block$total_remuneraciones)

## [1] 449577497

# All data is now allocated at block level!!
  
# Compute average income by block 
    
    # This estimate includes all occupied personnel (not only remunerated personnel)
    # I consider this is more appropriate for our purposes
    blocks_data_all <- blocks_data_all %>% 
      mutate(avg_income = case_when(
        est_remun_tot == 0 ~ 0,
        est_pe_oc == 0 ~ NA_real_,
        TRUE ~ est_remun_tot / est_pe_oc
      ))

# Round estimate decimals to 6
blocks_data_all <- blocks_data_all %>% 
  mutate_at(vars(est_ec_un:avg_income), round, 6)

Data to grids (Step 5)

Once the data from the Census is allocated in block geometries, these figures are re-aggregated at different geographic unit scales, namely hexagonal grid of 500 m, 1 km, 2 km, 4 km and post code areas. To do so, I employed a proportional overlay function. This function assigns integer values by the proportion of the overlapping areas. For instance, if one block overlaps 50% of its area in the geometries, that same proportion is allocated in the receiver area. This assumes that the data contained in the blocks is uniform.

Since the uniform grids cover 100% of the territory, all the data is captured. By contrast, the post codes do not cover all the territory. Therefore, some official values are left out.

Load proportional overlay function.

# Load proportional overlay function
apportion <- function(xx, yy, 
    # Smallest total overlap of each xx object from which it would transfer data in sq metres,
    low_lim_sqm2 = 60,
    # Smallest total overlap of xx object from which it would transfer data in percentage
    low_lim_percent = 0.05, 
    # Smallest total overlap of xx object that would be rescaled to 1, 
    # to transfer all data from xx to yy  
    low_2rescale = 0.70) {

# Transform to metric CRS

    # Set CRS
    myCRSutm <- "+proj=utm +zone=14 +datum=WGS84 +units=m +no_defs"
    myCRSlonglat <- "+proj=longlat +zone=14 +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"
    
    print("Transforming CRS...")
    xx <- st_transform(xx, myCRSutm)
    yy <- st_transform(yy, myCRSutm)

# Operations to transfer data from xx (giver area) to yy (receiver area)

    # Calculate xx areas
    xx$xx_area_m2 <- st_area(xx) %>% as.numeric()
    
    print("Processing spatial intersection...")
    # Intersect xx and yy (takes time)
    intersection <- st_intersection(xx, yy)
    # Calculate area of xx in yy
    intersection$inters_area_m2 <- st_area(intersection) %>% as.numeric()
    # Calculate the overlap percentage of each intersection over the original (complete) area of xx
    intersection$per_overlpa <- intersection$inters_area_m2 / intersection$xx_area_m2 %>% as.numeric()
    # Transform intersection to non-spatial object
    intersection_nonspatial <- st_set_geometry(intersection, NULL)
    print("Processing after intersection...")
    
    # Generate summaries of the intersection by xx object
    intersection_sums_by_xx <- intersection_nonspatial %>% 
    dplyr::group_by(cvegeo) %>%
    dplyr::summarise(sum_overlap_m2 = sum(inters_area_m2), sum_per_overlap = sum(per_overlpa))

# First condition. Exclude complete xx gemetries from the analysis if the sum of the xx intersection on yy is less than <low_lim_sqm2> AND if the sum of the intersection is lower than <low_lim_percent> 

    # Create key to identify xx areas to be excluded
    xx_exclude <- intersection_sums_by_xx %>%
    mutate(cond_1 = if_else(sum_per_overlap < low_lim_percent & 
                              sum_overlap_m2 < low_lim_sqm2, TRUE, FALSE)) %>%
    filter(cond_1 == TRUE) %>% pull(cvegeo)
    
    # Create subset excluding intersections under the first condition
    intersection_nonspatial <- subset(intersection_nonspatial, 
                                      !(intersection_nonspatial$cvegeo %in% xx_exclude))

# Second condition. If the total percentage of overlap is more than low_2rescale, then, re-scale to 1. This is to transfer the total values of XX.
# If it is not, transfer the data according to the simple overlap proportion 

  # Calculate the rescaling factor for objects under second condition
  # To obtain the total overlapping area of xx objects in each intersection
  intersection_nonspatial <- left_join(intersection_nonspatial, intersection_sums_by_xx) %>%
  # Calculate the rescaling proportion if xx objects are larger than xx_2rescale_key: 
  # each intersection / sum of the xx overlap
  mutate(assign_fac = if_else(sum_per_overlap > low_2rescale, per_overlpa / sum_per_overlap, per_overlpa)) 
 
  # Transfer data according to the assignment factor
  data_in_yy <- intersection_nonspatial %>%
  mutate_at(vars(est_ec_un, est_pe_oc, est_remun_tot), 
            .funs =  list( ~ . * intersection_nonspatial$assign_fac))
  
  data_in_yy <- data_in_yy %>%
  dplyr:: group_by(id) %>%
  dplyr::summarise_at(vars(est_ec_un, est_pe_oc, est_remun_tot), sum, na.rm = T)
  
  # Estimate average income level weighted by number of oc pe
  avg_income <- intersection_nonspatial %>%
  mutate(pe_oc_est = est_pe_oc * assign_fac) %>% # To estimate the pe oc in each intersection
  dplyr:: group_by(id) %>%
  # Median of average income weighted by n of pe oc
  dplyr:: summarise(avg_income = median(rep(avg_income, times = pe_oc_est), na.rm = T)) 

# Results

    # Join average income
    data_in_yy <- left_join(data_in_yy, avg_income)
    
    # Round decimals
    data_in_yy <- mutate_at(data_in_yy, vars(est_pe_oc:avg_income), round, 6)
    
    # Join the data to the spatial objects in yy
    data_in_yy_spat <- left_join(yy, data_in_yy)
    
    # Calculate yy areas
    data_in_yy_spat$area_m2 <- st_area(data_in_yy_spat) %>% as.numeric()
    # Estimate oc pe density
    data_in_yy_spat <- mutate(data_in_yy_spat, 
                            oc_pe_den_sq_km = ifelse(!is.na(est_pe_oc),
                                             round(est_pe_oc /  (area_m2/1000000), 4),
                                             NA_real_
                                             )) %>% 
    dplyr::select(-area_m2)
    
    print("Transforming back to  WGS")
    # Transform back to WGS
    data_in_yy_spat <- st_transform(data_in_yy_spat, myCRSlonglat)
    
    return(data_in_yy_spat)
}

Apply function to aggregate the data from blocks to grids.

grids_data <- lapply(names(grids), 
                     function(x) apportion(blocks_census_data, grids[[x]]))
names(grids_data) <- names(grids)

Validate results

As shown next, the sum of the figures in the receiver areas are the same as the source for the integer values. For average estimates, the medium value is slightly higher than the source.

# Source
apply(as.data.frame(blocks_census_data)[ ,c(4:6)], 2, sum, na.rm = T)

##      est_ec_un      est_pe_oc  est_remun_tot 
##    817973.0006   5083414.0001 449577497.0002

# Estimates
sapply(1:length(grids_data), function(x) 
  apply(as.data.frame(grids_data[[x]])[ ,c(2:4)], 2, sum, na.rm = T))

##                         [,1]           [,2]           [,3]           [,4]
## est_ec_un        817973.0006    817973.0006    817973.0006    817972.4184
## est_pe_oc       5083414.0001   5083414.0001   5083414.0001   5083413.1705
## est_remun_tot 449577497.0002 449577497.0002 449577497.0002 449577496.4582
##                         [,5]
## est_ec_un        815083.0566
## est_pe_oc       5072677.9594
## est_remun_tot 449113057.9936

# Median of the average income
# Source
median(rep(blocks_census_data$avg_income, times = blocks_census_data$est_pe_oc), na.rm = T)

## [1] 46.981481

# Estimates
sapply(1:length(grids_data), function(x){
  grid_data <- filter(grids_data[[x]], !is.na(est_pe_oc))
  median(rep(grid_data$avg_income, times = grid_data$est_pe_oc), na.rm = T)})

## [1] 49.454558 49.516191 49.292275 48.015491 51.312910

Visualization of estimates

The following interactive map is an example of one of the grids (2 km) which shows the income by decile.

The following map grid shows the occupied personnel density for each of the different aggregation areas.

The last map grid shows the estimated income decile for each of the various aggregation units.

Final thoughts

Limitations

Mismatch in the data collection period. We do not have Denue data for the same year of the Census (2014 vs. 2015). The Census was collected in early 2014, whilst the Denue was collected about 1 year later. Therefore we could have allocated data in areas where at the time of the data collection for the Census did not have any activity. Conversely, I could have omitted data that had activity at the time of the Census but stopped at the time of the data collection for the Denue. However, this is expected to produce a low biases, since there was not a considerable shock that could have substantially altered the economic dynamics from one area to another;
The methodology to collect the Census data usually accounts the number of occupied personal at the level of the establishment for the vast majority of economic units, that is where the people work. However, for certain economic activities (e.g. construction, energy production, etc.) where personnel do not have a fix working location, the figures are appended to the geographic location of the headquarters. Therefore, this could overestimate the attractiveness of a location. Unfortunately, we do not have access to the industry type at this level of aggregation and we cannot systematically exclude those potential cases;
Income estimates fall in the Ecological Fallacy, which assumes that salaries are homogeneous in each of the aggregation units;
The transfer of data from blocks to other geometries is proportional to the overlapping area and assumes uniform distribution of activities across the whole block. Given the relatively small size of the large majority of blocks that contain data, this approach is still appropriate for our purposes. In the case of average values, a weighted median value is computed as a function of the total occupied personnel.

Strenghts

The Economic Census Data is aggregated at a very low level (block), and this type of location accounts for more than 90% of the occupied personnel. Therefore, transfer to other geometries can be done more confidently;
The rest of the data (5%) is known at the census tract. Therefore, the spatial error can be expected to be low;
The Denue data is disaggregated at the point level and it contains information about the size of each economic unit, which allows to produce finer estimates.