Using Satellite Imagery and Computer Vision to Estimate the Income Levels of Mexican Households

Andrea Vallebueno

October 2019

Executive Summary

This project has the objective of using satellite imagery of the Mexican territory to estimate the income levels of Mexican households using computer vision and convolutional neural networks. In addition to its immediate benefits in terms of providing an efficient way to understand the geographical distribution of income levels, the success of this project could result in its application toward improving the measurement of urban life and quantifying the impact of public policy across time. This first project allows us to explore the data that is available for Mexico and to evaluate how successful and accurate this model would be in predicting economic data.

Literature

The use of satellite imagery and street view images to predict economic variables such as income or physical characteristics of an urban area has been explored by several authors.

Glaeser et al. (2015) describe how new urban data sources can be used to improve the study of cities. Specifically, they use a corpora of Google Street View images to measure the physical characteristics of a neighborhood and to estimate income levels in New York City blocks, emphasizing the power of this tool to map wealth and poverty. Moreover, the authors underscore the use of these models to predict income and map poverty in developing countries, where there is a surplus of image data and a lack of reliable information on income and other variables. To perform this analysis, the authors queried the Google Street View Image API by specifying locations (using latitude and longitude) and camera viewpoints. Latitude and longitude values were randomly selected from a uniform grid, and 40 images were sampled per square mile to obtain 12,200 images of New York City. The authors also used images from Boston as a test set to evaluate the performance of their model, given that it could be prone to overfitting. For their income data, Glaeser et al. used information on median family income, which is reported at the census block level in the American Community Survey. For the actual prediction, the authors used v-Support Vector Regression on extracted features (color information histograms, GIST feature descriptors and textons). Their computer vision model proves to be a good fit, with an $R^2$ of 0.85, and generalizes well to images of other cities.

Naik et al. (2016) explore the use of computer vision to quantify the physical aspects of urban areas using street level imagery. Additionally, the authors use this methodology to develop an algorithm that predicts the perceived safety or 'Streetscore' of locations across cities in the United States. Using street imagery coupled with perceived safety scores, the authors merge features extracted from three methods (textons, color histograms and GIST features) and train a v-Support Vector Regressions with a linear kernel. Their model achieves an $R^2$ of 0.57. Their image corpora consists of one million Google Street View images from 19 cities. Furthermore, they use socioeconomic characteristics of census tracts sourced from the American Community Survey and relate urban appearance to demographic variables such as population, median income and Gini index. They find a strong positive correlation between urban appearance and median income and population density, and between income inequality and the variance of neighborhood appearance.

Naik et al. (2017) dive deeper into using urban appearance to predict economic and demographic data by measuring changes in the physical characteristics of neighborhoods in time and using these changes to evaluate neighborhood improvement. Using streetscapes from five cities in the United States sourced from the Google API captured in 2007 and 2014 and 'Streetscores' for each image, they perform semantic segmentation of images, creates feature vectors using GIST features and texton maps, and used features of streets and buildings to predict the Streetscore with a Support Vector Regression. With this information, they calculated the change in Streetscore for both years to obtain information on the change in physical appearance. The authors find a positive correlation between the physical improvement of neighborhoods and the population of college-educated adults, the neighborhood's initial appearance, and the neighborhood's proximity to the central business district.

For Mexico, Babenko et al. (2017) use Convolutional Neural Networks and satellite images from Planet and Digital Globe to estimate poverty. Poverty data stems from Mexico's 2015 Intercensal Survey and from its 2014 National Survey of Household Income and Expenditure (ENIGH). After testing several neural network architectures, the authors use a GoogleNet CNN as a final model, which explains approximately 57% of the variation in poverty in the validation set.

Data Overview and Discussion

Household data

Mexico's National Institute of Statistics and Geography (INEGI) conducts several census and surveys that measure economic and demographic variables of Mexican households and individuals. Given the confidential nature of the data, this microdata can only be fully accessed via its remote processing services or in-person at the INEGI's laboratories. However, full access to these services is restricted to graduate students, researchers affiliated to valid institutions such as universities and government officials. As a result, we will use a sample of the 2015 Intercensal Survey data that has been kindly provided by Integrated Public Use Microdata Series (IPUMS), an institution which provides census and survey data from various countries. We also wish to acknowledge the statistical offices that provided the underlying data making this research possible: National Institute of Statistics, Geography, and Informatics, Mexico.

Let us begin by loading the necessary libraries and taking a closer look at our data.

In [92]:
import requests
import os
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
%matplotlib inline
In [94]:
households = pd.read_csv(r'datasets/Econ_data/Censo_poblacion_vivienda/IPUMS/data.csv')
In [96]:
households.head()
Out[96]:
COUNTRY YEAR SAMPLE SERIAL HHWT URBAN GEOLEV1 GEOLEV2 GEO2_MX GEO2_MX1995 ... EDUCMX EMPSTAT EMPSTATD INDGEN INCTOT INCEARN INCWEL INCRET HLTHFAC HLTHCOV
0 484 1995 484199501 1001.0 103 2 484001 484001001 484001001 1001.0 ... 223 1.0 110.0 80.0 1714.0 1714.0 0.0 0.0 NaN NaN
1 484 1995 484199501 1001.0 103 2 484001 484001001 484001001 1001.0 ... 106 1.0 110.0 60.0 1286.0 1286.0 0.0 0.0 NaN NaN
2 484 1995 484199501 1001.0 103 2 484001 484001001 484001001 1001.0 ... 321 3.0 330.0 0.0 9999999.0 0.0 0.0 0.0 NaN NaN
3 484 1995 484199501 1001.0 103 2 484001 484001001 484001001 1001.0 ... 222 3.0 330.0 0.0 9999999.0 0.0 0.0 0.0 NaN NaN
4 484 1995 484199501 1001.0 103 2 484001 484001001 484001001 1001.0 ... 105 3.0 330.0 0.0 9999999.0 0.0 0.0 0.0 NaN NaN

5 rows × 59 columns

In [108]:
households.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21960976 entries, 0 to 21960975
Data columns (total 59 columns):
COUNTRY        int64
YEAR           int64
SAMPLE         int64
SERIAL         float64
HHWT           int64
URBAN          int64
GEOLEV1        int64
GEOLEV2        int64
GEO2_MX        int64
GEO2_MX1995    float64
GEO2_MX2005    float64
GEO2_MX2015    float64
SIZEMX         int64
REMITT         float64
ELECTRIC       int64
WATSUP         int64
SEWAGE         int64
FUELCOOK       float64
PHONE          float64
CELL           float64
INTERNET       float64
TRASH          float64
AUTOS          float64
HOTWATER       float64
AIRCON         float64
COMPUTER       float64
WASHER         float64
REFRIG         float64
TV             float64
RADIO          float64
ROOMS          int64
BEDROOMS       int64
KITCHEN        float64
TOILET         int64
BATH           float64
FLOOR          int64
WALL           float64
ROOF           float64
HHTYPE         float64
PERNUM         int64
PERWT          int64
INDIG          float64
SPEAKIND       float64
LANGMX         float64
SCHOOL         int64
LIT            int64
EDATTAIN       int64
EDATTAIND      int64
YRSCHOOL       int64
EDUCMX         int64
EMPSTAT        float64
EMPSTATD       float64
INDGEN         float64
INCTOT         float64
INCEARN        float64
INCWEL         float64
INCRET         float64
HLTHFAC        float64
HLTHCOV        float64
dtypes: float64(35), int64(24)
memory usage: 9.7 GB

The dataset includes 21,960,976 data points and 59 variables for the intercensal survey of 1995, 2005 and 2015. As per IPUMS' variable availability key, variables include:

  • COUNTRY: Country from which the sample was drawn
  • YEAR: Year in which the census or survey was taken
  • SAMPLE: Identifies the IPUMS sample from which the case is drawn.
  • SERIAL: An identifying number unique to each household in a given sample
  • HHWT : Indicates the number of households in the population represented by the household in the sample
  • URBAN: Indicates whether the household was located in a place designated as urban or as rural
  • GEOLEV1: Indicates the major administrative unit in which the household was enumerated
  • GEOLEV2: Indicates the second major administrative unit in which the household was enumerated
  • GEO2_MX: Identifies the household's province within Mexico in al sample years
  • GEO2_MX1995: Identifies the household's municipality within Mexico in 1995
  • GEO2_MX2005: Identifies the household's municipality within Mexico in 2005
  • GEO2_MX2015: Identifies the household's municipality within Mexico in 2015
  • SIZEMX: Identifies the population of the locality within Mexico in all sample years
  • REMITT: Indicates whether the household received remittances from persons living elsewhere
  • ELECTRIC: Indicates whether the household had access to electricity
  • WATSUP: Describes the physical means by which the housing unit receives its water
  • SEWAGE: Indicates whether the household has access to a sewage system or septic tank
  • FUELCOOK: Indicates the predominant type of fuel or energy used for cooking
  • PHONE: Indicates the availability of a telephone in the dwelling
  • CELL: Indicates the availability of a cellular phone in the household
  • INTERNET: Indicates whether or not the household had an internet connection
  • TRASH: Indicates whether the household's waste or garbage is collected by a sanitation service
  • AUTOS: Records whether a member of the household owned or had use of a vehicle and, in many samples, the number
  • HOTWATER: Indicates whether the housing unit had a water heater
  • AIRCON: Indicates whether the household had air conditioning
  • COMPUTER: Indicates whether the household had a personal computer
  • WASHER: Indicates whether the household had a clothes washing machine
  • REFRIG: Indicates whether the household had a refrigerator
  • TV: Indicates whether the household had a television
  • RADIO: Indicates whether the household had a radio
  • ROOMS: Indicates the number of rooms occupied by the housing unit
  • BEDROOMS: Indicates the number of rooms available to members of the household for sleeping
  • KITCHEN: Indicates whether the household had a kitchen, cooking facilities, or room dedicated to food preparation
  • TOILET: Indicates whether the household had access to a toilet and, in most cases, whether it was a flush toilet
  • BATH: Indicates whether the household had access to bathing facilities and, in most cases, whether it had exclusive access
  • FLOOR: Indicates the dwelling's predominant flooring material
  • WALL: Indicates the primary material used in the construction of the dwelling
  • ROOF: Indicates the dwelling's predominant roofing material
  • HHTYPE: Describes the composition of households
  • PERNUM: Numbers all persons within each household consecutively
  • PERWT: Indicates the number of persons in the actual population represented by the person in the sample
  • INDIG: Indicates whether the person belonged to an indigenous group
  • SPEAKIND: Indicates whether a person speaks an indigenous language
  • LANGMX: Indicates the indigenous language, if any, spoken by the person
  • SCHOOL: Indicates whether or not the person attended school at the time of the census or within some specified period
  • LIT: Indicates whether or not the respondent could read and write in any language
  • EDATTAIN: Records the person's educational attainment in terms of the level of schooling completed
  • EDATTAIND: Records the person's educational attainment in terms of the level of schooling completed
  • YRSCHOOL: Indicates the highest grade/level of schooling the person had completed, in years
  • EDUCMX: Indicates the person's educational attainment in Mexico in terms of the highest level and year or grade completed
  • EMPSTAT: Indicates whether or not the respondent was part of the labor force -- working or seeking work
  • EMPSTATD: Indicates whether or not the respondent was part of the labor force -- working or seeking work
  • INDGEN: Recodes the industrial classifications of the various samples into twelve groups
  • INCTOT: Reports the person's total personal income from all sources in the previous month or year
  • INCEARN: Reports the person's total income from their labor (from wages, a business, or a farm) in the previous month or year
  • INCWEL: Reports the monthly or annual income the respondent received from anti-poverty or welfare programs
  • INCRET: Reports the amount of income the respondent received from a retirement program or account, or from a pension
  • HLTHFAC: Identifies the type of hospital, clinic, or healthcare facility the person normally attended
  • HLTHCOV: Reports whether the person had medical coverage (or a right to medical services) from a public or private institution

The most significant variables will be those related to the geographic location of the household and those that describe the family's income. It important to note, however, that for future research remaining variables related to the education, dwelling characteristics and access to health services could prove very useful in developing a wellbeing index for each household which we could pair to the satellite images in order to predit welfare.

Let us conduct some exploratory data analysis on the rows related to the 2015 Intercensal Survey.

In [110]:
households15 = households.copy()
households15 = households15[households15['YEAR'] == 2015]
households15.describe()
Out[110]:
COUNTRY YEAR SAMPLE SERIAL HHWT URBAN GEOLEV1 GEOLEV2 GEO2_MX GEO2_MX1995 ... EDUCMX EMPSTAT EMPSTATD INDGEN INCTOT INCEARN INCWEL INCRET HLTHFAC HLTHCOV
count 11344365.0 11344365.0 11344365.0 1.134436e+07 1.134436e+07 1.134436e+07 1.134436e+07 1.134436e+07 1.134436e+07 0.0 ... 1.134436e+07 1.134436e+07 1.134436e+07 1.134436e+07 0.0 1.134436e+07 0.0 0.0 1.134436e+07 1.134436e+07
mean 484.0 2015.0 484201501.0 1.449551e+09 1.053932e+01 1.550229e+00 4.840177e+05 4.840178e+08 4.840178e+08 NaN ... 2.380601e+02 1.642034e+00 1.799075e+02 2.296046e+01 NaN 7.065413e+07 NaN NaN 4.410009e+00 4.276989e+01
std 0.0 0.0 0.0 8.441825e+08 1.615997e+01 4.974707e-01 7.946213e+00 7.963143e+03 7.963143e+03 NaN ... 2.446028e+02 1.332595e+00 1.471947e+02 7.578232e+01 NaN 4.553311e+07 NaN NaN 1.858148e+00 2.441384e+01
min 484.0 2015.0 484201501.0 1.000000e+03 2.000000e+00 1.000000e+00 4.840010e+05 4.840010e+08 4.840010e+08 NaN ... 1.000000e+01 0.000000e+00 0.000000e+00 0.000000e+00 NaN 0.000000e+00 NaN NaN 1.000000e+00 1.000000e+01
25% 484.0 2015.0 484201501.0 7.171520e+08 4.000000e+00 1.000000e+00 4.840120e+05 4.840120e+08 4.840120e+08 NaN ... 1.030000e+02 1.000000e+00 1.100000e+02 0.000000e+00 NaN 7.714000e+03 NaN NaN 5.000000e+00 4.000000e+01
50% 484.0 2015.0 484201501.0 1.439303e+09 6.000000e+00 2.000000e+00 4.840160e+05 4.840161e+08 4.840161e+08 NaN ... 1.060000e+02 1.000000e+00 1.100000e+02 0.000000e+00 NaN 1.000000e+08 NaN NaN 5.000000e+00 4.000000e+01
75% 484.0 2015.0 484201501.0 2.171946e+09 1.000000e+01 2.000000e+00 4.840240e+05 4.840240e+08 4.840240e+08 NaN ... 3.210000e+02 3.000000e+00 3.100000e+02 1.000000e+01 NaN 1.000000e+08 NaN NaN 5.000000e+00 4.000000e+01
max 484.0 2015.0 484201501.0 2.927196e+09 9.340000e+02 2.000000e+00 4.840320e+05 4.840321e+08 4.840321e+08 NaN ... 9.990000e+02 9.000000e+00 9.990000e+02 9.990000e+02 NaN 1.000000e+08 NaN NaN 9.000000e+00 9.900000e+01

8 rows × 59 columns

In [112]:
print('Null values per column in the Households 2015 dataset')
households15.isnull().sum()
Null values per column in the Households 2015 dataset
Out[112]:
COUNTRY               0
YEAR                  0
SAMPLE                0
SERIAL                0
HHWT                  0
URBAN                 0
GEOLEV1               0
GEOLEV2               0
GEO2_MX               0
GEO2_MX1995    11344365
GEO2_MX2005    11344365
GEO2_MX2015           0
SIZEMX                0
REMITT                0
ELECTRIC              0
WATSUP                0
SEWAGE                0
FUELCOOK              0
PHONE                 0
CELL                  0
INTERNET              0
TRASH                 0
AUTOS                 0
HOTWATER              0
AIRCON                0
COMPUTER              0
WASHER                0
REFRIG                0
TV                    0
RADIO                 0
ROOMS                 0
BEDROOMS              0
KITCHEN               0
TOILET                0
BATH                  0
FLOOR                 0
WALL                  0
ROOF                  0
HHTYPE                0
PERNUM                0
PERWT                 0
INDIG                 0
SPEAKIND              0
LANGMX                0
SCHOOL                0
LIT                   0
EDATTAIN              0
EDATTAIND             0
YRSCHOOL              0
EDUCMX                0
EMPSTAT               0
EMPSTATD              0
INDGEN                0
INCTOT         11344365
INCEARN               0
INCWEL         11344365
INCRET         11344365
HLTHFAC               0
HLTHCOV               0
dtype: int64

From the above, we can observe that the 2015 dataset includes 11,344,365 observations. Moreover, we note that there are null values for the geographic location variables of the years 1995 and 2005, which we will drop. We also note that the variables Total INCTOT, INCWEL and INCRET are not part of the 2015 survey and will drop those columns as well. Although the rest of the variables do not appear to have missing values, we need to be careful as IPUMS variable description key reports that the particular codes that denote a null value for each variable. Null values can also stem from the fact that an individual may not be in the variable's 'universe'; for instance, the INCEARN variable describing income earned is only reported for individuals older than 12 years of age. Let us then drop the null values for this variable, which is of the utmost importance for our analysis.

In [132]:
households15_earners = households15.copy()
households15_earners = households15_earners[(households15_earners['INCEARN'] != 99999998)  & (households15_earners['INCEARN'] != 99999999)]
In [133]:
households15_earners.shape
Out[133]:
(3329264, 59)

We see that by dropping null values for the INCEARN variable, our dataset has been reduced to 3,329,264 entries. As per Babenko et al. (2017), the fact that we only have access to labor income compared to total income is not a severe issue, as these are strongly linearly correlated. Therefore, INCEARN will be our target variable for this project. We note that his is the monthly income in pesos.

Images

The images we will be using to train our convolutional neural network will be sourced from the Google Maps API. The API allows you to access an image of a certain location identified by its name, address or latitude and longitude. Moreover, it allows you to specify certain map parameters such as size, scale, format, maptype and region. Below is an example, capturing images of historic places in Mexico City.

Note: The following code to download the images from Google's API has been manipulated from Adrian Rosebrock's Deep Learning for Computer Vision with Python book. Please review the Sources section below for more details.

In [9]:
output_images = r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\test\test_download_google\output'

# initialize the URL list
places = ['Palacio+de+Bellas+Artes+Avenida+Juarez+Historic+Center', 'Zocalo+Historic+Center+of+Mexico+City', 
           'Museo+Nacional+de+Antropologia+Mexico']
URLS = []

for p in places:
    url = 'https://maps.googleapis.com/maps/api/staticmap?center={}&zoom=18&size=227x227&maptype=satellite&key=MYKEY'.format(p)
    URLS.append(url)

for (i, url) in enumerate(URLS):
    try:
        r = requests.get(url, timeout = 100)

    except:
        print('[INFO] error downloading image...')
    
    p = os.path.sep.join([output_images, '{}.png'.format(str(i).zfill(2))])
    f = open(p,'wb')
    f.write(r.content)
    f.close()
    
    title = places[i].replace('+', ' ')
    print(title)
    image = plt.imread(p)
    plt.figure()
    plt.imshow(image)
    plt.show()
    
    time.sleep(0.1)
Palacio de Bellas Artes Avenida Juarez Historic Center
Zocalo Historic Center of Mexico City
Museo Nacional de Antropologia Mexico

Pairing the datasets and understanding the INEGI's geostatistical maps

The next step is to pair the household dataset with our satellite imagery; that is, to determine precisely how we will query Google's Maps Static API to match households with the correct image. Let us begin by taking a look at the geographic variables that are included in the household dataset:

  • URBAN: Indicates whether the household was located in a place designated as urban or as rural
  • GEOLEV1: Indicates the major administrative unit in which the household was enumerated
  • GEOLEV2: Indicates the second major administrative unit in which the household was enumerated
  • GEO2_MX: Identifies the household's province within Mexico in al sample years

To better understand these variables, we will look into the INEGI's Geostatistical Cartography Manual, which details the levels at which it divides the Mexican territory:

  • Level 1: State geostatistical areas (AGEE)
  • Level 2: Municipal geostatistical areas (AGEM)
  • Level 3: Basic geostatistical areas(AGEB): these can be either rural or urban
  • Level 4: Locality
  • Level 5: Block
In [10]:
print("Overview of the INEGI's geostatistical areas")
plt.figure(figsize = (6, 18))
inegi_areas = plt.imread(r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\output\main\INEGI_manual_areas.png')
plt.imshow(inegi_areas)
plt.show()
print("Image sourced from the INEGI's 'Manual de cartografia estadística'")
Overview of the INEGI's geostatistical areas
Image sourced from the INEGI's 'Manual de cartografia estadística'

From IPUMS' variable description key, we can see that variables GOELEV1 and GEOLEV2 are harmonized variables that identify states and cities to which the household belongs. GEO2_MX matches with the AGEM geostatistical level, indicating each household's province. Thus, we see that the dataset does not provide a high enough level to match each household to an image, given that we do not have access to a more granular location for each household.

We have two options to deal with this problem. First, we can calculate average income levels per province and use a much higher-level satellite image of each province to train our network. This strategy is similar to that pursued by Babenko et al. (2017), who trained their model on the fraction of households living in poverty per administrative unit.

A second strategy is to use sampled data from the 2010 Census provided on the INEGI's webpage (see Sources below), which contains geographic information for each household at the block level; however, there are two important caveats:

  • This dataset (at least the public version) does not provide information on income levels, only demographic data and data on the household's dwelling. A strategy could be to construct the previously mentioned welfare index using dwelling information such as access to electricity, type of roof and floor and other indicators of welfare
  • Even though location is disaggregated to the block level, we can only match latitudes and longitudes to the locality level (level 4)

Strategy 1: Aggregating household data at the municipal level

We will pursue the first strategy. Our tasks include aggregating the data, determining the number of observation points available to train our convolutional neural network, matching our municipality data to a latitude and longitude, and identifying the exact query specifications to download the satellite imagery. We can then begin training our neural network.

Preparing the household dataset

In [134]:
final_households = households15_earners.copy()
final_households = final_households[['GEO2_MX', 'INCEARN']]
final_households.head()
Out[134]:
GEO2_MX INCEARN
10616611 484001001 3857.0
10616615 484001001 4286.0
10616617 484001001 4286.0
10616619 484001001 1714.0
10616621 484001001 4286.0

We will now aggregate the data above by the location variable, GEO2_MX.

In [135]:
grouped_households = final_households.groupby('GEO2_MX').agg(np.mean)
In [136]:
grouped_households.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2321 entries, 484001001 to 484032053
Data columns (total 1 columns):
INCEARN    2321 non-null float64
dtypes: float64(1)
memory usage: 36.3 KB
In [137]:
grouped_households.describe()
Out[137]:
INCEARN
count 2321.000000
mean 3802.515556
std 1797.218417
min 49.715686
25% 2597.702624
50% 3697.135258
75% 4803.338739
max 20124.487083
In [138]:
grouped_households.hist()
plt.title('Histogram of Average Monthly Income Earned per Municipality')
plt.show()

Having aggregated the data, we note that our number of observation points has been reduced to 2,321. This has significant consequences for our neural network, given that it will be prone to overfitting. From the above plot we can observe the distribution of this monthly average income per municipality, and note that it is largely concentrated between 2,500 to 5,000 pesos per month. This roughly translates to 125-250 USD per month.

Our next step involves mapping the municipality data to a latitude and longitude. To complete this task, we will use the INEGI's Unique Catalogue of State, Municipal, and Locality Geostatistical Area Codes, which can be found here.

In [139]:
data_types = {'mapa': int, 'Cve_Ent': int, 'Nom_Ent': object, 'Nom_Abr': object, 'Cve_Mun': int, 
              'Nom_Mun': object, 'Cve_Loc': int, 'Nom_Loc': object, 'Latitud': object, 'Longitud': object, 
             'Lat_Decimal': float, 'Lon_Decimal': float, 'Altitud': object, 'Cve_Carta': object, 
             'Pob_total' : int, 'Pob_Masculina': int}
loc_catalogue = pd.read_csv('datasets/Location_catalogue/AGEEML_201910301717378.csv', dtype = data_types)
In [140]:
loc_catalogue.head()
Out[140]:
Mapa Cve_Ent Nom_Ent Nom_Abr Cve_Mun Nom_Mun Cve_Loc Nom_Loc Ámbito Latitud Longitud Lat_Decimal Lon_Decimal Altitud Cve_Carta Pob_Total Pob_Masculina Pob_Femenina Total De Viviendas Habitadas
0 10010001 1 Aguascalientes Ags. 1 Aguascalientes 1 Aguascalientes U 21°52´47.362N" 102°17´45.768W" 21.879823 -102.296047 1878 F13D19 722250 348722 373528 185120
1 10010094 1 Aguascalientes Ags. 1 Aguascalientes 94 Granja Adelita R 21°52´18.749N" 102°22´24.710W" 21.871875 -102.373531 1902 F13D18 14 0 0 2
2 10010096 1 Aguascalientes Ags. 1 Aguascalientes 96 Agua Azul R 21°53´01.522N" 102°21´25.639W" 21.883756 -102.357122 1861 F13D18 37 21 16 11
3 10010100 1 Aguascalientes Ags. 1 Aguascalientes 100 Rancho Alegre R 21°51´16.556N" 102°22´21.884W" 21.854599 -102.372746 1879 F13D18 10 0 0 1
4 10010102 1 Aguascalientes Ags. 1 Aguascalientes 102 Los Arbolitos [Rancho] R 21°46´48.650N" 102°21´26.261W" 21.780181 -102.357295 1861 F13D18 7 0 0 1

Now, we observe that in the location catalogue above we are provided with data on the first four geostatistical levels. If we take a closer look at the variable Mapa, we can see that this is an 8 or 9-digit variable. The first one or two digits encode the state (1 to 32), the next three the municipality, while the last two encode the locality. We also notice that for each Municipality, a data point with locality '01' is added which represents the Municipality. Thus, we need to match the six digits for all instances that incluide a '01' at the end, as this is the overall data on the Municipality, not the localities within the Municipality. Let us recall the information we had in our households database:

In [141]:
grouped_households.head()
Out[141]:
INCEARN
GEO2_MX
484001001 6460.614828
484001002 4266.121608
484001003 3960.400450
484001004 4244.469072
484001005 6412.689766

We notice that the GEO2_MX variable includes a '484' and then six digits representing the state and municipality, exactly as above. Therefore, we will now extract the relevant location codes from each dataset and merge them using this key.

In [142]:
grouped_households.reset_index(inplace = True)
grouped_households['location'] = grouped_households['GEO2_MX'].astype(str).str.split('484').str[1]
In [143]:
loc_catalogue_final = loc_catalogue.copy()
loc_catalogue_final = loc_catalogue_final[['Mapa', 'Lat_Decimal', 'Lon_Decimal']]
bool_loc = loc_catalogue_final['Mapa'].astype(str).str[-2:] == '01'
loc_catalogue_final = loc_catalogue_final[bool_loc]
In [144]:
loc_catalogue_final['length'] = loc_catalogue_final['Mapa'].astype(str).str.len()
loc_catalogue_final['state'] = loc_catalogue_final['Mapa'].astype(str).str[0]
loc_catalogue_final.loc[loc_catalogue_final['length'] == 9, 'state'] = loc_catalogue_final['Mapa'].astype(str).str[:2]

loc_catalogue_final['municipality'] = loc_catalogue_final['Mapa'].astype(str).str[1:4]
loc_catalogue_final.loc[loc_catalogue_final['length'] == 9, 'municipality'] = loc_catalogue_final['Mapa'].astype(str).str[2:5]
loc_catalogue_final['location'] = '0' + loc_catalogue_final['state'].astype(str) + loc_catalogue_final['municipality'].astype(str)
In [145]:
grouped_households.drop('GEO2_MX', inplace = True, axis = 1)
grouped_households.head()
Out[145]:
INCEARN location
0 6460.614828 001001
1 4266.121608 001002
2 3960.400450 001003
3 4244.469072 001004
4 6412.689766 001005
In [146]:
grouped_households['location'].value_counts()
Out[146]:
030164    1
021156    1
014089    1
030133    1
030062    1
         ..
002002    1
030125    1
024050    1
010037    1
031028    1
Name: location, Length: 2321, dtype: int64
In [147]:
loc_catalogue_final = loc_catalogue_final.groupby('location').agg(np.mean)
loc_catalogue_final.reset_index(inplace = True)
loc_catalogue_final.head()
Out[147]:
location Mapa Lat_Decimal Lon_Decimal length
0 010001 1.000102e+08 24.507653 -104.742524 9.0
1 010002 1.000201e+08 25.121212 -106.590988 9.0
2 010003 1.000300e+08 24.979125 -104.768917 9.0
3 010004 1.000402e+08 24.784158 -103.770661 9.0
4 010005 1.000509e+08 24.079907 -104.572918 9.0

We observe that we now have a location variable in both datasets which has the same format, and are now ready to merge both datasets.

In [148]:
locations = loc_catalogue_final.copy()
locations = locations[['location', 'Lat_Decimal', 'Lon_Decimal']]
locations['location'] = locations['location'].str.zfill(6)
locations.head()
Out[148]:
location Lat_Decimal Lon_Decimal
0 010001 24.507653 -104.742524
1 010002 25.121212 -106.590988
2 010003 24.979125 -104.768917
3 010004 24.784158 -103.770661
4 010005 24.079907 -104.572918
In [149]:
combined = grouped_households.merge(locations, how = 'left', left_on = 'location', right_on = 'location')
combined.head()
Out[149]:
INCEARN location Lat_Decimal Lon_Decimal
0 6460.614828 001001 21.836513 -102.286666
1 4266.121608 001002 22.156567 -102.039549
2 3960.400450 001003 21.870873 -102.726988
3 4244.469072 001004 22.366641 -102.302860
4 6412.689766 001005 21.962373 -102.338731
In [150]:
combined.to_csv(r'C:\Users\andre\Documents\Data_Science_Projects\Mexico_Satellite_Imagery\datasets\households\household_data_combined.csv')

We now have a dataset which will allow us to map average monthly income levels to satellite images.

Now, we have two options:

  1. We can train a classification model using a convolutional neural network that predicts an income bucket
  2. We can use a convolutional neural network as a feature extractor to train a support vector machine that predicts the income level

We will proceed with the second option and leave the first option for future research.

Selecting the right query for the Google Maps Static API

We shall now proceed to selecting the right query to download our images from the Google Maps Static API. As previously mentioned, map parameters and location include:

  • Size: image dimensions
  • Scale: controls the number of pixels for a fixed coverage area and level of detail
  • Format: image format
  • Maptype: type of map (roadmap, satellite, hybrid, terrain)
  • Region: controls de boarders to display
  • Center: defines the center of the map if markers are not provided
  • Zoom: zoom level of the map

Considering our input data and our needs (satellite images), the parameters we need to define are the size, scale and zoom level. We will download 500x500 images, considering that any rescaling can be performed at the pre-processing stage of the convolutional neural network. Now, let us take a look at sample images to determine what best captures our region of interest. W

We will now take a look at images generated for the municipality of Aguascalientes and fit the parameters such that it captures it adecuately.

In [28]:
print("Municipality of Aguascalientes")
plt.figure(figsize = (15, 18))
aguascalientes_m = plt.imread(r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\output\main\Aguascalientes_municipality.png')
aguascalientes_ms = plt.imread(r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\output\main\Aguascalientes_municipality_s.png')
plt.subplot(121)
plt.imshow(aguascalientes_m)
plt.subplot(122)
plt.imshow(aguascalientes_ms)
plt.show()
print("Images sourced from Google Maps")
Municipality of Aguascalientes
Images sourced from Google Maps

Now, let us replicate this using Google's Maps Static API and the coordinates for this Municipality in our combined dataset. We will use a larger size for the moment in order to better gauge the adequacy of the image.

In [29]:
output_images = r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\test\test_download_google\output'

aguas_lat = combined.loc[0, 'Lat_Decimal']
aguas_lon = combined.loc[0, 'Lon_Decimal']

parameters = {'zoom' : 12, 'scale': 2, 'size': '500x500', 'format': 'JPEG', 'maptype': 'satellite'}
base_text = 'https://maps.googleapis.com/maps/api/staticmap?center={},{}&zoom={}&scale={}&size={}&format={}&maptype={}&key=MYKEY'
url = base_text.format(aguas_lat, aguas_lon, parameters['zoom'], parameters['scale'], parameters['size'], 
                       parameters['format'], parameters['maptype'])

try:
    r = requests.get(url, timeout = 100)

except:
    print('[INFO] error downloading image...')

p = os.path.sep.join([output_images, 'aguascalientes.png'])
f = open(p,'wb')
f.write(r.content)
f.close()
    
title = 'Municipality of Aguascalientes'
print(title)
image = plt.imread(p)
plt.figure(figsize = (10, 18))
plt.imshow(image)
plt.show()
Municipality of Aguascalientes

Comparing this image to our satellite image above we see that a zoom level of 12 is still too high. We will adjust below.

In [30]:
output_images = r'C:\Users\andre\Documents\Data_Science_Projects\Mexico_Satellite_Imagery\test\test_download_google\output'

parameters = {'zoom' : 10, 'scale': 2, 'size': '500x500', 'format': 'JPEG', 'maptype': 'satellite'}
base_text = 'https://maps.googleapis.com/maps/api/staticmap?center={},{}&zoom={}&scale={}&size={}&format={}&maptype={}&key=MYKEY'
url = base_text.format(aguas_lat, aguas_lon, parameters['zoom'], parameters['scale'], parameters['size'], 
                       parameters['format'], parameters['maptype'])

try:
    r = requests.get(url, timeout = 100)

except:
    print('[INFO] error downloading image...')

p = os.path.sep.join([output_images, 'aguascalientes.jpg'])
f = open(p,'wb')
f.write(r.content)
f.close()
    
image = plt.imread(p)

aguascalientes_ms2 = plt.imread('C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\output\main\Aguascalientes_municipality_s2.png')
print()
plt.figure(figsize = (15, 7))
plt.subplot(121)
plt.imshow(image)
plt.title('Image sourced from API')
plt.subplot(122)
plt.imshow(aguascalientes_ms2)
plt.title('Manually crafted image')
plt.suptitle('Comparison of image from API and manually crafted image')
plt.show()
print("Images sourced from Google Maps")

Images sourced from Google Maps

This seems to be the best fit and so we will now use those parameters to download the images for the rest of the data.

In [151]:
output_images = r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\datasets\images'

parameters = {'zoom' : 10, 'scale': 2, 'size': '500x500', 'format': 'JPEG', 'maptype': 'satellite'}
base_text = 'https://maps.googleapis.com/maps/api/staticmap?center={},{}&zoom={}&scale={}&size={}&format={}&maptype={}&key=MYKEY'

for (i, loc) in enumerate(combined['location']):
    bool_file = r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\datasets\images\{}.jpg'.format(loc)
    if os.path.exists(bool_file) == False: 
        lat = combined.loc[i, 'Lat_Decimal']
        lon = combined.loc[i, 'Lon_Decimal']
        url = base_text.format(lat, lon, parameters['zoom'], parameters['scale'], parameters['size'], 
                           parameters['format'], parameters['maptype'])

        try:
            r = requests.get(url, timeout = 100)

        except:
            print('[INFO] error downloading image...')

        p = os.path.sep.join([output_images, '{}.jpg'.format(str(loc))])
        f = open(p,'wb')
        f.write(r.content)
        f.close()
        if i % 100 == 0:
            print('Successfully downloaded image {}/{}'.format(i, len(combined['location'])))

        time.sleep(0.1)  

Training the Model

Now that we have our full dataset, we are ready to begin training our neural network. We note that for this section, we will be using tools and manipulating code from the Pyimagesearch module developed by Adrian Rosebrock in his book, Deep Learning for Computer Vision in Python, which can be found here.

In [152]:
from keras.applications import VGG16
from keras.applications import imagenet_utils
from keras.preprocessing.image import img_to_array
from keras.preprocessing.image import load_img
from Model.pyimagesearch.io import HDF5DatasetWriter
from imutils import paths
import progressbar
import random
In [153]:
dataset = r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\datasets\images'
output = r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\Model\output\hdf5_features4.hdf5'
bs = 32
buffer_size = 1000

Let us begin by accessing the images that we previously downloaded from Google's API and labeling them.

In [154]:
imagePaths = list(paths.list_images(dataset))
random.shuffle(imagePaths)

label_codes = [path.split(os.path.sep)[-1][:-4] for path in imagePaths]
In [155]:
combined['label'] = combined.index
In [156]:
labels = [combined.loc[combined['location'] == loc, 'label'] for loc in label_codes]
labels = [l.iloc[0] for l in labels]

We next instantiate a VGG16 convolutional neural network. By adding include_top=False, we are not fully forward propagating throughout the model but rather stopping before the fully connected layers. This will allow us to extract feature representations of each image instead of a set of probabilities for classification. We also instantiate an HDF5 dataset writer, which will store the extracted features from the network and pair them to each image's label.

In [157]:
if os.path.exists(output) == False:
    model = VGG16(weights="imagenet", include_top=False)
    dataset = HDF5DatasetWriter((len(imagePaths), 512 * 7 * 7), output, dataKey="features", bufSize=buffer_size)

Next, we initialize our progress bar and begin looping over the images in batches. Within each batch, we loop over each image and pre-process it, adjusting the size to 224x224. After pre-processing the batch, it gets forward propagated through the network. At this stage, we extract the features by flattening the model's predictions; these vectors are then added to our HDF5 dataset couped with the images' labels.

In [158]:
if os.path.exists(output) == False:
    widgets = ["Extracting Features: ", progressbar.Percentage(), " ",
        progressbar.Bar(), " ", progressbar.ETA()]
    pbar = progressbar.ProgressBar(maxval=len(imagePaths), widgets=widgets).start()

    for i in np.arange(0, len(imagePaths), bs):
        batchPaths = imagePaths[i:i + bs]
        batchLabels = labels[i:i + bs]
        batchImages = []

        for (j, imagePath) in enumerate(batchPaths):
            image = load_img(imagePath, target_size=(224, 224))
            image = img_to_array(image)

            image = np.expand_dims(image, axis=0)
            image = imagenet_utils.preprocess_input(image)

            batchImages.append(image)

        batchImages = np.vstack(batchImages)
        features = model.predict(batchImages, batch_size=bs)

        features = features.reshape((features.shape[0], 512 * 7 * 7))

        dataset.add(features, batchLabels)
        pbar.update(i)

    dataset.close()
    pbar.finish()

With our features at hand, we are now ready to train our linear regression model.

In [159]:
import h5py
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score

We begin by preparing our data by loading our features from the HDF5 dataset and labeling our data with the average monthly income earned from our combined dataset. We then perform our train-test split.

In [162]:
hdf5_data = h5py.File(output, "r")
features = hdf5_data['features'][()]
codes = hdf5_data['labels'][()]
labels = [combined.loc[combined['label'] == code, 'INCEARN'] for code in codes]
labels = [l.iloc[0] for l in labels]
X_train, X_test, y_train, y_test = train_test_split(features, labels, random_state = 42, test_size = 0.95)

As a next step, we instantiate our support vector regression model and use five-fold cross validation to select the best hyperparameters. In line with Naik et al. (2016), we fit a Suppor Vector Regression with a linear kernel.

In [163]:
parameters = {"C": [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]}

model = svm.SVR(gamma = 'scale', kernel = 'linear')
clf = GridSearchCV(model, parameters, cv = 5, iid = True)
clf.fit(X_train, y_train)
Out[163]:
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3,
                           epsilon=0.1, gamma='scale', kernel='linear',
                           max_iter=-1, shrinking=True, tol=0.001,
                           verbose=False),
             iid=True, n_jobs=None,
             param_grid={'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0,
                               10000.0]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [164]:
print('R-squared score (training data): {:.3f}'.format(r2_score(y_train, clf.predict(X_train,))))
hdf5_data.close()
R-squared score (training data): 0.489

We can observe that with the best parameter from five-fold cross validation, our model presents an $R^2$ of 0.489.

Visual Exploration and Mapping Outcomes

As a final step, we will conduct some visual exploration. Firstly, we will perform principal components analysis to abstract the relationship between our extracted features and average monthly income per municipality.

In [165]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

X_train_scaled = StandardScaler().fit_transform(X_train)
PCA = PCA(n_components = 1)
principal_c = PCA.fit_transform(X_train_scaled)
principal_df = pd.DataFrame(data = principal_c, columns = ['X_pca'])
In [166]:
principal_df['y_pca'] = y_train
In [167]:
principal_df.plot.scatter(x = 'X_pca', y = 'y_pca')
plt.title('Scatter plot of Principal Component 1 and average monthly income earned per municipality')
plt.show()

Lastly, we will map our target variable to visualize its geographic distribution.

In [168]:
os.environ['PROJ_LIB'] = r'C:\ProgramData\Anaconda3\pkgs\proj4-5.2.0-ha925a31_1\Library\share'
from mpl_toolkits.basemap import Basemap
In [169]:
fig = plt.figure(figsize = (15, 6))
m = Basemap(projection = 'lcc', resolution=None, width=6E6, height=3E6, lat_0=25, lon_0=-100)
x, y = m(combined['Lon_Decimal'].tolist(), combined['Lat_Decimal'].tolist())
m.shadedrelief()
plt.scatter(x, y, c = combined['INCEARN'].tolist(), cmap = 'Reds', alpha  = 0.5, marker = 'o')
plt.title('Geographic Distribution of Monthly Average Income per Municipality')
plt.colorbar(label='Average Monthly Income per Municipality (pesos)')
plt.show()

As we can see above, average monthly income tends to be higher in the Northeastern and Central parts of Mexico, which are known for their high level of industrial and economic activity. The poorest parts are clearly visible in the figure and are largely in Sourthern Mexico (Oaxaca, Guerrero) and the Yucan Peninsula.

Conclusions

In this project, we have fit a support vector regression model with a linear kernel to estimate the monthly average income (in pesos) of a municipality using a satellite image of its location. For this purpose, we have cleaned data from the Mexican 2015 Intercensal Survey (kindly provided by IPUMS) and matched it to satellite images by manipulating municipalities' geographic coordinates and income information. Due to the lack of granularity in the data, we performed these estimates at the municipality level. Extracting features from these images using a VGG16 CNN, our model presented an $R^2$ of 0.489.

Many strategies could be followed to try to improve the explainablity of our model, which we will pursue in future research. These include sourcing the data directly from the Mexican Statistics Institute to acquire more granular information on locality location and performing a classification instead of a regression by dividing income levels into buckets.

Sources

Babenko, Boris, Jonathan Hersh, David Newhouse, Anusha Ramakrishnan, and Tom Swartz. "Poverty Mapping Using Convolutional Neural Networks Trained on High and Medium Resolution Satellite Images, With an Application in Mexico." Presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017) Workshop on Machine Learning for the Developing World (December 2017).

Dubey, Abhimanyu, Nikhil Naik, Devi Parikh, Ramesh Raskar and Cesar Hidalgo. "Deep Learning the City: Quantifying Urban Perception at a Global Scale." European Conference on Computer Vision (September 2016): 569-576.

Glaeser, Edward L., Scott Duke Kominers, Michael Luca, and Nikhil Naik. "Big Data and Big Cities: The Promises and Limitations of Improved Measures of Urban Life." Economic Inquiry 56, no. 1 (January 2018): 114–137. (Originally Harvard Business School Working Paper, No. 16-065, November 2015.)

Google Maps Platform. Maps Static API.

Instituto Nacional de Estadística y Geografía. "Catálogo Único de Claves de Áreas Geoestadísticas Estatales, Municipales y Localidades." Link. Accessed October 30, 2019.

Instituto Nacional de Estadística y Geografía. "Censo de Población y Vivienda 2010." Link. Accessed October 30, 2019.

Instituto Nacional de Estadística y Geografía. "Manual de cartografia estadística." Censo de Población y Vivienda (2010).

Minnesota Population Center. Integrated Public Use Microdata Series, International: Version 7.2 [Mexico Intercensal Survey 2015 - sample ID 484201501]. Minneapolis, MN: IPUMS, 2019. Link

Naik, Nikhil, Ramesh Raskar, and Cesar A. Hidalgo. "Cities are Physical Too: Using Computer Vision to Measure the Quality and Impact of Urban Appearance." American Economic Review, Vol. 106 (May 2016): 128-132.

Naik, Nikhil, Scott Duke Kominers, Ramesh Raskar, Edward L. Glaeser, and Cesar Hidalgo. "Computer vision uncovers predictors of physical urban change." PNAS, Vol. 114, no. 29 (July 2017): 7571-7576.

Rosebrock, Adrian. "Deep Learning for Computer Vision with Python." PyImageSearch. Link. Accessed September 29, 2019.