Using Satellite Imagery and Computer Vision to Estimate the Income Levels of Mexican Households¶

Andrea Vallebueno¶

October 2019¶

Executive Summary¶

This project has the objective of using satellite imagery of the Mexican territory to estimate the income levels of Mexican households using computer vision and convolutional neural networks. In addition to its immediate benefits in terms of providing an efficient way to understand the geographical distribution of income levels, the success of this project could result in its application toward improving the measurement of urban life and quantifying the impact of public policy across time. This first project allows us to explore the data that is available for Mexico and to evaluate how successful and accurate this model would be in predicting economic data.

Literature¶

The use of satellite imagery and street view images to predict economic variables such as income or physical characteristics of an urban area has been explored by several authors.

Glaeser et al. (2015) describe how new urban data sources can be used to improve the study of cities. Specifically, they use a corpora of Google Street View images to measure the physical characteristics of a neighborhood and to estimate income levels in New York City blocks, emphasizing the power of this tool to map wealth and poverty. Moreover, the authors underscore the use of these models to predict income and map poverty in developing countries, where there is a surplus of image data and a lack of reliable information on income and other variables. To perform this analysis, the authors queried the Google Street View Image API by specifying locations (using latitude and longitude) and camera viewpoints. Latitude and longitude values were randomly selected from a uniform grid, and 40 images were sampled per square mile to obtain 12,200 images of New York City. The authors also used images from Boston as a test set to evaluate the performance of their model, given that it could be prone to overfitting. For their income data, Glaeser et al. used information on median family income, which is reported at the census block level in the American Community Survey. For the actual prediction, the authors used v-Support Vector Regression on extracted features (color information histograms, GIST feature descriptors and textons). Their computer vision model proves to be a good fit, with an $R^2$ of 0.85, and generalizes well to images of other cities.

Naik et al. (2016) explore the use of computer vision to quantify the physical aspects of urban areas using street level imagery. Additionally, the authors use this methodology to develop an algorithm that predicts the perceived safety or 'Streetscore' of locations across cities in the United States. Using street imagery coupled with perceived safety scores, the authors merge features extracted from three methods (textons, color histograms and GIST features) and train a v-Support Vector Regressions with a linear kernel. Their model achieves an $R^2$ of 0.57. Their image corpora consists of one million Google Street View images from 19 cities. Furthermore, they use socioeconomic characteristics of census tracts sourced from the American Community Survey and relate urban appearance to demographic variables such as population, median income and Gini index. They find a strong positive correlation between urban appearance and median income and population density, and between income inequality and the variance of neighborhood appearance.

Naik et al. (2017) dive deeper into using urban appearance to predict economic and demographic data by measuring changes in the physical characteristics of neighborhoods in time and using these changes to evaluate neighborhood improvement. Using streetscapes from five cities in the United States sourced from the Google API captured in 2007 and 2014 and 'Streetscores' for each image, they perform semantic segmentation of images, creates feature vectors using GIST features and texton maps, and used features of streets and buildings to predict the Streetscore with a Support Vector Regression. With this information, they calculated the change in Streetscore for both years to obtain information on the change in physical appearance. The authors find a positive correlation between the physical improvement of neighborhoods and the population of college-educated adults, the neighborhood's initial appearance, and the neighborhood's proximity to the central business district.

For Mexico, Babenko et al. (2017) use Convolutional Neural Networks and satellite images from Planet and Digital Globe to estimate poverty. Poverty data stems from Mexico's 2015 Intercensal Survey and from its 2014 National Survey of Household Income and Expenditure (ENIGH). After testing several neural network architectures, the authors use a GoogleNet CNN as a final model, which explains approximately 57% of the variation in poverty in the validation set.

Data Overview and Discussion¶

Household data¶

Mexico's National Institute of Statistics and Geography (INEGI) conducts several census and surveys that measure economic and demographic variables of Mexican households and individuals. Given the confidential nature of the data, this microdata can only be fully accessed via its remote processing services or in-person at the INEGI's laboratories. However, full access to these services is restricted to graduate students, researchers affiliated to valid institutions such as universities and government officials. As a result, we will use a sample of the 2015 Intercensal Survey data that has been kindly provided by Integrated Public Use Microdata Series (IPUMS), an institution which provides census and survey data from various countries. We also wish to acknowledge the statistical offices that provided the underlying data making this research possible: National Institute of Statistics, Geography, and Informatics, Mexico.

Let us begin by loading the necessary libraries and taking a closer look at our data.

import requests
import os
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
%matplotlib inline

households = pd.read_csv(r'datasets/Econ_data/Censo_poblacion_vivienda/IPUMS/data.csv')

households.head()

households.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21960976 entries, 0 to 21960975
Data columns (total 59 columns):
COUNTRY        int64
YEAR           int64
SAMPLE         int64
SERIAL         float64
HHWT           int64
URBAN          int64
GEOLEV1        int64
GEOLEV2        int64
GEO2_MX        int64
GEO2_MX1995    float64
GEO2_MX2005    float64
GEO2_MX2015    float64
SIZEMX         int64
REMITT         float64
ELECTRIC       int64
WATSUP         int64
SEWAGE         int64
FUELCOOK       float64
PHONE          float64
CELL           float64
INTERNET       float64
TRASH          float64
AUTOS          float64
HOTWATER       float64
AIRCON         float64
COMPUTER       float64
WASHER         float64
REFRIG         float64
TV             float64
RADIO          float64
ROOMS          int64
BEDROOMS       int64
KITCHEN        float64
TOILET         int64
BATH           float64
FLOOR          int64
WALL           float64
ROOF           float64
HHTYPE         float64
PERNUM         int64
PERWT          int64
INDIG          float64
SPEAKIND       float64
LANGMX         float64
SCHOOL         int64
LIT            int64
EDATTAIN       int64
EDATTAIND      int64
YRSCHOOL       int64
EDUCMX         int64
EMPSTAT        float64
EMPSTATD       float64
INDGEN         float64
INCTOT         float64
INCEARN        float64
INCWEL         float64
INCRET         float64
HLTHFAC        float64
HLTHCOV        float64
dtypes: float64(35), int64(24)
memory usage: 9.7 GB

The dataset includes 21,960,976 data points and 59 variables for the intercensal survey of 1995, 2005 and 2015. As per IPUMS' variable availability key, variables include:

COUNTRY: Country from which the sample was drawn
YEAR: Year in which the census or survey was taken
SAMPLE: Identifies the IPUMS sample from which the case is drawn.
SERIAL: An identifying number unique to each household in a given sample
HHWT : Indicates the number of households in the population represented by the household in the sample
URBAN: Indicates whether the household was located in a place designated as urban or as rural
GEOLEV1: Indicates the major administrative unit in which the household was enumerated
GEOLEV2: Indicates the second major administrative unit in which the household was enumerated
GEO2_MX: Identifies the household's province within Mexico in al sample years
GEO2_MX1995: Identifies the household's municipality within Mexico in 1995
GEO2_MX2005: Identifies the household's municipality within Mexico in 2005
GEO2_MX2015: Identifies the household's municipality within Mexico in 2015
SIZEMX: Identifies the population of the locality within Mexico in all sample years
REMITT: Indicates whether the household received remittances from persons living elsewhere
ELECTRIC: Indicates whether the household had access to electricity
WATSUP: Describes the physical means by which the housing unit receives its water
SEWAGE: Indicates whether the household has access to a sewage system or septic tank
FUELCOOK: Indicates the predominant type of fuel or energy used for cooking
PHONE: Indicates the availability of a telephone in the dwelling
CELL: Indicates the availability of a cellular phone in the household
INTERNET: Indicates whether or not the household had an internet connection
TRASH: Indicates whether the household's waste or garbage is collected by a sanitation service
AUTOS: Records whether a member of the household owned or had use of a vehicle and, in many samples, the number
HOTWATER: Indicates whether the housing unit had a water heater
AIRCON: Indicates whether the household had air conditioning
COMPUTER: Indicates whether the household had a personal computer
WASHER: Indicates whether the household had a clothes washing machine
REFRIG: Indicates whether the household had a refrigerator
TV: Indicates whether the household had a television
RADIO: Indicates whether the household had a radio
ROOMS: Indicates the number of rooms occupied by the housing unit
BEDROOMS: Indicates the number of rooms available to members of the household for sleeping
KITCHEN: Indicates whether the household had a kitchen, cooking facilities, or room dedicated to food preparation
TOILET: Indicates whether the household had access to a toilet and, in most cases, whether it was a flush toilet
BATH: Indicates whether the household had access to bathing facilities and, in most cases, whether it had exclusive access
FLOOR: Indicates the dwelling's predominant flooring material
WALL: Indicates the primary material used in the construction of the dwelling
ROOF: Indicates the dwelling's predominant roofing material
HHTYPE: Describes the composition of households
PERNUM: Numbers all persons within each household consecutively
PERWT: Indicates the number of persons in the actual population represented by the person in the sample
INDIG: Indicates whether the person belonged to an indigenous group
SPEAKIND: Indicates whether a person speaks an indigenous language
LANGMX: Indicates the indigenous language, if any, spoken by the person
SCHOOL: Indicates whether or not the person attended school at the time of the census or within some specified period
LIT: Indicates whether or not the respondent could read and write in any language
EDATTAIN: Records the person's educational attainment in terms of the level of schooling completed
EDATTAIND: Records the person's educational attainment in terms of the level of schooling completed
YRSCHOOL: Indicates the highest grade/level of schooling the person had completed, in years
EDUCMX: Indicates the person's educational attainment in Mexico in terms of the highest level and year or grade completed
EMPSTAT: Indicates whether or not the respondent was part of the labor force -- working or seeking work
EMPSTATD: Indicates whether or not the respondent was part of the labor force -- working or seeking work
INDGEN: Recodes the industrial classifications of the various samples into twelve groups
INCTOT: Reports the person's total personal income from all sources in the previous month or year
INCEARN: Reports the person's total income from their labor (from wages, a business, or a farm) in the previous month or year
INCWEL: Reports the monthly or annual income the respondent received from anti-poverty or welfare programs
INCRET: Reports the amount of income the respondent received from a retirement program or account, or from a pension
HLTHFAC: Identifies the type of hospital, clinic, or healthcare facility the person normally attended
HLTHCOV: Reports whether the person had medical coverage (or a right to medical services) from a public or private institution

The most significant variables will be those related to the geographic location of the household and those that describe the family's income. It important to note, however, that for future research remaining variables related to the education, dwelling characteristics and access to health services could prove very useful in developing a wellbeing index for each household which we could pair to the satellite images in order to predit welfare.

Let us conduct some exploratory data analysis on the rows related to the 2015 Intercensal Survey.

households15 = households.copy()
households15 = households15[households15['YEAR'] == 2015]
households15.describe()

print('Null values per column in the Households 2015 dataset')
households15.isnull().sum()

Null values per column in the Households 2015 dataset

COUNTRY               0
YEAR                  0
SAMPLE                0
SERIAL                0
HHWT                  0
URBAN                 0
GEOLEV1               0
GEOLEV2               0
GEO2_MX               0
GEO2_MX1995    11344365
GEO2_MX2005    11344365
GEO2_MX2015           0
SIZEMX                0
REMITT                0
ELECTRIC              0
WATSUP                0
SEWAGE                0
FUELCOOK              0
PHONE                 0
CELL                  0
INTERNET              0
TRASH                 0
AUTOS                 0
HOTWATER              0
AIRCON                0
COMPUTER              0
WASHER                0
REFRIG                0
TV                    0
RADIO                 0
ROOMS                 0
BEDROOMS              0
KITCHEN               0
TOILET                0
BATH                  0
FLOOR                 0
WALL                  0
ROOF                  0
HHTYPE                0
PERNUM                0
PERWT                 0
INDIG                 0
SPEAKIND              0
LANGMX                0
SCHOOL                0
LIT                   0
EDATTAIN              0
EDATTAIND             0
YRSCHOOL              0
EDUCMX                0
EMPSTAT               0
EMPSTATD              0
INDGEN                0
INCTOT         11344365
INCEARN               0
INCWEL         11344365
INCRET         11344365
HLTHFAC               0
HLTHCOV               0
dtype: int64

From the above, we can observe that the 2015 dataset includes 11,344,365 observations. Moreover, we note that there are null values for the geographic location variables of the years 1995 and 2005, which we will drop. We also note that the variables Total INCTOT, INCWEL and INCRET are not part of the 2015 survey and will drop those columns as well. Although the rest of the variables do not appear to have missing values, we need to be careful as IPUMS variable description key reports that the particular codes that denote a null value for each variable. Null values can also stem from the fact that an individual may not be in the variable's 'universe'; for instance, the INCEARN variable describing income earned is only reported for individuals older than 12 years of age. Let us then drop the null values for this variable, which is of the utmost importance for our analysis.

households15_earners = households15.copy()
households15_earners = households15_earners[(households15_earners['INCEARN'] != 99999998)  & (households15_earners['INCEARN'] != 99999999)]

households15_earners.shape

(3329264, 59)

We see that by dropping null values for the INCEARN variable, our dataset has been reduced to 3,329,264 entries. As per Babenko et al. (2017), the fact that we only have access to labor income compared to total income is not a severe issue, as these are strongly linearly correlated. Therefore, INCEARN will be our target variable for this project. We note that his is the monthly income in pesos.

Images¶

The images we will be using to train our convolutional neural network will be sourced from the Google Maps API. The API allows you to access an image of a certain location identified by its name, address or latitude and longitude. Moreover, it allows you to specify certain map parameters such as size, scale, format, maptype and region. Below is an example, capturing images of historic places in Mexico City.

Note: The following code to download the images from Google's API has been manipulated from Adrian Rosebrock's Deep Learning for Computer Vision with Python book. Please review the Sources section below for more details.

output_images = r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\test\test_download_google\output'

# initialize the URL list
places = ['Palacio+de+Bellas+Artes+Avenida+Juarez+Historic+Center', 'Zocalo+Historic+Center+of+Mexico+City', 
           'Museo+Nacional+de+Antropologia+Mexico']
URLS = []

for p in places:
    url = 'https://maps.googleapis.com/maps/api/staticmap?center={}&zoom=18&size=227x227&maptype=satellite&key=MYKEY'.format(p)
    URLS.append(url)

for (i, url) in enumerate(URLS):
    try:
        r = requests.get(url, timeout = 100)

    except:
        print('[INFO] error downloading image...')
    
    p = os.path.sep.join([output_images, '{}.png'.format(str(i).zfill(2))])
    f = open(p,'wb')
    f.write(r.content)
    f.close()
    
    title = places[i].replace('+', ' ')
    print(title)
    image = plt.imread(p)
    plt.figure()
    plt.imshow(image)
    plt.show()
    
    time.sleep(0.1)

Palacio de Bellas Artes Avenida Juarez Historic Center

Zocalo Historic Center of Mexico City

Museo Nacional de Antropologia Mexico

Pairing the datasets and understanding the INEGI's geostatistical maps¶

The next step is to pair the household dataset with our satellite imagery; that is, to determine precisely how we will query Google's Maps Static API to match households with the correct image. Let us begin by taking a look at the geographic variables that are included in the household dataset:

URBAN: Indicates whether the household was located in a place designated as urban or as rural
GEOLEV1: Indicates the major administrative unit in which the household was enumerated
GEOLEV2: Indicates the second major administrative unit in which the household was enumerated
GEO2_MX: Identifies the household's province within Mexico in al sample years

To better understand these variables, we will look into the INEGI's Geostatistical Cartography Manual, which details the levels at which it divides the Mexican territory:

Level 1: State geostatistical areas (AGEE)
Level 2: Municipal geostatistical areas (AGEM)
Level 3: Basic geostatistical areas(AGEB): these can be either rural or urban
Level 4: Locality
Level 5: Block

print("Overview of the INEGI's geostatistical areas")
plt.figure(figsize = (6, 18))
inegi_areas = plt.imread(r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\output\main\INEGI_manual_areas.png')
plt.imshow(inegi_areas)
plt.show()
print("Image sourced from the INEGI's 'Manual de cartografia estadística'")

Overview of the INEGI's geostatistical areas

Image sourced from the INEGI's 'Manual de cartografia estadística'

From IPUMS' variable description key, we can see that variables GOELEV1 and GEOLEV2 are harmonized variables that identify states and cities to which the household belongs. GEO2_MX matches with the AGEM geostatistical level, indicating each household's province. Thus, we see that the dataset does not provide a high enough level to match each household to an image, given that we do not have access to a more granular location for each household.

We have two options to deal with this problem. First, we can calculate average income levels per province and use a much higher-level satellite image of each province to train our network. This strategy is similar to that pursued by Babenko et al. (2017), who trained their model on the fraction of households living in poverty per administrative unit.

A second strategy is to use sampled data from the 2010 Census provided on the INEGI's webpage (see Sources below), which contains geographic information for each household at the block level; however, there are two important caveats:

This dataset (at least the public version) does not provide information on income levels, only demographic data and data on the household's dwelling. A strategy could be to construct the previously mentioned welfare index using dwelling information such as access to electricity, type of roof and floor and other indicators of welfare
Even though location is disaggregated to the block level, we can only match latitudes and longitudes to the locality level (level 4)

Strategy 1: Aggregating household data at the municipal level¶

We will pursue the first strategy. Our tasks include aggregating the data, determining the number of observation points available to train our convolutional neural network, matching our municipality data to a latitude and longitude, and identifying the exact query specifications to download the satellite imagery. We can then begin training our neural network.

Preparing the household dataset¶

final_households = households15_earners.copy()
final_households = final_households[['GEO2_MX', 'INCEARN']]
final_households.head()

We will now aggregate the data above by the location variable, GEO2_MX.

grouped_households = final_households.groupby('GEO2_MX').agg(np.mean)

grouped_households.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2321 entries, 484001001 to 484032053
Data columns (total 1 columns):
INCEARN    2321 non-null float64
dtypes: float64(1)
memory usage: 36.3 KB

grouped_households.describe()

grouped_households.hist()
plt.title('Histogram of Average Monthly Income Earned per Municipality')
plt.show()

Having aggregated the data, we note that our number of observation points has been reduced to 2,321. This has significant consequences for our neural network, given that it will be prone to overfitting. From the above plot we can observe the distribution of this monthly average income per municipality, and note that it is largely concentrated between 2,500 to 5,000 pesos per month. This roughly translates to 125-250 USD per month.

Our next step involves mapping the municipality data to a latitude and longitude. To complete this task, we will use the INEGI's Unique Catalogue of State, Municipal, and Locality Geostatistical Area Codes, which can be found here.

data_types = {'mapa': int, 'Cve_Ent': int, 'Nom_Ent': object, 'Nom_Abr': object, 'Cve_Mun': int, 
              'Nom_Mun': object, 'Cve_Loc': int, 'Nom_Loc': object, 'Latitud': object, 'Longitud': object, 
             'Lat_Decimal': float, 'Lon_Decimal': float, 'Altitud': object, 'Cve_Carta': object, 
             'Pob_total' : int, 'Pob_Masculina': int}
loc_catalogue = pd.read_csv('datasets/Location_catalogue/AGEEML_201910301717378.csv', dtype = data_types)

loc_catalogue.head()

Now, we observe that in the location catalogue above we are provided with data on the first four geostatistical levels. If we take a closer look at the variable Mapa, we can see that this is an 8 or 9-digit variable. The first one or two digits encode the state (1 to 32), the next three the municipality, while the last two encode the locality. We also notice that for each Municipality, a data point with locality '01' is added which represents the Municipality. Thus, we need to match the six digits for all instances that incluide a '01' at the end, as this is the overall data on the Municipality, not the localities within the Municipality. Let us recall the information we had in our households database:

grouped_households.head()

We notice that the GEO2_MX variable includes a '484' and then six digits representing the state and municipality, exactly as above. Therefore, we will now extract the relevant location codes from each dataset and merge them using this key.

grouped_households.reset_index(inplace = True)
grouped_households['location'] = grouped_households['GEO2_MX'].astype(str).str.split('484').str[1]

loc_catalogue_final = loc_catalogue.copy()
loc_catalogue_final = loc_catalogue_final[['Mapa', 'Lat_Decimal', 'Lon_Decimal']]
bool_loc = loc_catalogue_final['Mapa'].astype(str).str[-2:] == '01'
loc_catalogue_final = loc_catalogue_final[bool_loc]

loc_catalogue_final['length'] = loc_catalogue_final['Mapa'].astype(str).str.len()
loc_catalogue_final['state'] = loc_catalogue_final['Mapa'].astype(str).str[0]
loc_catalogue_final.loc[loc_catalogue_final['length'] == 9, 'state'] = loc_catalogue_final['Mapa'].astype(str).str[:2]

loc_catalogue_final['municipality'] = loc_catalogue_final['Mapa'].astype(str).str[1:4]
loc_catalogue_final.loc[loc_catalogue_final['length'] == 9, 'municipality'] = loc_catalogue_final['Mapa'].astype(str).str[2:5]
loc_catalogue_final['location'] = '0' + loc_catalogue_final['state'].astype(str) + loc_catalogue_final['municipality'].astype(str)

grouped_households.drop('GEO2_MX', inplace = True, axis = 1)
grouped_households.head()

grouped_households['location'].value_counts()

030164    1
021156    1
014089    1
030133    1
030062    1
         ..
002002    1
030125    1
024050    1
010037    1
031028    1
Name: location, Length: 2321, dtype: int64

loc_catalogue_final = loc_catalogue_final.groupby('location').agg(np.mean)
loc_catalogue_final.reset_index(inplace = True)
loc_catalogue_final.head()

We observe that we now have a location variable in both datasets which has the same format, and are now ready to merge both datasets.

locations = loc_catalogue_final.copy()
locations = locations[['location', 'Lat_Decimal', 'Lon_Decimal']]
locations['location'] = locations['location'].str.zfill(6)
locations.head()

combined = grouped_households.merge(locations, how = 'left', left_on = 'location', right_on = 'location')
combined.head()

combined.to_csv(r'C:\Users\andre\Documents\Data_Science_Projects\Mexico_Satellite_Imagery\datasets\households\household_data_combined.csv')

We now have a dataset which will allow us to map average monthly income levels to satellite images.

Now, we have two options:

We can train a classification model using a convolutional neural network that predicts an income bucket
We can use a convolutional neural network as a feature extractor to train a support vector machine that predicts the income level

We will proceed with the second option and leave the first option for future research.

Selecting the right query for the Google Maps Static API¶

We shall now proceed to selecting the right query to download our images from the Google Maps Static API. As previously mentioned, map parameters and location include:

Size: image dimensions
Scale: controls the number of pixels for a fixed coverage area and level of detail
Format: image format
Maptype: type of map (roadmap, satellite, hybrid, terrain)
Region: controls de boarders to display
Center: defines the center of the map if markers are not provided
Zoom: zoom level of the map

Considering our input data and our needs (satellite images), the parameters we need to define are the size, scale and zoom level. We will download 500x500 images, considering that any rescaling can be performed at the pre-processing stage of the convolutional neural network. Now, let us take a look at sample images to determine what best captures our region of interest. W

We will now take a look at images generated for the municipality of Aguascalientes and fit the parameters such that it captures it adecuately.

print("Municipality of Aguascalientes")
plt.figure(figsize = (15, 18))
aguascalientes_m = plt.imread(r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\output\main\Aguascalientes_municipality.png')
aguascalientes_ms = plt.imread(r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\output\main\Aguascalientes_municipality_s.png')
plt.subplot(121)
plt.imshow(aguascalientes_m)
plt.subplot(122)
plt.imshow(aguascalientes_ms)
plt.show()
print("Images sourced from Google Maps")

Municipality of Aguascalientes

Images sourced from Google Maps

Now, let us replicate this using Google's Maps Static API and the coordinates for this Municipality in our combined dataset. We will use a larger size for the moment in order to better gauge the adequacy of the image.

output_images = r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\test\test_download_google\output'

aguas_lat = combined.loc[0, 'Lat_Decimal']
aguas_lon = combined.loc[0, 'Lon_Decimal']

parameters = {'zoom' : 12, 'scale': 2, 'size': '500x500', 'format': 'JPEG', 'maptype': 'satellite'}
base_text = 'https://maps.googleapis.com/maps/api/staticmap?center={},{}&zoom={}&scale={}&size={}&format={}&maptype={}&key=MYKEY'
url = base_text.format(aguas_lat, aguas_lon, parameters['zoom'], parameters['scale'], parameters['size'], 
                       parameters['format'], parameters['maptype'])

try:
    r = requests.get(url, timeout = 100)

except:
    print('[INFO] error downloading image...')

p = os.path.sep.join([output_images, 'aguascalientes.png'])
f = open(p,'wb')
f.write(r.content)
f.close()
    
title = 'Municipality of Aguascalientes'
print(title)
image = plt.imread(p)
plt.figure(figsize = (10, 18))
plt.imshow(image)
plt.show()

Municipality of Aguascalientes

Comparing this image to our satellite image above we see that a zoom level of 12 is still too high. We will adjust below.

output_images = r'C:\Users\andre\Documents\Data_Science_Projects\Mexico_Satellite_Imagery\test\test_download_google\output'

parameters = {'zoom' : 10, 'scale': 2, 'size': '500x500', 'format': 'JPEG', 'maptype': 'satellite'}
base_text = 'https://maps.googleapis.com/maps/api/staticmap?center={},{}&zoom={}&scale={}&size={}&format={}&maptype={}&key=MYKEY'
url = base_text.format(aguas_lat, aguas_lon, parameters['zoom'], parameters['scale'], parameters['size'], 
                       parameters['format'], parameters['maptype'])

try:
    r = requests.get(url, timeout = 100)

except:
    print('[INFO] error downloading image...')

p = os.path.sep.join([output_images, 'aguascalientes.jpg'])
f = open(p,'wb')
f.write(r.content)
f.close()
    
image = plt.imread(p)

aguascalientes_ms2 = plt.imread('C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\output\main\Aguascalientes_municipality_s2.png')
print()
plt.figure(figsize = (15, 7))
plt.subplot(121)
plt.imshow(image)
plt.title('Image sourced from API')
plt.subplot(122)
plt.imshow(aguascalientes_ms2)
plt.title('Manually crafted image')
plt.suptitle('Comparison of image from API and manually crafted image')
plt.show()
print("Images sourced from Google Maps")

Images sourced from Google Maps

This seems to be the best fit and so we will now use those parameters to download the images for the rest of the data.

output_images = r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\datasets\images'

parameters = {'zoom' : 10, 'scale': 2, 'size': '500x500', 'format': 'JPEG', 'maptype': 'satellite'}
base_text = 'https://maps.googleapis.com/maps/api/staticmap?center={},{}&zoom={}&scale={}&size={}&format={}&maptype={}&key=MYKEY'

for (i, loc) in enumerate(combined['location']):
    bool_file = r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\datasets\images\{}.jpg'.format(loc)
    if os.path.exists(bool_file) == False: 
        lat = combined.loc[i, 'Lat_Decimal']
        lon = combined.loc[i, 'Lon_Decimal']
        url = base_text.format(lat, lon, parameters['zoom'], parameters['scale'], parameters['size'], 
                           parameters['format'], parameters['maptype'])

        try:
            r = requests.get(url, timeout = 100)

        except:
            print('[INFO] error downloading image...')

        p = os.path.sep.join([output_images, '{}.jpg'.format(str(loc))])
        f = open(p,'wb')
        f.write(r.content)
        f.close()
        if i % 100 == 0:
            print('Successfully downloaded image {}/{}'.format(i, len(combined['location'])))

        time.sleep(0.1)

Training the Model¶

Now that we have our full dataset, we are ready to begin training our neural network. We note that for this section, we will be using tools and manipulating code from the Pyimagesearch module developed by Adrian Rosebrock in his book, Deep Learning for Computer Vision in Python, which can be found here.

from keras.applications import VGG16
from keras.applications import imagenet_utils
from keras.preprocessing.image import img_to_array
from keras.preprocessing.image import load_img
from Model.pyimagesearch.io import HDF5DatasetWriter
from imutils import paths
import progressbar
import random

dataset = r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\datasets\images'
output = r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\Model\output\hdf5_features4.hdf5'
bs = 32
buffer_size = 1000

Let us begin by accessing the images that we previously downloaded from Google's API and labeling them.

imagePaths = list(paths.list_images(dataset))
random.shuffle(imagePaths)

label_codes = [path.split(os.path.sep)[-1][:-4] for path in imagePaths]

combined['label'] = combined.index

labels = [combined.loc[combined['location'] == loc, 'label'] for loc in label_codes]
labels = [l.iloc[0] for l in labels]

We next instantiate a VGG16 convolutional neural network. By adding include_top=False, we are not fully forward propagating throughout the model but rather stopping before the fully connected layers. This will allow us to extract feature representations of each image instead of a set of probabilities for classification. We also instantiate an HDF5 dataset writer, which will store the extracted features from the network and pair them to each image's label.

if os.path.exists(output) == False:
    model = VGG16(weights="imagenet", include_top=False)
    dataset = HDF5DatasetWriter((len(imagePaths), 512 * 7 * 7), output, dataKey="features", bufSize=buffer_size)

Next, we initialize our progress bar and begin looping over the images in batches. Within each batch, we loop over each image and pre-process it, adjusting the size to 224x224. After pre-processing the batch, it gets forward propagated through the network. At this stage, we extract the features by flattening the model's predictions; these vectors are then added to our HDF5 dataset couped with the images' labels.

if os.path.exists(output) == False:
    widgets = ["Extracting Features: ", progressbar.Percentage(), " ",
        progressbar.Bar(), " ", progressbar.ETA()]
    pbar = progressbar.ProgressBar(maxval=len(imagePaths), widgets=widgets).start()

    for i in np.arange(0, len(imagePaths), bs):
        batchPaths = imagePaths[i:i + bs]
        batchLabels = labels[i:i + bs]
        batchImages = []

        for (j, imagePath) in enumerate(batchPaths):
            image = load_img(imagePath, target_size=(224, 224))
            image = img_to_array(image)

            image = np.expand_dims(image, axis=0)
            image = imagenet_utils.preprocess_input(image)

            batchImages.append(image)

        batchImages = np.vstack(batchImages)
        features = model.predict(batchImages, batch_size=bs)

        features = features.reshape((features.shape[0], 512 * 7 * 7))

        dataset.add(features, batchLabels)
        pbar.update(i)

    dataset.close()
    pbar.finish()

With our features at hand, we are now ready to train our linear regression model.

import h5py
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score

We begin by preparing our data by loading our features from the HDF5 dataset and labeling our data with the average monthly income earned from our combined dataset. We then perform our train-test split.

hdf5_data = h5py.File(output, "r")
features = hdf5_data['features'][()]
codes = hdf5_data['labels'][()]
labels = [combined.loc[combined['label'] == code, 'INCEARN'] for code in codes]
labels = [l.iloc[0] for l in labels]
X_train, X_test, y_train, y_test = train_test_split(features, labels, random_state = 42, test_size = 0.95)

As a next step, we instantiate our support vector regression model and use five-fold cross validation to select the best hyperparameters. In line with Naik et al. (2016), we fit a Suppor Vector Regression with a linear kernel.

parameters = {"C": [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]}

model = svm.SVR(gamma = 'scale', kernel = 'linear')
clf = GridSearchCV(model, parameters, cv = 5, iid = True)
clf.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3,
                           epsilon=0.1, gamma='scale', kernel='linear',
                           max_iter=-1, shrinking=True, tol=0.001,
                           verbose=False),
             iid=True, n_jobs=None,
             param_grid={'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0,
                               10000.0]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

print('R-squared score (training data): {:.3f}'.format(r2_score(y_train, clf.predict(X_train,))))
hdf5_data.close()

R-squared score (training data): 0.489

We can observe that with the best parameter from five-fold cross validation, our model presents an $R^2$ of 0.489.

Visual Exploration and Mapping Outcomes¶

As a final step, we will conduct some visual exploration. Firstly, we will perform principal components analysis to abstract the relationship between our extracted features and average monthly income per municipality.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

X_train_scaled = StandardScaler().fit_transform(X_train)
PCA = PCA(n_components = 1)
principal_c = PCA.fit_transform(X_train_scaled)
principal_df = pd.DataFrame(data = principal_c, columns = ['X_pca'])

principal_df['y_pca'] = y_train

principal_df.plot.scatter(x = 'X_pca', y = 'y_pca')
plt.title('Scatter plot of Principal Component 1 and average monthly income earned per municipality')
plt.show()

Lastly, we will map our target variable to visualize its geographic distribution.

os.environ['PROJ_LIB'] = r'C:\ProgramData\Anaconda3\pkgs\proj4-5.2.0-ha925a31_1\Library\share'
from mpl_toolkits.basemap import Basemap

fig = plt.figure(figsize = (15, 6))
m = Basemap(projection = 'lcc', resolution=None, width=6E6, height=3E6, lat_0=25, lon_0=-100)
x, y = m(combined['Lon_Decimal'].tolist(), combined['Lat_Decimal'].tolist())
m.shadedrelief()
plt.scatter(x, y, c = combined['INCEARN'].tolist(), cmap = 'Reds', alpha  = 0.5, marker = 'o')
plt.title('Geographic Distribution of Monthly Average Income per Municipality')
plt.colorbar(label='Average Monthly Income per Municipality (pesos)')
plt.show()

As we can see above, average monthly income tends to be higher in the Northeastern and Central parts of Mexico, which are known for their high level of industrial and economic activity. The poorest parts are clearly visible in the figure and are largely in Sourthern Mexico (Oaxaca, Guerrero) and the Yucan Peninsula.

Conclusions¶

In this project, we have fit a support vector regression model with a linear kernel to estimate the monthly average income (in pesos) of a municipality using a satellite image of its location. For this purpose, we have cleaned data from the Mexican 2015 Intercensal Survey (kindly provided by IPUMS) and matched it to satellite images by manipulating municipalities' geographic coordinates and income information. Due to the lack of granularity in the data, we performed these estimates at the municipality level. Extracting features from these images using a VGG16 CNN, our model presented an $R^2$ of 0.489.

Many strategies could be followed to try to improve the explainablity of our model, which we will pursue in future research. These include sourcing the data directly from the Mexican Statistics Institute to acquire more granular information on locality location and performing a classification instead of a regression by dividing income levels into buckets.

Sources¶

Babenko, Boris, Jonathan Hersh, David Newhouse, Anusha Ramakrishnan, and Tom Swartz. "Poverty Mapping Using Convolutional Neural Networks Trained on High and Medium Resolution Satellite Images, With an Application in Mexico." Presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017) Workshop on Machine Learning for the Developing World (December 2017).

Dubey, Abhimanyu, Nikhil Naik, Devi Parikh, Ramesh Raskar and Cesar Hidalgo. "Deep Learning the City: Quantifying Urban Perception at a Global Scale." European Conference on Computer Vision (September 2016): 569-576.

Glaeser, Edward L., Scott Duke Kominers, Michael Luca, and Nikhil Naik. "Big Data and Big Cities: The Promises and Limitations of Improved Measures of Urban Life." Economic Inquiry 56, no. 1 (January 2018): 114–137. (Originally Harvard Business School Working Paper, No. 16-065, November 2015.)

Google Maps Platform. Maps Static API.

Instituto Nacional de Estadística y Geografía. "Catálogo Único de Claves de Áreas Geoestadísticas Estatales, Municipales y Localidades." Link. Accessed October 30, 2019.

Instituto Nacional de Estadística y Geografía. "Censo de Población y Vivienda 2010." Link. Accessed October 30, 2019.

Instituto Nacional de Estadística y Geografía. "Manual de cartografia estadística." Censo de Población y Vivienda (2010).

Minnesota Population Center. Integrated Public Use Microdata Series, International: Version 7.2 [Mexico Intercensal Survey 2015 - sample ID 484201501]. Minneapolis, MN: IPUMS, 2019. Link

Naik, Nikhil, Ramesh Raskar, and Cesar A. Hidalgo. "Cities are Physical Too: Using Computer Vision to Measure the Quality and Impact of Urban Appearance." American Economic Review, Vol. 106 (May 2016): 128-132.

Naik, Nikhil, Scott Duke Kominers, Ramesh Raskar, Edward L. Glaeser, and Cesar Hidalgo. "Computer vision uncovers predictors of physical urban change." PNAS, Vol. 114, no. 29 (July 2017): 7571-7576.

Rosebrock, Adrian. "Deep Learning for Computer Vision with Python." PyImageSearch. Link. Accessed September 29, 2019.

	COUNTRY	YEAR	SAMPLE	SERIAL	HHWT	URBAN	GEOLEV1	GEOLEV2	GEO2_MX	GEO2_MX1995	...	EDUCMX	EMPSTAT	EMPSTATD	INDGEN	INCTOT	INCEARN	INCWEL	INCRET	HLTHFAC	HLTHCOV
count	11344365.0	11344365.0	11344365.0	1.134436e+07	1.134436e+07	1.134436e+07	1.134436e+07	1.134436e+07	1.134436e+07	0.0	...	1.134436e+07	1.134436e+07	1.134436e+07	1.134436e+07	0.0	1.134436e+07	0.0	0.0	1.134436e+07	1.134436e+07
mean	484.0	2015.0	484201501.0	1.449551e+09	1.053932e+01	1.550229e+00	4.840177e+05	4.840178e+08	4.840178e+08	NaN	...	2.380601e+02	1.642034e+00	1.799075e+02	2.296046e+01	NaN	7.065413e+07	NaN	NaN	4.410009e+00	4.276989e+01
std	0.0	0.0	0.0	8.441825e+08	1.615997e+01	4.974707e-01	7.946213e+00	7.963143e+03	7.963143e+03	NaN	...	2.446028e+02	1.332595e+00	1.471947e+02	7.578232e+01	NaN	4.553311e+07	NaN	NaN	1.858148e+00	2.441384e+01
min	484.0	2015.0	484201501.0	1.000000e+03	2.000000e+00	1.000000e+00	4.840010e+05	4.840010e+08	4.840010e+08	NaN	...	1.000000e+01	0.000000e+00	0.000000e+00	0.000000e+00	NaN	0.000000e+00	NaN	NaN	1.000000e+00	1.000000e+01
25%	484.0	2015.0	484201501.0	7.171520e+08	4.000000e+00	1.000000e+00	4.840120e+05	4.840120e+08	4.840120e+08	NaN	...	1.030000e+02	1.000000e+00	1.100000e+02	0.000000e+00	NaN	7.714000e+03	NaN	NaN	5.000000e+00	4.000000e+01
50%	484.0	2015.0	484201501.0	1.439303e+09	6.000000e+00	2.000000e+00	4.840160e+05	4.840161e+08	4.840161e+08	NaN	...	1.060000e+02	1.000000e+00	1.100000e+02	0.000000e+00	NaN	1.000000e+08	NaN	NaN	5.000000e+00	4.000000e+01
75%	484.0	2015.0	484201501.0	2.171946e+09	1.000000e+01	2.000000e+00	4.840240e+05	4.840240e+08	4.840240e+08	NaN	...	3.210000e+02	3.000000e+00	3.100000e+02	1.000000e+01	NaN	1.000000e+08	NaN	NaN	5.000000e+00	4.000000e+01
max	484.0	2015.0	484201501.0	2.927196e+09	9.340000e+02	2.000000e+00	4.840320e+05	4.840321e+08	4.840321e+08	NaN	...	9.990000e+02	9.000000e+00	9.990000e+02	9.990000e+02	NaN	1.000000e+08	NaN	NaN	9.000000e+00	9.900000e+01

	GEO2_MX	INCEARN
10616611	484001001	3857.0
10616615	484001001	4286.0
10616617	484001001	4286.0
10616619	484001001	1714.0
10616621	484001001	4286.0

	INCEARN
count	2321.000000
mean	3802.515556
std	1797.218417
min	49.715686
25%	2597.702624
50%	3697.135258
75%	4803.338739
max	20124.487083

	Mapa	Cve_Ent	Nom_Ent	Nom_Abr	Cve_Mun	Nom_Mun	Cve_Loc	Nom_Loc	Ámbito	Latitud	Longitud	Lat_Decimal	Lon_Decimal	Altitud	Cve_Carta	Pob_Total	Pob_Masculina	Pob_Femenina	Total De Viviendas Habitadas
0	10010001	1	Aguascalientes	Ags.	1	Aguascalientes	1	Aguascalientes	U	21°52´47.362N"	102°17´45.768W"	21.879823	-102.296047	1878	F13D19	722250	348722	373528	185120
1	10010094	1	Aguascalientes	Ags.	1	Aguascalientes	94	Granja Adelita	R	21°52´18.749N"	102°22´24.710W"	21.871875	-102.373531	1902	F13D18	14	0	0	2
2	10010096	1	Aguascalientes	Ags.	1	Aguascalientes	96	Agua Azul	R	21°53´01.522N"	102°21´25.639W"	21.883756	-102.357122	1861	F13D18	37	21	16	11
3	10010100	1	Aguascalientes	Ags.	1	Aguascalientes	100	Rancho Alegre	R	21°51´16.556N"	102°22´21.884W"	21.854599	-102.372746	1879	F13D18	10	0	0	1
4	10010102	1	Aguascalientes	Ags.	1	Aguascalientes	102	Los Arbolitos [Rancho]	R	21°46´48.650N"	102°21´26.261W"	21.780181	-102.357295	1861	F13D18	7	0	0	1

	INCEARN
GEO2_MX
484001001	6460.614828
484001002	4266.121608
484001003	3960.400450
484001004	4244.469072
484001005	6412.689766

	COUNTRY	YEAR	SAMPLE	SERIAL	HHWT	URBAN	GEOLEV1	GEOLEV2	GEO2_MX	GEO2_MX1995	...	EDUCMX	EMPSTAT	EMPSTATD	INDGEN	INCTOT	INCEARN	HLTHFAC	HLTHCOV
0	484	1995	484199501	1001.0	103	2	484001	484001001	484001001	1001.0	...	223	1.0	110.0	80.0	1714.0	1714.0	NaN	NaN
1	484	1995	484199501	1001.0	103	2	484001	484001001	484001001	1001.0	...	106	1.0	110.0	60.0	1286.0	1286.0	NaN	NaN
2	484	1995	484199501	1001.0	103	2	484001	484001001	484001001	1001.0	...	321	3.0	330.0	0.0	9999999.0	0.0	NaN	NaN
3	484	1995	484199501	1001.0	103	2	484001	484001001	484001001	1001.0	...	222	3.0	330.0	0.0	9999999.0	0.0	NaN	NaN
4	484	1995	484199501	1001.0	103	2	484001	484001001	484001001	1001.0	...	105	3.0	330.0	0.0	9999999.0	0.0	NaN	NaN

	INCEARN	location
0	6460.614828	001001
1	4266.121608	001002
2	3960.400450	001003
3	4244.469072	001004
4	6412.689766	001005

	location	Mapa	Lat_Decimal	Lon_Decimal	length
0	010001	1.000102e+08	24.507653	-104.742524	9.0
1	010002	1.000201e+08	25.121212	-106.590988	9.0
2	010003	1.000300e+08	24.979125	-104.768917	9.0
3	010004	1.000402e+08	24.784158	-103.770661	9.0
4	010005	1.000509e+08	24.079907	-104.572918	9.0

	INCEARN	location	Lat_Decimal	Lon_Decimal
0	6460.614828	001001	21.836513	-102.286666
1	4266.121608	001002	22.156567	-102.039549
2	3960.400450	001003	21.870873	-102.726988
3	4244.469072	001004	22.366641	-102.302860
4	6412.689766	001005	21.962373	-102.338731