This project has the objective of using satellite imagery of the Mexican territory to estimate the income levels of Mexican households using computer vision and convolutional neural networks. In addition to its immediate benefits in terms of providing an efficient way to understand the geographical distribution of income levels, the success of this project could result in its application toward improving the measurement of urban life and quantifying the impact of public policy across time. This first project allows us to explore the data that is available for Mexico and to evaluate how successful and accurate this model would be in predicting economic data.
The use of satellite imagery and street view images to predict economic variables such as income or physical characteristics of an urban area has been explored by several authors.
Glaeser et al. (2015) describe how new urban data sources can be used to improve the study of cities. Specifically, they use a corpora of Google Street View images to measure the physical characteristics of a neighborhood and to estimate income levels in New York City blocks, emphasizing the power of this tool to map wealth and poverty. Moreover, the authors underscore the use of these models to predict income and map poverty in developing countries, where there is a surplus of image data and a lack of reliable information on income and other variables. To perform this analysis, the authors queried the Google Street View Image API by specifying locations (using latitude and longitude) and camera viewpoints. Latitude and longitude values were randomly selected from a uniform grid, and 40 images were sampled per square mile to obtain 12,200 images of New York City. The authors also used images from Boston as a test set to evaluate the performance of their model, given that it could be prone to overfitting. For their income data, Glaeser et al. used information on median family income, which is reported at the census block level in the American Community Survey. For the actual prediction, the authors used v-Support Vector Regression on extracted features (color information histograms, GIST feature descriptors and textons). Their computer vision model proves to be a good fit, with an $R^2$ of 0.85, and generalizes well to images of other cities.
Naik et al. (2016) explore the use of computer vision to quantify the physical aspects of urban areas using street level imagery. Additionally, the authors use this methodology to develop an algorithm that predicts the perceived safety or 'Streetscore' of locations across cities in the United States. Using street imagery coupled with perceived safety scores, the authors merge features extracted from three methods (textons, color histograms and GIST features) and train a v-Support Vector Regressions with a linear kernel. Their model achieves an $R^2$ of 0.57. Their image corpora consists of one million Google Street View images from 19 cities. Furthermore, they use socioeconomic characteristics of census tracts sourced from the American Community Survey and relate urban appearance to demographic variables such as population, median income and Gini index. They find a strong positive correlation between urban appearance and median income and population density, and between income inequality and the variance of neighborhood appearance.
Naik et al. (2017) dive deeper into using urban appearance to predict economic and demographic data by measuring changes in the physical characteristics of neighborhoods in time and using these changes to evaluate neighborhood improvement. Using streetscapes from five cities in the United States sourced from the Google API captured in 2007 and 2014 and 'Streetscores' for each image, they perform semantic segmentation of images, creates feature vectors using GIST features and texton maps, and used features of streets and buildings to predict the Streetscore with a Support Vector Regression. With this information, they calculated the change in Streetscore for both years to obtain information on the change in physical appearance. The authors find a positive correlation between the physical improvement of neighborhoods and the population of college-educated adults, the neighborhood's initial appearance, and the neighborhood's proximity to the central business district.
For Mexico, Babenko et al. (2017) use Convolutional Neural Networks and satellite images from Planet and Digital Globe to estimate poverty. Poverty data stems from Mexico's 2015 Intercensal Survey and from its 2014 National Survey of Household Income and Expenditure (ENIGH). After testing several neural network architectures, the authors use a GoogleNet CNN as a final model, which explains approximately 57% of the variation in poverty in the validation set.
Mexico's National Institute of Statistics and Geography (INEGI) conducts several census and surveys that measure economic and demographic variables of Mexican households and individuals. Given the confidential nature of the data, this microdata can only be fully accessed via its remote processing services or in-person at the INEGI's laboratories. However, full access to these services is restricted to graduate students, researchers affiliated to valid institutions such as universities and government officials. As a result, we will use a sample of the 2015 Intercensal Survey data that has been kindly provided by Integrated Public Use Microdata Series (IPUMS), an institution which provides census and survey data from various countries. We also wish to acknowledge the statistical offices that provided the underlying data making this research possible: National Institute of Statistics, Geography, and Informatics, Mexico.
Let us begin by loading the necessary libraries and taking a closer look at our data.
import requests
import os
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
%matplotlib inline
households = pd.read_csv(r'datasets/Econ_data/Censo_poblacion_vivienda/IPUMS/data.csv')
households.head()
households.info()
The dataset includes 21,960,976 data points and 59 variables for the intercensal survey of 1995, 2005 and 2015. As per IPUMS' variable availability key, variables include:
COUNTRY
: Country from which the sample was drawnYEAR
: Year in which the census or survey was takenSAMPLE
: Identifies the IPUMS sample from which the case is drawn. SERIAL
: An identifying number unique to each household in a given sampleHHWT
: Indicates the number of households in the population represented by the household in the sampleURBAN
: Indicates whether the household was located in a place designated as urban or as ruralGEOLEV1
: Indicates the major administrative unit in which the household was enumerated GEOLEV2
: Indicates the second major administrative unit in which the household was enumeratedGEO2_MX
: Identifies the household's province within Mexico in al sample yearsGEO2_MX1995
: Identifies the household's municipality within Mexico in 1995GEO2_MX2005
: Identifies the household's municipality within Mexico in 2005GEO2_MX2015
: Identifies the household's municipality within Mexico in 2015SIZEMX
: Identifies the population of the locality within Mexico in all sample yearsREMITT
: Indicates whether the household received remittances from persons living elsewhereELECTRIC
: Indicates whether the household had access to electricityWATSUP
: Describes the physical means by which the housing unit receives its waterSEWAGE
: Indicates whether the household has access to a sewage system or septic tankFUELCOOK
: Indicates the predominant type of fuel or energy used for cookingPHONE
: Indicates the availability of a telephone in the dwellingCELL
: Indicates the availability of a cellular phone in the householdINTERNET
: Indicates whether or not the household had an internet connectionTRASH
: Indicates whether the household's waste or garbage is collected by a sanitation serviceAUTOS
: Records whether a member of the household owned or had use of a vehicle and, in many samples, the numberHOTWATER
: Indicates whether the housing unit had a water heaterAIRCON
: Indicates whether the household had air conditioningCOMPUTER
: Indicates whether the household had a personal computerWASHER
: Indicates whether the household had a clothes washing machineREFRIG
: Indicates whether the household had a refrigeratorTV
: Indicates whether the household had a televisionRADIO
: Indicates whether the household had a radioROOMS
: Indicates the number of rooms occupied by the housing unitBEDROOMS
: Indicates the number of rooms available to members of the household for sleepingKITCHEN
: Indicates whether the household had a kitchen, cooking facilities, or room dedicated to food preparationTOILET
: Indicates whether the household had access to a toilet and, in most cases, whether it was a flush toiletBATH
: Indicates whether the household had access to bathing facilities and, in most cases, whether it had exclusive accessFLOOR
: Indicates the dwelling's predominant flooring materialWALL
: Indicates the primary material used in the construction of the dwellingROOF
: Indicates the dwelling's predominant roofing materialHHTYPE
: Describes the composition of householdsPERNUM
: Numbers all persons within each household consecutivelyPERWT
: Indicates the number of persons in the actual population represented by the person in the sampleINDIG
: Indicates whether the person belonged to an indigenous groupSPEAKIND
: Indicates whether a person speaks an indigenous languageLANGMX
: Indicates the indigenous language, if any, spoken by the personSCHOOL
: Indicates whether or not the person attended school at the time of the census or within some specified periodLIT
: Indicates whether or not the respondent could read and write in any languageEDATTAIN
: Records the person's educational attainment in terms of the level of schooling completedEDATTAIND
: Records the person's educational attainment in terms of the level of schooling completedYRSCHOOL
: Indicates the highest grade/level of schooling the person had completed, in yearsEDUCMX
: Indicates the person's educational attainment in Mexico in terms of the highest level and year or grade completedEMPSTAT
: Indicates whether or not the respondent was part of the labor force -- working or seeking work EMPSTATD
: Indicates whether or not the respondent was part of the labor force -- working or seeking work INDGEN
: Recodes the industrial classifications of the various samples into twelve groupsINCTOT
: Reports the person's total personal income from all sources in the previous month or yearINCEARN
: Reports the person's total income from their labor (from wages, a business, or a farm) in the previous month or yearINCWEL
: Reports the monthly or annual income the respondent received from anti-poverty or welfare programsINCRET
: Reports the amount of income the respondent received from a retirement program or account, or from a pensionHLTHFAC
: Identifies the type of hospital, clinic, or healthcare facility the person normally attendedHLTHCOV
: Reports whether the person had medical coverage (or a right to medical services) from a public or private institutionThe most significant variables will be those related to the geographic location of the household and those that describe the family's income. It important to note, however, that for future research remaining variables related to the education, dwelling characteristics and access to health services could prove very useful in developing a wellbeing index for each household which we could pair to the satellite images in order to predit welfare.
Let us conduct some exploratory data analysis on the rows related to the 2015 Intercensal Survey.
households15 = households.copy()
households15 = households15[households15['YEAR'] == 2015]
households15.describe()
print('Null values per column in the Households 2015 dataset')
households15.isnull().sum()
From the above, we can observe that the 2015 dataset includes 11,344,365 observations. Moreover, we note that there are null values for the geographic location variables of the years 1995 and 2005, which we will drop. We also note that the variables Total INCTOT
, INCWEL
and INCRET
are not part of the 2015 survey and will drop those columns as well. Although the rest of the variables do not appear to have missing values, we need to be careful as IPUMS variable description key reports that the particular codes that denote a null value for each variable. Null values can also stem from the fact that an individual may not be in the variable's 'universe'; for instance, the INCEARN
variable describing income earned is only reported for individuals older than 12 years of age. Let us then drop the null values for this variable, which is of the utmost importance for our analysis.
households15_earners = households15.copy()
households15_earners = households15_earners[(households15_earners['INCEARN'] != 99999998) & (households15_earners['INCEARN'] != 99999999)]
households15_earners.shape
We see that by dropping null values for the INCEARN
variable, our dataset has been reduced to 3,329,264 entries. As per Babenko et al. (2017), the fact that we only have access to labor income compared to total income is not a severe issue, as these are strongly linearly correlated. Therefore, INCEARN
will be our target variable for this project. We note that his is the monthly income in pesos.
The images we will be using to train our convolutional neural network will be sourced from the Google Maps API. The API allows you to access an image of a certain location identified by its name, address or latitude and longitude. Moreover, it allows you to specify certain map parameters such as size, scale, format, maptype and region. Below is an example, capturing images of historic places in Mexico City.
Note: The following code to download the images from Google's API has been manipulated from Adrian Rosebrock's Deep Learning for Computer Vision with Python book. Please review the Sources section below for more details.
output_images = r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\test\test_download_google\output'
# initialize the URL list
places = ['Palacio+de+Bellas+Artes+Avenida+Juarez+Historic+Center', 'Zocalo+Historic+Center+of+Mexico+City',
'Museo+Nacional+de+Antropologia+Mexico']
URLS = []
for p in places:
url = 'https://maps.googleapis.com/maps/api/staticmap?center={}&zoom=18&size=227x227&maptype=satellite&key=MYKEY'.format(p)
URLS.append(url)
for (i, url) in enumerate(URLS):
try:
r = requests.get(url, timeout = 100)
except:
print('[INFO] error downloading image...')
p = os.path.sep.join([output_images, '{}.png'.format(str(i).zfill(2))])
f = open(p,'wb')
f.write(r.content)
f.close()
title = places[i].replace('+', ' ')
print(title)
image = plt.imread(p)
plt.figure()
plt.imshow(image)
plt.show()
time.sleep(0.1)
The next step is to pair the household dataset with our satellite imagery; that is, to determine precisely how we will query Google's Maps Static API to match households with the correct image. Let us begin by taking a look at the geographic variables that are included in the household dataset:
URBAN
: Indicates whether the household was located in a place designated as urban or as ruralGEOLEV1
: Indicates the major administrative unit in which the household was enumerated GEOLEV2
: Indicates the second major administrative unit in which the household was enumeratedGEO2_MX
: Identifies the household's province within Mexico in al sample yearsTo better understand these variables, we will look into the INEGI's Geostatistical Cartography Manual, which details the levels at which it divides the Mexican territory:
print("Overview of the INEGI's geostatistical areas")
plt.figure(figsize = (6, 18))
inegi_areas = plt.imread(r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\output\main\INEGI_manual_areas.png')
plt.imshow(inegi_areas)
plt.show()
print("Image sourced from the INEGI's 'Manual de cartografia estadística'")
From IPUMS' variable description key, we can see that variables GOELEV1
and GEOLEV2
are harmonized variables that identify states and cities to which the household belongs. GEO2_MX
matches with the AGEM geostatistical level, indicating each household's province. Thus, we see that the dataset does not provide a high enough level to match each household to an image, given that we do not have access to a more granular location for each household.
We have two options to deal with this problem. First, we can calculate average income levels per province and use a much higher-level satellite image of each province to train our network. This strategy is similar to that pursued by Babenko et al. (2017), who trained their model on the fraction of households living in poverty per administrative unit.
A second strategy is to use sampled data from the 2010 Census provided on the INEGI's webpage (see Sources below), which contains geographic information for each household at the block level; however, there are two important caveats:
We will pursue the first strategy. Our tasks include aggregating the data, determining the number of observation points available to train our convolutional neural network, matching our municipality data to a latitude and longitude, and identifying the exact query specifications to download the satellite imagery. We can then begin training our neural network.
final_households = households15_earners.copy()
final_households = final_households[['GEO2_MX', 'INCEARN']]
final_households.head()
We will now aggregate the data above by the location variable, GEO2_MX
.
grouped_households = final_households.groupby('GEO2_MX').agg(np.mean)
grouped_households.info()
grouped_households.describe()
grouped_households.hist()
plt.title('Histogram of Average Monthly Income Earned per Municipality')
plt.show()
Having aggregated the data, we note that our number of observation points has been reduced to 2,321. This has significant consequences for our neural network, given that it will be prone to overfitting. From the above plot we can observe the distribution of this monthly average income per municipality, and note that it is largely concentrated between 2,500 to 5,000 pesos per month. This roughly translates to 125-250 USD per month.
Our next step involves mapping the municipality data to a latitude and longitude. To complete this task, we will use the INEGI's Unique Catalogue of State, Municipal, and Locality Geostatistical Area Codes, which can be found here.
data_types = {'mapa': int, 'Cve_Ent': int, 'Nom_Ent': object, 'Nom_Abr': object, 'Cve_Mun': int,
'Nom_Mun': object, 'Cve_Loc': int, 'Nom_Loc': object, 'Latitud': object, 'Longitud': object,
'Lat_Decimal': float, 'Lon_Decimal': float, 'Altitud': object, 'Cve_Carta': object,
'Pob_total' : int, 'Pob_Masculina': int}
loc_catalogue = pd.read_csv('datasets/Location_catalogue/AGEEML_201910301717378.csv', dtype = data_types)
loc_catalogue.head()
Now, we observe that in the location catalogue above we are provided with data on the first four geostatistical levels. If we take a closer look at the variable Mapa
, we can see that this is an 8 or 9-digit variable. The first one or two digits encode the state (1 to 32), the next three the municipality, while the last two encode the locality. We also notice that for each Municipality, a data point with locality '01' is added which represents the Municipality. Thus, we need to match the six digits for all instances that incluide a '01' at the end, as this is the overall data on the Municipality, not the localities within the Municipality. Let us recall the information we had in our households database:
grouped_households.head()
We notice that the GEO2_MX
variable includes a '484' and then six digits representing the state and municipality, exactly as above. Therefore, we will now extract the relevant location codes from each dataset and merge them using this key.
grouped_households.reset_index(inplace = True)
grouped_households['location'] = grouped_households['GEO2_MX'].astype(str).str.split('484').str[1]
loc_catalogue_final = loc_catalogue.copy()
loc_catalogue_final = loc_catalogue_final[['Mapa', 'Lat_Decimal', 'Lon_Decimal']]
bool_loc = loc_catalogue_final['Mapa'].astype(str).str[-2:] == '01'
loc_catalogue_final = loc_catalogue_final[bool_loc]
loc_catalogue_final['length'] = loc_catalogue_final['Mapa'].astype(str).str.len()
loc_catalogue_final['state'] = loc_catalogue_final['Mapa'].astype(str).str[0]
loc_catalogue_final.loc[loc_catalogue_final['length'] == 9, 'state'] = loc_catalogue_final['Mapa'].astype(str).str[:2]
loc_catalogue_final['municipality'] = loc_catalogue_final['Mapa'].astype(str).str[1:4]
loc_catalogue_final.loc[loc_catalogue_final['length'] == 9, 'municipality'] = loc_catalogue_final['Mapa'].astype(str).str[2:5]
loc_catalogue_final['location'] = '0' + loc_catalogue_final['state'].astype(str) + loc_catalogue_final['municipality'].astype(str)
grouped_households.drop('GEO2_MX', inplace = True, axis = 1)
grouped_households.head()
grouped_households['location'].value_counts()
loc_catalogue_final = loc_catalogue_final.groupby('location').agg(np.mean)
loc_catalogue_final.reset_index(inplace = True)
loc_catalogue_final.head()
We observe that we now have a location
variable in both datasets which has the same format, and are now ready to merge both datasets.
locations = loc_catalogue_final.copy()
locations = locations[['location', 'Lat_Decimal', 'Lon_Decimal']]
locations['location'] = locations['location'].str.zfill(6)
locations.head()
combined = grouped_households.merge(locations, how = 'left', left_on = 'location', right_on = 'location')
combined.head()
combined.to_csv(r'C:\Users\andre\Documents\Data_Science_Projects\Mexico_Satellite_Imagery\datasets\households\household_data_combined.csv')
We now have a dataset which will allow us to map average monthly income levels to satellite images.
Now, we have two options:
We will proceed with the second option and leave the first option for future research.
We shall now proceed to selecting the right query to download our images from the Google Maps Static API. As previously mentioned, map parameters and location include:
Considering our input data and our needs (satellite images), the parameters we need to define are the size, scale and zoom level. We will download 500x500 images, considering that any rescaling can be performed at the pre-processing stage of the convolutional neural network. Now, let us take a look at sample images to determine what best captures our region of interest. W
We will now take a look at images generated for the municipality of Aguascalientes and fit the parameters such that it captures it adecuately.
print("Municipality of Aguascalientes")
plt.figure(figsize = (15, 18))
aguascalientes_m = plt.imread(r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\output\main\Aguascalientes_municipality.png')
aguascalientes_ms = plt.imread(r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\output\main\Aguascalientes_municipality_s.png')
plt.subplot(121)
plt.imshow(aguascalientes_m)
plt.subplot(122)
plt.imshow(aguascalientes_ms)
plt.show()
print("Images sourced from Google Maps")
Now, let us replicate this using Google's Maps Static API and the coordinates for this Municipality in our combined dataset. We will use a larger size for the moment in order to better gauge the adequacy of the image.
output_images = r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\test\test_download_google\output'
aguas_lat = combined.loc[0, 'Lat_Decimal']
aguas_lon = combined.loc[0, 'Lon_Decimal']
parameters = {'zoom' : 12, 'scale': 2, 'size': '500x500', 'format': 'JPEG', 'maptype': 'satellite'}
base_text = 'https://maps.googleapis.com/maps/api/staticmap?center={},{}&zoom={}&scale={}&size={}&format={}&maptype={}&key=MYKEY'
url = base_text.format(aguas_lat, aguas_lon, parameters['zoom'], parameters['scale'], parameters['size'],
parameters['format'], parameters['maptype'])
try:
r = requests.get(url, timeout = 100)
except:
print('[INFO] error downloading image...')
p = os.path.sep.join([output_images, 'aguascalientes.png'])
f = open(p,'wb')
f.write(r.content)
f.close()
title = 'Municipality of Aguascalientes'
print(title)
image = plt.imread(p)
plt.figure(figsize = (10, 18))
plt.imshow(image)
plt.show()
Comparing this image to our satellite image above we see that a zoom level of 12 is still too high. We will adjust below.
output_images = r'C:\Users\andre\Documents\Data_Science_Projects\Mexico_Satellite_Imagery\test\test_download_google\output'
parameters = {'zoom' : 10, 'scale': 2, 'size': '500x500', 'format': 'JPEG', 'maptype': 'satellite'}
base_text = 'https://maps.googleapis.com/maps/api/staticmap?center={},{}&zoom={}&scale={}&size={}&format={}&maptype={}&key=MYKEY'
url = base_text.format(aguas_lat, aguas_lon, parameters['zoom'], parameters['scale'], parameters['size'],
parameters['format'], parameters['maptype'])
try:
r = requests.get(url, timeout = 100)
except:
print('[INFO] error downloading image...')
p = os.path.sep.join([output_images, 'aguascalientes.jpg'])
f = open(p,'wb')
f.write(r.content)
f.close()
image = plt.imread(p)
aguascalientes_ms2 = plt.imread('C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\output\main\Aguascalientes_municipality_s2.png')
print()
plt.figure(figsize = (15, 7))
plt.subplot(121)
plt.imshow(image)
plt.title('Image sourced from API')
plt.subplot(122)
plt.imshow(aguascalientes_ms2)
plt.title('Manually crafted image')
plt.suptitle('Comparison of image from API and manually crafted image')
plt.show()
print("Images sourced from Google Maps")
This seems to be the best fit and so we will now use those parameters to download the images for the rest of the data.
output_images = r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\datasets\images'
parameters = {'zoom' : 10, 'scale': 2, 'size': '500x500', 'format': 'JPEG', 'maptype': 'satellite'}
base_text = 'https://maps.googleapis.com/maps/api/staticmap?center={},{}&zoom={}&scale={}&size={}&format={}&maptype={}&key=MYKEY'
for (i, loc) in enumerate(combined['location']):
bool_file = r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\datasets\images\{}.jpg'.format(loc)
if os.path.exists(bool_file) == False:
lat = combined.loc[i, 'Lat_Decimal']
lon = combined.loc[i, 'Lon_Decimal']
url = base_text.format(lat, lon, parameters['zoom'], parameters['scale'], parameters['size'],
parameters['format'], parameters['maptype'])
try:
r = requests.get(url, timeout = 100)
except:
print('[INFO] error downloading image...')
p = os.path.sep.join([output_images, '{}.jpg'.format(str(loc))])
f = open(p,'wb')
f.write(r.content)
f.close()
if i % 100 == 0:
print('Successfully downloaded image {}/{}'.format(i, len(combined['location'])))
time.sleep(0.1)
Now that we have our full dataset, we are ready to begin training our neural network. We note that for this section, we will be using tools and manipulating code from the Pyimagesearch module developed by Adrian Rosebrock in his book, Deep Learning for Computer Vision in Python, which can be found here.
from keras.applications import VGG16
from keras.applications import imagenet_utils
from keras.preprocessing.image import img_to_array
from keras.preprocessing.image import load_img
from Model.pyimagesearch.io import HDF5DatasetWriter
from imutils import paths
import progressbar
import random
dataset = r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\datasets\images'
output = r'C:\\Users\\andre\\Documents\\Data_Science_Projects\\Mexico_Satellite_Imagery\Model\output\hdf5_features4.hdf5'
bs = 32
buffer_size = 1000
Let us begin by accessing the images that we previously downloaded from Google's API and labeling them.
imagePaths = list(paths.list_images(dataset))
random.shuffle(imagePaths)
label_codes = [path.split(os.path.sep)[-1][:-4] for path in imagePaths]
combined['label'] = combined.index
labels = [combined.loc[combined['location'] == loc, 'label'] for loc in label_codes]
labels = [l.iloc[0] for l in labels]
We next instantiate a VGG16 convolutional neural network. By adding include_top=False
, we are not fully forward propagating throughout the model but rather stopping before the fully connected layers. This will allow us to extract feature representations of each image instead of a set of probabilities for classification. We also instantiate an HDF5 dataset writer, which will store the extracted features from the network and pair them to each image's label.
if os.path.exists(output) == False:
model = VGG16(weights="imagenet", include_top=False)
dataset = HDF5DatasetWriter((len(imagePaths), 512 * 7 * 7), output, dataKey="features", bufSize=buffer_size)
Next, we initialize our progress bar and begin looping over the images in batches. Within each batch, we loop over each image and pre-process it, adjusting the size to 224x224. After pre-processing the batch, it gets forward propagated through the network. At this stage, we extract the features by flattening the model's predictions; these vectors are then added to our HDF5 dataset couped with the images' labels.
if os.path.exists(output) == False:
widgets = ["Extracting Features: ", progressbar.Percentage(), " ",
progressbar.Bar(), " ", progressbar.ETA()]
pbar = progressbar.ProgressBar(maxval=len(imagePaths), widgets=widgets).start()
for i in np.arange(0, len(imagePaths), bs):
batchPaths = imagePaths[i:i + bs]
batchLabels = labels[i:i + bs]
batchImages = []
for (j, imagePath) in enumerate(batchPaths):
image = load_img(imagePath, target_size=(224, 224))
image = img_to_array(image)
image = np.expand_dims(image, axis=0)
image = imagenet_utils.preprocess_input(image)
batchImages.append(image)
batchImages = np.vstack(batchImages)
features = model.predict(batchImages, batch_size=bs)
features = features.reshape((features.shape[0], 512 * 7 * 7))
dataset.add(features, batchLabels)
pbar.update(i)
dataset.close()
pbar.finish()
With our features at hand, we are now ready to train our linear regression model.
import h5py
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score
We begin by preparing our data by loading our features from the HDF5 dataset and labeling our data with the average monthly income earned from our combined
dataset. We then perform our train-test split.
hdf5_data = h5py.File(output, "r")
features = hdf5_data['features'][()]
codes = hdf5_data['labels'][()]
labels = [combined.loc[combined['label'] == code, 'INCEARN'] for code in codes]
labels = [l.iloc[0] for l in labels]
X_train, X_test, y_train, y_test = train_test_split(features, labels, random_state = 42, test_size = 0.95)
As a next step, we instantiate our support vector regression model and use five-fold cross validation to select the best hyperparameters. In line with Naik et al. (2016), we fit a Suppor Vector Regression with a linear kernel.
parameters = {"C": [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]}
model = svm.SVR(gamma = 'scale', kernel = 'linear')
clf = GridSearchCV(model, parameters, cv = 5, iid = True)
clf.fit(X_train, y_train)
print('R-squared score (training data): {:.3f}'.format(r2_score(y_train, clf.predict(X_train,))))
hdf5_data.close()
We can observe that with the best parameter from five-fold cross validation, our model presents an $R^2$ of 0.489.
As a final step, we will conduct some visual exploration. Firstly, we will perform principal components analysis to abstract the relationship between our extracted features and average monthly income per municipality.
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
X_train_scaled = StandardScaler().fit_transform(X_train)
PCA = PCA(n_components = 1)
principal_c = PCA.fit_transform(X_train_scaled)
principal_df = pd.DataFrame(data = principal_c, columns = ['X_pca'])
principal_df['y_pca'] = y_train
principal_df.plot.scatter(x = 'X_pca', y = 'y_pca')
plt.title('Scatter plot of Principal Component 1 and average monthly income earned per municipality')
plt.show()
Lastly, we will map our target variable to visualize its geographic distribution.
os.environ['PROJ_LIB'] = r'C:\ProgramData\Anaconda3\pkgs\proj4-5.2.0-ha925a31_1\Library\share'
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize = (15, 6))
m = Basemap(projection = 'lcc', resolution=None, width=6E6, height=3E6, lat_0=25, lon_0=-100)
x, y = m(combined['Lon_Decimal'].tolist(), combined['Lat_Decimal'].tolist())
m.shadedrelief()
plt.scatter(x, y, c = combined['INCEARN'].tolist(), cmap = 'Reds', alpha = 0.5, marker = 'o')
plt.title('Geographic Distribution of Monthly Average Income per Municipality')
plt.colorbar(label='Average Monthly Income per Municipality (pesos)')
plt.show()
As we can see above, average monthly income tends to be higher in the Northeastern and Central parts of Mexico, which are known for their high level of industrial and economic activity. The poorest parts are clearly visible in the figure and are largely in Sourthern Mexico (Oaxaca, Guerrero) and the Yucan Peninsula.
In this project, we have fit a support vector regression model with a linear kernel to estimate the monthly average income (in pesos) of a municipality using a satellite image of its location. For this purpose, we have cleaned data from the Mexican 2015 Intercensal Survey (kindly provided by IPUMS) and matched it to satellite images by manipulating municipalities' geographic coordinates and income information. Due to the lack of granularity in the data, we performed these estimates at the municipality level. Extracting features from these images using a VGG16 CNN, our model presented an $R^2$ of 0.489.
Many strategies could be followed to try to improve the explainablity of our model, which we will pursue in future research. These include sourcing the data directly from the Mexican Statistics Institute to acquire more granular information on locality location and performing a classification instead of a regression by dividing income levels into buckets.
Babenko, Boris, Jonathan Hersh, David Newhouse, Anusha Ramakrishnan, and Tom Swartz. "Poverty Mapping Using Convolutional Neural Networks Trained on High and Medium Resolution Satellite Images, With an Application in Mexico." Presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017) Workshop on Machine Learning for the Developing World (December 2017).
Dubey, Abhimanyu, Nikhil Naik, Devi Parikh, Ramesh Raskar and Cesar Hidalgo. "Deep Learning the City: Quantifying Urban Perception at a Global Scale." European Conference on Computer Vision (September 2016): 569-576.
Glaeser, Edward L., Scott Duke Kominers, Michael Luca, and Nikhil Naik. "Big Data and Big Cities: The Promises and Limitations of Improved Measures of Urban Life." Economic Inquiry 56, no. 1 (January 2018): 114–137. (Originally Harvard Business School Working Paper, No. 16-065, November 2015.)
Google Maps Platform. Maps Static API.
Instituto Nacional de Estadística y Geografía. "Catálogo Único de Claves de Áreas Geoestadísticas Estatales, Municipales y Localidades." Link. Accessed October 30, 2019.
Instituto Nacional de Estadística y Geografía. "Censo de Población y Vivienda 2010." Link. Accessed October 30, 2019.
Instituto Nacional de Estadística y Geografía. "Manual de cartografia estadística." Censo de Población y Vivienda (2010).
Minnesota Population Center. Integrated Public Use Microdata Series, International: Version 7.2 [Mexico Intercensal Survey 2015 - sample ID 484201501]. Minneapolis, MN: IPUMS, 2019. Link
Naik, Nikhil, Ramesh Raskar, and Cesar A. Hidalgo. "Cities are Physical Too: Using Computer Vision to Measure the Quality and Impact of Urban Appearance." American Economic Review, Vol. 106 (May 2016): 128-132.
Naik, Nikhil, Scott Duke Kominers, Ramesh Raskar, Edward L. Glaeser, and Cesar Hidalgo. "Computer vision uncovers predictors of physical urban change." PNAS, Vol. 114, no. 29 (July 2017): 7571-7576.
Rosebrock, Adrian. "Deep Learning for Computer Vision with Python." PyImageSearch. Link. Accessed September 29, 2019.