Loading data from the “descarga masiva”“ de INEGI into PostGIS

The first step is to download all the tsv files from here. http://www3.inegi.org.mx/sistemas/descarga/default.aspx?c=28088 Choose the option to download by federal entity (states and muncipalities) in tsv format. Open up the zip files and place the results in a single directory. The following lines will read all the files into a list in R.


setwd("/home/duncan/descarga")  ## Change to whereever you have the data
a <- dir(pattern = "*Valor*")
d <- lapply(a, function(x) read.table(x, sep = "\t", fill = T, head = T))

Connect to the data base

Now connect to your data base in order to run queries. If you have not set up PostGIS with an ODBC connector you need to do this before running this code. Substitute my connection for your own.


library(RODBC)
con <- odbcConnect("gisdb")

Look at the structure of the data

x <- d[[1]]
str(x)

## 'data.frame':    8208 obs. of  73 variables:
##  $ Cve_Entidad   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Desc_Entidad  : Factor w/ 1 level "Aguascalientes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Cve_Municipio : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Desc_Municipio: Factor w/ 14 levels "Aguascalientes",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Tema_nivel_1  : Factor w/ 4 levels "Economía","Medio ambiente",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tema_nivel_2  : Factor w/ 21 levels "Actividades primarias",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ Tema_nivel_3  : Factor w/ 58 levels "Actividades gubernamentales",..: 19 19 19 19 19 19 19 19 19 19 ...
##  $ Id_Indicador  : num  5000000001 5000000002 5000000003 5000000004 5000000018 ...
##  $ Indicador     : Factor w/ 890 levels "Accidentes de tránsito en zona suburbana",..: 752 307 753 308 751 306 538 489 23 741 ...
##  $ X1895         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1900         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1910         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1921         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1930         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1940         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1950         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1952         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1960         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1970         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1971         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1978         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1979         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1980         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1981         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1982         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1983         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1984         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1985         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1986         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1987         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1988         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1989         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1990         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1991         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1992         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1993         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1994         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1995         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1996         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1997         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X1998         : num  39990 226457 1 1.2 1.1 ...
##  $ X1999         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2000         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2001         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2002         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2003         : num  43522 290996 1 1.3 1.1 ...
##  $ X2004         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2005         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2006         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2007         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2008         : num  54186 330380 1.1 1.2 1.1 ...
##  $ X2009         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2010         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2010.1       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2010.2       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2010.3       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2010.4       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2011         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2011.1       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2011.2       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2011.3       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2011.4       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2012         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2012.1       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2012.2       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2012.3       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2012.4       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2013         : logi  NA NA NA NA NA NA ...
##  $ X2013.1       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2013.2       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X2013.3       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ UnidadMedida  : Factor w/ 51 levels "","Automóviles",..: 49 43 45 45 45 45 45 45 45 49 ...
##  $ X             : logi  NA NA NA NA NA NA ...

Cleaning things up

This is clearly quite challenging data. When R read in the files it added an "X” to the names of the columns representing the years. Many of these columns are simply packed with missing values. We'll use reshape to sort out some of this mess by “melting” the data into a consistent form. The identifiers for the columns will be the unique ID for the state (“Cve_Entidad”), municipio (“Cve_Municipio”) and the index.

library(reshape)

## Loading required package: plyr
## 
## Attaching package: 'reshape'
## 
## The following objects are masked from 'package:plyr':
## 
##     rename, round_any

id <- c("Cve_Entidad", "Cve_Municipio", "Id_Indicador")

Now we need to define the columns in which the measurements are held. These are all headed by an X.

mes <- grep("X", names(x))
dd <- melt(x, id = id, m = mes, na.rm = T)
head(dd)

##   Cve_Entidad Cve_Municipio Id_Indicador variable    value
## 1           1             0   3104001001    X1895      8.5
## 2           1             0   3103001005    X1895     70.0
## 3           1             0   3105001001    X1900     18.3
## 4           1             0   3104001001    X1900      9.5
## 5           1             0   3103001005    X1900     68.7
## 6           1             0   1002000001    X1910 120511.0

This is looking better. Now if we just remove the X from the variable column and coerce the character to a number we will have the year from which the non missing data has been derived.

dd$variable <- as.numeric(gsub("X", "", dd$variable))
head(dd)

##   Cve_Entidad Cve_Municipio Id_Indicador variable    value
## 1           1             0   3104001001     1895      8.5
## 2           1             0   3103001005     1895     70.0
## 3           1             0   3105001001     1900     18.3
## 4           1             0   3104001001     1900      9.5
## 5           1             0   3103001005     1900     68.7
## 6           1             0   1002000001     1910 120511.0

Now write out the results to a temporary file.

fl <- "/home/duncan/temp.csv"
write.table(dd, file = fl, row.names = F, sep = ",", col.names = FALSE)

We are going to load the results into a table in our PostGIS data base.

I first created a schema called municipios. Then ran this to form a table.

query <- "create schema inegi;"
odbcQuery(con, query)

query <- "CREATE TABLE inegi.data(est_id bigint,mun_id bigint,var_id bigint,yr double precision, val double precision);"
odbcQuery(con, query)

Load the data

query <- paste("COPY inegi.data FROM '", fl, "' DELIMITERS ',' CSV;", sep = "")
print(query)

## [1] "COPY inegi.data FROM '/home/duncan/temp.csv' DELIMITERS ',' CSV;"

Now if we want to load the first states worth of data this would do it.

odbcQuery(con, query)

But, we want to load all the states at once. So either roll the code into a function and lapply it, or run it in a loop like this

Load all the data


for (i in 1:32) {
    x <- d[[i]]
    id <- c("Cve_Entidad", "Cve_Municipio", "Id_Indicador")
    mes <- grep("X", names(x))
    dd <- melt(x, id = id, m = mes, na.rm = T)
    dd$variable <- as.numeric(gsub("X", "", dd$variable))
    fl <- "/home/duncan/temp.csv"
    write.table(dd, file = fl, row.names = F, sep = ",", col.names = FALSE)
    query <- paste("COPY inegi.data FROM '", fl, "' DELIMITERS ',' CSV;", sep = "")
    print(query)
    print(odbcQuery(con, query))
}

Adding a table holding the variable names

So far so good. However all that we are left with are the values alongside some meaningless numbers representing the variables. We need to add a table that we can use to cross reference these numbers.

We can do this by taking the first file again and extracting all the unique values that it holds.

x <- d[[1]]
x <- data.frame(id = x$Id_Indicador, tema1 = x$Tema_nivel_1, tema2 = x$Tema_nivel_2, 
    tema3 = x$Tema_nivel_3, var = x$Indicador)
x <- x[unique(x$var), ]
head(x)

##             id               tema1                         tema2
## 752 3103004004 Sociedad y Gobierno Empleo y relaciones laborales
## 307 5300000069            Economía        Actividades terciarias
## 753 3103004005 Sociedad y Gobierno Empleo y relaciones laborales
## 308 5300000079            Economía        Actividades terciarias
## 751 3103004003 Sociedad y Gobierno Empleo y relaciones laborales
## 306 5300000059            Economía        Actividades terciarias
##                                          tema3
## 752 Características del empleo de la población
## 307      Transportes, correos y almacenamiento
## 753 Características del empleo de la población
## 308      Transportes, correos y almacenamiento
## 751 Características del empleo de la población
## 306      Transportes, correos y almacenamiento
##                                                                                 var
## 752                            Porcentaje de hombres ocupados en el sector primario
## 307 Total de personal no dependiente de la razón social. Sector 48-49. Transportes.
## 753                          Porcentaje de hombres ocupados en el sector secundario
## 308                                 Personal remunerado. Sector 48-49. Transportes.
## 751                    Porcentaje de población total ocupada en el sector terciario
## 306    Total de personal dependiente de la razón social. Sector 48-49. Transportes.

Load the variable ids into the data base.

We need another table in the municipios schema.

query <- "CREATE TABLE inegi.ids(var_id bigint,tema1 text,tema2 text,tema3 text,variable text);"
odbcQuery(con, query)

Then run a query to load all the data.


fl <- "/home/duncan/temp.csv"
write.table(x, file = fl, row.names = F, sep = ",", col.names = FALSE)
query <- paste("COPY inegi.ids FROM '", fl, "' DELIMITERS ',' CSV;", sep = "")
print(query)
print(odbcQuery(con, query))

So, that wasn't so hard (the second time around!)

Now the challenge is making sense of it all and finding some interesting variables in the pile.


query <- "select * from inegi.ids"
d <- sqlQuery(con, query)

d[grep("homicid", d$variable), ]

##         var_id               tema1                        tema2
## 83  1006000018 Sociedad y Gobierno Seguridad pública y Justicia
## 106 1006000006 Sociedad y Gobierno Seguridad pública y Justicia
## 504 1006000051 Sociedad y Gobierno Seguridad pública y Justicia
##                                     tema3
## 83  Procuración e Impartición de justicia
## 106 Procuración e Impartición de justicia
## 504 Procuración e Impartición de justicia
##                                                                                                         variable
## 83            Sentenciados por homicidio registrados en los juzgados del fuero común (Catálogo Único de Delitos)
## 106 Presuntos delincuentes por homicidio registrados en los juzgados del fuero común (Catálogo Único de Delitos)
## 504                                                   Delitos por homicidio registrados en el MP del fuero común

So this will pull down all the data for presumed homicides.

query <- "select * from inegi.data where var_id=1006000006"
d <- sqlQuery(con, query)

There are a few additional things you need to know in order to do anything useful with the data.

In order to visualise the tables you need to load some shapefiles of Mexican administrative districts into the database.

https://dl.dropboxusercontent.com/u/2703650/shapefilesINEGI.zip

These can be loaded into the schema using any of the usual ways.

Once in the schema you can now relate the codes for the municipios and states with their actual names. Any row that has a zero for muncipio ID represents state totals or means.

So now once the attributes of the shapefile are available for joining on, to get homicides by states requires a fairly simple query, which can be easily adapted for any other variable.

query <- "select nombre,yr,val,pob from(select yr,est_id,val  from inegi.data where var_id=1006000051 and mun_id=0) a, inegi.estados2010 m where m.est_id=a.est_id"

d <- sqlQuery(con, query)
head(d)

##    nombre   yr val     pob
## 1 Yucatán 2001 163 1936370
## 2 Yucatán 2002 267 1936370
## 3 Yucatán 2003 292 1936370
## 4 Yucatán 2004 236 1936370
## 5 Yucatán 2005 244 1936370
## 6 Yucatán 2006 232 1936370

A similar query can be used for municipal districts.

query <- "select a.est_id,nom_mun,yr,val,pob from (select *\n from inegi.data where var_id=1006000051 and mun_id>0) a, inegi.municipios2010 m where m.est_id=a.est_id and m.mun_id=a.mun_id;"

d <- sqlQuery(con, query)
head(d)

##   est_id  nom_mun   yr val   pob
## 1     31    Abalá 2004   0  6356
## 2     31  Acanceh 2004   0 15337
## 3     31     Akil 2004   0 10362
## 4     31     Baca 2004   0  5701
## 5     31   Bokobá 2004   0  2053
## 6     31 Buctzotz 2004   0  8637