The first step is to download all the tsv files from here. http://www3.inegi.org.mx/sistemas/descarga/default.aspx?c=28088 Choose the option to download by federal entity (states and muncipalities) in tsv format. Open up the zip files and place the results in a single directory. The following lines will read all the files into a list in R.
setwd("/home/duncan/descarga") ## Change to whereever you have the data
a <- dir(pattern = "*Valor*")
d <- lapply(a, function(x) read.table(x, sep = "\t", fill = T, head = T))
Now connect to your data base in order to run queries. If you have not set up PostGIS with an ODBC connector you need to do this before running this code. Substitute my connection for your own.
library(RODBC)
con <- odbcConnect("gisdb")
x <- d[[1]]
str(x)
## 'data.frame': 8208 obs. of 73 variables:
## $ Cve_Entidad : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Desc_Entidad : Factor w/ 1 level "Aguascalientes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Cve_Municipio : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Desc_Municipio: Factor w/ 14 levels "Aguascalientes",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Tema_nivel_1 : Factor w/ 4 levels "Economía","Medio ambiente",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Tema_nivel_2 : Factor w/ 21 levels "Actividades primarias",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ Tema_nivel_3 : Factor w/ 58 levels "Actividades gubernamentales",..: 19 19 19 19 19 19 19 19 19 19 ...
## $ Id_Indicador : num 5000000001 5000000002 5000000003 5000000004 5000000018 ...
## $ Indicador : Factor w/ 890 levels "Accidentes de tránsito en zona suburbana",..: 752 307 753 308 751 306 538 489 23 741 ...
## $ X1895 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1900 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1910 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1921 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1930 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1940 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1950 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1952 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1960 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1970 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1971 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1978 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1979 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1980 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1981 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1982 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1983 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1984 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1985 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1986 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1987 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1988 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1989 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1990 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1991 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1992 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1993 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1994 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1995 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1996 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1997 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X1998 : num 39990 226457 1 1.2 1.1 ...
## $ X1999 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2000 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2001 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2002 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2003 : num 43522 290996 1 1.3 1.1 ...
## $ X2004 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2005 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2006 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2007 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2008 : num 54186 330380 1.1 1.2 1.1 ...
## $ X2009 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2010 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2010.1 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2010.2 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2010.3 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2010.4 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2011 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2011.1 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2011.2 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2011.3 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2011.4 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2012 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2012.1 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2012.2 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2012.3 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2012.4 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2013 : logi NA NA NA NA NA NA ...
## $ X2013.1 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2013.2 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2013.3 : num NA NA NA NA NA NA NA NA NA NA ...
## $ UnidadMedida : Factor w/ 51 levels "","Automóviles",..: 49 43 45 45 45 45 45 45 45 49 ...
## $ X : logi NA NA NA NA NA NA ...
This is clearly quite challenging data. When R read in the files it added an "X” to the names of the columns representing the years. Many of these columns are simply packed with missing values. We'll use reshape to sort out some of this mess by “melting” the data into a consistent form. The identifiers for the columns will be the unique ID for the state (“Cve_Entidad”), municipio (“Cve_Municipio”) and the index.
library(reshape)
## Loading required package: plyr
##
## Attaching package: 'reshape'
##
## The following objects are masked from 'package:plyr':
##
## rename, round_any
id <- c("Cve_Entidad", "Cve_Municipio", "Id_Indicador")
Now we need to define the columns in which the measurements are held. These are all headed by an X.
mes <- grep("X", names(x))
dd <- melt(x, id = id, m = mes, na.rm = T)
head(dd)
## Cve_Entidad Cve_Municipio Id_Indicador variable value
## 1 1 0 3104001001 X1895 8.5
## 2 1 0 3103001005 X1895 70.0
## 3 1 0 3105001001 X1900 18.3
## 4 1 0 3104001001 X1900 9.5
## 5 1 0 3103001005 X1900 68.7
## 6 1 0 1002000001 X1910 120511.0
This is looking better. Now if we just remove the X from the variable column and coerce the character to a number we will have the year from which the non missing data has been derived.
dd$variable <- as.numeric(gsub("X", "", dd$variable))
head(dd)
## Cve_Entidad Cve_Municipio Id_Indicador variable value
## 1 1 0 3104001001 1895 8.5
## 2 1 0 3103001005 1895 70.0
## 3 1 0 3105001001 1900 18.3
## 4 1 0 3104001001 1900 9.5
## 5 1 0 3103001005 1900 68.7
## 6 1 0 1002000001 1910 120511.0
Now write out the results to a temporary file.
fl <- "/home/duncan/temp.csv"
write.table(dd, file = fl, row.names = F, sep = ",", col.names = FALSE)
We are going to load the results into a table in our PostGIS data base.
I first created a schema called municipios. Then ran this to form a table.
query <- "create schema inegi;"
odbcQuery(con, query)
query <- "CREATE TABLE inegi.data(est_id bigint,mun_id bigint,var_id bigint,yr double precision, val double precision);"
odbcQuery(con, query)
query <- paste("COPY inegi.data FROM '", fl, "' DELIMITERS ',' CSV;", sep = "")
print(query)
## [1] "COPY inegi.data FROM '/home/duncan/temp.csv' DELIMITERS ',' CSV;"
Now if we want to load the first states worth of data this would do it.
odbcQuery(con, query)
But, we want to load all the states at once. So either roll the code into a function and lapply it, or run it in a loop like this
for (i in 1:32) {
x <- d[[i]]
id <- c("Cve_Entidad", "Cve_Municipio", "Id_Indicador")
mes <- grep("X", names(x))
dd <- melt(x, id = id, m = mes, na.rm = T)
dd$variable <- as.numeric(gsub("X", "", dd$variable))
fl <- "/home/duncan/temp.csv"
write.table(dd, file = fl, row.names = F, sep = ",", col.names = FALSE)
query <- paste("COPY inegi.data FROM '", fl, "' DELIMITERS ',' CSV;", sep = "")
print(query)
print(odbcQuery(con, query))
}
So far so good. However all that we are left with are the values alongside some meaningless numbers representing the variables. We need to add a table that we can use to cross reference these numbers.
We can do this by taking the first file again and extracting all the unique values that it holds.
x <- d[[1]]
x <- data.frame(id = x$Id_Indicador, tema1 = x$Tema_nivel_1, tema2 = x$Tema_nivel_2,
tema3 = x$Tema_nivel_3, var = x$Indicador)
x <- x[unique(x$var), ]
head(x)
## id tema1 tema2
## 752 3103004004 Sociedad y Gobierno Empleo y relaciones laborales
## 307 5300000069 Economía Actividades terciarias
## 753 3103004005 Sociedad y Gobierno Empleo y relaciones laborales
## 308 5300000079 Economía Actividades terciarias
## 751 3103004003 Sociedad y Gobierno Empleo y relaciones laborales
## 306 5300000059 Economía Actividades terciarias
## tema3
## 752 Características del empleo de la población
## 307 Transportes, correos y almacenamiento
## 753 Características del empleo de la población
## 308 Transportes, correos y almacenamiento
## 751 Características del empleo de la población
## 306 Transportes, correos y almacenamiento
## var
## 752 Porcentaje de hombres ocupados en el sector primario
## 307 Total de personal no dependiente de la razón social. Sector 48-49. Transportes.
## 753 Porcentaje de hombres ocupados en el sector secundario
## 308 Personal remunerado. Sector 48-49. Transportes.
## 751 Porcentaje de población total ocupada en el sector terciario
## 306 Total de personal dependiente de la razón social. Sector 48-49. Transportes.
We need another table in the municipios schema.
query <- "CREATE TABLE inegi.ids(var_id bigint,tema1 text,tema2 text,tema3 text,variable text);"
odbcQuery(con, query)
Then run a query to load all the data.
fl <- "/home/duncan/temp.csv"
write.table(x, file = fl, row.names = F, sep = ",", col.names = FALSE)
query <- paste("COPY inegi.ids FROM '", fl, "' DELIMITERS ',' CSV;", sep = "")
print(query)
print(odbcQuery(con, query))
So, that wasn't so hard (the second time around!)
Now the challenge is making sense of it all and finding some interesting variables in the pile.
query <- "select * from inegi.ids"
d <- sqlQuery(con, query)
d[grep("homicid", d$variable), ]
## var_id tema1 tema2
## 83 1006000018 Sociedad y Gobierno Seguridad pública y Justicia
## 106 1006000006 Sociedad y Gobierno Seguridad pública y Justicia
## 504 1006000051 Sociedad y Gobierno Seguridad pública y Justicia
## tema3
## 83 Procuración e Impartición de justicia
## 106 Procuración e Impartición de justicia
## 504 Procuración e Impartición de justicia
## variable
## 83 Sentenciados por homicidio registrados en los juzgados del fuero común (Catálogo Único de Delitos)
## 106 Presuntos delincuentes por homicidio registrados en los juzgados del fuero común (Catálogo Único de Delitos)
## 504 Delitos por homicidio registrados en el MP del fuero común
So this will pull down all the data for presumed homicides.
query <- "select * from inegi.data where var_id=1006000006"
d <- sqlQuery(con, query)
There are a few additional things you need to know in order to do anything useful with the data.
In order to visualise the tables you need to load some shapefiles of Mexican administrative districts into the database.
https://dl.dropboxusercontent.com/u/2703650/shapefilesINEGI.zip
These can be loaded into the schema using any of the usual ways.
Once in the schema you can now relate the codes for the municipios and states with their actual names. Any row that has a zero for muncipio ID represents state totals or means.
So now once the attributes of the shapefile are available for joining on, to get homicides by states requires a fairly simple query, which can be easily adapted for any other variable.
query <- "select nombre,yr,val,pob from(select yr,est_id,val from inegi.data where var_id=1006000051 and mun_id=0) a, inegi.estados2010 m where m.est_id=a.est_id"
d <- sqlQuery(con, query)
head(d)
## nombre yr val pob
## 1 Yucatán 2001 163 1936370
## 2 Yucatán 2002 267 1936370
## 3 Yucatán 2003 292 1936370
## 4 Yucatán 2004 236 1936370
## 5 Yucatán 2005 244 1936370
## 6 Yucatán 2006 232 1936370
A similar query can be used for municipal districts.
query <- "select a.est_id,nom_mun,yr,val,pob from (select *\n from inegi.data where var_id=1006000051 and mun_id>0) a, inegi.municipios2010 m where m.est_id=a.est_id and m.mun_id=a.mun_id;"
d <- sqlQuery(con, query)
head(d)
## est_id nom_mun yr val pob
## 1 31 Abalá 2004 0 6356
## 2 31 Acanceh 2004 0 15337
## 3 31 Akil 2004 0 10362
## 4 31 Baca 2004 0 5701
## 5 31 Bokobá 2004 0 2053
## 6 31 Buctzotz 2004 0 8637