Loading the dataset:
library(readr)
dataset <- read_csv("D:/Dropbox/MsC UABC/TESIS/Evarista/micronucleos/stats pack/2 dataset modifications/Datos Completos con LatLong.csv")
This renders a dataset with blanks from rows 52 : 79 and columns x34, x39:x68. It is important to get rid of them:
dataset <- dataset[-52:-79,-34]
dataset <- dataset[,-38:-39]
Now we have a dataset with the following dimensions: 51 rows and 37 columns.
Getting to know the data
[1] "Exposicion" "X" "Y" "# de Muestra"
[5] "Colonia" "Genero" "Edad" "Lugar de procedencia"
[9] "Estado civil" "A\xf1os viviendo en la colonia" "Distancia a CEMEX m" "Log (10) Distancia"
[13] "Escolaridad" "Trabajo" "Lugar de Trabajo Antes /Ahora" "Usa o Uso Qimicos"
[17] "Seguro Medico" "Estado de Salud" "Estilo de Vida" "Alimentaci\xf3n"
[21] "Condici\xf3n de la vivienda" "Condici\xf3n de Exteriores" "Peso en Kilos" "Estatura (m)"
[25] "Diametro Cintura (cm)" "Glucosa" "Micronucleo" "Nucleo lobulado"
[29] "Binucleadas" "Pignosis" "Cromatina Condensada" "Cariolisis"
[33] "Cariorrexis" "Estatura (m)_1" "Peso en Kilos_1" "Talla2 (m)"
[37] "IMC"
Boolean variables
- Exposicion
- Género
- Trabajo
- Usa o usó químicos
- Seguro médico
- Cariorrexis
Variables with nominal values
- Colonia
- Lugar de procedencia
- Estado Civil
- No. de muestra
- Lugar de trabajo antes/ahora (no sirve pa nada)
- Estado de salud
- Estilo de vida
- Alimentación
- Condición de la vivienda
- Condición de exteriores
Variables with ordinal values
- Años viviendo en la colonia
- Escolaridad
Variables with discrete values
- x
- y
- Edad
- Peso en \(kg\)
- Diámetro de cintura en cm
- Micronucleo
- Nucleo lobulado
- Binucleadas
- Pignosis
- Cromatina condensada
- Cariolisis
Variables with continious values
- Distancia a CEMEX m
- Log(10) distancia
- Estatura en m
- Glucosa (\(\mu g/dL\))
- Talla\(^2\) en \(m\)
- IMC
Descriptive Statistics


Correlation Matrices
The dataset comes from a study on the influence of contamination generated by a cement factory in a local population, evaluated through citotoxic damage.
Target Variables will, therefore, be:
Micronucleo
Nucleo lobulado
Binucleadas
Pignosis
Cromatina condensada
Cariolisis
Calculating to all 6 target variables:
Computing the correlation matrix:
Micronucleo Nucleo lobulado Binucleadas Pignosis Cromatina Condensada Cariolisis Cariorrexis Distancia a CEMEX m Glucosa IMC
Micronucleo 1.00 0.28 0.46 0.21 0.48 0.26 0.03 -0.42 0.42 0.22
Nucleo lobulado 0.28 1.00 0.64 0.25 -0.13 0.32 0.15 0.02 0.43 0.29
Binucleadas 0.46 0.64 1.00 0.27 0.14 0.42 0.22 -0.07 0.19 0.35
Pignosis 0.21 0.25 0.27 1.00 0.15 0.25 -0.11 -0.35 0.41 0.29
Cromatina Condensada 0.48 -0.13 0.14 0.15 1.00 0.06 -0.03 -0.35 0.13 0.12
Cariolisis 0.26 0.32 0.42 0.25 0.06 1.00 0.45 -0.14 0.55 0.38
Cariorrexis 0.03 0.15 0.22 -0.11 -0.03 0.45 1.00 0.08 0.11 0.42
Distancia a CEMEX m -0.42 0.02 -0.07 -0.35 -0.35 -0.14 0.08 1.00 -0.30 0.04
Glucosa 0.42 0.43 0.19 0.41 0.13 0.55 0.11 -0.30 1.00 0.22
IMC 0.22 0.29 0.35 0.29 0.12 0.38 0.42 0.04 0.22 1.00
Computing significance levels and visualizing the results in a formated table:
Generating a correlogram:
library(corrplot)
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)

Using INEGI data to have an idea of the magnitud of the sample, compared to the population of these colonias.
The queary was made through this link, data was downloaded in SPSS format after sekectin the state of Baja California. Since the open data provided by INEGI does not include the name of colonias, I had to manually consult the Inventario Nacional de Viviendas 2016, an interactive map with layers of information where I was able to spot the names of some of the colonies mentioned in the report I am analyzing. Once I located the colonias, the AGEB number was extracted (manually) and then the database can be manipulated:
- Industrial: 0200100017930
- Jalisco: 0200100017930
- Villa Bonita: 020010001845A
I have not been able to locate the following colonias:
- Costa Bella II
- Este
- Norte
- Oeste
- Sur
Loading SPSS data into this R project:
library(foreign)
db <- file.choose()
inegi <- read.spss(db, to.data.frame = TRUE)
Subsetting the dataframe to retain only the AGEBs of interest
- Subsetting AGEB
0200100017930 (Industrial and Jalisco):
inegi7930 <- subset(inegi, AGEB == "7930",
select=c(POBTOT, AGEB))
head(inegi7930)
This produces a table with 2 columns and 44 rows. And a total population living in this area of 4193 people.
- Subsetting AGEB
020010001845A(Villa Bonita):
inegi845A <- subset(inegi, AGEB == "845A",
select=c(POBTOT, AGEB))
head(inegi845A)
This produces a table with 2 columns and 32 rows. And a total population living in this area of 2254 people.
Considering the data accuired from INEGI and, according to the law of large numbers we can make the following declarations:
- For the Industrial and Jalisco, AGEB 7930:
- The percentage of the population tested (including controls) was 1.049368%
- For Villa Bonita, AGEB 845A:
- The percentage of the population tested (including controls) was 1.4196983 %.
Chart of Correlation Matrix

