According to Biggs et al. (2021), one of the essential steps for building socio-ecological models involves the survey of attributes that should make up their analytical structure. According to these authors, it is essential that each of the attributes be identified, recognized and analyzed, in order to recognize the data profile and its possible application, whether in descriptive terms or in terms of information mapping.
From this perspective, Ellis (2020) presented that population attributes, mainly demographic density, associated with anthromes are fundamental for modeling anthropogenic biomes, both on a global and local scale. Guided by these guidelines regarding demographic aspects linked to anthromes, in this work we carried out an exploratory analysis of census data produced by the Brazilian Institute of Geography and Statistics (IBGE).
The exploratory analysis aimed to identify the attributes that made up the data from the census operation carried out by IBGE in 2010 (IBGE, 2013a). Furthermore, we tried to recognize special characteristics that would allow the integration of the tabular data provided by this institution, intending to expand the data set for modeling anthromes locally.
In addition, we strive to evaluate the possibility of plotting this data, that is, of spatially distributing census information in local mappings. This operation was performed in the R® software, using the investigative guidelines presented by Lovelace et al. (2019) and Anderson (2021) for exploratory analysis and creation of mappings and plots of geographic information. These authors presented a critical-analytical format in their works, demonstrating the logic involved in achieving the objectives just presented for this work.
Throughout the exploratory analysis, we presented detailed summaries of the functions used and which were extracted from the two works just discussed. Through this research, we carried out a survey of characteristics of census data, in vector and raster formats, which would allow their use in structuring the decision tree for classifying anthromes locally. To this end, we emphasize that this was an essential step in building the modeling of anthropogenic biomes in R® language. In it, we recognized attributes of the demographic data that aligned with those identified by Gauthier (2021) and Ellis, Beusen and Goldewijk (2020) as fundamental for mapping anthromes in R®.
We highlight that the format of this manuscript does not follow conventional textual standards, where “Introduction”, “Methodology/Materials and Methods”, “Results and Discussion” and “Conclusions” are separately detailed. Here, we report using a logical programming format, that is, we first present what was carried out in the R software (Methodology), then the R code (Methodology and Results) and, as the results are generated by the computer program, they are discussed below (Discussion). Therefore, this is the format adopted in this manuscript, in order to facilitate analytical understanding and concatenate the analyzes carried out.
In the first stage of the exploratory analysis of demographic data, the tabular files provided by the Brazilian Institute of Geography and Statistics (IBGE) were downloaded from the institution’s digital platform. According to IBGE, the lowest level of data disaggregation is the micro data from the 2010 Census, that is, this data contains information for each of the cities investigated by IBGE during the demographic census. These data show the distribution of the municipal population in urban and rural areas and also in different urban systems, such as municipal headquarters or outside the municipal headquarters. The web page where this data is available is:
We point out that we used data referring to the State of São Paulo (Brazil) as an experimental model for mapping anthromes locally, as this Federation Unit (UF) encompasses different territorial typologies (land uses and covers) and has significant representation in the economy , in national politics and management, as well as symbolic distribution and population size.
The files downloaded from the IBGE platform were included in a folder associated with the work (directory) for later application in the R® software. This folder contains the guidance documentation provided by the Brazilian Institute, the micro data and tables referring to the population of the State of São Paulo recorded in the 2010 Census. The tabular data are in the “.xls” extension (Microsoft Access 365) for import into R®.
We point out that some adjustments to the content of the tables were necessary, as they prevented the files from being read correctly in the software. Therefore, the tabular files were opened in Microsoft Access 365 to remove titles and additional information, such as subtitles, captions and bibliographic references, which appeared in the original data. Thus, in the edited data only the names of the attributes remained (first line of each column) and the census data for each attribute necessary for analysis.
Furthermore, we emphasize that the numeric tabular data contained spacing between units and a hyphen (-) in null values, characteristics that prevented the data from being read as numeric values, being interpreted by R as “characters”. Therefore, we edited the tables made available by IBGE, removing spaces and replacing hyphens with zeros (0) in the sets analyzed in this work. Please note that these edits were made in Microsoft Excel 365.
The operations performed in the Access 365 and Excel 365 programs are not reported throughout the text. However, the edited tabular files, in “.xls” format, were made available as complementary files for this work and are available on the EcoMetrologia Project’s GitHub https://github.com/maximilianogobbo/landuseplanning.git> and can be accessed remotely. Furthermore, all documents that make up the demographic data portfolio, including the R and Rmarkdown scripts, were saved in a single directory, in order to facilitate and streamline the operation, manipulation and analysis of data in the software. The getwd (_) function shows the referenced working directory, the virtual location where all the documents for this investigation are located:
## [1] "C:/ARQUIVOS COMPUTADOR/DOUTORADO/DOUTORADO TESE/03 DADOS GEOESPACIAIS/02.1 DEMOGRAPHIC"
Of all the documents downloaded from the IBGE platform, only 3 of them were used in the first phase of the exploratory analysis, as only these contained information about: the geographic location of the municipalities in the State of São Paulo, the population in each of the subdivisions established in the census, the area and/or demographic density of each municipality.
Below, the loading of each of the tables in R® is presented separately using the read_excel () function. For this function to operate, the file name and directory where the tables were saved were indicated, as illustrated in Script 1 below.
Furthermore, in this preliminary phase, two other functions were used subsequent to loading. The names() function, to identify the name of the data set attributes (first line of tabular data), and the summary() function, which offers a synthesis of the data analyzed by it, whether in qualitative terms (characters) or in quantitative terms (numerical and statistical).
The first table loaded into the software was “population01.xls”, using the read_excel () function. Sequentially, we transformed the table into an object (data frame), which was named population01. Using the names() function, we check the names of the attributes in this data set. Subsequently, we use the summary () function to obtain a qualitative and quantitative synthesis of the population01 data frame. Script 1 (code) illustrates this preliminary procedure in R® language.
Script: Loading and Preliminary Analysis of population01
names(population01)
## [1] "city" "Área Urbanizada"
## [3] "Área não Urbanizada" "Área Urbana Isolada"
## [5] "Área Rural (Exceto Aglomerado)" "Aglomerado Rural de Extensão Urbana"
## [7] "Aglomerado Rural Povoado" "Aglomerado Rural Núcleo"
## [9] "Outros Aglomerados Rurais Raros" "Código da Unidade Geográfica"
summary(population01)
## city Área Urbanizada Área não Urbanizada Área Urbana Isolada
## Length:645 Min. : 627 Min. : 0 Min. : 0.0
## Class :character 1st Qu.: 3753 1st Qu.: 0 1st Qu.: 0.0
## Mode :character Median : 9485 Median : 0 Median : 0.0
## Mean : 59817 Mean : 1048 Mean : 508.3
## 3rd Qu.: 33907 3rd Qu.: 15 3rd Qu.: 79.0
## Max. :11065838 Max. :65912 Max. :41236.0
## Área Rural (Exceto Aglomerado) Aglomerado Rural de Extensão Urbana
## Min. : 0 Min. : 0.0
## 1st Qu.: 591 1st Qu.: 0.0
## Median : 1218 Median : 0.0
## Mean : 2244 Mean : 246.9
## 3rd Qu.: 2780 3rd Qu.: 0.0
## Max. :45899 Max. :54903.0
## Aglomerado Rural Povoado Aglomerado Rural Núcleo
## Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.000
## Mean : 56.94 Mean : 8.567
## 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :6185.00 Max. :813.000
## Outros Aglomerados Rurais Raros Código da Unidade Geográfica
## Min. : 0.00 Min. :3500105
## 1st Qu.: 0.00 1st Qu.:3514601
## Median : 0.00 Median :3528700
## Mean : 40.07 Mean :3528698
## 3rd Qu.: 0.00 3rd Qu.:3543204
## Max. :2889.00 Max. :3557303
Source: the authors (2023). Caption: preliminary analysis of the population data set using the functions: names () and summary () in the R software.
The preliminary analysis of data from population01 revealed some important aspects about the set. The first to be scored involves the names () function that indicates the name of the attributes that make up the data frame. These attributes refer to the typologies of census sectors (land use) to indicate the number of inhabitants registered in each of them for each of the municipalities in São Paulo.
As seen in the summary () results, there are 645 lines (length) that represent the municipalities of the State. Each of the lines offers the number of inhabitants registered in each of the territorial typologies associated with the city, which is indicated in the first line of the data set.
Another aspect to be highlighted in the results presented by this function is the attribute Código da Unidade Geográfica, despite being defined by values (numbers), it is a numerical descriptor, that is, a sequence of numbers assigned to determine the area of reference. This descriptor code is understood, in software engineering and database modeling, as identifying attributes, which are not repeated throughout the data set and are exclusively attributed to an entity, which in the case of the population01 data frame are the cities paulistas.
That said, we carried out the same analytical procedure with the “population02.xls” table in the directory, from which the population02 data frame was created and which is described in Script 2.
Script: Loading and Preliminary Analysis of population02
names(population02)
## [1] "city" "Total"
## [3] "Urbana" "Na sede municipal"
## [5] "Rural" "Área\ntotal\n(km²)"
## [7] "Densidade demográfica (hab/km²)" "Código da Unidade Geográfica"
summary(population02)
## city Total Urbana Na sede municipal
## Length:645 Min. : 805 Min. : 627 Min. : 627
## Class :character 1st Qu.: 5151 1st Qu.: 3865 1st Qu.: 3681
## Mode :character Median : 12737 Median : 10352 Median : 9563
## Mean : 63972 Mean : 61372 Mean : 56890
## 3rd Qu.: 37910 3rd Qu.: 34748 3rd Qu.: 32676
## Max. :11253503 Max. :11152344 Max. :11111108
## Rural Área\ntotal\n(km²) Densidade demográfica (hab/km²)
## Min. : 0 Min. : 5.4 Min. : 3.73
## 1st Qu.: 628 1st Qu.: 157.9 1st Qu.: 19.69
## Median : 1286 Median : 281.1 Median : 38.87
## Mean : 2600 Mean : 384.8 Mean : 302.13
## 3rd Qu.: 2971 3rd Qu.: 508.5 3rd Qu.: 109.81
## Max. :101159 Max. :1977.4 Max. :12519.10
## Código da Unidade Geográfica
## Min. :3500105
## 1st Qu.:3514601
## Median :3528700
## Mean :3528698
## 3rd Qu.:3543204
## Max. :3557303
Source: the authors (2023). Caption: preliminary analysis of the population data set using the functions: names () and summary () in the R software.
We observed, through the results obtained by the names() function, that the population02 set has some attributes with the same name and others different from those present in the population01 data frame. We highlight the attributes “demographic density” and “total area”, which are information about the number of individuals in a given area and the total area of the census sector, respectively. Again, the attribute Código da Unidade Geográfica is interpreted as a numeric attribute, incurring the same problem identified previously for population01.
Next, we performed the same procedure with the file “population03.xls”, which gave rise to the data frame population03, as shown in Script 3.
Script: Loading and Preliminary Analysis of population03
names(population03)
## [1] "city"
## [2] "População residente Absoluta"
## [3] "População residente absoluta total urbana"
## [4] "População residente absoluta total na sede municipal urbana\n"
## [5] "Total Relativa (%)...5"
## [6] "Total Relativa (%)...6"
## [7] "Na sede municipal Relativa (%)\n"
## [8] "Área\ntotal\n(km²)\n"
## [9] "Densidade demográfica (hab/km²)"
## [10] "Código da Unidade Geográfica"
summary(population03)
## city População residente Absoluta
## Length:645 Min. : 805
## Class :character 1st Qu.: 5151
## Mode :character Median : 12737
## Mean : 63972
## 3rd Qu.: 37910
## Max. :11253503
## População residente absoluta total urbana
## Min. : 627
## 1st Qu.: 3865
## Median : 10352
## Mean : 61372
## 3rd Qu.: 34748
## Max. :11152344
## População residente absoluta total na sede municipal urbana\n
## Min. : 627
## 1st Qu.: 3681
## Median : 9563
## Mean : 56890
## 3rd Qu.: 32676
## Max. :11111108
## Total Relativa (%)...5 Total Relativa (%)...6 Na sede municipal Relativa (%)\n
## Min. :100 Min. : 24.90 Min. : 12.60
## 1st Qu.:100 1st Qu.: 78.70 1st Qu.: 71.50
## Median :100 Median : 88.40 Median : 84.30
## Mean :100 Mean : 84.32 Mean : 79.78
## 3rd Qu.:100 3rd Qu.: 94.90 3rd Qu.: 92.10
## Max. :100 Max. :100.00 Max. :100.00
## Área\ntotal\n(km²)\n Densidade demográfica (hab/km²)
## Min. : 5.4 Min. : 3.73
## 1st Qu.: 157.9 1st Qu.: 19.69
## Median : 281.1 Median : 38.87
## Mean : 384.8 Mean : 302.13
## 3rd Qu.: 508.5 3rd Qu.: 109.81
## Max. :1977.4 Max. :12519.10
## Código da Unidade Geográfica
## Min. :3500105
## 1st Qu.:3514601
## Median :3528700
## Mean :3528698
## 3rd Qu.:3543204
## Max. :3557303
Source: the authors (2023). Caption: preliminary analysis of the population data set using the functions: names () and summary () in the R software.
In Script 3, in addition to presenting the results of the two analytical functions, we also report the first 10 lines of the data frame population03, presented right after loading the data using the read_excel () function.
We observed that in the set population03 there are other numerical attributes, namely: Relative Total (%) (in Portuguese, Total Relativa) and Relative municipal headquarters (%) (in Portuguese, Na sede municipal Relativa). These attributes, however, represent statistical proportions of the population in each of the cities in São Paulo, and are not exactly attributed to population dimensions, such as concentration or demographic density.
Furthermore, we reiterate that the identifying attribute Geographic Unit Code (in Portuguese, Código da Unidade Geográficas) is also part of this data frame, which is the only identifier present in the three sets analyzed up to this point. However, in the three sets there is no direct definition of information that spatializes geographic information, such as longitude, latitude and altitude of points or polygons referring to census sectors.
Even identifying these obstacles that have only been addressed, we expanded the preliminary analysis with two other functions. Firstly, the class() function to recognize the structural format of the three sets. Sequentially, the dim() function that gives the number of rows and columns of the data sets. Script 4 below demonstrates the results obtained.
Script: Application of the functions class () and dim () in the Preliminary Analysis
class(population01)
## [1] "tbl_df" "tbl" "data.frame"
class (population02)
## [1] "tbl_df" "tbl" "data.frame"
class(population03)
## [1] "tbl_df" "tbl" "data.frame"
dim(population01)
## [1] 645 10
dim(population02)
## [1] 645 8
dim(population03)
## [1] 645 10
Source: the authors (2023). Caption: Preliminary analysis of the data sets using the functions: class () and dim () in the R software.
We observed, in the results generated by the class () function, that the three data sets (population01, population02 and population03) are of the data frame type, that is, they are structured following the distribution of information in rows and columns (tabular, from the English acronym, tbl). In the columns of the data frames, information that characterizes the municipalities of the State of São Paulo is reported, that is, the answer for each of the attributes identified in the first line of the data frames. On the other hand, the dim() function reported that the data frame population01 is composed of 645 lines and 10 columns, while population02 is structured in 645 lines and 8 columns and population03 in 645 lines and 10 columns.
Given this information, we confirm that all lines in the population01 data frame correspond to population02 and population03, that is, all cities are present in the three data frames. However, we found that the number of columns differs between the data sets, a fact that we had observed when applying the names () and summary () functions. This occurs because there are attributes that are present in one that do not integrate the others and vice-versa, thus changing the number of columns in each of them.
Returning to the attribute Geographic Unit Code (in Portuguese, Código da Unidade Geográficas), present in the three data frames, we carried out a specific analysis to identify how the information for each city is read by the software. We use the summary () function again and filter the column of this attribute using square brackets [], which are used to specify the attribute, as we demonstrated in Script 5.
Script: Analysis of the “Geographic Unit Code (in Portuguese, Código da Unidade Geográficas)” attribute
summary(population01["Código da Unidade Geográfica"])
## Código da Unidade Geográfica
## Min. :3500105
## 1st Qu.:3514601
## Median :3528700
## Mean :3528698
## 3rd Qu.:3543204
## Max. :3557303
summary(population02["Código da Unidade Geográfica"])
## Código da Unidade Geográfica
## Min. :3500105
## 1st Qu.:3514601
## Median :3528700
## Mean :3528698
## 3rd Qu.:3543204
## Max. :3557303
summary(population03["Código da Unidade Geográfica"])
## Código da Unidade Geográfica
## Min. :3500105
## 1st Qu.:3514601
## Median :3528700
## Mean :3528698
## 3rd Qu.:3543204
## Max. :3557303
Source: the authors (2023). Caption: analysis of the “Geographic Unit Code (in Portuguese, Código da Unidade Geográficas)” attribute using the summary (_) function and selecting the attribute using square brackets [].
In line with what we presented previously, the summary () function returned statistical information about this attribute. Above it can be seen that the minimum (min.), first quartile (1st Qu.), mean, mean, third quartile (3rd Qu.) and maximum (max.) values of the data set are presented. Therefore, we confirm that the interpretation of this identifying attribute by the software is not done as an area identifier code, but as a numerical value. This prevents the direct plotting of data in mapping, requiring other geographic information to do so.
From this perspective, we carried out a new search on the IBGE platform to find the files referring to the identifying attribute Geographic Unit Code (in Portuguese, Código da Unidade Geográficas). The files related to this attribute were downloaded and indexed in the same working directory as the exploratory analysis and are available at the following link, accessible remotely.
In this search, we captured the shape files made available by IBGE, which contain the set of geographic information (code geometry: longitude, latitude and altitude) that represent the Geographic Unit Code (in Portuguese, Código da Unidade Geográficas) of the 3 data frames (population01, population02 and population03 ). With these shape files, we envision connecting the census data from the three sets to the spatial structures of their locations (census sector).
To this end, the first step was to load the raster file into R using the shape file () function, which is used to read raster data in the software. The data set loaded by this function was named br_locations_2010, converting it into an object for exploratory analysis. After loading the raster data, we carried out the same analytical procedures demonstrated so far. We use the names () function to identify the names of the attributes that make up the br_locations_2010 data set. On the other hand, the summary() function was used to recognize the qualitative and quantitative structure of this object (Script 7).
Script: Preliminary analysis of br_locations_2010
names(br_locations_2010)
## [1] "ID" "CD_GEOCODI" "TIPO" "CD_GEOCODB" "NM_BAIRRO"
## [6] "CD_GEOCODS" "NM_SUBDIST" "CD_GEOCODD" "NM_DISTRIT" "CD_GEOCODM"
## [11] "city" "NM_MICRO" "NM_MESO" "state" "CD_NIVEL"
## [16] "CD_CATEGOR" "NM_CATEGOR" "NM_LOCALID" "LONG" "LAT"
## [21] "ALT" "GMRotation"
summary(br_locations_2010)
## Object of class SpatialPointsDataFrame
## Coordinates:
## min max
## x -73.49761 -32.435186
## y -33.73754 5.220071
## Is projected: FALSE
## proj4string :
## [+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs]
## Number of points: 21886
## Data attributes:
## ID CD_GEOCODI TIPO CD_GEOCODB
## Min. : 1 Length:21886 Length:21886 Length:21886
## 1st Qu.: 5472 Class :character Class :character Class :character
## Median :10944 Mode :character Mode :character Mode :character
## Mean :10944
## 3rd Qu.:16415
## Max. :21886
##
## NM_BAIRRO CD_GEOCODS NM_SUBDIST CD_GEOCODD
## Length:21886 Length:21886 Length:21886 Length:21886
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## NM_DISTRIT CD_GEOCODM city NM_MICRO
## Length:21886 Length:21886 Length:21886 Length:21886
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## NM_MESO state CD_NIVEL CD_CATEGOR
## Length:21886 Length:21886 Length:21886 Length:21886
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## NM_CATEGOR NM_LOCALID LONG LAT
## Length:21886 Length:21886 Min. :-73.50 Min. :-33.738
## Class :character Class :character 1st Qu.:-49.89 1st Qu.:-21.588
## Mode :character Mode :character Median :-44.62 Median :-12.619
## Mean :-45.54 Mean :-14.067
## 3rd Qu.:-40.15 3rd Qu.: -6.603
## Max. :-32.44 Max. : 5.220
##
## ALT GMRotation
## Min. : 0.0 Min. :0
## 1st Qu.: 111.1 1st Qu.:0
## Median : 329.3 Median :0
## Mean : 372.4 Mean :0
## 3rd Qu.: 582.4 3rd Qu.:0
## Max. :1639.2 Max. :0
## NA's :1
crs(br_locations_2010)
## Coordinate Reference System:
## Deprecated Proj.4 representation:
## +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
## WKT2 2019 representation:
## BOUNDCRS[
## SOURCECRS[
## GEOGCRS["unknown",
## DATUM["Unknown based on GRS 1980 ellipsoid using towgs84=0,0,0,0,0,0,0",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1],
## ID["EPSG",7019]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8901]],
## CS[ellipsoidal,2],
## AXIS["longitude",east,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433,
## ID["EPSG",9122]]],
## AXIS["latitude",north,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433,
## ID["EPSG",9122]]]]],
## TARGETCRS[
## GEOGCRS["WGS 84",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4326]]],
## ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84",
## METHOD["Position Vector transformation (geog2D domain)",
## ID["EPSG",9606]],
## PARAMETER["X-axis translation",0,
## ID["EPSG",8605]],
## PARAMETER["Y-axis translation",0,
## ID["EPSG",8606]],
## PARAMETER["Z-axis translation",0,
## ID["EPSG",8607]],
## PARAMETER["X-axis rotation",0,
## ID["EPSG",8608]],
## PARAMETER["Y-axis rotation",0,
## ID["EPSG",8609]],
## PARAMETER["Z-axis rotation",0,
## ID["EPSG",8610]],
## PARAMETER["Scale difference",1,
## ID["EPSG",8611]]]]
Source: the authors (2023).
The results show that the set br_locations_2010 is composed of 21,886 lines and 22 attributes described in columns. The geometry type of the data set is in the format of points and is structured in XY dimensions, having integrated values of xmin, ymin, xmax and ymax within the structure. Furthermore, the data set has the geographic reference system (CRS) based on the EPSG (acronym for European Petroleum Survey Group) “SIRGAS 2000” format.
We also highlight that the 22 attributes that make up the data set are: ID, CD_GEOCODI, TIPO, CD_GEOCODB, NM_BAIRRO, CD_GEOCODS, NM_SUBDIST, CD_GEOCODD, NM_DISTRIT, CD_GEOCODM,city, NM_MICRO, NM_MESO, state, CD_NIVEL ,CD_CATEGOR, NM_CATEGOR, NM_LOCALID, LONG, LAT, ALT and GMRotation. Of these, we confirmed the presence of attributes associated with the geographic positioning of information, such as longitude (LONG), latitude (LAT), altitude (ALT) and geometry. Furthermore, we have the presence of territorial subdivisions described in the br_locations_2010 data set, as we verified that, sequentially, the tabular format of the data starts from the lowest level of aggregation, being neighborhood (NM_BAIRRO), and reaching States . These results, therefore, provided us with clues to understand the structure of the data and to identify relevant attributes for data mining in R®.
In order to test the spatialization of the information contained in the br_locations_2010 data set, we applied the plot () function in the analytical sequence to visualize the distribution of points described by the data frame, as can be seen in Figure 1.
Figure: Plotting of data from br_locations_2010
Source: the authors (2023). Caption: figure produced through the function plot (_), using the data set br_locations_2010. Outline in red representing the Brazilian territorial polygon, inserted to demonstrate the distribution of points.
The plot illustrates each of the 21,886 points that make up the data set, using the geographic coordinates of each of the Brazilian locations to carry it out. The blank spaces, where there are no points marked on the plot, represent areas where there was no population decline (areas without human occupation/demographic voids) and/or where the populations sampled in these areas were considered within census sectors close to their area establishment, as suggested in the reference document for carrying out the Brazilian census operation (IBGE, 2013).
Based on the names of the attributes of br_locations_2010, we verified that the column NM_UF, referring to the Name of the Federation Unit, allows the filtering of data related to the State of São Paulo, making it possible to cut the data set to meet the experimental area of this research. Furthermore, we verified that this data frame is composed of geographically distributed points, given the presence of the attributes longitude (LONG), latitude (LAT) and altitude (ALT). Furthermore, there are structural characteristics in the data frame that help in the spatialization of geographic information, which are presented in Script 7 by the information coords.x1 and coords.x2.
In order to individualize the data from the State of São Paulo, we returned to the Microsoft Access 365 program (initial format of the data set made available by IBGE) to filter the data from localities_br. We filtered the data set using the NM_UF attribute, which protects the Names of the Federation Units, selecting only the lines that had São Paulo as response data (character inserted in the NM_UF column line). This selected data was copied to an Excel 365 spreadsheet and saved in the “.xls” extension with the name localsp.xls. We reiterate that both the original file (Access 365) and the produced file (Excel 365) are in the GitHub digital collection and can be accessed remotely.
After producing the localsp.xls file, we return to R, where we look for the file in the work directory to be loaded into the software using the read_excel () function. Along with loading, we create the localsp object, as illustrated in the Script. After loading the data, we took the opportunity to rename some attributes that were part of the data set through the names () function, namely: a) NM_MUNICIPIO, replaced by city; b) NM_UF, by state; c) LONG for longitude; d) LAT for latitude; e) ALT for altitude. Finally, we confirm the creation of the dataset and the changes to the attribute names, actions that are described in the following script.
Script: Loading data and creating the localsp object.
localsp
## # A tibble: 2,142 × 21
## ID CD_GEOCODIGO TIPO CD_GEOCODBA NM_BAIRRO CD_GEOCODSD CD_GEOCODDS
## <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
## 1 15316 3.50e14 URBANO NA <NA> 35001050500 350010505
## 2 15317 3.50e14 URBANO NA <NA> 35001050500 350010505
## 3 15318 3.50e14 URBANO NA <NA> 35001050500 350010505
## 4 15319 3.50e14 URBANO NA <NA> 35001050500 350010505
## 5 15320 3.50e14 URBANO NA <NA> 35001050500 350010505
## 6 15321 3.50e14 URBANO NA <NA> 35001050500 350010505
## 7 15322 3.50e14 URBANO NA <NA> 35002040500 350020405
## 8 15323 3.50e14 URBANO NA <NA> 35002040500 350020405
## 9 15324 3.50e14 URBANO NA <NA> 35002040500 350020405
## 10 15325 3.50e14 URBANO NA <NA> 35003030500 350030305
## # ℹ 2,132 more rows
## # ℹ 14 more variables: NM_DISTRITO <chr>, CD_GEOCODMU <dbl>, city <chr>,
## # NM_MICRO <chr>, NM_MESO <chr>, state <chr>, CD_NIVEL <dbl>,
## # CD_CATEGORIA <dbl>, NM_CATEGORIA <chr>, NM_LOCALIDADE <chr>,
## # longitude <dbl>, latitude <dbl>, altitude <dbl>, GM_PONTO_sk <chr>
Source: the authors (2023). Caption: loading data from localsp.xls and creating the localsp object. In the script, the format of the data set and the attributes (variables) that constitute it are highlighted.
The object creation check allowed us to extract some relevant information about the localsp set. The results in Script 10 indicated that the object is structured in a data frame model (free translation from English, tibble), which is composed of 2,142 lines (from English, rows) and 21 variables (from English, variables ) distributed in columns, which depict the attributes of this data set. The names of the attributes (variables) were highlighted in the Script above and we point out that they are the same attributes that make up localities_br, except for the geometry attribute (GEOMETRY) which was not subject to filtering in Access 365 and, therefore, is not part of the localsp set .
Following the same functions previously used in the preliminary analysis (Table 10), we explored the localsp data set, aiming to identify relevant characteristics for the exploratory analysis, as demonstrated in Script 11.
Script 11: Preliminary analysis of localsp
names(localsp)
## [1] "ID" "CD_GEOCODIGO" "TIPO" "CD_GEOCODBA"
## [5] "NM_BAIRRO" "CD_GEOCODSD" "CD_GEOCODDS" "NM_DISTRITO"
## [9] "CD_GEOCODMU" "city" "NM_MICRO" "NM_MESO"
## [13] "state" "CD_NIVEL" "CD_CATEGORIA" "NM_CATEGORIA"
## [17] "NM_LOCALIDADE" "longitude" "latitude" "altitude"
## [21] "GM_PONTO_sk"
summary(localsp)
## ID CD_GEOCODIGO TIPO CD_GEOCODBA
## Min. :15316 Min. :3.500e+14 Length:2142 Min. :3.502e+11
## 1st Qu.:15851 1st Qu.:3.514e+14 Class :character 1st Qu.:3.514e+11
## Median :16386 Median :3.529e+14 Mode :character Median :3.533e+11
## Mean :16386 Mean :3.528e+14 Mean :3.531e+11
## 3rd Qu.:16921 3rd Qu.:3.542e+14 3rd Qu.:3.549e+11
## Max. :17456 Max. :3.557e+14 Max. :3.555e+11
## NA's :1 NA's :1 NA's :1997
## NM_BAIRRO CD_GEOCODSD CD_GEOCODDS NM_DISTRITO
## Length:2142 Min. :3.500e+10 Min. :350010505 Length:2142
## Class :character 1st Qu.:3.514e+10 1st Qu.:351410605 Class :character
## Mode :character Median :3.529e+10 Median :352850205 Mode :character
## Mean :3.528e+10 Mean :352799828
## 3rd Qu.:3.542e+10 3rd Qu.:354165305
## Max. :3.557e+10 Max. :355730305
## NA's :1 NA's :1
## CD_GEOCODMU city NM_MICRO NM_MESO
## Min. :3500105 Length:2142 Length:2142 Length:2142
## 1st Qu.:3514106 Class :character Class :character Class :character
## Median :3528502 Mode :character Mode :character Mode :character
## Mean :3527998
## 3rd Qu.:3541653
## Max. :3557303
## NA's :1
## state CD_NIVEL CD_CATEGORIA NM_CATEGORIA
## Length:2142 Min. :1.000 Min. : 1.000 Length:2142
## Class :character 1st Qu.:1.000 1st Qu.: 3.000 Class :character
## Mode :character Median :3.000 Median : 5.000 Mode :character
## Mean :3.655 Mean : 8.192
## 3rd Qu.:6.000 3rd Qu.:10.000
## Max. :6.000 Max. :70.000
## NA's :1 NA's :1
## NM_LOCALIDADE longitude latitude altitude
## Length:2142 Min. :-53.06 Min. :-25.22 Min. : 1.363
## Class :character 1st Qu.:-49.48 1st Qu.:-23.31 1st Qu.: 465.505
## Mode :character Median :-48.08 Median :-22.67 Median : 575.228
## Mean :-48.29 Mean :-22.43 Mean : 580.182
## 3rd Qu.:-46.95 3rd Qu.:-21.57 3rd Qu.: 712.080
## Max. :-44.20 Max. :-19.87 Max. :1639.155
## NA's :1 NA's :1 NA's :1
## GM_PONTO_sk
## Length:2142
## Class :character
## Mode :character
##
##
##
##
dim(localsp)
## [1] 2142 21
class(localsp)
## [1] "tbl_df" "tbl" "data.frame"
Source: the authors (2023). Caption: Preliminary analysis of the localsp set. The script presents the results obtained by applying the names (), dim (), class () and summary () function.
Using the names() function, we confirm the names of the attributes associated with the data set, that is, the names in the first line of the object. We also found that localsp is composed of different spatial characteristics, such as longitude, latitude and altitude, and identifying attributes, such as ID and CD_GEOCODIGO. We should highlight that the ID attribute, despite referring to the identity of certain geographic information, is a variant code in each line of the data set, as can be seen in the results generated by Script 10. We note that the same municipality has different codes IDs and that they refer to different typologies of land use or urban agglomerations, which is confirmed through the attributes CD_NIVEL and CD_CATEGORIA (numeric codes that refer to the different typologies) and * NM_CATEGORY* (nomeclatura variables for specifying the typology).
On the other hand, when paying attention to the CD_GEOCODE attribute, we observed numerical repetition for the different profiles of land use and coverage and typologies of human settlements, that is, the same identifying code was repeated for the municipalities. According to the technical documentation of the 2010 Census (IBGE, 2013), it was found that CD_GEOCODIGO refers to the geographical area identifier code, which is established by IBGE to refer to a certain polygon (census sector) in the territory Brazilian, as we had previously discussed.
Additionally, the dim () function showed that the set is composed of 2142 lines and 21 columns, which describe data attributes. Furthermore, the class () function revealed that the localsp data presentation format is structured in the tabular model (from the acronym, tbl) and constitutes a data frame (df).
Additionally, the summary () function expanded the information about each of the attributes associated with the set. Through its application, different characteristics were summarized. For attributes that assumed the character profile (character), the function returned the number of lines in each column (length), the class of information in the column (class ) and the presentation model (mode). For information that was not characters, statistical results were returned for the numerical values associated with the attributes, such as: minimum value, first and third quartile, mean, average and maximum.
We point out that the statistical results, in the case of localsp, are not relevant for the exploratory analysis. However, these results demonstrate which attributes are interpreted by the software as numerical attributes. Thus, we identified that the attributes longitude, latitude and altitude have numerical data for each of the census sectors. This favors the use of this information for the spatialization of specific geographic information, as we identified in the sets population01, population02 and population03.
Comparatively, we verified, through the results generated by the dim() function, that the population data frames (population01, population02 and population03) are made up of 645 lines (Script 4), while * *localsp is made up of 2,142 lines (Script 11). This substantial difference in the number of lines between the population groups and localsp reflects the fragmentation of population information, that is, in the localsp data frame there are subdivisions of the types of population clusters that make up the urban perimeter, as previously indicated about the codes CD_NIVEL, CD_CATEGORIA and NM_CATEGORIA** that make up this set.
As we found in the 2010 Demographic Census documentation (IBGE, 2013a), the sampling points that make up the localsp data detail the territorial mesh in a more robust way, that is, they further divide the territorial portions, reaching the micro data of the Demographic census. On the other hand, in population01, population02 and population03 there is an agglomeration of this micro data given by municipalities, reducing the number of lines when compared to localsp. However, as observed in Scripts 1, 2 and 3, the typologies of population clusters are attributes that integrate the three population groups (population01/02/03). Therefore, through these indications about the components of the 4 sets (localsp and population01/02/03) we have the indication to proceed with the joining of these data frames.
Having indicated the relevant aspects about the four central sets of this exploratory analysis, we move on to mining the data that make up the sets for further connection between them. In the localsp dataset, we identified the presence of territorial subdivisions in the NM_CATEGORIA column, which reflected the typology of the census sector. The following Script reveals the names of the variables present in the column of this set attribute (localsp). According to the technical documentation of the 2010 Census (IBGE, 2013a), the categories in the micro data portray the profile of population clusters present in the national territory and, consequently, in the State of São Paulo. Thus, it is understood that these categories reveal how human groups are inserted into the territory, assuming a strong relationship with land use and land cover.
## [1] "CIDADE" "AUI"
## [3] "VILA" "NÚCLEO"
## [5] "POVOADO" "LUGAREJO"
## [7] "PROJETO DE ASSENTAMENTO" "ALDEIA INDÍGENA"
## [9] NA
Considering the categories, it was extracted from the IBGE documentation (2013a), that they are subdivided into two groups: urban and rural. Among the urban areas are: cities, isolated urban areas (AUI) and towns; while in rural areas there are: nuclei, villages, hamlets, settlement projects and indigenous villages (traditional communities). Following the logic described by the Institute, we subdivided the localsp data set based on the NM_CATEGORIA attribute, generating data sets for each of the categories referring to the different typologies of census sectors and/or clusters population. The following script depicts the procedure for creating the sets: sp_cities, sp_isolatedurbanareas, sp_urbanvillages, sp_traditionalcommunities, sp_ruralvillage, sp_ruralcore, sp_settlement and sp_settlementproject; which was performed using the filter() function.
Script: Subdivision of the localsp dataset based on the NM_CATEGORY attribute.
#Urban Groups
#creating the city group
sp_cities = localsp %>% filter(NM_CATEGORIA == "CIDADE")
#creating isolated urban areas group
sp_isolatedurbanareas = localsp %>% filter(NM_CATEGORIA == "AUI")
#creating the group of villages (urban)
sp_urbanvillages = localsp %>% filter(NM_CATEGORIA == "VILA")
#Rural Groups
#creating the group of indigenous villages
sp_traditionalcommunities = localsp %>% filter(NM_CATEGORIA == "ALDEIA INDÍGENA")
#creating the group of rural villages
sp_ruralvillage = localsp %>% filter(NM_CATEGORIA == "LUGAREJO")
#creating the rural core group
sp_ruralcore = localsp %>% filter(NM_CATEGORIA == "NÚCLEO")
#creating the settlement group (rural)
sp_settlement = localsp %>% filter(NM_CATEGORIA == "POVOADO")
#creating the settlement project group (rural)
sp_settlementproject = localsp %>% filter(NM_CATEGORIA == "PROJETO DE ASSENTAMENTO")
Source: the authors (2023). Caption: Script describing the separation of the localsp dataset into 8 categories, which describe the typologies of population clusters from the 2010 Census, separated into two groups: urban and rural.
After creating the different data sets, based on the population grouping categories, we plotted each of them, using the longitude and latitude columns (18 and 19 respectively) to evaluate the spatialization capacity of these data.
Figure: Plot of data from locals separated by categories of population clusters.
Source: the authors (2023). Caption: Figure generated from categorized data from localsp, where each of the graphs represents a category of population cluster. The number of points reflects the number of variables in the eight plotted sets.
When we turned to the population data sets (population01/02/03), we identified that the set population01 was the one that most completely discriminated the different categories of population clusters. This confirmation occurred through the analysis of the attributes that made up the three sets, as illustrated in the following script, which uses the names() function for comparison.
Table: Comparison of attributes that make up population groups.
## Population 01 Population 02
## 1 city city
## 2 Área Urbanizada Total
## 3 Área não Urbanizada Urbana
## 4 Área Urbana Isolada Na sede municipal
## 5 Área Rural (Exceto Aglomerado) Rural
## 6 Aglomerado Rural de Extensão Urbana Área\ntotal\n(km²)
## 7 Aglomerado Rural Povoado Densidade demográfica (hab/km²)
## 8 Aglomerado Rural Núcleo Código da Unidade Geográfica
## 9 Outros Aglomerados Rurais Raros <NA>
## 10 Código da Unidade Geográfica <NA>
## Population 03
## 1 city
## 2 População residente Absoluta
## 3 População residente absoluta total urbana
## 4 População residente absoluta total na sede municipal urbana\n
## 5 Total Relativa (%)...5
## 6 Total Relativa (%)...6
## 7 Na sede municipal Relativa (%)\n
## 8 Área\ntotal\n(km²)\n
## 9 Densidade demográfica (hab/km²)
## 10 Código da Unidade Geográfica
Source: the authors (2023). Legend: Table showing the attributes that make up the three population groups (population01, population02 and population03). The “NA” demonstrates that the population02 set has a smaller number of columns in its structure.
As seen in the results generated, we confirm that the set population01 presents the breakdown of the categories of population clusters previously found in localsp, namely: municipality, urbanized area, non-urbanized area, isolated urban area , rural area (except agglomeration), urban extension rural agglomeration, populated rural agglomeration, core rural agglomeration and other rare rural agglomerations. Given this, we chose to focus on the set population01 to advance the correlation of data, considering its synergy with localsp data (and its categories) and its completeness over the municipalities of São Paulo.
From this perspective, we fragmented population01 data into 8 groups, following the data frame construction logic. In each of the 8 sets created, we kept the first and last columns, respectively “city” and “CD_GEOCODIGO”, both of utmost importance for correlation with the data from localsp, as we treated previously. On the other hand, the other population01 columns were the variable objects for creating the 8 new sets, each integrating a new set.
Therefore, the following Script describes how the creation of isolated sets of demographic information were carried out according to the category (typology) of the population cluster. It is worth highlighting that for the partitioning and creation of the following 8 sets, we used as a reference the technical documentation of the 2010 Census (IBGE, 2013a), which explains which categories each of the attributes described in columns 2 to 9 of the set belong to * *population01 and which are also the categories present in localsp** (NM_CATEGORIA).
Script: Partitioning population data01 according to population cluster categories.
pop01_cities <- population01[,c(1,2,10)]
pop01_cities
## # A tibble: 645 × 3
## city `Área Urbanizada` `Código da Unidade Geográfica`
## <chr> <dbl> <dbl>
## 1 ADAMANTINA 31713 3500105
## 2 ADOLFO 3155 3500204
## 3 AGUAÍ 27261 3500303
## 4 ÁGUAS DA PRATA 5513 3500402
## 5 ÁGUAS DE LINDÓIA 6886 3500501
## 6 ÁGUAS DE SANTA BÁRBARA 3681 3500550
## 7 ÁGUAS DE SÃO PEDRO 2707 3500600
## 8 AGUDOS 32173 3500709
## 9 ALAMBARI 3036 3500758
## 10 ALFREDO MARCONDES 2690 3500808
## # ℹ 635 more rows
pop01_isolatedurbanareas <- population01[,c(1,4,10)]
pop01_isolatedurbanareas
## # A tibble: 645 × 3
## city `Área Urbana Isolada` `Código da Unidade Geográfica`
## <chr> <dbl> <dbl>
## 1 ADAMANTINA 180 3500105
## 2 ADOLFO 45 3500204
## 3 AGUAÍ 1025 3500303
## 4 ÁGUAS DA PRATA 1258 3500402
## 5 ÁGUAS DE LINDÓIA 0 3500501
## 6 ÁGUAS DE SANTA BÁRBARA 578 3500550
## 7 ÁGUAS DE SÃO PEDRO 0 3500600
## 8 AGUDOS 161 3500709
## 9 ALAMBARI 636 3500758
## 10 ALFREDO MARCONDES 0 3500808
## # ℹ 635 more rows
pop01_urbanvillages <- population01[,c(1,3,10)]
pop01_urbanvillages
## # A tibble: 645 × 3
## city `Área não Urbanizada` `Código da Unidade Geográfica`
## <chr> <dbl> <dbl>
## 1 ADAMANTINA 55 3500105
## 2 ADOLFO 0 3500204
## 3 AGUAÍ 715 3500303
## 4 ÁGUAS DA PRATA 0 3500402
## 5 ÁGUAS DE LINDÓIA 10225 3500501
## 6 ÁGUAS DE SANTA BÁRBARA 0 3500550
## 7 ÁGUAS DE SÃO PEDRO 0 3500600
## 8 AGUDOS 659 3500709
## 9 ALAMBARI 0 3500758
## 10 ALFREDO MARCONDES 565 3500808
## # ℹ 635 more rows
pop01_traditionalcommunities <- population01[,c(1,5,10)]
pop01_traditionalcommunities
## # A tibble: 645 × 3
## city Área Rural (Exceto Aglomerado…¹ Código da Unidade Ge…²
## <chr> <dbl> <dbl>
## 1 ADAMANTINA 0 3500105
## 2 ADOLFO 0 3500204
## 3 AGUAÍ 3147 3500303
## 4 ÁGUAS DA PRATA 813 3500402
## 5 ÁGUAS DE LINDÓIA 155 3500501
## 6 ÁGUAS DE SANTA BÁRBARA 1342 3500550
## 7 ÁGUAS DE SÃO PEDRO 0 3500600
## 8 AGUDOS 1531 3500709
## 9 ALAMBARI 1212 3500758
## 10 ALFREDO MARCONDES 636 3500808
## # ℹ 635 more rows
## # ℹ abbreviated names: ¹`Área Rural (Exceto Aglomerado)`,
## # ²`Código da Unidade Geográfica`
pop01_ruralvillage <- population01[,c(1,6,10)]
pop01_ruralvillage
## # A tibble: 645 × 3
## city Aglomerado Rural de Extensão …¹ Código da Unidade Ge…²
## <chr> <dbl> <dbl>
## 1 ADAMANTINA 0 3500105
## 2 ADOLFO 0 3500204
## 3 AGUAÍ 0 3500303
## 4 ÁGUAS DA PRATA 0 3500402
## 5 ÁGUAS DE LINDÓIA 0 3500501
## 6 ÁGUAS DE SANTA BÁRBARA 0 3500550
## 7 ÁGUAS DE SÃO PEDRO 0 3500600
## 8 AGUDOS 0 3500709
## 9 ALAMBARI 0 3500758
## 10 ALFREDO MARCONDES 0 3500808
## # ℹ 635 more rows
## # ℹ abbreviated names: ¹`Aglomerado Rural de Extensão Urbana`,
## # ²`Código da Unidade Geográfica`
pop01_ruralcore <- population01[,c(1,8,10)]
pop01_ruralcore
## # A tibble: 645 × 3
## city `Aglomerado Rural Núcleo` Código da Unidade Geográfi…¹
## <chr> <dbl> <dbl>
## 1 ADAMANTINA 0 3500105
## 2 ADOLFO 0 3500204
## 3 AGUAÍ 0 3500303
## 4 ÁGUAS DA PRATA 0 3500402
## 5 ÁGUAS DE LINDÓIA 0 3500501
## 6 ÁGUAS DE SANTA BÁRBARA 0 3500550
## 7 ÁGUAS DE SÃO PEDRO 0 3500600
## 8 AGUDOS 0 3500709
## 9 ALAMBARI 0 3500758
## 10 ALFREDO MARCONDES 0 3500808
## # ℹ 635 more rows
## # ℹ abbreviated name: ¹`Código da Unidade Geográfica`
pop01_settlement <- population01[,c(1,7,10)]
pop01_settlement
## # A tibble: 645 × 3
## city `Aglomerado Rural Povoado` Código da Unidade Geográf…¹
## <chr> <dbl> <dbl>
## 1 ADAMANTINA 0 3500105
## 2 ADOLFO 0 3500204
## 3 AGUAÍ 0 3500303
## 4 ÁGUAS DA PRATA 0 3500402
## 5 ÁGUAS DE LINDÓIA 0 3500501
## 6 ÁGUAS DE SANTA BÁRBARA 0 3500550
## 7 ÁGUAS DE SÃO PEDRO 0 3500600
## 8 AGUDOS 0 3500709
## 9 ALAMBARI 0 3500758
## 10 ALFREDO MARCONDES 0 3500808
## # ℹ 635 more rows
## # ℹ abbreviated name: ¹`Código da Unidade Geográfica`
pop01_settlementproject <- population01[,c(1,9,10)]
pop01_settlementproject
## # A tibble: 645 × 3
## city Outros Aglomerados Rurais Rar…¹ Código da Unidade Ge…²
## <chr> <dbl> <dbl>
## 1 ADAMANTINA 0 3500105
## 2 ADOLFO 0 3500204
## 3 AGUAÍ 0 3500303
## 4 ÁGUAS DA PRATA 0 3500402
## 5 ÁGUAS DE LINDÓIA 0 3500501
## 6 ÁGUAS DE SANTA BÁRBARA 0 3500550
## 7 ÁGUAS DE SÃO PEDRO 0 3500600
## 8 AGUDOS 0 3500709
## 9 ALAMBARI 0 3500758
## 10 ALFREDO MARCONDES 0 3500808
## # ℹ 635 more rows
## # ℹ abbreviated names: ¹`Outros Aglomerados Rurais Raros`,
## # ²`Código da Unidade Geográfica`
Source: the authors (2023). Caption: In the script, data from the population01 set is partitioned according to the category of the population cluster. Columns 2 to 9 of the population01 set are isolated and integrated with columns 1 (city) and 10 (CD_GEOCODIGO), structuring 8 new data sets, namely: pop01_cities, pop01_isolatedurbanareas, pop01_urbanvillages, pop01_traditionalcommunities, pop01_ruralvillage, pop01_ruralcore, pop01_settlement and pop01_settlementproject.
Comparing the 8 sets from localsp and the 8 from population01, we noticed that there is a divergence between them in the number of lines and it is necessary to make some notes on the following table. The first reflects that not all micro data categories (NM_CATEGORIES) are present in all municipalities in São Paulo and, therefore, do not correspond between the sets. On the other hand, it is noted that there are more than one point to describe a category, that is, there are more than one point identified for a given category, as is the case of isolated urban areas (985 lines).
Table: Comparison between the number of lines of the 8 sets formed from population01 and the 8 derived from localsp
## Data.coming.from.population01 Data.coming.from.localsp
## 1 645 645
## 2 645 985
## 3 645 295
## 4 645 12
## 5 645 104
## 6 645 30
## 7 645 61
## 8 645 9
Source: the authors (2023). Legend: Table comparing the number of lines in the sets from population01 (first column) and localsp (second column).
Despite the divergence between the number of lines, we emphasize that this does not invalidate data mining; on the contrary, this confirms the presence and/or absence of different categories in the territory of São Paulo and allows us to recognize the demography in each of the municipalities that make up the micro data.
Having established the 8 sets from localsp and the 8 derived from population01, we set out to join these sets, as a way of spatializing the demographic data. Therefore, we are guided by data from localsp, to which we add the respective populations (number of people). As we highlighted previously, there is a numerical divergence between the data from the two sets (Table) and therefore we must consider two aspects.
The first protects row numbers equal to or less than 645, where there is a correspondence between Column 1 (data from population01) and Column 2 (data from localsp) of the Table. In this sense, when there is no point described for sampling (column 2), there is no population described for a given typology of census sector. On the other hand, when all lines in Column 2 correspond to Column 1, all sample points had their populations integrated.
In the second aspect, we portray the cases in which Column 2 has a number of points greater than 645 lines, extrapolating the number of lines present in Column 1, that is, not all geographic points coming from localsp have correspondence in * *population01. The only case described and represented throughout data mining was that for isolated urban areas (AUI), portrayed in the set sp_isolatedurbanareas**. According to the IBGE technical documentation (2023), isolated urban areas are “an area defined by law and separated from the district headquarters [municipality] by rural area or by another legal limit”; therefore, it is understood that the same municipality may have different AUIs integrated during the census sampling, as can be seen in Column 2 of the Table (line 2). However, as can be seen in Column 1 of the Table, the data is combined, that is, all isolated urban areas in the municipality have their populations depicted in a single line. To this end, for the sake of analysis, and already identifying a sampling limitation, we considered the total population of AUIs for each point, that is, the values for the same municipality (Column 1) are repeated for the points associated with the same municipality ( Column 2). This overestimation of data will be dealt with during data mapping by differentiating colors on the map and, jointly, in the legend referring to data from isolated urban areas.
That said, let’s move on to data joining. Following the same logic just used, we divided the join into two moments, the first for the sets in Column 2 with points equal to or less than 645 lines and the second for the number of points greater than 645.
Script: Joining data from population01 and localsp.
#Joining data from the Cities typology
cities <- sp_cities%>%left_join(pop01_cities)
#Joining data from the Urban Villages typology
urbanvillages <- sp_urbanvillages%>%inner_join(pop01_urbanvillages)
#Joining data from the Traditional Communities typology
traditionalcommunities <- sp_traditionalcommunities%>%inner_join(pop01_traditionalcommunities)
#Joining data from the Rural Village typology
ruralvillage <- sp_ruralvillage%>%inner_join(pop01_ruralvillage)
#Joining data from the Rural Core typology
ruralcore <- sp_ruralcore%>%inner_join(pop01_ruralcore)
#Joining data from the Rural Villages typology
ruralsettlement <- sp_settlement%>%inner_join(pop01_settlement)
#Joining data from the Settlement Projects typology
settlementproject <- sp_settlementproject%>%inner_join(pop01_settlementproject)
#Joining data from the Isolated Urban Areas typology
isolatedurbanareas <- sp_isolatedurbanareas%>%right_join(pop01_isolatedurbanareas)
isolatedurbanareas <- isolatedurbanareas[complete.cases(isolatedurbanareas$altitude, isolatedurbanareas$latitude),]
Source: the authors (2023). Caption: in the script, information from population01 is associated with the locations of localsp, through the join() function and its variants.
According to the Script, we were able to connect the data from population01 and localsp, adding the respective populations to the different census typologies. In this way, we structure new sets, where the population information and geolocations of each of the points structured for them are found, namely: cities, urbanvillages, isolatedurbanareas, traditionalcommunities, ruralvillage, ruralcore, ruralsettlement and settlementproject.
Once the data joining stage is complete, we move on to plotting this data. Firstly, we must highlight that the choice of colors for the plot followed the guidelines established by Ellis and Ramankutty (2008), where shades of red represent urban populations and their nuances; while rural populations are associated with earthy tones (orange and brown). Furthermore, we use shades of blue to distinguish traditional populations, taking into account their sociocultural uniqueness, both in terms of their relationship with nature and in relation to their relevance for the maintenance and preservation of the identity of these groups. The Table presents the tones used (RGB code) for each territorial typology described by the data.
It is worth highlighting here that, unlike what was proposed for anthromes, we consider the features identified by IBGE (2023) in the spatial continuum proposal. According to this document, we notice an expansion of concepts and, consequently, of urban-rural approaches, which proves to be extremely relevant for structuring public policies in the country. Therefore, the continuum project proposed by the Brazilian Institute expands the guidelines proposed by Ellis (2020) regarding anthromes and, therefore, we consider this as an improvement in the delineation of Brazilian anthromes.
We emphasize from the outset that the IBGE document, published in 2023, uses in its analytical structuring and modeling, to a large extent, data from the 2010 Demographic Census (IBGE, 2013), a fact that aligns our project with the technical-scientific developments of the Brazilian Institute and does not invalidate such data as a source for current scientific production. It is also worth considering that both the IBGE document (2023) and this research precede the publication of the complete data from the 2022 Demographic Census, that is, this limitation is present in both products, which must be updated after the publication of the data complete IBGE data. However, we reiterate, this does not invalidate the development of this work, as the codified structure is adaptable to different sources of information, as well as the subsequent update of data from the Brazilian Demographic Census, carried out in 2022.
That said, we return to the attributes longitude and latitude that are part of the data sets just produced. Since these two attributes are fundamental for spatialization in the plot and, subsequently, in the mapping of population information, they are the ones used to construct the plot of the data below. Therefore, we select these two pieces of information using the “$” operator in each of the sets. Additionally, we chose the format for plotting the points using the “pch” descriptor, using the number “15” to plot as squares filled in the same color, which was chosen using the “col” descriptor. As we said above, the colors vary according to the typology of the census sector (Table). Below is the Script that encodes the separation of data, its plotting and the respective coloring of each one.
Script: Plotting the 8 data sets separately.
Source: the authors (2023). Legend: plot of population data from different territorial typologies with their respective colors. As can be seen in the plots just produced, the data for each of the territorial typologies were spatialized according to the two geographic information (latitude and longitude). The “cex” descriptor defined for each municipality the size of the square plotted following the size of the reference population (attribute population of each data set). It is worth noting that we added a new column to the 8 data sets, which was named category; In this column, the territorial typologies were inserted in the data sets, so that we could plot the 8 sets in a single plot. Additionally, we created the colors set, which determines the colors for the unique plot of populated anthromes. The following script presents these processes.
Script: Coded structure for plotting the 8 data sets.
cities, urbanvillages, isolatedurbanareas, traditionalcommunities, ruralvillage, ruralcore, ruralsettlement and settlementproject
colors <- c("Cities" = "#FF0000", "Isolated Urban Areas" = "#FF4747", "Urban Villages"= "#F66969", "Rural Village"= "#ED833B", "Rural Core"="#DF9B6D", "Rural Settlement"="#FFD966", "Settlement Project"="#968551", "Traditional Communities"="#9CC2E5")
legend_populatedanthromes <- data.frame(Categorias = unique(populated_anthromes$categoria), Cores = unique(colors))
legend_populatedanthromes
## Categorias Cores
## 1 Cities #FF0000
## 2 Isolated Urban Areas #FF4747
## 3 Urban Villages #F66969
## 4 Rural Village #ED833B
## 5 Rural Core #DF9B6D
## 6 Rural Settlement #FFD966
## 7 Settlement Project #968551
## 8 Traditional Communities #9CC2E5
Source: the authors (2023). Legend: the script presents the coded structures for plotting the 8 data sets, which are guided by the populated_anthromes and colors. data sets
Once these operations were carried out, we proceeded to plot the data from the 8 sets simultaneously. We follow the color pattern established by the colors set and the respective typologies of the population sectors described by category. We reiterate that the spatialization of the data was based on latitude and longitude information.
Figure: Plot of data from the 8 sets of population anthromes.
Source: the authors (2023). Caption: plot of data referring to populated anthromes in the State of São Paulo, divided between the 8 categories created based on IBGE data.
The Figure reveals that the plotting of data from the 8 population groups (census typologies) occurred correctly, allowing the integration of the different typologies into a single figure. Furthermore, it is noted that the legend follows the coloring established for the different territorial categories. Therefore, through the figure, we can see the adequacy of the data distribution for territorial mapping, carried out in the analytical sequence.
After plotting the populated_anthromes data, we move on to the static mapping of this data set. The static mapping aimed to structure the distribution of points in the shape file of the municipalities (urban perimeters) of the State of São Paulo. To carry out this mapping, some adjustments to the sets were necessary, which are summarized below. The code for carrying out these actions was hidden in this document; however, it is found in the file available on GitHub associated with this work.
We take the opportunity to justify the use of the shape file file as a means to construct the mapping. Our first option for mapping was to use the orbital images provided by Google Earth, using the API Key to integrate the mapping with the Google LCC. platform. However, during the construction of the mapping, we identified that the use of images provided by the company only occurs upon payment. Despite the technological advantages of using these Earth images in mapping, we chose not to use them, considering the cost of operation at this stage of the research and the free services already provided by the Brazilian Institute of Geography and Statistics, as we will demonstrate. by the shape file used in the mapping. To this end, we opted for IBGE files to maintain our technical-scientific alignment with free national data structures, allowing other researchers and users to access and build mappings such as the one presented below.
Once these operations have been carried out, we move on to mapping the data from the populated_anthromes set onto the shape file cities_shape. To build the mapping we use the ggplot() package and combine different functions associated with it. We highlight the main functions below, following the order of application:
That said, the following Script presents the code for constructing the mapping and, as a result, the mapping of populated anthromes.
Script: Static Mapping of Populated anthromes in the State of São Paulo.
map_anthromes <- ggplot()+
geom_sf(data = cities_shape)+
geom_point(data = populated_anthromes, aes(x = longitude, y = latitude, color = categoria), width = 0.01, height = 0.01, pch = 15)+
scale_color_manual(values = setNames(cores_categorias$cor, cores_categorias$categoria), breaks = ordem_categorias, labels = ordem_categorias)+
labs (title = "Populated Anthromes", subtitle = "Study Area: State of São Paulo (Brazil)", fill = cores_categorias$categoria)+
xlab ("Longitude")+
ylab ("Latitude")+
labs (color = "Populated Anthromes")+
theme_minimal()
print(map_anthromes)
Source: the authors (2023). Caption: code showing the structure used to map data from the populated_anthromes set onto the shapefile cities_shape.
The mapping generated by the code above demonstrates that the populated_anthromes data was overlaid on the cities_shape shapefile as expected. The mapping followed the guidelines for distribution of sampling points according to the longitude and latitude described in the populated_anthromes set, as well as the color established for each category.
It is observed, however, that some squares (points referring to populated anthromes) go beyond the areas of the shapefile cities_shape. This aspect was considered in the study of the uncertainty in the mapping of populated anthromes, a study that we will later present to the interactive mapping of populated anthromes.
In the next topic, we transpose static mapping to interactive mapping, in order to structure a map that can be integrated with technological services, such as websites and the GitHub collection.
Firstly, it was necessary to prepare some structures for the interactive mapping to be created next. The first of these was the structuring of the components used in the legend to be printed in the mapping. To this end, we created the set demographic_anthromes, which contains the names of the 8 typologies of populated anthromes that have been mapped up to this point. In addition, to define the categories and colors printed in the mapping, we used the data set described in the legend_populatedanthromes data frame, taking into account its previous use for plotting the data (performed in the previous item).
Having defined these aspects, we move on to the mapping itself. To create the interactive mapping, we used the leaflet() package and the editable resources associated with it and we will highlight the most relevant ones. Firstly, the addMarkers() function helped to demarcate the points where data were found in the demographic_anthromes set, using the latitude and longitude attributes for plotting.
In addition, the addRectangles() function helped to define the position and size of the squares used to demarcate the populated anthromes. At this point, we must highlight that we used the measurement, in degrees, of 0.03 (positive and negative) to size the area squares. The choice was based on the analysis of the literature and the mapping itself, as, during the tests, we observed that for smaller degrees, the squares did not cover the surface of some cities, given the punctual nature of the **demographic_anthromes* data set. *. Therefore, we chose to use these degrees as a reference and consider them later when analyzing the uncertainty of the generated mapping.
Thus, assuming what was predicted, the first interactive mapping of Brazilian anthromes is presented below, portraying the model area of this work, that of the State of São Paulo, and the populated anthromes present in it. It was structured on the basis of Open Street Maps, a free, collaborative global mapping project that can be used by any user and researcher around the world. We reiterate here that our choice is meanted by the dissemination of knowledge and the reproducibility of the research carried out here, therefore, the free nature and availability of these open maps justify our choice.
Figure: Interactive Mapping of Populated anthromes in the State of São Paulo.
Source: the authors (2023). Caption: interactive mapping produced in R language (R Studio) where the populated anthromes present in the State of São Paulo (Brazil) are presented, a reference area for the pilot study of Brazilian anthromes. In the mapping, the squares that describe the anthropogenic sectors are presented, using degrees of 0.03 (positive and negative) for latitude and longitude to demarcate each of the squares. The legend in the mapping represents the colors visible in the mapping and the typologies of the anthropogenic sectors to which they refer.
Notoriously, the product has its limitations, however it represents in a relevant way the anthropogenic populated sectors distributed in the territory of São Paulo. Returning to the question about the dimensions of the squares present in the mapping, we carried out the calculation to size the area described by each of the squares. For this purpose, we used the average of the variables latitude and longitude that appeared in the populated_anthromes set as a basis for the calculations, taking into account the high number of isolated points.
Furthermore, we consider the first two formulas presented below for calculating the width and height of the square. Subsequently, based on the results, we calculate the area of the square using the third formula in the sequence.
Formulas
\(Width (longitude)=0.03×111.32×cos(mean latitude)\)
\(Height (latitude)=0.03×111.32×sin(mean latitude)\)
\(Area of the square = width×height\)
AOnce the mathematical expressions are presented, the results obtained are reported below.
## square width (longitude) in kilometers: 3.020237 km
## square height (latitude) in kilometers: 1.425164 km
## Average square area in square kilometers: 4.304335 km²
We emphasize that up to this point no mapping uncertainty studies have been carried out, this being the subsequent stage of the work.
As established in the methodology of this work, we carried out certain procedures to evaluate and certify the quality of the regional mapping of anthromes. Following the investigative guidelines established by Lovelace, Nowosad and Muenchow (2019) and Wickham, Çetinkaya-Rundel and Grolemund (2023), we list studies to confirm the spatialization of geospatial information (distribution of population data), to evaluate the uncertainty and associated error to mapping in the eyes of Earth and Environmental Sciences and to attest to the quality of the product generated by this study. As the aforementioned authors present, these investigations are part of mapping uncertainty and validation studies, which are reported below.
The initial stage of the analysis of the quality of the mapping reflects the analysis of the overlap of the mapped points of the populated anthromes of the State of São Paulo and the data from the São Paulo locations, which portray the census sectors used during the 2010 Demographic Census (IBGE, 2013) . According to the Brazilian legal apparatus (MAPA/INCRA, 2022; BRASIL, 2018; MMA, 2006; 2002), the overlap analysis proves to be a regulated instrument within the Federation to evaluate the overlap of polygons in areas registered in different institutions of government. The objective is to identify whether rural and urban properties overlap spatially in property registers (rural and urban), which could generate territorial conflicts, tax defaults, among other judicial, civic and environmental problems.
From this perspective, analyzing the overlap of populated anthromes with the raw IBGE data aims to demonstrate the alignment of the product with the urban-rural mesh of the census sectors. As can be seen from the regulations just discussed, the non-overlapping of populated areas and locations in São Paulo would portray the inaccuracy of the mapping, potentially causing the aforementioned problems for territorial planning and, consequently, for the spheres of government. Thus, following the premises of Lovelace, Nowosad and Muenchow (2019) and Wickham, Çetinkaya-Rundel and Grolemund (2023), we investigated how the points referring to populated anthromes overlap with data from the São Paulo census sectors.
Thus, we converted the raw data from br_locations_2010 (shapefile) into a simple data set (sf) using the st_as_sf() function. After the conversion, we extract the data referring to the State of São Paulo, using the filter() function. With it, we structured the saopaulo data set, referring to the census sectors of the State of São Paulo. After structuring this set, here understood as a comparator, we determined the number of sample points, choosing all points (2143) as a sample for overlap analysis.
Having determined the sampling points of the comparator (saopaulo), we established the coordinate system (CRS) of 4326, the same as the populated_anthromes data. The following Figure demonstrates the number of sample points (IBGE raw data) established for overlap analysis and their spatial distribution. The choice for purple comes from the fact that this color is not included in any of the data sets worked on so far..
Figure: Sample points from raw IBGE data (localities_br)
Source: the authors (2023). Caption: figure showing the distribution (spatialization) of the points established for the overlap analysis with the points referring to populated anthromes.
After structuring the sampling points for overlap analysis, we retrieved the data from map_anthromes, that is, the mapped data of the populated anthromes. In order not to generate conflicts, the layer with the shapefile of the municipalities in the State of São Paulo was removed, leaving only the squares referring to the areas of populated anthromes.
Starting from the mapping, we built a simple data set, using the dimensions x and y as latitude and longitude parameters and determining the CRS 4326 as the associated coordinate system, even from the sample/comparator set and the mappings produced in the previous item (mapping of anthromes) .
Having established the two sets of geographic information, populated anthromes (anthromes_data_sf) and sample (sample_sf), we proceeded to compare them both. Firstly, it was verified whether both had the same CRS associated with their structures, as, through the CRS it is possible to identify whether geospatial information is distributed in the same area and whether they overlap within it during spatialization. Assuming this, we perform the comparison using the if()/else() functions to compare the two sets of data. The script reveals the structure of the function and presents the result through the sentence that we indicate as an answer to the if/else question.
Script: Comparison of the CRS of the sets anthromes_data_sf and amostra_sf.
## [1] "the CRS are the same"
Source: the authors (2023). Caption: script presenting the construction of the if()/else() function for comparison between the two sets in relation to the geographic referencing system (CRS).
Having confirmed that the CRS of the two sets are the same, we proceed to the overlap analysis. To do this, using the st_join() function, we combine the two sets (anthromes_data_sf and amostra_sf) into a single simple data set (sf), which was named juncao_sp .
Starting from joining, we summarized the data using summarize(), indicating that there was a count of points grouped by coordinates (group_by(LAT&LONG) - latitude and longitude). The product of this code indicates how many points overlap, using the point coordinates (latitude and longitude) as a reference. The following script shows the organization of the function to identify the number of points and then the numbers of overlapping points are presented.
Script: Count of overlapping points between sets.
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOINT
## Dimension: XY
## Bounding box: xmin: -53.05865 ymin: -25.21507 xmax: -44.19936 ymax: -19.87297
## Geodetic CRS: WGS 84
## # A tibble: 1 × 3
## `LAT & LONG` contagem_de_pontos geometry
## <lgl> <int> <MULTIPOINT [°]>
## 1 TRUE 2143 ((-53.05865 -22.58118), (-53.0027 -22.52495),…
Source: the authors (2023). Caption: script describing the function for analyzing the number of overlapping points between the two data sets (anthromes_data_sf and saopaulo), followed by the tabulated result of the comparison.
According to the results generated by the function, we observed that the 2143 sampled points overlapped with the points coming from populated anthromes (populated_anthromes). After verifying the number of overlapping points, the literature suggests that the results be visually evaluated, in order to verify the accuracy of the data overlap.
Considering this, we organize the sets to visualize the overlapping of points. First, we join the set tablea_contagem_sp with the data from saopaulo, producing the set map_results_sp, which represents the aggregated data.
Sequentially, using the ggplot() function, we structure the mapping to visualize the overlapping of points. Data from map_results_sp were plotted in dark green (darkgree) and data from populated anthromes (anthromes_data_sf) in red (red). In order to facilitate the visualization of the overlapping points, we chose to increase the dimension of the points from map_results_sp to 2 and reduce the points from anthromes_data_sf to 0.5, that is, the first ones were plotted in significantly larger dimensions than the seconds. Thus, the following script presents the code and, subsequently, the visualization of the overlapping of data from the two sets.
Script: Visualization of overlapping points between the two data sets.
Source: the authors (2023). Caption: script presenting the structure for mapping data compared by overlap analysis and generated mapping allowing the visualization of overlapping points in the territory of the State of São Paulo.
By mapping the overlapping data, we observed that the anthromes_data_sf points (derived from populated_anthromes, in red) are overlapping with the IBGE gold standard (raw data from saopaulo, mapped in dark green). Thus, the visualization of the overlapping points made it possible to verify that the spatialization of data referring to populated anthromes occurs appropriately, following the same geographic coordinates as the saopaulo data and distributed throughout the territory of São Paulo , as seen on the cities_shape shapefile layer. Thus, visual confirmation brought indications that point to the validation of the mapping at first.
In order to ensure data quality and accuracy in mapping the geospatial information of populated anthromes, we carried out examinations on the properties of overlapping data sets, i.e. the populated anthromes data (populated_anthromes, map_anthromes_data, anthromes_data_sf) and the gold standard based on IBGE data (saopaulo). As we highlighted previously, the data sets have the same geographic coordinate system (CRS) associated with their structures. Furthermore, they both have spatial dimensions of latitude and longitude, which allowed the insertion of points in the mapping.
Continuing the analysis, we carried out the verification of the geographic limits (coordinates), which aims to prove that both sets represent the same territorial area of the mapping, that is, that the two sets have the same latitude and longitude information in their geographic referencing structure. From this perspective, using data from populated_anthromes and saopaulo, we performed the analysis using the range() function, which brings the minimum and maximum limits of the investigated parameters, which in this case were latitude /LAT and longitude/LONG (Frame).
Table: Minimum and maximum latitude and longitude limits of the populated_anthromes and saopaulo sets.
## Maximum and minimum latitude of populated_anthromes: -25.21507 -19.87297
## Maximum and minimum latitude of São Paulo: -25.21507 -19.87297
## Maximum and minimum longitude of populated_anthromes: -53.05865 -44.19936
## Maximum and minimum longitude of saopaulo: -53.05865 -44.19936
Source: the authors (2023). Caption: table showing the minimum and maximum limits of latitude and longitude of the two sets of data analyzed using the range() function.
The information from the range() function confirms that the minimum and maximum limits, both for longitude and latitude, are the same for both sets. Continuing with the verification, we use the summary() function to analyze extreme statistical values associated with the two sets (Script). Just as we performed the analysis using the range() function, here we consider the latitude and longitude information of the two sets.
Script: Application of the summary() function to analyze extreme values.
summary(populated_anthromes$latitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -25.22 -23.31 -22.67 -22.43 -21.57 -19.87
summary(saopaulo$LAT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -25.22 -23.31 -22.67 -22.43 -21.58 -19.87
summary(populated_anthromes$longitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -53.06 -49.48 -48.08 -48.29 -46.95 -44.20
summary(saopaulo$LONG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -53.06 -49.48 -48.07 -48.29 -46.95 -44.20
Source: the authors (2023). Caption: code for statistical summary of populated_anthromes and saopaulo data, presenting the minimum, first quartile, mean, mean, third quartile and maximum values.
We verified that the results obtained are synergistic, with small differences in the third quartile of latitude (0.01) and the mean of longitude (0.01). This difference is associated with the number of points described by the two sets, as populated_anthromes is made up of 2141 points, while saopaulo is made up of 2143.
## Simple feature collection with 2 features and 22 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -46.97105 ymin: -22.27151 xmax: -46.97078 ymax: -22.24241
## Geodetic CRS: WGS 84
## ID CD_GEOCODI TIPO CD_GEOCODB NM_BAIRRO CD_GEOCODS NM_SUBDIST
## 1 17457 355730305000009 URBANO <NA> <NA> 35573030500 <NA>
## 2 17458 355730305000010 URBANO <NA> <NA> 35573030500 <NA>
## CD_GEOCODD NM_DISTRIT CD_GEOCODM city NM_MICRO NM_MESO state
## 1 355730305 ESTIVA GERBI 3557303 ESTIVA GERBI MOJI-MIRIM CAMPINAS SAO PAULO
## 2 355730305 ESTIVA GERBI 3557303 ESTIVA GERBI MOJI-MIRIM CAMPINAS SAO PAULO
## CD_NIVEL CD_CATEGOR NM_CATEGOR NM_LOCALID LONG LAT
## 1 6 001 AUI RANCHO NOVO -46.97105 -22.24241
## 2 6 002 AUI RECANTO DO ORI\xc7ANGA -46.97078 -22.27151
## ALT GMRotation geometry
## 1 669.9272 0 POINT (-46.97105 -22.24241)
## 2 624.0211 0 POINT (-46.97078 -22.27151)
Performing the analysis of the sets to identify the missing lines in populated_anthromes, we observed that the two lines from saopaulo refer to two isolated urban areas (AUI) in the municipality of Estiva Gerbi (district area of Moji Mirim ). According to population data01, a set used to associate demographic density with populated areas (census sectors), AUI data are clusters and associated with a point, which can describe more than one location. Therefore, the error (or spatial limitation) is due to this aspect of data clustering and is represented in the following Figure.
Figure: Mapped representation of missing points in populated_anthromes.
Source: the authors (2023). Caption: figure presenting the summary of the overlap analysis, where the two missing points in populated_anthromes and in saopaulo are indicated in red.
Deepening our statistical analyzes on the mapping of São Paulo’s anthromes, considering data from the Brazilian Demographic Census (IBGE, 2013) as the gold standard, we move on to Summary Statistics. To carry them out, we established a statistical grid over the territory of São Paulo, using data from saopaulo (simple data set - sf - derived from br_locations_2010, our comparator or gold standard). This grid was built on the basis of 400 cells (20 by 20), considering the spherical scale, as we will present below.
First, we convert the set saopaulo into a spatial object (from English, spatial), using the function *as(_, “spatial”)*, giving rise to the set saopaulo_spatial . Starting from this, we establish the minimum and maximum X and Y values (longitude and latitude, respectively).
Having established the minimum and maximum values of statistical grid). With this calculation we arrive at the values size_cell_x and size_cell_y, representing the width (longitude) and height (latitude) of the statistical grid square.
It is noteworthy that in these calculations on the size of the squares (cells) of the statistical grid, the sphericity of the Earth was considered, deriving the formulas used here from the Haversine Formula, which is commonly used to calculate the distance between two points. The global structure of this formula is represented as follows:
\[ a = \sin^2\left(\frac{\Delta\text{lat}}{2}\right) + \cos(\text{lat}_1) \cdot \cos(\text{lat}_2) \cdot \sin^2\left(\frac{\Delta\text{lon}}{2}\right) \]
\[ c = 2 \cdot \text{atan2}\left(\sqrt{a}, \sqrt{1-a}\right) \]
\[ d = R \cdot c \]
where: - \(\Delta\text{lat}\) is the difference in latitude between the two points, - \(\Delta\text{lon}\) is the difference in longitude between the two points, - \(\text{lat}_1\) e \(\text{lat}_2\) are the latitudes of the two points in radians, - \(R\) is the radius of the sphere (for example, the average radius of the Earth).
In this way, we were able to size the distances between two points of longitude and latitude, arriving at the width and height dimensions of the grid cells and, therefore, we were able to size the approximate area of each of them. The dimensional values obtained from the calculations performed are presented below.
Table: Cell dimensions in a 20x20 statistical grid (400 cells).
## Cell size (quadrant) in kilometers (km):
## width (cell_size_x_km): 45.57963 km
## height(cell_size_y_km): 11.34617 km
## Area of each quadrant of the grid in km²: 517.1542 km²
## Average dimension of the sides of the grid square in km: 22.74102 km
Source: the authors. Legend: table representing the values obtained during the dimensioning of the height, width and area of the cells of the statistical grid structured in 20x20 (400).
Furthermore, with these dimensions we structured the statistical grid with 400 cells using the raster() function, which was named quadrant_grid and was used in subsequent calculations. Additionally, it was necessary to structure the XY ordered pairs of the statistical grid so that we could visualize the cells in the plot, which was done using the as.data.frame() function.
After establishing the statistical grid with 400 cells, we started counting points in each cell. The objective was to identify the distribution behavior of anthromes mapping points compared to the gold standard (IBGE data). This count was carried out in two moments, the first for data from saopaulo and the second for populated_anthromes.
Using the rasterize() function, we count points per quadrant in saopaulo. The count allowed the construction of a data frame with 2 columns and 400 rows, in which the columns represent the cells (400) and the number of points in each of them. Thus, each of the lines refers to one of the cells in the statistical grid.
After counting saopaulo points, we performed a similar procedure with the data from populated_anthromes. Firstly, this set was converted into a spatial object (spatial), using the st_as_sf() function, structuring the populated_anthromes_spatial object. Sequentially, the rasterize() function led to the counting of points in the statistical grid for the populated anthromes, generating the count_by_quadrant_anthromes data frame. Again, the cell information and number of points per cell were isolated in a set referring to data from populated anthromes (anthromes_countpoints). With this data, we were able to map the distribution of points on the statistical grid of anthromes.
The following figure illustrates the Statistical Grid with the Point Count for saopaulo (gold standard) and for populated_anthromes. Comparatively, we observed that some of the points referring to the gold standard are counted in other cells in the populated anthromes. Famously, this difference is associated with the EPSG structure in which the two sets were found. As we mentioned previously, the CRS were different between the two sets (gold standard and populated anthromes) and adjustments were made throughout the analyses. This generated small distortions in the distribution of points, in a few cases, as we will demonstrate through other metrics about the mapping.
Figure: Statistical Grids with Point Counts of the Gold Standard and Populated anthromes of the State of São Paulo.
Source: the authors (2023). Caption: Figure representing the statistical grid with the point count of (a) saopaulo (gold standard) and (b) populated_anthromes, both referring to the State of São Paulo. The colored areas represent the color gradient according to the number of points in the grid cells. The gray areas represent the cells where there were no points, that is, areas that went beyond the analysis area of the perimeter of the State of São Paulo and, consequently, of the populated anthromes of the Federation Unit.
First, we combined the two data sets associated with point counting (saopaulo_countpoints and anthromes_countpoints). The purpose of combining the two sets was to structure the confusion matrix for statistical analyses. In it, we aligned the numbers of points in each cell of the statistical grid (20x20) of the two sets, in order to establish the following relationships:
To structure these relationships, we use the ifelse() function. If the number of points were equal and different from 0 in both sets to determine if the values are equal (TP) or different (TN), greater (FN) or smaller (FP) in saopaulo_countpoints when compared to ** anthromes_countpoints**.
If the number of points were equal in both sets and different from 0, the TP column would receive the value of 1 and the TN, FP and FN columns would receive the value of 0. If the number of points were equal to 0 in both sets, the column TN would receive the value of 1 and TP, FN and FP of 0. Additionally, if the number of points in the grid quadrant were greater saopaulo_countpoints than in anthromes_countpoints, the FN column would also receive the value of 1 and the TP, TN and FP columns the value of 0. On the other hand, if the number of points were smaller in saopaulo_countpoints than in anthromes_countpoints, the FP column would receive the value of 1 and the columns TP, TN and FN the value of 0.
It should be noted that in the statistical grid there were areas that did not represent the data from São Paulo (areas in gray in the Figure above), that is, lines where “NA” appeared. These lines were replaced by 0 in the combined set using the function [is.na(combined_set)] <- 0, in order to allow statistical calculations based on the confusion matrix. The following table illustrates the confusion matrix structured by this operation.
Table: Confusion Matrix
## Celula saopaulo_countpoints anthromes_countpoints TP TN FN FP
## 1 1 0 0 0 1 0 0
## 2 2 0 0 0 1 0 0
## 3 3 0 0 0 1 0 0
## 4 4 0 0 0 1 0 0
## 5 5 1 1 1 0 0 0
## 6 6 10 10 1 0 0 0
## 7 7 15 15 1 0 0 0
## 8 8 3 3 1 0 0 0
## 9 9 1 1 1 0 0 0
## 10 10 0 0 0 1 0 0
## 11 11 0 0 0 1 0 0
## 12 12 4 4 1 0 0 0
## 13 13 6 6 1 0 0 0
## 14 14 0 0 0 1 0 0
## 15 15 0 0 0 1 0 0
## 16 16 0 0 0 1 0 0
## 17 17 0 0 0 1 0 0
## 18 18 0 0 0 1 0 0
## 19 19 0 0 0 1 0 0
## 20 20 0 0 0 1 0 0
## 21 21 0 0 0 1 0 0
## 22 22 0 0 0 1 0 0
## 23 23 0 0 0 1 0 0
## 24 24 0 0 0 1 0 0
## 25 25 28 28 1 0 0 0
## 26 26 9 9 1 0 0 0
## 27 27 8 8 1 0 0 0
## 28 28 5 5 1 0 0 0
## 29 29 6 6 1 0 0 0
## 30 30 2 2 1 0 0 0
## 31 31 2 2 1 0 0 0
## 32 32 10 10 1 0 0 0
## 33 33 5 5 1 0 0 0
## 34 34 1 1 1 0 0 0
## 35 35 0 0 0 1 0 0
## 36 36 0 0 0 1 0 0
## 37 37 0 0 0 1 0 0
## 38 38 0 0 0 1 0 0
## 39 39 0 0 0 1 0 0
## 40 40 0 0 0 1 0 0
## 41 41 0 0 0 1 0 0
## 42 42 0 0 0 1 0 0
## 43 43 0 0 0 1 0 0
## 44 44 2 2 1 0 0 0
## 45 45 3 3 1 0 0 0
## 46 46 9 9 1 0 0 0
## 47 47 8 8 1 0 0 0
## 48 48 11 11 1 0 0 0
## 49 49 10 10 1 0 0 0
## 50 50 8 8 1 0 0 0
## 51 51 4 4 1 0 0 0
## 52 52 5 5 1 0 0 0
## 53 53 9 9 1 0 0 0
## 54 54 3 3 1 0 0 0
## 55 55 0 0 0 1 0 0
## 56 56 0 0 0 1 0 0
## 57 57 0 0 0 1 0 0
## 58 58 0 0 0 1 0 0
## 59 59 0 0 0 1 0 0
## 60 60 0 0 0 1 0 0
## 61 61 0 0 0 1 0 0
## 62 62 0 0 0 1 0 0
## 63 63 0 0 0 1 0 0
## 64 64 6 6 1 0 0 0
## 65 65 3 3 1 0 0 0
## 66 66 4 4 1 0 0 0
## 67 67 11 11 1 0 0 0
## 68 68 14 14 1 0 0 0
## 69 69 17 17 1 0 0 0
## 70 70 8 8 1 0 0 0
## 71 71 7 7 1 0 0 0
## 72 72 5 5 1 0 0 0
## 73 73 1 1 1 0 0 0
## 74 74 0 0 0 1 0 0
## 75 75 0 0 0 1 0 0
## 76 76 0 0 0 1 0 0
## 77 77 0 0 0 1 0 0
## 78 78 0 0 0 1 0 0
## 79 79 0 0 0 1 0 0
## 80 80 0 0 0 1 0 0
## 81 81 0 0 0 1 0 0
## 82 82 0 0 0 1 0 0
## 83 83 2 2 1 0 0 0
## 84 84 3 3 1 0 0 0
## 85 85 6 6 1 0 0 0
## 86 86 22 22 1 0 0 0
## 87 87 25 25 1 0 0 0
## 88 88 7 7 1 0 0 0
## 89 89 11 11 1 0 0 0
## 90 90 13 13 1 0 0 0
## 91 91 10 10 1 0 0 0
## 92 92 17 17 1 0 0 0
## 93 93 5 5 1 0 0 0
## 94 94 1 1 1 0 0 0
## 95 95 0 0 0 1 0 0
## 96 96 0 0 0 1 0 0
## 97 97 0 0 0 1 0 0
## 98 98 0 0 0 1 0 0
## 99 99 0 0 0 1 0 0
## 100 100 0 0 0 1 0 0
## 101 101 0 0 0 1 0 0
## 102 102 0 0 0 1 0 0
## 103 103 6 6 1 0 0 0
## 104 104 8 8 1 0 0 0
## 105 105 3 3 1 0 0 0
## 106 106 11 11 1 0 0 0
## 107 107 13 13 1 0 0 0
## 108 108 20 20 1 0 0 0
## 109 109 13 13 1 0 0 0
## 110 110 13 13 1 0 0 0
## 111 111 9 9 1 0 0 0
## 112 112 8 8 1 0 0 0
## 113 113 7 7 1 0 0 0
## 114 114 3 3 1 0 0 0
## 115 115 3 3 1 0 0 0
## 116 116 0 0 0 1 0 0
## 117 117 0 0 0 1 0 0
## 118 118 0 0 0 1 0 0
## 119 119 0 0 0 1 0 0
## 120 120 0 0 0 1 0 0
## 121 121 0 0 0 1 0 0
## 122 122 0 0 0 1 0 0
## 123 123 4 4 1 0 0 0
## 124 124 9 9 1 0 0 0
## 125 125 14 14 1 0 0 0
## 126 126 9 9 1 0 0 0
## 127 127 7 7 1 0 0 0
## 128 128 4 4 1 0 0 0
## 129 129 23 23 1 0 0 0
## 130 130 8 8 1 0 0 0
## 131 131 7 7 1 0 0 0
## 132 132 5 5 1 0 0 0
## 133 133 6 6 1 0 0 0
## 134 134 5 5 1 0 0 0
## 135 135 7 7 1 0 0 0
## 136 136 0 0 0 1 0 0
## 137 137 0 0 0 1 0 0
## 138 138 0 0 0 1 0 0
## 139 139 0 0 0 1 0 0
## 140 140 0 0 0 1 0 0
## 141 141 0 0 0 1 0 0
## 142 142 0 0 0 1 0 0
## 143 143 4 4 1 0 0 0
## 144 144 10 10 1 0 0 0
## 145 145 15 15 1 0 0 0
## 146 146 17 17 1 0 0 0
## 147 147 7 7 1 0 0 0
## 148 148 6 6 1 0 0 0
## 149 149 8 8 1 0 0 0
## 150 150 10 10 1 0 0 0
## 151 151 5 5 1 0 0 0
## 152 152 5 5 1 0 0 0
## 153 153 8 8 1 0 0 0
## 154 154 8 8 1 0 0 0
## 155 155 6 6 1 0 0 0
## 156 156 0 0 0 1 0 0
## 157 157 0 0 0 1 0 0
## 158 158 0 0 0 1 0 0
## 159 159 0 0 0 1 0 0
## 160 160 0 0 0 1 0 0
## 161 161 0 0 0 1 0 0
## 162 162 0 0 0 1 0 0
## 163 163 2 2 1 0 0 0
## 164 164 15 15 1 0 0 0
## 165 165 5 5 1 0 0 0
## 166 166 5 5 1 0 0 0
## 167 167 10 10 1 0 0 0
## 168 168 18 18 1 0 0 0
## 169 169 10 10 1 0 0 0
## 170 170 20 20 1 0 0 0
## 171 171 12 12 1 0 0 0
## 172 172 8 8 1 0 0 0
## 173 173 8 8 1 0 0 0
## 174 174 10 8 0 0 1 0
## 175 175 3 3 1 0 0 0
## 176 176 0 0 0 1 0 0
## 177 177 0 0 0 1 0 0
## 178 178 0 0 0 1 0 0
## 179 179 0 0 0 1 0 0
## 180 180 0 0 0 1 0 0
## 181 181 2 2 1 0 0 0
## 182 182 1 1 1 0 0 0
## 183 183 4 4 1 0 0 0
## 184 184 5 5 1 0 0 0
## 185 185 5 5 1 0 0 0
## 186 186 3 3 1 0 0 0
## 187 187 6 6 1 0 0 0
## 188 188 9 9 1 0 0 0
## 189 189 13 13 1 0 0 0
## 190 190 14 14 1 0 0 0
## 191 191 13 13 1 0 0 0
## 192 192 7 7 1 0 0 0
## 193 193 13 13 1 0 0 0
## 194 194 20 20 1 0 0 0
## 195 195 6 6 1 0 0 0
## 196 196 0 0 0 1 0 0
## 197 197 0 0 0 1 0 0
## 198 198 0 0 0 1 0 0
## 199 199 4 4 1 0 0 0
## 200 200 0 0 0 1 0 0
## 201 201 2 2 1 0 0 0
## 202 202 2 2 1 0 0 0
## 203 203 0 0 0 1 0 0
## 204 204 1 1 1 0 0 0
## 205 205 4 4 1 0 0 0
## 206 206 10 10 1 0 0 0
## 207 207 6 6 1 0 0 0
## 208 208 3 3 1 0 0 0
## 209 209 6 6 1 0 0 0
## 210 210 10 10 1 0 0 0
## 211 211 14 14 1 0 0 0
## 212 212 15 15 1 0 0 0
## 213 213 47 47 1 0 0 0
## 214 214 48 48 1 0 0 0
## 215 215 29 29 1 0 0 0
## 216 216 0 0 0 1 0 0
## 217 217 4 4 1 0 0 0
## 218 218 5 5 1 0 0 0
## 219 219 10 10 1 0 0 0
## 220 220 5 6 0 0 0 1
## 221 221 0 0 0 1 0 0
## 222 222 0 0 0 1 0 0
## 223 223 0 0 0 1 0 0
## 224 224 0 0 0 1 0 0
## 225 225 0 0 0 1 0 0
## 226 226 6 6 1 0 0 0
## 227 227 3 3 1 0 0 0
## 228 228 12 12 1 0 0 0
## 229 229 17 17 1 0 0 0
## 230 230 5 5 1 0 0 0
## 231 231 7 7 1 0 0 0
## 232 232 12 12 1 0 0 0
## 233 233 14 14 1 0 0 0
## 234 234 33 33 1 0 0 0
## 235 235 25 25 1 0 0 0
## 236 236 3 3 1 0 0 0
## 237 237 20 20 1 0 0 0
## 238 238 8 8 1 0 0 0
## 239 239 6 6 1 0 0 0
## 240 240 0 0 0 1 0 0
## 241 241 0 0 0 1 0 0
## 242 242 0 0 0 1 0 0
## 243 243 0 0 0 1 0 0
## 244 244 0 0 0 1 0 0
## 245 245 0 0 0 1 0 0
## 246 246 0 0 0 1 0 0
## 247 247 0 0 0 1 0 0
## 248 248 1 1 1 0 0 0
## 249 249 21 21 1 0 0 0
## 250 250 39 39 1 0 0 0
## 251 251 15 15 1 0 0 0
## 252 252 27 27 1 0 0 0
## 253 253 33 33 1 0 0 0
## 254 254 44 44 1 0 0 0
## 255 255 97 97 1 0 0 0
## 256 256 18 18 1 0 0 0
## 257 257 12 12 1 0 0 0
## 258 258 3 3 1 0 0 0
## 259 259 0 0 0 1 0 0
## 260 260 0 0 0 1 0 0
## 261 261 0 0 0 1 0 0
## 262 262 0 0 0 1 0 0
## 263 263 0 0 0 1 0 0
## 264 264 0 0 0 1 0 0
## 265 265 0 0 0 1 0 0
## 266 266 0 0 0 1 0 0
## 267 267 0 0 0 1 0 0
## 268 268 0 0 0 1 0 0
## 269 269 13 13 1 0 0 0
## 270 270 11 11 1 0 0 0
## 271 271 7 7 1 0 0 0
## 272 272 14 14 1 0 0 0
## 273 273 33 33 1 0 0 0
## 274 274 41 41 1 0 0 0
## 275 275 42 42 1 0 0 0
## 276 276 46 46 1 0 0 0
## 277 277 8 8 1 0 0 0
## 278 278 3 3 1 0 0 0
## 279 279 2 2 1 0 0 0
## 280 280 0 0 0 1 0 0
## 281 281 0 0 0 1 0 0
## 282 282 0 0 0 1 0 0
## 283 283 0 0 0 1 0 0
## 284 284 0 0 0 1 0 0
## 285 285 0 0 0 1 0 0
## 286 286 0 0 0 1 0 0
## 287 287 0 0 0 1 0 0
## 288 288 1 1 1 0 0 0
## 289 289 7 7 1 0 0 0
## 290 290 1 1 1 0 0 0
## 291 291 4 4 1 0 0 0
## 292 292 10 10 1 0 0 0
## 293 293 14 14 1 0 0 0
## 294 294 5 5 1 0 0 0
## 295 295 69 69 1 0 0 0
## 296 296 14 14 1 0 0 0
## 297 297 3 3 1 0 0 0
## 298 298 6 6 1 0 0 0
## 299 299 0 0 0 1 0 0
## 300 300 0 0 0 1 0 0
## 301 301 0 0 0 1 0 0
## 302 302 0 0 0 1 0 0
## 303 303 0 0 0 1 0 0
## 304 304 0 0 0 1 0 0
## 305 305 0 0 0 1 0 0
## 306 306 0 0 0 1 0 0
## 307 307 0 0 0 1 0 0
## 308 308 0 0 0 1 0 0
## 309 309 4 4 1 0 0 0
## 310 310 4 4 1 0 0 0
## 311 311 4 4 1 0 0 0
## 312 312 1 1 1 0 0 0
## 313 313 2 2 1 0 0 0
## 314 314 2 2 1 0 0 0
## 315 315 14 14 1 0 0 0
## 316 316 14 14 1 0 0 0
## 317 317 0 0 0 1 0 0
## 318 318 1 1 1 0 0 0
## 319 319 0 0 0 1 0 0
## 320 320 0 0 0 1 0 0
## 321 321 0 0 0 1 0 0
## 322 322 0 0 0 1 0 0
## 323 323 0 0 0 1 0 0
## 324 324 0 0 0 1 0 0
## 325 325 0 0 0 1 0 0
## 326 326 0 0 0 1 0 0
## 327 327 0 0 0 1 0 0
## 328 328 0 0 0 1 0 0
## 329 329 1 1 1 0 0 0
## 330 330 6 6 1 0 0 0
## 331 331 1 1 1 0 0 0
## 332 332 3 3 1 0 0 0
## 333 333 13 13 1 0 0 0
## 334 334 8 8 1 0 0 0
## 335 335 1 1 1 0 0 0
## 336 336 0 0 0 1 0 0
## 337 337 0 0 0 1 0 0
## 338 338 0 0 0 1 0 0
## 339 339 0 0 0 1 0 0
## 340 340 0 0 0 1 0 0
## 341 341 0 0 0 1 0 0
## 342 342 0 0 0 1 0 0
## 343 343 0 0 0 1 0 0
## 344 344 0 0 0 1 0 0
## 345 345 0 0 0 1 0 0
## 346 346 0 0 0 1 0 0
## 347 347 0 0 0 1 0 0
## 348 348 0 0 0 1 0 0
## 349 349 2 2 1 0 0 0
## 350 350 11 11 1 0 0 0
## 351 351 4 4 1 0 0 0
## 352 352 3 3 1 0 0 0
## 353 353 1 1 1 0 0 0
## 354 354 1 1 1 0 0 0
## 355 355 0 0 0 1 0 0
## 356 356 0 0 0 1 0 0
## 357 357 0 0 0 1 0 0
## 358 358 0 0 0 1 0 0
## 359 359 0 0 0 1 0 0
## 360 360 0 0 0 1 0 0
## 361 361 0 0 0 1 0 0
## 362 362 0 0 0 1 0 0
## 363 363 0 0 0 1 0 0
## 364 364 0 0 0 1 0 0
## 365 365 0 0 0 1 0 0
## 366 366 0 0 0 1 0 0
## 367 367 0 0 0 1 0 0
## 368 368 0 0 0 1 0 0
## 369 369 0 0 0 1 0 0
## 370 370 0 0 0 1 0 0
## 371 371 2 2 1 0 0 0
## 372 372 8 8 1 0 0 0
## 373 373 2 2 1 0 0 0
## 374 374 0 0 0 1 0 0
## 375 375 0 0 0 1 0 0
## 376 376 0 0 0 1 0 0
## 377 377 0 0 0 1 0 0
## 378 378 0 0 0 1 0 0
## 379 379 0 0 0 1 0 0
## 380 380 0 0 0 1 0 0
## 381 381 0 0 0 1 0 0
## 382 382 0 0 0 1 0 0
## 383 383 0 0 0 1 0 0
## 384 384 0 0 0 1 0 0
## 385 385 0 0 0 1 0 0
## 386 386 0 0 0 1 0 0
## 387 387 0 0 0 1 0 0
## 388 388 0 0 0 1 0 0
## 389 389 0 0 0 1 0 0
## 390 390 0 0 0 1 0 0
## 391 391 0 0 0 1 0 0
## 392 392 1 2 0 0 0 1
## 393 393 0 0 0 1 0 0
## 394 394 0 0 0 1 0 0
## 395 395 0 0 0 1 0 0
## 396 396 0 0 0 1 0 0
## 397 397 0 0 0 1 0 0
## 398 398 0 0 0 1 0 0
## 399 399 0 0 0 1 0 0
## 400 400 0 0 0 1 0 0
Source: the authors (2023). Legend: Confusion matrix structured from the alignment by cells of saopaulo_countpoints and anthromes_countpoints. The table displays the first 10 rows of the set of 400 rows (statistical grid cells).
After structuring the confusion matrix, we proceeded to analyze the sensitivity of mapping in a 20x20 statistical grid for the State of São Paulo. The calculation of mapping sensitivity aims to evaluate the model’s ability to identify true positives (TPs), that is, through this metric it is possible to evaluate whether the mapping of anthromes in São Paulo can efficiently identify the points present in the territory of São Paulo (gold standard). The estimate of this metric was made using the formula:
\[\text{Sensitivity (Recall)} = \frac{TP}{TP + FN}\]
where TP represents true positives and FN represents false positives.
# Calculating sensitivity
sensitivity <- sum(combined_set$TP) / (sum(combined_set$TP) + sum(combined_set$FN))
# Displaying the result
sensitivity
## [1] 0.9951691
Through the sensitivity calculation, we obtained the value of 0.9951691, that is, approximately 99.52% of the gold standard points are captured within the populated anthromes points. Using this metric, we confirm that the model used in mapping populated anthromes in the State of São Paulo is capable of identifying areas similarly represented by the gold standard, consequently pointing to the quality of the mapping and the sensitivity of the method.
Continuing the statistical analyzes regarding the mapping of populated anthromes in the State of São Paulo, we move on to the analysis of the specificity of the mapping. According to the literature, this metric refers to the model’s ability to identify mapped points that are not part of the comparison standard (gold standard). From this perspective, the specificity analysis aimed to identify whether the model used in the mapping (populated_anthromes) is capable of pointing out which points are not included in the same cell of the statistical grid as the gold standard (saopaulo).
In this way, we returned to the combined_set for analysis, where the proportion of points identified as True Negatives (TNs) in relation to the number of False Positive points (FPs) was verified. In other words, we analyzed the proportion between the quadrants in which the number of gold standard points was greater compared to the quadrants where the number of anthromes points was greater (FPs). The formula used to estimate specificity is represented by the formula:
\[\text{Specificity} = \frac{TN}{TN + FP}\]
where TN represents the number of quadrants identified as True Negatives and FN the number of False Positives.
# Calculating Specificity
specificity <- sum(combined_set$TN) / (sum(combined_set$TN) + sum(combined_set$FP))
# Displaying the result
print(specificity)
## [1] 0.9896373
The result obtained for the mapping specificity metric was 0.9896373, that is, in the quadrants of the statistical grid where there are no points plotted by the gold standard, the model used for mapping anthromes operates with a proportion of * *98.96% accuracy, correctly classifying the absence of points in these areas. It is also considered that this value indicates that there are quadrants where the points mapped by anthromes_countpoints exceed the number of points that make up saopaulo_countpoints**; This indication refers to the overlap analysis carried out previously, where we demonstrated that some points were distorted during the spatial distribution (latitude and longitude of the points on the map) and, when we established the statistical grid, they framed cells different from those of the gold standard.
Therefore, despite the limitations just discussed, we observed that this metric also contributes to inferring the quality of the model used in mapping, demonstrating its suitability for the intended use.
Advancing in the statistical analyses, we enter the global accuracy metric of the mapping. According to the literature, this metric aims to identify the proportion of points that were correctly identified by anthromes mapping when compared to the gold standard. In other words, this metric estimates the number of True Positives (TPs) compared to the total number of cells in the grid. The calculation of global accuracy is done using the formula:
\[\text{Global Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\]
where TP, which represents the grid cells in which the number of points is equal in anthromes mapping and in the gold standard, is divided by the total cells of the statistical grid, that is, by the sum of cells with the same number of points (TPs) and different (FPs and FNs) between the sets.
# Calculating Global Accuracy
global_accuracy <- (sum(combined_set$TP) + sum(combined_set$TN)) / (sum(combined_set$TP) + sum(combined_set$TN) + sum(combined_set$FP) + sum(combined_set$FN))
# Displaying the result
print(global_accuracy)
## [1] 0.9925
Based on the estimate of global accuracy, we obtained the result of 0.9925, that is, approximately 99.25% of the points mapped for anthromes are correctly mapped in the State of São Paulo when compared to the standard gold. This proportion demonstrates that there is a high correspondence rate between the mapped data, pointing to the quality of the product generated for the anthromes in the Federation Unit.
Assuming this, we proceed to calculate the global error, which, according to the literature, indicates the proportion of areas classified incorrectly by the model. In other words, this metric uses the proportion between False Positives (FPs) and False Negatives (FNs) compared to the total number of cells in the statistical grid. The calculation of the global error is given by the formula:
\[\text{Global Error} = \frac{FP + FN}{TP + TN + FP + FN}\]
where we have the sum of FPs and FNs, areas with a divergent number of points between anthromes and the gold standard, divided by the total number of cells with points.
# Calculating the Global Error
global_error <- (sum(combined_set$FP) + sum(combined_set$FN)) / (sum(combined_set$TP) + sum(combined_set$TN) + sum(combined_set$FP) + sum(combined_set$FN))
# Displaying the result
print(global_error)
## [1] 0.0075
Through calculations of the global error we obtained an estimate of 0.0075, that is, only 0.75% of the areas mapped in populated anthromes were classified incorrectly when compared to the IBGE gold standard. This value indicates the model’s low error rate and points to the accuracy of the mapping of anthromes, reinforcing the notes on the quality of the mapping.
Throughout the analyzes on the statistical metrics of the mapping, namely: sensitivity, specificity, global accuracy and global error; We observed promising results for the model used in mapping the anthromes populated in the State of São Paulo, as evidenced by the following table.
Table: Results of the Statistical Metrics of the Model for Mapping Populated Anthromes.
## Metrics Estimation Percentage
## 1 Sensitivity 0.9951691 99.52%
## 2 Specificity 0.9896373 98.96%
## 3 Global Accuracy 0.9925000 99.25%
## 4 Global Error 0.0075000 0.75%
Source: the authors (2023). Legend: table with a summary of the results obtained for the four statistical metrics analyzed, namely: sensitivity, specificity, global accuracy and global error. The results are presented in two formats, the estimate and the percentage (with two decimal places).
According to other work that involves such metrics in mapping, whether for validation of the product (cartography) or for analysis of the model (map production structure), it appears that the estimates for the model of populated anthromes align compliance with the requirements for suitability for the intended use, reflecting the quality of the distribution of points and the efficiency of their representation.
Below we present the bar graph referring to the metrics presented in the table. As can be seen, the global error appears to be the only metric closest to zero, which according to the literature is significantly positive, given the low distortion rate of the product (populated anthromes) compared to the standard gold (IBGE data).
On the other hand, it is noted that sensitivity, specificity and global accuracy are close to 1. According to the literature, when approaching 1, the better the model’s ability to represent the data, following comparator guidelines (gold standard). Therefore, the anthromes model meets these premises and is capable of meeting the specifications of these metrics with relevant efficiency.
Graph: Statistical Metrics Results.
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
## No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
Source: the authors (2023). Legend: Bar graph summarizing the estimates obtained for the statistical metrics used in the analysis of mapping quality, namely: global accuracy (red), global error (purple), specificity (green) and sensitivity (blue).
Another visualization pattern that seemed relevant to present the estimates obtained for the analyzed metrics was the Radar chart. In it, the four metrics are presented simultaneously on a target, in which the center represents the value of 0 and the last circle from the inside to the outside the value of 1. This standard is commonly used by Metrology to analyze the measurement capacity of a given method. or instrument. As this is one of the Sciences that are part of our analytical core and that consolidates our view of Environmental Sciences and Human Ecology, we absorb such graphic modeling in the analyses, reinforcing our effort to align these Sciences.
Chart: Radar Chart of Metrics Investigated.
## `geom_path()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
Source: the authors (2023). Legend: radar graph of statistical metrics to validate the mapping and model analyzed. The graph reports the estimates obtained for sensitivity, specificity, global accuracy and global error.
As shown in the table previously, the values for sensitivity, specificity and global accuracy are close to one, giving the impression that these values are at 1 (last circle from inside to outside) in the pattern radar graph display. Otherwise, the value of the global error being close to 0 is at the center of the graphic model, reiterating the statements made and assuming the interpretative premises of the graphic model from the literature.
With this, we conclude our analyzes regarding the mapping of populated anthromes. Throughout this analysis, we processed and mined data from the 2010 Demographic Census (IBGE, 2013), classifying geospatial data into different types of populated anthromes, following the guidelines established by Ellis (2020) for classification and IBGE metadata. for alignment. Subsequently, we plotted the classified data and, sequentially, static and interactive mapping of populated anthromes in the State of São Paulo.
Once the mapping construction stages were completed, we moved on to statistical analysis to validate the cartographic product. At this point, we carried out the overlap analysis, comparing the cartography of populated anthromes to the gold standard, which was stipulated based on IBGE data. We identified some distortions in the data set and, consequently, in the mapping, but which do not invalidate the product of this Thesis; however, it appears to be a limitation of the product.
Sequentially, we analyzed the statistical metrics of sensitivity, specificity, global accuracy and global error. The estimates obtained for these metrics showed that the model has relevant suitability for the intended use of mapping demographic information, efficiently performing the distribution of points in the cartography and the mapping of census information, when compared to the gold standard. The small distortions identified at this stage also appear to be modeling limitations, but do not invalidate the modeling used to map populated anthromes.
Throughout this analysis, we processed and mined data from the 2010 Demographic Census (IBGE, 2013), classifying geospatial data into different types of populated anthromes, following the guidelines established by Ellis (2020) for classification and IBGE metadata. for alignment. Subsequently, we plotted the classified data and, sequentially, static and interactive mapping of populated anthromes in the State of São Paulo.
Once the mapping construction stages were completed, we moved on to statistical analysis to validate the cartographic product. At this point, we carried out the overlap analysis, comparing the cartography of populated anthromes to the gold standard, which was stipulated based on IBGE data. We identified some distortions in the data set and, consequently, in the mapping, but which do not invalidate the product of this Thesis; however, it appears to be a limitation of the product.
Sequentially, we analyzed the statistical metrics of sensitivity, specificity, global accuracy and global error. The estimates obtained for these metrics showed that the model has relevant suitability for the intended use of mapping demographic information, efficiently performing the distribution of points in the cartography and the mapping of census information, when compared to the gold standard. The small distortions identified at this stage also appear to be limitations of the modeling, but they do not invalidate the modeling used to map the populated anthromes.