1 Introduction

According to Biggs et al. (2021), one of the essential steps for building socio-ecological models involves the survey of attributes that should make up their analytical structure. According to these authors, it is essential that each of the attributes be identified, recognized and analyzed, in order to recognize the data profile and its possible application, whether in descriptive terms or in terms of information mapping.

From this perspective, Ellis (2020) presented that population attributes, mainly demographic density, associated with anthromes are fundamental for modeling anthropogenic biomes, both on a global and local scale. Guided by these guidelines regarding demographic aspects linked to anthromes, in this work we carried out an exploratory analysis of census data produced by the Brazilian Institute of Geography and Statistics (IBGE).

The exploratory analysis aimed to identify the attributes that made up the data from the census operation carried out by IBGE in 2010 (IBGE, 2013a). Furthermore, we tried to recognize special characteristics that would allow the integration of the tabular data provided by this institution, intending to expand the data set for modeling anthromes locally.

In addition, we strive to evaluate the possibility of plotting this data, that is, of spatially distributing census information in local mappings. This operation was performed in the R® software, using the investigative guidelines presented by Lovelace et al. (2019) and Anderson (2021) for exploratory analysis and creation of mappings and plots of geographic information. These authors presented a critical-analytical format in their works, demonstrating the logic involved in achieving the objectives just presented for this work.

Throughout the exploratory analysis, we presented detailed summaries of the functions used and which were extracted from the two works just discussed. Through this research, we carried out a survey of characteristics of census data, in vector and raster formats, which would allow their use in structuring the decision tree for classifying anthromes locally. To this end, we emphasize that this was an essential step in building the modeling of anthropogenic biomes in R® language. In it, we recognized attributes of the demographic data that aligned with those identified by Gauthier (2021) and Ellis, Beusen and Goldewijk (2020) as fundamental for mapping anthromes in R®.

We highlight that the format of this manuscript does not follow conventional textual standards, where “Introduction”, “Methodology/Materials and Methods”, “Results and Discussion” and “Conclusions” are separately detailed. Here, we report using a logical programming format, that is, we first present what was carried out in the R software (Methodology), then the R code (Methodology and Results) and, as the results are generated by the computer program, they are discussed below (Discussion). Therefore, this is the format adopted in this manuscript, in order to facilitate analytical understanding and concatenate the analyzes carried out.

2 Demographic Data: from loading to mapping

2.1 Exploratory Analysis of Demographic Data

In the first stage of the exploratory analysis of demographic data, the tabular files provided by the Brazilian Institute of Geography and Statistics (IBGE) were downloaded from the institution’s digital platform. According to IBGE, the lowest level of data disaggregation is the micro data from the 2010 Census, that is, this data contains information for each of the cities investigated by IBGE during the demographic census. These data show the distribution of the municipal population in urban and rural areas and also in different urban systems, such as municipal headquarters or outside the municipal headquarters. The web page where this data is available is:

https://www.ibge.gov.br/estatisticas/sociais/populacao/9662-censo-demografico-2010.html?=&t=microdados.

We point out that we used data referring to the State of São Paulo (Brazil) as an experimental model for mapping anthromes locally, as this Federation Unit (UF) encompasses different territorial typologies (land uses and covers) and has significant representation in the economy , in national politics and management, as well as symbolic distribution and population size.

The files downloaded from the IBGE platform were included in a folder associated with the work (directory) for later application in the R® software. This folder contains the guidance documentation provided by the Brazilian Institute, the micro data and tables referring to the population of the State of São Paulo recorded in the 2010 Census. The tabular data are in the “.xls” extension (Microsoft Access 365) for import into R®.

We point out that some adjustments to the content of the tables were necessary, as they prevented the files from being read correctly in the software. Therefore, the tabular files were opened in Microsoft Access 365 to remove titles and additional information, such as subtitles, captions and bibliographic references, which appeared in the original data. Thus, in the edited data only the names of the attributes remained (first line of each column) and the census data for each attribute necessary for analysis.

Furthermore, we emphasize that the numeric tabular data contained spacing between units and a hyphen (-) in null values, characteristics that prevented the data from being read as numeric values, being interpreted by R as “characters”. Therefore, we edited the tables made available by IBGE, removing spaces and replacing hyphens with zeros (0) in the sets analyzed in this work. Please note that these edits were made in Microsoft Excel 365.

The operations performed in the Access 365 and Excel 365 programs are not reported throughout the text. However, the edited tabular files, in “.xls” format, were made available as complementary files for this work and are available on the EcoMetrologia Project’s GitHub https://github.com/maximilianogobbo/landuseplanning.git> and can be accessed remotely. Furthermore, all documents that make up the demographic data portfolio, including the R and Rmarkdown scripts, were saved in a single directory, in order to facilitate and streamline the operation, manipulation and analysis of data in the software. The getwd (_) function shows the referenced working directory, the virtual location where all the documents for this investigation are located:

## [1] "C:/ARQUIVOS COMPUTADOR/DOUTORADO/DOUTORADO TESE/03 DADOS GEOESPACIAIS/02.1 DEMOGRAPHIC"

Of all the documents downloaded from the IBGE platform, only 3 of them were used in the first phase of the exploratory analysis, as only these contained information about: the geographic location of the municipalities in the State of São Paulo, the population in each of the subdivisions established in the census, the area and/or demographic density of each municipality.

Below, the loading of each of the tables in R® is presented separately using the read_excel () function. For this function to operate, the file name and directory where the tables were saved were indicated, as illustrated in Script 1 below.

Furthermore, in this preliminary phase, two other functions were used subsequent to loading. The names() function, to identify the name of the data set attributes (first line of tabular data), and the summary() function, which offers a synthesis of the data analyzed by it, whether in qualitative terms (characters) or in quantitative terms (numerical and statistical).

The first table loaded into the software was “population01.xls”, using the read_excel () function. Sequentially, we transformed the table into an object (data frame), which was named population01. Using the names() function, we check the names of the attributes in this data set. Subsequently, we use the summary () function to obtain a qualitative and quantitative synthesis of the population01 data frame. Script 1 (code) illustrates this preliminary procedure in R® language.

Script: Loading and Preliminary Analysis of population01

names(population01)

##  [1] "city"                                "Área Urbanizada"                    
##  [3] "Área não Urbanizada"                 "Área Urbana Isolada"                
##  [5] "Área Rural (Exceto Aglomerado)"      "Aglomerado Rural de Extensão Urbana"
##  [7] "Aglomerado Rural Povoado"            "Aglomerado Rural Núcleo"            
##  [9] "Outros Aglomerados Rurais Raros"     "Código da Unidade Geográfica"

summary(population01)

##      city           Área Urbanizada    Área não Urbanizada Área Urbana Isolada
##  Length:645         Min.   :     627   Min.   :    0       Min.   :    0.0    
##  Class :character   1st Qu.:    3753   1st Qu.:    0       1st Qu.:    0.0    
##  Mode  :character   Median :    9485   Median :    0       Median :    0.0    
##                     Mean   :   59817   Mean   : 1048       Mean   :  508.3    
##                     3rd Qu.:   33907   3rd Qu.:   15       3rd Qu.:   79.0    
##                     Max.   :11065838   Max.   :65912       Max.   :41236.0    
##  Área Rural (Exceto Aglomerado) Aglomerado Rural de Extensão Urbana
##  Min.   :    0                  Min.   :    0.0                    
##  1st Qu.:  591                  1st Qu.:    0.0                    
##  Median : 1218                  Median :    0.0                    
##  Mean   : 2244                  Mean   :  246.9                    
##  3rd Qu.: 2780                  3rd Qu.:    0.0                    
##  Max.   :45899                  Max.   :54903.0                    
##  Aglomerado Rural Povoado Aglomerado Rural Núcleo
##  Min.   :   0.00          Min.   :  0.000        
##  1st Qu.:   0.00          1st Qu.:  0.000        
##  Median :   0.00          Median :  0.000        
##  Mean   :  56.94          Mean   :  8.567        
##  3rd Qu.:   0.00          3rd Qu.:  0.000        
##  Max.   :6185.00          Max.   :813.000        
##  Outros Aglomerados Rurais Raros Código da Unidade Geográfica
##  Min.   :   0.00                 Min.   :3500105             
##  1st Qu.:   0.00                 1st Qu.:3514601             
##  Median :   0.00                 Median :3528700             
##  Mean   :  40.07                 Mean   :3528698             
##  3rd Qu.:   0.00                 3rd Qu.:3543204             
##  Max.   :2889.00                 Max.   :3557303

Source: the authors (2023). Caption: preliminary analysis of the population data set using the functions: names () and summary () in the R software.

The preliminary analysis of data from population01 revealed some important aspects about the set. The first to be scored involves the names () function that indicates the name of the attributes that make up the data frame. These attributes refer to the typologies of census sectors (land use) to indicate the number of inhabitants registered in each of them for each of the municipalities in São Paulo.

As seen in the summary () results, there are 645 lines (length) that represent the municipalities of the State. Each of the lines offers the number of inhabitants registered in each of the territorial typologies associated with the city, which is indicated in the first line of the data set.

Another aspect to be highlighted in the results presented by this function is the attribute Código da Unidade Geográfica, despite being defined by values (numbers), it is a numerical descriptor, that is, a sequence of numbers assigned to determine the area of reference. This descriptor code is understood, in software engineering and database modeling, as identifying attributes, which are not repeated throughout the data set and are exclusively attributed to an entity, which in the case of the population01 data frame are the cities paulistas.

That said, we carried out the same analytical procedure with the “population02.xls” table in the directory, from which the population02 data frame was created and which is described in Script 2.

Script: Loading and Preliminary Analysis of population02

names(population02)

## [1] "city"                            "Total"                          
## [3] "Urbana"                          "Na sede municipal"              
## [5] "Rural"                           "Área\ntotal\n(km²)"             
## [7] "Densidade demográfica (hab/km²)" "Código da Unidade Geográfica"

summary(population02)

##      city               Total              Urbana         Na sede municipal 
##  Length:645         Min.   :     805   Min.   :     627   Min.   :     627  
##  Class :character   1st Qu.:    5151   1st Qu.:    3865   1st Qu.:    3681  
##  Mode  :character   Median :   12737   Median :   10352   Median :    9563  
##                     Mean   :   63972   Mean   :   61372   Mean   :   56890  
##                     3rd Qu.:   37910   3rd Qu.:   34748   3rd Qu.:   32676  
##                     Max.   :11253503   Max.   :11152344   Max.   :11111108  
##      Rural        Área\ntotal\n(km²) Densidade demográfica (hab/km²)
##  Min.   :     0   Min.   :   5.4     Min.   :    3.73               
##  1st Qu.:   628   1st Qu.: 157.9     1st Qu.:   19.69               
##  Median :  1286   Median : 281.1     Median :   38.87               
##  Mean   :  2600   Mean   : 384.8     Mean   :  302.13               
##  3rd Qu.:  2971   3rd Qu.: 508.5     3rd Qu.:  109.81               
##  Max.   :101159   Max.   :1977.4     Max.   :12519.10               
##  Código da Unidade Geográfica
##  Min.   :3500105             
##  1st Qu.:3514601             
##  Median :3528700             
##  Mean   :3528698             
##  3rd Qu.:3543204             
##  Max.   :3557303

Source: the authors (2023). Caption: preliminary analysis of the population data set using the functions: names () and summary () in the R software.

We observed, through the results obtained by the names() function, that the population02 set has some attributes with the same name and others different from those present in the population01 data frame. We highlight the attributes “demographic density” and “total area”, which are information about the number of individuals in a given area and the total area of the census sector, respectively. Again, the attribute Código da Unidade Geográfica is interpreted as a numeric attribute, incurring the same problem identified previously for population01.

Next, we performed the same procedure with the file “population03.xls”, which gave rise to the data frame population03, as shown in Script 3.

Script: Loading and Preliminary Analysis of population03

names(population03)

##  [1] "city"                                                         
##  [2] "População residente Absoluta"                                 
##  [3] "População residente absoluta total urbana"                    
##  [4] "População residente absoluta total na sede municipal urbana\n"
##  [5] "Total Relativa (%)...5"                                       
##  [6] "Total Relativa (%)...6"                                       
##  [7] "Na sede municipal Relativa (%)\n"                             
##  [8] "Área\ntotal\n(km²)\n"                                         
##  [9] "Densidade demográfica (hab/km²)"                              
## [10] "Código da Unidade Geográfica"

summary(population03)

##      city           População residente Absoluta
##  Length:645         Min.   :     805            
##  Class :character   1st Qu.:    5151            
##  Mode  :character   Median :   12737            
##                     Mean   :   63972            
##                     3rd Qu.:   37910            
##                     Max.   :11253503            
##  População residente absoluta total urbana
##  Min.   :     627                         
##  1st Qu.:    3865                         
##  Median :   10352                         
##  Mean   :   61372                         
##  3rd Qu.:   34748                         
##  Max.   :11152344                         
##  População residente absoluta total na sede municipal urbana\n
##  Min.   :     627                                             
##  1st Qu.:    3681                                             
##  Median :    9563                                             
##  Mean   :   56890                                             
##  3rd Qu.:   32676                                             
##  Max.   :11111108                                             
##  Total Relativa (%)...5 Total Relativa (%)...6 Na sede municipal Relativa (%)\n
##  Min.   :100            Min.   : 24.90         Min.   : 12.60                  
##  1st Qu.:100            1st Qu.: 78.70         1st Qu.: 71.50                  
##  Median :100            Median : 88.40         Median : 84.30                  
##  Mean   :100            Mean   : 84.32         Mean   : 79.78                  
##  3rd Qu.:100            3rd Qu.: 94.90         3rd Qu.: 92.10                  
##  Max.   :100            Max.   :100.00         Max.   :100.00                  
##  Área\ntotal\n(km²)\n Densidade demográfica (hab/km²)
##  Min.   :   5.4       Min.   :    3.73               
##  1st Qu.: 157.9       1st Qu.:   19.69               
##  Median : 281.1       Median :   38.87               
##  Mean   : 384.8       Mean   :  302.13               
##  3rd Qu.: 508.5       3rd Qu.:  109.81               
##  Max.   :1977.4       Max.   :12519.10               
##  Código da Unidade Geográfica
##  Min.   :3500105             
##  1st Qu.:3514601             
##  Median :3528700             
##  Mean   :3528698             
##  3rd Qu.:3543204             
##  Max.   :3557303

Source: the authors (2023). Caption: preliminary analysis of the population data set using the functions: names () and summary () in the R software.

In Script 3, in addition to presenting the results of the two analytical functions, we also report the first 10 lines of the data frame population03, presented right after loading the data using the read_excel () function.

We observed that in the set population03 there are other numerical attributes, namely: Relative Total (%) (in Portuguese, Total Relativa) and Relative municipal headquarters (%) (in Portuguese, Na sede municipal Relativa). These attributes, however, represent statistical proportions of the population in each of the cities in São Paulo, and are not exactly attributed to population dimensions, such as concentration or demographic density.

Furthermore, we reiterate that the identifying attribute Geographic Unit Code (in Portuguese, Código da Unidade Geográficas) is also part of this data frame, which is the only identifier present in the three sets analyzed up to this point. However, in the three sets there is no direct definition of information that spatializes geographic information, such as longitude, latitude and altitude of points or polygons referring to census sectors.

Even identifying these obstacles that have only been addressed, we expanded the preliminary analysis with two other functions. Firstly, the class() function to recognize the structural format of the three sets. Sequentially, the dim() function that gives the number of rows and columns of the data sets. Script 4 below demonstrates the results obtained.

Script: Application of the functions class () and dim () in the Preliminary Analysis

class(population01)

## [1] "tbl_df"     "tbl"        "data.frame"

class (population02)

## [1] "tbl_df"     "tbl"        "data.frame"

class(population03)

## [1] "tbl_df"     "tbl"        "data.frame"

dim(population01)

## [1] 645  10

dim(population02)

## [1] 645   8

dim(population03)

## [1] 645  10

Source: the authors (2023). Caption: Preliminary analysis of the data sets using the functions: class () and dim () in the R software.

We observed, in the results generated by the class () function, that the three data sets (population01, population02 and population03) are of the data frame type, that is, they are structured following the distribution of information in rows and columns (tabular, from the English acronym, tbl). In the columns of the data frames, information that characterizes the municipalities of the State of São Paulo is reported, that is, the answer for each of the attributes identified in the first line of the data frames. On the other hand, the dim() function reported that the data frame population01 is composed of 645 lines and 10 columns, while population02 is structured in 645 lines and 8 columns and population03 in 645 lines and 10 columns.

Given this information, we confirm that all lines in the population01 data frame correspond to population02 and population03, that is, all cities are present in the three data frames. However, we found that the number of columns differs between the data sets, a fact that we had observed when applying the names () and summary () functions. This occurs because there are attributes that are present in one that do not integrate the others and vice-versa, thus changing the number of columns in each of them.

Returning to the attribute Geographic Unit Code (in Portuguese, Código da Unidade Geográficas), present in the three data frames, we carried out a specific analysis to identify how the information for each city is read by the software. We use the summary () function again and filter the column of this attribute using square brackets [], which are used to specify the attribute, as we demonstrated in Script 5.

Script: Analysis of the “Geographic Unit Code (in Portuguese, Código da Unidade Geográficas)” attribute

summary(population01["Código da Unidade Geográfica"])

##  Código da Unidade Geográfica
##  Min.   :3500105             
##  1st Qu.:3514601             
##  Median :3528700             
##  Mean   :3528698             
##  3rd Qu.:3543204             
##  Max.   :3557303

summary(population02["Código da Unidade Geográfica"])

##  Código da Unidade Geográfica
##  Min.   :3500105             
##  1st Qu.:3514601             
##  Median :3528700             
##  Mean   :3528698             
##  3rd Qu.:3543204             
##  Max.   :3557303

summary(population03["Código da Unidade Geográfica"])

##  Código da Unidade Geográfica
##  Min.   :3500105             
##  1st Qu.:3514601             
##  Median :3528700             
##  Mean   :3528698             
##  3rd Qu.:3543204             
##  Max.   :3557303

Source: the authors (2023). Caption: analysis of the “Geographic Unit Code (in Portuguese, Código da Unidade Geográficas)” attribute using the summary (_) function and selecting the attribute using square brackets [].

In line with what we presented previously, the summary () function returned statistical information about this attribute. Above it can be seen that the minimum (min.), first quartile (1st Qu.), mean, mean, third quartile (3rd Qu.) and maximum (max.) values of the data set are presented. Therefore, we confirm that the interpretation of this identifying attribute by the software is not done as an area identifier code, but as a numerical value. This prevents the direct plotting of data in mapping, requiring other geographic information to do so.

From this perspective, we carried out a new search on the IBGE platform to find the files referring to the identifying attribute Geographic Unit Code (in Portuguese, Código da Unidade Geográficas). The files related to this attribute were downloaded and indexed in the same working directory as the exploratory analysis and are available at the following link, accessible remotely.

https://www.ibge.gov.br/geociencias/organizacao-do-territorio/estrutura-territorial/27385-localidades.html?=&t=acesso-ao-produto.

In this search, we captured the shape files made available by IBGE, which contain the set of geographic information (code geometry: longitude, latitude and altitude) that represent the Geographic Unit Code (in Portuguese, Código da Unidade Geográficas) of the 3 data frames (population01, population02 and population03 ). With these shape files, we envision connecting the census data from the three sets to the spatial structures of their locations (census sector).

To this end, the first step was to load the raster file into R using the shape file () function, which is used to read raster data in the software. The data set loaded by this function was named br_locations_2010, converting it into an object for exploratory analysis. After loading the raster data, we carried out the same analytical procedures demonstrated so far. We use the names () function to identify the names of the attributes that make up the br_locations_2010 data set. On the other hand, the summary() function was used to recognize the qualitative and quantitative structure of this object (Script 7).

Script: Preliminary analysis of br_locations_2010

names(br_locations_2010)

##  [1] "ID"         "CD_GEOCODI" "TIPO"       "CD_GEOCODB" "NM_BAIRRO" 
##  [6] "CD_GEOCODS" "NM_SUBDIST" "CD_GEOCODD" "NM_DISTRIT" "CD_GEOCODM"
## [11] "city"       "NM_MICRO"   "NM_MESO"    "state"      "CD_NIVEL"  
## [16] "CD_CATEGOR" "NM_CATEGOR" "NM_LOCALID" "LONG"       "LAT"       
## [21] "ALT"        "GMRotation"

summary(br_locations_2010)

## Object of class SpatialPointsDataFrame
## Coordinates:
##         min        max
## x -73.49761 -32.435186
## y -33.73754   5.220071
## Is projected: FALSE 
## proj4string :
## [+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs]
## Number of points: 21886
## Data attributes:
##        ID         CD_GEOCODI            TIPO            CD_GEOCODB       
##  Min.   :    1   Length:21886       Length:21886       Length:21886      
##  1st Qu.: 5472   Class :character   Class :character   Class :character  
##  Median :10944   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10944                                                           
##  3rd Qu.:16415                                                           
##  Max.   :21886                                                           
##                                                                          
##   NM_BAIRRO          CD_GEOCODS         NM_SUBDIST         CD_GEOCODD       
##  Length:21886       Length:21886       Length:21886       Length:21886      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   NM_DISTRIT         CD_GEOCODM            city             NM_MICRO        
##  Length:21886       Length:21886       Length:21886       Length:21886      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    NM_MESO             state             CD_NIVEL          CD_CATEGOR       
##  Length:21886       Length:21886       Length:21886       Length:21886      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   NM_CATEGOR         NM_LOCALID             LONG             LAT         
##  Length:21886       Length:21886       Min.   :-73.50   Min.   :-33.738  
##  Class :character   Class :character   1st Qu.:-49.89   1st Qu.:-21.588  
##  Mode  :character   Mode  :character   Median :-44.62   Median :-12.619  
##                                        Mean   :-45.54   Mean   :-14.067  
##                                        3rd Qu.:-40.15   3rd Qu.: -6.603  
##                                        Max.   :-32.44   Max.   :  5.220  
##                                                                          
##       ALT           GMRotation
##  Min.   :   0.0   Min.   :0   
##  1st Qu.: 111.1   1st Qu.:0   
##  Median : 329.3   Median :0   
##  Mean   : 372.4   Mean   :0   
##  3rd Qu.: 582.4   3rd Qu.:0   
##  Max.   :1639.2   Max.   :0   
##  NA's   :1

crs(br_locations_2010)

## Coordinate Reference System:
## Deprecated Proj.4 representation:
##  +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs 
## WKT2 2019 representation:
## BOUNDCRS[
##     SOURCECRS[
##         GEOGCRS["unknown",
##             DATUM["Unknown based on GRS 1980 ellipsoid using towgs84=0,0,0,0,0,0,0",
##                 ELLIPSOID["GRS 1980",6378137,298.257222101,
##                     LENGTHUNIT["metre",1],
##                     ID["EPSG",7019]]],
##             PRIMEM["Greenwich",0,
##                 ANGLEUNIT["degree",0.0174532925199433],
##                 ID["EPSG",8901]],
##             CS[ellipsoidal,2],
##                 AXIS["longitude",east,
##                     ORDER[1],
##                     ANGLEUNIT["degree",0.0174532925199433,
##                         ID["EPSG",9122]]],
##                 AXIS["latitude",north,
##                     ORDER[2],
##                     ANGLEUNIT["degree",0.0174532925199433,
##                         ID["EPSG",9122]]]]],
##     TARGETCRS[
##         GEOGCRS["WGS 84",
##             DATUM["World Geodetic System 1984",
##                 ELLIPSOID["WGS 84",6378137,298.257223563,
##                     LENGTHUNIT["metre",1]]],
##             PRIMEM["Greenwich",0,
##                 ANGLEUNIT["degree",0.0174532925199433]],
##             CS[ellipsoidal,2],
##                 AXIS["latitude",north,
##                     ORDER[1],
##                     ANGLEUNIT["degree",0.0174532925199433]],
##                 AXIS["longitude",east,
##                     ORDER[2],
##                     ANGLEUNIT["degree",0.0174532925199433]],
##             ID["EPSG",4326]]],
##     ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84",
##         METHOD["Position Vector transformation (geog2D domain)",
##             ID["EPSG",9606]],
##         PARAMETER["X-axis translation",0,
##             ID["EPSG",8605]],
##         PARAMETER["Y-axis translation",0,
##             ID["EPSG",8606]],
##         PARAMETER["Z-axis translation",0,
##             ID["EPSG",8607]],
##         PARAMETER["X-axis rotation",0,
##             ID["EPSG",8608]],
##         PARAMETER["Y-axis rotation",0,
##             ID["EPSG",8609]],
##         PARAMETER["Z-axis rotation",0,
##             ID["EPSG",8610]],
##         PARAMETER["Scale difference",1,
##             ID["EPSG",8611]]]]

Source: the authors (2023).

The results show that the set br_locations_2010 is composed of 21,886 lines and 22 attributes described in columns. The geometry type of the data set is in the format of points and is structured in XY dimensions, having integrated values of xmin, ymin, xmax and ymax within the structure. Furthermore, the data set has the geographic reference system (CRS) based on the EPSG (acronym for European Petroleum Survey Group) “SIRGAS 2000” format.

We also highlight that the 22 attributes that make up the data set are: ID, CD_GEOCODI, TIPO, CD_GEOCODB, NM_BAIRRO, CD_GEOCODS, NM_SUBDIST, CD_GEOCODD, NM_DISTRIT, CD_GEOCODM,city, NM_MICRO, NM_MESO, state, CD_NIVEL ,CD_CATEGOR, NM_CATEGOR, NM_LOCALID, LONG, LAT, ALT and GMRotation. Of these, we confirmed the presence of attributes associated with the geographic positioning of information, such as longitude (LONG), latitude (LAT), altitude (ALT) and geometry. Furthermore, we have the presence of territorial subdivisions described in the br_locations_2010 data set, as we verified that, sequentially, the tabular format of the data starts from the lowest level of aggregation, being neighborhood (NM_BAIRRO), and reaching States . These results, therefore, provided us with clues to understand the structure of the data and to identify relevant attributes for data mining in R®.

In order to test the spatialization of the information contained in the br_locations_2010 data set, we applied the plot () function in the analytical sequence to visualize the distribution of points described by the data frame, as can be seen in Figure 1.

Figure: Plotting of data from br_locations_2010

Source: the authors (2023). Caption: figure produced through the function plot (_), using the data set br_locations_2010. Outline in red representing the Brazilian territorial polygon, inserted to demonstrate the distribution of points.

The plot illustrates each of the 21,886 points that make up the data set, using the geographic coordinates of each of the Brazilian locations to carry it out. The blank spaces, where there are no points marked on the plot, represent areas where there was no population decline (areas without human occupation/demographic voids) and/or where the populations sampled in these areas were considered within census sectors close to their area establishment, as suggested in the reference document for carrying out the Brazilian census operation (IBGE, 2013).

Based on the names of the attributes of br_locations_2010, we verified that the column NM_UF, referring to the Name of the Federation Unit, allows the filtering of data related to the State of São Paulo, making it possible to cut the data set to meet the experimental area of this research. Furthermore, we verified that this data frame is composed of geographically distributed points, given the presence of the attributes longitude (LONG), latitude (LAT) and altitude (ALT). Furthermore, there are structural characteristics in the data frame that help in the spatialization of geographic information, which are presented in Script 7 by the information coords.x1 and coords.x2.

In order to individualize the data from the State of São Paulo, we returned to the Microsoft Access 365 program (initial format of the data set made available by IBGE) to filter the data from localities_br. We filtered the data set using the NM_UF attribute, which protects the Names of the Federation Units, selecting only the lines that had São Paulo as response data (character inserted in the NM_UF column line). This selected data was copied to an Excel 365 spreadsheet and saved in the “.xls” extension with the name localsp.xls. We reiterate that both the original file (Access 365) and the produced file (Excel 365) are in the GitHub digital collection and can be accessed remotely.

After producing the localsp.xls file, we return to R, where we look for the file in the work directory to be loaded into the software using the read_excel () function. Along with loading, we create the localsp object, as illustrated in the Script. After loading the data, we took the opportunity to rename some attributes that were part of the data set through the names () function, namely: a) NM_MUNICIPIO, replaced by city; b) NM_UF, by state; c) LONG for longitude; d) LAT for latitude; e) ALT for altitude. Finally, we confirm the creation of the dataset and the changes to the attribute names, actions that are described in the following script.

Script: Loading data and creating the localsp object.

localsp

## # A tibble: 2,142 × 21
##       ID CD_GEOCODIGO TIPO   CD_GEOCODBA NM_BAIRRO CD_GEOCODSD CD_GEOCODDS
##    <dbl>        <dbl> <chr>        <dbl> <chr>           <dbl>       <dbl>
##  1 15316      3.50e14 URBANO          NA <NA>      35001050500   350010505
##  2 15317      3.50e14 URBANO          NA <NA>      35001050500   350010505
##  3 15318      3.50e14 URBANO          NA <NA>      35001050500   350010505
##  4 15319      3.50e14 URBANO          NA <NA>      35001050500   350010505
##  5 15320      3.50e14 URBANO          NA <NA>      35001050500   350010505
##  6 15321      3.50e14 URBANO          NA <NA>      35001050500   350010505
##  7 15322      3.50e14 URBANO          NA <NA>      35002040500   350020405
##  8 15323      3.50e14 URBANO          NA <NA>      35002040500   350020405
##  9 15324      3.50e14 URBANO          NA <NA>      35002040500   350020405
## 10 15325      3.50e14 URBANO          NA <NA>      35003030500   350030305
## # ℹ 2,132 more rows
## # ℹ 14 more variables: NM_DISTRITO <chr>, CD_GEOCODMU <dbl>, city <chr>,
## #   NM_MICRO <chr>, NM_MESO <chr>, state <chr>, CD_NIVEL <dbl>,
## #   CD_CATEGORIA <dbl>, NM_CATEGORIA <chr>, NM_LOCALIDADE <chr>,
## #   longitude <dbl>, latitude <dbl>, altitude <dbl>, GM_PONTO_sk <chr>

Source: the authors (2023). Caption: loading data from localsp.xls and creating the localsp object. In the script, the format of the data set and the attributes (variables) that constitute it are highlighted.

The object creation check allowed us to extract some relevant information about the localsp set. The results in Script 10 indicated that the object is structured in a data frame model (free translation from English, tibble), which is composed of 2,142 lines (from English, rows) and 21 variables (from English, variables ) distributed in columns, which depict the attributes of this data set. The names of the attributes (variables) were highlighted in the Script above and we point out that they are the same attributes that make up localities_br, except for the geometry attribute (GEOMETRY) which was not subject to filtering in Access 365 and, therefore, is not part of the localsp set .

Following the same functions previously used in the preliminary analysis (Table 10), we explored the localsp data set, aiming to identify relevant characteristics for the exploratory analysis, as demonstrated in Script 11.

Script 11: Preliminary analysis of localsp

names(localsp)

##  [1] "ID"            "CD_GEOCODIGO"  "TIPO"          "CD_GEOCODBA"  
##  [5] "NM_BAIRRO"     "CD_GEOCODSD"   "CD_GEOCODDS"   "NM_DISTRITO"  
##  [9] "CD_GEOCODMU"   "city"          "NM_MICRO"      "NM_MESO"      
## [13] "state"         "CD_NIVEL"      "CD_CATEGORIA"  "NM_CATEGORIA" 
## [17] "NM_LOCALIDADE" "longitude"     "latitude"      "altitude"     
## [21] "GM_PONTO_sk"

summary(localsp)

##        ID         CD_GEOCODIGO           TIPO            CD_GEOCODBA       
##  Min.   :15316   Min.   :3.500e+14   Length:2142        Min.   :3.502e+11  
##  1st Qu.:15851   1st Qu.:3.514e+14   Class :character   1st Qu.:3.514e+11  
##  Median :16386   Median :3.529e+14   Mode  :character   Median :3.533e+11  
##  Mean   :16386   Mean   :3.528e+14                      Mean   :3.531e+11  
##  3rd Qu.:16921   3rd Qu.:3.542e+14                      3rd Qu.:3.549e+11  
##  Max.   :17456   Max.   :3.557e+14                      Max.   :3.555e+11  
##  NA's   :1       NA's   :1                              NA's   :1997       
##   NM_BAIRRO          CD_GEOCODSD         CD_GEOCODDS        NM_DISTRITO       
##  Length:2142        Min.   :3.500e+10   Min.   :350010505   Length:2142       
##  Class :character   1st Qu.:3.514e+10   1st Qu.:351410605   Class :character  
##  Mode  :character   Median :3.529e+10   Median :352850205   Mode  :character  
##                     Mean   :3.528e+10   Mean   :352799828                     
##                     3rd Qu.:3.542e+10   3rd Qu.:354165305                     
##                     Max.   :3.557e+10   Max.   :355730305                     
##                     NA's   :1           NA's   :1                             
##   CD_GEOCODMU          city             NM_MICRO           NM_MESO         
##  Min.   :3500105   Length:2142        Length:2142        Length:2142       
##  1st Qu.:3514106   Class :character   Class :character   Class :character  
##  Median :3528502   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :3527998                                                           
##  3rd Qu.:3541653                                                           
##  Max.   :3557303                                                           
##  NA's   :1                                                                 
##     state              CD_NIVEL      CD_CATEGORIA    NM_CATEGORIA      
##  Length:2142        Min.   :1.000   Min.   : 1.000   Length:2142       
##  Class :character   1st Qu.:1.000   1st Qu.: 3.000   Class :character  
##  Mode  :character   Median :3.000   Median : 5.000   Mode  :character  
##                     Mean   :3.655   Mean   : 8.192                     
##                     3rd Qu.:6.000   3rd Qu.:10.000                     
##                     Max.   :6.000   Max.   :70.000                     
##                     NA's   :1       NA's   :1                          
##  NM_LOCALIDADE        longitude         latitude         altitude       
##  Length:2142        Min.   :-53.06   Min.   :-25.22   Min.   :   1.363  
##  Class :character   1st Qu.:-49.48   1st Qu.:-23.31   1st Qu.: 465.505  
##  Mode  :character   Median :-48.08   Median :-22.67   Median : 575.228  
##                     Mean   :-48.29   Mean   :-22.43   Mean   : 580.182  
##                     3rd Qu.:-46.95   3rd Qu.:-21.57   3rd Qu.: 712.080  
##                     Max.   :-44.20   Max.   :-19.87   Max.   :1639.155  
##                     NA's   :1        NA's   :1        NA's   :1         
##  GM_PONTO_sk       
##  Length:2142       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

dim(localsp)

## [1] 2142   21

class(localsp)

## [1] "tbl_df"     "tbl"        "data.frame"

Source: the authors (2023). Caption: Preliminary analysis of the localsp set. The script presents the results obtained by applying the names (), dim (), class () and summary () function.

Using the names() function, we confirm the names of the attributes associated with the data set, that is, the names in the first line of the object. We also found that localsp is composed of different spatial characteristics, such as longitude, latitude and altitude, and identifying attributes, such as ID and CD_GEOCODIGO. We should highlight that the ID attribute, despite referring to the identity of certain geographic information, is a variant code in each line of the data set, as can be seen in the results generated by Script 10. We note that the same municipality has different codes IDs and that they refer to different typologies of land use or urban agglomerations, which is confirmed through the attributes CD_NIVEL and CD_CATEGORIA (numeric codes that refer to the different typologies) and * NM_CATEGORY* (nomeclatura variables for specifying the typology).

On the other hand, when paying attention to the CD_GEOCODE attribute, we observed numerical repetition for the different profiles of land use and coverage and typologies of human settlements, that is, the same identifying code was repeated for the municipalities. According to the technical documentation of the 2010 Census (IBGE, 2013), it was found that CD_GEOCODIGO refers to the geographical area identifier code, which is established by IBGE to refer to a certain polygon (census sector) in the territory Brazilian, as we had previously discussed.

Additionally, the dim () function showed that the set is composed of 2142 lines and 21 columns, which describe data attributes. Furthermore, the class () function revealed that the localsp data presentation format is structured in the tabular model (from the acronym, tbl) and constitutes a data frame (df).

Additionally, the summary () function expanded the information about each of the attributes associated with the set. Through its application, different characteristics were summarized. For attributes that assumed the character profile (character), the function returned the number of lines in each column (length), the class of information in the column (class ) and the presentation model (mode). For information that was not characters, statistical results were returned for the numerical values associated with the attributes, such as: minimum value, first and third quartile, mean, average and maximum.

We point out that the statistical results, in the case of localsp, are not relevant for the exploratory analysis. However, these results demonstrate which attributes are interpreted by the software as numerical attributes. Thus, we identified that the attributes longitude, latitude and altitude have numerical data for each of the census sectors. This favors the use of this information for the spatialization of specific geographic information, as we identified in the sets population01, population02 and population03.

Comparatively, we verified, through the results generated by the dim() function, that the population data frames (population01, population02 and population03) are made up of 645 lines (Script 4), while * *localsp is made up of 2,142 lines (Script 11). This substantial difference in the number of lines between the population groups and localsp reflects the fragmentation of population information, that is, in the localsp data frame there are subdivisions of the types of population clusters that make up the urban perimeter, as previously indicated about the codes CD_NIVEL, CD_CATEGORIA and NM_CATEGORIA** that make up this set.

As we found in the 2010 Demographic Census documentation (IBGE, 2013a), the sampling points that make up the localsp data detail the territorial mesh in a more robust way, that is, they further divide the territorial portions, reaching the micro data of the Demographic census. On the other hand, in population01, population02 and population03 there is an agglomeration of this micro data given by municipalities, reducing the number of lines when compared to localsp. However, as observed in Scripts 1, 2 and 3, the typologies of population clusters are attributes that integrate the three population groups (population01/02/03). Therefore, through these indications about the components of the 4 sets (localsp and population01/02/03) we have the indication to proceed with the joining of these data frames.

2.2 Data Mining and Manipulation

Having indicated the relevant aspects about the four central sets of this exploratory analysis, we move on to mining the data that make up the sets for further connection between them. In the localsp dataset, we identified the presence of territorial subdivisions in the NM_CATEGORIA column, which reflected the typology of the census sector. The following Script reveals the names of the variables present in the column of this set attribute (localsp). According to the technical documentation of the 2010 Census (IBGE, 2013a), the categories in the micro data portray the profile of population clusters present in the national territory and, consequently, in the State of São Paulo. Thus, it is understood that these categories reveal how human groups are inserted into the territory, assuming a strong relationship with land use and land cover.

## [1] "CIDADE"                  "AUI"                    
## [3] "VILA"                    "NÚCLEO"                 
## [5] "POVOADO"                 "LUGAREJO"               
## [7] "PROJETO DE ASSENTAMENTO" "ALDEIA INDÍGENA"        
## [9] NA

Considering the categories, it was extracted from the IBGE documentation (2013a), that they are subdivided into two groups: urban and rural. Among the urban areas are: cities, isolated urban areas (AUI) and towns; while in rural areas there are: nuclei, villages, hamlets, settlement projects and indigenous villages (traditional communities). Following the logic described by the Institute, we subdivided the localsp data set based on the NM_CATEGORIA attribute, generating data sets for each of the categories referring to the different typologies of census sectors and/or clusters population. The following script depicts the procedure for creating the sets: sp_cities, sp_isolatedurbanareas, sp_urbanvillages, sp_traditionalcommunities, sp_ruralvillage, sp_ruralcore, sp_settlement and sp_settlementproject; which was performed using the filter() function.

Script: Subdivision of the localsp dataset based on the NM_CATEGORY attribute.

#Urban Groups
#creating the city group
sp_cities = localsp %>% filter(NM_CATEGORIA == "CIDADE")

#creating isolated urban areas group
sp_isolatedurbanareas = localsp %>% filter(NM_CATEGORIA == "AUI")

#creating the group of villages (urban)
sp_urbanvillages = localsp %>% filter(NM_CATEGORIA == "VILA")

#Rural Groups
#creating the group of indigenous villages
sp_traditionalcommunities = localsp %>% filter(NM_CATEGORIA == "ALDEIA INDÍGENA")

#creating the group of rural villages
sp_ruralvillage = localsp %>% filter(NM_CATEGORIA == "LUGAREJO")

#creating the rural core group
sp_ruralcore = localsp %>% filter(NM_CATEGORIA == "NÚCLEO")

#creating the settlement group (rural)
sp_settlement = localsp %>% filter(NM_CATEGORIA == "POVOADO")

#creating the settlement project group (rural)
sp_settlementproject = localsp %>% filter(NM_CATEGORIA == "PROJETO DE ASSENTAMENTO")

Source: the authors (2023). Caption: Script describing the separation of the localsp dataset into 8 categories, which describe the typologies of population clusters from the 2010 Census, separated into two groups: urban and rural.

After creating the different data sets, based on the population grouping categories, we plotted each of them, using the longitude and latitude columns (18 and 19 respectively) to evaluate the spatialization capacity of these data.

Figure: Plot of data from locals separated by categories of population clusters.

Source: the authors (2023). Caption: Figure generated from categorized data from localsp, where each of the graphs represents a category of population cluster. The number of points reflects the number of variables in the eight plotted sets.

When we turned to the population data sets (population01/02/03), we identified that the set population01 was the one that most completely discriminated the different categories of population clusters. This confirmation occurred through the analysis of the attributes that made up the three sets, as illustrated in the following script, which uses the names() function for comparison.

Table: Comparison of attributes that make up population groups.

##                          Population 01                   Population 02
## 1                                 city                            city
## 2                      Área Urbanizada                           Total
## 3                  Área não Urbanizada                          Urbana
## 4                  Área Urbana Isolada               Na sede municipal
## 5       Área Rural (Exceto Aglomerado)                           Rural
## 6  Aglomerado Rural de Extensão Urbana              Área\ntotal\n(km²)
## 7             Aglomerado Rural Povoado Densidade demográfica (hab/km²)
## 8              Aglomerado Rural Núcleo    Código da Unidade Geográfica
## 9      Outros Aglomerados Rurais Raros                            <NA>
## 10        Código da Unidade Geográfica                            <NA>
##                                                    Population 03
## 1                                                           city
## 2                                   População residente Absoluta
## 3                      População residente absoluta total urbana
## 4  População residente absoluta total na sede municipal urbana\n
## 5                                         Total Relativa (%)...5
## 6                                         Total Relativa (%)...6
## 7                               Na sede municipal Relativa (%)\n
## 8                                           Área\ntotal\n(km²)\n
## 9                                Densidade demográfica (hab/km²)
## 10                                  Código da Unidade Geográfica

Source: the authors (2023). Legend: Table showing the attributes that make up the three population groups (population01, population02 and population03). The “NA” demonstrates that the population02 set has a smaller number of columns in its structure.

As seen in the results generated, we confirm that the set population01 presents the breakdown of the categories of population clusters previously found in localsp, namely: municipality, urbanized area, non-urbanized area, isolated urban area , rural area (except agglomeration), urban extension rural agglomeration, populated rural agglomeration, core rural agglomeration and other rare rural agglomerations. Given this, we chose to focus on the set population01 to advance the correlation of data, considering its synergy with localsp data (and its categories) and its completeness over the municipalities of São Paulo.

From this perspective, we fragmented population01 data into 8 groups, following the data frame construction logic. In each of the 8 sets created, we kept the first and last columns, respectively “city” and “CD_GEOCODIGO”, both of utmost importance for correlation with the data from localsp, as we treated previously. On the other hand, the other population01 columns were the variable objects for creating the 8 new sets, each integrating a new set.

Therefore, the following Script describes how the creation of isolated sets of demographic information were carried out according to the category (typology) of the population cluster. It is worth highlighting that for the partitioning and creation of the following 8 sets, we used as a reference the technical documentation of the 2010 Census (IBGE, 2013a), which explains which categories each of the attributes described in columns 2 to 9 of the set belong to * *population01 and which are also the categories present in localsp** (NM_CATEGORIA).

Script: Partitioning population data01 according to population cluster categories.

pop01_cities <- population01[,c(1,2,10)]
pop01_cities

## # A tibble: 645 × 3
##    city                   `Área Urbanizada` `Código da Unidade Geográfica`
##    <chr>                              <dbl>                          <dbl>
##  1 ADAMANTINA                         31713                        3500105
##  2 ADOLFO                              3155                        3500204
##  3 AGUAÍ                              27261                        3500303
##  4 ÁGUAS DA PRATA                      5513                        3500402
##  5 ÁGUAS DE LINDÓIA                    6886                        3500501
##  6 ÁGUAS DE SANTA BÁRBARA              3681                        3500550
##  7 ÁGUAS DE SÃO PEDRO                  2707                        3500600
##  8 AGUDOS                             32173                        3500709
##  9 ALAMBARI                            3036                        3500758
## 10 ALFREDO MARCONDES                   2690                        3500808
## # ℹ 635 more rows

pop01_isolatedurbanareas <- population01[,c(1,4,10)]
pop01_isolatedurbanareas

## # A tibble: 645 × 3
##    city                   `Área Urbana Isolada` `Código da Unidade Geográfica`
##    <chr>                                  <dbl>                          <dbl>
##  1 ADAMANTINA                               180                        3500105
##  2 ADOLFO                                    45                        3500204
##  3 AGUAÍ                                   1025                        3500303
##  4 ÁGUAS DA PRATA                          1258                        3500402
##  5 ÁGUAS DE LINDÓIA                           0                        3500501
##  6 ÁGUAS DE SANTA BÁRBARA                   578                        3500550
##  7 ÁGUAS DE SÃO PEDRO                         0                        3500600
##  8 AGUDOS                                   161                        3500709
##  9 ALAMBARI                                 636                        3500758
## 10 ALFREDO MARCONDES                          0                        3500808
## # ℹ 635 more rows

pop01_urbanvillages <- population01[,c(1,3,10)]
pop01_urbanvillages

## # A tibble: 645 × 3
##    city                   `Área não Urbanizada` `Código da Unidade Geográfica`
##    <chr>                                  <dbl>                          <dbl>
##  1 ADAMANTINA                                55                        3500105
##  2 ADOLFO                                     0                        3500204
##  3 AGUAÍ                                    715                        3500303
##  4 ÁGUAS DA PRATA                             0                        3500402
##  5 ÁGUAS DE LINDÓIA                       10225                        3500501
##  6 ÁGUAS DE SANTA BÁRBARA                     0                        3500550
##  7 ÁGUAS DE SÃO PEDRO                         0                        3500600
##  8 AGUDOS                                   659                        3500709
##  9 ALAMBARI                                   0                        3500758
## 10 ALFREDO MARCONDES                        565                        3500808
## # ℹ 635 more rows

pop01_traditionalcommunities <- population01[,c(1,5,10)]
pop01_traditionalcommunities

## # A tibble: 645 × 3
##    city                   Área Rural (Exceto Aglomerado…¹ Código da Unidade Ge…²
##    <chr>                                            <dbl>                  <dbl>
##  1 ADAMANTINA                                           0                3500105
##  2 ADOLFO                                               0                3500204
##  3 AGUAÍ                                             3147                3500303
##  4 ÁGUAS DA PRATA                                     813                3500402
##  5 ÁGUAS DE LINDÓIA                                   155                3500501
##  6 ÁGUAS DE SANTA BÁRBARA                            1342                3500550
##  7 ÁGUAS DE SÃO PEDRO                                   0                3500600
##  8 AGUDOS                                            1531                3500709
##  9 ALAMBARI                                          1212                3500758
## 10 ALFREDO MARCONDES                                  636                3500808
## # ℹ 635 more rows
## # ℹ abbreviated names: ¹`Área Rural (Exceto Aglomerado)`,
## #   ²`Código da Unidade Geográfica`

pop01_ruralvillage <- population01[,c(1,6,10)]
pop01_ruralvillage

## # A tibble: 645 × 3
##    city                   Aglomerado Rural de Extensão …¹ Código da Unidade Ge…²
##    <chr>                                            <dbl>                  <dbl>
##  1 ADAMANTINA                                           0                3500105
##  2 ADOLFO                                               0                3500204
##  3 AGUAÍ                                                0                3500303
##  4 ÁGUAS DA PRATA                                       0                3500402
##  5 ÁGUAS DE LINDÓIA                                     0                3500501
##  6 ÁGUAS DE SANTA BÁRBARA                               0                3500550
##  7 ÁGUAS DE SÃO PEDRO                                   0                3500600
##  8 AGUDOS                                               0                3500709
##  9 ALAMBARI                                             0                3500758
## 10 ALFREDO MARCONDES                                    0                3500808
## # ℹ 635 more rows
## # ℹ abbreviated names: ¹`Aglomerado Rural de Extensão Urbana`,
## #   ²`Código da Unidade Geográfica`

pop01_ruralcore <- population01[,c(1,8,10)]
pop01_ruralcore

## # A tibble: 645 × 3
##    city                   `Aglomerado Rural Núcleo` Código da Unidade Geográfi…¹
##    <chr>                                      <dbl>                        <dbl>
##  1 ADAMANTINA                                     0                      3500105
##  2 ADOLFO                                         0                      3500204
##  3 AGUAÍ                                          0                      3500303
##  4 ÁGUAS DA PRATA                                 0                      3500402
##  5 ÁGUAS DE LINDÓIA                               0                      3500501
##  6 ÁGUAS DE SANTA BÁRBARA                         0                      3500550
##  7 ÁGUAS DE SÃO PEDRO                             0                      3500600
##  8 AGUDOS                                         0                      3500709
##  9 ALAMBARI                                       0                      3500758
## 10 ALFREDO MARCONDES                              0                      3500808
## # ℹ 635 more rows
## # ℹ abbreviated name: ¹`Código da Unidade Geográfica`

pop01_settlement <- population01[,c(1,7,10)]
pop01_settlement

## # A tibble: 645 × 3
##    city                   `Aglomerado Rural Povoado` Código da Unidade Geográf…¹
##    <chr>                                       <dbl>                       <dbl>
##  1 ADAMANTINA                                      0                     3500105
##  2 ADOLFO                                          0                     3500204
##  3 AGUAÍ                                           0                     3500303
##  4 ÁGUAS DA PRATA                                  0                     3500402
##  5 ÁGUAS DE LINDÓIA                                0                     3500501
##  6 ÁGUAS DE SANTA BÁRBARA                          0                     3500550
##  7 ÁGUAS DE SÃO PEDRO                              0                     3500600
##  8 AGUDOS                                          0                     3500709
##  9 ALAMBARI                                        0                     3500758
## 10 ALFREDO MARCONDES                               0                     3500808
## # ℹ 635 more rows
## # ℹ abbreviated name: ¹`Código da Unidade Geográfica`

pop01_settlementproject <- population01[,c(1,9,10)]
pop01_settlementproject

## # A tibble: 645 × 3
##    city                   Outros Aglomerados Rurais Rar…¹ Código da Unidade Ge…²
##    <chr>                                            <dbl>                  <dbl>
##  1 ADAMANTINA                                           0                3500105
##  2 ADOLFO                                               0                3500204
##  3 AGUAÍ                                                0                3500303
##  4 ÁGUAS DA PRATA                                       0                3500402
##  5 ÁGUAS DE LINDÓIA                                     0                3500501
##  6 ÁGUAS DE SANTA BÁRBARA                               0                3500550
##  7 ÁGUAS DE SÃO PEDRO                                   0                3500600
##  8 AGUDOS                                               0                3500709
##  9 ALAMBARI                                             0                3500758
## 10 ALFREDO MARCONDES                                    0                3500808
## # ℹ 635 more rows
## # ℹ abbreviated names: ¹`Outros Aglomerados Rurais Raros`,
## #   ²`Código da Unidade Geográfica`

Source: the authors (2023). Caption: In the script, data from the population01 set is partitioned according to the category of the population cluster. Columns 2 to 9 of the population01 set are isolated and integrated with columns 1 (city) and 10 (CD_GEOCODIGO), structuring 8 new data sets, namely: pop01_cities, pop01_isolatedurbanareas, pop01_urbanvillages, pop01_traditionalcommunities, pop01_ruralvillage, pop01_ruralcore, pop01_settlement and pop01_settlementproject.

Comparing the 8 sets from localsp and the 8 from population01, we noticed that there is a divergence between them in the number of lines and it is necessary to make some notes on the following table. The first reflects that not all micro data categories (NM_CATEGORIES) are present in all municipalities in São Paulo and, therefore, do not correspond between the sets. On the other hand, it is noted that there are more than one point to describe a category, that is, there are more than one point identified for a given category, as is the case of isolated urban areas (985 lines).

Table: Comparison between the number of lines of the 8 sets formed from population01 and the 8 derived from localsp

##   Data.coming.from.population01 Data.coming.from.localsp
## 1                           645                      645
## 2                           645                      985
## 3                           645                      295
## 4                           645                       12
## 5                           645                      104
## 6                           645                       30
## 7                           645                       61
## 8                           645                        9

Source: the authors (2023). Legend: Table comparing the number of lines in the sets from population01 (first column) and localsp (second column).

Despite the divergence between the number of lines, we emphasize that this does not invalidate data mining; on the contrary, this confirms the presence and/or absence of different categories in the territory of São Paulo and allows us to recognize the demography in each of the municipalities that make up the micro data.

2.3 Data Joining

Having established the 8 sets from localsp and the 8 derived from population01, we set out to join these sets, as a way of spatializing the demographic data. Therefore, we are guided by data from localsp, to which we add the respective populations (number of people). As we highlighted previously, there is a numerical divergence between the data from the two sets (Table) and therefore we must consider two aspects.

The first protects row numbers equal to or less than 645, where there is a correspondence between Column 1 (data from population01) and Column 2 (data from localsp) of the Table. In this sense, when there is no point described for sampling (column 2), there is no population described for a given typology of census sector. On the other hand, when all lines in Column 2 correspond to Column 1, all sample points had their populations integrated.

In the second aspect, we portray the cases in which Column 2 has a number of points greater than 645 lines, extrapolating the number of lines present in Column 1, that is, not all geographic points coming from localsp have correspondence in * *population01. The only case described and represented throughout data mining was that for isolated urban areas (AUI), portrayed in the set sp_isolatedurbanareas**. According to the IBGE technical documentation (2023), isolated urban areas are “an area defined by law and separated from the district headquarters [municipality] by rural area or by another legal limit”; therefore, it is understood that the same municipality may have different AUIs integrated during the census sampling, as can be seen in Column 2 of the Table (line 2). However, as can be seen in Column 1 of the Table, the data is combined, that is, all isolated urban areas in the municipality have their populations depicted in a single line. To this end, for the sake of analysis, and already identifying a sampling limitation, we considered the total population of AUIs for each point, that is, the values for the same municipality (Column 1) are repeated for the points associated with the same municipality ( Column 2). This overestimation of data will be dealt with during data mapping by differentiating colors on the map and, jointly, in the legend referring to data from isolated urban areas.

That said, let’s move on to data joining. Following the same logic just used, we divided the join into two moments, the first for the sets in Column 2 with points equal to or less than 645 lines and the second for the number of points greater than 645.

Script: Joining data from population01 and localsp.

#Joining data from the Cities typology
cities <- sp_cities%>%left_join(pop01_cities)

#Joining data from the Urban Villages typology
urbanvillages <- sp_urbanvillages%>%inner_join(pop01_urbanvillages)

#Joining data from the Traditional Communities typology
traditionalcommunities <- sp_traditionalcommunities%>%inner_join(pop01_traditionalcommunities)

#Joining data from the Rural Village typology
ruralvillage <- sp_ruralvillage%>%inner_join(pop01_ruralvillage)

#Joining data from the Rural Core typology
ruralcore <- sp_ruralcore%>%inner_join(pop01_ruralcore)

#Joining data from the Rural Villages typology
ruralsettlement <- sp_settlement%>%inner_join(pop01_settlement)

#Joining data from the Settlement Projects typology
settlementproject <- sp_settlementproject%>%inner_join(pop01_settlementproject)

#Joining data from the Isolated Urban Areas typology
isolatedurbanareas <- sp_isolatedurbanareas%>%right_join(pop01_isolatedurbanareas)
isolatedurbanareas <- isolatedurbanareas[complete.cases(isolatedurbanareas$altitude, isolatedurbanareas$latitude),]

Source: the authors (2023). Caption: in the script, information from population01 is associated with the locations of localsp, through the join() function and its variants.

According to the Script, we were able to connect the data from population01 and localsp, adding the respective populations to the different census typologies. In this way, we structure new sets, where the population information and geolocations of each of the points structured for them are found, namely: cities, urbanvillages, isolatedurbanareas, traditionalcommunities, ruralvillage, ruralcore, ruralsettlement and settlementproject.

2.4 Data plotting

Once the data joining stage is complete, we move on to plotting this data. Firstly, we must highlight that the choice of colors for the plot followed the guidelines established by Ellis and Ramankutty (2008), where shades of red represent urban populations and their nuances; while rural populations are associated with earthy tones (orange and brown). Furthermore, we use shades of blue to distinguish traditional populations, taking into account their sociocultural uniqueness, both in terms of their relationship with nature and in relation to their relevance for the maintenance and preservation of the identity of these groups. The Table presents the tones used (RGB code) for each territorial typology described by the data.

It is worth highlighting here that, unlike what was proposed for anthromes, we consider the features identified by IBGE (2023) in the spatial continuum proposal. According to this document, we notice an expansion of concepts and, consequently, of urban-rural approaches, which proves to be extremely relevant for structuring public policies in the country. Therefore, the continuum project proposed by the Brazilian Institute expands the guidelines proposed by Ellis (2020) regarding anthromes and, therefore, we consider this as an improvement in the delineation of Brazilian anthromes.

We emphasize from the outset that the IBGE document, published in 2023, uses in its analytical structuring and modeling, to a large extent, data from the 2010 Demographic Census (IBGE, 2013), a fact that aligns our project with the technical-scientific developments of the Brazilian Institute and does not invalidate such data as a source for current scientific production. It is also worth considering that both the IBGE document (2023) and this research precede the publication of the complete data from the 2022 Demographic Census, that is, this limitation is present in both products, which must be updated after the publication of the data complete IBGE data. However, we reiterate, this does not invalidate the development of this work, as the codified structure is adaptable to different sources of information, as well as the subsequent update of data from the Brazilian Demographic Census, carried out in 2022.

That said, we return to the attributes longitude and latitude that are part of the data sets just produced. Since these two attributes are fundamental for spatialization in the plot and, subsequently, in the mapping of population information, they are the ones used to construct the plot of the data below. Therefore, we select these two pieces of information using the “$” operator in each of the sets. Additionally, we chose the format for plotting the points using the “pch” descriptor, using the number “15” to plot as squares filled in the same color, which was chosen using the “col” descriptor. As we said above, the colors vary according to the typology of the census sector (Table). Below is the Script that encodes the separation of data, its plotting and the respective coloring of each one.

Script: Plotting the 8 data sets separately.

Source: the authors (2023). Legend: plot of population data from different territorial typologies with their respective colors. As can be seen in the plots just produced, the data for each of the territorial typologies were spatialized according to the two geographic information (latitude and longitude). The “cex” descriptor defined for each municipality the size of the square plotted following the size of the reference population (attribute population of each data set). It is worth noting that we added a new column to the 8 data sets, which was named category; In this column, the territorial typologies were inserted in the data sets, so that we could plot the 8 sets in a single plot. Additionally, we created the colors set, which determines the colors for the unique plot of populated anthromes. The following script presents these processes.

Script: Coded structure for plotting the 8 data sets.

cities, urbanvillages, isolatedurbanareas, traditionalcommunities, ruralvillage, ruralcore, ruralsettlement and settlementproject

colors <- c("Cities" = "#FF0000", "Isolated Urban Areas" = "#FF4747", "Urban Villages"= "#F66969", "Rural Village"= "#ED833B", "Rural Core"="#DF9B6D", "Rural Settlement"="#FFD966", "Settlement Project"="#968551", "Traditional Communities"="#9CC2E5")

legend_populatedanthromes <- data.frame(Categorias = unique(populated_anthromes$categoria), Cores = unique(colors))

legend_populatedanthromes

##                Categorias   Cores
## 1                  Cities #FF0000
## 2    Isolated Urban Areas #FF4747
## 3          Urban Villages #F66969
## 4           Rural Village #ED833B
## 5              Rural Core #DF9B6D
## 6        Rural Settlement #FFD966
## 7      Settlement Project #968551
## 8 Traditional Communities #9CC2E5

Source: the authors (2023). Legend: the script presents the coded structures for plotting the 8 data sets, which are guided by the populated_anthromes and colors. data sets

Once these operations were carried out, we proceeded to plot the data from the 8 sets simultaneously. We follow the color pattern established by the colors set and the respective typologies of the population sectors described by category. We reiterate that the spatialization of the data was based on latitude and longitude information.

Figure: Plot of data from the 8 sets of population anthromes.

Source: the authors (2023). Caption: plot of data referring to populated anthromes in the State of São Paulo, divided between the 8 categories created based on IBGE data.

The Figure reveals that the plotting of data from the 8 population groups (census typologies) occurred correctly, allowing the integration of the different typologies into a single figure. Furthermore, it is noted that the legend follows the coloring established for the different territorial categories. Therefore, through the figure, we can see the adequacy of the data distribution for territorial mapping, carried out in the analytical sequence.

2.5 Static mapping of populated anthromes

After plotting the populated_anthromes data, we move on to the static mapping of this data set. The static mapping aimed to structure the distribution of points in the shape file of the municipalities (urban perimeters) of the State of São Paulo. To carry out this mapping, some adjustments to the sets were necessary, which are summarized below. The code for carrying out these actions was hidden in this document; however, it is found in the file available on GitHub associated with this work.

Organization and determination of the color categories used for each typology of populated anthromes, determining how colorimetric information and anthropogenic types should be combined.
Upload the shape file sp_municipios.shp made available by IBGE. This file refers only to municipalities in the State of São Paulo, therefore, only the polygons referring to cities in São Paulo appear on the generated map.
Creation of the shapefile set cities_shape, where the attributes NM_MUNICIP and geometry from the set sp_municipios.shp were selected, isolating them for the construction of the mapping of São Paulo cities.

We take the opportunity to justify the use of the shape file file as a means to construct the mapping. Our first option for mapping was to use the orbital images provided by Google Earth, using the API Key to integrate the mapping with the Google LCC. platform. However, during the construction of the mapping, we identified that the use of images provided by the company only occurs upon payment. Despite the technological advantages of using these Earth images in mapping, we chose not to use them, considering the cost of operation at this stage of the research and the free services already provided by the Brazilian Institute of Geography and Statistics, as we will demonstrate. by the shape file used in the mapping. To this end, we opted for IBGE files to maintain our technical-scientific alignment with free national data structures, allowing other researchers and users to access and build mappings such as the one presented below.

Once these operations have been carried out, we move on to mapping the data from the populated_anthromes set onto the shape file cities_shape. To build the mapping we use the ggplot() package and combine different functions associated with it. We highlight the main functions below, following the order of application:

geom_sf (): loading shape file cities_shape;
geom_point (): determination of the mapped points of populated_anthromes;
scale_color_manual (): determination of colors and order for mapping populated anthromes;
labs () and variables: determination of graphic characteristics of the legend and mapping.

That said, the following Script presents the code for constructing the mapping and, as a result, the mapping of populated anthromes.

Script: Static Mapping of Populated anthromes in the State of São Paulo.

map_anthromes <- ggplot()+
  geom_sf(data = cities_shape)+
  geom_point(data = populated_anthromes, aes(x = longitude, y = latitude, color = categoria), width = 0.01, height = 0.01, pch = 15)+
  scale_color_manual(values = setNames(cores_categorias$cor, cores_categorias$categoria), breaks = ordem_categorias, labels = ordem_categorias)+
  labs (title = "Populated Anthromes", subtitle = "Study Area: State of São Paulo (Brazil)", fill = cores_categorias$categoria)+
  xlab ("Longitude")+
  ylab ("Latitude")+
  labs (color = "Populated Anthromes")+
  theme_minimal()

print(map_anthromes)

Source: the authors (2023). Caption: code showing the structure used to map data from the populated_anthromes set onto the shapefile cities_shape.

The mapping generated by the code above demonstrates that the populated_anthromes data was overlaid on the cities_shape shapefile as expected. The mapping followed the guidelines for distribution of sampling points according to the longitude and latitude described in the populated_anthromes set, as well as the color established for each category.

It is observed, however, that some squares (points referring to populated anthromes) go beyond the areas of the shapefile cities_shape. This aspect was considered in the study of the uncertainty in the mapping of populated anthromes, a study that we will later present to the interactive mapping of populated anthromes.

In the next topic, we transpose static mapping to interactive mapping, in order to structure a map that can be integrated with technological services, such as websites and the GitHub collection.

2.6 Interactive mapping of populated anthromes

Firstly, it was necessary to prepare some structures for the interactive mapping to be created next. The first of these was the structuring of the components used in the legend to be printed in the mapping. To this end, we created the set demographic_anthromes, which contains the names of the 8 typologies of populated anthromes that have been mapped up to this point. In addition, to define the categories and colors printed in the mapping, we used the data set described in the legend_populatedanthromes data frame, taking into account its previous use for plotting the data (performed in the previous item).

Having defined these aspects, we move on to the mapping itself. To create the interactive mapping, we used the leaflet() package and the editable resources associated with it and we will highlight the most relevant ones. Firstly, the addMarkers() function helped to demarcate the points where data were found in the demographic_anthromes set, using the latitude and longitude attributes for plotting.

In addition, the addRectangles() function helped to define the position and size of the squares used to demarcate the populated anthromes. At this point, we must highlight that we used the measurement, in degrees, of 0.03 (positive and negative) to size the area squares. The choice was based on the analysis of the literature and the mapping itself, as, during the tests, we observed that for smaller degrees, the squares did not cover the surface of some cities, given the punctual nature of the **demographic_anthromes* data set. *. Therefore, we chose to use these degrees as a reference and consider them later when analyzing the uncertainty of the generated mapping.

Thus, assuming what was predicted, the first interactive mapping of Brazilian anthromes is presented below, portraying the model area of this work, that of the State of São Paulo, and the populated anthromes present in it. It was structured on the basis of Open Street Maps, a free, collaborative global mapping project that can be used by any user and researcher around the world. We reiterate here that our choice is meanted by the dissemination of knowledge and the reproducibility of the research carried out here, therefore, the free nature and availability of these open maps justify our choice.

Figure: Interactive Mapping of Populated anthromes in the State of São Paulo.

Source: the authors (2023). Caption: interactive mapping produced in R language (R Studio) where the populated anthromes present in the State of São Paulo (Brazil) are presented, a reference area for the pilot study of Brazilian anthromes. In the mapping, the squares that describe the anthropogenic sectors are presented, using degrees of 0.03 (positive and negative) for latitude and longitude to demarcate each of the squares. The legend in the mapping represents the colors visible in the mapping and the typologies of the anthropogenic sectors to which they refer.

Notoriously, the product has its limitations, however it represents in a relevant way the anthropogenic populated sectors distributed in the territory of São Paulo. Returning to the question about the dimensions of the squares present in the mapping, we carried out the calculation to size the area described by each of the squares. For this purpose, we used the average of the variables latitude and longitude that appeared in the populated_anthromes set as a basis for the calculations, taking into account the high number of isolated points.

Furthermore, we consider the first two formulas presented below for calculating the width and height of the square. Subsequently, based on the results, we calculate the area of the square using the third formula in the sequence.

Formulas

$Width (longitude)=0.03×111.32×cos(mean latitude)$

$Height (latitude)=0.03×111.32×sin(mean latitude)$

$Area of the square = width×height$

AOnce the mathematical expressions are presented, the results obtained are reported below.

## square width (longitude) in kilometers: 3.020237 km

## square height (latitude) in kilometers: 1.425164 km

## Average square area in square kilometers: 4.304335 km²

We emphasize that up to this point no mapping uncertainty studies have been carried out, this being the subsequent stage of the work.

2.7 Mapping validation and uncertainty studies

As established in the methodology of this work, we carried out certain procedures to evaluate and certify the quality of the regional mapping of anthromes. Following the investigative guidelines established by Lovelace, Nowosad and Muenchow (2019) and Wickham, Çetinkaya-Rundel and Grolemund (2023), we list studies to confirm the spatialization of geospatial information (distribution of population data), to evaluate the uncertainty and associated error to mapping in the eyes of Earth and Environmental Sciences and to attest to the quality of the product generated by this study. As the aforementioned authors present, these investigations are part of mapping uncertainty and validation studies, which are reported below.

2.7.1 Overlap Analysis

The initial stage of the analysis of the quality of the mapping reflects the analysis of the overlap of the mapped points of the populated anthromes of the State of São Paulo and the data from the São Paulo locations, which portray the census sectors used during the 2010 Demographic Census (IBGE, 2013) . According to the Brazilian legal apparatus (MAPA/INCRA, 2022; BRASIL, 2018; MMA, 2006; 2002), the overlap analysis proves to be a regulated instrument within the Federation to evaluate the overlap of polygons in areas registered in different institutions of government. The objective is to identify whether rural and urban properties overlap spatially in property registers (rural and urban), which could generate territorial conflicts, tax defaults, among other judicial, civic and environmental problems.

From this perspective, analyzing the overlap of populated anthromes with the raw IBGE data aims to demonstrate the alignment of the product with the urban-rural mesh of the census sectors. As can be seen from the regulations just discussed, the non-overlapping of populated areas and locations in São Paulo would portray the inaccuracy of the mapping, potentially causing the aforementioned problems for territorial planning and, consequently, for the spheres of government. Thus, following the premises of Lovelace, Nowosad and Muenchow (2019) and Wickham, Çetinkaya-Rundel and Grolemund (2023), we investigated how the points referring to populated anthromes overlap with data from the São Paulo census sectors.

Thus, we converted the raw data from br_locations_2010 (shapefile) into a simple data set (sf) using the st_as_sf() function. After the conversion, we extract the data referring to the State of São Paulo, using the filter() function. With it, we structured the saopaulo data set, referring to the census sectors of the State of São Paulo. After structuring this set, here understood as a comparator, we determined the number of sample points, choosing all points (2143) as a sample for overlap analysis.

Having determined the sampling points of the comparator (saopaulo), we established the coordinate system (CRS) of 4326, the same as the populated_anthromes data. The following Figure demonstrates the number of sample points (IBGE raw data) established for overlap analysis and their spatial distribution. The choice for purple comes from the fact that this color is not included in any of the data sets worked on so far..

Figure: Sample points from raw IBGE data (localities_br)

Source: the authors (2023). Caption: figure showing the distribution (spatialization) of the points established for the overlap analysis with the points referring to populated anthromes.

After structuring the sampling points for overlap analysis, we retrieved the data from map_anthromes, that is, the mapped data of the populated anthromes. In order not to generate conflicts, the layer with the shapefile of the municipalities in the State of São Paulo was removed, leaving only the squares referring to the areas of populated anthromes.

Starting from the mapping, we built a simple data set, using the dimensions x and y as latitude and longitude parameters and determining the CRS 4326 as the associated coordinate system, even from the sample/comparator set and the mappings produced in the previous item (mapping of anthromes) .

Having established the two sets of geographic information, populated anthromes (anthromes_data_sf) and sample (sample_sf), we proceeded to compare them both. Firstly, it was verified whether both had the same CRS associated with their structures, as, through the CRS it is possible to identify whether geospatial information is distributed in the same area and whether they overlap within it during spatialization. Assuming this, we perform the comparison using the if()/else() functions to compare the two sets of data. The script reveals the structure of the function and presents the result through the sentence that we indicate as an answer to the if/else question.

Script: Comparison of the CRS of the sets anthromes_data_sf and amostra_sf.

## [1] "the CRS are the same"

Source: the authors (2023). Caption: script presenting the construction of the if()/else() function for comparison between the two sets in relation to the geographic referencing system (CRS).

Having confirmed that the CRS of the two sets are the same, we proceed to the overlap analysis. To do this, using the st_join() function, we combine the two sets (anthromes_data_sf and amostra_sf) into a single simple data set (sf), which was named juncao_sp .

Starting from joining, we summarized the data using summarize(), indicating that there was a count of points grouped by coordinates (group_by(LAT&LONG) - latitude and longitude). The product of this code indicates how many points overlap, using the point coordinates (latitude and longitude) as a reference. The following script shows the organization of the function to identify the number of points and then the numbers of overlapping points are presented.

Script: Count of overlapping points between sets.

## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOINT
## Dimension:     XY
## Bounding box:  xmin: -53.05865 ymin: -25.21507 xmax: -44.19936 ymax: -19.87297
## Geodetic CRS:  WGS 84
## # A tibble: 1 × 3
##   `LAT & LONG` contagem_de_pontos                                       geometry
##   <lgl>                     <int>                               <MULTIPOINT [°]>
## 1 TRUE                       2143 ((-53.05865 -22.58118), (-53.0027 -22.52495),…

Source: the authors (2023). Caption: script describing the function for analyzing the number of overlapping points between the two data sets (anthromes_data_sf and saopaulo), followed by the tabulated result of the comparison.

According to the results generated by the function, we observed that the 2143 sampled points overlapped with the points coming from populated anthromes (populated_anthromes). After verifying the number of overlapping points, the literature suggests that the results be visually evaluated, in order to verify the accuracy of the data overlap.

Considering this, we organize the sets to visualize the overlapping of points. First, we join the set tablea_contagem_sp with the data from saopaulo, producing the set map_results_sp, which represents the aggregated data.

Sequentially, using the ggplot() function, we structure the mapping to visualize the overlapping of points. Data from map_results_sp were plotted in dark green (darkgree) and data from populated anthromes (anthromes_data_sf) in red (red). In order to facilitate the visualization of the overlapping points, we chose to increase the dimension of the points from map_results_sp to 2 and reduce the points from anthromes_data_sf to 0.5, that is, the first ones were plotted in significantly larger dimensions than the seconds. Thus, the following script presents the code and, subsequently, the visualization of the overlapping of data from the two sets.

Script: Visualization of overlapping points between the two data sets.

Source: the authors (2023). Caption: script presenting the structure for mapping data compared by overlap analysis and generated mapping allowing the visualization of overlapping points in the territory of the State of São Paulo.

By mapping the overlapping data, we observed that the anthromes_data_sf points (derived from populated_anthromes, in red) are overlapping with the IBGE gold standard (raw data from saopaulo, mapped in dark green). Thus, the visualization of the overlapping points made it possible to verify that the spatialization of data referring to populated anthromes occurs appropriately, following the same geographic coordinates as the saopaulo data and distributed throughout the territory of São Paulo , as seen on the cities_shape shapefile layer. Thus, visual confirmation brought indications that point to the validation of the mapping at first.

2.7.2 Examination of data properties

In order to ensure data quality and accuracy in mapping the geospatial information of populated anthromes, we carried out examinations on the properties of overlapping data sets, i.e. the populated anthromes data (populated_anthromes, map_anthromes_data, anthromes_data_sf) and the gold standard based on IBGE data (saopaulo). As we highlighted previously, the data sets have the same geographic coordinate system (CRS) associated with their structures. Furthermore, they both have spatial dimensions of latitude and longitude, which allowed the insertion of points in the mapping.

Continuing the analysis, we carried out the verification of the geographic limits (coordinates), which aims to prove that both sets represent the same territorial area of the mapping, that is, that the two sets have the same latitude and longitude information in their geographic referencing structure. From this perspective, using data from populated_anthromes and saopaulo, we performed the analysis using the range() function, which brings the minimum and maximum limits of the investigated parameters, which in this case were latitude /LAT and longitude/LONG (Frame).

Table: Minimum and maximum latitude and longitude limits of the populated_anthromes and saopaulo sets.

## Maximum and minimum latitude of populated_anthromes: -25.21507 -19.87297

## Maximum and minimum latitude of São Paulo: -25.21507 -19.87297

## Maximum and minimum longitude of populated_anthromes: -53.05865 -44.19936

## Maximum and minimum longitude of saopaulo: -53.05865 -44.19936

Source: the authors (2023). Caption: table showing the minimum and maximum limits of latitude and longitude of the two sets of data analyzed using the range() function.

The information from the range() function confirms that the minimum and maximum limits, both for longitude and latitude, are the same for both sets. Continuing with the verification, we use the summary() function to analyze extreme statistical values associated with the two sets (Script). Just as we performed the analysis using the range() function, here we consider the latitude and longitude information of the two sets.

Script: Application of the summary() function to analyze extreme values.

summary(populated_anthromes$latitude)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -25.22  -23.31  -22.67  -22.43  -21.57  -19.87

summary(saopaulo$LAT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -25.22  -23.31  -22.67  -22.43  -21.58  -19.87

summary(populated_anthromes$longitude)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -53.06  -49.48  -48.08  -48.29  -46.95  -44.20

summary(saopaulo$LONG)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -53.06  -49.48  -48.07  -48.29  -46.95  -44.20

Source: the authors (2023). Caption: code for statistical summary of populated_anthromes and saopaulo data, presenting the minimum, first quartile, mean, mean, third quartile and maximum values.

We verified that the results obtained are synergistic, with small differences in the third quartile of latitude (0.01) and the mean of longitude (0.01). This difference is associated with the number of points described by the two sets, as populated_anthromes is made up of 2141 points, while saopaulo is made up of 2143.

## Simple feature collection with 2 features and 22 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -46.97105 ymin: -22.27151 xmax: -46.97078 ymax: -22.24241
## Geodetic CRS:  WGS 84
##      ID      CD_GEOCODI   TIPO CD_GEOCODB NM_BAIRRO  CD_GEOCODS NM_SUBDIST
## 1 17457 355730305000009 URBANO       <NA>      <NA> 35573030500       <NA>
## 2 17458 355730305000010 URBANO       <NA>      <NA> 35573030500       <NA>
##   CD_GEOCODD   NM_DISTRIT CD_GEOCODM         city   NM_MICRO  NM_MESO     state
## 1  355730305 ESTIVA GERBI    3557303 ESTIVA GERBI MOJI-MIRIM CAMPINAS SAO PAULO
## 2  355730305 ESTIVA GERBI    3557303 ESTIVA GERBI MOJI-MIRIM CAMPINAS SAO PAULO
##   CD_NIVEL CD_CATEGOR NM_CATEGOR             NM_LOCALID      LONG       LAT
## 1        6        001        AUI            RANCHO NOVO -46.97105 -22.24241
## 2        6        002        AUI RECANTO DO ORI\xc7ANGA -46.97078 -22.27151
##        ALT GMRotation                    geometry
## 1 669.9272          0 POINT (-46.97105 -22.24241)
## 2 624.0211          0 POINT (-46.97078 -22.27151)

Performing the analysis of the sets to identify the missing lines in populated_anthromes, we observed that the two lines from saopaulo refer to two isolated urban areas (AUI) in the municipality of Estiva Gerbi (district area of Moji Mirim ). According to population data01, a set used to associate demographic density with populated areas (census sectors), AUI data are clusters and associated with a point, which can describe more than one location. Therefore, the error (or spatial limitation) is due to this aspect of data clustering and is represented in the following Figure.

Figure: Mapped representation of missing points in populated_anthromes.

Source: the authors (2023). Caption: figure presenting the summary of the overlap analysis, where the two missing points in populated_anthromes and in saopaulo are indicated in red.

2.8 Mapping summary statistics

Deepening our statistical analyzes on the mapping of São Paulo’s anthromes, considering data from the Brazilian Demographic Census (IBGE, 2013) as the gold standard, we move on to Summary Statistics. To carry them out, we established a statistical grid over the territory of São Paulo, using data from saopaulo (simple data set - sf - derived from br_locations_2010, our comparator or gold standard). This grid was built on the basis of 400 cells (20 by 20), considering the spherical scale, as we will present below.

First, we convert the set saopaulo into a spatial object (from English, spatial), using the function *as(_, “spatial”)*, giving rise to the set saopaulo_spatial . Starting from this, we establish the minimum and maximum X and Y values (longitude and latitude, respectively).

Having established the minimum and maximum values of statistical grid). With this calculation we arrive at the values size_cell_x and size_cell_y, representing the width (longitude) and height (latitude) of the statistical grid square.

It is noteworthy that in these calculations on the size of the squares (cells) of the statistical grid, the sphericity of the Earth was considered, deriving the formulas used here from the Haversine Formula, which is commonly used to calculate the distance between two points. The global structure of this formula is represented as follows:

\[ a = \sin^2\left(\frac{\Delta\text{lat}}{2}\right) + \cos(\text{lat}_1) \cdot \cos(\text{lat}_2) \cdot \sin^2\left(\frac{\Delta\text{lon}}{2}\right) \]

\[ c = 2 \cdot \text{atan2}\left(\sqrt{a}, \sqrt{1-a}\right) \]

\[ d = R \cdot c \]

where: - $\Delta\text{lat}$ is the difference in latitude between the two points, - $\Delta\text{lon}$ is the difference in longitude between the two points, - $\text{lat}_1$ e $\text{lat}_2$ are the latitudes of the two points in radians, - $R$ is the radius of the sphere (for example, the average radius of the Earth).

In this way, we were able to size the distances between two points of longitude and latitude, arriving at the width and height dimensions of the grid cells and, therefore, we were able to size the approximate area of each of them. The dimensional values obtained from the calculations performed are presented below.

Table: Cell dimensions in a 20x20 statistical grid (400 cells).

## Cell size (quadrant) in kilometers (km):

## width (cell_size_x_km): 45.57963 km

## height(cell_size_y_km): 11.34617 km

## Area of each quadrant of the grid in km²: 517.1542 km²

## Average dimension of the sides of the grid square in km: 22.74102 km

Source: the authors. Legend: table representing the values obtained during the dimensioning of the height, width and area of the cells of the statistical grid structured in 20x20 (400).

Furthermore, with these dimensions we structured the statistical grid with 400 cells using the raster() function, which was named quadrant_grid and was used in subsequent calculations. Additionally, it was necessary to structure the XY ordered pairs of the statistical grid so that we could visualize the cells in the plot, which was done using the as.data.frame() function.

After establishing the statistical grid with 400 cells, we started counting points in each cell. The objective was to identify the distribution behavior of anthromes mapping points compared to the gold standard (IBGE data). This count was carried out in two moments, the first for data from saopaulo and the second for populated_anthromes.

Using the rasterize() function, we count points per quadrant in saopaulo. The count allowed the construction of a data frame with 2 columns and 400 rows, in which the columns represent the cells (400) and the number of points in each of them. Thus, each of the lines refers to one of the cells in the statistical grid.

After counting saopaulo points, we performed a similar procedure with the data from populated_anthromes. Firstly, this set was converted into a spatial object (spatial), using the st_as_sf() function, structuring the populated_anthromes_spatial object. Sequentially, the rasterize() function led to the counting of points in the statistical grid for the populated anthromes, generating the count_by_quadrant_anthromes data frame. Again, the cell information and number of points per cell were isolated in a set referring to data from populated anthromes (anthromes_countpoints). With this data, we were able to map the distribution of points on the statistical grid of anthromes.

The following figure illustrates the Statistical Grid with the Point Count for saopaulo (gold standard) and for populated_anthromes. Comparatively, we observed that some of the points referring to the gold standard are counted in other cells in the populated anthromes. Famously, this difference is associated with the EPSG structure in which the two sets were found. As we mentioned previously, the CRS were different between the two sets (gold standard and populated anthromes) and adjustments were made throughout the analyses. This generated small distortions in the distribution of points, in a few cases, as we will demonstrate through other metrics about the mapping.

Figure: Statistical Grids with Point Counts of the Gold Standard and Populated anthromes of the State of São Paulo.

Source: the authors (2023). Caption: Figure representing the statistical grid with the point count of (a) saopaulo (gold standard) and (b) populated_anthromes, both referring to the State of São Paulo. The colored areas represent the color gradient according to the number of points in the grid cells. The gray areas represent the cells where there were no points, that is, areas that went beyond the analysis area of the perimeter of the State of São Paulo and, consequently, of the populated anthromes of the Federation Unit.

2.8.1 Structuring the Confusion Matrix for Statistical Analysis

First, we combined the two data sets associated with point counting (saopaulo_countpoints and anthromes_countpoints). The purpose of combining the two sets was to structure the confusion matrix for statistical analyses. In it, we aligned the numbers of points in each cell of the statistical grid (20x20) of the two sets, in order to establish the following relationships:

True Positives (TPs): quadrants of the statistical grid where the number of points is equal between saopaulo_countpoints and anthromes_countpoints;
True Negatives (TNs): quadrants of the statistical grid where there are no points represented in both saopaulo_countpoints and anthromes_countpoints;
False Negatives (FNs): quadrants of the statistical grid where the number of points in saopaulo_countpoints is greater than the number of points in anthromes_countpoints.
False Positives (FPs): quadrants of the statistical grid where the number of points in saopaulo_countpoints is smaller than the number of points in anthromes_countpoints.

To structure these relationships, we use the ifelse() function. If the number of points were equal and different from 0 in both sets to determine if the values are equal (TP) or different (TN), greater (FN) or smaller (FP) in saopaulo_countpoints when compared to ** anthromes_countpoints**.

If the number of points were equal in both sets and different from 0, the TP column would receive the value of 1 and the TN, FP and FN columns would receive the value of 0. If the number of points were equal to 0 in both sets, the column TN would receive the value of 1 and TP, FN and FP of 0. Additionally, if the number of points in the grid quadrant were greater saopaulo_countpoints than in anthromes_countpoints, the FN column would also receive the value of 1 and the TP, TN and FP columns the value of 0. On the other hand, if the number of points were smaller in saopaulo_countpoints than in anthromes_countpoints, the FP column would receive the value of 1 and the columns TP, TN and FN the value of 0.

It should be noted that in the statistical grid there were areas that did not represent the data from São Paulo (areas in gray in the Figure above), that is, lines where “NA” appeared. These lines were replaced by 0 in the combined set using the function [is.na(combined_set)] <- 0, in order to allow statistical calculations based on the confusion matrix. The following table illustrates the confusion matrix structured by this operation.

Table: Confusion Matrix

##     Celula saopaulo_countpoints anthromes_countpoints TP TN FN FP
## 1        1                    0                     0  0  1  0  0
## 2        2                    0                     0  0  1  0  0
## 3        3                    0                     0  0  1  0  0
## 4        4                    0                     0  0  1  0  0
## 5        5                    1                     1  1  0  0  0
## 6        6                   10                    10  1  0  0  0
## 7        7                   15                    15  1  0  0  0
## 8        8                    3                     3  1  0  0  0
## 9        9                    1                     1  1  0  0  0
## 10      10                    0                     0  0  1  0  0
## 11      11                    0                     0  0  1  0  0
## 12      12                    4                     4  1  0  0  0
## 13      13                    6                     6  1  0  0  0
## 14      14                    0                     0  0  1  0  0
## 15      15                    0                     0  0  1  0  0
## 16      16                    0                     0  0  1  0  0
## 17      17                    0                     0  0  1  0  0
## 18      18                    0                     0  0  1  0  0
## 19      19                    0                     0  0  1  0  0
## 20      20                    0                     0  0  1  0  0
## 21      21                    0                     0  0  1  0  0
## 22      22                    0                     0  0  1  0  0
## 23      23                    0                     0  0  1  0  0
## 24      24                    0                     0  0  1  0  0
## 25      25                   28                    28  1  0  0  0
## 26      26                    9                     9  1  0  0  0
## 27      27                    8                     8  1  0  0  0
## 28      28                    5                     5  1  0  0  0
## 29      29                    6                     6  1  0  0  0
## 30      30                    2                     2  1  0  0  0
## 31      31                    2                     2  1  0  0  0
## 32      32                   10                    10  1  0  0  0
## 33      33                    5                     5  1  0  0  0
## 34      34                    1                     1  1  0  0  0
## 35      35                    0                     0  0  1  0  0
## 36      36                    0                     0  0  1  0  0
## 37      37                    0                     0  0  1  0  0
## 38      38                    0                     0  0  1  0  0
## 39      39                    0                     0  0  1  0  0
## 40      40                    0                     0  0  1  0  0
## 41      41                    0                     0  0  1  0  0
## 42      42                    0                     0  0  1  0  0
## 43      43                    0                     0  0  1  0  0
## 44      44                    2                     2  1  0  0  0
## 45      45                    3                     3  1  0  0  0
## 46      46                    9                     9  1  0  0  0
## 47      47                    8                     8  1  0  0  0
## 48      48                   11                    11  1  0  0  0
## 49      49                   10                    10  1  0  0  0
## 50      50                    8                     8  1  0  0  0
## 51      51                    4                     4  1  0  0  0
## 52      52                    5                     5  1  0  0  0
## 53      53                    9                     9  1  0  0  0
## 54      54                    3                     3  1  0  0  0
## 55      55                    0                     0  0  1  0  0
## 56      56                    0                     0  0  1  0  0
## 57      57                    0                     0  0  1  0  0
## 58      58                    0                     0  0  1  0  0
## 59      59                    0                     0  0  1  0  0
## 60      60                    0                     0  0  1  0  0
## 61      61                    0                     0  0  1  0  0
## 62      62                    0                     0  0  1  0  0
## 63      63                    0                     0  0  1  0  0
## 64      64                    6                     6  1  0  0  0
## 65      65                    3                     3  1  0  0  0
## 66      66                    4                     4  1  0  0  0
## 67      67                   11                    11  1  0  0  0
## 68      68                   14                    14  1  0  0  0
## 69      69                   17                    17  1  0  0  0
## 70      70                    8                     8  1  0  0  0
## 71      71                    7                     7  1  0  0  0
## 72      72                    5                     5  1  0  0  0
## 73      73                    1                     1  1  0  0  0
## 74      74                    0                     0  0  1  0  0
## 75      75                    0                     0  0  1  0  0
## 76      76                    0                     0  0  1  0  0
## 77      77                    0                     0  0  1  0  0
## 78      78                    0                     0  0  1  0  0
## 79      79                    0                     0  0  1  0  0
## 80      80                    0                     0  0  1  0  0
## 81      81                    0                     0  0  1  0  0
## 82      82                    0                     0  0  1  0  0
## 83      83                    2                     2  1  0  0  0
## 84      84                    3                     3  1  0  0  0
## 85      85                    6                     6  1  0  0  0
## 86      86                   22                    22  1  0  0  0
## 87      87                   25                    25  1  0  0  0
## 88      88                    7                     7  1  0  0  0
## 89      89                   11                    11  1  0  0  0
## 90      90                   13                    13  1  0  0  0
## 91      91                   10                    10  1  0  0  0
## 92      92                   17                    17  1  0  0  0
## 93      93                    5                     5  1  0  0  0
## 94      94                    1                     1  1  0  0  0
## 95      95                    0                     0  0  1  0  0
## 96      96                    0                     0  0  1  0  0
## 97      97                    0                     0  0  1  0  0
## 98      98                    0                     0  0  1  0  0
## 99      99                    0                     0  0  1  0  0
## 100    100                    0                     0  0  1  0  0
## 101    101                    0                     0  0  1  0  0
## 102    102                    0                     0  0  1  0  0
## 103    103                    6                     6  1  0  0  0
## 104    104                    8                     8  1  0  0  0
## 105    105                    3                     3  1  0  0  0
## 106    106                   11                    11  1  0  0  0
## 107    107                   13                    13  1  0  0  0
## 108    108                   20                    20  1  0  0  0
## 109    109                   13                    13  1  0  0  0
## 110    110                   13                    13  1  0  0  0
## 111    111                    9                     9  1  0  0  0
## 112    112                    8                     8  1  0  0  0
## 113    113                    7                     7  1  0  0  0
## 114    114                    3                     3  1  0  0  0
## 115    115                    3                     3  1  0  0  0
## 116    116                    0                     0  0  1  0  0
## 117    117                    0                     0  0  1  0  0
## 118    118                    0                     0  0  1  0  0
## 119    119                    0                     0  0  1  0  0
## 120    120                    0                     0  0  1  0  0
## 121    121                    0                     0  0  1  0  0
## 122    122                    0                     0  0  1  0  0
## 123    123                    4                     4  1  0  0  0
## 124    124                    9                     9  1  0  0  0
## 125    125                   14                    14  1  0  0  0
## 126    126                    9                     9  1  0  0  0
## 127    127                    7                     7  1  0  0  0
## 128    128                    4                     4  1  0  0  0
## 129    129                   23                    23  1  0  0  0
## 130    130                    8                     8  1  0  0  0
## 131    131                    7                     7  1  0  0  0
## 132    132                    5                     5  1  0  0  0
## 133    133                    6                     6  1  0  0  0
## 134    134                    5                     5  1  0  0  0
## 135    135                    7                     7  1  0  0  0
## 136    136                    0                     0  0  1  0  0
## 137    137                    0                     0  0  1  0  0
## 138    138                    0                     0  0  1  0  0
## 139    139                    0                     0  0  1  0  0
## 140    140                    0                     0  0  1  0  0
## 141    141                    0                     0  0  1  0  0
## 142    142                    0                     0  0  1  0  0
## 143    143                    4                     4  1  0  0  0
## 144    144                   10                    10  1  0  0  0
## 145    145                   15                    15  1  0  0  0
## 146    146                   17                    17  1  0  0  0
## 147    147                    7                     7  1  0  0  0
## 148    148                    6                     6  1  0  0  0
## 149    149                    8                     8  1  0  0  0
## 150    150                   10                    10  1  0  0  0
## 151    151                    5                     5  1  0  0  0
## 152    152                    5                     5  1  0  0  0
## 153    153                    8                     8  1  0  0  0
## 154    154                    8                     8  1  0  0  0
## 155    155                    6                     6  1  0  0  0
## 156    156                    0                     0  0  1  0  0
## 157    157                    0                     0  0  1  0  0
## 158    158                    0                     0  0  1  0  0
## 159    159                    0                     0  0  1  0  0
## 160    160                    0                     0  0  1  0  0
## 161    161                    0                     0  0  1  0  0
## 162    162                    0                     0  0  1  0  0
## 163    163                    2                     2  1  0  0  0
## 164    164                   15                    15  1  0  0  0
## 165    165                    5                     5  1  0  0  0
## 166    166                    5                     5  1  0  0  0
## 167    167                   10                    10  1  0  0  0
## 168    168                   18                    18  1  0  0  0
## 169    169                   10                    10  1  0  0  0
## 170    170                   20                    20  1  0  0  0
## 171    171                   12                    12  1  0  0  0
## 172    172                    8                     8  1  0  0  0
## 173    173                    8                     8  1  0  0  0
## 174    174                   10                     8  0  0  1  0
## 175    175                    3                     3  1  0  0  0
## 176    176                    0                     0  0  1  0  0
## 177    177                    0                     0  0  1  0  0
## 178    178                    0                     0  0  1  0  0
## 179    179                    0                     0  0  1  0  0
## 180    180                    0                     0  0  1  0  0
## 181    181                    2                     2  1  0  0  0
## 182    182                    1                     1  1  0  0  0
## 183    183                    4                     4  1  0  0  0
## 184    184                    5                     5  1  0  0  0
## 185    185                    5                     5  1  0  0  0
## 186    186                    3                     3  1  0  0  0
## 187    187                    6                     6  1  0  0  0
## 188    188                    9                     9  1  0  0  0
## 189    189                   13                    13  1  0  0  0
## 190    190                   14                    14  1  0  0  0
## 191    191                   13                    13  1  0  0  0
## 192    192                    7                     7  1  0  0  0
## 193    193                   13                    13  1  0  0  0
## 194    194                   20                    20  1  0  0  0
## 195    195                    6                     6  1  0  0  0
## 196    196                    0                     0  0  1  0  0
## 197    197                    0                     0  0  1  0  0
## 198    198                    0                     0  0  1  0  0
## 199    199                    4                     4  1  0  0  0
## 200    200                    0                     0  0  1  0  0
## 201    201                    2                     2  1  0  0  0
## 202    202                    2                     2  1  0  0  0
## 203    203                    0                     0  0  1  0  0
## 204    204                    1                     1  1  0  0  0
## 205    205                    4                     4  1  0  0  0
## 206    206                   10                    10  1  0  0  0
## 207    207                    6                     6  1  0  0  0
## 208    208                    3                     3  1  0  0  0
## 209    209                    6                     6  1  0  0  0
## 210    210                   10                    10  1  0  0  0
## 211    211                   14                    14  1  0  0  0
## 212    212                   15                    15  1  0  0  0
## 213    213                   47                    47  1  0  0  0
## 214    214                   48                    48  1  0  0  0
## 215    215                   29                    29  1  0  0  0
## 216    216                    0                     0  0  1  0  0
## 217    217                    4                     4  1  0  0  0
## 218    218                    5                     5  1  0  0  0
## 219    219                   10                    10  1  0  0  0
## 220    220                    5                     6  0  0  0  1
## 221    221                    0                     0  0  1  0  0
## 222    222                    0                     0  0  1  0  0
## 223    223                    0                     0  0  1  0  0
## 224    224                    0                     0  0  1  0  0
## 225    225                    0                     0  0  1  0  0
## 226    226                    6                     6  1  0  0  0
## 227    227                    3                     3  1  0  0  0
## 228    228                   12                    12  1  0  0  0
## 229    229                   17                    17  1  0  0  0
## 230    230                    5                     5  1  0  0  0
## 231    231                    7                     7  1  0  0  0
## 232    232                   12                    12  1  0  0  0
## 233    233                   14                    14  1  0  0  0
## 234    234                   33                    33  1  0  0  0
## 235    235                   25                    25  1  0  0  0
## 236    236                    3                     3  1  0  0  0
## 237    237                   20                    20  1  0  0  0
## 238    238                    8                     8  1  0  0  0
## 239    239                    6                     6  1  0  0  0
## 240    240                    0                     0  0  1  0  0
## 241    241                    0                     0  0  1  0  0
## 242    242                    0                     0  0  1  0  0
## 243    243                    0                     0  0  1  0  0
## 244    244                    0                     0  0  1  0  0
## 245    245                    0                     0  0  1  0  0
## 246    246                    0                     0  0  1  0  0
## 247    247                    0                     0  0  1  0  0
## 248    248                    1                     1  1  0  0  0
## 249    249                   21                    21  1  0  0  0
## 250    250                   39                    39  1  0  0  0
## 251    251                   15                    15  1  0  0  0
## 252    252                   27                    27  1  0  0  0
## 253    253                   33                    33  1  0  0  0
## 254    254                   44                    44  1  0  0  0
## 255    255                   97                    97  1  0  0  0
## 256    256                   18                    18  1  0  0  0
## 257    257                   12                    12  1  0  0  0
## 258    258                    3                     3  1  0  0  0
## 259    259                    0                     0  0  1  0  0
## 260    260                    0                     0  0  1  0  0
## 261    261                    0                     0  0  1  0  0
## 262    262                    0                     0  0  1  0  0
## 263    263                    0                     0  0  1  0  0
## 264    264                    0                     0  0  1  0  0
## 265    265                    0                     0  0  1  0  0
## 266    266                    0                     0  0  1  0  0
## 267    267                    0                     0  0  1  0  0
## 268    268                    0                     0  0  1  0  0
## 269    269                   13                    13  1  0  0  0
## 270    270                   11                    11  1  0  0  0
## 271    271                    7                     7  1  0  0  0
## 272    272                   14                    14  1  0  0  0
## 273    273                   33                    33  1  0  0  0
## 274    274                   41                    41  1  0  0  0
## 275    275                   42                    42  1  0  0  0
## 276    276                   46                    46  1  0  0  0
## 277    277                    8                     8  1  0  0  0
## 278    278                    3                     3  1  0  0  0
## 279    279                    2                     2  1  0  0  0
## 280    280                    0                     0  0  1  0  0
## 281    281                    0                     0  0  1  0  0
## 282    282                    0                     0  0  1  0  0
## 283    283                    0                     0  0  1  0  0
## 284    284                    0                     0  0  1  0  0
## 285    285                    0                     0  0  1  0  0
## 286    286                    0                     0  0  1  0  0
## 287    287                    0                     0  0  1  0  0
## 288    288                    1                     1  1  0  0  0
## 289    289                    7                     7  1  0  0  0
## 290    290                    1                     1  1  0  0  0
## 291    291                    4                     4  1  0  0  0
## 292    292                   10                    10  1  0  0  0
## 293    293                   14                    14  1  0  0  0
## 294    294                    5                     5  1  0  0  0
## 295    295                   69                    69  1  0  0  0
## 296    296                   14                    14  1  0  0  0
## 297    297                    3                     3  1  0  0  0
## 298    298                    6                     6  1  0  0  0
## 299    299                    0                     0  0  1  0  0
## 300    300                    0                     0  0  1  0  0
## 301    301                    0                     0  0  1  0  0
## 302    302                    0                     0  0  1  0  0
## 303    303                    0                     0  0  1  0  0
## 304    304                    0                     0  0  1  0  0
## 305    305                    0                     0  0  1  0  0
## 306    306                    0                     0  0  1  0  0
## 307    307                    0                     0  0  1  0  0
## 308    308                    0                     0  0  1  0  0
## 309    309                    4                     4  1  0  0  0
## 310    310                    4                     4  1  0  0  0
## 311    311                    4                     4  1  0  0  0
## 312    312                    1                     1  1  0  0  0
## 313    313                    2                     2  1  0  0  0
## 314    314                    2                     2  1  0  0  0
## 315    315                   14                    14  1  0  0  0
## 316    316                   14                    14  1  0  0  0
## 317    317                    0                     0  0  1  0  0
## 318    318                    1                     1  1  0  0  0
## 319    319                    0                     0  0  1  0  0
## 320    320                    0                     0  0  1  0  0
## 321    321                    0                     0  0  1  0  0
## 322    322                    0                     0  0  1  0  0
## 323    323                    0                     0  0  1  0  0
## 324    324                    0                     0  0  1  0  0
## 325    325                    0                     0  0  1  0  0
## 326    326                    0                     0  0  1  0  0
## 327    327                    0                     0  0  1  0  0
## 328    328                    0                     0  0  1  0  0
## 329    329                    1                     1  1  0  0  0
## 330    330                    6                     6  1  0  0  0
## 331    331                    1                     1  1  0  0  0
## 332    332                    3                     3  1  0  0  0
## 333    333                   13                    13  1  0  0  0
## 334    334                    8                     8  1  0  0  0
## 335    335                    1                     1  1  0  0  0
## 336    336                    0                     0  0  1  0  0
## 337    337                    0                     0  0  1  0  0
## 338    338                    0                     0  0  1  0  0
## 339    339                    0                     0  0  1  0  0
## 340    340                    0                     0  0  1  0  0
## 341    341                    0                     0  0  1  0  0
## 342    342                    0                     0  0  1  0  0
## 343    343                    0                     0  0  1  0  0
## 344    344                    0                     0  0  1  0  0
## 345    345                    0                     0  0  1  0  0
## 346    346                    0                     0  0  1  0  0
## 347    347                    0                     0  0  1  0  0
## 348    348                    0                     0  0  1  0  0
## 349    349                    2                     2  1  0  0  0
## 350    350                   11                    11  1  0  0  0
## 351    351                    4                     4  1  0  0  0
## 352    352                    3                     3  1  0  0  0
## 353    353                    1                     1  1  0  0  0
## 354    354                    1                     1  1  0  0  0
## 355    355                    0                     0  0  1  0  0
## 356    356                    0                     0  0  1  0  0
## 357    357                    0                     0  0  1  0  0
## 358    358                    0                     0  0  1  0  0
## 359    359                    0                     0  0  1  0  0
## 360    360                    0                     0  0  1  0  0
## 361    361                    0                     0  0  1  0  0
## 362    362                    0                     0  0  1  0  0
## 363    363                    0                     0  0  1  0  0
## 364    364                    0                     0  0  1  0  0
## 365    365                    0                     0  0  1  0  0
## 366    366                    0                     0  0  1  0  0
## 367    367                    0                     0  0  1  0  0
## 368    368                    0                     0  0  1  0  0
## 369    369                    0                     0  0  1  0  0
## 370    370                    0                     0  0  1  0  0
## 371    371                    2                     2  1  0  0  0
## 372    372                    8                     8  1  0  0  0
## 373    373                    2                     2  1  0  0  0
## 374    374                    0                     0  0  1  0  0
## 375    375                    0                     0  0  1  0  0
## 376    376                    0                     0  0  1  0  0
## 377    377                    0                     0  0  1  0  0
## 378    378                    0                     0  0  1  0  0
## 379    379                    0                     0  0  1  0  0
## 380    380                    0                     0  0  1  0  0
## 381    381                    0                     0  0  1  0  0
## 382    382                    0                     0  0  1  0  0
## 383    383                    0                     0  0  1  0  0
## 384    384                    0                     0  0  1  0  0
## 385    385                    0                     0  0  1  0  0
## 386    386                    0                     0  0  1  0  0
## 387    387                    0                     0  0  1  0  0
## 388    388                    0                     0  0  1  0  0
## 389    389                    0                     0  0  1  0  0
## 390    390                    0                     0  0  1  0  0
## 391    391                    0                     0  0  1  0  0
## 392    392                    1                     2  0  0  0  1
## 393    393                    0                     0  0  1  0  0
## 394    394                    0                     0  0  1  0  0
## 395    395                    0                     0  0  1  0  0
## 396    396                    0                     0  0  1  0  0
## 397    397                    0                     0  0  1  0  0
## 398    398                    0                     0  0  1  0  0
## 399    399                    0                     0  0  1  0  0
## 400    400                    0                     0  0  1  0  0

Source: the authors (2023). Legend: Confusion matrix structured from the alignment by cells of saopaulo_countpoints and anthromes_countpoints. The table displays the first 10 rows of the set of 400 rows (statistical grid cells).

2.8.2 Mapping Sensitivity

After structuring the confusion matrix, we proceeded to analyze the sensitivity of mapping in a 20x20 statistical grid for the State of São Paulo. The calculation of mapping sensitivity aims to evaluate the model’s ability to identify true positives (TPs), that is, through this metric it is possible to evaluate whether the mapping of anthromes in São Paulo can efficiently identify the points present in the territory of São Paulo (gold standard). The estimate of this metric was made using the formula:

\[\text{Sensitivity (Recall)} = \frac{TP}{TP + FN}\]

where TP represents true positives and FN represents false positives.

# Calculating sensitivity
sensitivity <- sum(combined_set$TP) / (sum(combined_set$TP) + sum(combined_set$FN))

# Displaying the result
sensitivity

## [1] 0.9951691

Through the sensitivity calculation, we obtained the value of 0.9951691, that is, approximately 99.52% of the gold standard points are captured within the populated anthromes points. Using this metric, we confirm that the model used in mapping populated anthromes in the State of São Paulo is capable of identifying areas similarly represented by the gold standard, consequently pointing to the quality of the mapping and the sensitivity of the method.

2.8.3 Mapping Specificity

Continuing the statistical analyzes regarding the mapping of populated anthromes in the State of São Paulo, we move on to the analysis of the specificity of the mapping. According to the literature, this metric refers to the model’s ability to identify mapped points that are not part of the comparison standard (gold standard). From this perspective, the specificity analysis aimed to identify whether the model used in the mapping (populated_anthromes) is capable of pointing out which points are not included in the same cell of the statistical grid as the gold standard (saopaulo).

In this way, we returned to the combined_set for analysis, where the proportion of points identified as True Negatives (TNs) in relation to the number of False Positive points (FPs) was verified. In other words, we analyzed the proportion between the quadrants in which the number of gold standard points was greater compared to the quadrants where the number of anthromes points was greater (FPs). The formula used to estimate specificity is represented by the formula:

\[\text{Specificity} = \frac{TN}{TN + FP}\]

where TN represents the number of quadrants identified as True Negatives and FN the number of False Positives.

# Calculating Specificity
specificity <- sum(combined_set$TN) / (sum(combined_set$TN) + sum(combined_set$FP))

# Displaying the result
print(specificity)

## [1] 0.9896373

The result obtained for the mapping specificity metric was 0.9896373, that is, in the quadrants of the statistical grid where there are no points plotted by the gold standard, the model used for mapping anthromes operates with a proportion of * *98.96% accuracy, correctly classifying the absence of points in these areas. It is also considered that this value indicates that there are quadrants where the points mapped by anthromes_countpoints exceed the number of points that make up saopaulo_countpoints**; This indication refers to the overlap analysis carried out previously, where we demonstrated that some points were distorted during the spatial distribution (latitude and longitude of the points on the map) and, when we established the statistical grid, they framed cells different from those of the gold standard.

Therefore, despite the limitations just discussed, we observed that this metric also contributes to inferring the quality of the model used in mapping, demonstrating its suitability for the intended use.

2.8.4 Global Mapping Accuracy and Error

Advancing in the statistical analyses, we enter the global accuracy metric of the mapping. According to the literature, this metric aims to identify the proportion of points that were correctly identified by anthromes mapping when compared to the gold standard. In other words, this metric estimates the number of True Positives (TPs) compared to the total number of cells in the grid. The calculation of global accuracy is done using the formula:

\[\text{Global Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\]

where TP, which represents the grid cells in which the number of points is equal in anthromes mapping and in the gold standard, is divided by the total cells of the statistical grid, that is, by the sum of cells with the same number of points (TPs) and different (FPs and FNs) between the sets.

# Calculating Global Accuracy
global_accuracy <- (sum(combined_set$TP) + sum(combined_set$TN)) / (sum(combined_set$TP) + sum(combined_set$TN) + sum(combined_set$FP) + sum(combined_set$FN))

# Displaying the result
print(global_accuracy)

## [1] 0.9925

Based on the estimate of global accuracy, we obtained the result of 0.9925, that is, approximately 99.25% of the points mapped for anthromes are correctly mapped in the State of São Paulo when compared to the standard gold. This proportion demonstrates that there is a high correspondence rate between the mapped data, pointing to the quality of the product generated for the anthromes in the Federation Unit.

Assuming this, we proceed to calculate the global error, which, according to the literature, indicates the proportion of areas classified incorrectly by the model. In other words, this metric uses the proportion between False Positives (FPs) and False Negatives (FNs) compared to the total number of cells in the statistical grid. The calculation of the global error is given by the formula:

\[\text{Global Error} = \frac{FP + FN}{TP + TN + FP + FN}\]

where we have the sum of FPs and FNs, areas with a divergent number of points between anthromes and the gold standard, divided by the total number of cells with points.

# Calculating the Global Error
global_error <- (sum(combined_set$FP) + sum(combined_set$FN)) / (sum(combined_set$TP) + sum(combined_set$TN) + sum(combined_set$FP) + sum(combined_set$FN))

# Displaying the result
  print(global_error)

## [1] 0.0075

Through calculations of the global error we obtained an estimate of 0.0075, that is, only 0.75% of the areas mapped in populated anthromes were classified incorrectly when compared to the IBGE gold standard. This value indicates the model’s low error rate and points to the accuracy of the mapping of anthromes, reinforcing the notes on the quality of the mapping.

2.9 Summary of Mapping Statistics and Visualization of Results

Throughout the analyzes on the statistical metrics of the mapping, namely: sensitivity, specificity, global accuracy and global error; We observed promising results for the model used in mapping the anthromes populated in the State of São Paulo, as evidenced by the following table.

Table: Results of the Statistical Metrics of the Model for Mapping Populated Anthromes.

##           Metrics Estimation Percentage
## 1     Sensitivity  0.9951691     99.52%
## 2     Specificity  0.9896373     98.96%
## 3 Global Accuracy  0.9925000     99.25%
## 4    Global Error  0.0075000      0.75%

Source: the authors (2023). Legend: table with a summary of the results obtained for the four statistical metrics analyzed, namely: sensitivity, specificity, global accuracy and global error. The results are presented in two formats, the estimate and the percentage (with two decimal places).

According to other work that involves such metrics in mapping, whether for validation of the product (cartography) or for analysis of the model (map production structure), it appears that the estimates for the model of populated anthromes align compliance with the requirements for suitability for the intended use, reflecting the quality of the distribution of points and the efficiency of their representation.

Below we present the bar graph referring to the metrics presented in the table. As can be seen, the global error appears to be the only metric closest to zero, which according to the literature is significantly positive, given the low distortion rate of the product (populated anthromes) compared to the standard gold (IBGE data).

On the other hand, it is noted that sensitivity, specificity and global accuracy are close to 1. According to the literature, when approaching 1, the better the model’s ability to represent the data, following comparator guidelines (gold standard). Therefore, the anthromes model meets these premises and is capable of meeting the specifications of these metrics with relevant efficiency.

Graph: Statistical Metrics Results.

## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
## No shared levels found between `names(values)` of the manual scale and the
## data's fill values.

Source: the authors (2023). Legend: Bar graph summarizing the estimates obtained for the statistical metrics used in the analysis of mapping quality, namely: global accuracy (red), global error (purple), specificity (green) and sensitivity (blue).

Another visualization pattern that seemed relevant to present the estimates obtained for the analyzed metrics was the Radar chart. In it, the four metrics are presented simultaneously on a target, in which the center represents the value of 0 and the last circle from the inside to the outside the value of 1. This standard is commonly used by Metrology to analyze the measurement capacity of a given method. or instrument. As this is one of the Sciences that are part of our analytical core and that consolidates our view of Environmental Sciences and Human Ecology, we absorb such graphic modeling in the analyses, reinforcing our effort to align these Sciences.

Chart: Radar Chart of Metrics Investigated.

## `geom_path()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

Source: the authors (2023). Legend: radar graph of statistical metrics to validate the mapping and model analyzed. The graph reports the estimates obtained for sensitivity, specificity, global accuracy and global error.

As shown in the table previously, the values for sensitivity, specificity and global accuracy are close to one, giving the impression that these values are at 1 (last circle from inside to outside) in the pattern radar graph display. Otherwise, the value of the global error being close to 0 is at the center of the graphic model, reiterating the statements made and assuming the interpretative premises of the graphic model from the literature.

With this, we conclude our analyzes regarding the mapping of populated anthromes. Throughout this analysis, we processed and mined data from the 2010 Demographic Census (IBGE, 2013), classifying geospatial data into different types of populated anthromes, following the guidelines established by Ellis (2020) for classification and IBGE metadata. for alignment. Subsequently, we plotted the classified data and, sequentially, static and interactive mapping of populated anthromes in the State of São Paulo.

Once the mapping construction stages were completed, we moved on to statistical analysis to validate the cartographic product. At this point, we carried out the overlap analysis, comparing the cartography of populated anthromes to the gold standard, which was stipulated based on IBGE data. We identified some distortions in the data set and, consequently, in the mapping, but which do not invalidate the product of this Thesis; however, it appears to be a limitation of the product.

Sequentially, we analyzed the statistical metrics of sensitivity, specificity, global accuracy and global error. The estimates obtained for these metrics showed that the model has relevant suitability for the intended use of mapping demographic information, efficiently performing the distribution of points in the cartography and the mapping of census information, when compared to the gold standard. The small distortions identified at this stage also appear to be modeling limitations, but do not invalidate the modeling used to map populated anthromes.

3 Conclusions

Throughout this analysis, we processed and mined data from the 2010 Demographic Census (IBGE, 2013), classifying geospatial data into different types of populated anthromes, following the guidelines established by Ellis (2020) for classification and IBGE metadata. for alignment. Subsequently, we plotted the classified data and, sequentially, static and interactive mapping of populated anthromes in the State of São Paulo.

Sequentially, we analyzed the statistical metrics of sensitivity, specificity, global accuracy and global error. The estimates obtained for these metrics showed that the model has relevant suitability for the intended use of mapping demographic information, efficiently performing the distribution of points in the cartography and the mapping of census information, when compared to the gold standard. The small distortions identified at this stage also appear to be limitations of the modeling, but they do not invalidate the modeling used to map the populated anthromes.

Populated anthromes: from exploratory analysis of demographic data to mapping

Maximiliano Gobbo

2023-05-24

1 Introduction

2 Demographic Data: from loading to mapping

2.1 Exploratory Analysis of Demographic Data

2.2 Data Mining and Manipulation

2.3 Data Joining

2.4 Data plotting

2.5 Static mapping of populated anthromes

2.6 Interactive mapping of populated anthromes

2.7 Mapping validation and uncertainty studies

2.7.1 Overlap Analysis

2.7.2 Examination of data properties

2.8 Mapping summary statistics

2.8.1 Structuring the Confusion Matrix for Statistical Analysis

2.8.2 Mapping Sensitivity

2.8.3 Mapping Specificity

2.8.4 Global Mapping Accuracy and Error

2.9 Summary of Mapping Statistics and Visualization of Results

3 Conclusions