1 Introduction

The survey of attributes associated with a set of geospatial data is essential to identify points for correlation between different types of geographic information, as presented by Machlis (2017). The recognition of these attributes allows social and environmental information to be connected within the mapping programming, generating cartography for different areas of knowledge, as discussed by Anderson (2021). From this perspective, Moreno and Basile (2021) showed that through this integration it is possible, in the universe of the R® language, to produce maps with sufficient quality to compose territorial, spatial, socio-environmental, environmental and cultural analyses, thus serving the most diverse scientific fields. . Considering this, Gauthier (2021) and Ellis, Beusen and Goldewijk (2020) demonstrated, through the global mapping of anthropogenic biomes (anthromes), that the correlation between different types of geospatial information, in addition to being possible, proves to be a tool modern approach to socio-ecological studies. These authors carried out the indicated mapping using computational language, R® language itself, and were able to generate a relevant mapping on a 96 km² hexagonal grid of the DGG.

According to the information reported in the structural bases on anthromes, Ellis and Ramankutty (2008) pointed out that demographic information, such as demographic density, and land use and coverage, such as coverage typologies, are fundamental for the identification, classification and mapping of different anthropogenic classes. Both in the cartographic product reported by these two authors, and in that published by Gauthier (2021) and Ellis, Beusen and Goldewijk (2021), it is noticeable that the territorial sections of land uses (structured polygons) represent significant information for the classification and for mapping anthromes.

Assuming the cartographic products described above and the indications on database attributes, and intending to produce local mapping of Brazilian anthromes in the future, we carried out an exploratory analysis of data on land use and cover. These data are produced by the Brazilian Institute of Geography and Statistics (IBGE), following the technical description documented in the Technical Manual for Land Use and Cover (IBGE, 2013b).

The central objectives of this investigation were to identify the attributes that made up the data made available by the institution, recognize the spatialization of geographic information on land use and land cover and verify whether there were key attributes for correlating these geospatial data with demographic data, analyzed by us in another work. To achieve these objectives, we based ourselves on the logical and analytical formatting presented in the works of Lovelace et al. (2019) and Anderson (2021), where the functions for exploratory data analysis in the R® software are presented.

2 Presentation of data and methodological guidelines

The data on land use and coverage used in this research were those available on the Brazilian Institute of Geography and Statistics (IBGE) platform. This data was downloaded and saved in the workbook (directory) associated with this stage of the investigation. We emphasize that it is part of Brazilian national open data policies (IBGE, 2019) and is available for remote access at the link:

https://www.ibge.gov.br/geociencias/informacoes-ambientais/cobertura-e-uso-da-terra/15831-cobertura-e-uso-da-terra-do-brasil.html?=&t=o-que-e.

We highlight that land use and land cover data were produced by IBGE following the methodological guidelines expressed in the Technical Manual on Land Use and Cover (IBGE, 2013b). Previously, we analyzed the quality of the data made available by the Brazilian institute, checking their compliance with the intended use for the regionalization of Brazilian anthromes (GOBBO, 2020).

Continuing our PhD research, in this exploratory analysis we investigated attributes that were part of the data collection made available by IBGE. The central objective was to identify which attributes associated with this geographic information would allow the correlation between land use and cover data and demographic data, as well as the plotting of this information in the local mapping of anthromes, as suggested by Guathier (2021) and Ellis , Beusen and Goldewijk (2020).

To this end, the exploratory analysis of land use and land cover data was carried out using the R® software, aligning our investigation with those carried out by the aforementioned authors on anthromes. The analytical guidelines used in the R® language were taken from the works of Lovelace et al. (2019) and Anderson (2021), where the authors follow an investigative path to identify the constituent attributes of the data and the possibility of integrating information into mappings.

As we previously presented in the Thesis Methodology, our focus was on the State of São Paulo, justifying ourselves in the representativeness of the Federation Unit for national economy, politics and management and by the territorial divisions and sociocultural diversity of the populations that make up the São Paulo mosaic. From this perspective, data relating to São Paulo’s land cover for the period between 2010 and 2018 were downloaded from the IBGE digital collection.

Following the methodological recommendations of Gauthier (2021) for classifying and mapping anthromes, which indicates that use and coverage and census data must have the same dating to optimize correlation and reduce information distortions, we chose to use raster data referring to to 2010, the last year in which data from the census operation was fully available on the institute’s platform (IBGE, 2013a).

3 From Exploratory Analysis to Mapping of Land Uses and Covers associated with São Paulo Anthromes

3.1 Exploratory Analysis

To present and confirm the working directory for the exploratory analysis of data on land use and cover, we use the getwd() function, which informs the name of the folder in which the files to be analyzed were saved, as shown below.

getwd()
## [1] "C:/ARQUIVOS COMPUTADOR/DOUTORADO/DOUTORADO TESE/03 DADOS GEOESPACIAIS/03.1 LAND USE AND COVER"

We highlight that this Brazilian data is of the raster type and, therefore, follows a different loading logic than the vector data in the R® software. Therefore, we use the shapefile() function to load this data into the program. Between the parentheses of the function, we specify the name of the file in the directory, whose name was coverage_sp.shp. We emphasize that this name reflected a set of files made available by IBGE with different extensions, which should be included in the work folder so that the shapefile() function could interpret them together.

Script 12 expresses in the first line how the data was loaded. After loading the raster data set, we create an object (data frame) from it, using the indicator (arrow) preceded by the given name of coverage_sp, as demonstrated in the second line of the code. Next, we confirm the creation of the object and its composition using only the name of the coverage_sp set (highlighted in the script).

Script 01: Loading the file coverage_sp.shp.

coverage_sp <- st_read("coverage_sp.shp")
## Reading layer `coverage_sp' from data source 
##   `C:\ARQUIVOS COMPUTADOR\DOUTORADO\DOUTORADO TESE\03 DADOS GEOESPACIAIS\03.1 LAND USE AND COVER\coverage_sp.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 248217 features and 8 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -53.10706 ymin: -25.30115 xmax: -44.16211 ymax: -19.78065
## Geodetic CRS:  SIRGAS 2000

Source: the authors (2023). Caption: script describing the loading of data from “coverage_sp.shp” through the st_read() function and the creation of the coverage_sp data frame. The script presents different information that constitutes the file and the loaded data.

The results presented in Script 12 provided relevant information about the composition of the coverage_sp set. Firstly, we observed that the class of the object is of the data frame type of spatial polygons, that is, the data described a set of polygons that represented the areas that made up the territorial mosaic of the State of São Paulo. Additionally, we verified that the object was composed of 248,217 lines (features) and 8 variables. These variables had the following names: INDICE_GRE, USO2000, USO2010, USO2012, USO2014, USO2016, USO2018 and ID_GRE.

In minimum and maximum values (respectively, min values and max values) the minimum and maximum values for each of the variables that made up the data set were presented. On the other hand, in extend, we are faced with the indication of the minimum and maximum extension of the X and Y values that structured the set of polygons, that is, the values of the ordered pairs for plotting the data. Additionally, through the CRS indicated in Script 12, we recognized that there was a geographic projection and positioning system linked to the structure of the data set.

In order to broaden our perception of the coverage_sp set and confirm the above statements, we use the summary () function to identify basic characteristics of the polygonal data frame. Subsequently, we use the class () function in the analysis to confirm the class that the set belongs to. Sequentially, we used the names () function to identify the names expressed in the first line of the data frame, that is, the attributes that comprised it. To confirm the number of lines that make up the data frame, we apply the dim () function. The functions and their results are reported in Script 13 below.

Script 02: Preliminary coverage_sp analysis

class(coverage_sp)
## [1] "sf"         "data.frame"
names(coverage_sp)
## [1] "INDICE_GRE" "USO2000"    "USO2010"    "USO2012"    "USO2014"   
## [6] "USO2016"    "USO2018"    "ID_GRE"     "geometry"
dim(coverage_sp)
## [1] 248217      9
summary(coverage_sp)
##   INDICE_GRE           USO2000          USO2010          USO2012      
##  Length:248217      Min.   : 1.000   Min.   : 1.000   Min.   : 1.000  
##  Class :character   1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 2.000  
##  Mode  :character   Median : 3.000   Median : 3.000   Median : 3.000  
##                     Mean   : 3.848   Mean   : 3.697   Mean   : 3.643  
##                     3rd Qu.: 4.000   3rd Qu.: 4.000   3rd Qu.: 4.000  
##                     Max.   :13.000   Max.   :13.000   Max.   :13.000  
##     USO2014          USO2016          USO2018           ID_GRE         
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   :1.124e+13  
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.:1.124e+13  
##  Median : 3.000   Median : 3.000   Median : 3.000   Median :1.124e+13  
##  Mean   : 3.624   Mean   : 3.605   Mean   : 3.597   Mean   :1.124e+13  
##  3rd Qu.: 4.000   3rd Qu.: 4.000   3rd Qu.: 4.000   3rd Qu.:1.124e+13  
##  Max.   :13.000   Max.   :13.000   Max.   :13.000   Max.   :1.124e+13  
##           geometry     
##  POLYGON      :248217  
##  epsg:4674    :     0  
##  +proj=long...:     0  
##                        
##                        
## 

Source: the authors (2023). Caption: script with preliminary coverage analysis, where the class(), names(), dim() and summary() functions were used.

According to the results presented above (Script 13), we confirmed through the class () function that the object belonged to the spatial polygon data frame class. In the structure of the object, we also verified that the object had a geographic coordinate system (CRS) integrated into it, which was represented in the Script by the minimum and maximum values of X and Y and by the expression “proj4string”. This expression revealed the geographic projection capacity of the data thanks to the coordinates indicated in the object structure (XY ordered pairs).

Furthermore, we proved that coverage_sp had the spatial characteristic integrated into its structure, revealed by the expression “sp”, which comes from the English term spatial. Another important aspect was the indication of “attr”, an expression that demonstrated that the object was composed of a set of attributes (attr), which were stored in package format.

Furthermore, the results from the names () function reaffirmed the names of the attributes of the coverage_sp set, as well as their distribution in 8 columns. Regarding the dim () function, the results showed that the object consisted of 248,217 lines and 8 columns. We thus verified that each of the lines referred to different polygonal structures in the territory of São Paulo, that is, they did not portray the locations and typologies of use and coverage that make up the mosaic of the State of São Paulo.

On the other hand, considering the results obtained by the names() function and the typology of results for each of the attributes, indicated by the summary() function, we recognize that the attributes “USO2000”, “USO2010”, “USO2012”, “USO2014” , “USO2016” and “USO2018” portrayed the different years of Earth use and coverage monitoring, an operation carried out by IBGE every two years (IBGE, 2017; 2018; 2020).

From this perspective, as we were interested in the year 2010 for structuring the decision tree for classifying Brazilian anthromes, we filtered the data referring to the year in the plot() function, using the name of the attribute “USO2010” between the square brackets [] in the function. Figure 01 reports how the coverage data was plotted for the year 2010.

Figure 01: Plot of coverage data for the year 2010.

Source: the authors (2023). Caption: Figure produced by plotting the coverage data, using the plot (_) function for production and the brackets [] to filter the data for the year 2010.

Figure 01, produced by plotting the data set, demonstrated that the polygonal structure of the State of São Paulo was visibly structured within the coverage_sp data frame. However, the complete filling of the territorial area did not allow the differentiation of land uses and covers that made up the mosaic. However, from the results obtained previously, we were able to identify that there are numbers in the USO2010 column and, as suggested by the IBGE metadata, we know that these numbers refer to different types of land use and cover. This fact can be demonstrated by applying the head() function, whose application deals with demonstrating the first lines of the set. The following code illustrates the use of the function.

Script: Application of the head() function in coverage_sp.

head(coverage_sp)
## Simple feature collection with 6 features and 8 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -49.89744 ymin: -23.05592 xmax: -49.88686 ymax: -23.00143
## Geodetic CRS:  SIRGAS 2000
##      INDICE_GRE USO2000 USO2010 USO2012 USO2014 USO2016 USO2018       ID_GRE
## 1 1KME5422N8762       4       4       4       4       4       4 1.123535e+13
## 2 1KME5422N8763       2       2       2       2       2       2 1.123535e+13
## 3 1KME5422N8764       2       2       2       2       2       2 1.123535e+13
## 4 1KME5422N8765       2       2       2       2       2       2 1.123535e+13
## 5 1KME5422N8766       4       4       4       4       4       4 1.123535e+13
## 6 1KME5422N8767       4       4       4       4       4       4 1.123535e+13
##                         geometry
## 1 POLYGON ((-49.88686 -23.055...
## 2 POLYGON ((-49.88701 -23.046...
## 3 POLYGON ((-49.88715 -23.037...
## 4 POLYGON ((-49.88729 -23.028...
## 5 POLYGON ((-49.88744 -23.019...
## 6 POLYGON ((-49.88758 -23.010...

Source: the authors (2024). Caption: demonstration of the head() function to find the numbers in the first six lines of the set coverage_sp.

As seen in the results presented by the head() function, we notice that in the USO2010 column there are different numbers for each line in the set. Synergistically, we notice that in the figure generated previously a scale is presented that goes from 0 to 14, which reveals which numbers are present in this set. Given these findings, we moved forward with our analyzes for data mining and manipulation.

3.2 Data Mining and Manipulation

Considering our focus on 2010 data, given the demographic data previously analyzed, our mining began by isolating the columns from the coverage_sp set that would be useful for subsequent analyses. Therefore, using the select() function, we created a new set with the columns USO2010 and geometry, which was named coverage_sp_2010, as illustrated in the following code.

Script: creation of the coverage_sp_2010 set.

coverage_sp_2010 <- coverage_sp[c("USO2010", "geometry")]
coverage_sp_2010
## Simple feature collection with 248217 features and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -53.10706 ymin: -25.30115 xmax: -44.16211 ymax: -19.78065
## Geodetic CRS:  SIRGAS 2000
## First 10 features:
##    USO2010                       geometry
## 1        4 POLYGON ((-49.88686 -23.055...
## 2        2 POLYGON ((-49.88701 -23.046...
## 3        2 POLYGON ((-49.88715 -23.037...
## 4        2 POLYGON ((-49.88729 -23.028...
## 5        4 POLYGON ((-49.88744 -23.019...
## 6        4 POLYGON ((-49.88758 -23.010...
## 7        4 POLYGON ((-49.88772 -23.001...
## 8        4 POLYGON ((-49.88787 -22.992...
## 9        4 POLYGON ((-49.88801 -22.983...
## 10       4 POLYGON ((-49.93661 -22.983...

Source: the authors (2024). Caption: application of the select() function to create a new data set for the year 2010. In the code, the INDICE_GRE, USO2010, ID_GRE and geometry columns from the coverage_sp set were isolated.

As can be seen in the results, the set coverage_sp_2010 follows the same structure as cobertura_sp, however only with the relevant data for subsequent analyzes referring to the year 2010. Based on this, we carried out a survey in the column USO2010 to confirm what was previously presented about the values present in this column. To carry out such polling, we use the unique() function, which reports the values, in this case numeric, present in the column. Below is the presentation of the code and results.

Script: application of the unique() function in coverage_sp_2010.

unique(coverage_sp_2010$USO2010)
##  [1]  4  2  1  3 12  5  6 11 10 13  9

Source: the authors (2024). Legend: application of the unique() function in the coverage_sp_2010 data set in the USO2010 column to identify the numeric values present in the column.

According to the results reported by the unique() function, we note the numeric identifiers: 1,2,3,4,5,6,9,10,11,12 and 13. We call them numerical identifiers because, According to the IBGE technical documentation, these values are assigned to different land uses and covers. Considering this, we must provide a synthesis, based on the document, of the respective uses and coverage in order to advance in our analyses. In the metadata made available by IBGE, we find the file coverage_sp2018.xlsx, which contains the respective uses and coverage and a brief description for each of the numbers identified above. This information is reported in full below.

1.Artificial area: Areas where non-agricultural anthropogenic surfaces predominate. These are those structured by buildings and road systems, which include metropolises, cities, towns, indigenous villages and quilombola communities, areas occupied by industrial and commercial complexes and buildings that may, in some cases, be located in peri-urban areas. . Areas where exploration or extraction of mineral substances occurs, through mining or mining, also belong to this class.

  1. Agricultural Area: Area characterized by temporary, semi-perennial and permanent crops, irrigated or not, with the land used for the production of food, fiber, fuel and other raw materials. It follows the parameters adopted in IBGE agricultural research and includes all cultivated areas, including those that are fallow or located on flood able land. It can be represented by heterogeneous agricultural zones or extensive areas of plantations. Includes aquaculture tanks.

  2. Managed Pasture: Areas intended for grazing cattle and other animals, with cultivated herbaceous vegetation (brachiaria, ryegrass, etc.) or field vegetation (natural), both presenting high-intensity anthropogenic interference. These interferences may include planting; cleaning the land (destopping and destonering); elimination of weeds mechanically or chemically (application of herbicides); harrowing; liming; fertilizing; among others that distort the natural coverage.

  3. Mosaic of Occupations in Forest Area: Area characterized by mixed occupation of agricultural, pasture and/or forestry areas associated or not with forest remnants, in which individualization of its components is not possible. It also includes areas with natural and anthropogenic disturbances, mechanical or non-mechanical, that make it difficult to characterize the area.

  4. Silviculture: Area characterized by forest plantations of exotic or native species as monocultures. It follows the parameters adopted in IBGE’s plant extraction and forestry research.

  5. Forest Vegetation: Area occupied by forests. Tree formations exceeding 5 meters in height are considered forestry, including areas of Dense Rainforest, Open Rainforest, Seasonal Forest, as well as Mixed Rainforest. It includes other features due to their size exceeding 5 m in height, such as the Forested Savanna, Forested Campinarana, Forested Steppe Savanna, Mangroves and Buritizais, According to the Technical Manual for Land Use (IBGE, 2013).

  6. Wet Area: Area characterized by natural herbaceous or shrubby vegetation (coverage of 10% or more), permanently or periodically flooded by fresh or brackish water. Includes ponds, swamps, wet fields, estuaries, among others. The flooding period must be at least 2 months per year. Shrub or tree vegetation may occur, as long as they occupy less than 10% of the total area.

  7. Countryside Vegetation: Area characterized by grassland formations. Countryside is understood as the different categories of vegetation that are physiognomically very different from forestry, that is, those that are characterized by a predominantly shrubby stratum, sparsely distributed over a grassy-woody stratum. This category includes Savannas, Steppes, Steppe Savannas, Pioneer Formations and Ecological Refuges. They are spread across different phytogeographic regions, comprising different primary typologies: plateau steppes, rocky fields in coastal mountains and coastal hydro-sandy fields (restinga), According to the Technical Manual for Land Use (IBGE, 2013). These areas may be subject to grazing and other low-intensity anthropogenic interference, such as the unmanaged pasture areas of Rio Grande do Sul and Pantanal.

  8. Mosaic of Occupations in Countryside Areas: Area characterized by mixed occupation of agricultural, pasture and/or forestry areas associated or not with countryside remnants, in which individualization of its components is not possible. It also includes areas with natural and anthropogenic disturbances, mechanical or non-mechanical, that make it difficult to characterize the area.

  9. Continental body of water: Includes all inland waters, such as rivers, streams, canals and other linear bodies of water. It also encompasses naturally closed bodies of water (natural lakes) and artificial reservoirs (artificial water dams built for irrigation, flood control, water supply and electricity generation). Does not include aquaculture tanks.

13: Coastal body of water: Includes waters within 12 nautical miles, According to Law No. 8,617, of January 4, 1993.

Based on this extract from the IBGE document, we resumed our table of anthropogenic classes to establish a parallel between the categories of land use and cover and the anthropogenic classes. The following table represents the categories and colors referring to each anthrome to be classified.

Table: Categories and colors of anthromes.

##                                           categories  colors
## 1  Urban (urban center and dense population cluster) #FF0000
## 2                      Mixed urban (urban periphery) #FF4747
## 3                              Agricultural villages #ED833B
## 4                                 Livestock villages #FFD966
## 5                            Traditional Communities #9CC2E5
## 6                                    Mining villages #8C8C8C
## 7                                  Agricultural land #F4B488
## 8                                     Livestock land #FFEEB7
## 9                                       Mining lands #B6B6B6
## 10                                          Forestry #E2EFD9
## 11                  Forest areas for sustainable use #A8D08D
## 12               Preservation and conservation areas #538135

Source: the authors (2024). Caption: Categories of anthromes and their respective color codes (RGB).

Considering this table and the classes previously presented for land uses and covers, we must make some notes to help understand the next analytical steps. These notes cover the adjustments made to establish the parallel between the classes of anthromes and the classes of land use and cover present in the coverage_sp_2010 data. We will list the adjustments below.

  1. The class of artificial area brings together different types of population clusters, which makes their distinction impossible at first. Therefore, we established the nomenclature and color for this class referring to the anthropogenic category of mixed urban, considering the range of anthropogenic classes that these data can group together.

  2. For the typology agricultural area (use and coverage) we established the nomenclature and coloring of agricultural land, considering the description provided by IBGE for this category of land use and coverage. The same was applied to the category of use and coverage referring to managed pasture, whose reference for anthromes was livestock land. Furthermore, since silviculture is a common category between the use and cover classes and anthromes, the same coloration was established.

  3. For the land use and cover categories countryside vegetation, humid area and forest vegetation, we established the nomenclature and coloring of preservation and conservation areas, merging them as in a single set. Our choice was made based on the description of the uses and coverage that these areas describe. According to IBGE, the three types of use and coverage refer to natural areas that have not been disturbed. Notably, we disregarded the phytophysiological and biogeographical differences associated with the three typologies, however, this does not preclude further analysis and can be considered as a limitation of our analyses.

  4. The category of use and coverage mosaic of occupations in forest areas was understood by us as forest areas of sustainable use in the categories of anthromes. Therefore, we established this nomenclature and coloring for such areas. On the other hand, we consider the category mosaic of occupations in rural areas as synergistic with agricultural/livestock villages; thus, we established the nomenclature and coloring of agricultural villages (anthrome category) for this category of use and coverage. It should be noted that, as in artificial areas, it was not possible to distinguish between agriculture and livestock; therefore, the distinction was made a posteriori with the integration of demographic data with land use and land cover

  5. For the categories continental water body and coastal water body there were no representative categories in the anthromes. In order not to remove this data set from our analyses, we established the color “#000066” (dark blue) to include the data in subsequent analyzes and defined the typology as waters. The maintenance of this set aimed to portray the relevance of water courses for the maintenance and supply of communities, as well as for territorial planning policies.

After these words, we structure, using the if/else() function, the code so that the data is interpreted as mentioned above. Therefore, we determined in coverage_sp_2010 what each of the values in the USO2010 column represents as a typology of land use and coverage and the color defined for each of the typologies. The following code illustrates the procedure for inserting two new columns into the coverage_sp_2010 set: typology and color.

Script: Determination of land use and coverage and the color of the typology based on the numeric attribute in the USO2010 column.

# Add usage and coverage typology column with if/else
coverage_sp_2010$tipology <- ifelse(coverage_sp_2010$USO2010 == 1, "mixed urban",
ifelse(coverage_sp_2010$USO2010 == 2, "agricultural lands",
ifelse(coverage_sp_2010$USO2010 == 3, "livestock lands",
ifelse(coverage_sp_2010$USO2010 == 4, "sustainable use forest areas",
ifelse(coverage_sp_2010$USO2010 == 5, "silviculture",
ifelse(coverage_sp_2010$USO2010 == 6, "preservation and conservation areas",
ifelse(coverage_sp_2010$USO2010 == 9, "preservation and conservation areas",
ifelse(coverage_sp_2010$USO2010 == 10, "preservation and conservation areas",
ifelse(coverage_sp_2010$USO2010 == 11, "agricultural villages",
ifelse(coverage_sp_2010$USO2010 == 12, "waters",
ifelse(coverage_sp_2010$USO2010 == 13, "waters", NA)))))))))))
# Add color column specified with if/else
coverage_sp_2010$colors <- ifelse(coverage_sp_2010$USO2010 == 1, "#FF4747",
ifelse(coverage_sp_2010$USO2010 == 2, "#F4B488",
ifelse(coverage_sp_2010$USO2010 == 3, "#FFEEB7",
ifelse(coverage_sp_2010$USO2010 == 4, "#A8D08D",
ifelse(coverage_sp_2010$USO2010 == 5, "#E2EFD9",
ifelse(coverage_sp_2010$USO2010 == 6, "#538135",
ifelse(coverage_sp_2010$USO2010 == 9, "#538135",
ifelse(coverage_sp_2010$USO2010 == 10, "#538135",
ifelse(coverage_sp_2010$USO2010 == 11, "#ED833B",
ifelse(coverage_sp_2010$USO2010 == 12, "#000066",
ifelse(coverage_sp_2010$USO2010 == 13, "#000066", NA)))))))))))

Source: the authors (2024). Caption: application of the if/else() function to construct the typology (land use and coverage) and color columns, using as a factor the numeric attribute of the USO2010 column of coverage_sp_2010.

After inserting these two pieces of information into the coverage_sp_2010 set, we fragmented it into the different groups of land use and coverage represented by the codes: 1,2,3,4,5,6,9,10,11 .12 and 13. The following script illustrates the process for constructing anthrome classes based on soil use and land cover attributes.

Script: fragmentation of the coverage_sp_2010 set into land use and land cover groups.

# mixed urban (1)
ucs_mixed_urban = coverage_sp_2010 %>% filter(USO2010 == 1)

# agricultural lands (2)
ucs_agricultural_lands = coverage_sp_2010 %>% filter(USO2010==2)

# livestock lands (3)
ucs_livestock_lands = coverage_sp_2010 %>% filter(USO2010 == 3)

# sustainable use forest areas (4)
ucs_sustainable_use_forest = coverage_sp_2010 %>% filter(USO2010 == 4)

# silviculture (5)
ucs_silviculture = coverage_sp_2010 %>% filter(USO2010 == 5)

# preservation and conservation areas - combination of Forest Vegetation (6), Wet Area (9) and Countryside Vegetation (10)
ucs_preservation_conservation = coverage_sp_2010 %>% filter(USO2010 %in% c(6,9,10))

# Mosaic of occupations in rural areas (11)
ucs_agricultural_villages = coverage_sp_2010 %>% filter(USO2010 == 11)

# Continental (12) and coastal water bodies (13)
ucs_waters = coverage_sp_2010 %>% filter(USO2010 %in% c(12,13))

Source: the authors (2024). Caption: creation of different classes of anthromes based on the attribute of soil use and land cover (from coverage_sp_2010).

In the script above, we fragmented the **coverage_sp_2010* set using the codes for the different types of land use and coverage. We highlight that the areas were aggregated in accolorsdance with the previously postulated. Coming from this fragmentation, also understood as data mining, we created the following sets (8 in total):

  1. ucs_mixed_urban
  2. ucs_agricultural_villages
  3. ucs_agricultural_lands
  4. ucs_livestock_lands
  5. ucs_forestry
  6. ucs_sustainable_use_forest
  7. ucs_preservation_conservation
  8. ucs_waters

These sets represent the types of land use and cover present in the State of São Paulo and also the anthromes classified based on the uses and covers present in the Federation Unit. It should be noted that, as we previously reported, some typologies aggregate data from different territorial contexts. An example is the artificial area which clusters areas such as cities, towns, indigenous villages and quilombola communities. From this perspective, we had a warning for future analyses, considering that these areas describe different populated anthromes. Therefore, at this stage we consider them only as artificial areas, treating them together. However, when combining land use and coverage data with population data, the distinction between the typology of the anthropogenic biome was referred to based on data from populated anthromes, with a view to better differentiating the occupations and uses of anthropogenic biomes.

3.3 Data Plotting

After data mining and manipulation was carried out, creating sets of land use and land cover anthromes present in the State of São Paulo, we moved on to viewing these sets individually. In the following code, we establish, through st_geometry() the mechanism for operating the plot() function, indicating that it uses the geometry of the polygons of each type of use and coverage as a form for plotting. Furthermore, we used the col factor so that the plot() function assigned the respective colors intended for the anthropogenic typology to the polygons. Furthermore, we remove the edges of the polygons, allowing only the plotting of the colored shape (border = NA). Below are represented the different anthromes based on land uses and covers.

Figure: coded structure for producing images referring to anthrome’s typologies based on the attribute land use and cover.

# Plot mixed urban (1)
plot(st_geometry(ucs_mixed_urban), col = ucs_mixed_urban$colors, main = "mixed urban", border = NA)

# Plot agricultural lands (2)
plot(st_geometry(ucs_agricultural_lands), col = ucs_agricultural_lands$colors, main = "agricultural lands", border = NA)

# Plot livestock lands
plot(st_geometry(ucs_livestock_lands), col = ucs_livestock_lands$colors, main = "livestock lands", border = NA)

# Plot sustainable use forest areas (4)
plot(st_geometry(ucs_sustainable_use_forest), col = ucs_sustainable_use_forest$colors, main = "sustainable use forest areas", border = NA)

# Plot silviculture (5)
plot(st_geometry(ucs_silviculture), col = ucs_silviculture$colors, main = "silviculture", border = NA)

# Plot conservation and preservation areas (6)
plot(st_geometry(ucs_preservation_conservation), col = ucs_preservation_conservation$colors, main = "conservation and preservation areas", border = NA)

# Plot agricultural villages (11)
plot(st_geometry(ucs_agricultural_villages), col = ucs_agricultural_villages$colors, main = "agricultural villages", border = NA)

# Plot bodies of water
plot(st_geometry(ucs_waters), col = ucs_waters$colors, main = "bodies of water", border = NA)

Source: the authors (2024). Caption: São Paulo anthromes isolated based on the typology of land use and cover.

The images generated by plotting the sets revealed how São Paulo’s anthromes are distributed within the Federation Unit separately. In the figure artificial area we observe the distribution of population clusters, mainly those associated with municipalities, towns, districts and traditional communities. It is worth noting that in the images mosaic of occupations in forest areas and mosaic of occupations in rural areas there is also the presence of a population, however, they are represented here in a generic way. As we pointed out previously, at the junction of land use and land cover anthromes with populated anthromes, we will be able to distinguish where the populations are actually inserted in these two sets.

Regarding the images agricultural area and managed pasture, we have agricultural areas and livestock areas represented separately. Here we have reference to the planting and breeding areas, with no directly associated population. Likewise, native forests, conservation and preservation represent anthromes where there is no human presence and associated human activity. In contrast, the silviculture image represents areas destined for the production of tree species or areas of use, but they are also areas where human presence is not very representative.

Finally, the image bodies of water represents the magnitude of the rivers, lakes, lagoons and watercourses, including coastal ones, that run through the State. Although it does not exactly represent an anthromes, we consider the relevance of rivers both for maintaining the quality of life and the environment, as well as for their significant importance for territorial planning, as recommended by national standards and regulations.

Continuing with our analyses, we proceeded to join these sets to jointly plot the anthromes of land use and cover, in order to obtain a general overview of the distribution of these anthromes in the territory of São Paulo. To this end, it was necessary to create a legend that represented the typologies and colors associated with each of the soil use and cover anthromes. To do so, we extracted this information (typology and color) from the coverage_sp_2010 set to construct the legend. The following script illustrates the process of creating anthromes_ucs_legend.

Script: creation of the legend for land use and land cover anthromes (anthromes_ucs_legend).

# Set the caption manually
anthromes_ucs_legend <- data.frame(tipology = c("mixed urban", "agricultural lands", "livestock lands", "sustainable use forest", "silviculture", "conservation and preservation areas", "agricultural villages", "waters"),colors = c("#FF4747", "#F4B488", "#FFEEB7", "#A8D08D", "#E2EFD9", "#538135", "#ED833B", "#000066"))

Source: the authors (2024). Legend: script demonstrating the mechanism for creating the legend to be plotted on the image of land use and land cover anthromes.

Once these procedures were carried out to construct the image of soil use and land cover anthromes, we proceeded to plotting the sets together. Using the operator add = TRUE we place the different sets of land use and land cover anthromes on the same plot. The following image represents the anthromes of use and coverage present in the State of São Paulo.

Figure: anthromes of land use and cover in the State of São Paulo.

# Plot all geometries with their corresponding colors
plot(st_geometry(ucs_mixed_urban), col = ucs_mixed_urban$colors, border = NA, main = "land use and cover anthromes")
plot(st_geometry(ucs_agricultural_lands), col = ucs_agricultural_lands$colors, border = NA, add = TRUE)
plot(st_geometry(ucs_livestock_lands), col = ucs_livestock_lands$colors, border = NA, add = TRUE)
plot(st_geometry(ucs_sustainable_use_forest), col = ucs_sustainable_use_forest$colors, border = NA, add = TRUE)
plot(st_geometry(ucs_silviculture), col = ucs_silviculture$colors, border = NA, add = TRUE)
plot(st_geometry(ucs_preservation_conservation), col = ucs_preservation_conservation$colors, border = NA, add = TRUE)
plot(st_geometry(ucs_agricultural_villages), col = ucs_agricultural_villages$colors, border = NA, add = TRUE)
plot(st_geometry(ucs_waters), col = ucs_waters$colors, border = NA, add = TRUE)

# Add caption
legend("topright", legend = anthromes_ucs_legend$tipology, fill = anthromes_ucs_legend$colors, title = "Tipology", cex = 0.8)

Source: the authors (2024). Caption: figure showing the different soil use and cover anthromes present in the State of São Paulo. The legend in the figure represents the colors associated with each of the anthromes typologies.

The generated Figure represents the different anthromes of land use and cover. In this plot we have the 8 classes listed previously colored as stipulated for the classes. It should be noted that the plot, despite considering the geographic referencing system, does not represent the mapping in the coordinate plane, being an ordered extract of the data. In the next steps we will join the data for plotting using the ggplot() function, where the cartographic plane is visible.

3.4 Static Mapping of anthromes of Use and Coverage in São Paulo

Continuing the analyses, in this stage we carried out the construction of the static mapping using the ggplot() function. However, to accomplish this, it was first necessary to combine the eight classes of soil use and land cover anthromes. Next, the code for joining the data into a single data frame (anthromes_ucs_sp) is presented. Furthermore, it was necessary to create a set containing the colors associated with each of the types of use and coverage (anthromes_ucs_sp_colors); the code for this procedure is also represented in the sequence.

Script: Joining the anthromes classes of land use and land cover into a single data frame(anthromes_ucs_sp) and joining colors ((anthromes_ucs_sp_colors))

# Create a data frame with corresponding geometries and colors
anthromes_ucs_sp <- rbind(ucs_mixed_urban, ucs_agricultural_villages, ucs_agricultural_lands, ucs_livestock_lands, ucs_silviculture, ucs_sustainable_use_forest, ucs_preservation_conservation, ucs_waters)

# Get the unique levels from the "color" column
anthromes_ucs_sp_colors <- unique(anthromes_ucs_sp$colors)

Source: the authors (2024). Legend: script presenting the model for building the anthromes_ucs_sp set, which groups together the 8 classes of land use and cover anthromes, and the anthromes_ucs_sp_colors set, which integrates the colors associated with the different land use and cover typologies.

According to what was observed in the code above, the anthromes_ucs_sp set was constructed correctly, aggregating the different classes of land use and land cover anthromes into a single data frame. Starting from this set, we proceeded to create the static mapping of soil use and cover anthromes. The following code demonstrates how the ggplot() function was structured to map data on land use and land cover anthromes, followed by the image referring to the Static Mapping of Land Use and Cover anthromes.

Script: Static Mapping of Land Use and Cover anthromes

# Plot with ggplot2 and add the legend directly
map_anthromes_ucs <- ggplot(anthromes_ucs_sp, aes(fill = factor(colors, levels = unique(anthromes_ucs_legend$colors)))) +
  geom_sf(color = NA) +
  scale_fill_manual(values = anthromes_ucs_legend$colors, 
                    labels = anthromes_ucs_legend$tipology,
                    name = "Tipology") +  
  labs(title = "Land Use and Cover Anthromes", 
       subtitle = "Study Area: State of São Paulo", 
       x = "Longitude", 
       y = "Latitude") +
  theme_minimal()

print(map_anthromes_ucs)

Source: the authors (2024). Caption: the image generated by the script represents the anthromes of land use and cover in the State of São Paulo. In detail, in ggplot() we inserted the name of the set (anthromes_ucs_sp) and indicated that the polygons were plotted with the fills (fill) present in the set, following and pairing with the colors (colors) present in the set anthromes_ucs_sp_colors. In geom_sf we determined that the edges of the polygons were not represented (color = NA), so as not to interfere with the colors represented by the fill.

Additionally, in scale_fill_manual() we determine the format for constructing the legend present in the mapping; in values = caption_anthromes_ucs\(colors* we indicate that the values would be the colors present in the set **anthromes_ucs_legend**, while in *labels = caption_anthromes_ucs\)typology we indicate the categories that should be present in the legend. Finally, in name = “typology” we determine the name written on the caption.

In labs() we determine graphic aspects for the plot, such as title (title), subtitle (subtitle), graphic dimensions x (longitude) and y (latitude). Finally, in theme_minimal() we have the graphical representation format of the mapping, such as lines and visual distributions, not requiring adjustments for this.

With this, we were able to carry out static mapping of soil use and land cover anthromes in the State of São Paulo. Analysis to validate the mapping will be carried out later. From this point, we started building the interactive mapping, using the Open Street Maps (OSM) database.

3.5 Interactive Mapping of anthromes of Use and Coverage in São Paulo

Once the static mapping of land use and land cover anthromes in the State of São Paulo was completed, we moved on to interactive mapping. The interactive mapping aimed to spatialize the classified results in a dynamic layout, which could be integrated with other desktop, mobile or web platforms. To this end, the procedure for creating the interactive mapping is represented below. It is noteworthy that in attempts not reported in this work, the anthromes_ucs_sp data set required a large amount of processing time, compromising the computer’s memory. With this in mind, we started using the 8 subsets that made up the larger set, that is, those 8 sets that dealt with the different typologies of land use and land cover anthromes, namely:

  1. ucs_mixed_urban
  2. ucs_agricultural_villages
  3. ucs_agricultural_lands
  4. ucs_livestock_lands
  5. ucs_forestry
  6. ucs_sustainable_use_forest
  7. ucs_preservation_conservation
  8. ucs_waters

That said, the first step in building the interactive mapping was to determine that the 8 subsets had the same geometric projection base or coordinate system (crs) as Open Street Maps (OSM). To do this, we use the st_transform() function on the 8 subsets, selecting the crs +proj=longlat +datum=WGS84 in st_crs(). At this stage, the names of the subsets were kept the same, as the change indicated by the functions only protects the data projection structure.

Once the crs transformation of the subsets was carried out, we proceeded to merge the polygons that made up the subsets. As we presented previously, the attempt to create interactive mapping with fragmented data required a lot of processing time and did not achieve the expected result. Therefore, we fused the polygons using the st_union() function, which allowed the construction of multipolygon layers. For each of the 8 subsets, layers were constructed to represent them, namely:

  1. layer_ucs_mixed_urban
  2. layer_ucs_agricultural_lands
  3. layer_ucs_livestock_lands
  4. layer_ucs_sustainable_use_forest
  5. layer_ucs_silviculture
  6. layer_ucs_preservation_conservation
  7. layer_ucs_agricultural_villages
  8. layer_ucs_waters

Once the 8 layers were structured for the subsets of land use and land cover, we carried out the layer simplification procedure. The objective of this procedure was also to reduce processing time and memory demand for building the interactive mapping. According to the literature, this is a common procedure when the amount of data to be mapped is significant (more than 200 thousand polygons in the State of São Paulo, for example). Therefore, following the authors’ suggestions, we applied a tolerance index (tolerancia_deg) of 0.001, which is a value that would not harm the resolution of the interactive mapping.

After determining the tolerance index for simplification, we built the function to simplify the multipolygons together. The function established the structure for simplification to occur, in each of the 8 layers, in a unique way, using the basis of the tolerance index for simplification. The use of this function culminated in the creation of 8 simplified layers referring to the different types of use and coverage, namely:

  1. layer_ucs_mixed_urban_simplified
  2. layer_ucs_agricultural_lands_simplified
  3. layer_ucs_livestock_lands_simplified
  4. layer_ucs_sustainable_use_forest_simplified
  5. layer_ucs_silviculture_simplified
  6. layer_ucs_preservation_conservation_simplified
  7. layer_ucs_agricultural_villages_simplified
  8. layer_ucs_waters_simplified

Once the simplification stage is complete, we move on to organizing the layout of the layers to be plotted in the interactive mapping. This organization was due to the structure of how the data was organized and distributed in the XY dimensions. In order for the layers to be processed in a single way (reducing processing time), the data that made up the layers needed to be read in a single way. Therefore, we assigned two other keys “[]” to the list, so that the structure could be read individually, as illustrated in the following code.

Script: Organization for reading the layer in a unique way.

# Simplified Artificial Area Layer
mixed_urban_multipolygons <- layer_ucs_mixed_urban_simplified[[1]]  
agricultural_lands_multipolygons <- layer_ucs_agricultural_lands_simplified[[1]]
livestock_lands_multipolygons <- layer_ucs_livestock_lands_simplified[[1]]
sustainable_use_forest_multipolygons <- layer_ucs_sustainable_use_forest_simplified[[1]]
silviculture_multipolygons <- layer_ucs_silviculture_simplified[[1]]
preservation_conservation_multipolygons <- layer_ucs_preservation_conservation_simplified[[1]]
agricultural_villages_multipolygons <- layer_ucs_agricultural_villages_simplified[[1]]
waters_multipolygons <- layer_ucs_waters_simplified[[1]]

Source: the authors (2024). Legend: code demonstrating the format for establishing a single reading per layer of multipolygons, in order to reduce data processing time during information mapping.

As noted in the code above, 8 layers were structured in the model for representation in interactive mapping. Each of them represents a typology of land use and land cover anthromes to be included in the mapping. The constructed layers are:

  1. Mixed urban: urban_mixed_multipolygons
  2. Agricultural land: agricultural_lands_multipolygons
  3. Livestock lands: livestock_lands_multipolygons
  4. Sustainable use forests: sustainable_use_forest_multipolygons
  5. Silvicuture: silviculture_multipolygons
  6. Conservation and preservation areas: preservation_conservation_multipolygons
  7. Agricultural villages: agricultural_villages_multipolygons
  8. Waters: waters_multipolygons

With these 8 structured layers, we started building the interactive mapping using the leaflet() function. The basis (background) for mapping the data was the same used in the construction of the interactive mapping of populated anthromes, Open Street Maps (OSM). We reiterate that our choice was made due to the free nature of OSM services and the collaboration in the construction of mappings that this mapping source allows. Regarding the mapping structure, we direct the visualization to the region of the State of São Paulo through the setView() function, using the information lng = -47.9292, lat = -23.5505, zoom = 7 (longitude, latitude and zoom, respectively) for such.

The addPolygons() function was responsible for integrating the 8 layers into the mapping. In it, we direct the data set to be plotted (date), the color of the edges (color), the filling color of the polygons (fillColor), the opacity of the filling (fillOpacity) and the name indicator for that layer (popup) . We must point out that the colors used in the fillColor mapping attribute were the same as those established in anthromes_ucs_legend, in order to maintain the colorimetric standard used throughout the work. Additionally, we removed the edges in the layer representation (color = NA), so that they would not overlap other polygons, affecting the visualization of other layers.

Finally, we added layer by layer to the interactive mapping so that we could observe whether the processing time was optimized and that the computer would be able to generate the interactive mapping of the land use and land cover anthromes in the State of São Paulo. Therefore, the following code represents the model for building interactive layer mapping.

Script: Construction of interactive mapping of soil use and land cover anthromes in the State of São Paulo by layers.

# Create a Leaflet map based on OpenStreetMap and add the artificial area layer
anthromes_ucs_interactive <- leaflet() %>%
  addTiles() %>%  
  setView(lng = -47.9292, lat = -23.5505, zoom = 7) %>%  
  addPolygons(data = mixed_urban_multipolygons,
              color = NA,    
              fillColor = "#FF4747",  
              fillOpacity = 1,  
              popup = "Mixed Urban")  
  
# Add agricultural lands layer
anthromes_ucs_interactive <- anthromes_ucs_interactive %>% addPolygons(data = agricultural_lands_multipolygons,
                             color = NA,
                             fillColor = "#F4B488",
                             fillOpacity = 1,
                             popup = "Agricultural Lands")

# Add pasture layer with management
anthromes_ucs_interactive <- anthromes_ucs_interactive %>% addPolygons(data = livestock_lands_multipolygons,
                             color = NA,
                             fillColor = "#FFEEB7",
                             fillOpacity = 1,
                             popup = "Livestock Lands")

# Add forest occupation layer
anthromes_ucs_interactive <- anthromes_ucs_interactive %>% addPolygons(data = sustainable_use_forest_multipolygons,
                             color = NA,
                             fillColor = "#A8D08D",
                             fillOpacity = 1,
                             popup = "Sustainable Use Forest")

# Add silviculture layer
anthromes_ucs_interactive <- anthromes_ucs_interactive %>% addPolygons(data = silviculture_multipolygons,
                             color = NA,
                             fillColor = "#E2EFD9",
                             fillOpacity = 1,
                             popup = "Silviculture")

# Add preservation and conservation layer
anthromes_ucs_interactive <- anthromes_ucs_interactive %>% addPolygons(data = preservation_conservation_multipolygons,
                             color = NA,
                             fillColor = "#538135",
                             fillOpacity = 1,
                             popup = " Preservation and Conservation")

#Add agricultural villages layer
anthromes_ucs_interactive <- anthromes_ucs_interactive %>% addPolygons(data = agricultural_villages_multipolygons,
                             color = NA,
                             fillColor = "#ED833B",
                             fillOpacity = 1,
                             popup = "Agricultural Villages")

# Add waters layer
anthromes_ucs_interactive <- anthromes_ucs_interactive %>% addPolygons(data = waters_multipolygons,
                             color = NA,
                             fillColor = "#000066",
                             fillOpacity = 1,
                             popup = "Waters")

# Add the legend to the map
anthromes_ucs_interactive <- anthromes_ucs_interactive %>% 
  addLegend(position = "bottomright",  
            colors = anthromes_ucs_legend$colors,  
            labels = anthromes_ucs_legend$tipology,  
            title = "Tipology")  

Source: the authors (2024). Caption: code illustrating the construction of the interactive mapping of anthropogenic biomes of land use and land cover in the State of São Paulo through the leaflet() function in the Open Street Maps database. Each command line represents the insertion of one of the layers that represent the different anthromes in the State.

Once the mapping was structured, we moved on to visualizing the interactive mapping of land use and land cover anthromes in the State of São Paulo.

Figure: Interactive mapping of land use and land cover anthromes in the State of São Paulo.

anthromes_ucs_interactive

Source: the authors (2024). Caption: interactive mapping of land use and land cover anthromes in the State of São Paulo, created using the leaflet() function and based on land use and land cover data produced and made available by the Brazilian Institute of Geography and Statistics ( IBGE). The data used in the mapping refers to the year 2010, the same year as the demographic data treated in the authors’ previous research.

The interactive mapping revealed that the spatialization of the data (layer distribution) occurred significantly. We can observe that the different layers were integrated into the territorial mosaic of the State, providing a representative image of the distribution of anthromes in the Federation Unit. Furthermore, we found that both the popups and the legend significantly help in understanding the cartographic plan.

A limitation we found in the mapping is associated with the representation of river basins and/or continental watercourses that were not represented in the interactive mapping. However, this limitation is not associated with the data manipulated, treated and mapped in this research. As we assessed in the data set made available by IBGE and in the metadata relating to it, we identified that river basins and watercourses with lower flow volumes are not represented in the data set. From this perspective, some of the polygons of other uses and coverage (other than water) overlap these regions, covering continental waters and leading to the absence of this information in the interactive mapping.

However, despite these limitations, the interactive mapping of land use and land cover significantly represents the distribution of these anthropogenic sectors in the State and serves as a guide for the construction of more accurate models, using, for example, regionalized data or orbital images. to improve information. In the next stage, we will carry out analyzes to validate and error the mapping of land use and land cover anthromes focusing on the product generated by the ggplot() function, that is, the static mapping of land use and land cover anthromes of the State of São Paulo.

3.6 Mapping Validation and Uncertainty Studies

Once the previous steps have been carried out, we move on to validating the mapping and studying the uncertainty associated with it. In this topic we use as a comparator the raw data from IBGE, those referring to soil uses and coverage filtered for the year 2010. On the other hand, the set analyzed was that structured with the different typologies of land use and coverage of anthromes and the their respective colors. Therefore, the sets we discussed in this topic were:

  • Comparator: coverage_sp_2010
  • Analyzed set: anthromes_ucs_sp

The following steps are the same as those carried out for validation and uncertainty in the mapping of populated anthromes. The choice for the same analytical sequence is justified by the synergy between the two models, so that the results could be joined later in the fusion of demographic data with land use and land cover data.

It should be noted that before carrying out the subsequent steps, we adjusted the coordinate system of the two sets (crs), establishing the crs “+proj=longlat +datum=WGS84” for both. To confirm that both sets were adjusted to the same coordinate system, we employ the if/else() function. The following table illustrates the crs adjustment and indicates whether or not both have the same system associated with the data.

Table: Adjustment of the coordinate system and checking.

## [1] "the CRS are the same"

Source: the authors (2024). Legend: confirmation of the adjustment of the coordinate system and verification of the similarity between the two sets.

The Script revealed that the adjustment of the set’s coordinate system was carried out correctly and that both sets are on the same crs. Given this inference, we continued the validation process and study of mapping uncertainty.

3.6.1 Overlap Analysis

As presented in the previous chapter, the overlap analysis assumes as a premise the identification of polygons of demarcated areas (urban and rural) that overlap in different legal registries, such as records of urban and rural properties in the bodies of the Federation, States and/or of municipalities (MAPA/INCRA, 2022; BRASIL, 2018; MMA, 2006; 2002). Therefore, at this stage, we carried out a comparison between the two sets mentioned above to assess whether the mapped data on land use and land cover anthromes overlapped exactly with the polygons determined for the areas identified (and demarcated) by IBGE.

Since the two sets, coverage_sp_2010 and anthromes_ucs_sp, are structured in the same format (simple feature collection), we perform the comparison using the st_intersection()* function, the which automatically checks the intersection between the two sets. In other words, the function performs a probe on the sets evaluating whether all the polygons (geometries) present in coverage_sp_2010 were represented in the set anthromes_ucs_sp. To facilitate the presentation of the results, we used the function if/else(), so that, at the end of the analysis using st_intersection(), the sentence none/some/all covering geometries_sp_2010 are present was presented in anthromes_ucs_sp (Table).

Table: Overlap analysis using st_intersection() function.

## All coverage_sp_2010 geometries are present in anthromes_ucs_sp.

Source: the authors (2024). Legend: table showing the schematic for the overlapping analysis of the sets, taking coverage_sp_2010 as the comparator (gold standard) and anthromes_ucs_sp as the analyzed set.

The result obtained demonstrated that all the geometries that appeared in the raw IBGE data (coverage_sp_2010) are present in the data referring to soil use and land cover anthromes. In other words, all the geometries that represent different land uses and covers in the data from the Brazilian Institute find geometric counterparts in the use and cover data of the São Paulo anthromes. It should be noted here that the comparison criteria used by the st_intersection() function are the geometry columns present in the two sets and, therefore, the other columns of the sets are disregarded in the analysis performed by the function.

3.6.2 Examination of Data Properties

In the previous chapter on populated anthromes, we examined the properties of the data, that is, an investigation of the similarity between the gold standard and the data generated in the research. Here, our procedure investigated the structural synergy between data from coverage_sp_2010 and data from anthromes_ucs_sp. Firstly, we checked whether the number of lines between the two sets was identical or different, using the if/else() function again (Table).

Table: Checking the number of lines in the sets.

## [1] "the number of lines is the same"

Source: the authors (2024). Legend: table showing the result of checking the number of lines in the two sets, carried out using the if/else() function.

As seen in the result generated by the function, we notice that the number of lines between the two sets is the same. Thus, we found that all the lines that make up the gold standard are present in the data set generated by the research (anthromes_ucs_sp). Sequentially, we investigate whether the minimum and maximum limits of latitude and longitude are the same between the two sets.

To carry out this analysis, it was necessary to extract the limits (minimum and maximum) of latitude and longitude of the two sets. This was done through the st_box() function, which returns a set with these limits for the gold standard and another for the anthrome data. With these sets of limits (limits_coverage_sp_2010 and limits_anthromes_ucs_sp), we use the functions min() and max() to extract, respectively, the minimum and maximum latitude and longitude values from the sets to comparison. The following table presents the results of the comparison.

Table: Comparison of the minimum and maximum latitude and longitude limits of the sets.

## Minimum latitude coverage_sp_2010: -25.30115
## Minimum latitude of anthromes_ucs_sp: -25.30115
## Maximum latitude of coverage_sp_2010: -19.78065
## Maximum latitude of anthromes_ucs_sp: -19.78065
## Minimum length of coverage_sp_2010: -53.10706
## Minimum longitude of anthromes_ucs_sp: -53.10706
## Maximum length of coverage_sp_2010: -44.16211
## Maximum longitude of anthromes_ucs_sp: -44.16211

Source: the authors (2024). Legend: table showing the minimum and maximum latitude and longitude values of coverage_sp_2010 and anthromes_ucs_sp.

According to the Table, we confirm that the minimum and maximum limits of latitude and longitude of the investigated sets are the same. Therefore, we have that the representation of both occurs within the same mapping quadrant.

Therefore, we found that there are no missing lines in anthromes_ucs_sp, that is, all lines of the gold standard coverage_sp_2010 are represented in the generated set. Furthermore, we identified that the coordinate systems (crs) and latitude and longitude limits of the two sets are the same. These inferences contributed significantly to the preliminary confirmation of the quality of the mapping generated throughout this investigation.

3.6.3 Mapping Summary Statistics

Intuiting to deepen our knowledge about the sets, in order to guarantee the quality of the generated mapping, we move on to the Mapping Summary Statistics. In this case, we investigated the two sets on a statistical grid, similar to the one constructed in the chapter on populated anthromes. The analytical script follows the same format as the demographic data, therefore, we only represent the results generated from the analysis here.

The first stage of work was to extract from the coverage_sp_2010 and anthromes_ucs_sp polygons the points that structured their vertices. This operation was performed to obtain the number of points within each quadrant of the statistical analysis grid. We justify this operation in dynamicity to calculate the precision of the points framed there, since the same polygon could be distributed between two quadrants of the grid, making statistical analysis unfeasible.

Therefore, using the st_coordinates() function, we extract the vertices (coordinate points) of the two aforementioned sets. Sequentially, we determine the creation of a data frame composed of the X and Y points of the points, longitude and latitude respectively. The table below presents the first 6 lines of the sets generated for the gold standard, coordinates_points_coverage_sp_2010_df, and the data on land use and land cover anthromes, coordinates_points_anthromes_df.

Table: first lines of the sets of coordinates (points) of coordinates_points_coverage_sp_2010_df and coordinates_points_anthromes_df.

# Verificar a estrutura do novo data frame
head(coordinates_points_coverage_sp_2010_df)
##           x         y
## 1 -49.88686 -23.05579
## 2 -49.89659 -23.05592
## 3 -49.89673 -23.04686
## 4 -49.88701 -23.04673
## 5 -49.88686 -23.05579
## 6 -49.88701 -23.04673
# Verificar a estrutura do novo data frame
head(coordinates_points_anthromes_df)
##           x         y
## 1 -49.87772 -23.01941
## 2 -49.88744 -23.01955
## 3 -49.88758 -23.01049
## 4 -49.87786 -23.01035
## 5 -49.87772 -23.01941
## 6 -49.87786 -23.01035

Source: the authors (2024). Legend: first lines of the sets coming from the coordinates of points extracted from coverage_sp_2010 (gold standard) and from anthromes_ucs_sp (anthromes of land use and cover).

After extracting the points and their respective coordinates, we proceed to structuring the statistical grid for analyzing the distribution of points. The first step was to extract the longitude and latitude limits from the gold standard, which was done using the st_bbox() function. With these limits, we determined a grid of 20 by 20 (height and width), that is, a statistical grid of 400 cells for counting points.

It is worth highlighting that we considered the Haversine Formula in this construction, so that the sphericity of the Earth was assumed as a premise for its construction. Below we present the global structure of this formula:

\[ a = \sin^2\left(\frac{\Delta\text{lat}}{2}\right) + \cos(\text{lat}_1) \cdot \cos(\text{lat}_2) \cdot \sin^2\left(\frac{\Delta\text{lon}}{2}\right) \]

\[ c = 2 \cdot \text{atan2}\left(\sqrt{a}, \sqrt{1-a}\right) \]

\[ d = R \cdot c \]

where:

  • \(\Delta\text{lat}\) is the difference in latitude between the two points,
  • \(\Delta\text{lon}\) is the difference in longitude between the two points,
  • \(\text{lat}_1\) e \(\text{lat}_2\) are the latitudes of the two points in radians,
  • \(R\) is the radius of the sphere (for example, the average radius of the Earth).

Therefore, after structuring the grid, we multiply by the conversion factor to integrate the sphericity of the planet into the statistical grid projection. When we finish this procedure, we calculate the width, height, area of the quadrant in km² and the average dimension of the sides of the square. The following table presents the results of this procedure.

Table: Dimensions of the quadrants of the statistical grid.

## Width (cell_size_x_km): 45.98414 km
## Height (cell_size_y_km): 11.77902 km
## Area of each quadrant of the grid in km²: 541.6479 km²
## Average dimension of the sides of the grid square in km: 23.27333 km

Source: the authors (2024). Legend: table showing the dimensions of the quadrants of the statistical grid.

Once the dimensions were determined, we proceeded to construct the grid for integration with data from the points of the two sets analyzed. For this we use the maximum and minimum latitude and longitude information, coming from st_bbox(). We use the X and Y dimensions to frame the grid and induce the creation of the grid, with the information presented above, through the st_make_grid() function.

After this final structuring of the grid, we begin integrating the points in each of the quadrants. This procedure was carried out using the st_join() function, where points_coverage_sp_2010 and points_anthromes were joined to the polygons of the statistical grid (grid_quadrants_pol). After joining, we counted points per quadrant, using a group_by(id) function so that the point data was integrated with the polygons for counting.

# Perform the spatial junction between coverage points and quadrant polygons
join_coverage_sp_2010_grid <- st_join(points_coverage_sp_2010, grid_quadrants_pol)

# Check the join result
print(join_coverage_sp_2010_grid)
## Simple feature collection with 1241085 features and 1 field
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -53.10706 ymin: -25.30115 xmax: -44.16211 ymax: -19.78065
## Geodetic CRS:  +proj=longlat +datum=WGS84
## First 10 features:
##     id                    geometry
## 1  168 POINT (-49.88686 -23.05579)
## 2  168 POINT (-49.89659 -23.05592)
## 3  168 POINT (-49.89673 -23.04686)
## 4  168 POINT (-49.88701 -23.04673)
## 5  168 POINT (-49.88686 -23.05579)
## 6  168 POINT (-49.88701 -23.04673)
## 7  168 POINT (-49.89673 -23.04686)
## 8  168  POINT (-49.89687 -23.0378)
## 9  168 POINT (-49.88715 -23.03767)
## 10 168 POINT (-49.88701 -23.04673)
# Count the number of points in each quadrant
count_by_quadrant_coverage_sp_2010 <- join_coverage_sp_2010_grid %>%
  group_by(id) %>%
  summarize(num_points_coverage_sp_2010 = n())

# Join with the complete grid to fill the quadrants without points
count_by_quadrant_coverage_sp_2010 <- st_join(grid_quadrants_pol, count_by_quadrant_coverage_sp_2010, by = "id")

# Fill NA values in point count with zero
count_by_quadrant_coverage_sp_2010$num_points_coverage_sp_2010[is.na(count_by_quadrant_coverage_sp_2010$num_points_coverage_sp_2010)] <- 0

The following script presents summaries of the joins performed and the count itself. It is noteworthy that the results of the codes summarize the first lines of the sets count_by_quadrant_coverage_sp_2010 and count_by_quadrant_anthromes.

Script: synthesis of point count per quadrant of the statistical grid.

# Check the count result for coverage_sp_2010
print(count_by_quadrant_coverage_sp_2010)
## Simple feature collection with 400 features and 3 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -53.10706 ymin: -25.30115 xmax: -44.16211 ymax: -19.78065
## Geodetic CRS:  +proj=longlat +datum=WGS84
## First 10 features:
##    id.x id.y num_points_coverage_sp_2010                              x
## 1     1   NA                           0 POLYGON ((-53.10706 -25.301...
## 2     2   NA                           0 POLYGON ((-52.65982 -25.301...
## 3     3   NA                           0 POLYGON ((-52.21257 -25.301...
## 4     4   NA                           0 POLYGON ((-51.76532 -25.301...
## 5     5   NA                           0 POLYGON ((-51.31807 -25.301...
## 6     6   NA                           0 POLYGON ((-50.87082 -25.301...
## 7     7   NA                           0 POLYGON ((-50.42358 -25.301...
## 8     8   NA                           0 POLYGON ((-49.97633 -25.301...
## 9     9   NA                           0 POLYGON ((-49.52908 -25.301...
## 10   10   NA                           0 POLYGON ((-49.08183 -25.301...
# Check the count result for anthromes
print(count_by_quadrant_anthromes)
## Simple feature collection with 400 features and 3 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -53.10706 ymin: -25.30115 xmax: -44.16211 ymax: -19.78065
## Geodetic CRS:  +proj=longlat +datum=WGS84
## First 10 features:
##    id.x id.y num_points_anthromes                              x
## 1     1   NA                    0 POLYGON ((-53.10706 -25.301...
## 2     2   NA                    0 POLYGON ((-52.65982 -25.301...
## 3     3   NA                    0 POLYGON ((-52.21257 -25.301...
## 4     4   NA                    0 POLYGON ((-51.76532 -25.301...
## 5     5   NA                    0 POLYGON ((-51.31807 -25.301...
## 6     6   NA                    0 POLYGON ((-50.87082 -25.301...
## 7     7   NA                    0 POLYGON ((-50.42358 -25.301...
## 8     8   NA                    0 POLYGON ((-49.97633 -25.301...
## 9     9   NA                    0 POLYGON ((-49.52908 -25.301...
## 10   10   NA                    0 POLYGON ((-49.08183 -25.301...

Source: the authors (2024). Legend: summary of the point counting procedure, which contains only the first lines of the sets and the respective descriptions of each of them (structure of the data sets).

After counting points in each quadrant of the statistical grid, we move on to visualizing this data. The mapping of the statistical grid information was carried out using the ggplot() function, in order to improve the visualization of the data in the grid. In the following two mappings (Figures), we present the number of points in each of the quadrants of the 20 by 20 (400) cell statistical grid. The color was randomly established, with no association with the colorimetric standard used in the previous stages of our analyses.

Figure: Mapping of the statistical grid for coverage_sp_2010 data.

Source: the authors (2024). Legend: statistical grid referring to count_by_quadrant_anthromes data, where the number of points in each of the quadrants of the statistical grid are presented.

Figure: Mapping of the statistical grid for data from **anthromes_ucs_sp*

Source: the authors (2024). Legend: statistical grid referring to count_by_quadrant_anthromes data, where the number of points in each of the quadrants of the statistical grid are presented.

As a way of attesting the distribution of this data in the statistical grid and comparing the distribution of points between the two sets, we structured a data frame with the number of points in the sets. Therefore, we isolated the information on the number of points from count_by_quadrant_coverage_sp_2010 and count_by_quadrant_anthromes and created a new data set, named comparative_data.

In it, we also insert a column called difference, in which the values of the columns with the number of points from num_points_coverage_sp_2010 are subtracted by num_points_anthromes. Once these structural procedures were carried out, we used the if/else() function to evaluate whether there was any line in the comparative_data set in which differences were presented between the two compared sets. The following table shows the result of the analysis.

Table: Comparison between the number of points in the quadrants of the coverage_sp_2010 and anthromes_ucs_sp sets.

## Sets have the same number of points in each quadrant.

Source: the authors (2024). Caption: the table presents the result of the comparison between the number of points in the sets from coverage_sp_2010 and anthromes_ucs_sp. It indicates that there is no difference between the distribution of points in the two statistical grids structured throughout the analysis.

The result presented by the code above demonstrates that all points present in the set coverage_sp_2010 are equally represented by the set anthromes_ucs_sp. This demonstrates that the data were spatialized correctly and guarantees the quality of the distribution of polygons, both in the statistical grid and in the mapping generated throughout this analysis.

3.6.3.1 Structuring the Confusion Matrix for Statistical Analysis

With the results generated by counting points in a statistical grid, we started to produce the confusion matrix. The sets used to structure the confusion matrix were structured in the previous step, namely:

  • Gold standard (IBGE), coming from coverage_sp_2010 data: coordinates_points_coverage_sp_2010_df
  • Analyzed data on anthromes of use and coverage, from anthromes_ucs_sp: coordinates_points_anthromes_df

As we presented in the previous chapter, where we analyzed data from populated anthromes, the confusion matrix was used so that we could carry out analyzes of: sensitivity, specificity and accuracy and global error. It is worth highlighting that this matrix also aims to identify True Positives (TPs), True Negatives (TNs), False Positives (FPs) and False Negatives (FNs), which represent respectively in this analysis:

  1. True Positives (TPs): quadrants of the statistical grid where the number of points is equal between coordinates_points_coverage_sp_2010_df and coordinates_points_anthromes_df;
  2. True Negatives (TNs): quadrants of the statistical grid where there are no points represented in both coordinates_points_coverage_sp_2010_df and coordinates_points_anthromes_df;
  3. False Negatives (FNs): quadrants of the statistical grid where the number of points in coordinates_points_coverage_sp_2010_df is greater than the number of points in coordinates_points_anthromes_df.
  4. False Positives (FPs): quadrants of the statistical grid where the number of points in coordinates_points_coverage_sp_2010_df is smaller than the number of points in coordinates_points_anthromes_df.

Therefore, we started combining the sets into a single data frame, establishing columns with the number of points in the respective quadrants. To do this, we used the data.frame() function to create the ucs_combined_set which contained the points (vertices of the polygons of the two sets) divided by quadrants (statistical grid of 400 cells).

After this procedure, we use the if/else() function to identify the TPs, TNs, FPs and FNs present in the ucs_combined_set. The rules established for this function are the same 4 presented above, but in code formatting 0 and 1. Since the ucs_combined_set is a data frame of 400 lines, we have structured the following Table as a summary of the number of TPs, TNs, FPs and FNs identified by the analysis.

Table: Analytical summary of the number of TPs, TNs, FPs and FNs identified in the analysis.

##    TP  TN FP FN
## 1 222 178  0  0

Source: the authors (2024). Legend: table showing the number of True Positives (TPs), True Negatives (TNs), False Positives (FPs) and False Negatives (FNs) identified during the comparison between the gold standard (IBGE) and data on usage and coverage anthromes from soil.

Based on the results presented in the Table, we found that the vertices of the set of populated anthromes are strongly linked to the gold standard data (IBGE). According to the literature, the absence of FPs and FNs demonstrates that there was no significant error in framing the points of the two sets. Furthermore, the TPs and TNs demonstrate the efficiency of the model in distributing the points referring to the vertices, as presented in the literature, reiterating the quality of the data spatialization. Therefore, we understand that the mapping of land use and land cover anthromes sufficiently represents the polygonal distribution of IBGE data and guarantees the quality of geospatial information in the State of São Paulo.

With these results, we move forward with the sensitivity, specificity and accuracy and global error metrics.

3.6.3.2 Mapping Sensitivity and Specificity

When it comes to mapping sensitivity, we use the same formula presented in the previous chapter for the analysis, namely:

\[\text{Sensitivity (Recall)} = \frac{TP}{TP + FN}\]

where TP represents true positives and FN represents false positives.

The objective of this analysis was to identify whether the mapping model used for the anthromes of use and coverage was able to correctly identify the points (polygon vertices) present in the gold standard. The following Script represents the function and the result obtained for this metric.

Script: Model Sensitivity.

#Calculating Sensitivity
sensitivity_ucs <- sum(ucs_combined_set$TP) / (sum(ucs_combined_set$TP) + sum(ucs_combined_set$FN))

#Displaying the result
sensitivity_ucs
## [1] 1

Source: the authors (2024). Legend: script representing the coded format for calculating the sensitivity of the model used in mapping soil use and land cover anthromes and the result obtained.

According to the result presented above, we obtained a value of 1 for the sensitivity metric. This value reveals that 100% of the vertices present in coverage_sp_2010 are equally represented in the set anthromes_ucs_sp. According to the literature, this value represents the quality of the model in spatializing geospatial information, indicating its efficiency in spatially reproducing this geographic information.

Another metric that contributed to the evaluation of the model for mapping soil use and land cover anthromes was specificity. According to the literature, this metric aims to identify points present in anthromes_ucs_sp that were not present in coverage_sp_2010. In this horizon, the specificity concerns the analysis of True Negatives between the sets and its formula is presented below:

\[\text{Specificity} = \frac{TN}{TN + FP}\]

where TN represents the number of quadrants identified as True Negatives and FN the number of False Positives.

In the following Script, we present the calculation of the model specificity and the generated result.

Script: Model specificity.

#Calculating Specificity
specificity_ucs <- sum(ucs_combined_set$TN) / (sum(ucs_combined_set$TN) + sum(ucs_combined_set$FP))

#Displaying the result
print(specificity_ucs)
## [1] 1

Source: the authors (2024). Legend: script representing the coded format for calculating the specificity of the model used in mapping soil use and land cover anthromes and the result obtained.

The specificity calculation returned a value of 1, that is, there was a representation of 100% accuracy of the anthromes model compared to the gold standard. From this perspective, we note that the model was able to correctly represent all the points present in coverage_sp_2010 in the mapping of anthromes_ucs_sp data. Thus, assuming what is postulated in the literature, we were able to identify another metric that points to the quality of the mapping model for anthromes, taking into account their specificity.

3.6.3.3 Global Mapping Accuracy and Error

In addition to the analyzes carried out above, we focused on global mapping accuracy and error. According to the literature, the accuracy calculation aims to identify the number of points (polygon vertices) of anthromes_ucs_sp that were correctly mapped when compared to coverage_sp_2010 (gold standard). Therefore, this metric also deals with the number of True Positives and its formula is given by:

\[\text{Global Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\]

where TP, which represents the grid cells in which the number of points is equal in anthromes mapping and in the gold standard, is divided by the total cells of the statistical grid, that is, by the sum of cells with the same number of points (TPs ) and different (FPs and FNs) between the sets.

The following Script represents the formula used in R to calculate global accuracy and the generated result.

Script: Global accuracy of the model.

#Calculating Global Accuracy
global_acuracy_ucs <- (sum(ucs_combined_set$TP) + sum(ucs_combined_set$TN)) / (sum(ucs_combined_set$TP) + sum(ucs_combined_set$TN) + sum(ucs_combined_set$FP) + sum(ucs_combined_set$FN))

#Displaying the result
print(global_acuracy_ucs)
## [1] 1

Source: the authors (2024). Caption: script representing the coded format for calculating the global accuracy of the model used in mapping soil use and land cover anthromes and the result obtained.

The estimate (value) obtained by calculating the global accuracy was 1, that is, 100% of the vertices of anthromes_ucs_sp were correctly mapped against the vertices of the gold standard. According to the literature, this high correspondence rate reinforces the quality of the model used in the analysis, pointing to the quality of the information generated, the mapping.

On the other hand, the calculation of the global error deals with erroneously mapped geospatial information, as the literature points out. Thus, this metric refers to the proportion of False Negatives and False Positives present in the comparison between the sets. Its formula is given by:

\[\text{Global Error} = \frac{FP + FN}{TP + TN + FP + FN}\]

where we have the sum of FPs and FNs, areas with a divergent number of points between anthromes and the gold standard, divided by the total number of cells with points.

In the following script, the formula is represented using R coding and then the result is generated.

Script: Global model error.

#Calculating the Global Error
global_error_ucs <- (sum(ucs_combined_set$FP) + sum(ucs_combined_set$FN)) / (sum(ucs_combined_set$TP) + sum(ucs_combined_set$TN) + sum(ucs_combined_set$FP) + sum(ucs_combined_set$FN))

#Displaying the result
print(global_error_ucs)
## [1] 0

Source: the authors (2024). Legend: script representing the coded format for calculating the global error of the model used in mapping soil use and land cover anthromes and the result obtained.

The estimate (value) obtained for the global error was 0, that is, there was no error in the mapping model in the spatialization of geospatial information. In other words, we observed complete correspondence between the vertices of the polygons of anthromes_ucs_sp when compared to coverage_sp_2010. Given this result, we obtained other indicators that point to the quality of the anthromes model and, consequently, the mapping generated by it.

3.6.4 Summary of Mapping Statistics and Visualization of Results

Moving towards the end of this chapter, we present here a summary of the results obtained throughout our investigations. Below, we have brought a table where the estimates (values and percentage) for the metrics of sensitivity, specificity, accuracy and global error are presented.

Table: Summary of the metrics used in the analysis of mapping quality.

##            Metric Estimates Percentage
## 1     Sensitivity         1       100%
## 2     Specificity         1       100%
## 3 Global Accuracy         1       100%
## 4    Global Error         0         0%

Source: the authors (2024). Legend: table representing numerical and percentage values for the metrics investigated: sensitivity, specificity, global accuracy and global error.

As we presented previously, the estimates for the 4 metrics reported in the Table reflect the quality of the model used in mapping land use and land cover anthromes. We noticed that the model was able to efficiently spatialize the geospatial information present in anthromes_ucs_sp when compared to the spatialization of the gold standard data (coverage_sp_2010). The metrics presented above reinforce these statements and the literature demonstrates that these estimates are significantly positive for the operability of the mapping model.

In order to visually represent these metrics, we built a histogram, where the contribution of each of the metrics to the model evaluation was clearly visible. The following graph represents the histogram of the metrics applied in the previous analyses.

Graph: Histogram of the metrics investigated.

Source: the authors (2024). Legend: histogram representing the contribution of the metrics sensitivity, specificity, global accuracy and global error in the analysis of the quality of the mapping model of land use and land cover anthromes.

In view of the graph, we visibly notice that the specificity, sensitivity and global accuracy metrics reach the value of 1, that is, 100% utilization for the quality of the mapping. Linearly, the global error being 0 also indicates that the model used does not have representative errors in the spatialization of information, contributing significantly to the quality of the model used.

Given our alignment with Metrology, we use the radar chart format to visualize the results. This model, characteristic of Metrological Sciences, represents the level of distribution of estimates obtained on a target, as illustrated in the following graph.

Graph: Radar graph of the metrics investigated.

## `geom_path()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

Source: the authors (2024). Legend: radar graph representing the estimates obtained for the metrics investigated. The center of the graph represents the value of 0, while the outermost line represents the value of 1.

Visually, the radar graph makes clear the distribution of estimates obtained for the metrics that evaluated the model. Again, the points of sensitivity, specificity and global accuracy are observed at a margin of 1 (100% success rate), while the global error is in the center of the graph, pointing to a value of 0. With this, we conclude the synthesis of the results and we present the distribution of estimates that guaranteed the quality of the model used in this research.

4 Conclusions

Throughout this stage of research, we were able to identify the attributes necessary for the construction of the mapping model of soil use and land cover anthromes in the State of São Paulo. We built a codified structure that allowed the mining and manipulation of raw IBGE data to fit them into the territorial subdivision profile of anthromes. Based on this data partitioning, we established a specific format for mapping land use and land cover anthromes in the Federation Unit, using the same guidelines applied to the mapping of populated anthromes.

We generated isolated mappings for each of the classes of land use and land cover anthromes and, subsequently, produced static and interactive mappings for anthromes in São Paulo. We note that both were built on a solid foundation, considering validation metrics and studying mapping uncertainty. From this perspective, the statistical analyzes carried out throughout the research indicated the quality of the model and the product generated by it, whether in terms of the spatialization of information or in the cartography generated.

Given the promising results, we assumed as an investigative horizon the further reproducibility of the model for other Federation Units using the same data source, IBGE. Another perspective for the analysis concerns the use of local data, in order to reduce polygonal dimensions and refine the mapping constructed here.