This case study was conducted to see how the data obtained through the last clinic availability survey reflects on a map.
The data downloaded corresponds to the latest survey, conducted in 2019. It was obtained from the open data resource offered by the Paraguayan government through their website.
The data was formatted as a CSV delimited by semicolons.
#Loading packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#Loading Data-set
clinics_raw_data <- read.csv(
"Datos/health_care_personel.csv", sep = ";")
After loading the data, I inspected the table with glimpse to see data types and variable (column) names:
glimpse(clinics_raw_data)
## Rows: 38
## Columns: 10
## $ Ano.y.Region.Sanitaria <chr> " T…
## $ Total <dbl> 35.519,…
## $ Categoria.de.Ocupacion....Medicos <dbl> 10.936,…
## $ Categoria.de.Ocupacion....Quimicos.y.Bioquimicos <dbl> 1.666, …
## $ Categoria.de.Ocupacion....Odontologos <dbl> 1.153, …
## $ Categoria.de.Ocupacion....Licenciados.Enfermeria...Obstetricia <dbl> 11.714,…
## $ Categoria.de.Ocupacion....Otros.profesionales.universitarios <chr> "652", …
## $ Categoria.de.Ocupacion....Auxiliares.de.blanco <dbl> 4.013, …
## $ Categoria.de.Ocupacion....Tecnicos <dbl> 4.816, …
## $ Categoria.de.Ocupacion....Personal.de.apoyo <int> 569, 20…
The data appears to be the result of some sort of pivoting, with year and region in the same column (Ano.y.Region (Year.and.Region)). The rest of the columns contain numeric type of values, but had been introduced with “.” as thousand separators, which leads R to interpret them as double data type.
Another probable result of the pivoting is the spaces before the names in the Ano.y.Region variable.
As the output shows, the variable names are long and describe too many attributes of the observations. To be able to perform analysis in a ordered manner and improve code readability, I changed the variable names to shorter ones, using snake case where needed.
clinics_raw_data <- clinics_raw_data %>% rename(
region = Ano.y.Region.Sanitaria,
clinics = Categoria.de.Ocupacion....Medicos,
chemestry_prof = Categoria.de.Ocupacion....Quimicos.y.Bioquimicos,
dentists = Categoria.de.Ocupacion....Odontologos,
nurses = Categoria.de.Ocupacion....Licenciados.Enfermeria...Obstetricia,
interns = Categoria.de.Ocupacion....Otros.profesionales.universitarios,
asistants = Categoria.de.Ocupacion....Auxiliares.de.blanco,
technicians = Categoria.de.Ocupacion....Tecnicos,
support = Categoria.de.Ocupacion....Personal.de.apoyo
)
The resulting table:
head(clinics_raw_data)
## region Total clinics chemestry_prof dentists nurses
## 1 Total Pais Ano 2018 35.519 10.936 1.666 1.153 11.714
## 2 Asuncion 9.117 2.878 638.000 346.000 2.785
## 3 Concepcion 1.002 299.000 30.000 20.000 420.000
## 4 San Pedro 1.804 402.000 48.000 29.000 742.000
## 5 Cordillera 1.275 328.000 39.000 44.000 414.000
## 6 Guaira 1.191 363.000 59.000 31.000 399.000
## interns asistants technicians support
## 1 652 4.013 4.816 569
## 2 321 749.000 1.200 200
## 3 16 144.000 57.000 16
## 4 10 283.000 269.000 21
## 5 13 176.000 255.000 6
## 6 8 174.000 151.000 6
To address the spaces at the beginning of the names in the “region” variable, I used the function str_squish. I also standarized the casing of the region to ease merging.
clinics_raw_data <- clinics_raw_data %>%
mutate(region = str_squish(region)) %>%
mutate(region = tolower(region))
Resulting table:
head(clinics_raw_data)
## region Total clinics chemestry_prof dentists nurses interns
## 1 total pais ano 2018 35.519 10.936 1.666 1.153 11.714 652
## 2 asuncion 9.117 2.878 638.000 346.000 2.785 321
## 3 concepcion 1.002 299.000 30.000 20.000 420.000 16
## 4 san pedro 1.804 402.000 48.000 29.000 742.000 10
## 5 cordillera 1.275 328.000 39.000 44.000 414.000 13
## 6 guaira 1.191 363.000 59.000 31.000 399.000 8
## asistants technicians support
## 1 4.013 4.816 569
## 2 749.000 1.200 200
## 3 144.000 57.000 16
## 4 283.000 269.000 21
## 5 176.000 255.000 6
## 6 174.000 151.000 6
Now, since the “clinics” variable actually contains integers and not doubles, I will convert them to string, remove the period as a separator and then convert them to integers:
clinics_raw_data$clinics <- sapply(
clinics_raw_data$clinics, function(v) {
as.integer(gsub("\\.","", as.character(v)))
})
clinics_raw_data$nurses <- sapply(
clinics_raw_data$nurses, function(v) {
as.integer(gsub("\\.","", as.character(v)))
})
glimpse(clinics_raw_data)
## Rows: 38
## Columns: 10
## $ region <chr> "total pais ano 2018", "asuncion", "concepcion", "san p…
## $ Total <dbl> 35.519, 9.117, 1.002, 1.804, 1.275, 1.191, 1.599, 846.0…
## $ clinics <int> 10936, 2878, 299, 402, 328, 363, 414, 215, 555, 261, 27…
## $ chemestry_prof <dbl> 1.666, 638.000, 30.000, 48.000, 39.000, 59.000, 45.000,…
## $ dentists <dbl> 1.153, 346.000, 20.000, 29.000, 44.000, 31.000, 28.000,…
## $ nurses <int> 11714, 2785, 420, 742, 414, 399, 657, 259, 375, 372, 40…
## $ interns <chr> "652", "321", "16", "10", "13", "8", "20", "2", "16", "…
## $ asistants <dbl> 4.013, 749.000, 144.000, 283.000, 176.000, 174.000, 213…
## $ technicians <dbl> 4.816, 1.200, 57.000, 269.000, 255.000, 151.000, 191.00…
## $ support <int> 569, 200, 16, 21, 6, 6, 31, 8, 14, 5, 9, 28, 126, 12, 4…
The table contains other data regarding other professionals, for the purpose of this case study, this data can be discarded. Also, since this follows a pivot table like format, both region and year are located within the same column. To address this I will separate them in two columns, and reduce the scope of the data to only the year 2019.
#Separate year from region
clinics_raw_data <- add_column(
clinics_raw_data, "year" = 2019, .after = "region"
)
#Reduce data to 2019
clinics_clean_data <- clinics_raw_data[-(1:20),]
#Reduce columns to relevant ones
clinics_clean_data <- subset(
clinics_clean_data, select = -c(
Total,
interns,
asistants,
technicians,
support))
The final table, with all the conditions to be used is the following:
glimpse(clinics_clean_data)
## Rows: 18
## Columns: 6
## $ region <chr> "asuncion", "concepcion", "san pedro", "cordillera", "g…
## $ year <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2…
## $ clinics <int> 3153, 328, 455, 366, 396, 453, 248, 614, 332, 310, 580,…
## $ chemestry_prof <dbl> 656, 32, 49, 41, 56, 49, 29, 58, 38, 42, 61, 527, 14, 2…
## $ dentists <dbl> 350, 19, 29, 48, 32, 31, 30, 51, 24, 53, 36, 401, 16, 1…
## $ nurses <int> 2954, 443, 783, 399, 473, 669, 262, 447, 399, 418, 630,…
To link the tabular data to spatial points, I loaded this shape file containing points representing state capitals.
#Loading Geospatial data
library(sf)
## Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
capital_points <- read_sf("Datos/shp/cap_dep_puntos.shp")
The points are referenced by an “id” number standardized in Paraguay. To perform a join between these two data sets (clinics_clean_data and capital_points), I loaded this reference table.
points_ref <-
read.csv("C:/Users/pmedi/OneDrive/Documentos/departments.csv")
head(points_ref)
## id department
## 1 17 alto paraguay
## 2 10 alto parana
## 3 13 amambay
## 4 0 asuncion
## 5 5 caaguazu
## 6 6 caazapa
First, merging the data from the reference table to get the id’s.
clinics_merged_data <- merge(
x = clinics_clean_data,
y = points_ref,
by.x = "region",
by.y = "department",
all = TRUE)
glimpse(clinics_merged_data)
## Rows: 18
## Columns: 7
## $ region <chr> "alto paraguay", "alto parana", "amambay", "asuncion", …
## $ year <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2…
## $ clinics <int> 56, 580, 220, 3153, 102, 453, 248, 247, 385, 328, 366, …
## $ chemestry_prof <dbl> 4, 61, 26, 656, 5, 49, 29, 18, 527, 32, 41, 56, 58, 38,…
## $ dentists <dbl> 2, 36, 15, 350, 4, 31, 30, 20, 401, 19, 48, 32, 51, 24,…
## $ nurses <int> 74, 630, 224, 2954, 75, 669, 262, 248, 3488, 443, 399, …
## $ id <int> 17, 10, 13, 0, 16, 5, 6, 14, 11, 1, 3, 4, 7, 8, 12, 9, …
#Merging spatial data by id
clinics_geo_data <- merge(
x = capital_points,
y = clinics_merged_data,
by.x = "Id",
by.y = "id",
all = TRUE
)
glimpse(clinics_geo_data)
## Rows: 18
## Columns: 16
## $ Id <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ AREA <dbl> 104580000, 15942773, 2506053, 2698472, 0, 18259027, 219…
## $ PERIMETER <dbl> 68809.055, 26160.950, 10296.553, 8570.629, 0.000, 22481…
## $ CIUDADES_ <dbl> 214, 11, 21, 80, 0, 91, 155, 215, 0, 0, 95, 0, 0, 6, 18…
## $ CIUDADES_I <dbl> 1000, 10, 20, 125, 0, 139, 203, 1001, 0, 0, 143, 0, 0, …
## $ NOMBRE <chr> "ASUNCION", "Concepción", "San Pedro", "Caacupe", "Vill…
## $ TIPO <int> 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0
## $ Hectares <dbl> 10457.9984, 1594.2773, 250.6053, 269.8472, 0.0000, 1825…
## $ clinics.x <dbl> 2878, 299, 402, 328, 363, 414, 215, 555, 261, 279, 532,…
## $ region <chr> "asuncion", "concepcion", "san pedro", "cordillera", "g…
## $ year <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2…
## $ clinics.y <int> 3153, 328, 455, 366, 396, 453, 248, 614, 332, 310, 580,…
## $ chemestry_prof <dbl> 656, 32, 49, 41, 56, 49, 29, 58, 38, 42, 61, 527, 14, 2…
## $ dentists <dbl> 350, 19, 29, 48, 32, 31, 30, 51, 24, 53, 36, 401, 16, 1…
## $ nurses <int> 2954, 443, 783, 399, 473, 669, 262, 447, 399, 418, 630,…
## $ geometry <POINT [m]> POINT (439963 7203105), POINT (455417.2 7411569), POINT…
Displaying the spatial data:
#Displaying spatial data
library(mapview)
clinics_geo_data <- clinics_geo_data %>% as_Spatial
mapview(clinics_geo_data)
First, I created a grid covering the paraguayan territory with the points where to extrapolate the data.
#Loading grid
library(sp)
grid_py <- read_sf("Datos/shp/py_grid.shp")
grid_py <- grid_py %>% as_Spatial
grid <- as.data.frame(spsample(grid_py, "regular", n=5000000))
names(grid) <- c("x", "y")
coordinates(grid) <- c("x", "y")
gridded(grid) <- TRUE
fullgrid(grid) <- TRUE
#Equalizing the projections
proj4string(clinics_geo_data) <- proj4string(clinics_geo_data)
proj4string(grid) <- proj4string(clinics_geo_data)
#IDW Interpolation
library(gstat)
library(raster)
##
## Attaching package: 'raster'
## The following object is masked from 'package:dplyr':
##
## select
pi <- idw(clinics.y ~ 1, clinics_geo_data, newdata = grid, idp = 1)
## [inverse distance weighted interpolation]
r <- raster(pi)
plot(r)