Case study: Clinics available per state

This case study was conducted to see how the data obtained through the last clinic availability survey reflects on a map.

The data

Exploration

The data downloaded corresponds to the latest survey, conducted in 2019. It was obtained from the open data resource offered by the Paraguayan government through their website.

The data was formatted as a CSV delimited by semicolons.

#Loading packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#Loading Data-set

clinics_raw_data <- read.csv(
  "Datos/health_care_personel.csv", sep = ";")

After loading the data, I inspected the table with glimpse to see data types and variable (column) names:

glimpse(clinics_raw_data)
## Rows: 38
## Columns: 10
## $ Ano.y.Region.Sanitaria                                         <chr> "     T…
## $ Total                                                          <dbl> 35.519,…
## $ Categoria.de.Ocupacion....Medicos                              <dbl> 10.936,…
## $ Categoria.de.Ocupacion....Quimicos.y.Bioquimicos               <dbl> 1.666, …
## $ Categoria.de.Ocupacion....Odontologos                          <dbl> 1.153, …
## $ Categoria.de.Ocupacion....Licenciados.Enfermeria...Obstetricia <dbl> 11.714,…
## $ Categoria.de.Ocupacion....Otros.profesionales.universitarios   <chr> "652", …
## $ Categoria.de.Ocupacion....Auxiliares.de.blanco                 <dbl> 4.013, …
## $ Categoria.de.Ocupacion....Tecnicos                             <dbl> 4.816, …
## $ Categoria.de.Ocupacion....Personal.de.apoyo                    <int> 569, 20…

The data appears to be the result of some sort of pivoting, with year and region in the same column (Ano.y.Region (Year.and.Region)). The rest of the columns contain numeric type of values, but had been introduced with “.” as thousand separators, which leads R to interpret them as double data type.

Another probable result of the pivoting is the spaces before the names in the Ano.y.Region variable.

Data Cleaning

Changing variable names

As the output shows, the variable names are long and describe too many attributes of the observations. To be able to perform analysis in a ordered manner and improve code readability, I changed the variable names to shorter ones, using snake case where needed.

clinics_raw_data <- clinics_raw_data %>% rename(
  region = Ano.y.Region.Sanitaria,
  clinics = Categoria.de.Ocupacion....Medicos,
  chemestry_prof = Categoria.de.Ocupacion....Quimicos.y.Bioquimicos,
  dentists = Categoria.de.Ocupacion....Odontologos,
  nurses = Categoria.de.Ocupacion....Licenciados.Enfermeria...Obstetricia,
  interns = Categoria.de.Ocupacion....Otros.profesionales.universitarios,
  asistants = Categoria.de.Ocupacion....Auxiliares.de.blanco,
  technicians = Categoria.de.Ocupacion....Tecnicos,
  support = Categoria.de.Ocupacion....Personal.de.apoyo
)

The resulting table:

head(clinics_raw_data)
##                     region  Total clinics chemestry_prof dentists  nurses
## 1      Total Pais Ano 2018 35.519  10.936          1.666    1.153  11.714
## 2                 Asuncion  9.117   2.878        638.000  346.000   2.785
## 3               Concepcion  1.002 299.000         30.000   20.000 420.000
## 4                San Pedro  1.804 402.000         48.000   29.000 742.000
## 5               Cordillera  1.275 328.000         39.000   44.000 414.000
## 6                   Guaira  1.191 363.000         59.000   31.000 399.000
##   interns asistants technicians support
## 1     652     4.013       4.816     569
## 2     321   749.000       1.200     200
## 3      16   144.000      57.000      16
## 4      10   283.000     269.000      21
## 5      13   176.000     255.000       6
## 6       8   174.000     151.000       6

Trimming spaces

To address the spaces at the beginning of the names in the “region” variable, I used the function str_squish. I also standarized the casing of the region to ease merging.

clinics_raw_data <- clinics_raw_data %>% 
  mutate(region = str_squish(region)) %>% 
  mutate(region = tolower(region))

Resulting table:

head(clinics_raw_data)
##                region  Total clinics chemestry_prof dentists  nurses interns
## 1 total pais ano 2018 35.519  10.936          1.666    1.153  11.714     652
## 2            asuncion  9.117   2.878        638.000  346.000   2.785     321
## 3          concepcion  1.002 299.000         30.000   20.000 420.000      16
## 4           san pedro  1.804 402.000         48.000   29.000 742.000      10
## 5          cordillera  1.275 328.000         39.000   44.000 414.000      13
## 6              guaira  1.191 363.000         59.000   31.000 399.000       8
##   asistants technicians support
## 1     4.013       4.816     569
## 2   749.000       1.200     200
## 3   144.000      57.000      16
## 4   283.000     269.000      21
## 5   176.000     255.000       6
## 6   174.000     151.000       6

Changing data types

Now, since the “clinics” variable actually contains integers and not doubles, I will convert them to string, remove the period as a separator and then convert them to integers:

clinics_raw_data$clinics <- sapply(
  clinics_raw_data$clinics, function(v) {
    as.integer(gsub("\\.","", as.character(v)))
    })

clinics_raw_data$nurses <- sapply(
  clinics_raw_data$nurses, function(v) {
    as.integer(gsub("\\.","", as.character(v)))
    })

glimpse(clinics_raw_data)
## Rows: 38
## Columns: 10
## $ region         <chr> "total pais ano 2018", "asuncion", "concepcion", "san p…
## $ Total          <dbl> 35.519, 9.117, 1.002, 1.804, 1.275, 1.191, 1.599, 846.0…
## $ clinics        <int> 10936, 2878, 299, 402, 328, 363, 414, 215, 555, 261, 27…
## $ chemestry_prof <dbl> 1.666, 638.000, 30.000, 48.000, 39.000, 59.000, 45.000,…
## $ dentists       <dbl> 1.153, 346.000, 20.000, 29.000, 44.000, 31.000, 28.000,…
## $ nurses         <int> 11714, 2785, 420, 742, 414, 399, 657, 259, 375, 372, 40…
## $ interns        <chr> "652", "321", "16", "10", "13", "8", "20", "2", "16", "…
## $ asistants      <dbl> 4.013, 749.000, 144.000, 283.000, 176.000, 174.000, 213…
## $ technicians    <dbl> 4.816, 1.200, 57.000, 269.000, 255.000, 151.000, 191.00…
## $ support        <int> 569, 200, 16, 21, 6, 6, 31, 8, 14, 5, 9, 28, 126, 12, 4…

Reducing scope

The table contains other data regarding other professionals, for the purpose of this case study, this data can be discarded. Also, since this follows a pivot table like format, both region and year are located within the same column. To address this I will separate them in two columns, and reduce the scope of the data to only the year 2019.

#Separate year from region

clinics_raw_data <- add_column(
  clinics_raw_data, "year" = 2019, .after = "region"
  )

#Reduce data to 2019

clinics_clean_data <- clinics_raw_data[-(1:20),]

#Reduce columns to relevant ones

clinics_clean_data <- subset(
  clinics_clean_data, select = -c(
    Total,
    interns,
    asistants,
    technicians,
    support))

The final table, with all the conditions to be used is the following:

glimpse(clinics_clean_data)
## Rows: 18
## Columns: 6
## $ region         <chr> "asuncion", "concepcion", "san pedro", "cordillera", "g…
## $ year           <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2…
## $ clinics        <int> 3153, 328, 455, 366, 396, 453, 248, 614, 332, 310, 580,…
## $ chemestry_prof <dbl> 656, 32, 49, 41, 56, 49, 29, 58, 38, 42, 61, 527, 14, 2…
## $ dentists       <dbl> 350, 19, 29, 48, 32, 31, 30, 51, 24, 53, 36, 401, 16, 1…
## $ nurses         <int> 2954, 443, 783, 399, 473, 669, 262, 447, 399, 418, 630,…

Geospatial data

To link the tabular data to spatial points, I loaded this shape file containing points representing state capitals.

#Loading Geospatial data

library(sf)
## Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
capital_points <- read_sf("Datos/shp/cap_dep_puntos.shp")

The points are referenced by an “id” number standardized in Paraguay. To perform a join between these two data sets (clinics_clean_data and capital_points), I loaded this reference table.

points_ref <- 
  read.csv("C:/Users/pmedi/OneDrive/Documentos/departments.csv")

head(points_ref)
##   id    department
## 1 17 alto paraguay
## 2 10   alto parana
## 3 13       amambay
## 4  0      asuncion
## 5  5      caaguazu
## 6  6       caazapa

Merging

Getting the feature id’s

First, merging the data from the reference table to get the id’s.

clinics_merged_data <- merge(
  x = clinics_clean_data, 
  y = points_ref, 
  by.x = "region", 
  by.y = "department", 
  all = TRUE)

glimpse(clinics_merged_data)
## Rows: 18
## Columns: 7
## $ region         <chr> "alto paraguay", "alto parana", "amambay", "asuncion", …
## $ year           <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2…
## $ clinics        <int> 56, 580, 220, 3153, 102, 453, 248, 247, 385, 328, 366, …
## $ chemestry_prof <dbl> 4, 61, 26, 656, 5, 49, 29, 18, 527, 32, 41, 56, 58, 38,…
## $ dentists       <dbl> 2, 36, 15, 350, 4, 31, 30, 20, 401, 19, 48, 32, 51, 24,…
## $ nurses         <int> 74, 630, 224, 2954, 75, 669, 262, 248, 3488, 443, 399, …
## $ id             <int> 17, 10, 13, 0, 16, 5, 6, 14, 11, 1, 3, 4, 7, 8, 12, 9, …

Merging spatial data with shapefile

#Merging spatial data by id

clinics_geo_data <- merge(
  x = capital_points, 
  y = clinics_merged_data, 
  by.x = "Id", 
  by.y = "id", 
  all = TRUE
  )

glimpse(clinics_geo_data)
## Rows: 18
## Columns: 16
## $ Id             <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ AREA           <dbl> 104580000, 15942773, 2506053, 2698472, 0, 18259027, 219…
## $ PERIMETER      <dbl> 68809.055, 26160.950, 10296.553, 8570.629, 0.000, 22481…
## $ CIUDADES_      <dbl> 214, 11, 21, 80, 0, 91, 155, 215, 0, 0, 95, 0, 0, 6, 18…
## $ CIUDADES_I     <dbl> 1000, 10, 20, 125, 0, 139, 203, 1001, 0, 0, 143, 0, 0, …
## $ NOMBRE         <chr> "ASUNCION", "Concepción", "San Pedro", "Caacupe", "Vill…
## $ TIPO           <int> 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0
## $ Hectares       <dbl> 10457.9984, 1594.2773, 250.6053, 269.8472, 0.0000, 1825…
## $ clinics.x      <dbl> 2878, 299, 402, 328, 363, 414, 215, 555, 261, 279, 532,…
## $ region         <chr> "asuncion", "concepcion", "san pedro", "cordillera", "g…
## $ year           <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2…
## $ clinics.y      <int> 3153, 328, 455, 366, 396, 453, 248, 614, 332, 310, 580,…
## $ chemestry_prof <dbl> 656, 32, 49, 41, 56, 49, 29, 58, 38, 42, 61, 527, 14, 2…
## $ dentists       <dbl> 350, 19, 29, 48, 32, 31, 30, 51, 24, 53, 36, 401, 16, 1…
## $ nurses         <int> 2954, 443, 783, 399, 473, 669, 262, 447, 399, 418, 630,…
## $ geometry       <POINT [m]> POINT (439963 7203105), POINT (455417.2 7411569), POINT…

Displaying the spatial data:

#Displaying spatial data

library(mapview)

clinics_geo_data <- clinics_geo_data %>%  as_Spatial

mapview(clinics_geo_data)

Preparing data for interpolation

First, I created a grid covering the paraguayan territory with the points where to extrapolate the data.

#Loading grid

library(sp)

grid_py <- read_sf("Datos/shp/py_grid.shp")

grid_py <- grid_py %>% as_Spatial

grid <- as.data.frame(spsample(grid_py, "regular", n=5000000))

names(grid) <- c("x", "y")

coordinates(grid) <- c("x", "y")

gridded(grid) <- TRUE
fullgrid(grid) <- TRUE

#Equalizing the projections

proj4string(clinics_geo_data) <- proj4string(clinics_geo_data)
proj4string(grid) <- proj4string(clinics_geo_data)

Performing IDW

#IDW Interpolation

library(gstat)
library(raster)
## 
## Attaching package: 'raster'
## The following object is masked from 'package:dplyr':
## 
##     select
pi <- idw(clinics.y ~ 1, clinics_geo_data, newdata = grid, idp = 1)
## [inverse distance weighted interpolation]
r <- raster(pi)
plot(r)