We were given a homework in a course in demography to identify several esimates of variables found in the 2017 World Population Data Sheet. For instance, we were asked to find the country or countries with the lowest total fertility rate (TFR). As usual, I found myself shaking my head scanning rows and columns of the tables in the PDF. Doing a bit of searching in Google, I found out about the site https://www.pdftoexcel.com/, which converts tables found in a PDF to spreadsheet. I uploaded the PDF file there, let the site process it for less than 2 minutes, and downloaded the resulting spreadsheet. Afterwards, I did some manual copying and pasting and categorizing of the rows. The spreadsheet file is available for download here in this link.
Now I only have to do some more data manipulation in R in order for the file to be useful for my purposes.
# load the required packages
library(tidyverse)
library(data.table)
library(DT)
# import the file
wpds <- fread("wpds.csv", na.strings = c("-", "", "NA"))
# do some fixes
wpds <- wpds %>% mutate(
# remove the comma in each numeric entry
`GNI per Capita PPPc 2016` = as.numeric(gsub(",","",`GNI per Capita PPPc 2016`)),
`Population mid-2017 (millions)` = as.numeric(gsub(",","",`Population mid-2017 (millions)`)),
`Population mid-2030 (millions)` = as.numeric(gsub(",","",`Population mid-2030 (millions)`)),
`Population mid-2050 (millions)` = as.numeric(gsub(",","",`Population mid-2050 (millions)`)),
`Population per Square Kilometer of Arable Land (thousands)` = as.numeric(gsub(",","",`Population per Square Kilometer of Arable Land (thousands)`)),
`Population Ages 15-24 mid-2017 (millions)` = as.numeric(gsub(",","",`Population Ages 15-24 mid-2017 (millions)`)),
`Population Ages 15-24 mid-2050 (millions)` = as.numeric(gsub(",","",`Population Ages 15-24 mid-2050 (millions)`)),
# the blanks correspond to a country
Category = case_when(is.na(Category)~"Country", TRUE~as.character(Category))
)
Now, it is only a matter of using DT::datatable.
wpds %>% datatable()