Getting data off websites: a case study

Kiva microloans

library(dplyr)
library(tidyr)
library(magrittr)
library(rvest)

The first step is deciding what data you want to get off of a page. For Kiva microloans, the data might be

  1. person’s name
  2. country of origin
  3. how much they’d like to borrow
  4. what percentage they’ve raised
  5. If their loan has been filled already
  6. how much they’ve paid back
  7. info on the person: repayment term, schedule, date loan was pre-disbursed, date listed, currency exchange

in order to do this, we need to write a function which will extract each of these from the page. We start with saving a sample webpage:

site <- html("http://www.kiva.org/lend/774331")

Individual’s name.

Looks like the name is the page header:

kiva_name <- function(.site){
  .site %>%
    html_nodes("#pageHeader h2") %>%
    html_text
  }

kiva_name(site)
## [1] "Resineros De San José De Cañas Group"

country of origin

The element “Country” is badly named; it is also city etc:

kiva_place <- function(.site){
  .site %>%
    html_nodes("#pageHeader .country") %>%
    html_text
  }

kiva_place(site)
## [1] "San José de Cañas, Mexico"

how much they’d like to borrow

Life is pain. This number only appears in the middle of a sentence.

kiva_amt <- function(.site){
.site %>%
  html_nodes(".loanExcerpt") %>%
  html_text %>%
  gsub("[^0-9.]+", "", .) %>%
  gsub("\\.*$", "", .) %>%        ## remove trailing .
  gsub("^\\.*", "", .) %>%           ## remove leading .
  as.numeric
}

kiva_amt(site)
## [1] 29050
# We first look for everything that is a digit or a period. 
# Then remove periods from the start or end.
# NOTE this assumes that the summary will never have another digit in it. for example "these 2 people want 30$" will produce 230, not 30

what percentage they’ve raised

This is another weird one. The “percentage” means the amount raised so far (if less than total) or the amount paid back (once total is reached). Oh well. might as well get it anyway:

kiva_percent <- function(.site){
.site %>%
  html_nodes("#loanSummary .number") %>%
  html_text %>%
  gsub("[^0-9.]+", "", .) %>%
  as.numeric
}

kiva_percent(site)
## [1] 37

If their loan has been filled already

This is an interesting one. There is an element on the page (“.fullyFundedNotice”) that doesn’t even exist if your loan is not fully funded! So we can simply check that it exists at all:

kiva_funded <- function(.site){
  .site %>%
    html_nodes(".fullyFundedNotice") %>%
    html_text %>%
    identical(., character(0)) %>%
    not
}

kiva_funded(site)
## [1] FALSE
#kiva_funded(html("http://www.kiva.org/lend/774321"))

info on the person

Let’s grab that little sidebar of info on the person, because it looks useful.

loansum <- html(site) %>%
  html_nodes("#loanSummary dl")

By looking at loansum, we can see that it is a definition list object, not a table. that means that we can’t use html_table() to extract the numbers:

loansum %>%
  html_table
## Error: html_tag(x) == "table" is not TRUE

OK, so we have two options. One would be to force the whole structure of the list to a text string that we could clean with regular expressions:

loansum %>%
  html_text
## [1] "Repayment Term:\n\t\t\t\t\t\t120 months (more info)\n\t\n\t\t\t\t\t\tRepayment Schedule:\n\t\t\t\t\t\tIrregularly\n\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tPre-Disbursed:\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tAug 25, 2014\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\tListed\n\t\t\t\t\t\t\tOct 21, 2014\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\tCurrency Exchange Loss:\n\t\t\t\t\t\tN/A \n\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t"

Slightly more elegant would be to use the structure of the list itself. Every “definition list” contains a <dt> tag for the term which is defined, and a <dd> tag for the definition:

loansum %>%
  html_nodes("dt") %>%
  html_text
## [1] "Repayment Term:"         "Repayment Schedule:"    
## [3] "Pre-Disbursed:"          "Listed"                 
## [5] "Currency Exchange Loss:"
loansum %>%
  html_nodes("dd") %>%
  html_text
## [1] "120 months (more info)" "Irregularly"           
## [3] "Aug 25, 2014"           "Oct 21, 2014"          
## [5] "N/A "

We could extract this into a function:

deflist_to_df <- function(.site){
  require(rvest)
  require(dplyr)
  
  deflist_xml <- .site %>%
    html_nodes("#loanSummary dl")
  
  terms <- deflist_xml %>%
  html_nodes("dt") %>%
  html_text
  
  defs <- loansum %>%
  html_nodes("dd") %>%
  html_text
  
  names(defs) <- terms
  
  data.frame(t(defs))
}

deflist_to_df(site)
##          Repayment.Term. Repayment.Schedule. Pre.Disbursed.       Listed
## 1 120 months (more info)         Irregularly   Aug 25, 2014 Oct 21, 2014
##   Currency.Exchange.Loss.
## 1                    N/A

Put it all together

We can use some random numbers to sample the kiva profiles (using to our advantage the structure of the URL, which contains a number):

numvec2 <- c(786671,785489)

set.seed(5)
numvec <- sample(5000:7914, size = 10)+780000

download <- data.frame(startnum = numvec) %>%
  mutate(url = paste0("http://www.kiva.org/lend/", startnum)) %>%
  group_by(url) %>%
  do(site = failwith(NULL, html)(.$url))

Now, we can use the functions we just learned to extract all the info from these downloaded websites:

clean_download <- download %>%
  mutate(test = try(kiva_name(site))) %>%
  filter(!grepl("Error", x = test))
output <- clean_download %>%
  group_by(url) %>% 
  mutate(name = kiva_name(site[[1]]),
         funded = kiva_funded(site[[1]]),
         percent = kiva_percent(site[[1]]),
         amount = kiva_amt(site[[1]]),
         place = kiva_place(site[[1]])) %>%
  #separate(place, c("city", "country"), sep = ", ") %>%
  do(data.frame(., deflist_to_df(.[["site"]][[1]]))) %>%
  select(-site)
library(knitr)
kable(as.data.frame(output))
url test name funded percent amount place Repayment.Term. Repayment.Schedule. Pre.Disbursed. Listed Currency.Exchange.Loss.
http://www.kiva.org/lend/785304 Manjurani Manjurani TRUE 0 175 Maynaguri, India 120 months (more info) Irregularly Aug 25, 2014 Oct 21, 2014 N/A
http://www.kiva.org/lend/785320 Janet Janet TRUE 0 225 Kericho, Kenya 120 months (more info) Irregularly Aug 25, 2014 Oct 21, 2014 N/A
http://www.kiva.org/lend/785583 San Valentin Group San Valentin Group TRUE 0 3450 Asunción, Paraguay 120 months (more info) Irregularly Aug 25, 2014 Oct 21, 2014 N/A
http://www.kiva.org/lend/785828 Djiguiya Group Djiguiya Group FALSE 6 1475 M’Pessoba, Mali 120 months (more info) Irregularly Aug 25, 2014 Oct 21, 2014 N/A
http://www.kiva.org/lend/786535 Hanifan Hanifan TRUE 0 450 Liaqat Pur, Pakistan 120 months (more info) Irregularly Aug 25, 2014 Oct 21, 2014 N/A
http://www.kiva.org/lend/786996 Duom Duom TRUE 0 600 Thai Binh, Vietnam 120 months (more info) Irregularly Aug 25, 2014 Oct 21, 2014 N/A
http://www.kiva.org/lend/787040 Savoeun’s Group Savoeun’s Group TRUE 0 150 Battambang, Cambodia 120 months (more info) Irregularly Aug 25, 2014 Oct 21, 2014 N/A
http://www.kiva.org/lend/787349 Goutami Goutami TRUE 0 250 Maynaguri, India 120 months (more info) Irregularly Aug 25, 2014 Oct 21, 2014 N/A
http://www.kiva.org/lend/787780 Zenie Zenie TRUE 0 125 Calamba - Baliangao, Misamis Occidental, Philippines 120 months (more info) Irregularly Aug 25, 2014 Oct 21, 2014 N/A