Kiva: round 2

Getting data off websites: a case study

Kiva microloans

library(dplyr)
library(tidyr)
library(magrittr)
library(rvest)

The first step is deciding what data you want to get off of a page. For Kiva microloans, the data might be

person’s name
country of origin
how much they’d like to borrow
what percentage they’ve raised
If their loan has been filled already
how much they’ve paid back
info on the person: repayment term, schedule, date loan was pre-disbursed, date listed, currency exchange

in order to do this, we need to write a function which will extract each of these from the page. We start with saving a sample webpage:

site <- html("http://www.kiva.org/lend/774331")

Individual’s name.

Looks like the name is the page header:

kiva_name <- function(.site){
  .site %>%
    html_nodes("#pageHeader h2") %>%
    html_text
  }

kiva_name(site)

## [1] "Resineros De San JosÃ© De CaÃ±as Group"

country of origin

The element “Country” is badly named; it is also city etc:

kiva_place <- function(.site){
  .site %>%
    html_nodes("#pageHeader .country") %>%
    html_text
  }

kiva_place(site)

## [1] "San JosÃ© de CaÃ±as, Mexico"

how much they’d like to borrow

Life is pain. This number only appears in the middle of a sentence.

kiva_amt <- function(.site){
.site %>%
  html_nodes(".loanExcerpt") %>%
  html_text %>%
  gsub("[^0-9.]+", "", .) %>%
  gsub("\\.*$", "", .) %>%        ## remove trailing .
  gsub("^\\.*", "", .) %>%           ## remove leading .
  as.numeric
}

kiva_amt(site)

## [1] 29050

# We first look for everything that is a digit or a period. 
# Then remove periods from the start or end.
# NOTE this assumes that the summary will never have another digit in it. for example "these 2 people want 30$" will produce 230, not 30

what percentage they’ve raised

This is another weird one. The “percentage” means the amount raised so far (if less than total) or the amount paid back (once total is reached). Oh well. might as well get it anyway:

kiva_percent <- function(.site){
.site %>%
  html_nodes("#loanSummary .number") %>%
  html_text %>%
  gsub("[^0-9.]+", "", .) %>%
  as.numeric
}

kiva_percent(site)

## [1] 37

If their loan has been filled already

This is an interesting one. There is an element on the page (“.fullyFundedNotice”) that doesn’t even exist if your loan is not fully funded! So we can simply check that it exists at all:

kiva_funded <- function(.site){
  .site %>%
    html_nodes(".fullyFundedNotice") %>%
    html_text %>%
    identical(., character(0)) %>%
    not
}

kiva_funded(site)

## [1] FALSE

#kiva_funded(html("http://www.kiva.org/lend/774321"))

info on the person

Let’s grab that little sidebar of info on the person, because it looks useful.

loansum <- html(site) %>%
  html_nodes("#loanSummary dl")

By looking at loansum, we can see that it is a definition list object, not a table. that means that we can’t use html_table() to extract the numbers:

loansum %>%
  html_table

## Error: html_tag(x) == "table" is not TRUE

OK, so we have two options. One would be to force the whole structure of the list to a text string that we could clean with regular expressions:

loansum %>%
  html_text

## [1] "Repayment Term:\n\t\t\t\t\t\t120 months (more info)\n\t\n\t\t\t\t\t\tRepayment Schedule:\n\t\t\t\t\t\tIrregularly\n\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tPre-Disbursed:\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tAug 25, 2014\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\tListed\n\t\t\t\t\t\t\tOct 21, 2014\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\tCurrency Exchange Loss:\n\t\t\t\t\t\tN/A \n\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t"

Slightly more elegant would be to use the structure of the list itself. Every “definition list” contains a <dt> tag for the term which is defined, and a <dd> tag for the definition:

loansum %>%
  html_nodes("dt") %>%
  html_text

## [1] "Repayment Term:"         "Repayment Schedule:"    
## [3] "Pre-Disbursed:"          "Listed"                 
## [5] "Currency Exchange Loss:"

loansum %>%
  html_nodes("dd") %>%
  html_text

## [1] "120 months (more info)" "Irregularly"           
## [3] "Aug 25, 2014"           "Oct 21, 2014"          
## [5] "N/A "

We could extract this into a function:

deflist_to_df <- function(.site){
  require(rvest)
  require(dplyr)
  
  deflist_xml <- .site %>%
    html_nodes("#loanSummary dl")
  
  terms <- deflist_xml %>%
  html_nodes("dt") %>%
  html_text
  
  defs <- loansum %>%
  html_nodes("dd") %>%
  html_text
  
  names(defs) <- terms
  
  data.frame(t(defs))
}

deflist_to_df(site)

##          Repayment.Term. Repayment.Schedule. Pre.Disbursed.       Listed
## 1 120 months (more info)         Irregularly   Aug 25, 2014 Oct 21, 2014
##   Currency.Exchange.Loss.
## 1                    N/A

Put it all together

We can use some random numbers to sample the kiva profiles (using to our advantage the structure of the URL, which contains a number):

numvec2 <- c(786671,785489)

set.seed(5)
numvec <- sample(5000:7914, size = 10)+780000

download <- data.frame(startnum = numvec) %>%
  mutate(url = paste0("http://www.kiva.org/lend/", startnum)) %>%
  group_by(url) %>%
  do(site = failwith(NULL, html)(.$url))

Now, we can use the functions we just learned to extract all the info from these downloaded websites:

clean_download <- download %>%
  mutate(test = try(kiva_name(site))) %>%
  filter(!grepl("Error", x = test))

output <- clean_download %>%
  group_by(url) %>% 
  mutate(name = kiva_name(site[[1]]),
         funded = kiva_funded(site[[1]]),
         percent = kiva_percent(site[[1]]),
         amount = kiva_amt(site[[1]]),
         place = kiva_place(site[[1]])) %>%
  #separate(place, c("city", "country"), sep = ", ") %>%
  do(data.frame(., deflist_to_df(.[["site"]][[1]]))) %>%
  select(-site)

library(knitr)
kable(as.data.frame(output))

url	test	name	funded	percent	amount	place	Repayment.Term.	Repayment.Schedule.	Pre.Disbursed.	Listed	Currency.Exchange.Loss.
http://www.kiva.org/lend/785304	Manjurani	Manjurani	TRUE	0	175	Maynaguri, India	120 months (more info)	Irregularly	Aug 25, 2014	Oct 21, 2014	N/A
http://www.kiva.org/lend/785320	Janet	Janet	TRUE	0	225	Kericho, Kenya	120 months (more info)	Irregularly	Aug 25, 2014	Oct 21, 2014	N/A
http://www.kiva.org/lend/785583	San Valentin Group	San Valentin Group	TRUE	0	3450	AsunciÃ³n, Paraguay	120 months (more info)	Irregularly	Aug 25, 2014	Oct 21, 2014	N/A
http://www.kiva.org/lend/785828	Djiguiya Group	Djiguiya Group	FALSE	6	1475	M’Pessoba, Mali	120 months (more info)	Irregularly	Aug 25, 2014	Oct 21, 2014	N/A
http://www.kiva.org/lend/786535	Hanifan	Hanifan	TRUE	0	450	Liaqat Pur, Pakistan	120 months (more info)	Irregularly	Aug 25, 2014	Oct 21, 2014	N/A
http://www.kiva.org/lend/786996	Duom	Duom	TRUE	0	600	Thai Binh, Vietnam	120 months (more info)	Irregularly	Aug 25, 2014	Oct 21, 2014	N/A
http://www.kiva.org/lend/787040	Savoeun’s Group	Savoeun’s Group	TRUE	0	150	Battambang, Cambodia	120 months (more info)	Irregularly	Aug 25, 2014	Oct 21, 2014	N/A
http://www.kiva.org/lend/787349	Goutami	Goutami	TRUE	0	250	Maynaguri, India	120 months (more info)	Irregularly	Aug 25, 2014	Oct 21, 2014	N/A
http://www.kiva.org/lend/787780	Zenie	Zenie	TRUE	0	125	Calamba - Baliangao, Misamis Occidental, Philippines	120 months (more info)	Irregularly	Aug 25, 2014	Oct 21, 2014	N/A