library(dplyr)
library(tidyr)
library(magrittr)
library(rvest)
The first step is deciding what data you want to get off of a page. For Kiva microloans, the data might be
in order to do this, we need to write a function which will extract each of these from the page. We start with saving a sample webpage:
site <- html("http://www.kiva.org/lend/774331")
Looks like the name is the page header:
kiva_name <- function(.site){
.site %>%
html_nodes("#pageHeader h2") %>%
html_text
}
kiva_name(site)
## [1] "Resineros De San José De Cañas Group"
The element “Country” is badly named; it is also city etc:
kiva_place <- function(.site){
.site %>%
html_nodes("#pageHeader .country") %>%
html_text
}
kiva_place(site)
## [1] "San José de Cañas, Mexico"
Life is pain. This number only appears in the middle of a sentence.
kiva_amt <- function(.site){
.site %>%
html_nodes(".loanExcerpt") %>%
html_text %>%
gsub("[^0-9.]+", "", .) %>%
gsub("\\.*$", "", .) %>% ## remove trailing .
gsub("^\\.*", "", .) %>% ## remove leading .
as.numeric
}
kiva_amt(site)
## [1] 29050
# We first look for everything that is a digit or a period.
# Then remove periods from the start or end.
# NOTE this assumes that the summary will never have another digit in it. for example "these 2 people want 30$" will produce 230, not 30
This is another weird one. The “percentage” means the amount raised so far (if less than total) or the amount paid back (once total is reached). Oh well. might as well get it anyway:
kiva_percent <- function(.site){
.site %>%
html_nodes("#loanSummary .number") %>%
html_text %>%
gsub("[^0-9.]+", "", .) %>%
as.numeric
}
kiva_percent(site)
## [1] 37
This is an interesting one. There is an element on the page (“.fullyFundedNotice”) that doesn’t even exist if your loan is not fully funded! So we can simply check that it exists at all:
kiva_funded <- function(.site){
.site %>%
html_nodes(".fullyFundedNotice") %>%
html_text %>%
identical(., character(0)) %>%
not
}
kiva_funded(site)
## [1] FALSE
#kiva_funded(html("http://www.kiva.org/lend/774321"))
Let’s grab that little sidebar of info on the person, because it looks useful.
loansum <- html(site) %>%
html_nodes("#loanSummary dl")
By looking at loansum, we can see that it is a definition list object, not a table. that means that we can’t use html_table() to extract the numbers:
loansum %>%
html_table
## Error: html_tag(x) == "table" is not TRUE
OK, so we have two options. One would be to force the whole structure of the list to a text string that we could clean with regular expressions:
loansum %>%
html_text
## [1] "Repayment Term:\n\t\t\t\t\t\t120 months (more info)\n\t\n\t\t\t\t\t\tRepayment Schedule:\n\t\t\t\t\t\tIrregularly\n\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tPre-Disbursed:\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tAug 25, 2014\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\tListed\n\t\t\t\t\t\t\tOct 21, 2014\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\tCurrency Exchange Loss:\n\t\t\t\t\t\tN/A \n\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t"
Slightly more elegant would be to use the structure of the list itself. Every “definition list” contains a <dt> tag for the term which is defined, and a <dd> tag for the definition:
loansum %>%
html_nodes("dt") %>%
html_text
## [1] "Repayment Term:" "Repayment Schedule:"
## [3] "Pre-Disbursed:" "Listed"
## [5] "Currency Exchange Loss:"
loansum %>%
html_nodes("dd") %>%
html_text
## [1] "120 months (more info)" "Irregularly"
## [3] "Aug 25, 2014" "Oct 21, 2014"
## [5] "N/A "
We could extract this into a function:
deflist_to_df <- function(.site){
require(rvest)
require(dplyr)
deflist_xml <- .site %>%
html_nodes("#loanSummary dl")
terms <- deflist_xml %>%
html_nodes("dt") %>%
html_text
defs <- loansum %>%
html_nodes("dd") %>%
html_text
names(defs) <- terms
data.frame(t(defs))
}
deflist_to_df(site)
## Repayment.Term. Repayment.Schedule. Pre.Disbursed. Listed
## 1 120 months (more info) Irregularly Aug 25, 2014 Oct 21, 2014
## Currency.Exchange.Loss.
## 1 N/A
We can use some random numbers to sample the kiva profiles (using to our advantage the structure of the URL, which contains a number):
numvec2 <- c(786671,785489)
set.seed(5)
numvec <- sample(5000:7914, size = 10)+780000
download <- data.frame(startnum = numvec) %>%
mutate(url = paste0("http://www.kiva.org/lend/", startnum)) %>%
group_by(url) %>%
do(site = failwith(NULL, html)(.$url))
Now, we can use the functions we just learned to extract all the info from these downloaded websites:
clean_download <- download %>%
mutate(test = try(kiva_name(site))) %>%
filter(!grepl("Error", x = test))
output <- clean_download %>%
group_by(url) %>%
mutate(name = kiva_name(site[[1]]),
funded = kiva_funded(site[[1]]),
percent = kiva_percent(site[[1]]),
amount = kiva_amt(site[[1]]),
place = kiva_place(site[[1]])) %>%
#separate(place, c("city", "country"), sep = ", ") %>%
do(data.frame(., deflist_to_df(.[["site"]][[1]]))) %>%
select(-site)
library(knitr)
kable(as.data.frame(output))
| url | test | name | funded | percent | amount | place | Repayment.Term. | Repayment.Schedule. | Pre.Disbursed. | Listed | Currency.Exchange.Loss. |
|---|---|---|---|---|---|---|---|---|---|---|---|
| http://www.kiva.org/lend/785304 | Manjurani | Manjurani | TRUE | 0 | 175 | Maynaguri, India | 120 months (more info) | Irregularly | Aug 25, 2014 | Oct 21, 2014 | N/A |
| http://www.kiva.org/lend/785320 | Janet | Janet | TRUE | 0 | 225 | Kericho, Kenya | 120 months (more info) | Irregularly | Aug 25, 2014 | Oct 21, 2014 | N/A |
| http://www.kiva.org/lend/785583 | San Valentin Group | San Valentin Group | TRUE | 0 | 3450 | Asunción, Paraguay | 120 months (more info) | Irregularly | Aug 25, 2014 | Oct 21, 2014 | N/A |
| http://www.kiva.org/lend/785828 | Djiguiya Group | Djiguiya Group | FALSE | 6 | 1475 | M’Pessoba, Mali | 120 months (more info) | Irregularly | Aug 25, 2014 | Oct 21, 2014 | N/A |
| http://www.kiva.org/lend/786535 | Hanifan | Hanifan | TRUE | 0 | 450 | Liaqat Pur, Pakistan | 120 months (more info) | Irregularly | Aug 25, 2014 | Oct 21, 2014 | N/A |
| http://www.kiva.org/lend/786996 | Duom | Duom | TRUE | 0 | 600 | Thai Binh, Vietnam | 120 months (more info) | Irregularly | Aug 25, 2014 | Oct 21, 2014 | N/A |
| http://www.kiva.org/lend/787040 | Savoeun’s Group | Savoeun’s Group | TRUE | 0 | 150 | Battambang, Cambodia | 120 months (more info) | Irregularly | Aug 25, 2014 | Oct 21, 2014 | N/A |
| http://www.kiva.org/lend/787349 | Goutami | Goutami | TRUE | 0 | 250 | Maynaguri, India | 120 months (more info) | Irregularly | Aug 25, 2014 | Oct 21, 2014 | N/A |
| http://www.kiva.org/lend/787780 | Zenie | Zenie | TRUE | 0 | 125 | Calamba - Baliangao, Misamis Occidental, Philippines | 120 months (more info) | Irregularly | Aug 25, 2014 | Oct 21, 2014 | N/A |