Kiva loans are extremely small loans, called microloans, made to entrepreneurs who need small seed loans to start their businesses. The loans are made in order to help better communities one entrepreneur at a time. The dataset used in this vignette consists of a set of Kiva loans made in calendar year 2016 around the globe. For the purpose of this vignette, the loans data was pared down to make the file size < 25 MB.
kiva <- read.csv("https://raw.githubusercontent.com/douglasbarley/FALL2020TIDYVERSE/TidyverseVignette/kiva_loans.csv")
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'readr' was built under R version 3.6.3
## Warning: package 'purrr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
## Warning: package 'stringr' was built under R version 3.6.3
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
glimpse(kiva)
## Rows: 197,236
## Columns: 14
## $ id <int> 1002924, 1002908, 1002897, 1002916, 1002891, 100...
## $ funded_amount <int> 500, 500, 500, 500, 575, 1600, 500, 300, 300, 62...
## $ loan_amount <int> 500, 500, 500, 500, 575, 1600, 500, 300, 300, 62...
## $ activity <fct> Rickshaw, Rickshaw, Fruits & Vegetables, Clothin...
## $ sector <fct> Transportation, Transportation, Food, Clothing, ...
## $ country_code <fct> PK, PK, PK, PK, PK, KG, PK, PK, PK, UA, PK, MG, ...
## $ country <fct> Pakistan, Pakistan, Pakistan, Pakistan, Pakistan...
## $ region <fct> "Multan", "Lahore", "Multan", "Lahore", "Lahore"...
## $ partner_id <int> 247, 247, 247, 247, 247, 171, 247, 247, 247, 26,...
## $ term_in_months <int> 14, 11, 15, 11, 11, 14, 11, 11, 12, 26, 14, 6, 1...
## $ lender_count <int> 1, 1, 17, 16, 21, 64, 20, 1, 12, 23, 12, 3, 18, ...
## $ borrower_genders <fct> female, female, female, female, female, female, ...
## $ repayment_interval <fct> monthly, irregular, monthly, irregular, irregula...
## $ date <fct> 1/1/2016, 1/1/2016, 1/1/2016, 1/1/2016, 1/1/2016...
The 2016 data includes 197,236 observations of 14 variables.
The Tidyverse contains many packages that are useful in R for cleaning and exploring data. When faced with a fairly long dataset, such as the Kiva set in this example, it is useful to be able to count the data in a single column while grouping the counts according to discrete values in that column. The group_by function in the dplyr corner of the Tidyverse helps to do just that. This helps a programmer quickly explore what is in the data.
For example, it could be useful to know which countries received the most loans.
countries <- data.frame(kiva) %>%
group_by(country) %>%
summarize(count_loans = n())
head(countries)
## # A tibble: 6 x 2
## country count_loans
## <fct> <int>
## 1 Afghanistan 1
## 2 Albania 476
## 3 Armenia 2987
## 4 Azerbaijan 303
## 5 Belize 23
## 6 Bolivia 2488
Once we have a concise count of loans by country, it is helpful to be able to visualize all of the results in a single graphic. The ggplot() function, also part of the Tidyverse, is very helpful in the visualization realm.
ggplot(data = countries) + geom_col(aes(x = country, y = count_loans)) +
ggtitle("Loans Disbursed by Country") +
coord_flip() +
ylab('Loan Count') +
xlab('Country')
There are so many countries where loans were disbursed that it is difficult to read each country’s name. In order to simplify the listing and visualizations, let’s identify the top 10 countries that received loans.
countries_top10 <- head(arrange(countries,desc(count_loans)), n = 10)
countries_top10
## # A tibble: 10 x 2
## country count_loans
## <fct> <int>
## 1 Philippines 48317
## 2 Kenya 20604
## 3 Cambodia 11590
## 4 El Salvador 9454
## 5 Pakistan 8777
## 6 Colombia 7170
## 7 Tajikistan 6318
## 8 Peru 6215
## 9 Ecuador 5038
## 10 Uganda 4524
Now we can graph the top 10 countries that received loans.
ggplot(data = countries_top10) + geom_col(aes(x = reorder(country, count_loans), count_loans)) +
ggtitle("Loans Disbursed by Country") +
coord_flip() +
ylab('Loan Count') +
xlab('Country')
That’s much more legible! Now we can see that the Philippines received the most Kiva loans of any country in 2016.
sector <- data.frame(kiva) %>%
group_by(sector) %>%
summarize(count_loans = n())
## `summarise()` ungrouping output (override with `.groups` argument)
head(sector)
## # A tibble: 6 x 2
## sector count_loans
## <fct> <int>
## 1 Agriculture 52647
## 2 Arts 3909
## 3 Clothing 8957
## 4 Construction 1648
## 5 Education 9959
## 6 Entertainment 245
activity <- data.frame(kiva) %>%
group_by(activity) %>%
summarize(count_loans = n())
## `summarise()` ungrouping output (override with `.groups` argument)
head(activity)
## # A tibble: 6 x 2
## activity count_loans
## <fct> <int>
## 1 Agriculture 6033
## 2 Air Conditioning 12
## 3 Animal Sales 2578
## 4 Arts 344
## 5 Auto Repair 383
## 6 Bakery 1066
Once we have a concise count of loans by sector, it is helpful to be able to visualize all of the results in a single graphic. The ggplot() function, also part of the Tidyverse, is very helpful in the visualization realm.
ggplot(data = sector) + geom_col(aes(x = reorder(sector, count_loans), y = count_loans)) +
ggtitle("Loans Disbursed by Sector") +
coord_flip() +
ylab('Loan Count') +
xlab('Sector')
ggplot(data = activity) + geom_col(aes(x = reorder(activity, count_loans), y = count_loans)) +
ggtitle("Loans Disbursed by Activity") +
coord_flip() +
ylab('Loan Count') +
xlab('Activity')
activity_top10 <- head(arrange(activity,desc(count_loans)), n = 10)
activity_top10
## # A tibble: 10 x 2
## activity count_loans
## <fct> <int>
## 1 Farming 21967
## 2 General Store 19206
## 3 Pigs 8915
## 4 Personal Housing Expenses 8896
## 5 Home Appliances 8531
## 6 Food Production/Sales 8422
## 7 Clothing Sales 6484
## 8 Agriculture 6033
## 9 Retail 5938
## 10 Higher education costs 5726
ggplot(data = activity_top10) + geom_col(aes(x = reorder(activity, count_loans), count_loans)) +
ggtitle("Loans Disbursed by Activity") +
coord_flip() +
ylab('Loan Count') +
xlab('Activity')