DATA 607 Data Project

Kiva loans

Part 1 - Introduction and the big questions

Kiva.org is a crowdfunding organization that helps people around the world by providing loans for a variety of purposes ranging across areas such as education, health, women, the arts, technology and business startups. The loans are disbursed directly to the borrowers, and many of the loans are targeted specifically to help people living in poverty.

Kiva’s lenders fund projects in many, but not all, of the countries around the world. Most of the countries where loans are made, which includes the US, include areas where there is greater relative poverty, but not all countries receive Kiva loans. This leads to the primary research question for this project, which is:

Research question

Are a country’s total population and poverty proportion predictive of the volumes and amounts of Kiva’s microloans disbursed in that country?

In this question the dependent, or response, variables are the frequency and amounts of loans Kiva made in a given country.

The numerical independent variables are the total population and poverty proportion for each country, and a qualitative independent variable is the country’s name where the loans were disbursed.

…and the research question leads to a further question that we may want to try to answer:

Corollary question

If a country’s total population and poverty proportion are predictive of the volumes and amounts of Kiva’s microloans for a given country, can we predict the volumes and amounts of loans that might be made in countries where loans are not currently being made based on a specific country’s total population and poverty proportion?

Part 2 - Data sources and acquisition methods

World Bank csv import of total population per country

In order to answer the research question we first need to know the population for each country in the world. Population data is collected by the World Bank and is made available under its CC-BY 4.0 license as open data, given proper attribution. It is available on the World Bank website, and for this project it was extracted by downloading a .csv file from the World Bank, and storing that .csv file on my GitHub repo.

Each case in the World Bank population data represents the annual total population of a given country per year from 1960 to 2019. For this project we are only interested in the years 2014 through 2017, which aligns with the dates in the available data on Kiva’s loans.

The World Bank population data from 1960 to 2019 was read into RStudio from the .csv file in GitHub as shown below, and it was then transformed to only show populations for the years 2014 through 2017 inclusive.

WBpopURL <- "https://raw.githubusercontent.com/douglasbarley/Fall2020_FinalProject/main/PopulationByCountry.csv"
WBmetaURL <- "https://raw.githubusercontent.com/douglasbarley/Fall2020_FinalProject/main/Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2_1495124.csv"

WB_Pop <- read.csv(WBpopURL) %>%
  subset(select = c(CountryName, CountryCode, X2014, X2015, X2016, X2017)) %>%
    pivot_longer(`X2014`:`X2017`, names_to = "PopYear")

names(WB_Pop)[4] <- "Population"

WB_Pop$PopYear <- type.convert(gsub('^.', '', WB_Pop$PopYear)) # replace the first character "X" with a null string

head(WB_Pop)

## # A tibble: 6 x 4
##   CountryName CountryCode PopYear Population
##   <chr>       <chr>         <int>      <dbl>
## 1 Aruba       ABW            2014     103774
## 2 Aruba       ABW            2015     104341
## 3 Aruba       ABW            2016     104872
## 4 Aruba       ABW            2017     105366
## 5 Afghanistan AFG            2014   33370794
## 6 Afghanistan AFG            2015   34413603

The World Bank also provided a metadata file about the countries that includes the world region and overall income group as determined by the World Bank.

WBmetaURL <- "https://raw.githubusercontent.com/douglasbarley/Fall2020_FinalProject/main/Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2_1495124.csv"

WB_Meta <- read.csv(WBmetaURL)
names(WB_Meta)[1] <- "CountryCode"
names(WB_Meta)[5] <- "CountryName"

WB_Meta <- subset(WB_Meta, WB_Meta$IncomeGroup != "", select = c(CountryCode, Region, IncomeGroup))
  
head(WB_Meta)

##   CountryCode                     Region         IncomeGroup
## 1         ABW  Latin America & Caribbean         High income
## 2         AFG                 South Asia          Low income
## 3         AGO         Sub-Saharan Africa Lower middle income
## 4         ALB      Europe & Central Asia Upper middle income
## 5         AND      Europe & Central Asia         High income
## 7         ARE Middle East & North Africa         High income

With both files imported we can combine the data into a single set.

WB_data <- inner_join(WB_Pop,WB_Meta, by = c("CountryCode" = "CountryCode"), copy = FALSE, keep = FALSE)
head(WB_data, 10)

## # A tibble: 10 x 6
##    CountryName CountryCode PopYear Population Region             IncomeGroup    
##    <chr>       <chr>         <int>      <dbl> <chr>              <chr>          
##  1 Aruba       ABW            2014     103774 Latin America & C~ High income    
##  2 Aruba       ABW            2015     104341 Latin America & C~ High income    
##  3 Aruba       ABW            2016     104872 Latin America & C~ High income    
##  4 Aruba       ABW            2017     105366 Latin America & C~ High income    
##  5 Afghanistan AFG            2014   33370794 South Asia         Low income     
##  6 Afghanistan AFG            2015   34413603 South Asia         Low income     
##  7 Afghanistan AFG            2016   35383128 South Asia         Low income     
##  8 Afghanistan AFG            2017   36296400 South Asia         Low income     
##  9 Angola      AGO            2014   26941779 Sub-Saharan Africa Lower middle i~
## 10 Angola      AGO            2015   27884381 Sub-Saharan Africa Lower middle i~

World Health Organization (WHO) API import of poverty rates per country

With the population of each country in the world in hand, we next need to know that poverty rates for each country. Poverty data is collected by the WHO as part of their Global Health Observatory site. The data is openly available World Health Organization via their API, and for this project it was extracted by accessing the API on the WHO the website.

Each case represents the proportion of the total population that lived below the international poverty line of US$1.90/day for a given year, for each of the years 2014 through 2017 inclusive. There are 1137 observations in the given data set.

Two files were required from the API, because one file only had a 3-digit country code, whereas we wanted the full country name for each country.

# Link to OData API for the World Health Organization (WHO) indicators on poverty
WHO_PovertyAPI <- GET("https://ghoapi.azureedge.net/api/SI_POV_DAY1", verbose())
http_status(WHO_PovertyAPI)

## $category
## [1] "Success"
## 
## $reason
## [1] "OK"
## 
## $message
## [1] "Success: (200) OK"

# Link to OData API for the World Health Organization (WHO) values for the Country dimension
WHO_CountryAPI <- GET("https://ghoapi.azureedge.net/api/DIMENSION/COUNTRY/DimensionValues", verbose())
http_status(WHO_CountryAPI)

## $category
## [1] "Success"
## 
## $reason
## [1] "OK"
## 
## $message
## [1] "Success: (200) OK"

With a Success code of 200 the OData file was successfully read into a dataframe, but it was still difficult to read/interpret in plain language due to the structure of the file. As such I extracted the contents from the poverty data URL call as raw text, and then inserted the poverty data text into an R dataframe.

WHO_PovertyList <- content(WHO_PovertyAPI, "text", encoding='UTF-8')
WHO_PovertyListText <- fromJSON(WHO_PovertyList, flatten = TRUE)
WHO_PovertyList_df <- as.data.frame(WHO_PovertyListText)
str(WHO_PovertyList_df)

## 'data.frame':    1440 obs. of  24 variables:
##  $ X.odata.context         : chr  "https://ghoapi.azureedge.net/api/$metadata#SI_POV_DAY1" "https://ghoapi.azureedge.net/api/$metadata#SI_POV_DAY1" "https://ghoapi.azureedge.net/api/$metadata#SI_POV_DAY1" "https://ghoapi.azureedge.net/api/$metadata#SI_POV_DAY1" ...
##  $ value.Id                : int  21319082 21319083 21319084 21319085 21319086 21319087 21319088 21319089 21319090 21319091 ...
##  $ value.IndicatorCode     : chr  "SI_POV_DAY1" "SI_POV_DAY1" "SI_POV_DAY1" "SI_POV_DAY1" ...
##  $ value.SpatialDimType    : chr  "COUNTRY" "COUNTRY" "COUNTRY" "COUNTRY" ...
##  $ value.SpatialDim        : chr  "AGO" "AGO" "ALB" "ALB" ...
##  $ value.TimeDimType       : chr  "YEAR" "YEAR" "YEAR" "YEAR" ...
##  $ value.TimeDim           : int  2000 2008 2002 2005 2008 2012 2000 2001 2002 2003 ...
##  $ value.Dim1Type          : logi  NA NA NA NA NA NA ...
##  $ value.Dim1              : logi  NA NA NA NA NA NA ...
##  $ value.Dim2Type          : logi  NA NA NA NA NA NA ...
##  $ value.Dim2              : logi  NA NA NA NA NA NA ...
##  $ value.Dim3Type          : logi  NA NA NA NA NA NA ...
##  $ value.Dim3              : logi  NA NA NA NA NA NA ...
##  $ value.DataSourceDimType : logi  NA NA NA NA NA NA ...
##  $ value.DataSourceDim     : logi  NA NA NA NA NA NA ...
##  $ value.Value             : chr  "32.3" "30.1" "2.0" "1.1" ...
##  $ value.NumericValue      : num  32.3 30.1 2 1.1 0.4 1.1 5.7 9.4 14 7 ...
##  $ value.Low               : logi  NA NA NA NA NA NA ...
##  $ value.High              : logi  NA NA NA NA NA NA ...
##  $ value.Comments          : chr  "World Development Indicators database, World Bank: Estimated from unit-record consumption data." "World Development Indicators database, World Bank: Estimated from unit-record consumption data." "World Development Indicators database, World Bank: Estimated from unit-record consumption data." "World Development Indicators database, World Bank: Estimated from unit-record consumption data." ...
##  $ value.Date              : chr  "2019-08-20T09:05:26.32+02:00" "2019-08-20T09:05:26.373+02:00" "2019-08-20T09:05:26.403+02:00" "2019-08-20T09:05:26.42+02:00" ...
##  $ value.TimeDimensionValue: chr  "2000" "2008" "2002" "2005" ...
##  $ value.TimeDimensionBegin: chr  "2000-01-01T00:00:00+01:00" "2008-01-01T00:00:00+01:00" "2002-01-01T00:00:00+01:00" "2005-01-01T00:00:00+01:00" ...
##  $ value.TimeDimensionEnd  : chr  "2000-12-31T00:00:00+01:00" "2008-12-31T00:00:00+01:00" "2002-12-31T00:00:00+01:00" "2005-12-31T00:00:00+01:00" ...

I then repeated the process for the country dimension data

WHO_CountryList <- content(WHO_CountryAPI, "text", encoding='UTF-8')
WHO_CountryListText <- fromJSON(WHO_CountryList, flatten = TRUE)
WHO_CountryList_df <- as.data.frame(WHO_CountryListText)
str(WHO_CountryList_df)

## 'data.frame':    216 obs. of  7 variables:
##  $ X.odata.context      : chr  "https://ghoapi.azureedge.net/api/$metadata#Collection(Default.DIMENSION_VALUE)" "https://ghoapi.azureedge.net/api/$metadata#Collection(Default.DIMENSION_VALUE)" "https://ghoapi.azureedge.net/api/$metadata#Collection(Default.DIMENSION_VALUE)" "https://ghoapi.azureedge.net/api/$metadata#Collection(Default.DIMENSION_VALUE)" ...
##  $ value.Code           : chr  "AGO" "BDI" "BEN" "BFA" ...
##  $ value.Title          : chr  "Angola" "Burundi" "Benin" "Burkina Faso" ...
##  $ value.ParentDimension: chr  "REGION" "REGION" "REGION" "REGION" ...
##  $ value.Dimension      : chr  "COUNTRY" "COUNTRY" "COUNTRY" "COUNTRY" ...
##  $ value.ParentCode     : chr  "AFR" "AFR" "AFR" "AFR" ...
##  $ value.ParentTitle    : chr  "Africa" "Africa" "Africa" "Africa" ...

The 24 variables in the poverty dataframe and the 7 variables in the country dataframe were not all necessary for the project. So I decided to look at only the core information needed from the data, which I considered to be the country code, year and numerical percent of the population living in poverty from the poverty data set.

WHO_PovertyList_df2 <- subset(WHO_PovertyList_df, select = c(value.SpatialDim, value.TimeDim, value.NumericValue))
names(WHO_PovertyList_df2) <- c("CountryCode","Year","PctPoverty")
head(WHO_PovertyList_df2)

##   CountryCode Year PctPoverty
## 1         AGO 2000       32.3
## 2         AGO 2008       30.1
## 3         ALB 2002        2.0
## 4         ALB 2005        1.1
## 5         ALB 2008        0.4
## 6         ALB 2012        1.1

And I only needed the country code, country name and parent region from the country dimension set.

# get 3 variables from the country dimensions dataset
WHO_CountryList_df2 <- subset(WHO_CountryList_df, select = c(value.Code, value.Title, value.ParentTitle))
names(WHO_CountryList_df2) <- c("CountryCode","CountryName","WorldRegion")
head(WHO_CountryList_df2)

##   CountryCode              CountryName WorldRegion
## 1         AGO                   Angola      Africa
## 2         BDI                  Burundi      Africa
## 3         BEN                    Benin      Africa
## 4         BFA             Burkina Faso      Africa
## 5         BWA                 Botswana      Africa
## 6         CAF Central African Republic      Africa

Once I extracted the data from the poverty and country data sets, I merged them into a single WHO_poverty data frame, retaining only observations in the years 2014 through 2017 in order to align with the Kiva loan data.

WHO_poverty <- 
  inner_join(WHO_PovertyList_df2,WHO_CountryList_df2, by = c("CountryCode" = "CountryCode"), copy = FALSE, keep = FALSE) %>%
    filter(Year >= 2014 & Year < 2018)

WHO_poverty <- WHO_poverty[c("CountryName","CountryCode","WorldRegion","Year","PctPoverty")]

head(WHO_poverty,7)

##   CountryName CountryCode WorldRegion Year PctPoverty
## 1   Argentina         ARG    Americas 2014        0.7
## 2   Argentina         ARG    Americas 2016        0.6
## 3   Argentina         ARG    Americas 2017        0.4
## 4     Armenia         ARM      Europe 2014        2.3
## 5     Armenia         ARM      Europe 2015        1.9
## 6     Armenia         ARM      Europe 2016        1.8
## 7     Armenia         ARM      Europe 2017        1.4

Kiva.org import from Azure database of loans made per country

Kiva loan data from 2014 through 2017 was available as part of a Kaggle challenge. The data was available under a public domain license from the Kaggle website, but the .csv download was 187 MB, which I found was too large to upload to GitHub.

I created an Azure account and established an Azure database, to which I uploaded the .csv file, but when I tried to read the table with 671,205 observations spanning the period from 2014 through mid-2017 into RStudio I found it took a long time. So I created 4 views in the Azure database using SQL code. Each view was for each year of the data, and I was able to import each view from Azure into its own dataframe in RStudio. Then I used the rbind function to reassemble all of the data into a single dataframe inside RStudio.

Each case in the data represents a loan made to an economically disadvantaged person in the currency of the recipient’s residency.

I created a user account inside the Azure database named Rconnection which was used to establish the connection from this document to the Azure data source:

# connect to the server
my_connection <- dbConnect(drv = odbc::odbc(),
         Driver = "SQL Server",
         server = "cuny.database.windows.net",
         database = "CUNY",
         uid = "Rconnection",
         pwd = "RS606Proj#1")

kiva_2014 <- dbGetQuery(my_connection,'
  SELECT [id]
        ,[funded_amount]
        ,[loan_amount]
        ,[activity]
        ,[sector]
        ,[country_code]
        ,[country]
        ,[currency]
        ,[partner_id]
        ,[posted_time]
        ,[disbursed_time]
        ,[funded_time]
        ,[term_in_months]
        ,[lender_count]
        ,[repayment_interval]
        ,[date]
    FROM [dbo].[vw_kiva_loans_2014]
')

kiva_2015 <- dbGetQuery(my_connection,'
  SELECT [id]
        ,[funded_amount]
        ,[loan_amount]
        ,[activity]
        ,[sector]
        ,[country_code]
        ,[country]
        ,[currency]
        ,[partner_id]
        ,[posted_time]
        ,[disbursed_time]
        ,[funded_time]
        ,[term_in_months]
        ,[lender_count]
        ,[repayment_interval]
        ,[date]
    FROM [dbo].[vw_kiva_loans_2015]
')

kiva_2016 <- dbGetQuery(my_connection,'
  SELECT [id]
        ,[funded_amount]
        ,[loan_amount]
        ,[activity]
        ,[sector]
        ,[country_code]
        ,[country]
        ,[currency]
        ,[partner_id]
        ,[posted_time]
        ,[disbursed_time]
        ,[funded_time]
        ,[term_in_months]
        ,[lender_count]
        ,[repayment_interval]
        ,[date]
    FROM [dbo].[vw_kiva_loans_2016]
')

kiva_2017 <- dbGetQuery(my_connection,'
  SELECT [id]
        ,[funded_amount]
        ,[loan_amount]
        ,[activity]
        ,[sector]
        ,[country_code]
        ,[country]
        ,[currency]
        ,[partner_id]
        ,[posted_time]
        ,[disbursed_time]
        ,[funded_time]
        ,[term_in_months]
        ,[lender_count]
        ,[repayment_interval]
        ,[date]
    FROM [dbo].[vw_kiva_loans_2017]
')

Combine all years of data into a single dataframe and take a look at the structure of the data.

kiva <- rbind(kiva_2014,kiva_2015,kiva_2016,kiva_2017)

kiva <- mutate(kiva, Year = as.integer(substr(kiva$date,1,4)))

str(kiva)

## 'data.frame':    671205 obs. of  17 variables:
##  $ id                : chr  "694226" "694450" "694601" "694292" ...
##  $ funded_amount     : chr  "225.0" "1950.0" "600.0" "75.0" ...
##  $ loan_amount       : num  225 1950 600 75 500 ...
##  $ activity          : chr  "Transportation" "Cattle" "Construction Supplies" "General Store" ...
##  $ sector            : chr  "Transportation" "Agriculture" "Construction" "Retail" ...
##  $ country_code      : chr  "PH" "KG" "NI" "PH" ...
##  $ country           : chr  "Philippines" "Kyrgyzstan" "Nicaragua" "Philippines" ...
##  $ currency          : chr  "PHP" "KGS" "NIO" "PHP" ...
##  $ partner_id        : chr  "145.0" "171.0" "98.0" "136.0" ...
##  $ posted_time       : chr  "2014-04-07" "2014-04-07" "2014-04-07" "2014-04-07" ...
##  $ disbursed_time    : chr  "2014-03-12" "2014-03-27" "2014-03-29" "2014-03-28" ...
##  $ funded_time       : POSIXct, format: "2014-04-14 19:34:57" "2014-05-09 19:19:36" ...
##  $ term_in_months    : num  8 26 15 8 8 8 15 14 12 17 ...
##  $ lender_count      : chr  "6" "67" "20" "1" ...
##  $ repayment_interval: chr  "irregular" "monthly" "monthly" "irregular" ...
##  $ date              : chr  "2014-04-07" "2014-04-07" "2014-04-07" "2014-04-07" ...
##  $ Year              : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...

Data integration for use in a multivariate logistic regression

With all three discrete datasets imported, I wanted to combine the information in a way that I could use it to run a multiple regression model. I started with combining all the WB_data with the WHO_poverty percent of the total population in poverty for each country using a left_join() function.

world_poverty <- left_join(WB_data,WHO_poverty, by = c("CountryCode" = "CountryCode", "PopYear" = "Year"), copy = FALSE, keep = FALSE)
names(world_poverty)[1] <- "CountryName"
world_poverty <- world_poverty[c("CountryName", "CountryCode", "PopYear", "Population", "Region", "IncomeGroup", "PctPoverty")]
head(world_poverty,10)

## # A tibble: 10 x 7
##    CountryName CountryCode PopYear Population Region     IncomeGroup  PctPoverty
##    <chr>       <chr>         <int>      <dbl> <chr>      <chr>             <dbl>
##  1 Aruba       ABW            2014     103774 Latin Ame~ High income          NA
##  2 Aruba       ABW            2015     104341 Latin Ame~ High income          NA
##  3 Aruba       ABW            2016     104872 Latin Ame~ High income          NA
##  4 Aruba       ABW            2017     105366 Latin Ame~ High income          NA
##  5 Afghanistan AFG            2014   33370794 South Asia Low income           NA
##  6 Afghanistan AFG            2015   34413603 South Asia Low income           NA
##  7 Afghanistan AFG            2016   35383128 South Asia Low income           NA
##  8 Afghanistan AFG            2017   36296400 South Asia Low income           NA
##  9 Angola      AGO            2014   26941779 Sub-Sahar~ Lower middl~         NA
## 10 Angola      AGO            2015   27884381 Sub-Sahar~ Lower middl~         NA

Then I wanted to aggregate the number and amounts of loans for each year in each country from the kiva dataframe…

kiva_loans <- kiva %>%
  group_by(country,Year,sector) %>%
    summarise(loan_count = n(), loan_totals = sum(loan_amount)) %>%
      mutate(avg_loan_amount = round(loan_totals / loan_count,2))

head(kiva_loans)

## # A tibble: 6 x 6
## # Groups:   country, Year [3]
##   country      Year sector       loan_count loan_totals avg_loan_amount
##   <chr>       <int> <chr>             <int>       <dbl>           <dbl>
## 1 Afghanistan  2014 Arts                  1        6000           6000 
## 2 Afghanistan  2016 Arts                  1        8000           8000 
## 3 Albania      2014 Agriculture         242      379075           1566.
## 4 Albania      2014 Arts                  2        3875           1938.
## 5 Albania      2014 Clothing             28       45500           1625 
## 6 Albania      2014 Construction         13       19150           1473.

…which compressed 671,205 observations of 17 variables down to 3,162 observations of 6 variables.

Then I joined the Kiva loans set with the world_poverty table to arrive at the core data I wanted to analyze for the project.

world_loans <- left_join(world_poverty,kiva_loans, by = c("CountryName" = "country", "PopYear" = "Year"), copy = FALSE, keep = FALSE)
names(world_loans)[3] <- "Year"
head(world_loans,10)

## # A tibble: 10 x 11
##    CountryName CountryCode  Year Population Region IncomeGroup PctPoverty sector
##    <chr>       <chr>       <int>      <dbl> <chr>  <chr>            <dbl> <chr> 
##  1 Aruba       ABW          2014     103774 Latin~ High income         NA <NA>  
##  2 Aruba       ABW          2015     104341 Latin~ High income         NA <NA>  
##  3 Aruba       ABW          2016     104872 Latin~ High income         NA <NA>  
##  4 Aruba       ABW          2017     105366 Latin~ High income         NA <NA>  
##  5 Afghanistan AFG          2014   33370794 South~ Low income          NA Arts  
##  6 Afghanistan AFG          2015   34413603 South~ Low income          NA <NA>  
##  7 Afghanistan AFG          2016   35383128 South~ Low income          NA Arts  
##  8 Afghanistan AFG          2017   36296400 South~ Low income          NA <NA>  
##  9 Angola      AGO          2014   26941779 Sub-S~ Lower midd~         NA <NA>  
## 10 Angola      AGO          2015   27884381 Sub-S~ Lower midd~         NA <NA>  
## # ... with 3 more variables: loan_count <int>, loan_totals <dbl>,
## #   avg_loan_amount <dbl>

Exploring the proposed model data set

Using the world_loans dataframe we can create a training set to include only the countries that have received loans and a test set that includes only the countries that have not received loans.

Here is the training set that includes a sector, total loan count, total loan amount and average loan amount for each country that received loans:

world_loan_tr <- world_loans %>%
    group_by(CountryName, Region, IncomeGroup) %>%
      summarise(avg_Pop = mean(Population), avg_pctPoverty = mean(PctPoverty, na.rm = TRUE)) %>%
        replace_na(list(avg_pctPoverty = 0))

world_loan_train <- left_join(world_loan_tr, kiva_loans, by = c("CountryName" = "country"), copy = FALSE, keep = FALSE) %>%
    group_by(CountryName, Region, IncomeGroup,avg_Pop,avg_pctPoverty,sector) %>%
      summarise(total_loan_count = sum(loan_count), total_loan_amount = sum(loan_totals)) %>%
        mutate(avg_loan_amount = round(total_loan_amount / total_loan_count,2)) %>%
          filter(total_loan_count > 0)

world_loan_train

## # A tibble: 886 x 9
## # Groups:   CountryName, Region, IncomeGroup, avg_Pop, avg_pctPoverty [76]
##    CountryName Region IncomeGroup avg_Pop avg_pctPoverty sector total_loan_count
##    <chr>       <chr>  <chr>         <dbl>          <dbl> <chr>             <int>
##  1 Afghanistan South~ Low income   3.49e7              0 Arts                  2
##  2 Albania     Europ~ Upper midd~  2.88e6              0 Agric~              719
##  3 Albania     Europ~ Upper midd~  2.88e6              0 Arts                  5
##  4 Albania     Europ~ Upper midd~  2.88e6              0 Cloth~               87
##  5 Albania     Europ~ Upper midd~  2.88e6              0 Const~               30
##  6 Albania     Europ~ Upper midd~  2.88e6              0 Educa~               93
##  7 Albania     Europ~ Upper midd~  2.88e6              0 Enter~                5
##  8 Albania     Europ~ Upper midd~  2.88e6              0 Food                 79
##  9 Albania     Europ~ Upper midd~  2.88e6              0 Health              271
## 10 Albania     Europ~ Upper midd~  2.88e6              0 Housi~              378
## # ... with 876 more rows, and 2 more variables: total_loan_amount <dbl>,
## #   avg_loan_amount <dbl>

And here is the test set that includes no sector, loan count, total loan amount or average loan amount since there were no loans made to these countries:

world_loan_test <- left_join(world_loan_tr, kiva_loans, by = c("CountryName" = "country"), copy = FALSE, keep = FALSE) %>%
    group_by(CountryName, Region, IncomeGroup,avg_Pop,avg_pctPoverty,sector) %>%
      summarise(total_loan_count = sum(loan_count), total_loan_amount = sum(loan_totals)) %>%
        mutate(avg_loan_amount = round(total_loan_amount / total_loan_count,2)) %>%
          filter(is.na(total_loan_count))

world_loan_test

## # A tibble: 141 x 9
## # Groups:   CountryName, Region, IncomeGroup, avg_Pop, avg_pctPoverty [141]
##    CountryName Region IncomeGroup avg_Pop avg_pctPoverty sector total_loan_count
##    <chr>       <chr>  <chr>         <dbl>          <dbl> <chr>             <int>
##  1 Algeria     Middl~ Lower midd~  4.01e7          0     <NA>                 NA
##  2 American S~ East ~ Upper midd~  5.57e4          0     <NA>                 NA
##  3 Andorra     Europ~ High income  7.79e4          0     <NA>                 NA
##  4 Angola      Sub-S~ Lower midd~  2.84e7          0     <NA>                 NA
##  5 Antigua an~ Latin~ High income  9.40e4          0     <NA>                 NA
##  6 Argentina   Latin~ Upper midd~  4.34e7          0.567 <NA>                 NA
##  7 Aruba       Latin~ High income  1.05e5          0     <NA>                 NA
##  8 Australia   East ~ High income  2.40e7          0.7   <NA>                 NA
##  9 Austria     Europ~ High income  8.68e6          0.450 <NA>                 NA
## 10 Bahamas, T~ Latin~ High income  3.76e5          0     <NA>                 NA
## # ... with 131 more rows, and 2 more variables: total_loan_amount <dbl>,
## #   avg_loan_amount <dbl>

Having arrived at this point in the project, we can see what the data has provided for the analysis.

train_countries <- n_distinct(world_loan_train$CountryName)
test_countries <- n_distinct(world_loan_test$CountryName)

cat("This gives us a training set with", train_countries,"countries and a test set with",test_countries,"countries.")

## This gives us a training set with 76 countries and a test set with 141 countries.

Part 3 - Exploratory data analysis

Exploring loan recipients by country

With the data in hand the top of mind question is: where is Kiva making loans, and where are they not making loans?

Quick map view of the countries in the training dataset (i.e countries with loans).

data("World") # get world geography from tmap dataframe

train_countries_names <- distinct(world_loan_train, CountryName)

world_loan_train_map <- World %>%
  filter(name %in% train_countries_names$CountryName)

tmap_mode("view") # interactive map view

tm_shape(world_loan_train_map) +
  tm_polygons("income_grp") +
  tm_tiles("Stamen.TonerLabels")

Quick map view of the countries in the test dataset (i.e. countries not receiving loans).

data("World") # get world geography from tmap dataframe

test_countries_names <- distinct(world_loan_test, CountryName)

world_loan_test_map <- World %>%
  filter(name %in% test_countries_names$CountryName)

tmap_mode("view") # interactive map view

tm_shape(world_loan_test_map) +
  tm_polygons("income_grp") +
  tm_tiles("Stamen.TonerLabels")

Digging into the training data (i.e. the set with all the historical Kiva loans)

Let’s quantify the number and amounts of loans, as well as the average loan by world region.

loans_by_region <- world_loans %>%
    group_by(Region) %>%
      summarise(loan_count = sum(loan_count, na.rm = TRUE), loan_totals = sum(loan_totals, na.rm = TRUE)) %>%
        mutate(avg_loan_amount = round(loan_totals / loan_count,2))

loans_by_region[order(loans_by_region$avg_loan_amount, decreasing = TRUE),]

## # A tibble: 7 x 4
##   Region                     loan_count loan_totals avg_loan_amount
##   <chr>                           <int>       <dbl>           <dbl>
## 1 North America                    6093    31496375           5169.
## 2 Middle East & North Africa      14142    20179450           1427.
## 3 Latin America & Caribbean      156300   183310550           1173.
## 4 Europe & Central Asia           38932    42135625           1082.
## 5 Sub-Saharan Africa             168197   121521850            722.
## 6 South Asia                      38815    20363600            525.
## 7 East Asia & Pacific            224257   106646850            476.

loans_by_region$Region <- factor(loans_by_region$Region, levels = loans_by_region$Region[order(loans_by_region$loan_count)])

loans_by_region_count <- ggplot(data=loans_by_region, aes(x=Region,y=loan_count)) + 
  geom_col(fill="lightblue") + 
  geom_text(label=(c(loans_by_region$loan_count))) +
  ggtitle("Number of loans by region") +
  xlab('Region') +
  ylab('Count of loans') +
  coord_flip()

loans_by_region$Region <- factor(loans_by_region$Region, levels = loans_by_region$Region[order(loans_by_region$loan_count)])

loans_by_region_total_amount <- ggplot(data=loans_by_region, aes(x=Region,y=loan_totals)) + 
  geom_col(fill="lightblue") + 
  geom_text(label=(c(loans_by_region$loan_totals))) +
  ggtitle("Total $ amount of loans by region") +
  xlab('Region') +
  ylab('Total $ amount of loans') +
  coord_flip()

loans_by_region$Region <- factor(loans_by_region$Region, levels = loans_by_region$Region[order(loans_by_region$loan_count)])

loans_by_region_avg_amount <- ggplot(data=loans_by_region, aes(x=Region,y=avg_loan_amount)) + 
  geom_col(fill="lightblue") + 
  geom_text(label=(c(loans_by_region$avg_loan_amount))) +
  ggtitle("Avg $ amount of loans by region") +
  xlab('Region') +
  ylab('Avg $ amount of loans') +
  coord_flip()

grid.arrange(loans_by_region_count, loans_by_region_total_amount, loans_by_region_avg_amount, nrow = 3)

It appears that the highest total dollars are going to Latin America & the Caribbean region with $183M in total loan dollars, but the greatest number of loans are going to East Asia & the Pacific. Interestingly East Asia & the Pacific region have the lowest average dollars in each loan with a mean of only $475.56 per loan. Also noteworthy is that North America receives the fewest number of loans, but the average loan dollars is the highest in the world at a mean of $5,169.27 per loan.

We could also view the data as a scatterplot with regresssion lines.

ggplot(world_loan_train, aes(x = avg_loan_amount, y = total_loan_count)) + geom_jitter(aes(color = Region)) + stat_smooth(method ="lm",se = TRUE) + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10') + coord_flip() + facet_wrap(~ Region)

## `geom_smooth()` using formula 'y ~ x'

This confirms the same trend in lending by region just with a different view of the data.

Moving on, instead of grouping by geographic region, let’s also look at the number and amounts of loans by country-level income groups regardless of geographic region.

loans_by_affluence <- world_loans %>%
    group_by(IncomeGroup) %>%
      summarise(loan_count = sum(loan_count, na.rm = TRUE), loan_totals = sum(loan_totals, na.rm = TRUE)) %>%
        mutate(avg_loan_amount = round(loan_totals / loan_count,2))

loans_by_affluence[order(loans_by_affluence$avg_loan_amount, decreasing = TRUE),]

## # A tibble: 4 x 4
##   IncomeGroup         loan_count loan_totals avg_loan_amount
##   <chr>                    <int>       <dbl>           <dbl>
## 1 High income               6555    32987450           5032.
## 2 Upper middle income     131695   173429425           1317.
## 3 Low income               84220    76704900            911.
## 4 Lower middle income     424266   242532525            572.

loans_by_affluence$IncomeGroup <- factor(loans_by_affluence$IncomeGroup, levels = loans_by_affluence$IncomeGroup[order(loans_by_affluence$loan_count)])

loans_by_income_count <- ggplot(data=loans_by_affluence, aes(x=IncomeGroup,y=loan_count)) + 
  geom_col(fill="lightblue") + 
  geom_text(label=(c(loans_by_affluence$loan_count))) +
  ggtitle("Number of loans by income group") +
  xlab('Income group') +
  ylab('Total $ amount of loans') +
  coord_flip()

loans_by_affluence$Region <- factor(loans_by_affluence$IncomeGroup, levels = loans_by_affluence$IncomeGroup[order(loans_by_affluence$loan_count)])

loans_by_income_total_amount <- ggplot(data=loans_by_affluence, aes(x=IncomeGroup,y=loan_totals)) + 
  geom_col(fill="lightblue") + 
  geom_text(label=(c(loans_by_affluence$loan_totals))) +
  ggtitle("Total $ amount of loans by income group") +
  xlab('Income group') +
  ylab('Total $ amount of loans') +
  coord_flip()

loans_by_affluence$Region <- factor(loans_by_affluence$IncomeGroup, levels = loans_by_affluence$IncomeGroup[order(loans_by_affluence$loan_count)])

loans_by_income_avg_amount <- ggplot(data=loans_by_affluence, aes(x=IncomeGroup,y=avg_loan_amount)) + 
  geom_col(fill="lightblue") + 
  geom_text(label=(c(loans_by_affluence$avg_loan_amount))) +
  ggtitle("Avg $ amount of loans by income group") +
  xlab('Income group') +
  ylab('Total $ amount of loans') +
  coord_flip()

grid.arrange(loans_by_income_count, loans_by_income_total_amount, loans_by_income_avg_amount, nrow = 3)

It is curious that the high income countries receive the least total dollars but the highest average dollars per loan, while the lower middle income countries receive the greatest number of loans yet the lowest average dollars per loan. That means the lower middle income countries receive almost 9 times less on average per loan than the high income countries. Could this approach to making loans possibly lend credence to the saying "the rich get richer and the poor get poorer?

Let’s take yet another view and look at total loan amounts by total loan counts by sector.

ggplot(world_loan_train, aes(x = total_loan_amount, y = total_loan_count)) + geom_jitter(aes(color = Region)) + stat_smooth(method ="lm",se = TRUE) + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10') + coord_flip() + facet_wrap(~ sector)

## `geom_smooth()` using formula 'y ~ x'

Scanning across the loan sectors, it appears that loans for agriculture, clothing, food, and retail tend to receive more dollars than the other sectors.

We can use violin plots to get a sense of the overall distribution of loans by sector.

ggplot(world_loan_train, aes(y = total_loan_count, x = avg_loan_amount, color = sector)) + geom_violin() + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10') + coord_flip() + facet_wrap(~ sector)

Again agriculture, clothing, food and retail appear to have bigger bulges toward the top end of their loan counts in terms of avg amounts.

Part 4 - Inference

Simple Linear Regression

First let’s look a little more at the training data in terms of predicting number of loans purely based on avg percent of poverty in a country.

We see that the avg loans are quite right skewed, suggesting that the majority of loans made average smaller amounts.

hist(world_loan_train$avg_loan_amount, breaks = 50, xlab = "Avg Loan", col = "lightblue", main = "Avg Loans")

If we fit a linear model called m_loans to predict the amount of loans made in a country by its average poverty percentage, we see that the equation for the linear model turns out to be \[\widehat{total\_loans} = 695.732 + 3.477 * avg\_Pct\_Poverty\].

m_loans_poverty <- lm(world_loan_train$total_loan_count ~ world_loan_train$avg_pctPoverty, data = world_loan_train)
summary(m_loans_poverty)

## 
## Call:
## lm(formula = world_loan_train$total_loan_count ~ world_loan_train$avg_pctPoverty, 
##     data = world_loan_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -939   -694   -649   -314  52858 
## 
## Coefficients:
##                                 Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)                      695.732    121.292   5.736 0.0000000133 ***
## world_loan_train$avg_pctPoverty    3.477      6.029   0.577        0.564    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3149 on 884 degrees of freedom
## Multiple R-squared:  0.0003762,  Adjusted R-squared:  -0.0007546 
## F-statistic: 0.3327 on 1 and 884 DF,  p-value: 0.5642

We also see that there is a linear relationship between the avg Pct of poverty and the total loans received.

ggplot(world_loan_train, aes(x = avg_pctPoverty, y = total_loan_count)) + geom_jitter(aes(color = Region)) + stat_smooth(method ="lm",se = TRUE) + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10')

Let’s pause to verify that the conditions for this model are reasonable using the standard diagnostic plots.

ggplot(data = m_loans_poverty, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals") + stat_smooth(method ="lm",se = TRUE) + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10')

The variability of points around the least squares line remains roughly constant with a few extreme outliers, meaning that the relationship between the avg pct of poverty and the total amount of loans is roughly linear.

ggplot(data = m_loans_poverty, aes(x = .resid)) +
  geom_histogram() +
  xlab("Residuals")

The residuals show a considerable right skew with several far outliers.

ggplot(data = m_loans_poverty, aes(sample = .resid)) +
  stat_qq()

The qq-plot also shows a highly right skewed curve with several high end outliers.

Running logistic regressions to find a predictor model

We will run the full model first against the training set. So let’s create the full model.

m_loans <- lm(total_loan_count ~ CountryName + Region + IncomeGroup + sector + avg_Pop + avg_pctPoverty, data = world_loan_train)
summary(m_loans)

## 
## Call:
## lm(formula = total_loan_count ~ CountryName + Region + IncomeGroup + 
##     sector + avg_Pop + avg_pctPoverty, data = world_loan_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9813   -512    116    437  41638 
## 
## Coefficients: (11 not defined because of singularities)
##                                             Estimate          Std. Error
## (Intercept)                       2397.2986176292529  2796.1628819864382
## CountryNameAlbania                -303.6996369267350  2864.3827865305916
## CountryNameArmenia                 142.7670297398832  2864.3827865303933
## CountryNameAzerbaijan             -358.5630701023196  2870.5357722683607
## CountryNameBelize                -2272.2986176294403  3924.2681789746880
## CountryNameBenin                  -882.1790623331689  2961.5415192947094
## CountryNameBhutan                   -0.0000000001159  3893.9321714464168
## CountryNameBolivia                 154.4336964066028  2864.3827865304493
## CountryNameBrazil                 -526.8777975259709  2895.6262119445364
## CountryNameBurkina Faso           -390.8530298003951  2895.6213224707385
## CountryNameBurundi                -620.7674411527719  2921.4020314724162
## CountryNameCambodia               1889.7670297399504  2864.3827865304675
## CountryNameCameroon               -282.9177385629222  2877.6695864552671
## CountryNameChile                 -1034.6889811743922  2991.3486426639715
## CountryNameChina                  -480.8256323469025  2870.5499555882966
## CountryNameColombia               1033.7003630733109  2864.3827865305234
## CountryNameCosta Rica             -328.5663035933894  2864.3827865304747
## CountryNameDominican Republic     -840.0606533839924  2944.5280215731127
## CountryNameEcuador                 468.7670297399083  2864.3827865303751
## CountryNameEl Salvador            2225.7003630733134  2864.3827865305461
## CountryNameGeorgia                -318.3256323468976  2870.5499555883821
## CountryNameGhana                  -141.0329702600820  2864.3827865305079
## CountryNameGuam                  -1678.5722013162665  3924.4402762937907
## CountryNameGuatemala                54.7003630732498  2864.3827865305270
## CountryNameHaiti                  -232.0832867926781  2877.6335519656118
## CountryNameHonduras                  4.5003630732590  2864.3827865304929
## CountryNameIndia                   316.5003630732660  2864.3827865304802
## CountryNameIndonesia               -18.3663035933984  2864.3827865305229
## CountryNameIraq                   -452.5958270376615  2877.6732187750663
## CountryNameIsrael                 -600.2668203751462  2895.7288634791739
## CountryNameJordan                 -142.3353071981244  2870.5532425794154
## CountryNameKenya                  4622.3670297397884  2864.3827865303283
## CountryNameKosovo                 -338.0329702600695  2864.3827865304629
## CountryNameLebanon                 153.5003630732604  2864.3827865304988
## CountryNameLesotho                 118.3367439371703  3929.1581524224412
## CountryNameLiberia                -203.1924999955195  2899.6799514870522
## CountryNameMadagascar             -162.5436614140387  2877.6635088290154
## CountryNameMalawi                 -446.0338853335510  2885.9096966286938
## CountryNameMali                     50.8494752012601  2895.6262119443377
## CountryNameMauritania             -504.9495077255829  3924.5251711018486
## CountryNameMexico                  -49.8996369267380  2864.3827865304306
## CountryNameMoldova                -465.5399180611796  2870.5499555883516
## CountryNameMongolia               -428.6344986737290  2870.5357722682552
## CountryNameMozambique             -248.7059272451345  2870.5357722683084
## CountryNameNamibia                -647.7539511805671  3400.9417561738746
## CountryNameNepal                  -459.5912786388498  2885.9025930642770
## CountryNameNicaragua               352.7670297399196  2864.3827865304465
## CountryNameNigeria                 469.2863837825488  2967.8561632666979
## CountryNamePakistan               1357.8336964066032  2864.3827865304529
## CountryNamePanama                 -600.5866211020823  2885.9318119470531
## CountryNameParaguay                360.9003630732673  2864.3827865304474
## CountryNamePeru                   1049.5670297399013  2864.3827865303806
## CountryNamePhilippines           10263.4336964066570  2864.3827865304429
## CountryNamePuerto Rico            -632.1536535420112  2907.3333344450980
## CountryNameRwanda                   16.3670297399098  2864.3827865304079
## CountryNameSamoa                    60.5295195186374  2870.5920381913415
## CountryNameSenegal                -263.9916415308540  2870.5357722682338
## CountryNameSierra Leone            -71.6329702600864  2864.3827865304347
## CountryNameSolomon Islands        -575.3757805537406  2895.7408429632487
## CountryNameSomalia                -630.3770024781575  2899.5945029435238
## CountryNameSouth Africa           -483.3245942591770  2877.6282402726602
## CountryNameSouth Sudan            -626.2246293597588  2895.6295437765925
## CountryNameSuriname               -977.2998945258795  2967.9385682086217
## CountryNameTajikistan              872.7003630732477  2864.3827865304293
## CountryNameTanzania                -84.6996369267567  2864.3827865304174
## CountryNameThailand               -766.1502604601651  3093.2972881473070
## CountryNameTimor-Leste            -297.8381228302801  2877.6448159451793
## CountryNameTogo                    -30.0596157078293  2895.6262119443563
## CountryNameTurkey                 -549.1725216078013  2907.2961271462918
## CountryNameUganda                  940.7670297399101  2864.3827865304224
## CountryNameUkraine                -479.0099401646601  2889.5770629993290
## CountryNameUnited States           -26.4329702600880  2864.3827865304233
## CountryNameVanuatu                -299.6632560628339  3929.1581524223825
## CountryNameVietnam                 290.2336964065781  2864.3827865304243
## CountryNameZambia                 -600.6523920058125  2911.6816938341426
## CountryNameZimbabwe               -200.0832867926773  2877.6335519657118
## RegionEurope & Central Asia                       NA                  NA
## RegionLatin America & Caribbean                   NA                  NA
## RegionMiddle East & North Africa                  NA                  NA
## RegionNorth America                               NA                  NA
## RegionSouth Asia                                  NA                  NA
## RegionSub-Saharan Africa                          NA                  NA
## IncomeGroupLow income                             NA                  NA
## IncomeGroupLower middle income                    NA                  NA
## IncomeGroupUpper middle income                    NA                  NA
## sectorArts                       -2395.2986176289783   487.0040910352631
## sectorClothing                   -2057.9457011436989   476.2791050097732
## sectorConstruction               -2543.5529107498342   489.7040459229187
## sectorEducation                  -2145.5349181733486   505.0007306266344
## sectorEntertainment              -2872.6870451600362   546.3277453961758
## sectorFood                        -538.2962931765775   474.4062957141285
## sectorHealth                     -2456.3907904670623   489.6258243159040
## sectorHousing                    -2067.4983645019802   520.1590233244574
## sectorManufacturing              -2508.2207489063289   487.3986250027600
## sectorPersonal Use               -2093.6353615664248   510.7690227874791
## sectorRetail                      -717.7264163129320   473.5312914529525
## sectorServices                   -1891.3491099036526   474.2600996711696
## sectorTransportation             -2408.4855162622484   502.3727012746329
## sectorWholesale                  -2773.3629165844509   521.9436184460473
## avg_Pop                                           NA                  NA
## avg_pctPoverty                                    NA                  NA
##                                  t value    Pr(>|t|)    
## (Intercept)                        0.857     0.39151    
## CountryNameAlbania                -0.106     0.91559    
## CountryNameArmenia                 0.050     0.96026    
## CountryNameAzerbaijan             -0.125     0.90063    
## CountryNameBelize                 -0.579     0.56273    
## CountryNameBenin                  -0.298     0.76587    
## CountryNameBhutan                  0.000     1.00000    
## CountryNameBolivia                 0.054     0.95702    
## CountryNameBrazil                 -0.182     0.85566    
## CountryNameBurkina Faso           -0.135     0.89266    
## CountryNameBurundi                -0.212     0.83178    
## CountryNameCambodia                0.660     0.50961    
## CountryNameCameroon               -0.098     0.92171    
## CountryNameChile                  -0.346     0.72951    
## CountryNameChina                  -0.168     0.86702    
## CountryNameColombia                0.361     0.71828    
## CountryNameCosta Rica             -0.115     0.90871    
## CountryNameDominican Republic     -0.285     0.77549    
## CountryNameEcuador                 0.164     0.87005    
## CountryNameEl Salvador             0.777     0.43737    
## CountryNameGeorgia                -0.111     0.91173    
## CountryNameGhana                  -0.049     0.96074    
## CountryNameGuam                   -0.428     0.66897    
## CountryNameGuatemala               0.019     0.98477    
## CountryNameHaiti                  -0.081     0.93574    
## CountryNameHonduras                0.002     0.99875    
## CountryNameIndia                   0.110     0.91204    
## CountryNameIndonesia              -0.006     0.99489    
## CountryNameIraq                   -0.157     0.87507    
## CountryNameIsrael                 -0.207     0.83583    
## CountryNameJordan                 -0.050     0.96047    
## CountryNameKenya                   1.614     0.10698    
## CountryNameKosovo                 -0.118     0.90609    
## CountryNameLebanon                 0.054     0.95728    
## CountryNameLesotho                 0.030     0.97598    
## CountryNameLiberia                -0.070     0.94415    
## CountryNameMadagascar             -0.056     0.95497    
## CountryNameMalawi                 -0.155     0.87721    
## CountryNameMali                    0.018     0.98599    
## CountryNameMauritania             -0.129     0.89766    
## CountryNameMexico                 -0.017     0.98611    
## CountryNameMoldova                -0.162     0.87121    
## CountryNameMongolia               -0.149     0.88134    
## CountryNameMozambique             -0.087     0.93098    
## CountryNameNamibia                -0.190     0.84899    
## CountryNameNepal                  -0.159     0.87351    
## CountryNameNicaragua               0.123     0.90201    
## CountryNameNigeria                 0.158     0.87440    
## CountryNamePakistan                0.474     0.63560    
## CountryNamePanama                 -0.208     0.83520    
## CountryNameParaguay                0.126     0.89977    
## CountryNamePeru                    0.366     0.71415    
## CountryNamePhilippines             3.583     0.00036 ***
## CountryNamePuerto Rico            -0.217     0.82793    
## CountryNameRwanda                  0.006     0.99544    
## CountryNameSamoa                   0.021     0.98318    
## CountryNameSenegal                -0.092     0.92675    
## CountryNameSierra Leone           -0.025     0.98005    
## CountryNameSolomon Islands        -0.199     0.84255    
## CountryNameSomalia                -0.217     0.82795    
## CountryNameSouth Africa           -0.168     0.86666    
## CountryNameSouth Sudan            -0.216     0.82884    
## CountryNameSuriname               -0.329     0.74203    
## CountryNameTajikistan              0.305     0.76069    
## CountryNameTanzania               -0.030     0.97642    
## CountryNameThailand               -0.248     0.80445    
## CountryNameTimor-Leste            -0.104     0.91759    
## CountryNameTogo                   -0.010     0.99172    
## CountryNameTurkey                 -0.189     0.85022    
## CountryNameUganda                  0.328     0.74267    
## CountryNameUkraine                -0.166     0.86838    
## CountryNameUnited States          -0.009     0.99264    
## CountryNameVanuatu                -0.076     0.93923    
## CountryNameVietnam                 0.101     0.91932    
## CountryNameZambia                 -0.206     0.83662    
## CountryNameZimbabwe               -0.070     0.94458    
## RegionEurope & Central Asia           NA          NA    
## RegionLatin America & Caribbean       NA          NA    
## RegionMiddle East & North Africa      NA          NA    
## RegionNorth America                   NA          NA    
## RegionSouth Asia                      NA          NA    
## RegionSub-Saharan Africa              NA          NA    
## IncomeGroupLow income                 NA          NA    
## IncomeGroupLower middle income        NA          NA    
## IncomeGroupUpper middle income        NA          NA    
## sectorArts                        -4.918 0.000001060 ***
## sectorClothing                    -4.321 0.000017507 ***
## sectorConstruction                -5.194 0.000000262 ***
## sectorEducation                   -4.249 0.000024058 ***
## sectorEntertainment               -5.258 0.000000187 ***
## sectorFood                        -1.135     0.25685    
## sectorHealth                      -5.017 0.000000648 ***
## sectorHousing                     -3.975 0.000076859 ***
## sectorManufacturing               -5.146 0.000000335 ***
## sectorPersonal Use                -4.099 0.000045754 ***
## sectorRetail                      -1.516     0.12999    
## sectorServices                    -3.988 0.000072768 ***
## sectorTransportation              -4.794 0.000001949 ***
## sectorWholesale                   -5.314 0.000000140 ***
## avg_Pop                               NA          NA    
## avg_pctPoverty                        NA          NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2753 on 796 degrees of freedom
## Multiple R-squared:  0.3118, Adjusted R-squared:  0.2349 
## F-statistic: 4.053 on 89 and 796 DF,  p-value: < 0.00000000000000022

Let’s check the conditions for the regression.

ggplot(data = m_loans, aes(x = .resid)) +
  geom_histogram() +
  xlab("Residuals")

The model is somewhat normal but with several extreme outliers.

ggplot(data = m_loans, aes(x = .fitted, y = .resid)) +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals") + stat_smooth(method ="lm",se = TRUE) + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10')

The variance of the residuals is not constant with several outliers.

ggplot(data = m_loans, aes(sample = .resid)) +
  stat_qq()

The qqplot also appears quite right skewed with outliers on both tails.

So the conditions for mutliple or logistic regression are not met.

In the full model CountryName is a unique value to each country and cannot be “predicted” as a value for another country. Moreover, it appears that by including the specific countries that are already receiving loans we are overfitting the model, so we will remove the country names and see what happens.

m_loans <- lm(total_loan_count ~ Region + IncomeGroup + sector + avg_Pop + avg_pctPoverty, data = world_loan_train)
summary(m_loans)

## 
## Call:
## lm(formula = total_loan_count ~ Region + IncomeGroup + sector + 
##     avg_Pop + avg_pctPoverty, data = world_loan_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3865   -838   -140    338  50384 
## 
## Coefficients:
##                                          Estimate       Std. Error t value
## (Intercept)                       2285.3682871161   690.4883375156   3.310
## RegionEurope & Central Asia      -1055.6598003278   412.0111163657  -2.562
## RegionLatin America & Caribbean   -687.5082090687   363.5408652633  -1.891
## RegionMiddle East & North Africa  -731.6948147972   548.3654691616  -1.334
## RegionNorth America                 48.5845001486   981.1512130967   0.050
## RegionSouth Asia                  -975.2602118319   570.7603718902  -1.709
## RegionSub-Saharan Africa         -1319.6631762407   385.9877928602  -3.419
## IncomeGroupLow income             1001.9556786361   616.5994382708   1.625
## IncomeGroupLower middle income    1569.6642596866   562.6741555606   2.790
## IncomeGroupUpper middle income     638.9017427702   525.7833812106   1.215
## sectorArts                       -2313.1509271068   528.5294882114  -4.377
## sectorClothing                   -2019.3735317885   521.8020578040  -3.870
## sectorConstruction               -2433.1446273661   535.4330401440  -4.544
## sectorEducation                  -1969.7765768139   550.8528323355  -3.576
## sectorEntertainment              -2530.9729651644   594.0519730943  -4.261
## sectorFood                        -487.7652637183   519.8003023799  -0.938
## sectorHealth                     -2319.1532550516   535.1921523670  -4.333
## sectorHousing                    -1898.6813779493   567.0492638203  -3.348
## sectorManufacturing              -2401.0088662018   532.8345160668  -4.506
## sectorPersonal Use               -1963.1417777867   551.2701963350  -3.561
## sectorRetail                      -690.7831507630   516.0159644330  -1.339
## sectorServices                   -1866.9159294259   517.9016033632  -3.605
## sectorTransportation             -2269.7751482054   548.0698547509  -4.141
## sectorWholesale                  -2549.3859864936   566.9806158746  -4.496
## avg_Pop                             -0.0000002787     0.0000005088  -0.548
## avg_pctPoverty                       7.8721979833     7.1933757042   1.094
##                                    Pr(>|t|)    
## (Intercept)                        0.000973 ***
## RegionEurope & Central Asia        0.010570 *  
## RegionLatin America & Caribbean    0.058941 .  
## RegionMiddle East & North Africa   0.182452    
## RegionNorth America                0.960518    
## RegionSouth Asia                   0.087867 .  
## RegionSub-Saharan Africa           0.000658 ***
## IncomeGroupLow income              0.104535    
## IncomeGroupLower middle income     0.005393 ** 
## IncomeGroupUpper middle income     0.224645    
## sectorArts                       0.00001354 ***
## sectorClothing                     0.000117 ***
## sectorConstruction               0.00000630 ***
## sectorEducation                    0.000369 ***
## sectorEntertainment              0.00002265 ***
## sectorFood                         0.348318    
## sectorHealth                     0.00001642 ***
## sectorHousing                      0.000848 ***
## sectorManufacturing              0.00000752 ***
## sectorPersonal Use                 0.000390 ***
## sectorRetail                       0.181027    
## sectorServices                     0.000330 ***
## sectorTransportation             0.00003792 ***
## sectorWholesale                  0.00000786 ***
## avg_Pop                            0.583958    
## avg_pctPoverty                     0.274100    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3030 on 860 degrees of freedom
## Multiple R-squared:  0.09942,    Adjusted R-squared:  0.07324 
## F-statistic: 3.797 on 25 and 860 DF,  p-value: 0.000000001792

The relative strength of the model without country names dropped to \[R^2_{adj} = 0.07324\].

It is interesting that with the country names included the dimensions of world region and income group were all “NA”, but without the countries they begin to assume some significance within certain values such as Region = Europe & Central Asia and Income Group = Lower middle income.

For the next round we will remove Region since it has the highest p-value in North America.

m_loans <- lm(total_loan_count ~ IncomeGroup + sector + avg_Pop + avg_pctPoverty, data = world_loan_train)
summary(m_loans)

## 
## Call:
## lm(formula = total_loan_count ~ IncomeGroup + sector + avg_Pop + 
##     avg_pctPoverty, data = world_loan_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3244   -805   -166    266  51187 
## 
## Coefficients:
##                                          Estimate         Std. Error t value
## (Intercept)                     1777.457928715200   539.739965506075   3.293
## IncomeGroupLow income            356.463813132056   495.030537051976   0.720
## IncomeGroupLower middle income  1291.879415052892   448.686167227039   2.879
## IncomeGroupUpper middle income   408.322308767657   445.834864987234   0.916
## sectorArts                     -2294.877210070119   530.590110182443  -4.325
## sectorClothing                 -2012.063562258315   524.228341341961  -3.838
## sectorConstruction             -2421.217749999545   537.823552176587  -4.502
## sectorEducation                -1968.769702210150   553.342820886069  -3.558
## sectorEntertainment            -2494.089433513088   596.255967790020  -4.183
## sectorFood                      -499.266400554085   522.227199381874  -0.956
## sectorHealth                   -2341.189442695978   537.535878213656  -4.355
## sectorHousing                  -1854.735939505492   569.223793140106  -3.258
## sectorManufacturing            -2412.068343052929   535.239702703112  -4.507
## sectorPersonal Use             -1949.510495593590   553.534616954894  -3.522
## sectorRetail                    -702.186695557686   518.398794674743  -1.355
## sectorServices                 -1872.509792145529   520.302911058629  -3.599
## sectorTransportation           -2246.574759455291   550.528836786731  -4.081
## sectorWholesale                -2537.639450665607   569.295364362355  -4.458
## avg_Pop                           -0.000000005909     0.000000429850  -0.014
## avg_pctPoverty                     3.567894539515     6.665972917777   0.535
##                                  Pr(>|t|)    
## (Intercept)                      0.001031 ** 
## IncomeGroupLow income            0.471667    
## IncomeGroupLower middle income   0.004084 ** 
## IncomeGroupUpper middle income   0.359995    
## sectorArts                     0.00001702 ***
## sectorClothing                   0.000133 ***
## sectorConstruction             0.00000766 ***
## sectorEducation                  0.000394 ***
## sectorEntertainment            0.00003171 ***
## sectorFood                       0.339322    
## sectorHealth                   0.00001487 ***
## sectorHousing                    0.001164 ** 
## sectorManufacturing            0.00000749 ***
## sectorPersonal Use               0.000451 ***
## sectorRetail                     0.175921    
## sectorServices                   0.000338 ***
## sectorTransportation           0.00004904 ***
## sectorWholesale                0.00000938 ***
## avg_Pop                          0.989035    
## avg_pctPoverty                   0.592621    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3045 on 866 degrees of freedom
## Multiple R-squared:  0.08457,    Adjusted R-squared:  0.06449 
## F-statistic: 4.211 on 19 and 866 DF,  p-value: 0.000000005472

The relative strength of the model without Region dropped to \[R^2_{adj} = 0.06449\].

The adjusted-R-squared decreased by removing region so we will add it back in, and now the largest p-value is in avg_Pop, so we will try removing that dimension next.

m_loans <- lm(total_loan_count ~ Region + IncomeGroup + sector + avg_pctPoverty, data = world_loan_train)
summary(m_loans)

## 
## Call:
## lm(formula = total_loan_count ~ Region + IncomeGroup + sector + 
##     avg_pctPoverty, data = world_loan_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3827   -839   -167    340  50397 
## 
## Coefficients:
##                                   Estimate Std. Error t value   Pr(>|t|)    
## (Intercept)                       2222.644    680.652   3.265   0.001136 ** 
## RegionEurope & Central Asia       -990.684    394.409  -2.512   0.012193 *  
## RegionLatin America & Caribbean   -628.882    347.291  -1.811   0.070517 .  
## RegionMiddle East & North Africa  -658.444    531.597  -1.239   0.215826    
## RegionNorth America                 22.311    979.580   0.023   0.981834    
## RegionSouth Asia                 -1078.494    538.535  -2.003   0.045529 *  
## RegionSub-Saharan Africa         -1282.374    379.784  -3.377   0.000767 ***
## IncomeGroupLow income             1017.196    615.721   1.652   0.098891 .  
## IncomeGroupLower middle income    1589.577    561.271   2.832   0.004732 ** 
## IncomeGroupUpper middle income     621.510    524.611   1.185   0.236460    
## sectorArts                       -2308.336    528.242  -4.370 0.00001395 ***
## sectorClothing                   -2019.327    521.590  -3.871   0.000116 ***
## sectorConstruction               -2434.387    535.211  -4.548 0.00000618 ***
## sectorEducation                  -1971.350    550.621  -3.580   0.000363 ***
## sectorEntertainment              -2533.711    593.789  -4.267 0.00002201 ***
## sectorFood                        -488.994    519.584  -0.941   0.346905    
## sectorHealth                     -2322.442    534.941  -4.341 0.00001583 ***
## sectorHousing                    -1901.263    566.799  -3.354   0.000830 ***
## sectorManufacturing              -2402.728    532.609  -4.511 0.00000734 ***
## sectorPersonal Use               -1967.378    550.992  -3.571   0.000376 ***
## sectorRetail                      -690.483    515.806  -1.339   0.181038    
## sectorServices                   -1867.163    517.691  -3.607   0.000328 ***
## sectorTransportation             -2270.674    547.845  -4.145 0.00003738 ***
## sectorWholesale                  -2547.299    566.737  -4.495 0.00000792 ***
## avg_pctPoverty                       8.012      7.186   1.115   0.265206    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3029 on 861 degrees of freedom
## Multiple R-squared:  0.0991, Adjusted R-squared:  0.07399 
## F-statistic: 3.946 on 24 and 861 DF,  p-value: 0.0000000009874

The relative strength of the model without avg_Pop increased to \[R^2_{adj} = 0.07399\]. It appears that avg_Pop is not a predictor variable for this model.

The adjusted-R-squared increased slightly by removing the avg_Pop, and next we can try removing the avg_pctPoverty since it has the next highest p-value.

m_loans <- lm(total_loan_count ~ Region + IncomeGroup + sector, data = world_loan_train)
summary(m_loans)

## 
## Call:
## lm(formula = total_loan_count ~ Region + IncomeGroup + sector, 
##     data = world_loan_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3823   -819   -196    311  50415 
## 
## Coefficients:
##                                   Estimate Std. Error t value   Pr(>|t|)    
## (Intercept)                       2244.083    680.477   3.298   0.001014 ** 
## RegionEurope & Central Asia      -1025.379    393.235  -2.608   0.009277 ** 
## RegionLatin America & Caribbean   -637.407    347.256  -1.836   0.066768 .  
## RegionMiddle East & North Africa  -689.792    530.928  -1.299   0.194216    
## RegionNorth America                  9.848    979.654   0.010   0.991982    
## RegionSouth Asia                 -1113.351    537.703  -2.071   0.038696 *  
## RegionSub-Saharan Africa         -1167.162    365.506  -3.193   0.001458 ** 
## IncomeGroupLow income             1101.965    611.095   1.803   0.071696 .  
## IncomeGroupLower middle income    1612.015    560.989   2.874   0.004159 ** 
## IncomeGroupUpper middle income     634.187    524.561   1.209   0.227000    
## sectorArts                       -2308.579    528.316  -4.370 0.00001396 ***
## sectorClothing                   -2016.141    521.656  -3.865   0.000119 ***
## sectorConstruction               -2441.927    535.243  -4.562 0.00000579 ***
## sectorEducation                  -1968.879    550.695  -3.575   0.000369 ***
## sectorEntertainment              -2529.446    593.861  -4.259 0.00002276 ***
## sectorFood                        -488.102    519.657  -0.939   0.347851    
## sectorHealth                     -2320.920    535.015  -4.338 0.00001608 ***
## sectorHousing                    -1897.561    566.869  -3.347   0.000851 ***
## sectorManufacturing              -2397.970    532.667  -4.502 0.00000766 ***
## sectorPersonal Use               -1970.810    551.061  -3.576   0.000368 ***
## sectorRetail                      -690.537    515.879  -1.339   0.181065    
## sectorServices                   -1868.359    517.763  -3.609   0.000326 ***
## sectorTransportation             -2275.908    547.902  -4.154 0.00003595 ***
## sectorWholesale                  -2540.825    566.787  -4.483 0.00000836 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3030 on 862 degrees of freedom
## Multiple R-squared:  0.0978, Adjusted R-squared:  0.07373 
## F-statistic: 4.063 on 23 and 862 DF,  p-value: 0.0000000007722

The relative strength of the model without Avg_PctPoverty decreased to \[R^2_{adj} = 0.07373\].

Removing Avg_PctPoverty decreased the strength of the model, so it should be added back again. Last we can remove sector and see what that does to the overall model.

m_loans <- lm(total_loan_count ~ IncomeGroup + avg_Pop + avg_pctPoverty, data = world_loan_train)
summary(m_loans)

## 
## Call:
## lm(formula = total_loan_count ~ IncomeGroup + avg_Pop + avg_pctPoverty, 
##     data = world_loan_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1505   -905   -360   -116  52278 
## 
## Coefficients:
##                                        Estimate       Std. Error t value
## (Intercept)                     118.98711815875  423.35435149197   0.281
## IncomeGroupLow income           229.08451532504  506.77245741986   0.452
## IncomeGroupLower middle income 1156.95050324500  459.20112384950   2.519
## IncomeGroupUpper middle income  286.21013197798  456.35146798703   0.627
## avg_Pop                          -0.00000004095    0.00000044096  -0.093
## avg_pctPoverty                    4.01731518191    6.83507402487   0.588
##                                Pr(>|t|)  
## (Intercept)                      0.7787  
## IncomeGroupLow income            0.6513  
## IncomeGroupLower middle income   0.0119 *
## IncomeGroupUpper middle income   0.5307  
## avg_Pop                          0.9260  
## avg_pctPoverty                   0.5569  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3124 on 880 degrees of freedom
## Multiple R-squared:  0.02051,    Adjusted R-squared:  0.01495 
## F-statistic: 3.686 on 5 and 880 DF,  p-value: 0.002626

The relative strength of the model without sector decreased to \[R^2_{adj} = 0.01495\]. It seems that sector adds great value to the model.

Since removing sector is catastrophic to the model, and since the test set has no sector values in it to predict upon, this seems to be a failing venture. So we will have to abandon the prospect at this time of creating a predictive model based on the training and testing data that was derived from the current available datasets.

Part 5 - Conclusion

Examing the lending practices of essentially Americans to the rest of the world in which we live is an important thing to examine, because it shows our preferences for helping and supporting different parts of the world. It may even possibly hint at underlying attitudes about the regions of the world we are being so philanthropic towards.

The above analysis roughly succeeded in the simple linear regression of using poverty levels to predict a rough number of loans made, as we see in this faceted version of a prior graphic above:

ggplot(world_loan_train, aes(x = avg_pctPoverty, y = total_loan_count)) + geom_jitter(aes(color = Region)) + facet_wrap(~ Region) + stat_smooth(method ="lm",se = TRUE) + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10')

There is a positive linear relationship in several regions of the world between the poverty in the area and the number of loans received from Kiva.

However, we utterly failed to create an ideal multivariate model by which to predict how many loans might be made to each country that did not receive loans based on characteristics of the given country.

It may be possible that there are other factors underlying the lending tendencies of Americans toward the rest of the world, which could be confounding to the current analysis if those factors are not present in it. Perhaps if we were to include other characteristics of the countries that received loans, such as predominant race, ethnicity, political stance and religious prevalence in each country, we might be better equipped to distill a better model to predict where Americans overall might be willing to invest their money to lift up those who need help.

References

Kiva, November 3, 2020, https://www.kiva.org/.

“Population, total.” The World Bank, November 3, 2020. https://data.worldbank.org/indicator/SP.POP.TOTL?end=2017&start=2014.

“GHO OData API.” The World Health Data Platform, November 3, 2020. https://www.who.int/data/gho/info/gho-odata-api.

“Data Science for Good: Kiva Crowdfunding.” Kaggle, November 3. 2020, https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding.

“tmap: get started!” November 20, 2020, https://cran.r-project.org/web/packages/tmap/vignettes/tmap-getstarted.html.