Kiva.org is a crowdfunding organization that helps people around the world by providing loans for a variety of purposes ranging across areas such as education, health, women, the arts, technology and business startups. The loans are disbursed directly to the borrowers, and many of the loans are targeted specifically to help people living in poverty.
Kiva’s lenders fund projects in many, but not all, of the countries around the world. Most of the countries where loans are made, which includes the US, include areas where there is greater relative poverty, but not all countries receive Kiva loans. This leads to the primary research question for this project, which is:
Are a country’s total population and poverty proportion predictive of the volumes and amounts of Kiva’s microloans disbursed in that country?
In this question the dependent, or response, variables are the frequency and amounts of loans Kiva made in a given country.
The numerical independent variables are the total population and poverty proportion for each country, and a qualitative independent variable is the country’s name where the loans were disbursed.
…and the research question leads to a further question that we may want to try to answer:
If a country’s total population and poverty proportion are predictive of the volumes and amounts of Kiva’s microloans for a given country, can we predict the volumes and amounts of loans that might be made in countries where loans are not currently being made based on a specific country’s total population and poverty proportion?
In order to answer the research question we first need to know the population for each country in the world. Population data is collected by the World Bank and is made available under its CC-BY 4.0 license as open data, given proper attribution. It is available on the World Bank website, and for this project it was extracted by downloading a .csv file from the World Bank, and storing that .csv file on my GitHub repo.
Each case in the World Bank population data represents the annual total population of a given country per year from 1960 to 2019. For this project we are only interested in the years 2014 through 2017, which aligns with the dates in the available data on Kiva’s loans.
The World Bank population data from 1960 to 2019 was read into RStudio from the .csv file in GitHub as shown below, and it was then transformed to only show populations for the years 2014 through 2017 inclusive.
WBpopURL <- "https://raw.githubusercontent.com/douglasbarley/Fall2020_FinalProject/main/PopulationByCountry.csv"
WBmetaURL <- "https://raw.githubusercontent.com/douglasbarley/Fall2020_FinalProject/main/Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2_1495124.csv"
WB_Pop <- read.csv(WBpopURL) %>%
subset(select = c(CountryName, CountryCode, X2014, X2015, X2016, X2017)) %>%
pivot_longer(`X2014`:`X2017`, names_to = "PopYear")
names(WB_Pop)[4] <- "Population"
WB_Pop$PopYear <- type.convert(gsub('^.', '', WB_Pop$PopYear)) # replace the first character "X" with a null string
head(WB_Pop)
## # A tibble: 6 x 4
## CountryName CountryCode PopYear Population
## <chr> <chr> <int> <dbl>
## 1 Aruba ABW 2014 103774
## 2 Aruba ABW 2015 104341
## 3 Aruba ABW 2016 104872
## 4 Aruba ABW 2017 105366
## 5 Afghanistan AFG 2014 33370794
## 6 Afghanistan AFG 2015 34413603
The World Bank also provided a metadata file about the countries that includes the world region and overall income group as determined by the World Bank.
WBmetaURL <- "https://raw.githubusercontent.com/douglasbarley/Fall2020_FinalProject/main/Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2_1495124.csv"
WB_Meta <- read.csv(WBmetaURL)
names(WB_Meta)[1] <- "CountryCode"
names(WB_Meta)[5] <- "CountryName"
WB_Meta <- subset(WB_Meta, WB_Meta$IncomeGroup != "", select = c(CountryCode, Region, IncomeGroup))
head(WB_Meta)
## CountryCode Region IncomeGroup
## 1 ABW Latin America & Caribbean High income
## 2 AFG South Asia Low income
## 3 AGO Sub-Saharan Africa Lower middle income
## 4 ALB Europe & Central Asia Upper middle income
## 5 AND Europe & Central Asia High income
## 7 ARE Middle East & North Africa High income
With both files imported we can combine the data into a single set.
WB_data <- inner_join(WB_Pop,WB_Meta, by = c("CountryCode" = "CountryCode"), copy = FALSE, keep = FALSE)
head(WB_data, 10)
## # A tibble: 10 x 6
## CountryName CountryCode PopYear Population Region IncomeGroup
## <chr> <chr> <int> <dbl> <chr> <chr>
## 1 Aruba ABW 2014 103774 Latin America & C~ High income
## 2 Aruba ABW 2015 104341 Latin America & C~ High income
## 3 Aruba ABW 2016 104872 Latin America & C~ High income
## 4 Aruba ABW 2017 105366 Latin America & C~ High income
## 5 Afghanistan AFG 2014 33370794 South Asia Low income
## 6 Afghanistan AFG 2015 34413603 South Asia Low income
## 7 Afghanistan AFG 2016 35383128 South Asia Low income
## 8 Afghanistan AFG 2017 36296400 South Asia Low income
## 9 Angola AGO 2014 26941779 Sub-Saharan Africa Lower middle i~
## 10 Angola AGO 2015 27884381 Sub-Saharan Africa Lower middle i~
With the population of each country in the world in hand, we next need to know that poverty rates for each country. Poverty data is collected by the WHO as part of their Global Health Observatory site. The data is openly available World Health Organization via their API, and for this project it was extracted by accessing the API on the WHO the website.
Each case represents the proportion of the total population that lived below the international poverty line of US$1.90/day for a given year, for each of the years 2014 through 2017 inclusive. There are 1137 observations in the given data set.
Two files were required from the API, because one file only had a 3-digit country code, whereas we wanted the full country name for each country.
# Link to OData API for the World Health Organization (WHO) indicators on poverty
WHO_PovertyAPI <- GET("https://ghoapi.azureedge.net/api/SI_POV_DAY1", verbose())
http_status(WHO_PovertyAPI)
## $category
## [1] "Success"
##
## $reason
## [1] "OK"
##
## $message
## [1] "Success: (200) OK"
# Link to OData API for the World Health Organization (WHO) values for the Country dimension
WHO_CountryAPI <- GET("https://ghoapi.azureedge.net/api/DIMENSION/COUNTRY/DimensionValues", verbose())
http_status(WHO_CountryAPI)
## $category
## [1] "Success"
##
## $reason
## [1] "OK"
##
## $message
## [1] "Success: (200) OK"
With a Success code of 200 the OData file was successfully read into a dataframe, but it was still difficult to read/interpret in plain language due to the structure of the file. As such I extracted the contents from the poverty data URL call as raw text, and then inserted the poverty data text into an R dataframe.
WHO_PovertyList <- content(WHO_PovertyAPI, "text", encoding='UTF-8')
WHO_PovertyListText <- fromJSON(WHO_PovertyList, flatten = TRUE)
WHO_PovertyList_df <- as.data.frame(WHO_PovertyListText)
str(WHO_PovertyList_df)
## 'data.frame': 1440 obs. of 24 variables:
## $ X.odata.context : chr "https://ghoapi.azureedge.net/api/$metadata#SI_POV_DAY1" "https://ghoapi.azureedge.net/api/$metadata#SI_POV_DAY1" "https://ghoapi.azureedge.net/api/$metadata#SI_POV_DAY1" "https://ghoapi.azureedge.net/api/$metadata#SI_POV_DAY1" ...
## $ value.Id : int 21319082 21319083 21319084 21319085 21319086 21319087 21319088 21319089 21319090 21319091 ...
## $ value.IndicatorCode : chr "SI_POV_DAY1" "SI_POV_DAY1" "SI_POV_DAY1" "SI_POV_DAY1" ...
## $ value.SpatialDimType : chr "COUNTRY" "COUNTRY" "COUNTRY" "COUNTRY" ...
## $ value.SpatialDim : chr "AGO" "AGO" "ALB" "ALB" ...
## $ value.TimeDimType : chr "YEAR" "YEAR" "YEAR" "YEAR" ...
## $ value.TimeDim : int 2000 2008 2002 2005 2008 2012 2000 2001 2002 2003 ...
## $ value.Dim1Type : logi NA NA NA NA NA NA ...
## $ value.Dim1 : logi NA NA NA NA NA NA ...
## $ value.Dim2Type : logi NA NA NA NA NA NA ...
## $ value.Dim2 : logi NA NA NA NA NA NA ...
## $ value.Dim3Type : logi NA NA NA NA NA NA ...
## $ value.Dim3 : logi NA NA NA NA NA NA ...
## $ value.DataSourceDimType : logi NA NA NA NA NA NA ...
## $ value.DataSourceDim : logi NA NA NA NA NA NA ...
## $ value.Value : chr "32.3" "30.1" "2.0" "1.1" ...
## $ value.NumericValue : num 32.3 30.1 2 1.1 0.4 1.1 5.7 9.4 14 7 ...
## $ value.Low : logi NA NA NA NA NA NA ...
## $ value.High : logi NA NA NA NA NA NA ...
## $ value.Comments : chr "World Development Indicators database, World Bank: Estimated from unit-record consumption data." "World Development Indicators database, World Bank: Estimated from unit-record consumption data." "World Development Indicators database, World Bank: Estimated from unit-record consumption data." "World Development Indicators database, World Bank: Estimated from unit-record consumption data." ...
## $ value.Date : chr "2019-08-20T09:05:26.32+02:00" "2019-08-20T09:05:26.373+02:00" "2019-08-20T09:05:26.403+02:00" "2019-08-20T09:05:26.42+02:00" ...
## $ value.TimeDimensionValue: chr "2000" "2008" "2002" "2005" ...
## $ value.TimeDimensionBegin: chr "2000-01-01T00:00:00+01:00" "2008-01-01T00:00:00+01:00" "2002-01-01T00:00:00+01:00" "2005-01-01T00:00:00+01:00" ...
## $ value.TimeDimensionEnd : chr "2000-12-31T00:00:00+01:00" "2008-12-31T00:00:00+01:00" "2002-12-31T00:00:00+01:00" "2005-12-31T00:00:00+01:00" ...
I then repeated the process for the country dimension data
WHO_CountryList <- content(WHO_CountryAPI, "text", encoding='UTF-8')
WHO_CountryListText <- fromJSON(WHO_CountryList, flatten = TRUE)
WHO_CountryList_df <- as.data.frame(WHO_CountryListText)
str(WHO_CountryList_df)
## 'data.frame': 216 obs. of 7 variables:
## $ X.odata.context : chr "https://ghoapi.azureedge.net/api/$metadata#Collection(Default.DIMENSION_VALUE)" "https://ghoapi.azureedge.net/api/$metadata#Collection(Default.DIMENSION_VALUE)" "https://ghoapi.azureedge.net/api/$metadata#Collection(Default.DIMENSION_VALUE)" "https://ghoapi.azureedge.net/api/$metadata#Collection(Default.DIMENSION_VALUE)" ...
## $ value.Code : chr "AGO" "BDI" "BEN" "BFA" ...
## $ value.Title : chr "Angola" "Burundi" "Benin" "Burkina Faso" ...
## $ value.ParentDimension: chr "REGION" "REGION" "REGION" "REGION" ...
## $ value.Dimension : chr "COUNTRY" "COUNTRY" "COUNTRY" "COUNTRY" ...
## $ value.ParentCode : chr "AFR" "AFR" "AFR" "AFR" ...
## $ value.ParentTitle : chr "Africa" "Africa" "Africa" "Africa" ...
The 24 variables in the poverty dataframe and the 7 variables in the country dataframe were not all necessary for the project. So I decided to look at only the core information needed from the data, which I considered to be the country code, year and numerical percent of the population living in poverty from the poverty data set.
WHO_PovertyList_df2 <- subset(WHO_PovertyList_df, select = c(value.SpatialDim, value.TimeDim, value.NumericValue))
names(WHO_PovertyList_df2) <- c("CountryCode","Year","PctPoverty")
head(WHO_PovertyList_df2)
## CountryCode Year PctPoverty
## 1 AGO 2000 32.3
## 2 AGO 2008 30.1
## 3 ALB 2002 2.0
## 4 ALB 2005 1.1
## 5 ALB 2008 0.4
## 6 ALB 2012 1.1
And I only needed the country code, country name and parent region from the country dimension set.
# get 3 variables from the country dimensions dataset
WHO_CountryList_df2 <- subset(WHO_CountryList_df, select = c(value.Code, value.Title, value.ParentTitle))
names(WHO_CountryList_df2) <- c("CountryCode","CountryName","WorldRegion")
head(WHO_CountryList_df2)
## CountryCode CountryName WorldRegion
## 1 AGO Angola Africa
## 2 BDI Burundi Africa
## 3 BEN Benin Africa
## 4 BFA Burkina Faso Africa
## 5 BWA Botswana Africa
## 6 CAF Central African Republic Africa
Once I extracted the data from the poverty and country data sets, I merged them into a single WHO_poverty data frame, retaining only observations in the years 2014 through 2017 in order to align with the Kiva loan data.
WHO_poverty <-
inner_join(WHO_PovertyList_df2,WHO_CountryList_df2, by = c("CountryCode" = "CountryCode"), copy = FALSE, keep = FALSE) %>%
filter(Year >= 2014 & Year < 2018)
WHO_poverty <- WHO_poverty[c("CountryName","CountryCode","WorldRegion","Year","PctPoverty")]
head(WHO_poverty,7)
## CountryName CountryCode WorldRegion Year PctPoverty
## 1 Argentina ARG Americas 2014 0.7
## 2 Argentina ARG Americas 2016 0.6
## 3 Argentina ARG Americas 2017 0.4
## 4 Armenia ARM Europe 2014 2.3
## 5 Armenia ARM Europe 2015 1.9
## 6 Armenia ARM Europe 2016 1.8
## 7 Armenia ARM Europe 2017 1.4
Kiva loan data from 2014 through 2017 was available as part of a Kaggle challenge. The data was available under a public domain license from the Kaggle website, but the .csv download was 187 MB, which I found was too large to upload to GitHub.
I created an Azure account and established an Azure database, to which I uploaded the .csv file, but when I tried to read the table with 671,205 observations spanning the period from 2014 through mid-2017 into RStudio I found it took a long time. So I created 4 views in the Azure database using SQL code. Each view was for each year of the data, and I was able to import each view from Azure into its own dataframe in RStudio. Then I used the rbind function to reassemble all of the data into a single dataframe inside RStudio.
Each case in the data represents a loan made to an economically disadvantaged person in the currency of the recipient’s residency.
I created a user account inside the Azure database named Rconnection which was used to establish the connection from this document to the Azure data source:
# connect to the server
my_connection <- dbConnect(drv = odbc::odbc(),
Driver = "SQL Server",
server = "cuny.database.windows.net",
database = "CUNY",
uid = "Rconnection",
pwd = "RS606Proj#1")
kiva_2014 <- dbGetQuery(my_connection,'
SELECT [id]
,[funded_amount]
,[loan_amount]
,[activity]
,[sector]
,[country_code]
,[country]
,[currency]
,[partner_id]
,[posted_time]
,[disbursed_time]
,[funded_time]
,[term_in_months]
,[lender_count]
,[repayment_interval]
,[date]
FROM [dbo].[vw_kiva_loans_2014]
')
kiva_2015 <- dbGetQuery(my_connection,'
SELECT [id]
,[funded_amount]
,[loan_amount]
,[activity]
,[sector]
,[country_code]
,[country]
,[currency]
,[partner_id]
,[posted_time]
,[disbursed_time]
,[funded_time]
,[term_in_months]
,[lender_count]
,[repayment_interval]
,[date]
FROM [dbo].[vw_kiva_loans_2015]
')
kiva_2016 <- dbGetQuery(my_connection,'
SELECT [id]
,[funded_amount]
,[loan_amount]
,[activity]
,[sector]
,[country_code]
,[country]
,[currency]
,[partner_id]
,[posted_time]
,[disbursed_time]
,[funded_time]
,[term_in_months]
,[lender_count]
,[repayment_interval]
,[date]
FROM [dbo].[vw_kiva_loans_2016]
')
kiva_2017 <- dbGetQuery(my_connection,'
SELECT [id]
,[funded_amount]
,[loan_amount]
,[activity]
,[sector]
,[country_code]
,[country]
,[currency]
,[partner_id]
,[posted_time]
,[disbursed_time]
,[funded_time]
,[term_in_months]
,[lender_count]
,[repayment_interval]
,[date]
FROM [dbo].[vw_kiva_loans_2017]
')
Combine all years of data into a single dataframe and take a look at the structure of the data.
kiva <- rbind(kiva_2014,kiva_2015,kiva_2016,kiva_2017)
kiva <- mutate(kiva, Year = as.integer(substr(kiva$date,1,4)))
str(kiva)
## 'data.frame': 671205 obs. of 17 variables:
## $ id : chr "694226" "694450" "694601" "694292" ...
## $ funded_amount : chr "225.0" "1950.0" "600.0" "75.0" ...
## $ loan_amount : num 225 1950 600 75 500 ...
## $ activity : chr "Transportation" "Cattle" "Construction Supplies" "General Store" ...
## $ sector : chr "Transportation" "Agriculture" "Construction" "Retail" ...
## $ country_code : chr "PH" "KG" "NI" "PH" ...
## $ country : chr "Philippines" "Kyrgyzstan" "Nicaragua" "Philippines" ...
## $ currency : chr "PHP" "KGS" "NIO" "PHP" ...
## $ partner_id : chr "145.0" "171.0" "98.0" "136.0" ...
## $ posted_time : chr "2014-04-07" "2014-04-07" "2014-04-07" "2014-04-07" ...
## $ disbursed_time : chr "2014-03-12" "2014-03-27" "2014-03-29" "2014-03-28" ...
## $ funded_time : POSIXct, format: "2014-04-14 19:34:57" "2014-05-09 19:19:36" ...
## $ term_in_months : num 8 26 15 8 8 8 15 14 12 17 ...
## $ lender_count : chr "6" "67" "20" "1" ...
## $ repayment_interval: chr "irregular" "monthly" "monthly" "irregular" ...
## $ date : chr "2014-04-07" "2014-04-07" "2014-04-07" "2014-04-07" ...
## $ Year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
With all three discrete datasets imported, I wanted to combine the information in a way that I could use it to run a multiple regression model. I started with combining all the WB_data with the WHO_poverty percent of the total population in poverty for each country using a left_join() function.
world_poverty <- left_join(WB_data,WHO_poverty, by = c("CountryCode" = "CountryCode", "PopYear" = "Year"), copy = FALSE, keep = FALSE)
names(world_poverty)[1] <- "CountryName"
world_poverty <- world_poverty[c("CountryName", "CountryCode", "PopYear", "Population", "Region", "IncomeGroup", "PctPoverty")]
head(world_poverty,10)
## # A tibble: 10 x 7
## CountryName CountryCode PopYear Population Region IncomeGroup PctPoverty
## <chr> <chr> <int> <dbl> <chr> <chr> <dbl>
## 1 Aruba ABW 2014 103774 Latin Ame~ High income NA
## 2 Aruba ABW 2015 104341 Latin Ame~ High income NA
## 3 Aruba ABW 2016 104872 Latin Ame~ High income NA
## 4 Aruba ABW 2017 105366 Latin Ame~ High income NA
## 5 Afghanistan AFG 2014 33370794 South Asia Low income NA
## 6 Afghanistan AFG 2015 34413603 South Asia Low income NA
## 7 Afghanistan AFG 2016 35383128 South Asia Low income NA
## 8 Afghanistan AFG 2017 36296400 South Asia Low income NA
## 9 Angola AGO 2014 26941779 Sub-Sahar~ Lower middl~ NA
## 10 Angola AGO 2015 27884381 Sub-Sahar~ Lower middl~ NA
Then I wanted to aggregate the number and amounts of loans for each year in each country from the kiva dataframe…
kiva_loans <- kiva %>%
group_by(country,Year,sector) %>%
summarise(loan_count = n(), loan_totals = sum(loan_amount)) %>%
mutate(avg_loan_amount = round(loan_totals / loan_count,2))
head(kiva_loans)
## # A tibble: 6 x 6
## # Groups: country, Year [3]
## country Year sector loan_count loan_totals avg_loan_amount
## <chr> <int> <chr> <int> <dbl> <dbl>
## 1 Afghanistan 2014 Arts 1 6000 6000
## 2 Afghanistan 2016 Arts 1 8000 8000
## 3 Albania 2014 Agriculture 242 379075 1566.
## 4 Albania 2014 Arts 2 3875 1938.
## 5 Albania 2014 Clothing 28 45500 1625
## 6 Albania 2014 Construction 13 19150 1473.
…which compressed 671,205 observations of 17 variables down to 3,162 observations of 6 variables.
Then I joined the Kiva loans set with the world_poverty table to arrive at the core data I wanted to analyze for the project.
world_loans <- left_join(world_poverty,kiva_loans, by = c("CountryName" = "country", "PopYear" = "Year"), copy = FALSE, keep = FALSE)
names(world_loans)[3] <- "Year"
head(world_loans,10)
## # A tibble: 10 x 11
## CountryName CountryCode Year Population Region IncomeGroup PctPoverty sector
## <chr> <chr> <int> <dbl> <chr> <chr> <dbl> <chr>
## 1 Aruba ABW 2014 103774 Latin~ High income NA <NA>
## 2 Aruba ABW 2015 104341 Latin~ High income NA <NA>
## 3 Aruba ABW 2016 104872 Latin~ High income NA <NA>
## 4 Aruba ABW 2017 105366 Latin~ High income NA <NA>
## 5 Afghanistan AFG 2014 33370794 South~ Low income NA Arts
## 6 Afghanistan AFG 2015 34413603 South~ Low income NA <NA>
## 7 Afghanistan AFG 2016 35383128 South~ Low income NA Arts
## 8 Afghanistan AFG 2017 36296400 South~ Low income NA <NA>
## 9 Angola AGO 2014 26941779 Sub-S~ Lower midd~ NA <NA>
## 10 Angola AGO 2015 27884381 Sub-S~ Lower midd~ NA <NA>
## # ... with 3 more variables: loan_count <int>, loan_totals <dbl>,
## # avg_loan_amount <dbl>
Using the world_loans dataframe we can create a training set to include only the countries that have received loans and a test set that includes only the countries that have not received loans.
Here is the training set that includes a sector, total loan count, total loan amount and average loan amount for each country that received loans:
world_loan_tr <- world_loans %>%
group_by(CountryName, Region, IncomeGroup) %>%
summarise(avg_Pop = mean(Population), avg_pctPoverty = mean(PctPoverty, na.rm = TRUE)) %>%
replace_na(list(avg_pctPoverty = 0))
world_loan_train <- left_join(world_loan_tr, kiva_loans, by = c("CountryName" = "country"), copy = FALSE, keep = FALSE) %>%
group_by(CountryName, Region, IncomeGroup,avg_Pop,avg_pctPoverty,sector) %>%
summarise(total_loan_count = sum(loan_count), total_loan_amount = sum(loan_totals)) %>%
mutate(avg_loan_amount = round(total_loan_amount / total_loan_count,2)) %>%
filter(total_loan_count > 0)
world_loan_train
## # A tibble: 886 x 9
## # Groups: CountryName, Region, IncomeGroup, avg_Pop, avg_pctPoverty [76]
## CountryName Region IncomeGroup avg_Pop avg_pctPoverty sector total_loan_count
## <chr> <chr> <chr> <dbl> <dbl> <chr> <int>
## 1 Afghanistan South~ Low income 3.49e7 0 Arts 2
## 2 Albania Europ~ Upper midd~ 2.88e6 0 Agric~ 719
## 3 Albania Europ~ Upper midd~ 2.88e6 0 Arts 5
## 4 Albania Europ~ Upper midd~ 2.88e6 0 Cloth~ 87
## 5 Albania Europ~ Upper midd~ 2.88e6 0 Const~ 30
## 6 Albania Europ~ Upper midd~ 2.88e6 0 Educa~ 93
## 7 Albania Europ~ Upper midd~ 2.88e6 0 Enter~ 5
## 8 Albania Europ~ Upper midd~ 2.88e6 0 Food 79
## 9 Albania Europ~ Upper midd~ 2.88e6 0 Health 271
## 10 Albania Europ~ Upper midd~ 2.88e6 0 Housi~ 378
## # ... with 876 more rows, and 2 more variables: total_loan_amount <dbl>,
## # avg_loan_amount <dbl>
And here is the test set that includes no sector, loan count, total loan amount or average loan amount since there were no loans made to these countries:
world_loan_test <- left_join(world_loan_tr, kiva_loans, by = c("CountryName" = "country"), copy = FALSE, keep = FALSE) %>%
group_by(CountryName, Region, IncomeGroup,avg_Pop,avg_pctPoverty,sector) %>%
summarise(total_loan_count = sum(loan_count), total_loan_amount = sum(loan_totals)) %>%
mutate(avg_loan_amount = round(total_loan_amount / total_loan_count,2)) %>%
filter(is.na(total_loan_count))
world_loan_test
## # A tibble: 141 x 9
## # Groups: CountryName, Region, IncomeGroup, avg_Pop, avg_pctPoverty [141]
## CountryName Region IncomeGroup avg_Pop avg_pctPoverty sector total_loan_count
## <chr> <chr> <chr> <dbl> <dbl> <chr> <int>
## 1 Algeria Middl~ Lower midd~ 4.01e7 0 <NA> NA
## 2 American S~ East ~ Upper midd~ 5.57e4 0 <NA> NA
## 3 Andorra Europ~ High income 7.79e4 0 <NA> NA
## 4 Angola Sub-S~ Lower midd~ 2.84e7 0 <NA> NA
## 5 Antigua an~ Latin~ High income 9.40e4 0 <NA> NA
## 6 Argentina Latin~ Upper midd~ 4.34e7 0.567 <NA> NA
## 7 Aruba Latin~ High income 1.05e5 0 <NA> NA
## 8 Australia East ~ High income 2.40e7 0.7 <NA> NA
## 9 Austria Europ~ High income 8.68e6 0.450 <NA> NA
## 10 Bahamas, T~ Latin~ High income 3.76e5 0 <NA> NA
## # ... with 131 more rows, and 2 more variables: total_loan_amount <dbl>,
## # avg_loan_amount <dbl>
Having arrived at this point in the project, we can see what the data has provided for the analysis.
train_countries <- n_distinct(world_loan_train$CountryName)
test_countries <- n_distinct(world_loan_test$CountryName)
cat("This gives us a training set with", train_countries,"countries and a test set with",test_countries,"countries.")
## This gives us a training set with 76 countries and a test set with 141 countries.
With the data in hand the top of mind question is: where is Kiva making loans, and where are they not making loans?
Quick map view of the countries in the training dataset (i.e countries with loans).
data("World") # get world geography from tmap dataframe
train_countries_names <- distinct(world_loan_train, CountryName)
world_loan_train_map <- World %>%
filter(name %in% train_countries_names$CountryName)
tmap_mode("view") # interactive map view
tm_shape(world_loan_train_map) +
tm_polygons("income_grp") +
tm_tiles("Stamen.TonerLabels")
Quick map view of the countries in the test dataset (i.e. countries not receiving loans).
data("World") # get world geography from tmap dataframe
test_countries_names <- distinct(world_loan_test, CountryName)
world_loan_test_map <- World %>%
filter(name %in% test_countries_names$CountryName)
tmap_mode("view") # interactive map view
tm_shape(world_loan_test_map) +
tm_polygons("income_grp") +
tm_tiles("Stamen.TonerLabels")
Let’s quantify the number and amounts of loans, as well as the average loan by world region.
loans_by_region <- world_loans %>%
group_by(Region) %>%
summarise(loan_count = sum(loan_count, na.rm = TRUE), loan_totals = sum(loan_totals, na.rm = TRUE)) %>%
mutate(avg_loan_amount = round(loan_totals / loan_count,2))
loans_by_region[order(loans_by_region$avg_loan_amount, decreasing = TRUE),]
## # A tibble: 7 x 4
## Region loan_count loan_totals avg_loan_amount
## <chr> <int> <dbl> <dbl>
## 1 North America 6093 31496375 5169.
## 2 Middle East & North Africa 14142 20179450 1427.
## 3 Latin America & Caribbean 156300 183310550 1173.
## 4 Europe & Central Asia 38932 42135625 1082.
## 5 Sub-Saharan Africa 168197 121521850 722.
## 6 South Asia 38815 20363600 525.
## 7 East Asia & Pacific 224257 106646850 476.
loans_by_region$Region <- factor(loans_by_region$Region, levels = loans_by_region$Region[order(loans_by_region$loan_count)])
loans_by_region_count <- ggplot(data=loans_by_region, aes(x=Region,y=loan_count)) +
geom_col(fill="lightblue") +
geom_text(label=(c(loans_by_region$loan_count))) +
ggtitle("Number of loans by region") +
xlab('Region') +
ylab('Count of loans') +
coord_flip()
loans_by_region$Region <- factor(loans_by_region$Region, levels = loans_by_region$Region[order(loans_by_region$loan_count)])
loans_by_region_total_amount <- ggplot(data=loans_by_region, aes(x=Region,y=loan_totals)) +
geom_col(fill="lightblue") +
geom_text(label=(c(loans_by_region$loan_totals))) +
ggtitle("Total $ amount of loans by region") +
xlab('Region') +
ylab('Total $ amount of loans') +
coord_flip()
loans_by_region$Region <- factor(loans_by_region$Region, levels = loans_by_region$Region[order(loans_by_region$loan_count)])
loans_by_region_avg_amount <- ggplot(data=loans_by_region, aes(x=Region,y=avg_loan_amount)) +
geom_col(fill="lightblue") +
geom_text(label=(c(loans_by_region$avg_loan_amount))) +
ggtitle("Avg $ amount of loans by region") +
xlab('Region') +
ylab('Avg $ amount of loans') +
coord_flip()
grid.arrange(loans_by_region_count, loans_by_region_total_amount, loans_by_region_avg_amount, nrow = 3)
It appears that the highest total dollars are going to Latin America & the Caribbean region with $183M in total loan dollars, but the greatest number of loans are going to East Asia & the Pacific. Interestingly East Asia & the Pacific region have the lowest average dollars in each loan with a mean of only $475.56 per loan. Also noteworthy is that North America receives the fewest number of loans, but the average loan dollars is the highest in the world at a mean of $5,169.27 per loan.
We could also view the data as a scatterplot with regresssion lines.
ggplot(world_loan_train, aes(x = avg_loan_amount, y = total_loan_count)) + geom_jitter(aes(color = Region)) + stat_smooth(method ="lm",se = TRUE) + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10') + coord_flip() + facet_wrap(~ Region)
## `geom_smooth()` using formula 'y ~ x'
This confirms the same trend in lending by region just with a different view of the data.
Moving on, instead of grouping by geographic region, let’s also look at the number and amounts of loans by country-level income groups regardless of geographic region.
loans_by_affluence <- world_loans %>%
group_by(IncomeGroup) %>%
summarise(loan_count = sum(loan_count, na.rm = TRUE), loan_totals = sum(loan_totals, na.rm = TRUE)) %>%
mutate(avg_loan_amount = round(loan_totals / loan_count,2))
loans_by_affluence[order(loans_by_affluence$avg_loan_amount, decreasing = TRUE),]
## # A tibble: 4 x 4
## IncomeGroup loan_count loan_totals avg_loan_amount
## <chr> <int> <dbl> <dbl>
## 1 High income 6555 32987450 5032.
## 2 Upper middle income 131695 173429425 1317.
## 3 Low income 84220 76704900 911.
## 4 Lower middle income 424266 242532525 572.
loans_by_affluence$IncomeGroup <- factor(loans_by_affluence$IncomeGroup, levels = loans_by_affluence$IncomeGroup[order(loans_by_affluence$loan_count)])
loans_by_income_count <- ggplot(data=loans_by_affluence, aes(x=IncomeGroup,y=loan_count)) +
geom_col(fill="lightblue") +
geom_text(label=(c(loans_by_affluence$loan_count))) +
ggtitle("Number of loans by income group") +
xlab('Income group') +
ylab('Total $ amount of loans') +
coord_flip()
loans_by_affluence$Region <- factor(loans_by_affluence$IncomeGroup, levels = loans_by_affluence$IncomeGroup[order(loans_by_affluence$loan_count)])
loans_by_income_total_amount <- ggplot(data=loans_by_affluence, aes(x=IncomeGroup,y=loan_totals)) +
geom_col(fill="lightblue") +
geom_text(label=(c(loans_by_affluence$loan_totals))) +
ggtitle("Total $ amount of loans by income group") +
xlab('Income group') +
ylab('Total $ amount of loans') +
coord_flip()
loans_by_affluence$Region <- factor(loans_by_affluence$IncomeGroup, levels = loans_by_affluence$IncomeGroup[order(loans_by_affluence$loan_count)])
loans_by_income_avg_amount <- ggplot(data=loans_by_affluence, aes(x=IncomeGroup,y=avg_loan_amount)) +
geom_col(fill="lightblue") +
geom_text(label=(c(loans_by_affluence$avg_loan_amount))) +
ggtitle("Avg $ amount of loans by income group") +
xlab('Income group') +
ylab('Total $ amount of loans') +
coord_flip()
grid.arrange(loans_by_income_count, loans_by_income_total_amount, loans_by_income_avg_amount, nrow = 3)
It is curious that the high income countries receive the least total dollars but the highest average dollars per loan, while the lower middle income countries receive the greatest number of loans yet the lowest average dollars per loan. That means the lower middle income countries receive almost 9 times less on average per loan than the high income countries. Could this approach to making loans possibly lend credence to the saying "the rich get richer and the poor get poorer?
Let’s take yet another view and look at total loan amounts by total loan counts by sector.
ggplot(world_loan_train, aes(x = total_loan_amount, y = total_loan_count)) + geom_jitter(aes(color = Region)) + stat_smooth(method ="lm",se = TRUE) + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10') + coord_flip() + facet_wrap(~ sector)
## `geom_smooth()` using formula 'y ~ x'
Scanning across the loan sectors, it appears that loans for agriculture, clothing, food, and retail tend to receive more dollars than the other sectors.
We can use violin plots to get a sense of the overall distribution of loans by sector.
ggplot(world_loan_train, aes(y = total_loan_count, x = avg_loan_amount, color = sector)) + geom_violin() + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10') + coord_flip() + facet_wrap(~ sector)
Again agriculture, clothing, food and retail appear to have bigger bulges toward the top end of their loan counts in terms of avg amounts.
First let’s look a little more at the training data in terms of predicting number of loans purely based on avg percent of poverty in a country.
We see that the avg loans are quite right skewed, suggesting that the majority of loans made average smaller amounts.
hist(world_loan_train$avg_loan_amount, breaks = 50, xlab = "Avg Loan", col = "lightblue", main = "Avg Loans")
If we fit a linear model called m_loans to predict the amount of loans made in a country by its average poverty percentage, we see that the equation for the linear model turns out to be \[\widehat{total\_loans} = 695.732 + 3.477 * avg\_Pct\_Poverty\].
m_loans_poverty <- lm(world_loan_train$total_loan_count ~ world_loan_train$avg_pctPoverty, data = world_loan_train)
summary(m_loans_poverty)
##
## Call:
## lm(formula = world_loan_train$total_loan_count ~ world_loan_train$avg_pctPoverty,
## data = world_loan_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -939 -694 -649 -314 52858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 695.732 121.292 5.736 0.0000000133 ***
## world_loan_train$avg_pctPoverty 3.477 6.029 0.577 0.564
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3149 on 884 degrees of freedom
## Multiple R-squared: 0.0003762, Adjusted R-squared: -0.0007546
## F-statistic: 0.3327 on 1 and 884 DF, p-value: 0.5642
We also see that there is a linear relationship between the avg Pct of poverty and the total loans received.
ggplot(world_loan_train, aes(x = avg_pctPoverty, y = total_loan_count)) + geom_jitter(aes(color = Region)) + stat_smooth(method ="lm",se = TRUE) + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10')
Let’s pause to verify that the conditions for this model are reasonable using the standard diagnostic plots.
ggplot(data = m_loans_poverty, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals") + stat_smooth(method ="lm",se = TRUE) + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10')
The variability of points around the least squares line remains roughly constant with a few extreme outliers, meaning that the relationship between the avg pct of poverty and the total amount of loans is roughly linear.
ggplot(data = m_loans_poverty, aes(x = .resid)) +
geom_histogram() +
xlab("Residuals")
The residuals show a considerable right skew with several far outliers.
ggplot(data = m_loans_poverty, aes(sample = .resid)) +
stat_qq()
The qq-plot also shows a highly right skewed curve with several high end outliers.
We will run the full model first against the training set. So let’s create the full model.
m_loans <- lm(total_loan_count ~ CountryName + Region + IncomeGroup + sector + avg_Pop + avg_pctPoverty, data = world_loan_train)
summary(m_loans)
##
## Call:
## lm(formula = total_loan_count ~ CountryName + Region + IncomeGroup +
## sector + avg_Pop + avg_pctPoverty, data = world_loan_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9813 -512 116 437 41638
##
## Coefficients: (11 not defined because of singularities)
## Estimate Std. Error
## (Intercept) 2397.2986176292529 2796.1628819864382
## CountryNameAlbania -303.6996369267350 2864.3827865305916
## CountryNameArmenia 142.7670297398832 2864.3827865303933
## CountryNameAzerbaijan -358.5630701023196 2870.5357722683607
## CountryNameBelize -2272.2986176294403 3924.2681789746880
## CountryNameBenin -882.1790623331689 2961.5415192947094
## CountryNameBhutan -0.0000000001159 3893.9321714464168
## CountryNameBolivia 154.4336964066028 2864.3827865304493
## CountryNameBrazil -526.8777975259709 2895.6262119445364
## CountryNameBurkina Faso -390.8530298003951 2895.6213224707385
## CountryNameBurundi -620.7674411527719 2921.4020314724162
## CountryNameCambodia 1889.7670297399504 2864.3827865304675
## CountryNameCameroon -282.9177385629222 2877.6695864552671
## CountryNameChile -1034.6889811743922 2991.3486426639715
## CountryNameChina -480.8256323469025 2870.5499555882966
## CountryNameColombia 1033.7003630733109 2864.3827865305234
## CountryNameCosta Rica -328.5663035933894 2864.3827865304747
## CountryNameDominican Republic -840.0606533839924 2944.5280215731127
## CountryNameEcuador 468.7670297399083 2864.3827865303751
## CountryNameEl Salvador 2225.7003630733134 2864.3827865305461
## CountryNameGeorgia -318.3256323468976 2870.5499555883821
## CountryNameGhana -141.0329702600820 2864.3827865305079
## CountryNameGuam -1678.5722013162665 3924.4402762937907
## CountryNameGuatemala 54.7003630732498 2864.3827865305270
## CountryNameHaiti -232.0832867926781 2877.6335519656118
## CountryNameHonduras 4.5003630732590 2864.3827865304929
## CountryNameIndia 316.5003630732660 2864.3827865304802
## CountryNameIndonesia -18.3663035933984 2864.3827865305229
## CountryNameIraq -452.5958270376615 2877.6732187750663
## CountryNameIsrael -600.2668203751462 2895.7288634791739
## CountryNameJordan -142.3353071981244 2870.5532425794154
## CountryNameKenya 4622.3670297397884 2864.3827865303283
## CountryNameKosovo -338.0329702600695 2864.3827865304629
## CountryNameLebanon 153.5003630732604 2864.3827865304988
## CountryNameLesotho 118.3367439371703 3929.1581524224412
## CountryNameLiberia -203.1924999955195 2899.6799514870522
## CountryNameMadagascar -162.5436614140387 2877.6635088290154
## CountryNameMalawi -446.0338853335510 2885.9096966286938
## CountryNameMali 50.8494752012601 2895.6262119443377
## CountryNameMauritania -504.9495077255829 3924.5251711018486
## CountryNameMexico -49.8996369267380 2864.3827865304306
## CountryNameMoldova -465.5399180611796 2870.5499555883516
## CountryNameMongolia -428.6344986737290 2870.5357722682552
## CountryNameMozambique -248.7059272451345 2870.5357722683084
## CountryNameNamibia -647.7539511805671 3400.9417561738746
## CountryNameNepal -459.5912786388498 2885.9025930642770
## CountryNameNicaragua 352.7670297399196 2864.3827865304465
## CountryNameNigeria 469.2863837825488 2967.8561632666979
## CountryNamePakistan 1357.8336964066032 2864.3827865304529
## CountryNamePanama -600.5866211020823 2885.9318119470531
## CountryNameParaguay 360.9003630732673 2864.3827865304474
## CountryNamePeru 1049.5670297399013 2864.3827865303806
## CountryNamePhilippines 10263.4336964066570 2864.3827865304429
## CountryNamePuerto Rico -632.1536535420112 2907.3333344450980
## CountryNameRwanda 16.3670297399098 2864.3827865304079
## CountryNameSamoa 60.5295195186374 2870.5920381913415
## CountryNameSenegal -263.9916415308540 2870.5357722682338
## CountryNameSierra Leone -71.6329702600864 2864.3827865304347
## CountryNameSolomon Islands -575.3757805537406 2895.7408429632487
## CountryNameSomalia -630.3770024781575 2899.5945029435238
## CountryNameSouth Africa -483.3245942591770 2877.6282402726602
## CountryNameSouth Sudan -626.2246293597588 2895.6295437765925
## CountryNameSuriname -977.2998945258795 2967.9385682086217
## CountryNameTajikistan 872.7003630732477 2864.3827865304293
## CountryNameTanzania -84.6996369267567 2864.3827865304174
## CountryNameThailand -766.1502604601651 3093.2972881473070
## CountryNameTimor-Leste -297.8381228302801 2877.6448159451793
## CountryNameTogo -30.0596157078293 2895.6262119443563
## CountryNameTurkey -549.1725216078013 2907.2961271462918
## CountryNameUganda 940.7670297399101 2864.3827865304224
## CountryNameUkraine -479.0099401646601 2889.5770629993290
## CountryNameUnited States -26.4329702600880 2864.3827865304233
## CountryNameVanuatu -299.6632560628339 3929.1581524223825
## CountryNameVietnam 290.2336964065781 2864.3827865304243
## CountryNameZambia -600.6523920058125 2911.6816938341426
## CountryNameZimbabwe -200.0832867926773 2877.6335519657118
## RegionEurope & Central Asia NA NA
## RegionLatin America & Caribbean NA NA
## RegionMiddle East & North Africa NA NA
## RegionNorth America NA NA
## RegionSouth Asia NA NA
## RegionSub-Saharan Africa NA NA
## IncomeGroupLow income NA NA
## IncomeGroupLower middle income NA NA
## IncomeGroupUpper middle income NA NA
## sectorArts -2395.2986176289783 487.0040910352631
## sectorClothing -2057.9457011436989 476.2791050097732
## sectorConstruction -2543.5529107498342 489.7040459229187
## sectorEducation -2145.5349181733486 505.0007306266344
## sectorEntertainment -2872.6870451600362 546.3277453961758
## sectorFood -538.2962931765775 474.4062957141285
## sectorHealth -2456.3907904670623 489.6258243159040
## sectorHousing -2067.4983645019802 520.1590233244574
## sectorManufacturing -2508.2207489063289 487.3986250027600
## sectorPersonal Use -2093.6353615664248 510.7690227874791
## sectorRetail -717.7264163129320 473.5312914529525
## sectorServices -1891.3491099036526 474.2600996711696
## sectorTransportation -2408.4855162622484 502.3727012746329
## sectorWholesale -2773.3629165844509 521.9436184460473
## avg_Pop NA NA
## avg_pctPoverty NA NA
## t value Pr(>|t|)
## (Intercept) 0.857 0.39151
## CountryNameAlbania -0.106 0.91559
## CountryNameArmenia 0.050 0.96026
## CountryNameAzerbaijan -0.125 0.90063
## CountryNameBelize -0.579 0.56273
## CountryNameBenin -0.298 0.76587
## CountryNameBhutan 0.000 1.00000
## CountryNameBolivia 0.054 0.95702
## CountryNameBrazil -0.182 0.85566
## CountryNameBurkina Faso -0.135 0.89266
## CountryNameBurundi -0.212 0.83178
## CountryNameCambodia 0.660 0.50961
## CountryNameCameroon -0.098 0.92171
## CountryNameChile -0.346 0.72951
## CountryNameChina -0.168 0.86702
## CountryNameColombia 0.361 0.71828
## CountryNameCosta Rica -0.115 0.90871
## CountryNameDominican Republic -0.285 0.77549
## CountryNameEcuador 0.164 0.87005
## CountryNameEl Salvador 0.777 0.43737
## CountryNameGeorgia -0.111 0.91173
## CountryNameGhana -0.049 0.96074
## CountryNameGuam -0.428 0.66897
## CountryNameGuatemala 0.019 0.98477
## CountryNameHaiti -0.081 0.93574
## CountryNameHonduras 0.002 0.99875
## CountryNameIndia 0.110 0.91204
## CountryNameIndonesia -0.006 0.99489
## CountryNameIraq -0.157 0.87507
## CountryNameIsrael -0.207 0.83583
## CountryNameJordan -0.050 0.96047
## CountryNameKenya 1.614 0.10698
## CountryNameKosovo -0.118 0.90609
## CountryNameLebanon 0.054 0.95728
## CountryNameLesotho 0.030 0.97598
## CountryNameLiberia -0.070 0.94415
## CountryNameMadagascar -0.056 0.95497
## CountryNameMalawi -0.155 0.87721
## CountryNameMali 0.018 0.98599
## CountryNameMauritania -0.129 0.89766
## CountryNameMexico -0.017 0.98611
## CountryNameMoldova -0.162 0.87121
## CountryNameMongolia -0.149 0.88134
## CountryNameMozambique -0.087 0.93098
## CountryNameNamibia -0.190 0.84899
## CountryNameNepal -0.159 0.87351
## CountryNameNicaragua 0.123 0.90201
## CountryNameNigeria 0.158 0.87440
## CountryNamePakistan 0.474 0.63560
## CountryNamePanama -0.208 0.83520
## CountryNameParaguay 0.126 0.89977
## CountryNamePeru 0.366 0.71415
## CountryNamePhilippines 3.583 0.00036 ***
## CountryNamePuerto Rico -0.217 0.82793
## CountryNameRwanda 0.006 0.99544
## CountryNameSamoa 0.021 0.98318
## CountryNameSenegal -0.092 0.92675
## CountryNameSierra Leone -0.025 0.98005
## CountryNameSolomon Islands -0.199 0.84255
## CountryNameSomalia -0.217 0.82795
## CountryNameSouth Africa -0.168 0.86666
## CountryNameSouth Sudan -0.216 0.82884
## CountryNameSuriname -0.329 0.74203
## CountryNameTajikistan 0.305 0.76069
## CountryNameTanzania -0.030 0.97642
## CountryNameThailand -0.248 0.80445
## CountryNameTimor-Leste -0.104 0.91759
## CountryNameTogo -0.010 0.99172
## CountryNameTurkey -0.189 0.85022
## CountryNameUganda 0.328 0.74267
## CountryNameUkraine -0.166 0.86838
## CountryNameUnited States -0.009 0.99264
## CountryNameVanuatu -0.076 0.93923
## CountryNameVietnam 0.101 0.91932
## CountryNameZambia -0.206 0.83662
## CountryNameZimbabwe -0.070 0.94458
## RegionEurope & Central Asia NA NA
## RegionLatin America & Caribbean NA NA
## RegionMiddle East & North Africa NA NA
## RegionNorth America NA NA
## RegionSouth Asia NA NA
## RegionSub-Saharan Africa NA NA
## IncomeGroupLow income NA NA
## IncomeGroupLower middle income NA NA
## IncomeGroupUpper middle income NA NA
## sectorArts -4.918 0.000001060 ***
## sectorClothing -4.321 0.000017507 ***
## sectorConstruction -5.194 0.000000262 ***
## sectorEducation -4.249 0.000024058 ***
## sectorEntertainment -5.258 0.000000187 ***
## sectorFood -1.135 0.25685
## sectorHealth -5.017 0.000000648 ***
## sectorHousing -3.975 0.000076859 ***
## sectorManufacturing -5.146 0.000000335 ***
## sectorPersonal Use -4.099 0.000045754 ***
## sectorRetail -1.516 0.12999
## sectorServices -3.988 0.000072768 ***
## sectorTransportation -4.794 0.000001949 ***
## sectorWholesale -5.314 0.000000140 ***
## avg_Pop NA NA
## avg_pctPoverty NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2753 on 796 degrees of freedom
## Multiple R-squared: 0.3118, Adjusted R-squared: 0.2349
## F-statistic: 4.053 on 89 and 796 DF, p-value: < 0.00000000000000022
Let’s check the conditions for the regression.
ggplot(data = m_loans, aes(x = .resid)) +
geom_histogram() +
xlab("Residuals")
The model is somewhat normal but with several extreme outliers.
ggplot(data = m_loans, aes(x = .fitted, y = .resid)) +
geom_jitter() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals") + stat_smooth(method ="lm",se = TRUE) + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10')
The variance of the residuals is not constant with several outliers.
ggplot(data = m_loans, aes(sample = .resid)) +
stat_qq()
The qqplot also appears quite right skewed with outliers on both tails.
So the conditions for mutliple or logistic regression are not met.
In the full model CountryName is a unique value to each country and cannot be “predicted” as a value for another country. Moreover, it appears that by including the specific countries that are already receiving loans we are overfitting the model, so we will remove the country names and see what happens.
m_loans <- lm(total_loan_count ~ Region + IncomeGroup + sector + avg_Pop + avg_pctPoverty, data = world_loan_train)
summary(m_loans)
##
## Call:
## lm(formula = total_loan_count ~ Region + IncomeGroup + sector +
## avg_Pop + avg_pctPoverty, data = world_loan_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3865 -838 -140 338 50384
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 2285.3682871161 690.4883375156 3.310
## RegionEurope & Central Asia -1055.6598003278 412.0111163657 -2.562
## RegionLatin America & Caribbean -687.5082090687 363.5408652633 -1.891
## RegionMiddle East & North Africa -731.6948147972 548.3654691616 -1.334
## RegionNorth America 48.5845001486 981.1512130967 0.050
## RegionSouth Asia -975.2602118319 570.7603718902 -1.709
## RegionSub-Saharan Africa -1319.6631762407 385.9877928602 -3.419
## IncomeGroupLow income 1001.9556786361 616.5994382708 1.625
## IncomeGroupLower middle income 1569.6642596866 562.6741555606 2.790
## IncomeGroupUpper middle income 638.9017427702 525.7833812106 1.215
## sectorArts -2313.1509271068 528.5294882114 -4.377
## sectorClothing -2019.3735317885 521.8020578040 -3.870
## sectorConstruction -2433.1446273661 535.4330401440 -4.544
## sectorEducation -1969.7765768139 550.8528323355 -3.576
## sectorEntertainment -2530.9729651644 594.0519730943 -4.261
## sectorFood -487.7652637183 519.8003023799 -0.938
## sectorHealth -2319.1532550516 535.1921523670 -4.333
## sectorHousing -1898.6813779493 567.0492638203 -3.348
## sectorManufacturing -2401.0088662018 532.8345160668 -4.506
## sectorPersonal Use -1963.1417777867 551.2701963350 -3.561
## sectorRetail -690.7831507630 516.0159644330 -1.339
## sectorServices -1866.9159294259 517.9016033632 -3.605
## sectorTransportation -2269.7751482054 548.0698547509 -4.141
## sectorWholesale -2549.3859864936 566.9806158746 -4.496
## avg_Pop -0.0000002787 0.0000005088 -0.548
## avg_pctPoverty 7.8721979833 7.1933757042 1.094
## Pr(>|t|)
## (Intercept) 0.000973 ***
## RegionEurope & Central Asia 0.010570 *
## RegionLatin America & Caribbean 0.058941 .
## RegionMiddle East & North Africa 0.182452
## RegionNorth America 0.960518
## RegionSouth Asia 0.087867 .
## RegionSub-Saharan Africa 0.000658 ***
## IncomeGroupLow income 0.104535
## IncomeGroupLower middle income 0.005393 **
## IncomeGroupUpper middle income 0.224645
## sectorArts 0.00001354 ***
## sectorClothing 0.000117 ***
## sectorConstruction 0.00000630 ***
## sectorEducation 0.000369 ***
## sectorEntertainment 0.00002265 ***
## sectorFood 0.348318
## sectorHealth 0.00001642 ***
## sectorHousing 0.000848 ***
## sectorManufacturing 0.00000752 ***
## sectorPersonal Use 0.000390 ***
## sectorRetail 0.181027
## sectorServices 0.000330 ***
## sectorTransportation 0.00003792 ***
## sectorWholesale 0.00000786 ***
## avg_Pop 0.583958
## avg_pctPoverty 0.274100
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3030 on 860 degrees of freedom
## Multiple R-squared: 0.09942, Adjusted R-squared: 0.07324
## F-statistic: 3.797 on 25 and 860 DF, p-value: 0.000000001792
The relative strength of the model without country names dropped to \[R^2_{adj} = 0.07324\].
It is interesting that with the country names included the dimensions of world region and income group were all “NA”, but without the countries they begin to assume some significance within certain values such as Region = Europe & Central Asia and Income Group = Lower middle income.
For the next round we will remove Region since it has the highest p-value in North America.
m_loans <- lm(total_loan_count ~ IncomeGroup + sector + avg_Pop + avg_pctPoverty, data = world_loan_train)
summary(m_loans)
##
## Call:
## lm(formula = total_loan_count ~ IncomeGroup + sector + avg_Pop +
## avg_pctPoverty, data = world_loan_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3244 -805 -166 266 51187
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 1777.457928715200 539.739965506075 3.293
## IncomeGroupLow income 356.463813132056 495.030537051976 0.720
## IncomeGroupLower middle income 1291.879415052892 448.686167227039 2.879
## IncomeGroupUpper middle income 408.322308767657 445.834864987234 0.916
## sectorArts -2294.877210070119 530.590110182443 -4.325
## sectorClothing -2012.063562258315 524.228341341961 -3.838
## sectorConstruction -2421.217749999545 537.823552176587 -4.502
## sectorEducation -1968.769702210150 553.342820886069 -3.558
## sectorEntertainment -2494.089433513088 596.255967790020 -4.183
## sectorFood -499.266400554085 522.227199381874 -0.956
## sectorHealth -2341.189442695978 537.535878213656 -4.355
## sectorHousing -1854.735939505492 569.223793140106 -3.258
## sectorManufacturing -2412.068343052929 535.239702703112 -4.507
## sectorPersonal Use -1949.510495593590 553.534616954894 -3.522
## sectorRetail -702.186695557686 518.398794674743 -1.355
## sectorServices -1872.509792145529 520.302911058629 -3.599
## sectorTransportation -2246.574759455291 550.528836786731 -4.081
## sectorWholesale -2537.639450665607 569.295364362355 -4.458
## avg_Pop -0.000000005909 0.000000429850 -0.014
## avg_pctPoverty 3.567894539515 6.665972917777 0.535
## Pr(>|t|)
## (Intercept) 0.001031 **
## IncomeGroupLow income 0.471667
## IncomeGroupLower middle income 0.004084 **
## IncomeGroupUpper middle income 0.359995
## sectorArts 0.00001702 ***
## sectorClothing 0.000133 ***
## sectorConstruction 0.00000766 ***
## sectorEducation 0.000394 ***
## sectorEntertainment 0.00003171 ***
## sectorFood 0.339322
## sectorHealth 0.00001487 ***
## sectorHousing 0.001164 **
## sectorManufacturing 0.00000749 ***
## sectorPersonal Use 0.000451 ***
## sectorRetail 0.175921
## sectorServices 0.000338 ***
## sectorTransportation 0.00004904 ***
## sectorWholesale 0.00000938 ***
## avg_Pop 0.989035
## avg_pctPoverty 0.592621
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3045 on 866 degrees of freedom
## Multiple R-squared: 0.08457, Adjusted R-squared: 0.06449
## F-statistic: 4.211 on 19 and 866 DF, p-value: 0.000000005472
The relative strength of the model without Region dropped to \[R^2_{adj} = 0.06449\].
The adjusted-R-squared decreased by removing region so we will add it back in, and now the largest p-value is in avg_Pop, so we will try removing that dimension next.
m_loans <- lm(total_loan_count ~ Region + IncomeGroup + sector + avg_pctPoverty, data = world_loan_train)
summary(m_loans)
##
## Call:
## lm(formula = total_loan_count ~ Region + IncomeGroup + sector +
## avg_pctPoverty, data = world_loan_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3827 -839 -167 340 50397
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2222.644 680.652 3.265 0.001136 **
## RegionEurope & Central Asia -990.684 394.409 -2.512 0.012193 *
## RegionLatin America & Caribbean -628.882 347.291 -1.811 0.070517 .
## RegionMiddle East & North Africa -658.444 531.597 -1.239 0.215826
## RegionNorth America 22.311 979.580 0.023 0.981834
## RegionSouth Asia -1078.494 538.535 -2.003 0.045529 *
## RegionSub-Saharan Africa -1282.374 379.784 -3.377 0.000767 ***
## IncomeGroupLow income 1017.196 615.721 1.652 0.098891 .
## IncomeGroupLower middle income 1589.577 561.271 2.832 0.004732 **
## IncomeGroupUpper middle income 621.510 524.611 1.185 0.236460
## sectorArts -2308.336 528.242 -4.370 0.00001395 ***
## sectorClothing -2019.327 521.590 -3.871 0.000116 ***
## sectorConstruction -2434.387 535.211 -4.548 0.00000618 ***
## sectorEducation -1971.350 550.621 -3.580 0.000363 ***
## sectorEntertainment -2533.711 593.789 -4.267 0.00002201 ***
## sectorFood -488.994 519.584 -0.941 0.346905
## sectorHealth -2322.442 534.941 -4.341 0.00001583 ***
## sectorHousing -1901.263 566.799 -3.354 0.000830 ***
## sectorManufacturing -2402.728 532.609 -4.511 0.00000734 ***
## sectorPersonal Use -1967.378 550.992 -3.571 0.000376 ***
## sectorRetail -690.483 515.806 -1.339 0.181038
## sectorServices -1867.163 517.691 -3.607 0.000328 ***
## sectorTransportation -2270.674 547.845 -4.145 0.00003738 ***
## sectorWholesale -2547.299 566.737 -4.495 0.00000792 ***
## avg_pctPoverty 8.012 7.186 1.115 0.265206
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3029 on 861 degrees of freedom
## Multiple R-squared: 0.0991, Adjusted R-squared: 0.07399
## F-statistic: 3.946 on 24 and 861 DF, p-value: 0.0000000009874
The relative strength of the model without avg_Pop increased to \[R^2_{adj} = 0.07399\]. It appears that avg_Pop is not a predictor variable for this model.
The adjusted-R-squared increased slightly by removing the avg_Pop, and next we can try removing the avg_pctPoverty since it has the next highest p-value.
m_loans <- lm(total_loan_count ~ Region + IncomeGroup + sector, data = world_loan_train)
summary(m_loans)
##
## Call:
## lm(formula = total_loan_count ~ Region + IncomeGroup + sector,
## data = world_loan_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3823 -819 -196 311 50415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2244.083 680.477 3.298 0.001014 **
## RegionEurope & Central Asia -1025.379 393.235 -2.608 0.009277 **
## RegionLatin America & Caribbean -637.407 347.256 -1.836 0.066768 .
## RegionMiddle East & North Africa -689.792 530.928 -1.299 0.194216
## RegionNorth America 9.848 979.654 0.010 0.991982
## RegionSouth Asia -1113.351 537.703 -2.071 0.038696 *
## RegionSub-Saharan Africa -1167.162 365.506 -3.193 0.001458 **
## IncomeGroupLow income 1101.965 611.095 1.803 0.071696 .
## IncomeGroupLower middle income 1612.015 560.989 2.874 0.004159 **
## IncomeGroupUpper middle income 634.187 524.561 1.209 0.227000
## sectorArts -2308.579 528.316 -4.370 0.00001396 ***
## sectorClothing -2016.141 521.656 -3.865 0.000119 ***
## sectorConstruction -2441.927 535.243 -4.562 0.00000579 ***
## sectorEducation -1968.879 550.695 -3.575 0.000369 ***
## sectorEntertainment -2529.446 593.861 -4.259 0.00002276 ***
## sectorFood -488.102 519.657 -0.939 0.347851
## sectorHealth -2320.920 535.015 -4.338 0.00001608 ***
## sectorHousing -1897.561 566.869 -3.347 0.000851 ***
## sectorManufacturing -2397.970 532.667 -4.502 0.00000766 ***
## sectorPersonal Use -1970.810 551.061 -3.576 0.000368 ***
## sectorRetail -690.537 515.879 -1.339 0.181065
## sectorServices -1868.359 517.763 -3.609 0.000326 ***
## sectorTransportation -2275.908 547.902 -4.154 0.00003595 ***
## sectorWholesale -2540.825 566.787 -4.483 0.00000836 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3030 on 862 degrees of freedom
## Multiple R-squared: 0.0978, Adjusted R-squared: 0.07373
## F-statistic: 4.063 on 23 and 862 DF, p-value: 0.0000000007722
The relative strength of the model without Avg_PctPoverty decreased to \[R^2_{adj} = 0.07373\].
Removing Avg_PctPoverty decreased the strength of the model, so it should be added back again. Last we can remove sector and see what that does to the overall model.
m_loans <- lm(total_loan_count ~ IncomeGroup + avg_Pop + avg_pctPoverty, data = world_loan_train)
summary(m_loans)
##
## Call:
## lm(formula = total_loan_count ~ IncomeGroup + avg_Pop + avg_pctPoverty,
## data = world_loan_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1505 -905 -360 -116 52278
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 118.98711815875 423.35435149197 0.281
## IncomeGroupLow income 229.08451532504 506.77245741986 0.452
## IncomeGroupLower middle income 1156.95050324500 459.20112384950 2.519
## IncomeGroupUpper middle income 286.21013197798 456.35146798703 0.627
## avg_Pop -0.00000004095 0.00000044096 -0.093
## avg_pctPoverty 4.01731518191 6.83507402487 0.588
## Pr(>|t|)
## (Intercept) 0.7787
## IncomeGroupLow income 0.6513
## IncomeGroupLower middle income 0.0119 *
## IncomeGroupUpper middle income 0.5307
## avg_Pop 0.9260
## avg_pctPoverty 0.5569
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3124 on 880 degrees of freedom
## Multiple R-squared: 0.02051, Adjusted R-squared: 0.01495
## F-statistic: 3.686 on 5 and 880 DF, p-value: 0.002626
The relative strength of the model without sector decreased to \[R^2_{adj} = 0.01495\]. It seems that sector adds great value to the model.
Since removing sector is catastrophic to the model, and since the test set has no sector values in it to predict upon, this seems to be a failing venture. So we will have to abandon the prospect at this time of creating a predictive model based on the training and testing data that was derived from the current available datasets.
Examing the lending practices of essentially Americans to the rest of the world in which we live is an important thing to examine, because it shows our preferences for helping and supporting different parts of the world. It may even possibly hint at underlying attitudes about the regions of the world we are being so philanthropic towards.
The above analysis roughly succeeded in the simple linear regression of using poverty levels to predict a rough number of loans made, as we see in this faceted version of a prior graphic above:
ggplot(world_loan_train, aes(x = avg_pctPoverty, y = total_loan_count)) + geom_jitter(aes(color = Region)) + facet_wrap(~ Region) + stat_smooth(method ="lm",se = TRUE) + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10')
There is a positive linear relationship in several regions of the world between the poverty in the area and the number of loans received from Kiva.
However, we utterly failed to create an ideal multivariate model by which to predict how many loans might be made to each country that did not receive loans based on characteristics of the given country.
It may be possible that there are other factors underlying the lending tendencies of Americans toward the rest of the world, which could be confounding to the current analysis if those factors are not present in it. Perhaps if we were to include other characteristics of the countries that received loans, such as predominant race, ethnicity, political stance and religious prevalence in each country, we might be better equipped to distill a better model to predict where Americans overall might be willing to invest their money to lift up those who need help.
Kiva, November 3, 2020, https://www.kiva.org/.
“Population, total.” The World Bank, November 3, 2020. https://data.worldbank.org/indicator/SP.POP.TOTL?end=2017&start=2014.
“GHO OData API.” The World Health Data Platform, November 3, 2020. https://www.who.int/data/gho/info/gho-odata-api.
“Data Science for Good: Kiva Crowdfunding.” Kaggle, November 3. 2020, https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding.
“tmap: get started!” November 20, 2020, https://cran.r-project.org/web/packages/tmap/vignettes/tmap-getstarted.html.