Analyzing Global Findex Data Using ANOVA And GLM

Introduction

The goal of this project is to demonstrate statistical analysis of survey data using Analysis of Variance(ANOVA) and Generalized Linear Model(GLM). The Global Financial Inclusion, Global Findex database, 2014 edition will be used to solve following questions.

Is there a difference in financial account ownership between India, China, and G7 Nations by age group?
India has a good portion of the population, that does not have a financial account. Is there any difference in reasons for not having financial account across various groups?
Using India data, develop a predictive model to identify whether a respondent has a financial account.

Method and Software usage

The principles of tidy data provided by Hadley Wickham are followed throughout the process of cleaning and preparing the data for analysis. The software tool used for the project is R. Most of the statistical functions are used from stats package. graphs are shown using ggplot2 package.

Data Dictionary

Information about the variables can be obtained from Global-Findex-Data-Dictionary.csv. Actual survey results in its raw form are present in micro_world.csv. Variables may contain the recorded results of a direct question asked, or be derived in some way.

Libraries used.

if (!require('plyr')) install.packages('plyr')                #Data frame and table functions
if (!require('dplyr')) install.packages('dplyr')              #Data frame and table functions
if (!require('stringr')) install.packages('stringr')          #String manuplilation functions
if (!require('ggplot2')) install.packages('ggplot2')          #Graphics display
if (!require('tidyr')) install.packages('tidyr')              #Tidy data using spread() and gather() functions
if (!require('gridExtra')) install.packages('gridExtra')      #Display graphs side by side
if (!require('knitr')) install.packages('knitr')              #Report display, table format
if (!require('stats')) install.packages('stats')
if (!require('caret')) install.packages('caret')              #For confusion matrix

Data Preparation

The survey was carried out over the 2014 calendar year by Gallup, Inc. in more than 160 countries and over 140 languages, using randomly selected, nationally representative samples. The target population is the entire civilian population of age 15 years and above. This is an observational data. Most of the variables are stored as numerical values, and they are converted into character based categorical variables. Survey question surrounding income is based on household income quintile(1 - Poorest 20%, 2 - Second 20%, 3 - Middle 20%, 4 - Fourth 20%, 5 - Richest 20%). .

agef <- function(x){
  if (is.na(x)){
    age <- NA
    return(age)
  }
  if (x < 26){
    age <- "Below 26"
  } else if(x > 25 & x < 36){
    age <- "26 - 35"
  } else if(x > 35 & x < 46){
    age <- "36 - 45"
  } else if(x > 45 & x < 56){
    age <- "46 - 55"
  } else if(x > 55 & x < 66){
    age <- "56 - 65"
  } else if (x > 65){
    age <- "Above 65"
  } else{
    age <- NA
  }
  return(age)
}

incomef <- function(x){
  if (is.na(x)){
    inc <- NA
    return(inc)
  }
  if (x == 1){
    inc <- "Poorest 20%"
  } else if(x == 2){
    inc <- "Second 20%"
  } else if(x == 3){
    inc <- "Middle 20%"
  } else if(x == 4){
    inc <- "Fourth 20%"
  } else if(x == 5){
    inc <- "Richest 20%"
  } else{
    inc <- NA
  }
  return(inc)
}

eduf <- function(x){
  if (is.na(x)){
    col <- NA
    return(col)
  }
  if (x == 1){
    col <- "HS or Less"
  } else if(x == 2){
    col <- "Some College"
  } else if(x == 3){
    col <- "College Grad"
  } else{
    col <- NA
  }
  return(col)
}

age.order <- data.frame(age<-"Below 26", order<-1, stringsAsFactors = F)
age.order <- rbind(age.order, c("26 - 35", 2))
age.order <- rbind(age.order, c("36 - 45", 3))
age.order <- rbind(age.order, c("46 - 55", 4))
age.order <- rbind(age.order, c("56 - 65", 5))
age.order <- rbind(age.order, c("Above 65", 6))
colnames(age.order) <- c("age","order")
age.order$order <- as.integer(age.order$order)

income.order <- data.frame(age<-"Poorest 20%", order<-1, stringsAsFactors = F)
income.order <- rbind(income.order, c("Second 20%", 2))
income.order <- rbind(income.order, c("Middle 20%", 3))
income.order <- rbind(income.order, c("Fourth 20%", 4))
income.order <- rbind(income.order, c("Richest 20%", 5))
colnames(income.order) <- c("income","order")
income.order$order <- as.integer(income.order$order)

edu.order <- data.frame(edu<-"HS or Less", order<-1, stringsAsFactors = F)
edu.order <- rbind(edu.order, c("Some College", 2))
edu.order <- rbind(edu.order, c("College Grad", 3))
colnames(edu.order) <- c("edu","order")
edu.order$order <- as.integer(edu.order$order)

attOrder <- age.order %>% select(attName=age, order)
attOrder <- rbind(attOrder, as.data.frame(income.order %>% mutate(order = order + 20) %>% select(attName=income, order)))
attOrder <- rbind(attOrder, as.data.frame(edu.order %>% mutate(order = order + 30) %>% select(attName=edu, order)))
attOrder <- rbind(attOrder, data.frame(attName="Male",order = 0, stringsAsFactors = F))
attOrder <- rbind(attOrder, data.frame(attName="Female",order = -1, stringsAsFactors = F))
attOrder <- arrange(attOrder, order)

reasons <- data.frame("Cannot get one", 1, stringsAsFactors = F)
reasons <- rbind(reasons, c("Family member already has one", 2))
reasons <- rbind(reasons, c("Financial institution too far away", 3))
reasons <- rbind(reasons, c("Lack of documentation", 4))
reasons <- rbind(reasons, c("Lack of money", 5))
reasons <- rbind(reasons, c("Lack of trust", 6))
reasons <- rbind(reasons, c("No need for financial services", 7))
reasons <- rbind(reasons, c("Religious reasons", 8))
reasons <- rbind(reasons, c("Too expensive to have any account", 9))
colnames(reasons) <- c("reason","order")

#Change path
fileDir <- "D:/CUNY/606/Project-1/FinalProject"

#Load survey data.
world.findex.data <- read.csv(file.path(fileDir, "micro_world.csv"), stringsAsFactors=FALSE)

Question 1: Is there a difference in financial account ownership between India, China and G7 Nations by age group?

I will be using ANOVA statistical method to identify if there is a difference in financial account ownership. Analysis of Variance (ANOVA) is a statistical method used for testing differences between two or more means. Data has to meet following conditions before applying ANOVA. If any of the conditions are not satisfied, the results from the use of ANOVA techniques may be unreliable.

Each sample is an independent random sample

As data is collected by Gallup, Inc from each country on a random basis. Survey data meets the condition of independence.

The distribution of the response variable follows a normal distribution

For the scope of the project, I will be using account attribute as the response variable. This variable has two possible outcomes yes if the respondent has an account with financial institution otherwise no. Observations with missing values are discarded. This condition can be checked with the use of graphs; I will be using box plot and side-by-side dot plot to identify the normal distribution of the response variable.

The variability across the groups is about equal.

This can be evaluated by using rule of thumb if the largest sample standard deviation divided by the smallest sample standard deviation is not greater than two, then assume that the population variances are equal. I will be using box plot to identify variability across the groups.

Following graph shows a total number of respondents having an account with the financial institution. Account ownership percentage is higher among G7 Nations compared to India and China.

#Get India, China and G7 Nations data
in_cn_g7.data <- world.findex.data %>% 
  filter(economy %in% c("India", "China", "Canada", "France", "Germany", "Italy", "Japan", "United Kingdom", "United States")) %>% 
  select(economy, female, age, educ, inc_q, account)

in_cn_g7.data$hasAccount <- ifelse(in_cn_g7.data$account == 1, "Yes", ifelse(in_cn_g7.data$account == 2, "No", NA))
in_cn_g7.data$gender <- ifelse(in_cn_g7.data$female == 1, "Female", ifelse(in_cn_g7.data$female == 2, "Male",NA))
in_cn_g7.data = cbind(in_cn_g7.data, ldply(in_cn_g7.data$age, agef))
colnames(in_cn_g7.data)[ncol(in_cn_g7.data)] <- "ageGroup"
in_cn_g7.data = cbind(in_cn_g7.data, ldply(in_cn_g7.data$inc_q, incomef))
colnames(in_cn_g7.data)[ncol(in_cn_g7.data)] <- "incomeGroup"
in_cn_g7.data = cbind(in_cn_g7.data, ldply(in_cn_g7.data$educ, eduf))
colnames(in_cn_g7.data)[ncol(in_cn_g7.data)] <- "eduGroup"

#Get percentage of population that has accounts.
in_cn_g7.data1 <- in_cn_g7.data %>%
  filter(age>0) %>%
  group_by(economy) %>% 
  mutate(hasAcc = n())

in_cn_g7.data1 <- in_cn_g7.data1 %>% 
  filter(hasAccount=="Yes") %>% 
  group_by(economy, ageGroup, hasAcc) %>% 
  select (economy, ageGroup, hasAcc) %>% 
  mutate(ageGroupTotal = n()) %>% 
  distinct (ageGroupTotal)

in_cn_g7.data1 <- in_cn_g7.data1 %>% 
  mutate(accPer = round(ageGroupTotal*100/hasAcc, 3))

#Graphical display
in_cn_g7.data1 %>%
  group_by(economy) %>% 
  mutate(value = sum(accPer)) %>%
ggplot(aes(x=economy, y=value, fill = value)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Country", y="Percentage Account Holders", title = "Population Sample With Financial Accounts", subtitle = "Comparision between countries") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + geom_text(aes(label=value), vjust=0, color="black") + labs(fill = "Ownership")

Following data shows comparision of financial account ownership by age group accross countries.

#Group data by age group
in_cn_g7.data <- in_cn_g7.data %>%
  filter(hasAccount=="Yes") %>% 
  filter(age>0) %>%
  group_by(economy) %>% 
  mutate(hasAcc = n())

in_cn_g7.data <- in_cn_g7.data %>% 
  group_by(economy, ageGroup, hasAcc) %>% 
  select (economy, ageGroup, hasAcc) %>% 
  mutate(ageGroupTotal = n()) %>% 
  distinct (ageGroupTotal)

in_cn_g7.data <- in_cn_g7.data %>% 
  mutate(accPer = round(ageGroupTotal*100/hasAcc, 3))

age.in_cn_g7.data <- in_cn_g7.data %>%
  inner_join(age.order, by = c("ageGroup" = "age")) %>% 
  arrange(economy, order) 

age.in_cn_g7.data <- transform(age.in_cn_g7.data, ageGroup=factor(ageGroup,levels=unique(ageGroup)))
colnames(age.in_cn_g7.data)[colnames(age.in_cn_g7.data)=="ageGroup"] <- "Age Group"
colnames(age.in_cn_g7.data)[colnames(age.in_cn_g7.data)=="accPer"] <- "value"

age.in_cn_g7.data$ageGroupTotal <- NULL

#Display data in table format
age.in_cn_g7.data %>% 
  select(economy, `Age Group`, value) %>% 
  spread(economy, value) %>%
  kable(format='pandoc', caption = "Population Sample With Financial Accounts - Countries by Age Group")

Population Sample With Financial Accounts - Countries by Age Group
Age Group	Canada	China	France	Germany	India	Italy	Japan	United Kingdom	United States
Below 26	10.010	12.433	6.900	7.121	21.765	6.494	3.030	4.044	10.677
26 - 35	11.249	18.190	11.123	9.766	25.641	12.446	5.152	11.426	10.254
36 - 45	14.551	20.816	22.760	12.208	24.210	22.511	17.576	19.919	12.474
46 - 55	20.537	20.911	24.923	21.872	14.490	23.918	17.374	26.491	14.905
56 - 65	20.640	15.407	20.906	19.939	10.137	20.130	25.960	22.548	22.199
Above 65	23.013	12.243	13.388	29.095	3.757	14.502	30.909	15.571	29.493

Following graph show that data, doesn’t have the ideal bell-shaped appearance, and it suggests there are some outliers in the data. Financial account ownership differ between age groups in each country.

#Graphical display
ggplot(age.in_cn_g7.data, aes(x=economy, y=value, fill = `Age Group`)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Country", y="Accounts", title = "Percentage of Population With Financial Accounts", subtitle = "Comparision between countries by age group") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(fill = "Ownership") + scale_fill_brewer(palette="Paired")

Side-by-side dot plot show data within each group are very volatile. Graph also suggest, there are outliers in Japan data. Box plot suggest United States differ from rest of the countries. Since data is collected randomly in each country and graphs does not display anthing extreme we can assume distribution of the response variable follows a normal distribution and variability across the groups exists. This meets second and third conditions.

qplot(x = economy, y = value, data = age.in_cn_g7.data, color = value) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_smooth(method = "lm") +
  labs(x="Country", y="% of Accounts Holders By Age Group", title = "Financial Account Ownership By Country And Age Group", subtitle = "Side-by-side dot plot", color = "Ownership")

fill <- "#4271AE"
line <- "#1F3552"

age.in_cn_g7.data %>% 
  ggplot(aes(x = economy, y = value)) +
        geom_boxplot(fill = fill, colour = line) +
        scale_y_continuous(name = "Account") +
        scale_x_discrete(name = "Country") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        ggtitle("Average(Mean) Account Ownership By Country")

Standard deviations between countries range between 3.915 and 11.037. Even though quotient is greater than two when maximum standard deviation is divided by minimum standard deviation(2.819), we can assume age group variances are equal across countries as data is collected randomly in each country.

age.in_cn_g7.data %>% 
  select(economy, `Age Group`, value) %>% 
  group_by(economy) %>% 
  mutate(ecoSD = round(sd(value),3)) %>% 
  select(economy, ecoSD) %>% 
  distinct(ecoSD) %>% 
  spread(economy, ecoSD) %>%
  kable(format='pandoc', caption = "Standard Deviation by Countries Based on Age Group")

Standard Deviation by Countries Based on Age Group
Canada	China	France	Germany	India	Italy	Japan	United Kingdom	United States
5.462	3.915	7.213	8.377	8.689	6.704	11.037	8.12	7.652

Hypothesis test.

Null hypothesis: There is no difference in account ownership percentage between age groups across countries.

\(H_0: {\mu}_{can} = {\mu}_{chn} = {\mu}_{fra} = {\mu}_{deu} = {\mu}_{ind} = {\mu}_{ita} = {\mu}_{jpn} = {\mu}_{uk} = {\mu}_{usa}\)

Alternative hypothesis: There is difference in account ownership percentage between age groups across countries.

\(H_A: {\mu}_{can} \ne {\mu}_{chn} \ne {\mu}_{fra} \ne {\mu}_{deu} \ne {\mu}_{ind} \ne {\mu}_{ita} \ne {\mu}_{jpn} \ne {\mu}_{uk} \ne {\mu}_{usa}\)

options("scipen"=100, "digits"=4)
#Apply ANOVA
age.anova <- aov(value ~ `Age Group` + economy, data = age.in_cn_g7.data)
summary(age.anova)

##             Df Sum Sq Mean Sq F value Pr(>F)   
## `Age Group`  5    952     190    4.43 0.0026 **
## economy      8      0       0    0.00 1.0000   
## Residuals   40   1718      43                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Draw residuals plot to test is data is normally distributed 
ggplot(data=as.data.frame(qqnorm( age.anova$residuals , plot=F)), mapping=aes(x=x, y=y)) + 
    geom_point() + geom_smooth(method="lm", se=FALSE)  +
        scale_y_continuous(name = "Sample Quantiles") +
        scale_x_discrete(name = "Theoretical Quantiles") +
        ggtitle("Normal Q-Q Plot")

Conclusion

- Data meets fairly all three conditions of ANOVA.

- The normal plot of the residuals shows data points lie pretty close to the line. Some deviation is noticed near the ends. This suggests data is nearly normal.

- The p-value of the test is 0.0026, less than the default significance level of \(\alpha = 0.05\). We reject null hypothesis \(H_0\). In other words, there is difference in account ownership percentage between age groups across countries.

- As data was collected from each country based on size of the population, weighted sampling, aggregates are converted to percentages.

Question 2: India has a good portion of the population, that does not have a financial account. Is there any difference in reasons for not having financial account across various groups?

I will be using ANOVA technique to answer the question and use TukeyHSD function to identify if any reason pair stands out. Following graph shows a total number of respondents that does not have an account with the financial institution for various reasons.

#Get India data
india.data <- world.findex.data %>% 
  filter(economy == "India") %>% 
  select(economy, female, age, educ, inc_q, account, q8a, q8b, q8c, q8d, q8e, q8f, q8g, q8h, q8i)

#Convert numerical values to categorical values
india.data$hasAccount <- ifelse(india.data$account == 1, "Yes", ifelse(india.data$account == 2, "No", NA))
india.data$gender <- ifelse(india.data$female == 1, "Female", ifelse(india.data$female == 2, "Male",NA))
india.data = cbind(india.data, ldply(india.data$age, agef))
colnames(india.data)[ncol(india.data)] <- "ageGroup"
india.data = cbind(india.data, ldply(india.data$inc_q, incomef))
colnames(india.data)[ncol(india.data)] <- "incomeGroup"
india.data = cbind(india.data, ldply(india.data$educ, eduf))
colnames(india.data)[ncol(india.data)] <- "eduGroup"

#Subset data by respondents who do not have accounts
no.account <- india.data %>% filter(account==2)
no.account <- no.account %>% filter(!is.na(age))
no.account.tidy <- no.account %>% 
  select (gender, ageGroup, incomeGroup, eduGroup, q8a, q8b, q8c, q8d, q8e, q8f, q8g, q8h, q8i) %>% 
  gather(noAccountReason, value, -gender, -ageGroup, -incomeGroup, -eduGroup) %>% filter(value==1)

#Expand the questions
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8a"] <- "Financial institution too far away"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8b"] <- "Too expensive to have any account"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8c"] <- "Lack of documentation"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8d"] <- "Lack of trust"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8e"] <- "Religious reasons"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8f"] <- "Lack of money"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8g"] <- "Family member already has one"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8h"] <- "Cannot get one"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8i"] <- "No need for financial services"

#Get data to display on the graph
no.account.reason <- no.account.tidy %>% 
  group_by(noAccountReason) %>% 
  summarise(value = n()) %>% 
  inner_join(reasons, by = c("noAccountReason"="reason")) %>% 
  arrange(desc(order))

#Convert columns to factors to preserve data order  
no.account.reason <- transform(no.account.reason, noAccountReason=factor(noAccountReason,levels=unique(noAccountReason)))

#Graphical display
no.account.reason %>%
ggplot(aes(x=noAccountReason, y=value, fill = value)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Reasons", y="Respondents", title = "Population Sample With No Financial Accounts - India", subtitle = "Comparision between ressons") + labs(fill = "Counts") + coord_flip()

Above graph suggests that Lack of money is a primary reason for not having an account with the financial institution. Is Lack of money only reason across various Gender, Age groups, Education level and Income groups for not having an account with a financial institution? Following graphs show the breakdown by various groups.

Reasons for not having an account based on Age Groups. The graph shows age group below 26 years age top reason for not having an account with a financial institution is different from other age groups.

#Graphical display
no.account.age <- no.account.tidy %>% 
  group_by(ageGroup, noAccountReason) %>% 
  summarise(ageReasonvalue = n()) %>%
  inner_join(age.order, by = c("ageGroup"="age")) %>% 
  arrange(order,noAccountReason)

no.account.age <- no.account.age %>% group_by(ageGroup) %>% mutate(agevalue = sum(ageReasonvalue))

no.account.age <- no.account.age %>% mutate(value = round(ageReasonvalue * 100/agevalue,3))
#Convert data to factors
no.account.age <- transform(no.account.age, ageGroup=factor(ageGroup,levels=unique(ageGroup)))

ggplot(no.account.age, aes(x=ageGroup, y=value, fill = noAccountReason)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Age Group", y="Respondents %", title = "Reasons For Not Having Financial Account", subtitle = "Percentage comparision between Age Groups") + labs(fill = "Reasons") + scale_fill_brewer(palette="Paired") + coord_flip()

Reasons for not having an account based on Education Level. The graph shows reasons vary by Education Level.

#Graphical display
no.account.edu <- no.account.tidy %>% 
  group_by(eduGroup, noAccountReason) %>% 
  summarise(eduReasonvalue = n()) %>%
  inner_join(edu.order, by = c("eduGroup"="edu")) %>% 
  arrange(order,noAccountReason)

no.account.edu <- no.account.edu %>% group_by(eduGroup) %>% mutate(eduvalue = sum(eduReasonvalue))

no.account.edu <- no.account.edu %>% mutate(value = round(eduReasonvalue * 100/eduvalue,3))

#Convert data to factors
no.account.edu <- transform(no.account.edu, eduGroup=factor(eduGroup,levels=unique(eduGroup)))

ggplot(no.account.edu, aes(x=eduGroup, y=value, fill = noAccountReason)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Education Level", y="Respondents %", title = "Reasons For Not Having Financial Account", subtitle = "Percentage comparision between Education Level") + labs(fill = "Reasons") + scale_fill_brewer(palette="Paired") + coord_flip()

Reasons for not having an account based on household income. It shows reasons vary by household income. For first three groups Lack of money is top reason.

#Graphical display
no.account.income <- no.account.tidy %>% 
  group_by(incomeGroup, noAccountReason) %>% 
  summarise(incomeReasonvalue = n()) %>%
  inner_join(income.order, by = c("incomeGroup"="income")) %>% 
  arrange(order,noAccountReason)

no.account.income <- no.account.income %>% group_by(incomeGroup) %>% mutate(incomevalue = sum(incomeReasonvalue))

no.account.income <- no.account.income %>% mutate(value = round(incomeReasonvalue * 100/incomevalue,3))

#Convert data to factors
no.account.income <- transform(no.account.income, incomeGroup=factor(incomeGroup,levels=unique(incomeGroup)))

ggplot(no.account.income, aes(x=incomeGroup, y=value, fill = noAccountReason)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Income Group", y="Respondents %", title = "Reasons For Not Having Financial Account", subtitle = "Percentage comparision between Household Income") + labs(fill = "Reasons") + scale_fill_brewer(palette="Paired") + coord_flip()

Reasons for not having an account based on gender. Male and Female population has different reasons for not have financial accounts.

#Graphical display
no.account.gender <- no.account.tidy %>% 
  group_by(gender, noAccountReason) %>% 
  summarise(genderReasonvalue = n()) %>%
  arrange(gender,noAccountReason)

no.account.gender <- no.account.gender %>% group_by(gender) %>% mutate(gendervalue = sum(genderReasonvalue))

no.account.gender <- no.account.gender %>% mutate(value = round(genderReasonvalue * 100/gendervalue,3))

#Convert data to factors

ggplot(no.account.gender, aes(x=gender, y=value, fill = noAccountReason)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Gender", y="Respondents %", title = "Reasons For Not Having Financial Account", subtitle = "Percentage comparision between Gender") + labs(fill = "Reasons") + scale_fill_brewer(palette="Paired") + coord_flip()

Comparing details of the graphs gender, age groups, education level and income groups against overall graph, it stands out Lack of money is not common among all the groups. Using ANOVA technique lets determine if there is a difference in reasons for not having an account.

#Get complete data and summarise it by reasons
no.account <- india.data %>% 
  filter(age>0) %>% 
  filter(account == 2) %>%
  select (account, gender, ageGroup, incomeGroup, eduGroup, q8a, q8b, q8c, q8d, q8e, q8f, q8g, q8h, q8i) %>% 
  gather(noAccountReason, value, -gender, -ageGroup, -incomeGroup, -eduGroup) %>% 
  filter(value==1) %>% 
  select(noAccountReason, gender, ageGroup, incomeGroup, eduGroup) %>% 
  gather(resp, value, -noAccountReason) %>% 
  select(noAccountReason, respAtt = value) %>% 
  group_by(noAccountReason, respAtt) %>% 
  summarise(respCount = n())

no.account$noAccountReason[no.account$noAccountReason == "q8a"] <- "Financial institution too far away"
no.account$noAccountReason[no.account$noAccountReason == "q8b"] <- "Too expensive have any account"
no.account$noAccountReason[no.account$noAccountReason == "q8c"] <- "Lack of documentation"
no.account$noAccountReason[no.account$noAccountReason == "q8d"] <- "Lack of trust"
no.account$noAccountReason[no.account$noAccountReason == "q8e"] <- "Religious reasons"
no.account$noAccountReason[no.account$noAccountReason == "q8f"] <- "Lack of money"
no.account$noAccountReason[no.account$noAccountReason == "q8g"] <- "Family member already has one"
no.account$noAccountReason[no.account$noAccountReason == "q8h"] <- "Cannot get one"
no.account$noAccountReason[no.account$noAccountReason == "q8i"] <- "No need for financial services"

no.account <- no.account %>% group_by(respAtt) %>% mutate(reasonvalue = sum(respCount))
no.account <- no.account %>% mutate(respPer = round(respCount * 100/reasonvalue,3))


no.account.tidy <- no.account %>% 
  select(-respCount, -reasonvalue) %>% 
  spread(noAccountReason, respPer) %>% 
  inner_join(attOrder, by = c("respAtt" = "attName"))

no.account.tidy <- arrange(no.account.tidy, order)

colnames(no.account.tidy)[colnames(no.account.tidy)=="respAtt"] <- "Respondent Attributes"

no.account.tidy$order <- NULL

no.account.tidy %>% 
  kable(format='pandoc', caption = "Reasons For Not Having Financial Account")

Reasons For Not Having Financial Account
Respondent Attributes	Cannot get one	Family member already has one	Financial institution too far away	Lack of documentation	Lack of money	Lack of trust	No need for financial services	Religious reasons	Too expensive have any account
Female	4.791	17.33	10.279	11.150	25.78	4.443	11.672	2.265	12.282
Male	4.224	23.43	9.901	10.165	22.18	3.564	13.399	1.914	11.221
Below 26	3.837	26.41	8.916	12.415	21.90	2.257	12.980	1.919	9.368
26 - 35	4.889	16.59	10.370	11.852	24.74	4.148	11.556	2.370	13.481
36 - 45	4.585	17.47	10.044	9.389	26.20	5.240	12.009	1.528	13.537
46 - 55	5.128	21.25	9.890	8.059	23.08	4.762	13.919	2.564	11.355
56 - 65	4.029	19.41	12.821	7.692	23.81	5.861	12.821	1.465	12.088
Above 65	6.122	17.35	11.224	6.122	23.47	4.082	16.327	4.082	11.224
Poorest 20%	3.632	19.86	11.622	11.622	29.54	3.874	8.232	1.695	9.927
Second 20%	2.799	17.42	9.798	11.820	25.82	3.577	12.286	2.022	14.463
Middle 20%	5.110	19.20	11.188	10.635	22.93	4.558	12.845	2.210	11.326
Fourth 20%	4.706	24.71	9.020	9.216	22.16	3.922	12.745	1.765	11.765
Richest 20%	6.702	25.47	8.043	9.115	17.43	3.485	17.694	2.681	9.383
HS or Less	4.392	18.44	10.521	10.266	24.62	4.290	12.666	1.839	12.972
Some College	4.711	26.90	8.967	11.094	21.73	2.888	12.614	2.736	8.359
College Grad	4.255	34.04	6.383	17.021	14.89	4.255	12.766	2.128	4.255

The side-by-side plot shows responses are normally distributed across reasons not having a financial account. This satisfies second condition of ANOVA.

qplot(x = respAtt, y = respPer, data = no.account, color = respPer) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_smooth(method = "lm") +
  labs(x="Response Group", y="Reason Percentage", title = "Reasons For Not Having Financial Account", subtitle = "Side-by-side dot plot Response Group Vs. Reason Percentage", color = "Percentage")

Box plot showing responses, distribution across groups for not having a financial account. There are outliers in some of the groups. Average(mean) of each group varies but is reasonable.

#Calculate mean
fill <- "#4271AE"
line <- "#1F3552"

#Display Boxplot
no.account %>% 
  ggplot(aes(x = respAtt, y = respPer)) +
        geom_boxplot(fill = fill, colour = line) +
        scale_y_continuous(name = "Reason Percentage") +
        scale_x_discrete(name = "Response Group") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        ggtitle("Average(Mean) Of Reasons For Not Having Financial Account")

Side-by-side dot plot and box plot show data within each group are very volatile. It suggests distribution of the response variable follows a normal distribution and variability across the groups exists. This meets second and third conditions.

Standard deviations between reasons range between 6.769 and 10.140. Quotient is less than two when maximum standard deviation is divided by minimum standard deviation(1.498), we can assume reasons variances are equal across each response group for not having account.

no.account %>% 
  select(noAccountReason, respAtt, respPer) %>% 
  group_by(respAtt) %>% 
  mutate(resonSD = round(sd(respPer),3)) %>% 
  select(respAtt, resonSD) %>% 
  distinct(resonSD) %>% 
  spread(respAtt, resonSD) %>%
  kable(format='pandoc', caption = "Standard Deviation by Response Group Based on Reasons")

Standard Deviation by Response Group Based on Reasons
26 - 35	36 - 45	46 - 55	56 - 65	Above 65	Below 26	College Grad	Female	Fourth 20%	HS or Less	Male	Middle 20%	Poorest 20%	Richest 20%	Second 20%	Some College
6.949	7.482	7.192	7.241	6.777	8.509	10.14	7.238	7.889	7.251	7.672	6.769	8.827	7.537	7.74	8.331

As data meets conditions of ANOVA,

Null hypothesis: There is no difference in reasons for not having an account with financial institution.

\(H_0: {\mu}_{fa} = {\mu}_{exp} = {\mu}_{ld} = {\mu}_{lt} = {\mu}_{rr} = {\mu}_{lm} = {\mu}_{fm} = {\mu}_{co} = {\mu}_{nf}\)

Alternative hypothesis: There is difference in reasons for not having an account with financial institution.

\(H_A: {\mu}_{fa} \ne {\mu}_{exp} \ne {\mu}_{ld} \ne {\mu}_{lt} \ne {\mu}_{rr} \ne {\mu}_{lm} \ne {\mu}_{fm} \ne {\mu}_{co} \ne {\mu}_{nf}\)

options("scipen"=100, "digits"=16)
#Calculate ANOVA
no.account.anova <- aov(respPer ~ noAccountReason, data = no.account)
summary(no.account.anova)

##                  Df         Sum Sq        Mean Sq   F value
## noAccountReason   8 6886.429627722 860.8037034653 138.85689
## Residuals       135  836.894010437   6.1992148921          
##                                 Pr(>F)    
## noAccountReason < 0.000000000000000222 ***
## Residuals                                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Get Tukey output
tukey.output<-TukeyHSD(no.account.anova)
tukey.output<-data.frame(tukey.output$noAccountReason, stringsAsFactors = F)
tukey.output$pairs <- rownames(tukey.output)
rownames(tukey.output) <- NULL
tukey.output <- tukey.output %>% filter((upr-lwr)>0) %>% mutate(rangeDiff = (upr-lwr))
tukey.output <- tukey.output %>% arrange((`p.adj`), desc(diff))
tukey.output %>% kable(format='pandoc', caption = "TukeyHSD Results")

TukeyHSD Results
diff	lwr	upr	p.adj	pairs	rangeDiff
18.5221875000000011	15.7468584937078635	21.2975165062921405	0.0000000000000211	Lack of money-Cannot get one	5.550658012584277
16.9598749999999967	14.1845459937078591	19.7352040062921361	0.0000000000000211	Family member already has one-Cannot get one	5.550658012584277
13.2050000000000018	10.4296709937078642	15.9803290062921395	0.0000000000000211	Lack of money-Financial institution too far away	5.550658012584275
12.6646250000000027	9.8892959937078651	15.4399540062921403	0.0000000000000211	Lack of money-Lack of documentation	5.550658012584275
-12.0788125000000033	-14.8541415062921409	-9.3034834937078656	0.0000000000000211	Too expensive have any account-Lack of money	5.550658012584275
-17.5039999999999978	-20.2793290062921372	-14.7286709937078601	0.0000000000000211	Lack of trust-Family member already has one	5.550658012584277
-19.0663125000000022	-21.8416415062921416	-16.2909834937078628	0.0000000000000211	Lack of trust-Lack of money	5.550658012584279
-19.3804374999999993	-22.1557665062921387	-16.6051084937078599	0.0000000000000211	Religious reasons-Family member already has one	5.550658012584279
-20.9427500000000038	-23.7180790062921432	-18.1674209937078643	0.0000000000000211	Religious reasons-Lack of money	5.550658012584279
-11.6426874999999974	-14.4180165062921350	-8.8673584937078598	0.0000000000000214	Financial institution too far away-Family member already has one	5.550658012584275
-11.1023124999999983	-13.8776415062921359	-8.3269834937078606	0.0000000000000232	Lack of documentation-Family member already has one	5.550658012584275
-10.7092499999999990	-13.4845790062921367	-7.9339209937078614	0.0000000000000284	Religious reasons-No need for financial services	5.550658012584275
-10.5164999999999988	-13.2918290062921365	-7.7411709937078612	0.0000000000000322	Too expensive have any account-Family member already has one	5.550658012584275
-10.2335000000000029	-13.0088290062921406	-7.4581709937078653	0.0000000000000412	No need for financial services-Lack of money	5.550658012584275
8.8639374999999987	6.0886084937078611	11.6392665062921363	0.0000000000001027	Too expensive have any account-Religious reasons	5.550658012584275
8.8328124999999993	6.0574834937078617	11.6081415062921369	0.0000000000001055	No need for financial services-Lack of trust	5.550658012584275
-8.6711874999999985	-11.4465165062921361	-5.8958584937078609	0.0000000000001060	No need for financial services-Family member already has one	5.550658012584275
8.2886875000000000	5.5133584937078624	11.0640165062921376	0.0000000000001119	No need for financial services-Cannot get one	5.550658012584275
-8.2781249999999993	-11.0534540062921369	-5.5027959937078617	0.0000000000001135	Religious reasons-Lack of documentation	5.550658012584275
-7.7377500000000001	-10.5130790062921378	-4.9624209937078625	0.0000000000003277	Religious reasons-Financial institution too far away	5.550658012584275
6.9874999999999998	4.2121709937078613	9.7628290062921383	0.0000000000255275	Too expensive have any account-Lack of trust	5.550658012584277
6.4433749999999996	3.6680459937078616	9.2187040062921373	0.0000000007224596	Too expensive have any account-Cannot get one	5.550658012584275
-6.4016875000000004	-9.1770165062921389	-3.6263584937078623	0.0000000009295720	Lack of trust-Lack of documentation	5.550658012584277
-5.8613125000000013	-8.6366415062921398	-3.0859834937078632	0.0000000228705361	Lack of trust-Financial institution too far away	5.550658012584277
5.8575625000000002	3.0822334937078621	8.6328915062921379	0.0000000233740098	Lack of documentation-Cannot get one	5.550658012584275
5.3171875000000011	2.5418584937078630	8.0925165062921387	0.0000005002820614	Financial institution too far away-Cannot get one	5.550658012584275
2.9714999999999989	0.1961709937078608	5.7468290062921366	0.0260689327027537	No need for financial services-Financial institution too far away	5.550658012584275
2.4311249999999998	-0.3442040062921383	5.2064540062921374	0.1368886029981524	No need for financial services-Lack of documentation	5.550658012584275
-2.4205624999999991	-5.1958915062921367	0.3547665062921390	0.1407710678736802	Religious reasons-Cannot get one	5.550658012584275
-1.8764374999999989	-4.6517665062921374	0.8988915062921392	0.4565292446957468	Religious reasons-Lack of trust	5.550658012584277
-1.8453125000000004	-4.6206415062921380	0.9300165062921378	0.4802685403155512	Too expensive have any account-No need for financial services	5.550658012584275
1.5623125000000044	-1.2130165062921336	4.3376415062921421	0.6985482909477798	Lack of money-Family member already has one	5.550658012584275
1.1261874999999986	-1.6491415062921395	3.9015165062921366	0.9357003437230564	Too expensive have any account-Financial institution too far away	5.550658012584276
0.5858124999999994	-2.1895165062921387	3.3611415062921375	0.9991117666211534	Too expensive have any account-Lack of documentation	5.550658012584276
-0.5441250000000002	-3.3194540062921383	2.2312040062921379	0.9994823019455810	Lack of trust-Cannot get one	5.550658012584276
0.5403749999999992	-2.2349540062921389	3.3157040062921372	0.9995080101246836	Lack of documentation-Financial institution too far away	5.550658012584276

#Plot residuals
ggplot(data=as.data.frame(qqnorm( no.account.anova$residuals , plot=F)), mapping=aes(x=x, y=y)) + 
    geom_point() + geom_smooth(method="lm", se=FALSE)  +
        scale_y_continuous(name = "Sample Quantiles") +
        scale_x_discrete(name = "Theoretical Quantiles") +
        ggtitle("Normal Q-Q Plot")

Conclusion

- Data meets all three conditions of ANOVA.

- The normal plot of the residuals shows data points lie pretty close to the line. Some deviation is noticed near the ends. This suggests data is nearly normal.

- As p-value of the test is equal to 0.0000000000000002, indicating the evidence is strong enough to reject the null hypothesis, \(H_0\) at a significance level of \(\alpha = 0.05\). That is, the data provides strong evidence that there is difference in reasons for not having an account with financial institution across groups.

Output of Tukey Honest Significant Differences(TukeyHSD) function suggests the difference between reason pairs Lack of money-Cannot get one has a p-value of 0.0000000000000211 and is statistically significant. At 95% confidence interval, reason pair difference is between 15.75% and 21.30% with Lack of money averaging 18.52% higher. That means, there is 18.52% higher chance, for a response group not having an account with financial institution is because of Lack of money.

Question 3: Using India data, develop a predictive model to identify whether a respondent has a financial account.

For developing predictive model, - 70% of the dataset will be used as training data and 30% as test data. - As output will be bi-variate, Generalized Linear Model(GLM) is used.

#Get complete data
india.account <- india.data %>% 
  filter(!is.na(age)) %>% 
  select (account, gender, ageGroup, incomeGroup, eduGroup)

#There are only two possible outcomes person can have account or not
india.account$account <- ifelse(india.account$account == 1, 1, 0)

set.seed(2005) #So results can be repeated
#Create sample data
samIdx <- sample(2,nrow(india.account), replace = T, prob = c(0.70,0.30))

#Training data
india.train <- india.account[samIdx == 1,]

#Test data
india.test <- india.account[samIdx == 2,]

#Run GLM on training data
india.glm <- glm(account ~ gender + ageGroup + incomeGroup + eduGroup, data = india.account, family = binomial(link = "logit"))
  
#Make prediction using test data
india.predict <- predict(india.glm, india.test, type = "response")

summary(india.glm)

## 
## Call:
## glm(formula = account ~ gender + ageGroup + incomeGroup + eduGroup, 
##     family = binomial(link = "logit"), data = india.account)
## 
## Deviance Residuals: 
##             Min               1Q           Median               3Q  
## -2.448372543129  -1.116906853565   0.547292268250   1.020832477542  
##             Max  
##  1.747078967946  
## 
## Coefficients:
##                                 Estimate        Std. Error  z value
## (Intercept)             2.22725433507954  0.23536558027249  9.46296
## genderMale             -0.57425698102466  0.07920453566401 -7.25030
## ageGroup36 - 45         0.43238854578185  0.11731065075817  3.68584
## ageGroup46 - 55         0.39147864570844  0.13627812831866  2.87264
## ageGroup56 - 65         0.16310283488844  0.14539385801954  1.12180
## ageGroupAbove 65       -0.04206682128494  0.20924925554456 -0.20104
## ageGroupBelow 26       -0.69299605585285  0.11269337452207 -6.14939
## incomeGroupMiddle 20%  -0.27859272917531  0.11822816213385 -2.35640
## incomeGroupPoorest 20% -0.29025963641623  0.13618298824400 -2.13139
## incomeGroupRichest 20%  0.28640843483957  0.12342299954692  2.32054
## incomeGroupSecond 20%  -0.28674532984093  0.12195233837860 -2.35129
## eduGroupHS or Less     -1.95078559495004  0.21651664866703 -9.00986
## eduGroupSome College   -1.08226620863156  0.21664080767702 -4.99567
##                                      Pr(>|z|)    
## (Intercept)            < 0.000000000000000222 ***
## genderMale                0.00000000000041584 ***
## ageGroup36 - 45                    0.00022795 ***
## ageGroup46 - 55                    0.00407052 ** 
## ageGroup56 - 65                    0.26194747    
## ageGroupAbove 65                   0.84066972    
## ageGroupBelow 26          0.00000000077779134 ***
## incomeGroupMiddle 20%              0.01845308 *  
## incomeGroupPoorest 20%             0.03305668 *  
## incomeGroupRichest 20%             0.02031150 *  
## incomeGroupSecond 20%              0.01870844 *  
## eduGroupHS or Less     < 0.000000000000000222 ***
## eduGroupSome College      0.00000058631532198 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4115.8501258355  on 2998  degrees of freedom
## Residual deviance: 3721.2967784569  on 2986  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 3747.2967784569
## 
## Number of Fisher Scoring iterations: 4

The summary shows all the variables are significant.

#Generate prediction output
india.predict.info <- rep(0,length(which(samIdx == 2)))
india.predict.info[india.predict >=0.5] <- 1
  
cm <- confusionMatrix(india.predict.info, india.test$account)

cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 224 143
##          1 184 363
##                                                             
##                Accuracy : 0.6422319474836                   
##                  95% CI : (0.6101899657424, 0.6733537552102)
##     No Information Rate : 0.5536105032823                   
##     P-Value [Acc > NIR] : 0.00000003258372921716            
##                                                             
##                   Kappa : 0.2690289034871                   
##  Mcnemar's Test P-Value : 0.02696631569637                  
##                                                             
##             Sensitivity : 0.5490196078431                   
##             Specificity : 0.7173913043478                   
##          Pos Pred Value : 0.6103542234332                   
##          Neg Pred Value : 0.6636197440585                   
##              Prevalence : 0.4463894967177                   
##          Detection Rate : 0.2450765864333                   
##    Detection Prevalence : 0.4015317286652                   
##       Balanced Accuracy : 0.6332054560955                   
##                                                             
##        'Positive' Class : 0                                 
##

fourfoldplot(cm$table, color = c("#CC6666", "#99CC99"),conf.level = 0, margin = 1, main = "Mis-classification")

Conclusion

- As output variable can have two possible outcomes, respondent can have an account or not. I have used Generalized Linear Model(GLM) binomial family and logit link.

- Summary of the model suggests all the variables contribute significantly. Gender, age, and education contribute more than income group.

- The accuracy of the model is 64%.

Analyzing Global Findex Data Using ANOVA And GLM

Pavan Akula

May 14, 2017

Introduction

Method and Software usage

Data Dictionary

Libraries used.

Data Preparation

Question 1: Is there a difference in financial account ownership between India, China and G7 Nations by age group?

Conclusion

Question 2: India has a good portion of the population, that does not have a financial account. Is there any difference in reasons for not having financial account across various groups?

Conclusion

Question 3: Using India data, develop a predictive model to identify whether a respondent has a financial account.

Conclusion

References