Introduction

The goal of this project is to demonstrate statistical analysis of survey data using Analysis of Variance(ANOVA) and Generalized Linear Model(GLM). The Global Financial Inclusion, Global Findex database, 2014 edition will be used to solve following questions.

Method and Software usage

The principles of tidy data provided by Hadley Wickham are followed throughout the process of cleaning and preparing the data for analysis. The software tool used for the project is R. Most of the statistical functions are used from stats package. graphs are shown using ggplot2 package.

Data Dictionary

Information about the variables can be obtained from Global-Findex-Data-Dictionary.csv. Actual survey results in its raw form are present in micro_world.csv. Variables may contain the recorded results of a direct question asked, or be derived in some way.

Libraries used.

if (!require('plyr')) install.packages('plyr')                #Data frame and table functions
if (!require('dplyr')) install.packages('dplyr')              #Data frame and table functions
if (!require('stringr')) install.packages('stringr')          #String manuplilation functions
if (!require('ggplot2')) install.packages('ggplot2')          #Graphics display
if (!require('tidyr')) install.packages('tidyr')              #Tidy data using spread() and gather() functions
if (!require('gridExtra')) install.packages('gridExtra')      #Display graphs side by side
if (!require('knitr')) install.packages('knitr')              #Report display, table format
if (!require('stats')) install.packages('stats')
if (!require('caret')) install.packages('caret')              #For confusion matrix

Data Preparation

The survey was carried out over the 2014 calendar year by Gallup, Inc. in more than 160 countries and over 140 languages, using randomly selected, nationally representative samples. The target population is the entire civilian population of age 15 years and above. This is an observational data. Most of the variables are stored as numerical values, and they are converted into character based categorical variables. Survey question surrounding income is based on household income quintile(1 - Poorest 20%, 2 - Second 20%, 3 - Middle 20%, 4 - Fourth 20%, 5 - Richest 20%). .

agef <- function(x){
  if (is.na(x)){
    age <- NA
    return(age)
  }
  if (x < 26){
    age <- "Below 26"
  } else if(x > 25 & x < 36){
    age <- "26 - 35"
  } else if(x > 35 & x < 46){
    age <- "36 - 45"
  } else if(x > 45 & x < 56){
    age <- "46 - 55"
  } else if(x > 55 & x < 66){
    age <- "56 - 65"
  } else if (x > 65){
    age <- "Above 65"
  } else{
    age <- NA
  }
  return(age)
}

incomef <- function(x){
  if (is.na(x)){
    inc <- NA
    return(inc)
  }
  if (x == 1){
    inc <- "Poorest 20%"
  } else if(x == 2){
    inc <- "Second 20%"
  } else if(x == 3){
    inc <- "Middle 20%"
  } else if(x == 4){
    inc <- "Fourth 20%"
  } else if(x == 5){
    inc <- "Richest 20%"
  } else{
    inc <- NA
  }
  return(inc)
}

eduf <- function(x){
  if (is.na(x)){
    col <- NA
    return(col)
  }
  if (x == 1){
    col <- "HS or Less"
  } else if(x == 2){
    col <- "Some College"
  } else if(x == 3){
    col <- "College Grad"
  } else{
    col <- NA
  }
  return(col)
}

age.order <- data.frame(age<-"Below 26", order<-1, stringsAsFactors = F)
age.order <- rbind(age.order, c("26 - 35", 2))
age.order <- rbind(age.order, c("36 - 45", 3))
age.order <- rbind(age.order, c("46 - 55", 4))
age.order <- rbind(age.order, c("56 - 65", 5))
age.order <- rbind(age.order, c("Above 65", 6))
colnames(age.order) <- c("age","order")
age.order$order <- as.integer(age.order$order)

income.order <- data.frame(age<-"Poorest 20%", order<-1, stringsAsFactors = F)
income.order <- rbind(income.order, c("Second 20%", 2))
income.order <- rbind(income.order, c("Middle 20%", 3))
income.order <- rbind(income.order, c("Fourth 20%", 4))
income.order <- rbind(income.order, c("Richest 20%", 5))
colnames(income.order) <- c("income","order")
income.order$order <- as.integer(income.order$order)

edu.order <- data.frame(edu<-"HS or Less", order<-1, stringsAsFactors = F)
edu.order <- rbind(edu.order, c("Some College", 2))
edu.order <- rbind(edu.order, c("College Grad", 3))
colnames(edu.order) <- c("edu","order")
edu.order$order <- as.integer(edu.order$order)

attOrder <- age.order %>% select(attName=age, order)
attOrder <- rbind(attOrder, as.data.frame(income.order %>% mutate(order = order + 20) %>% select(attName=income, order)))
attOrder <- rbind(attOrder, as.data.frame(edu.order %>% mutate(order = order + 30) %>% select(attName=edu, order)))
attOrder <- rbind(attOrder, data.frame(attName="Male",order = 0, stringsAsFactors = F))
attOrder <- rbind(attOrder, data.frame(attName="Female",order = -1, stringsAsFactors = F))
attOrder <- arrange(attOrder, order)

reasons <- data.frame("Cannot get one", 1, stringsAsFactors = F)
reasons <- rbind(reasons, c("Family member already has one", 2))
reasons <- rbind(reasons, c("Financial institution too far away", 3))
reasons <- rbind(reasons, c("Lack of documentation", 4))
reasons <- rbind(reasons, c("Lack of money", 5))
reasons <- rbind(reasons, c("Lack of trust", 6))
reasons <- rbind(reasons, c("No need for financial services", 7))
reasons <- rbind(reasons, c("Religious reasons", 8))
reasons <- rbind(reasons, c("Too expensive to have any account", 9))
colnames(reasons) <- c("reason","order")

#Change path
fileDir <- "D:/CUNY/606/Project-1/FinalProject"

#Load survey data.
world.findex.data <- read.csv(file.path(fileDir, "micro_world.csv"), stringsAsFactors=FALSE) 

Question 1: Is there a difference in financial account ownership between India, China and G7 Nations by age group?

I will be using ANOVA statistical method to identify if there is a difference in financial account ownership. Analysis of Variance (ANOVA) is a statistical method used for testing differences between two or more means. Data has to meet following conditions before applying ANOVA. If any of the conditions are not satisfied, the results from the use of ANOVA techniques may be unreliable.

As data is collected by Gallup, Inc from each country on a random basis. Survey data meets the condition of independence.

For the scope of the project, I will be using account attribute as the response variable. This variable has two possible outcomes yes if the respondent has an account with financial institution otherwise no. Observations with missing values are discarded. This condition can be checked with the use of graphs; I will be using box plot and side-by-side dot plot to identify the normal distribution of the response variable.

This can be evaluated by using rule of thumb if the largest sample standard deviation divided by the smallest sample standard deviation is not greater than two, then assume that the population variances are equal. I will be using box plot to identify variability across the groups.

Following graph shows a total number of respondents having an account with the financial institution. Account ownership percentage is higher among G7 Nations compared to India and China.

#Get India, China and G7 Nations data
in_cn_g7.data <- world.findex.data %>% 
  filter(economy %in% c("India", "China", "Canada", "France", "Germany", "Italy", "Japan", "United Kingdom", "United States")) %>% 
  select(economy, female, age, educ, inc_q, account)

in_cn_g7.data$hasAccount <- ifelse(in_cn_g7.data$account == 1, "Yes", ifelse(in_cn_g7.data$account == 2, "No", NA))
in_cn_g7.data$gender <- ifelse(in_cn_g7.data$female == 1, "Female", ifelse(in_cn_g7.data$female == 2, "Male",NA))
in_cn_g7.data = cbind(in_cn_g7.data, ldply(in_cn_g7.data$age, agef))
colnames(in_cn_g7.data)[ncol(in_cn_g7.data)] <- "ageGroup"
in_cn_g7.data = cbind(in_cn_g7.data, ldply(in_cn_g7.data$inc_q, incomef))
colnames(in_cn_g7.data)[ncol(in_cn_g7.data)] <- "incomeGroup"
in_cn_g7.data = cbind(in_cn_g7.data, ldply(in_cn_g7.data$educ, eduf))
colnames(in_cn_g7.data)[ncol(in_cn_g7.data)] <- "eduGroup"

#Get percentage of population that has accounts.
in_cn_g7.data1 <- in_cn_g7.data %>%
  filter(age>0) %>%
  group_by(economy) %>% 
  mutate(hasAcc = n())

in_cn_g7.data1 <- in_cn_g7.data1 %>% 
  filter(hasAccount=="Yes") %>% 
  group_by(economy, ageGroup, hasAcc) %>% 
  select (economy, ageGroup, hasAcc) %>% 
  mutate(ageGroupTotal = n()) %>% 
  distinct (ageGroupTotal)

in_cn_g7.data1 <- in_cn_g7.data1 %>% 
  mutate(accPer = round(ageGroupTotal*100/hasAcc, 3))

#Graphical display
in_cn_g7.data1 %>%
  group_by(economy) %>% 
  mutate(value = sum(accPer)) %>%
ggplot(aes(x=economy, y=value, fill = value)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Country", y="Percentage Account Holders", title = "Population Sample With Financial Accounts", subtitle = "Comparision between countries") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + geom_text(aes(label=value), vjust=0, color="black") + labs(fill = "Ownership")

Following data shows comparision of financial account ownership by age group accross countries.

#Group data by age group
in_cn_g7.data <- in_cn_g7.data %>%
  filter(hasAccount=="Yes") %>% 
  filter(age>0) %>%
  group_by(economy) %>% 
  mutate(hasAcc = n())

in_cn_g7.data <- in_cn_g7.data %>% 
  group_by(economy, ageGroup, hasAcc) %>% 
  select (economy, ageGroup, hasAcc) %>% 
  mutate(ageGroupTotal = n()) %>% 
  distinct (ageGroupTotal)

in_cn_g7.data <- in_cn_g7.data %>% 
  mutate(accPer = round(ageGroupTotal*100/hasAcc, 3))

age.in_cn_g7.data <- in_cn_g7.data %>%
  inner_join(age.order, by = c("ageGroup" = "age")) %>% 
  arrange(economy, order) 

age.in_cn_g7.data <- transform(age.in_cn_g7.data, ageGroup=factor(ageGroup,levels=unique(ageGroup)))
colnames(age.in_cn_g7.data)[colnames(age.in_cn_g7.data)=="ageGroup"] <- "Age Group"
colnames(age.in_cn_g7.data)[colnames(age.in_cn_g7.data)=="accPer"] <- "value"

age.in_cn_g7.data$ageGroupTotal <- NULL

#Display data in table format
age.in_cn_g7.data %>% 
  select(economy, `Age Group`, value) %>% 
  spread(economy, value) %>%
  kable(format='pandoc', caption = "Population Sample With Financial Accounts - Countries by Age Group")
Population Sample With Financial Accounts - Countries by Age Group
Age Group Canada China France Germany India Italy Japan United Kingdom United States
Below 26 10.010 12.433 6.900 7.121 21.765 6.494 3.030 4.044 10.677
26 - 35 11.249 18.190 11.123 9.766 25.641 12.446 5.152 11.426 10.254
36 - 45 14.551 20.816 22.760 12.208 24.210 22.511 17.576 19.919 12.474
46 - 55 20.537 20.911 24.923 21.872 14.490 23.918 17.374 26.491 14.905
56 - 65 20.640 15.407 20.906 19.939 10.137 20.130 25.960 22.548 22.199
Above 65 23.013 12.243 13.388 29.095 3.757 14.502 30.909 15.571 29.493

Following graph show that data, doesn’t have the ideal bell-shaped appearance, and it suggests there are some outliers in the data. Financial account ownership differ between age groups in each country.

#Graphical display
ggplot(age.in_cn_g7.data, aes(x=economy, y=value, fill = `Age Group`)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Country", y="Accounts", title = "Percentage of Population With Financial Accounts", subtitle = "Comparision between countries by age group") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(fill = "Ownership") + scale_fill_brewer(palette="Paired")

Side-by-side dot plot show data within each group are very volatile. Graph also suggest, there are outliers in Japan data. Box plot suggest United States differ from rest of the countries. Since data is collected randomly in each country and graphs does not display anthing extreme we can assume distribution of the response variable follows a normal distribution and variability across the groups exists. This meets second and third conditions.

qplot(x = economy, y = value, data = age.in_cn_g7.data, color = value) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_smooth(method = "lm") +
  labs(x="Country", y="% of Accounts Holders By Age Group", title = "Financial Account Ownership By Country And Age Group", subtitle = "Side-by-side dot plot", color = "Ownership") 

fill <- "#4271AE"
line <- "#1F3552"

age.in_cn_g7.data %>% 
  ggplot(aes(x = economy, y = value)) +
        geom_boxplot(fill = fill, colour = line) +
        scale_y_continuous(name = "Account") +
        scale_x_discrete(name = "Country") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        ggtitle("Average(Mean) Account Ownership By Country")

Standard deviations between countries range between 3.915 and 11.037. Even though quotient is greater than two when maximum standard deviation is divided by minimum standard deviation(2.819), we can assume age group variances are equal across countries as data is collected randomly in each country.

age.in_cn_g7.data %>% 
  select(economy, `Age Group`, value) %>% 
  group_by(economy) %>% 
  mutate(ecoSD = round(sd(value),3)) %>% 
  select(economy, ecoSD) %>% 
  distinct(ecoSD) %>% 
  spread(economy, ecoSD) %>%
  kable(format='pandoc', caption = "Standard Deviation by Countries Based on Age Group")
Standard Deviation by Countries Based on Age Group
Canada China France Germany India Italy Japan United Kingdom United States
5.462 3.915 7.213 8.377 8.689 6.704 11.037 8.12 7.652

Hypothesis test.

Null hypothesis: There is no difference in account ownership percentage between age groups across countries.

\(H_0: {\mu}_{can} = {\mu}_{chn} = {\mu}_{fra} = {\mu}_{deu} = {\mu}_{ind} = {\mu}_{ita} = {\mu}_{jpn} = {\mu}_{uk} = {\mu}_{usa}\)

Alternative hypothesis: There is difference in account ownership percentage between age groups across countries.

\(H_A: {\mu}_{can} \ne {\mu}_{chn} \ne {\mu}_{fra} \ne {\mu}_{deu} \ne {\mu}_{ind} \ne {\mu}_{ita} \ne {\mu}_{jpn} \ne {\mu}_{uk} \ne {\mu}_{usa}\)

options("scipen"=100, "digits"=4)
#Apply ANOVA
age.anova <- aov(value ~ `Age Group` + economy, data = age.in_cn_g7.data)
summary(age.anova)
##             Df Sum Sq Mean Sq F value Pr(>F)   
## `Age Group`  5    952     190    4.43 0.0026 **
## economy      8      0       0    0.00 1.0000   
## Residuals   40   1718      43                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Draw residuals plot to test is data is normally distributed 
ggplot(data=as.data.frame(qqnorm( age.anova$residuals , plot=F)), mapping=aes(x=x, y=y)) + 
    geom_point() + geom_smooth(method="lm", se=FALSE)  +
        scale_y_continuous(name = "Sample Quantiles") +
        scale_x_discrete(name = "Theoretical Quantiles") +
        ggtitle("Normal Q-Q Plot")

Conclusion

- Data meets fairly all three conditions of ANOVA.

- The normal plot of the residuals shows data points lie pretty close to the line. Some deviation is noticed near the ends. This suggests data is nearly normal.

- The p-value of the test is 0.0026, less than the default significance level of \(\alpha = 0.05\). We reject null hypothesis \(H_0\). In other words, there is difference in account ownership percentage between age groups across countries.

- As data was collected from each country based on size of the population, weighted sampling, aggregates are converted to percentages.

Question 2: India has a good portion of the population, that does not have a financial account. Is there any difference in reasons for not having financial account across various groups?

I will be using ANOVA technique to answer the question and use TukeyHSD function to identify if any reason pair stands out. Following graph shows a total number of respondents that does not have an account with the financial institution for various reasons.

#Get India data
india.data <- world.findex.data %>% 
  filter(economy == "India") %>% 
  select(economy, female, age, educ, inc_q, account, q8a, q8b, q8c, q8d, q8e, q8f, q8g, q8h, q8i)

#Convert numerical values to categorical values
india.data$hasAccount <- ifelse(india.data$account == 1, "Yes", ifelse(india.data$account == 2, "No", NA))
india.data$gender <- ifelse(india.data$female == 1, "Female", ifelse(india.data$female == 2, "Male",NA))
india.data = cbind(india.data, ldply(india.data$age, agef))
colnames(india.data)[ncol(india.data)] <- "ageGroup"
india.data = cbind(india.data, ldply(india.data$inc_q, incomef))
colnames(india.data)[ncol(india.data)] <- "incomeGroup"
india.data = cbind(india.data, ldply(india.data$educ, eduf))
colnames(india.data)[ncol(india.data)] <- "eduGroup"

#Subset data by respondents who do not have accounts
no.account <- india.data %>% filter(account==2)
no.account <- no.account %>% filter(!is.na(age))
no.account.tidy <- no.account %>% 
  select (gender, ageGroup, incomeGroup, eduGroup, q8a, q8b, q8c, q8d, q8e, q8f, q8g, q8h, q8i) %>% 
  gather(noAccountReason, value, -gender, -ageGroup, -incomeGroup, -eduGroup) %>% filter(value==1)

#Expand the questions
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8a"] <- "Financial institution too far away"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8b"] <- "Too expensive to have any account"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8c"] <- "Lack of documentation"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8d"] <- "Lack of trust"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8e"] <- "Religious reasons"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8f"] <- "Lack of money"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8g"] <- "Family member already has one"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8h"] <- "Cannot get one"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8i"] <- "No need for financial services"

#Get data to display on the graph
no.account.reason <- no.account.tidy %>% 
  group_by(noAccountReason) %>% 
  summarise(value = n()) %>% 
  inner_join(reasons, by = c("noAccountReason"="reason")) %>% 
  arrange(desc(order))

#Convert columns to factors to preserve data order  
no.account.reason <- transform(no.account.reason, noAccountReason=factor(noAccountReason,levels=unique(noAccountReason)))

#Graphical display
no.account.reason %>%
ggplot(aes(x=noAccountReason, y=value, fill = value)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Reasons", y="Respondents", title = "Population Sample With No Financial Accounts - India", subtitle = "Comparision between ressons") + labs(fill = "Counts") + coord_flip()

Above graph suggests that Lack of money is a primary reason for not having an account with the financial institution. Is Lack of money only reason across various Gender, Age groups, Education level and Income groups for not having an account with a financial institution? Following graphs show the breakdown by various groups.

Reasons for not having an account based on Age Groups. The graph shows age group below 26 years age top reason for not having an account with a financial institution is different from other age groups.

#Graphical display
no.account.age <- no.account.tidy %>% 
  group_by(ageGroup, noAccountReason) %>% 
  summarise(ageReasonvalue = n()) %>%
  inner_join(age.order, by = c("ageGroup"="age")) %>% 
  arrange(order,noAccountReason)

no.account.age <- no.account.age %>% group_by(ageGroup) %>% mutate(agevalue = sum(ageReasonvalue))

no.account.age <- no.account.age %>% mutate(value = round(ageReasonvalue * 100/agevalue,3))
#Convert data to factors
no.account.age <- transform(no.account.age, ageGroup=factor(ageGroup,levels=unique(ageGroup)))

ggplot(no.account.age, aes(x=ageGroup, y=value, fill = noAccountReason)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Age Group", y="Respondents %", title = "Reasons For Not Having Financial Account", subtitle = "Percentage comparision between Age Groups") + labs(fill = "Reasons") + scale_fill_brewer(palette="Paired") + coord_flip()

Reasons for not having an account based on Education Level. The graph shows reasons vary by Education Level.

#Graphical display
no.account.edu <- no.account.tidy %>% 
  group_by(eduGroup, noAccountReason) %>% 
  summarise(eduReasonvalue = n()) %>%
  inner_join(edu.order, by = c("eduGroup"="edu")) %>% 
  arrange(order,noAccountReason)

no.account.edu <- no.account.edu %>% group_by(eduGroup) %>% mutate(eduvalue = sum(eduReasonvalue))

no.account.edu <- no.account.edu %>% mutate(value = round(eduReasonvalue * 100/eduvalue,3))

#Convert data to factors
no.account.edu <- transform(no.account.edu, eduGroup=factor(eduGroup,levels=unique(eduGroup)))

ggplot(no.account.edu, aes(x=eduGroup, y=value, fill = noAccountReason)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Education Level", y="Respondents %", title = "Reasons For Not Having Financial Account", subtitle = "Percentage comparision between Education Level") + labs(fill = "Reasons") + scale_fill_brewer(palette="Paired") + coord_flip()

Reasons for not having an account based on household income. It shows reasons vary by household income. For first three groups Lack of money is top reason.

#Graphical display
no.account.income <- no.account.tidy %>% 
  group_by(incomeGroup, noAccountReason) %>% 
  summarise(incomeReasonvalue = n()) %>%
  inner_join(income.order, by = c("incomeGroup"="income")) %>% 
  arrange(order,noAccountReason)

no.account.income <- no.account.income %>% group_by(incomeGroup) %>% mutate(incomevalue = sum(incomeReasonvalue))

no.account.income <- no.account.income %>% mutate(value = round(incomeReasonvalue * 100/incomevalue,3))

#Convert data to factors
no.account.income <- transform(no.account.income, incomeGroup=factor(incomeGroup,levels=unique(incomeGroup)))

ggplot(no.account.income, aes(x=incomeGroup, y=value, fill = noAccountReason)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Income Group", y="Respondents %", title = "Reasons For Not Having Financial Account", subtitle = "Percentage comparision between Household Income") + labs(fill = "Reasons") + scale_fill_brewer(palette="Paired") + coord_flip()

Reasons for not having an account based on gender. Male and Female population has different reasons for not have financial accounts.

#Graphical display
no.account.gender <- no.account.tidy %>% 
  group_by(gender, noAccountReason) %>% 
  summarise(genderReasonvalue = n()) %>%
  arrange(gender,noAccountReason)

no.account.gender <- no.account.gender %>% group_by(gender) %>% mutate(gendervalue = sum(genderReasonvalue))

no.account.gender <- no.account.gender %>% mutate(value = round(genderReasonvalue * 100/gendervalue,3))

#Convert data to factors

ggplot(no.account.gender, aes(x=gender, y=value, fill = noAccountReason)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Gender", y="Respondents %", title = "Reasons For Not Having Financial Account", subtitle = "Percentage comparision between Gender") + labs(fill = "Reasons") + scale_fill_brewer(palette="Paired") + coord_flip()

Comparing details of the graphs gender, age groups, education level and income groups against overall graph, it stands out Lack of money is not common among all the groups. Using ANOVA technique lets determine if there is a difference in reasons for not having an account.

#Get complete data and summarise it by reasons
no.account <- india.data %>% 
  filter(age>0) %>% 
  filter(account == 2) %>%
  select (account, gender, ageGroup, incomeGroup, eduGroup, q8a, q8b, q8c, q8d, q8e, q8f, q8g, q8h, q8i) %>% 
  gather(noAccountReason, value, -gender, -ageGroup, -incomeGroup, -eduGroup) %>% 
  filter(value==1) %>% 
  select(noAccountReason, gender, ageGroup, incomeGroup, eduGroup) %>% 
  gather(resp, value, -noAccountReason) %>% 
  select(noAccountReason, respAtt = value) %>% 
  group_by(noAccountReason, respAtt) %>% 
  summarise(respCount = n())

no.account$noAccountReason[no.account$noAccountReason == "q8a"] <- "Financial institution too far away"
no.account$noAccountReason[no.account$noAccountReason == "q8b"] <- "Too expensive have any account"
no.account$noAccountReason[no.account$noAccountReason == "q8c"] <- "Lack of documentation"
no.account$noAccountReason[no.account$noAccountReason == "q8d"] <- "Lack of trust"
no.account$noAccountReason[no.account$noAccountReason == "q8e"] <- "Religious reasons"
no.account$noAccountReason[no.account$noAccountReason == "q8f"] <- "Lack of money"
no.account$noAccountReason[no.account$noAccountReason == "q8g"] <- "Family member already has one"
no.account$noAccountReason[no.account$noAccountReason == "q8h"] <- "Cannot get one"
no.account$noAccountReason[no.account$noAccountReason == "q8i"] <- "No need for financial services"

no.account <- no.account %>% group_by(respAtt) %>% mutate(reasonvalue = sum(respCount))
no.account <- no.account %>% mutate(respPer = round(respCount * 100/reasonvalue,3))


no.account.tidy <- no.account %>% 
  select(-respCount, -reasonvalue) %>% 
  spread(noAccountReason, respPer) %>% 
  inner_join(attOrder, by = c("respAtt" = "attName"))

no.account.tidy <- arrange(no.account.tidy, order)

colnames(no.account.tidy)[colnames(no.account.tidy)=="respAtt"] <- "Respondent Attributes"

no.account.tidy$order <- NULL

no.account.tidy %>% 
  kable(format='pandoc', caption = "Reasons For Not Having Financial Account")
Reasons For Not Having Financial Account
Respondent Attributes Cannot get one Family member already has one Financial institution too far away Lack of documentation Lack of money Lack of trust No need for financial services Religious reasons Too expensive have any account
Female 4.791 17.33 10.279 11.150 25.78 4.443 11.672 2.265 12.282
Male 4.224 23.43 9.901 10.165 22.18 3.564 13.399 1.914 11.221
Below 26 3.837 26.41 8.916 12.415 21.90 2.257 12.980 1.919 9.368
26 - 35 4.889 16.59 10.370 11.852 24.74 4.148 11.556 2.370 13.481
36 - 45 4.585 17.47 10.044 9.389 26.20 5.240 12.009 1.528 13.537
46 - 55 5.128 21.25 9.890 8.059 23.08 4.762 13.919 2.564 11.355
56 - 65 4.029 19.41 12.821 7.692 23.81 5.861 12.821 1.465 12.088
Above 65 6.122 17.35 11.224 6.122 23.47 4.082 16.327 4.082 11.224
Poorest 20% 3.632 19.86 11.622 11.622 29.54 3.874 8.232 1.695 9.927
Second 20% 2.799 17.42 9.798 11.820 25.82 3.577 12.286 2.022 14.463
Middle 20% 5.110 19.20 11.188 10.635 22.93 4.558 12.845 2.210 11.326
Fourth 20% 4.706 24.71 9.020 9.216 22.16 3.922 12.745 1.765 11.765
Richest 20% 6.702 25.47 8.043 9.115 17.43 3.485 17.694 2.681 9.383
HS or Less 4.392 18.44 10.521 10.266 24.62 4.290 12.666 1.839 12.972
Some College 4.711 26.90 8.967 11.094 21.73 2.888 12.614 2.736 8.359
College Grad 4.255 34.04 6.383 17.021 14.89 4.255 12.766 2.128 4.255

The side-by-side plot shows responses are normally distributed across reasons not having a financial account. This satisfies second condition of ANOVA.

qplot(x = respAtt, y = respPer, data = no.account, color = respPer) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_smooth(method = "lm") +
  labs(x="Response Group", y="Reason Percentage", title = "Reasons For Not Having Financial Account", subtitle = "Side-by-side dot plot Response Group Vs. Reason Percentage", color = "Percentage") 

Box plot showing responses, distribution across groups for not having a financial account. There are outliers in some of the groups. Average(mean) of each group varies but is reasonable.

#Calculate mean
fill <- "#4271AE"
line <- "#1F3552"

#Display Boxplot
no.account %>% 
  ggplot(aes(x = respAtt, y = respPer)) +
        geom_boxplot(fill = fill, colour = line) +
        scale_y_continuous(name = "Reason Percentage") +
        scale_x_discrete(name = "Response Group") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        ggtitle("Average(Mean) Of Reasons For Not Having Financial Account")

Side-by-side dot plot and box plot show data within each group are very volatile. It suggests distribution of the response variable follows a normal distribution and variability across the groups exists. This meets second and third conditions.

Standard deviations between reasons range between 6.769 and 10.140. Quotient is less than two when maximum standard deviation is divided by minimum standard deviation(1.498), we can assume reasons variances are equal across each response group for not having account.

no.account %>% 
  select(noAccountReason, respAtt, respPer) %>% 
  group_by(respAtt) %>% 
  mutate(resonSD = round(sd(respPer),3)) %>% 
  select(respAtt, resonSD) %>% 
  distinct(resonSD) %>% 
  spread(respAtt, resonSD) %>%
  kable(format='pandoc', caption = "Standard Deviation by Response Group Based on Reasons")
Standard Deviation by Response Group Based on Reasons
26 - 35 36 - 45 46 - 55 56 - 65 Above 65 Below 26 College Grad Female Fourth 20% HS or Less Male Middle 20% Poorest 20% Richest 20% Second 20% Some College
6.949 7.482 7.192 7.241 6.777 8.509 10.14 7.238 7.889 7.251 7.672 6.769 8.827 7.537 7.74 8.331

As data meets conditions of ANOVA,

Null hypothesis: There is no difference in reasons for not having an account with financial institution.

\(H_0: {\mu}_{fa} = {\mu}_{exp} = {\mu}_{ld} = {\mu}_{lt} = {\mu}_{rr} = {\mu}_{lm} = {\mu}_{fm} = {\mu}_{co} = {\mu}_{nf}\)

Alternative hypothesis: There is difference in reasons for not having an account with financial institution.

\(H_A: {\mu}_{fa} \ne {\mu}_{exp} \ne {\mu}_{ld} \ne {\mu}_{lt} \ne {\mu}_{rr} \ne {\mu}_{lm} \ne {\mu}_{fm} \ne {\mu}_{co} \ne {\mu}_{nf}\)

options("scipen"=100, "digits"=16)
#Calculate ANOVA
no.account.anova <- aov(respPer ~ noAccountReason, data = no.account)
summary(no.account.anova)
##                  Df         Sum Sq        Mean Sq   F value
## noAccountReason   8 6886.429627722 860.8037034653 138.85689
## Residuals       135  836.894010437   6.1992148921          
##                                 Pr(>F)    
## noAccountReason < 0.000000000000000222 ***
## Residuals                                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Get Tukey output
tukey.output<-TukeyHSD(no.account.anova)
tukey.output<-data.frame(tukey.output$noAccountReason, stringsAsFactors = F)
tukey.output$pairs <- rownames(tukey.output)
rownames(tukey.output) <- NULL
tukey.output <- tukey.output %>% filter((upr-lwr)>0) %>% mutate(rangeDiff = (upr-lwr))
tukey.output <- tukey.output %>% arrange((`p.adj`), desc(diff))
tukey.output %>% kable(format='pandoc', caption = "TukeyHSD Results")
TukeyHSD Results
diff lwr upr p.adj pairs rangeDiff
18.5221875000000011 15.7468584937078635 21.2975165062921405 0.0000000000000211 Lack of money-Cannot get one 5.550658012584277
16.9598749999999967 14.1845459937078591 19.7352040062921361 0.0000000000000211 Family member already has one-Cannot get one 5.550658012584277
13.2050000000000018 10.4296709937078642 15.9803290062921395 0.0000000000000211 Lack of money-Financial institution too far away 5.550658012584275
12.6646250000000027 9.8892959937078651 15.4399540062921403 0.0000000000000211 Lack of money-Lack of documentation 5.550658012584275
-12.0788125000000033 -14.8541415062921409 -9.3034834937078656 0.0000000000000211 Too expensive have any account-Lack of money 5.550658012584275
-17.5039999999999978 -20.2793290062921372 -14.7286709937078601 0.0000000000000211 Lack of trust-Family member already has one 5.550658012584277
-19.0663125000000022 -21.8416415062921416 -16.2909834937078628 0.0000000000000211 Lack of trust-Lack of money 5.550658012584279
-19.3804374999999993 -22.1557665062921387 -16.6051084937078599 0.0000000000000211 Religious reasons-Family member already has one 5.550658012584279
-20.9427500000000038 -23.7180790062921432 -18.1674209937078643 0.0000000000000211 Religious reasons-Lack of money 5.550658012584279
-11.6426874999999974 -14.4180165062921350 -8.8673584937078598 0.0000000000000214 Financial institution too far away-Family member already has one 5.550658012584275
-11.1023124999999983 -13.8776415062921359 -8.3269834937078606 0.0000000000000232 Lack of documentation-Family member already has one 5.550658012584275
-10.7092499999999990 -13.4845790062921367 -7.9339209937078614 0.0000000000000284 Religious reasons-No need for financial services 5.550658012584275
-10.5164999999999988 -13.2918290062921365 -7.7411709937078612 0.0000000000000322 Too expensive have any account-Family member already has one 5.550658012584275
-10.2335000000000029 -13.0088290062921406 -7.4581709937078653 0.0000000000000412 No need for financial services-Lack of money 5.550658012584275
8.8639374999999987 6.0886084937078611 11.6392665062921363 0.0000000000001027 Too expensive have any account-Religious reasons 5.550658012584275
8.8328124999999993 6.0574834937078617 11.6081415062921369 0.0000000000001055 No need for financial services-Lack of trust 5.550658012584275
-8.6711874999999985 -11.4465165062921361 -5.8958584937078609 0.0000000000001060 No need for financial services-Family member already has one 5.550658012584275
8.2886875000000000 5.5133584937078624 11.0640165062921376 0.0000000000001119 No need for financial services-Cannot get one 5.550658012584275
-8.2781249999999993 -11.0534540062921369 -5.5027959937078617 0.0000000000001135 Religious reasons-Lack of documentation 5.550658012584275
-7.7377500000000001 -10.5130790062921378 -4.9624209937078625 0.0000000000003277 Religious reasons-Financial institution too far away 5.550658012584275
6.9874999999999998 4.2121709937078613 9.7628290062921383 0.0000000000255275 Too expensive have any account-Lack of trust 5.550658012584277
6.4433749999999996 3.6680459937078616 9.2187040062921373 0.0000000007224596 Too expensive have any account-Cannot get one 5.550658012584275
-6.4016875000000004 -9.1770165062921389 -3.6263584937078623 0.0000000009295720 Lack of trust-Lack of documentation 5.550658012584277
-5.8613125000000013 -8.6366415062921398 -3.0859834937078632 0.0000000228705361 Lack of trust-Financial institution too far away 5.550658012584277
5.8575625000000002 3.0822334937078621 8.6328915062921379 0.0000000233740098 Lack of documentation-Cannot get one 5.550658012584275
5.3171875000000011 2.5418584937078630 8.0925165062921387 0.0000005002820614 Financial institution too far away-Cannot get one 5.550658012584275
2.9714999999999989 0.1961709937078608 5.7468290062921366 0.0260689327027537 No need for financial services-Financial institution too far away 5.550658012584275
2.4311249999999998 -0.3442040062921383 5.2064540062921374 0.1368886029981524 No need for financial services-Lack of documentation 5.550658012584275
-2.4205624999999991 -5.1958915062921367 0.3547665062921390 0.1407710678736802 Religious reasons-Cannot get one 5.550658012584275
-1.8764374999999989 -4.6517665062921374 0.8988915062921392 0.4565292446957468 Religious reasons-Lack of trust 5.550658012584277
-1.8453125000000004 -4.6206415062921380 0.9300165062921378 0.4802685403155512 Too expensive have any account-No need for financial services 5.550658012584275
1.5623125000000044 -1.2130165062921336 4.3376415062921421 0.6985482909477798 Lack of money-Family member already has one 5.550658012584275
1.1261874999999986 -1.6491415062921395 3.9015165062921366 0.9357003437230564 Too expensive have any account-Financial institution too far away 5.550658012584276
0.5858124999999994 -2.1895165062921387 3.3611415062921375 0.9991117666211534 Too expensive have any account-Lack of documentation 5.550658012584276
-0.5441250000000002 -3.3194540062921383 2.2312040062921379 0.9994823019455810 Lack of trust-Cannot get one 5.550658012584276
0.5403749999999992 -2.2349540062921389 3.3157040062921372 0.9995080101246836 Lack of documentation-Financial institution too far away 5.550658012584276
#Plot residuals
ggplot(data=as.data.frame(qqnorm( no.account.anova$residuals , plot=F)), mapping=aes(x=x, y=y)) + 
    geom_point() + geom_smooth(method="lm", se=FALSE)  +
        scale_y_continuous(name = "Sample Quantiles") +
        scale_x_discrete(name = "Theoretical Quantiles") +
        ggtitle("Normal Q-Q Plot")

Conclusion

- Data meets all three conditions of ANOVA.

- The normal plot of the residuals shows data points lie pretty close to the line. Some deviation is noticed near the ends. This suggests data is nearly normal.

- As p-value of the test is equal to 0.0000000000000002, indicating the evidence is strong enough to reject the null hypothesis, \(H_0\) at a significance level of \(\alpha = 0.05\). That is, the data provides strong evidence that there is difference in reasons for not having an account with financial institution across groups.

Output of Tukey Honest Significant Differences(TukeyHSD) function suggests the difference between reason pairs Lack of money-Cannot get one has a p-value of 0.0000000000000211 and is statistically significant. At 95% confidence interval, reason pair difference is between 15.75% and 21.30% with Lack of money averaging 18.52% higher. That means, there is 18.52% higher chance, for a response group not having an account with financial institution is because of Lack of money.

Question 3: Using India data, develop a predictive model to identify whether a respondent has a financial account.

For developing predictive model, - 70% of the dataset will be used as training data and 30% as test data. - As output will be bi-variate, Generalized Linear Model(GLM) is used.

#Get complete data
india.account <- india.data %>% 
  filter(!is.na(age)) %>% 
  select (account, gender, ageGroup, incomeGroup, eduGroup)

#There are only two possible outcomes person can have account or not
india.account$account <- ifelse(india.account$account == 1, 1, 0)

set.seed(2005) #So results can be repeated
#Create sample data
samIdx <- sample(2,nrow(india.account), replace = T, prob = c(0.70,0.30))

#Training data
india.train <- india.account[samIdx == 1,]

#Test data
india.test <- india.account[samIdx == 2,]

#Run GLM on training data
india.glm <- glm(account ~ gender + ageGroup + incomeGroup + eduGroup, data = india.account, family = binomial(link = "logit"))
  
#Make prediction using test data
india.predict <- predict(india.glm, india.test, type = "response")

summary(india.glm)
## 
## Call:
## glm(formula = account ~ gender + ageGroup + incomeGroup + eduGroup, 
##     family = binomial(link = "logit"), data = india.account)
## 
## Deviance Residuals: 
##             Min               1Q           Median               3Q  
## -2.448372543129  -1.116906853565   0.547292268250   1.020832477542  
##             Max  
##  1.747078967946  
## 
## Coefficients:
##                                 Estimate        Std. Error  z value
## (Intercept)             2.22725433507954  0.23536558027249  9.46296
## genderMale             -0.57425698102466  0.07920453566401 -7.25030
## ageGroup36 - 45         0.43238854578185  0.11731065075817  3.68584
## ageGroup46 - 55         0.39147864570844  0.13627812831866  2.87264
## ageGroup56 - 65         0.16310283488844  0.14539385801954  1.12180
## ageGroupAbove 65       -0.04206682128494  0.20924925554456 -0.20104
## ageGroupBelow 26       -0.69299605585285  0.11269337452207 -6.14939
## incomeGroupMiddle 20%  -0.27859272917531  0.11822816213385 -2.35640
## incomeGroupPoorest 20% -0.29025963641623  0.13618298824400 -2.13139
## incomeGroupRichest 20%  0.28640843483957  0.12342299954692  2.32054
## incomeGroupSecond 20%  -0.28674532984093  0.12195233837860 -2.35129
## eduGroupHS or Less     -1.95078559495004  0.21651664866703 -9.00986
## eduGroupSome College   -1.08226620863156  0.21664080767702 -4.99567
##                                      Pr(>|z|)    
## (Intercept)            < 0.000000000000000222 ***
## genderMale                0.00000000000041584 ***
## ageGroup36 - 45                    0.00022795 ***
## ageGroup46 - 55                    0.00407052 ** 
## ageGroup56 - 65                    0.26194747    
## ageGroupAbove 65                   0.84066972    
## ageGroupBelow 26          0.00000000077779134 ***
## incomeGroupMiddle 20%              0.01845308 *  
## incomeGroupPoorest 20%             0.03305668 *  
## incomeGroupRichest 20%             0.02031150 *  
## incomeGroupSecond 20%              0.01870844 *  
## eduGroupHS or Less     < 0.000000000000000222 ***
## eduGroupSome College      0.00000058631532198 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4115.8501258355  on 2998  degrees of freedom
## Residual deviance: 3721.2967784569  on 2986  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 3747.2967784569
## 
## Number of Fisher Scoring iterations: 4

The summary shows all the variables are significant.

#Generate prediction output
india.predict.info <- rep(0,length(which(samIdx == 2)))
india.predict.info[india.predict >=0.5] <- 1
  
cm <- confusionMatrix(india.predict.info, india.test$account)

cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 224 143
##          1 184 363
##                                                             
##                Accuracy : 0.6422319474836                   
##                  95% CI : (0.6101899657424, 0.6733537552102)
##     No Information Rate : 0.5536105032823                   
##     P-Value [Acc > NIR] : 0.00000003258372921716            
##                                                             
##                   Kappa : 0.2690289034871                   
##  Mcnemar's Test P-Value : 0.02696631569637                  
##                                                             
##             Sensitivity : 0.5490196078431                   
##             Specificity : 0.7173913043478                   
##          Pos Pred Value : 0.6103542234332                   
##          Neg Pred Value : 0.6636197440585                   
##              Prevalence : 0.4463894967177                   
##          Detection Rate : 0.2450765864333                   
##    Detection Prevalence : 0.4015317286652                   
##       Balanced Accuracy : 0.6332054560955                   
##                                                             
##        'Positive' Class : 0                                 
## 
fourfoldplot(cm$table, color = c("#CC6666", "#99CC99"),conf.level = 0, margin = 1, main = "Mis-classification")

Conclusion

- As output variable can have two possible outcomes, respondent can have an account or not. I have used Generalized Linear Model(GLM) binomial family and logit link.

- Summary of the model suggests all the variables contribute significantly. Gender, age, and education contribute more than income group.

- The accuracy of the model is 64%.

References