The goal of this project is to demonstrate statistical analysis of survey data using Analysis of Variance(ANOVA) and Generalized Linear Model(GLM). The Global Financial Inclusion, Global Findex database, 2014 edition will be used to solve following questions.
Is there a difference in financial account ownership between India, China, and G7 Nations by age group?
India has a good portion of the population, that does not have a financial account. Is there any difference in reasons for not having financial account across various groups?
Using India data, develop a predictive model to identify whether a respondent has a financial account.
The principles of tidy data provided by Hadley Wickham are followed throughout the process of cleaning and preparing the data for analysis. The software tool used for the project is R. Most of the statistical functions are used from stats package. graphs are shown using ggplot2 package.
Information about the variables can be obtained from Global-Findex-Data-Dictionary.csv. Actual survey results in its raw form are present in micro_world.csv. Variables may contain the recorded results of a direct question asked, or be derived in some way.
if (!require('plyr')) install.packages('plyr') #Data frame and table functions
if (!require('dplyr')) install.packages('dplyr') #Data frame and table functions
if (!require('stringr')) install.packages('stringr') #String manuplilation functions
if (!require('ggplot2')) install.packages('ggplot2') #Graphics display
if (!require('tidyr')) install.packages('tidyr') #Tidy data using spread() and gather() functions
if (!require('gridExtra')) install.packages('gridExtra') #Display graphs side by side
if (!require('knitr')) install.packages('knitr') #Report display, table format
if (!require('stats')) install.packages('stats')
if (!require('caret')) install.packages('caret') #For confusion matrix
The survey was carried out over the 2014 calendar year by Gallup, Inc. in more than 160 countries and over 140 languages, using randomly selected, nationally representative samples. The target population is the entire civilian population of age 15 years and above. This is an observational data. Most of the variables are stored as numerical values, and they are converted into character based categorical variables. Survey question surrounding income is based on household income quintile(1 - Poorest 20%, 2 - Second 20%, 3 - Middle 20%, 4 - Fourth 20%, 5 - Richest 20%). .
agef <- function(x){
if (is.na(x)){
age <- NA
return(age)
}
if (x < 26){
age <- "Below 26"
} else if(x > 25 & x < 36){
age <- "26 - 35"
} else if(x > 35 & x < 46){
age <- "36 - 45"
} else if(x > 45 & x < 56){
age <- "46 - 55"
} else if(x > 55 & x < 66){
age <- "56 - 65"
} else if (x > 65){
age <- "Above 65"
} else{
age <- NA
}
return(age)
}
incomef <- function(x){
if (is.na(x)){
inc <- NA
return(inc)
}
if (x == 1){
inc <- "Poorest 20%"
} else if(x == 2){
inc <- "Second 20%"
} else if(x == 3){
inc <- "Middle 20%"
} else if(x == 4){
inc <- "Fourth 20%"
} else if(x == 5){
inc <- "Richest 20%"
} else{
inc <- NA
}
return(inc)
}
eduf <- function(x){
if (is.na(x)){
col <- NA
return(col)
}
if (x == 1){
col <- "HS or Less"
} else if(x == 2){
col <- "Some College"
} else if(x == 3){
col <- "College Grad"
} else{
col <- NA
}
return(col)
}
age.order <- data.frame(age<-"Below 26", order<-1, stringsAsFactors = F)
age.order <- rbind(age.order, c("26 - 35", 2))
age.order <- rbind(age.order, c("36 - 45", 3))
age.order <- rbind(age.order, c("46 - 55", 4))
age.order <- rbind(age.order, c("56 - 65", 5))
age.order <- rbind(age.order, c("Above 65", 6))
colnames(age.order) <- c("age","order")
age.order$order <- as.integer(age.order$order)
income.order <- data.frame(age<-"Poorest 20%", order<-1, stringsAsFactors = F)
income.order <- rbind(income.order, c("Second 20%", 2))
income.order <- rbind(income.order, c("Middle 20%", 3))
income.order <- rbind(income.order, c("Fourth 20%", 4))
income.order <- rbind(income.order, c("Richest 20%", 5))
colnames(income.order) <- c("income","order")
income.order$order <- as.integer(income.order$order)
edu.order <- data.frame(edu<-"HS or Less", order<-1, stringsAsFactors = F)
edu.order <- rbind(edu.order, c("Some College", 2))
edu.order <- rbind(edu.order, c("College Grad", 3))
colnames(edu.order) <- c("edu","order")
edu.order$order <- as.integer(edu.order$order)
attOrder <- age.order %>% select(attName=age, order)
attOrder <- rbind(attOrder, as.data.frame(income.order %>% mutate(order = order + 20) %>% select(attName=income, order)))
attOrder <- rbind(attOrder, as.data.frame(edu.order %>% mutate(order = order + 30) %>% select(attName=edu, order)))
attOrder <- rbind(attOrder, data.frame(attName="Male",order = 0, stringsAsFactors = F))
attOrder <- rbind(attOrder, data.frame(attName="Female",order = -1, stringsAsFactors = F))
attOrder <- arrange(attOrder, order)
reasons <- data.frame("Cannot get one", 1, stringsAsFactors = F)
reasons <- rbind(reasons, c("Family member already has one", 2))
reasons <- rbind(reasons, c("Financial institution too far away", 3))
reasons <- rbind(reasons, c("Lack of documentation", 4))
reasons <- rbind(reasons, c("Lack of money", 5))
reasons <- rbind(reasons, c("Lack of trust", 6))
reasons <- rbind(reasons, c("No need for financial services", 7))
reasons <- rbind(reasons, c("Religious reasons", 8))
reasons <- rbind(reasons, c("Too expensive to have any account", 9))
colnames(reasons) <- c("reason","order")
#Change path
fileDir <- "D:/CUNY/606/Project-1/FinalProject"
#Load survey data.
world.findex.data <- read.csv(file.path(fileDir, "micro_world.csv"), stringsAsFactors=FALSE)
I will be using ANOVA statistical method to identify if there is a difference in financial account ownership. Analysis of Variance (ANOVA) is a statistical method used for testing differences between two or more means. Data has to meet following conditions before applying ANOVA. If any of the conditions are not satisfied, the results from the use of ANOVA techniques may be unreliable.
As data is collected by Gallup, Inc from each country on a random basis. Survey data meets the condition of independence.
For the scope of the project, I will be using account attribute as the response variable. This variable has two possible outcomes yes if the respondent has an account with financial institution otherwise no. Observations with missing values are discarded. This condition can be checked with the use of graphs; I will be using box plot and side-by-side dot plot to identify the normal distribution of the response variable.
This can be evaluated by using rule of thumb if the largest sample standard deviation divided by the smallest sample standard deviation is not greater than two, then assume that the population variances are equal. I will be using box plot to identify variability across the groups.
Following graph shows a total number of respondents having an account with the financial institution. Account ownership percentage is higher among G7 Nations compared to India and China.
#Get India, China and G7 Nations data
in_cn_g7.data <- world.findex.data %>%
filter(economy %in% c("India", "China", "Canada", "France", "Germany", "Italy", "Japan", "United Kingdom", "United States")) %>%
select(economy, female, age, educ, inc_q, account)
in_cn_g7.data$hasAccount <- ifelse(in_cn_g7.data$account == 1, "Yes", ifelse(in_cn_g7.data$account == 2, "No", NA))
in_cn_g7.data$gender <- ifelse(in_cn_g7.data$female == 1, "Female", ifelse(in_cn_g7.data$female == 2, "Male",NA))
in_cn_g7.data = cbind(in_cn_g7.data, ldply(in_cn_g7.data$age, agef))
colnames(in_cn_g7.data)[ncol(in_cn_g7.data)] <- "ageGroup"
in_cn_g7.data = cbind(in_cn_g7.data, ldply(in_cn_g7.data$inc_q, incomef))
colnames(in_cn_g7.data)[ncol(in_cn_g7.data)] <- "incomeGroup"
in_cn_g7.data = cbind(in_cn_g7.data, ldply(in_cn_g7.data$educ, eduf))
colnames(in_cn_g7.data)[ncol(in_cn_g7.data)] <- "eduGroup"
#Get percentage of population that has accounts.
in_cn_g7.data1 <- in_cn_g7.data %>%
filter(age>0) %>%
group_by(economy) %>%
mutate(hasAcc = n())
in_cn_g7.data1 <- in_cn_g7.data1 %>%
filter(hasAccount=="Yes") %>%
group_by(economy, ageGroup, hasAcc) %>%
select (economy, ageGroup, hasAcc) %>%
mutate(ageGroupTotal = n()) %>%
distinct (ageGroupTotal)
in_cn_g7.data1 <- in_cn_g7.data1 %>%
mutate(accPer = round(ageGroupTotal*100/hasAcc, 3))
#Graphical display
in_cn_g7.data1 %>%
group_by(economy) %>%
mutate(value = sum(accPer)) %>%
ggplot(aes(x=economy, y=value, fill = value)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Country", y="Percentage Account Holders", title = "Population Sample With Financial Accounts", subtitle = "Comparision between countries") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + geom_text(aes(label=value), vjust=0, color="black") + labs(fill = "Ownership")
Following data shows comparision of financial account ownership by age group accross countries.
#Group data by age group
in_cn_g7.data <- in_cn_g7.data %>%
filter(hasAccount=="Yes") %>%
filter(age>0) %>%
group_by(economy) %>%
mutate(hasAcc = n())
in_cn_g7.data <- in_cn_g7.data %>%
group_by(economy, ageGroup, hasAcc) %>%
select (economy, ageGroup, hasAcc) %>%
mutate(ageGroupTotal = n()) %>%
distinct (ageGroupTotal)
in_cn_g7.data <- in_cn_g7.data %>%
mutate(accPer = round(ageGroupTotal*100/hasAcc, 3))
age.in_cn_g7.data <- in_cn_g7.data %>%
inner_join(age.order, by = c("ageGroup" = "age")) %>%
arrange(economy, order)
age.in_cn_g7.data <- transform(age.in_cn_g7.data, ageGroup=factor(ageGroup,levels=unique(ageGroup)))
colnames(age.in_cn_g7.data)[colnames(age.in_cn_g7.data)=="ageGroup"] <- "Age Group"
colnames(age.in_cn_g7.data)[colnames(age.in_cn_g7.data)=="accPer"] <- "value"
age.in_cn_g7.data$ageGroupTotal <- NULL
#Display data in table format
age.in_cn_g7.data %>%
select(economy, `Age Group`, value) %>%
spread(economy, value) %>%
kable(format='pandoc', caption = "Population Sample With Financial Accounts - Countries by Age Group")
| Age Group | Canada | China | France | Germany | India | Italy | Japan | United Kingdom | United States |
|---|---|---|---|---|---|---|---|---|---|
| Below 26 | 10.010 | 12.433 | 6.900 | 7.121 | 21.765 | 6.494 | 3.030 | 4.044 | 10.677 |
| 26 - 35 | 11.249 | 18.190 | 11.123 | 9.766 | 25.641 | 12.446 | 5.152 | 11.426 | 10.254 |
| 36 - 45 | 14.551 | 20.816 | 22.760 | 12.208 | 24.210 | 22.511 | 17.576 | 19.919 | 12.474 |
| 46 - 55 | 20.537 | 20.911 | 24.923 | 21.872 | 14.490 | 23.918 | 17.374 | 26.491 | 14.905 |
| 56 - 65 | 20.640 | 15.407 | 20.906 | 19.939 | 10.137 | 20.130 | 25.960 | 22.548 | 22.199 |
| Above 65 | 23.013 | 12.243 | 13.388 | 29.095 | 3.757 | 14.502 | 30.909 | 15.571 | 29.493 |
Following graph show that data, doesn’t have the ideal bell-shaped appearance, and it suggests there are some outliers in the data. Financial account ownership differ between age groups in each country.
#Graphical display
ggplot(age.in_cn_g7.data, aes(x=economy, y=value, fill = `Age Group`)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Country", y="Accounts", title = "Percentage of Population With Financial Accounts", subtitle = "Comparision between countries by age group") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(fill = "Ownership") + scale_fill_brewer(palette="Paired")
Side-by-side dot plot show data within each group are very volatile. Graph also suggest, there are outliers in Japan data. Box plot suggest United States differ from rest of the countries. Since data is collected randomly in each country and graphs does not display anthing extreme we can assume distribution of the response variable follows a normal distribution and variability across the groups exists. This meets second and third conditions.
qplot(x = economy, y = value, data = age.in_cn_g7.data, color = value) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_smooth(method = "lm") +
labs(x="Country", y="% of Accounts Holders By Age Group", title = "Financial Account Ownership By Country And Age Group", subtitle = "Side-by-side dot plot", color = "Ownership")
fill <- "#4271AE"
line <- "#1F3552"
age.in_cn_g7.data %>%
ggplot(aes(x = economy, y = value)) +
geom_boxplot(fill = fill, colour = line) +
scale_y_continuous(name = "Account") +
scale_x_discrete(name = "Country") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Average(Mean) Account Ownership By Country")
Standard deviations between countries range between 3.915 and 11.037. Even though quotient is greater than two when maximum standard deviation is divided by minimum standard deviation(2.819), we can assume age group variances are equal across countries as data is collected randomly in each country.
age.in_cn_g7.data %>%
select(economy, `Age Group`, value) %>%
group_by(economy) %>%
mutate(ecoSD = round(sd(value),3)) %>%
select(economy, ecoSD) %>%
distinct(ecoSD) %>%
spread(economy, ecoSD) %>%
kable(format='pandoc', caption = "Standard Deviation by Countries Based on Age Group")
| Canada | China | France | Germany | India | Italy | Japan | United Kingdom | United States |
|---|---|---|---|---|---|---|---|---|
| 5.462 | 3.915 | 7.213 | 8.377 | 8.689 | 6.704 | 11.037 | 8.12 | 7.652 |
Hypothesis test.
Null hypothesis: There is no difference in account ownership percentage between age groups across countries.
\(H_0: {\mu}_{can} = {\mu}_{chn} = {\mu}_{fra} = {\mu}_{deu} = {\mu}_{ind} = {\mu}_{ita} = {\mu}_{jpn} = {\mu}_{uk} = {\mu}_{usa}\)
Alternative hypothesis: There is difference in account ownership percentage between age groups across countries.
\(H_A: {\mu}_{can} \ne {\mu}_{chn} \ne {\mu}_{fra} \ne {\mu}_{deu} \ne {\mu}_{ind} \ne {\mu}_{ita} \ne {\mu}_{jpn} \ne {\mu}_{uk} \ne {\mu}_{usa}\)
options("scipen"=100, "digits"=4)
#Apply ANOVA
age.anova <- aov(value ~ `Age Group` + economy, data = age.in_cn_g7.data)
summary(age.anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## `Age Group` 5 952 190 4.43 0.0026 **
## economy 8 0 0 0.00 1.0000
## Residuals 40 1718 43
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Draw residuals plot to test is data is normally distributed
ggplot(data=as.data.frame(qqnorm( age.anova$residuals , plot=F)), mapping=aes(x=x, y=y)) +
geom_point() + geom_smooth(method="lm", se=FALSE) +
scale_y_continuous(name = "Sample Quantiles") +
scale_x_discrete(name = "Theoretical Quantiles") +
ggtitle("Normal Q-Q Plot")
- Data meets fairly all three conditions of ANOVA.
- The normal plot of the residuals shows data points lie pretty close to the line. Some deviation is noticed near the ends. This suggests data is nearly normal.
- The p-value of the test is 0.0026, less than the default significance level of \(\alpha = 0.05\). We reject null hypothesis \(H_0\). In other words, there is difference in account ownership percentage between age groups across countries.
- As data was collected from each country based on size of the population, weighted sampling, aggregates are converted to percentages.
I will be using ANOVA technique to answer the question and use TukeyHSD function to identify if any reason pair stands out. Following graph shows a total number of respondents that does not have an account with the financial institution for various reasons.
#Get India data
india.data <- world.findex.data %>%
filter(economy == "India") %>%
select(economy, female, age, educ, inc_q, account, q8a, q8b, q8c, q8d, q8e, q8f, q8g, q8h, q8i)
#Convert numerical values to categorical values
india.data$hasAccount <- ifelse(india.data$account == 1, "Yes", ifelse(india.data$account == 2, "No", NA))
india.data$gender <- ifelse(india.data$female == 1, "Female", ifelse(india.data$female == 2, "Male",NA))
india.data = cbind(india.data, ldply(india.data$age, agef))
colnames(india.data)[ncol(india.data)] <- "ageGroup"
india.data = cbind(india.data, ldply(india.data$inc_q, incomef))
colnames(india.data)[ncol(india.data)] <- "incomeGroup"
india.data = cbind(india.data, ldply(india.data$educ, eduf))
colnames(india.data)[ncol(india.data)] <- "eduGroup"
#Subset data by respondents who do not have accounts
no.account <- india.data %>% filter(account==2)
no.account <- no.account %>% filter(!is.na(age))
no.account.tidy <- no.account %>%
select (gender, ageGroup, incomeGroup, eduGroup, q8a, q8b, q8c, q8d, q8e, q8f, q8g, q8h, q8i) %>%
gather(noAccountReason, value, -gender, -ageGroup, -incomeGroup, -eduGroup) %>% filter(value==1)
#Expand the questions
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8a"] <- "Financial institution too far away"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8b"] <- "Too expensive to have any account"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8c"] <- "Lack of documentation"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8d"] <- "Lack of trust"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8e"] <- "Religious reasons"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8f"] <- "Lack of money"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8g"] <- "Family member already has one"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8h"] <- "Cannot get one"
no.account.tidy$noAccountReason[no.account.tidy$noAccountReason == "q8i"] <- "No need for financial services"
#Get data to display on the graph
no.account.reason <- no.account.tidy %>%
group_by(noAccountReason) %>%
summarise(value = n()) %>%
inner_join(reasons, by = c("noAccountReason"="reason")) %>%
arrange(desc(order))
#Convert columns to factors to preserve data order
no.account.reason <- transform(no.account.reason, noAccountReason=factor(noAccountReason,levels=unique(noAccountReason)))
#Graphical display
no.account.reason %>%
ggplot(aes(x=noAccountReason, y=value, fill = value)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Reasons", y="Respondents", title = "Population Sample With No Financial Accounts - India", subtitle = "Comparision between ressons") + labs(fill = "Counts") + coord_flip()
Above graph suggests that Lack of money is a primary reason for not having an account with the financial institution. Is Lack of money only reason across various Gender, Age groups, Education level and Income groups for not having an account with a financial institution? Following graphs show the breakdown by various groups.
Reasons for not having an account based on Age Groups. The graph shows age group below 26 years age top reason for not having an account with a financial institution is different from other age groups.
#Graphical display
no.account.age <- no.account.tidy %>%
group_by(ageGroup, noAccountReason) %>%
summarise(ageReasonvalue = n()) %>%
inner_join(age.order, by = c("ageGroup"="age")) %>%
arrange(order,noAccountReason)
no.account.age <- no.account.age %>% group_by(ageGroup) %>% mutate(agevalue = sum(ageReasonvalue))
no.account.age <- no.account.age %>% mutate(value = round(ageReasonvalue * 100/agevalue,3))
#Convert data to factors
no.account.age <- transform(no.account.age, ageGroup=factor(ageGroup,levels=unique(ageGroup)))
ggplot(no.account.age, aes(x=ageGroup, y=value, fill = noAccountReason)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Age Group", y="Respondents %", title = "Reasons For Not Having Financial Account", subtitle = "Percentage comparision between Age Groups") + labs(fill = "Reasons") + scale_fill_brewer(palette="Paired") + coord_flip()
Reasons for not having an account based on Education Level. The graph shows reasons vary by Education Level.
#Graphical display
no.account.edu <- no.account.tidy %>%
group_by(eduGroup, noAccountReason) %>%
summarise(eduReasonvalue = n()) %>%
inner_join(edu.order, by = c("eduGroup"="edu")) %>%
arrange(order,noAccountReason)
no.account.edu <- no.account.edu %>% group_by(eduGroup) %>% mutate(eduvalue = sum(eduReasonvalue))
no.account.edu <- no.account.edu %>% mutate(value = round(eduReasonvalue * 100/eduvalue,3))
#Convert data to factors
no.account.edu <- transform(no.account.edu, eduGroup=factor(eduGroup,levels=unique(eduGroup)))
ggplot(no.account.edu, aes(x=eduGroup, y=value, fill = noAccountReason)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Education Level", y="Respondents %", title = "Reasons For Not Having Financial Account", subtitle = "Percentage comparision between Education Level") + labs(fill = "Reasons") + scale_fill_brewer(palette="Paired") + coord_flip()
Reasons for not having an account based on household income. It shows reasons vary by household income. For first three groups Lack of money is top reason.
#Graphical display
no.account.income <- no.account.tidy %>%
group_by(incomeGroup, noAccountReason) %>%
summarise(incomeReasonvalue = n()) %>%
inner_join(income.order, by = c("incomeGroup"="income")) %>%
arrange(order,noAccountReason)
no.account.income <- no.account.income %>% group_by(incomeGroup) %>% mutate(incomevalue = sum(incomeReasonvalue))
no.account.income <- no.account.income %>% mutate(value = round(incomeReasonvalue * 100/incomevalue,3))
#Convert data to factors
no.account.income <- transform(no.account.income, incomeGroup=factor(incomeGroup,levels=unique(incomeGroup)))
ggplot(no.account.income, aes(x=incomeGroup, y=value, fill = noAccountReason)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Income Group", y="Respondents %", title = "Reasons For Not Having Financial Account", subtitle = "Percentage comparision between Household Income") + labs(fill = "Reasons") + scale_fill_brewer(palette="Paired") + coord_flip()
Reasons for not having an account based on gender. Male and Female population has different reasons for not have financial accounts.
#Graphical display
no.account.gender <- no.account.tidy %>%
group_by(gender, noAccountReason) %>%
summarise(genderReasonvalue = n()) %>%
arrange(gender,noAccountReason)
no.account.gender <- no.account.gender %>% group_by(gender) %>% mutate(gendervalue = sum(genderReasonvalue))
no.account.gender <- no.account.gender %>% mutate(value = round(genderReasonvalue * 100/gendervalue,3))
#Convert data to factors
ggplot(no.account.gender, aes(x=gender, y=value, fill = noAccountReason)) + geom_bar(width=.9,stat="identity",position = "dodge") + labs(x="Gender", y="Respondents %", title = "Reasons For Not Having Financial Account", subtitle = "Percentage comparision between Gender") + labs(fill = "Reasons") + scale_fill_brewer(palette="Paired") + coord_flip()
Comparing details of the graphs gender, age groups, education level and income groups against overall graph, it stands out Lack of money is not common among all the groups. Using ANOVA technique lets determine if there is a difference in reasons for not having an account.
#Get complete data and summarise it by reasons
no.account <- india.data %>%
filter(age>0) %>%
filter(account == 2) %>%
select (account, gender, ageGroup, incomeGroup, eduGroup, q8a, q8b, q8c, q8d, q8e, q8f, q8g, q8h, q8i) %>%
gather(noAccountReason, value, -gender, -ageGroup, -incomeGroup, -eduGroup) %>%
filter(value==1) %>%
select(noAccountReason, gender, ageGroup, incomeGroup, eduGroup) %>%
gather(resp, value, -noAccountReason) %>%
select(noAccountReason, respAtt = value) %>%
group_by(noAccountReason, respAtt) %>%
summarise(respCount = n())
no.account$noAccountReason[no.account$noAccountReason == "q8a"] <- "Financial institution too far away"
no.account$noAccountReason[no.account$noAccountReason == "q8b"] <- "Too expensive have any account"
no.account$noAccountReason[no.account$noAccountReason == "q8c"] <- "Lack of documentation"
no.account$noAccountReason[no.account$noAccountReason == "q8d"] <- "Lack of trust"
no.account$noAccountReason[no.account$noAccountReason == "q8e"] <- "Religious reasons"
no.account$noAccountReason[no.account$noAccountReason == "q8f"] <- "Lack of money"
no.account$noAccountReason[no.account$noAccountReason == "q8g"] <- "Family member already has one"
no.account$noAccountReason[no.account$noAccountReason == "q8h"] <- "Cannot get one"
no.account$noAccountReason[no.account$noAccountReason == "q8i"] <- "No need for financial services"
no.account <- no.account %>% group_by(respAtt) %>% mutate(reasonvalue = sum(respCount))
no.account <- no.account %>% mutate(respPer = round(respCount * 100/reasonvalue,3))
no.account.tidy <- no.account %>%
select(-respCount, -reasonvalue) %>%
spread(noAccountReason, respPer) %>%
inner_join(attOrder, by = c("respAtt" = "attName"))
no.account.tidy <- arrange(no.account.tidy, order)
colnames(no.account.tidy)[colnames(no.account.tidy)=="respAtt"] <- "Respondent Attributes"
no.account.tidy$order <- NULL
no.account.tidy %>%
kable(format='pandoc', caption = "Reasons For Not Having Financial Account")
| Respondent Attributes | Cannot get one | Family member already has one | Financial institution too far away | Lack of documentation | Lack of money | Lack of trust | No need for financial services | Religious reasons | Too expensive have any account |
|---|---|---|---|---|---|---|---|---|---|
| Female | 4.791 | 17.33 | 10.279 | 11.150 | 25.78 | 4.443 | 11.672 | 2.265 | 12.282 |
| Male | 4.224 | 23.43 | 9.901 | 10.165 | 22.18 | 3.564 | 13.399 | 1.914 | 11.221 |
| Below 26 | 3.837 | 26.41 | 8.916 | 12.415 | 21.90 | 2.257 | 12.980 | 1.919 | 9.368 |
| 26 - 35 | 4.889 | 16.59 | 10.370 | 11.852 | 24.74 | 4.148 | 11.556 | 2.370 | 13.481 |
| 36 - 45 | 4.585 | 17.47 | 10.044 | 9.389 | 26.20 | 5.240 | 12.009 | 1.528 | 13.537 |
| 46 - 55 | 5.128 | 21.25 | 9.890 | 8.059 | 23.08 | 4.762 | 13.919 | 2.564 | 11.355 |
| 56 - 65 | 4.029 | 19.41 | 12.821 | 7.692 | 23.81 | 5.861 | 12.821 | 1.465 | 12.088 |
| Above 65 | 6.122 | 17.35 | 11.224 | 6.122 | 23.47 | 4.082 | 16.327 | 4.082 | 11.224 |
| Poorest 20% | 3.632 | 19.86 | 11.622 | 11.622 | 29.54 | 3.874 | 8.232 | 1.695 | 9.927 |
| Second 20% | 2.799 | 17.42 | 9.798 | 11.820 | 25.82 | 3.577 | 12.286 | 2.022 | 14.463 |
| Middle 20% | 5.110 | 19.20 | 11.188 | 10.635 | 22.93 | 4.558 | 12.845 | 2.210 | 11.326 |
| Fourth 20% | 4.706 | 24.71 | 9.020 | 9.216 | 22.16 | 3.922 | 12.745 | 1.765 | 11.765 |
| Richest 20% | 6.702 | 25.47 | 8.043 | 9.115 | 17.43 | 3.485 | 17.694 | 2.681 | 9.383 |
| HS or Less | 4.392 | 18.44 | 10.521 | 10.266 | 24.62 | 4.290 | 12.666 | 1.839 | 12.972 |
| Some College | 4.711 | 26.90 | 8.967 | 11.094 | 21.73 | 2.888 | 12.614 | 2.736 | 8.359 |
| College Grad | 4.255 | 34.04 | 6.383 | 17.021 | 14.89 | 4.255 | 12.766 | 2.128 | 4.255 |
The side-by-side plot shows responses are normally distributed across reasons not having a financial account. This satisfies second condition of ANOVA.
qplot(x = respAtt, y = respPer, data = no.account, color = respPer) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_smooth(method = "lm") +
labs(x="Response Group", y="Reason Percentage", title = "Reasons For Not Having Financial Account", subtitle = "Side-by-side dot plot Response Group Vs. Reason Percentage", color = "Percentage")
Box plot showing responses, distribution across groups for not having a financial account. There are outliers in some of the groups. Average(mean) of each group varies but is reasonable.
#Calculate mean
fill <- "#4271AE"
line <- "#1F3552"
#Display Boxplot
no.account %>%
ggplot(aes(x = respAtt, y = respPer)) +
geom_boxplot(fill = fill, colour = line) +
scale_y_continuous(name = "Reason Percentage") +
scale_x_discrete(name = "Response Group") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Average(Mean) Of Reasons For Not Having Financial Account")
Side-by-side dot plot and box plot show data within each group are very volatile. It suggests distribution of the response variable follows a normal distribution and variability across the groups exists. This meets second and third conditions.
Standard deviations between reasons range between 6.769 and 10.140. Quotient is less than two when maximum standard deviation is divided by minimum standard deviation(1.498), we can assume reasons variances are equal across each response group for not having account.
no.account %>%
select(noAccountReason, respAtt, respPer) %>%
group_by(respAtt) %>%
mutate(resonSD = round(sd(respPer),3)) %>%
select(respAtt, resonSD) %>%
distinct(resonSD) %>%
spread(respAtt, resonSD) %>%
kable(format='pandoc', caption = "Standard Deviation by Response Group Based on Reasons")
| 26 - 35 | 36 - 45 | 46 - 55 | 56 - 65 | Above 65 | Below 26 | College Grad | Female | Fourth 20% | HS or Less | Male | Middle 20% | Poorest 20% | Richest 20% | Second 20% | Some College |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6.949 | 7.482 | 7.192 | 7.241 | 6.777 | 8.509 | 10.14 | 7.238 | 7.889 | 7.251 | 7.672 | 6.769 | 8.827 | 7.537 | 7.74 | 8.331 |
As data meets conditions of ANOVA,
Null hypothesis: There is no difference in reasons for not having an account with financial institution.
\(H_0: {\mu}_{fa} = {\mu}_{exp} = {\mu}_{ld} = {\mu}_{lt} = {\mu}_{rr} = {\mu}_{lm} = {\mu}_{fm} = {\mu}_{co} = {\mu}_{nf}\)
Alternative hypothesis: There is difference in reasons for not having an account with financial institution.
\(H_A: {\mu}_{fa} \ne {\mu}_{exp} \ne {\mu}_{ld} \ne {\mu}_{lt} \ne {\mu}_{rr} \ne {\mu}_{lm} \ne {\mu}_{fm} \ne {\mu}_{co} \ne {\mu}_{nf}\)
options("scipen"=100, "digits"=16)
#Calculate ANOVA
no.account.anova <- aov(respPer ~ noAccountReason, data = no.account)
summary(no.account.anova)
## Df Sum Sq Mean Sq F value
## noAccountReason 8 6886.429627722 860.8037034653 138.85689
## Residuals 135 836.894010437 6.1992148921
## Pr(>F)
## noAccountReason < 0.000000000000000222 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Get Tukey output
tukey.output<-TukeyHSD(no.account.anova)
tukey.output<-data.frame(tukey.output$noAccountReason, stringsAsFactors = F)
tukey.output$pairs <- rownames(tukey.output)
rownames(tukey.output) <- NULL
tukey.output <- tukey.output %>% filter((upr-lwr)>0) %>% mutate(rangeDiff = (upr-lwr))
tukey.output <- tukey.output %>% arrange((`p.adj`), desc(diff))
tukey.output %>% kable(format='pandoc', caption = "TukeyHSD Results")
| diff | lwr | upr | p.adj | pairs | rangeDiff |
|---|---|---|---|---|---|
| 18.5221875000000011 | 15.7468584937078635 | 21.2975165062921405 | 0.0000000000000211 | Lack of money-Cannot get one | 5.550658012584277 |
| 16.9598749999999967 | 14.1845459937078591 | 19.7352040062921361 | 0.0000000000000211 | Family member already has one-Cannot get one | 5.550658012584277 |
| 13.2050000000000018 | 10.4296709937078642 | 15.9803290062921395 | 0.0000000000000211 | Lack of money-Financial institution too far away | 5.550658012584275 |
| 12.6646250000000027 | 9.8892959937078651 | 15.4399540062921403 | 0.0000000000000211 | Lack of money-Lack of documentation | 5.550658012584275 |
| -12.0788125000000033 | -14.8541415062921409 | -9.3034834937078656 | 0.0000000000000211 | Too expensive have any account-Lack of money | 5.550658012584275 |
| -17.5039999999999978 | -20.2793290062921372 | -14.7286709937078601 | 0.0000000000000211 | Lack of trust-Family member already has one | 5.550658012584277 |
| -19.0663125000000022 | -21.8416415062921416 | -16.2909834937078628 | 0.0000000000000211 | Lack of trust-Lack of money | 5.550658012584279 |
| -19.3804374999999993 | -22.1557665062921387 | -16.6051084937078599 | 0.0000000000000211 | Religious reasons-Family member already has one | 5.550658012584279 |
| -20.9427500000000038 | -23.7180790062921432 | -18.1674209937078643 | 0.0000000000000211 | Religious reasons-Lack of money | 5.550658012584279 |
| -11.6426874999999974 | -14.4180165062921350 | -8.8673584937078598 | 0.0000000000000214 | Financial institution too far away-Family member already has one | 5.550658012584275 |
| -11.1023124999999983 | -13.8776415062921359 | -8.3269834937078606 | 0.0000000000000232 | Lack of documentation-Family member already has one | 5.550658012584275 |
| -10.7092499999999990 | -13.4845790062921367 | -7.9339209937078614 | 0.0000000000000284 | Religious reasons-No need for financial services | 5.550658012584275 |
| -10.5164999999999988 | -13.2918290062921365 | -7.7411709937078612 | 0.0000000000000322 | Too expensive have any account-Family member already has one | 5.550658012584275 |
| -10.2335000000000029 | -13.0088290062921406 | -7.4581709937078653 | 0.0000000000000412 | No need for financial services-Lack of money | 5.550658012584275 |
| 8.8639374999999987 | 6.0886084937078611 | 11.6392665062921363 | 0.0000000000001027 | Too expensive have any account-Religious reasons | 5.550658012584275 |
| 8.8328124999999993 | 6.0574834937078617 | 11.6081415062921369 | 0.0000000000001055 | No need for financial services-Lack of trust | 5.550658012584275 |
| -8.6711874999999985 | -11.4465165062921361 | -5.8958584937078609 | 0.0000000000001060 | No need for financial services-Family member already has one | 5.550658012584275 |
| 8.2886875000000000 | 5.5133584937078624 | 11.0640165062921376 | 0.0000000000001119 | No need for financial services-Cannot get one | 5.550658012584275 |
| -8.2781249999999993 | -11.0534540062921369 | -5.5027959937078617 | 0.0000000000001135 | Religious reasons-Lack of documentation | 5.550658012584275 |
| -7.7377500000000001 | -10.5130790062921378 | -4.9624209937078625 | 0.0000000000003277 | Religious reasons-Financial institution too far away | 5.550658012584275 |
| 6.9874999999999998 | 4.2121709937078613 | 9.7628290062921383 | 0.0000000000255275 | Too expensive have any account-Lack of trust | 5.550658012584277 |
| 6.4433749999999996 | 3.6680459937078616 | 9.2187040062921373 | 0.0000000007224596 | Too expensive have any account-Cannot get one | 5.550658012584275 |
| -6.4016875000000004 | -9.1770165062921389 | -3.6263584937078623 | 0.0000000009295720 | Lack of trust-Lack of documentation | 5.550658012584277 |
| -5.8613125000000013 | -8.6366415062921398 | -3.0859834937078632 | 0.0000000228705361 | Lack of trust-Financial institution too far away | 5.550658012584277 |
| 5.8575625000000002 | 3.0822334937078621 | 8.6328915062921379 | 0.0000000233740098 | Lack of documentation-Cannot get one | 5.550658012584275 |
| 5.3171875000000011 | 2.5418584937078630 | 8.0925165062921387 | 0.0000005002820614 | Financial institution too far away-Cannot get one | 5.550658012584275 |
| 2.9714999999999989 | 0.1961709937078608 | 5.7468290062921366 | 0.0260689327027537 | No need for financial services-Financial institution too far away | 5.550658012584275 |
| 2.4311249999999998 | -0.3442040062921383 | 5.2064540062921374 | 0.1368886029981524 | No need for financial services-Lack of documentation | 5.550658012584275 |
| -2.4205624999999991 | -5.1958915062921367 | 0.3547665062921390 | 0.1407710678736802 | Religious reasons-Cannot get one | 5.550658012584275 |
| -1.8764374999999989 | -4.6517665062921374 | 0.8988915062921392 | 0.4565292446957468 | Religious reasons-Lack of trust | 5.550658012584277 |
| -1.8453125000000004 | -4.6206415062921380 | 0.9300165062921378 | 0.4802685403155512 | Too expensive have any account-No need for financial services | 5.550658012584275 |
| 1.5623125000000044 | -1.2130165062921336 | 4.3376415062921421 | 0.6985482909477798 | Lack of money-Family member already has one | 5.550658012584275 |
| 1.1261874999999986 | -1.6491415062921395 | 3.9015165062921366 | 0.9357003437230564 | Too expensive have any account-Financial institution too far away | 5.550658012584276 |
| 0.5858124999999994 | -2.1895165062921387 | 3.3611415062921375 | 0.9991117666211534 | Too expensive have any account-Lack of documentation | 5.550658012584276 |
| -0.5441250000000002 | -3.3194540062921383 | 2.2312040062921379 | 0.9994823019455810 | Lack of trust-Cannot get one | 5.550658012584276 |
| 0.5403749999999992 | -2.2349540062921389 | 3.3157040062921372 | 0.9995080101246836 | Lack of documentation-Financial institution too far away | 5.550658012584276 |
#Plot residuals
ggplot(data=as.data.frame(qqnorm( no.account.anova$residuals , plot=F)), mapping=aes(x=x, y=y)) +
geom_point() + geom_smooth(method="lm", se=FALSE) +
scale_y_continuous(name = "Sample Quantiles") +
scale_x_discrete(name = "Theoretical Quantiles") +
ggtitle("Normal Q-Q Plot")
- Data meets all three conditions of ANOVA.
- The normal plot of the residuals shows data points lie pretty close to the line. Some deviation is noticed near the ends. This suggests data is nearly normal.
- As p-value of the test is equal to 0.0000000000000002, indicating the evidence is strong enough to reject the null hypothesis, \(H_0\) at a significance level of \(\alpha = 0.05\). That is, the data provides strong evidence that there is difference in reasons for not having an account with financial institution across groups.
Output of Tukey Honest Significant Differences(TukeyHSD) function suggests the difference between reason pairs Lack of money-Cannot get one has a p-value of 0.0000000000000211 and is statistically significant. At 95% confidence interval, reason pair difference is between 15.75% and 21.30% with Lack of money averaging 18.52% higher. That means, there is 18.52% higher chance, for a response group not having an account with financial institution is because of Lack of money.
For developing predictive model, - 70% of the dataset will be used as training data and 30% as test data. - As output will be bi-variate, Generalized Linear Model(GLM) is used.
#Get complete data
india.account <- india.data %>%
filter(!is.na(age)) %>%
select (account, gender, ageGroup, incomeGroup, eduGroup)
#There are only two possible outcomes person can have account or not
india.account$account <- ifelse(india.account$account == 1, 1, 0)
set.seed(2005) #So results can be repeated
#Create sample data
samIdx <- sample(2,nrow(india.account), replace = T, prob = c(0.70,0.30))
#Training data
india.train <- india.account[samIdx == 1,]
#Test data
india.test <- india.account[samIdx == 2,]
#Run GLM on training data
india.glm <- glm(account ~ gender + ageGroup + incomeGroup + eduGroup, data = india.account, family = binomial(link = "logit"))
#Make prediction using test data
india.predict <- predict(india.glm, india.test, type = "response")
summary(india.glm)
##
## Call:
## glm(formula = account ~ gender + ageGroup + incomeGroup + eduGroup,
## family = binomial(link = "logit"), data = india.account)
##
## Deviance Residuals:
## Min 1Q Median 3Q
## -2.448372543129 -1.116906853565 0.547292268250 1.020832477542
## Max
## 1.747078967946
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 2.22725433507954 0.23536558027249 9.46296
## genderMale -0.57425698102466 0.07920453566401 -7.25030
## ageGroup36 - 45 0.43238854578185 0.11731065075817 3.68584
## ageGroup46 - 55 0.39147864570844 0.13627812831866 2.87264
## ageGroup56 - 65 0.16310283488844 0.14539385801954 1.12180
## ageGroupAbove 65 -0.04206682128494 0.20924925554456 -0.20104
## ageGroupBelow 26 -0.69299605585285 0.11269337452207 -6.14939
## incomeGroupMiddle 20% -0.27859272917531 0.11822816213385 -2.35640
## incomeGroupPoorest 20% -0.29025963641623 0.13618298824400 -2.13139
## incomeGroupRichest 20% 0.28640843483957 0.12342299954692 2.32054
## incomeGroupSecond 20% -0.28674532984093 0.12195233837860 -2.35129
## eduGroupHS or Less -1.95078559495004 0.21651664866703 -9.00986
## eduGroupSome College -1.08226620863156 0.21664080767702 -4.99567
## Pr(>|z|)
## (Intercept) < 0.000000000000000222 ***
## genderMale 0.00000000000041584 ***
## ageGroup36 - 45 0.00022795 ***
## ageGroup46 - 55 0.00407052 **
## ageGroup56 - 65 0.26194747
## ageGroupAbove 65 0.84066972
## ageGroupBelow 26 0.00000000077779134 ***
## incomeGroupMiddle 20% 0.01845308 *
## incomeGroupPoorest 20% 0.03305668 *
## incomeGroupRichest 20% 0.02031150 *
## incomeGroupSecond 20% 0.01870844 *
## eduGroupHS or Less < 0.000000000000000222 ***
## eduGroupSome College 0.00000058631532198 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4115.8501258355 on 2998 degrees of freedom
## Residual deviance: 3721.2967784569 on 2986 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 3747.2967784569
##
## Number of Fisher Scoring iterations: 4
The summary shows all the variables are significant.
#Generate prediction output
india.predict.info <- rep(0,length(which(samIdx == 2)))
india.predict.info[india.predict >=0.5] <- 1
cm <- confusionMatrix(india.predict.info, india.test$account)
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 224 143
## 1 184 363
##
## Accuracy : 0.6422319474836
## 95% CI : (0.6101899657424, 0.6733537552102)
## No Information Rate : 0.5536105032823
## P-Value [Acc > NIR] : 0.00000003258372921716
##
## Kappa : 0.2690289034871
## Mcnemar's Test P-Value : 0.02696631569637
##
## Sensitivity : 0.5490196078431
## Specificity : 0.7173913043478
## Pos Pred Value : 0.6103542234332
## Neg Pred Value : 0.6636197440585
## Prevalence : 0.4463894967177
## Detection Rate : 0.2450765864333
## Detection Prevalence : 0.4015317286652
## Balanced Accuracy : 0.6332054560955
##
## 'Positive' Class : 0
##
fourfoldplot(cm$table, color = c("#CC6666", "#99CC99"),conf.level = 0, margin = 1, main = "Mis-classification")
- As output variable can have two possible outcomes, respondent can have an account or not. I have used Generalized Linear Model(GLM) binomial family and logit link.
- Summary of the model suggests all the variables contribute significantly. Gender, age, and education contribute more than income group.
- The accuracy of the model is 64%.