NLB_StatisticalAnalysis

#install.packages(ggplot2)
library(ggplot2)
#install.packages("dplyr")
library(dplyr)
#install.packages("Hmisc")
library(Hmisc)
#install.packages("factoextra")
library(factoextra)
#install.packages("cluster")
library(cluster)
#install.packages("magrittr")
library(magrittr)
#install.packages("NbClust")
library(NbClust)
#install.packages("tidyr")
library(tidyr)
#install.packages("rstatix")
library(rstatix)
#install.packages("corrplot")
library(corrplot)
#install.packages("car")
library(car)
#install.packages("GGally")
library(GGally)
#install.packages("factoextra")
library(factoextra)

We started by activating the libraries I might need for the analysis.

data <- read.table("./anketa_final_6.csv", header=TRUE, sep=",", dec=".")
head(data)

Import the data from the survey.

data_seg <- as.data.frame(data[c("ID","Q21","Q56", "Q58", "Q65", "Q7a_1")])
summary(data_seg[,-1])

##       Q21             Q56             Q58             Q65       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:3.000   1st Qu.:5.000   1st Qu.:2.000  
##  Median :3.000   Median :5.000   Median :6.000   Median :5.000  
##  Mean   :3.585   Mean   :4.575   Mean   :5.594   Mean   :4.406  
##  3rd Qu.:5.750   3rd Qu.:6.000   3rd Qu.:6.000   3rd Qu.:6.000  
##  Max.   :7.000   Max.   :7.000   Max.   :7.000   Max.   :7.000  
##      Q7a_1      
##  Min.   :1.000  
##  1st Qu.:3.000  
##  Median :6.000  
##  Mean   :4.972  
##  3rd Qu.:7.000  
##  Max.   :7.000

We created a new data frame that includes only the clustering variables that we chose.

Q21: I prefer paying with cash over a card. (Likert scale 1-7)
Q56: How concerned are you about fraud when using digital payments? (Likert scale 1-7)
Q58: How much do you trust banks to securely process your digital transactions? (Likert scale 1-7)
Q65: To what extent do you agree with the statement: ‘Digital payments will completely replace cash in the future’? (Likert scale 1-7)
Q7a_1: To what extent do you think the following payment methods allow for tracking expenses? Cash (1 - do not allow expense tracking at all, 7 - fully allow expense tracking)

ggpairs(data_seg[, -1])

The correlation between the selected variables is below 0.4 which was selected as the cutpoint to avoid multicolinearity.

data_seg_std <- as.data.frame(scale(data_seg[c("Q21","Q56", "Q58", "Q65", "Q7a_1")]))
head(data_seg_std)

Scaled the data, not necessary.

data_seg$Dissimilarity = sqrt(data_seg_std$Q21^2 + data_seg_std$Q56^2 + data_seg_std$Q58^2 + data_seg_std$Q65^2 + data_seg_std$Q7a_1^2)

head(data_seg[order(-data_seg$Dissimilarity), c("ID", "Dissimilarity")], 15)

data <- data %>%
  filter(!ID %in% c("2"))

data <- data %>%
  mutate(ID = row_number())

data_seg <- as.data.frame(data[c("ID","Q21","Q56", "Q58", "Q65", "Q7a_1")])

data_seg_std <- as.data.frame(scale(data_seg[c(2:6)]))
head(data_seg_std)

Checked for outliers with dissimilarity and decided to remove ID2, then reset the ID order.

get_clust_tendency(data_seg_std,
                   n = nrow(data_seg_std) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.5801586
## 
## $plot
## NULL

Hopkins statistics is above 0.5 (0.58 > 0.50), hence we can conclude that data is clusterable. If it would be greater, closer to 1, it would be even more appropriate. Now the next question is how many clusters to use, so we will check this.

How many clusters to use?

WARD <- data_seg_std %>%
  get_dist(method = "euclidean") %>%
  hclust(method = "ward.D2")

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 105

fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Based on the dendrogram, We would choose 2 clusters, as there is the biggest jump in vertical line.

Distance <- get_dist(data_seg_std, 
                     method = "euclidian")

fviz_dist(Distance, 
          gradient = list(low = "darkred", 
                          mid = "grey95", 
                          high = "white"))

From the matrix of distances displayed above, we can see that some groups of clusters are forming, and we see 3 or 4 groups.

fviz_nbclust(data_seg_std, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

With the elbow method, We can see the slope changes the most at 2 and 3, so both are possible options for number of clusters.

fviz_nbclust(data_seg_std, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette analysis")

The highest value of the Silhouette analysis is at 2, so this is the suggested numbers of clusters.

Most of the methods suggested 2 as the number of clusters, but as it was not allowed to go on with that small number of clusters, We decided to check the dendogram for the second best option which was 3 clusters and we decided to proceed with 3 from now on.

K-Means Clustering

Clustering <- kmeans(data_seg_std, 
                     centers = 3,
                     nstart = 25)

Clustering

## K-means clustering with 3 clusters of sizes 35, 40, 30
## 
## Cluster means:
##          Q21         Q56        Q58        Q65      Q7a_1
## 1 -0.5836674 -0.47655561  0.2685342  0.1923019 -1.1618365
## 2 -0.2604055 -0.01505602  0.4929521  0.3381673  0.6697987
## 3  1.0281527  0.57605624 -0.9705594 -0.6752419  0.4624109
## 
## Clustering vector:
##   [1] 3 3 3 3 2 2 1 2 1 1 2 1 3 1 2 2 3 1 3 1 3 2 1 2 2 3 2 1 2 2 2 2 3 3 2 2 1
##  [38] 1 1 1 2 1 3 3 1 3 2 2 1 2 2 2 2 1 2 2 2 2 3 1 2 3 1 2 1 1 2 2 3 2 1 2 1 3
##  [75] 1 2 3 2 2 1 1 3 3 3 3 1 1 3 2 2 3 1 1 1 2 1 1 3 3 3 1 2 3 1 3
## 
## Within cluster sum of squares by cluster:
## [1]  97.44233 106.40500 120.23515
##  (between_SS / total_SS =  37.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Here we performed K-Means clustering. Biggest ratio of final leaders is 37.7%, which explains the total variability of clustering the 5 variables into 3 groups, and the other 62.3% is not explained.

fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = TRUE,
             ggtheme = theme_bw(),
             labelsize = 8,
             data = data_seg_std)

With the help of Principal Component Analysis around 56.7% (38.1+18.6) of information is showed when combining 3 variables into 2 dimensions.

After the first iteration of the cluster we decided to remove ID 26 as it looks as an outlier.

data <- data %>%
  filter(!ID %in% c("26"))

data <- data %>%
  mutate(ID = row_number())

data_seg <- as.data.frame(data[c("ID","Q21","Q56", "Q58", "Q65", "Q7a_1")])

data_seg_std <- as.data.frame(scale(data_seg[c(2:6)]))
head(data_seg_std)

Clustering <- kmeans(data_seg_std, 
                     centers = 3,
                     nstart = 25)

Clustering

## K-means clustering with 3 clusters of sizes 40, 29, 35
## 
## Cluster means:
##          Q21          Q56        Q58        Q65      Q7a_1
## 1 -0.2725869 -0.002123452  0.4891783  0.3501446  0.6603701
## 2  1.0959873  0.564728489 -0.9752117 -0.7291629  0.5018437
## 3 -0.5965759 -0.465491088  0.2489716  0.2039984 -1.1705220
## 
## Clustering vector:
##   [1] 2 2 2 2 1 1 3 1 3 3 1 3 2 3 1 1 2 3 2 3 2 1 3 1 1 1 3 1 1 1 1 2 2 1 1 3 3
##  [38] 3 3 1 3 2 2 3 2 1 1 3 1 1 1 1 3 1 1 1 1 2 3 1 2 3 1 3 3 1 1 2 1 3 1 3 2 3
##  [75] 1 2 1 1 3 3 2 2 2 2 3 3 2 1 1 2 3 3 3 1 3 3 2 2 2 3 1 2 3 2
## 
## Within cluster sum of squares by cluster:
## [1] 108.2046 105.3258 100.5716
##  (between_SS / total_SS =  39.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Here I performed K-Means clustering again, without removed unit. As we can see the ratio between SS between and SS total in now a bit bigger. The biggest ratio of final leaders is now 39.0%, which explains the total variability of clustering the 5 variables into 3 groups, and the other 61.0% is not explained.

fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = TRUE,
             ggtheme = theme_bw(),
             labelsize = 8,
             data = data_seg_std)

We decided to keep this clusters.

Group profiles

Averages <- Clustering$centers
Averages

##          Q21          Q56        Q58        Q65      Q7a_1
## 1 -0.2725869 -0.002123452  0.4891783  0.3501446  0.6603701
## 2  1.0959873  0.564728489 -0.9752117 -0.7291629  0.5018437
## 3 -0.5965759 -0.465491088  0.2489716  0.2039984 -1.1705220

Figure <- as.data.frame(Averages)
Figure$id <- 1:nrow(Figure)

Figure <- pivot_longer(Figure, cols = c("Q21","Q56", "Q58", "Q65", "Q7a_1"))

Figure$Group <- factor(Figure$id, 
                       levels = c(1, 2, 3), 
                       labels = c("Explorers", "Traditionalists", "Adopters"))

Figure$ImeF <- factor(Figure$name, 
              levels = c("Q21","Q56", "Q58", "Q65", "Q7a_1"), 
              labels = c("CashCardPref","FraudConcern", "TrustBanks", "FutureDigDom", "CashTrack"))


ggplot(Figure, aes(x = ImeF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables") +
  scale_color_brewer(palette="Set1") +
  ylim(-1.5, 1.5) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

The plot shows the averages of the clusters, this already gives us an idea of how these groups are composed, it also shows that there is differences between each other in all of the variables that were selected.

We will use this profiles of groups at the end for my conclusion.

data$Group <- Clustering$cluster

data_seg$Group <- Clustering$cluster

Assigned the groups to the observations in the two data frames that we will be using.

Appropriateness of cluster variables used

fit <- aov(cbind(Q21, Q56, Q58, Q65, Q7a_1) ~ as.factor(Group), 
             data = data)

summary(fit)

##  Response Q21 :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 225.15 112.576  48.131 2.082e-15 ***
## Residuals        101 236.23   2.339                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Q56 :
##                   Df  Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(Group)   2  55.223 27.6116  9.8651 0.000122 ***
## Residuals        101 282.690  2.7989                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Q58 :
##                   Df Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(Group)   2 52.885  26.442  31.184 2.84e-11 ***
## Residuals        101 85.644   0.848                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Q65 :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 103.02  51.511  13.541 6.165e-06 ***
## Residuals        101 384.20   3.804                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Q7a_1 :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 305.57 152.784  121.17 < 2.2e-16 ***
## Residuals        101 127.35   1.261                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Response for Q21 (preference of paying with cash over card):

\(H0: μ(Q21, G1) = μ(Q21, G2) = μ(Q21, G3)\)

\(H1\): At least one μ(Q21, j) is different.

We can reject H0 at p < 0.001. We can reject H0 for all cluster variables at p < 0.001. Therefore we can assume that the groups are statistically different in the mean values of the cluster variables.

data %>%
  group_by(as.factor(data$Group)) %>%
  shapiro_test(Q50a)

\(H0\): Q50a is normally distributed in G1.

\(H1\): Q50a is not normally distributed in G1.

We can reject H0.

\(H0\):Q50a is normally distributed in G2.

\(H1\): Q50a is not normally distributed in G2.

We can reject H0.

\(H0\): Q50a is normally distributed in G3.

\(H1\): Q50a is not normally distributed in G3.

We can reject H0.

kruskal.test(Q50a ~ Group, 
             data = data)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Q50a by Group
## Kruskal-Wallis chi-squared = 26.973, df = 2, p-value = 1.389e-06

Q50a: how often do you use cash (as a payment method)

\(H0\): Location distributions of Q50a are the same for all groups.

\(H1\): Location distributions of Q50a are not the same for all groups.

We reject H0 at p<0.001, therefore at least one location distribution is different from the other.

kruskal_effsize(Q50a ~ Group, 
                data = data)

We reject H0 at p < 0.001, therefore at least one location distribution is different from the other, meaning We trust that our clusters are meaningful. All in all, our result is validated (effect size is large).

data %>%
  group_by(Group) %>%
  shapiro_test(Q46)

Q46: What percentage of all your transactions is carried out through digital payments compared to cash?

\(H0\): Q46 is normally distributed in G1.

\(H1\): Q46 is not normally distributed in G1.

We can reject H0.

\(H0\):Q46 is normally distributed in G2.

\(H1\): Q46 is not normally distributed in G2.

We cannot reject H0.

\(H0\): Q46 is normally distributed in G3.

\(H1\): Q46 is not normally distributed in G3.

We can reject H0.

kruskal.test(Q46 ~ Group, 
             data = data)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Q46 by Group
## Kruskal-Wallis chi-squared = 24.21, df = 2, p-value = 5.531e-06

\(H0\): Location distributions of Q46 are the same for all groups.

\(H1\): Location distributions of Q46 are not the same for all groups.

We reject H0 at p<0.001, therefore at least one location distribution is different from the other.

kruskal_effsize(Q46 ~ Group, 
                data = data)

We reject H0 at p < 0.001, therefore at least one location distribution is different from the other, meaning I trust that my clusters are meaningful. All in all, my result is validated (effect size is large).

I did validation with two different variables, one categorical and one numerical, as the numerical showed a violation in normality within one of the groups I also ran a non-parametrical test, both validations were succesful, this was the last step to prove that the clustering was successful.

data$Q39F <- factor(data$Q39,
                    levels = c(1, 2),
                    labels = c("Man", "Woman"))

chi_square <- chisq.test(data$Q39F, as.factor(data$Group))
chi_square

## 
##  Pearson's Chi-squared test
## 
## data:  data$Q39F and as.factor(data$Group)
## X-squared = 2.6015, df = 2, p-value = 0.2723

Q39F: What is your gender?

\(H0\): There is no association between variables Man and Woman in 3 groups.

\(H1\): There is association between variables Man and Woman in 3 groups.

We cannot reject H0, hence there is no association between the two variables in our 3 groups.

data$Q40G <- ifelse(data$Q40 < 1951, "0", "1")


data$Q40F <- factor(data$Q40G,
                    levels = c(0, 1),
                    labels = c("Less_75", "More_75"))

chi_square <- chisq.test(data$Q40F, as.factor(data$Group))
chi_square

## 
##  Pearson's Chi-squared test
## 
## data:  data$Q40F and as.factor(data$Group)
## X-squared = 5.0269, df = 2, p-value = 0.08099

\(H0\): There is no association between variables age less than 75 and age more than 75.

\(H1\): There is association between variables age less than 75 and age more than 75.

We cannot reject H0, hence there is no association between the two variables.

group_count_age <- table(data$Q40F, data$Group)
print(group_count_age)

##          
##            1  2  3
##   Less_75 17  7 18
##   More_75 23 22 17

We divided the group age in two, people that are 75 or older and the second group that is 74 and younger. The table above shows the number of people that are younger than 75 and the number of people older than 75 in different clusters. In cluster 1 for example, 17 people are younger than 75. And we can clearly see that in the group of people that are less accepting of digital payments there is a vast majority of people above the threshold of 75 years.

data$Q41G <- ifelse(data$Q41 < 5, "0", "1")


data$Q41F <- factor(data$Q41G,
                    levels = c(0, 1),
                    labels = c("Lower_ed", "Higher_ed"))

chi_square <- chisq.test(data$Q41F, as.factor(data$Group))
chi_square

## 
##  Pearson's Chi-squared test
## 
## data:  data$Q41F and as.factor(data$Group)
## X-squared = 20.108, df = 2, p-value = 4.301e-05

Q41: What is your highest level of education attained?

\(H0\): There is no association between variables lower education and higher education.

\(H1\): There is association between variables lower education and higher education.

We can reject H0 (P<0.001), hence there is association between the two variables.

group_count_ed <- table(data$Q41F, data$Group)
print(group_count_ed)

##            
##              1  2  3
##   Lower_ed  10 18  4
##   Higher_ed 30 11 31

The table above shows the number of people that have lower education and number of people that have higher education.

data$Q44G <- ifelse(data$Q44 > 1, "0", "1")


data$Q44F <- factor(data$Q44G,
                    levels = c(0, 1),
                    labels = c("Outside", "City"))

chi_square <- chisq.test(data$Q44F, as.factor(data$Group))
chi_square

## 
##  Pearson's Chi-squared test
## 
## data:  data$Q44F and as.factor(data$Group)
## X-squared = 3.1874, df = 2, p-value = 0.2032

Q44: Where do you currently reside?

\(H0\): There is no association between variables Outside and City.

\(H1\): There is association between variables Outside and City.

We cannot reject H0 (P>0.05), hence we cannot say there is association between the two variables.

group_count_region <- table(data$Q44F, data$Group)
print(group_count_region)

##          
##            1  2  3
##   Outside 19 20 19
##   City    21  9 16

The table above shows the number of people that live in the city and the number of people that live outside the city.

data$Q45G <- ifelse(data$Q45 < 2, "0", "1")


data$Q45F <- factor(data$Q45G,
                    levels = c(0, 1),
                    labels = c("NLB", "Other"))

chi_square <- chisq.test(data$Q45F, as.factor(data$Group))
chi_square

## 
##  Pearson's Chi-squared test
## 
## data:  data$Q45F and as.factor(data$Group)
## X-squared = 0.30967, df = 2, p-value = 0.8566

Q45: Which bank do you primarily use?

\(H0\): There is no association between variables NLB and Other banks.

\(H1\): There is association between variables NLB and Other banks.

We cannot reject H0 (P>0.05), hence we cannot say there is association between the two variables.

data$Q51F <- factor(data$Q51,
                    levels = c(1, 2, 3, 4, 5),
                    labels = c("SeveralWeek", "OnceWeek", "TwoThreeMonth", "OnceMonth", "Rarely"))

chi_square <- chisq.test(data$Q51F, as.factor(data$Group))

## Warning in chisq.test(data$Q51F, as.factor(data$Group)): Chi-squared
## approximation may be incorrect

chi_square

## 
##  Pearson's Chi-squared test
## 
## data:  data$Q51F and as.factor(data$Group)
## X-squared = 6.2798, df = 8, p-value = 0.6159

addmargins(round(chi_square$expected, 2))

##                as.factor(data$Group)
## data$Q51F           1     2     3 Sum
##   SeveralWeek    0.38  0.28  0.34   1
##   OnceWeek       5.00  3.62  4.38  13
##   TwoThreeMonth 17.31 12.55 15.14  45
##   OnceMonth     10.00  7.25  8.75  26
##   Rarely         7.31  5.30  6.39  19
##   Sum           40.00 29.00 35.00 104

Since more than allowed expected frequencies are under 5 (<5), We will try to group some variables.

data$Q51F <- factor(data$Q51,
                    levels = c(1, 2, 3, 4, 5),
                    labels = c("Several times a month", "Several times a month", "Several times a month", "Once a month or less", "Once a month or less"))

chi_square <- chisq.test(data$Q51F, as.factor(data$Group))
chi_square

## 
##  Pearson's Chi-squared test
## 
## data:  data$Q51F and as.factor(data$Group)
## X-squared = 1.4601, df = 2, p-value = 0.4819

After trying different combinations of grouping We couldn’t find an association between the frequency of cash withdrawal and the clusters.

data$Q55cF <- factor(data$Q55c,
                    levels = c(0, 1),
                    labels = c("Not a concern", "Concern"))

chi_square <- chisq.test(data$Q55cF, as.factor(data$Group))
chi_square

## 
##  Pearson's Chi-squared test
## 
## data:  data$Q55cF and as.factor(data$Group)
## X-squared = 5.4779, df = 2, p-value = 0.06464

Q55c: DO difficulties with using technologies concern you regarding the digital payments?

\(H0\): There is no association between variables Not a concern and Concern.

\(H1\): There is association between variables Not a concern and Concern.

We cannot reject H0 (P>0.05), hence we cannot say there is association between the two variables.

group_count_difficulties <- table(data$Q55cF, data$Group)
print(group_count_difficulties)

##                
##                  1  2  3
##   Not a concern 29 13 22
##   Concern       11 16 13

The table above shows the number of a group that does not have concern about difficulties using technologies and number of a group that has concern about difficulties using technologies.

After testing if there is an association between the concern that seniors have about the difficulties of using technology we found that there is an association as people from the Cash Conscious Group has more people concerned about this fact, while in the other two groups seniors don’t believe that the difficulty of using technology is a concern.

data$Q62F <- factor(data$Q62,
                    levels = c(1, 2),
                    labels = c("Yes", "No"))

chi_square <- chisq.test(data$Q62, as.factor(data$Group))
chi_square

## 
##  Pearson's Chi-squared test
## 
## data:  data$Q62 and as.factor(data$Group)
## X-squared = 0.62055, df = 2, p-value = 0.7332

Q62: Would additional benefits or improvements encourage you to use digital banking more?

\(H0\): There is no association between variables Yes and No.

\(H1\): There is association between variables Yes and No.

We cannot reject H0 (P>0.05), hence we cannot say there is association between the two variables.

group_count_incentives <- table(data$Q62F, data$Group)
print(group_count_incentives)

##      
##        1  2  3
##   Yes 20 12 15
##   No  20 17 20

The table above shows the number of people who would be encouraged by additional benefits and number of people that would not be encouraged by additional benefits to use digital banking more.

Extrinsic incentives don’t appear to have any association with the groups, this also shows that even people who are currently not into digital payments wouldn’t be encouraged by incentives.

kruskal.test(Q64 ~ Group, 
             data = data)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Q64 by Group
## Kruskal-Wallis chi-squared = 6.7353, df = 2, p-value = 0.03447

Q64: To what extent are you interested in education about digital payments?

\(H0\): Location distributions of Q64 are the same for all groups.

\(H1\): Location distributions of Q64 are not the same for all groups.

We reject H0 at p<0.04, therefore at least one location distribution is different from the other.

kruskal_effsize(Q64 ~ Group, 
                data = data)

We reject H0 at p < 0.04, therefore at least one location distribution is different from the other, meaning I trust that my clusters are meaningful. All in all, my result is validated (effect size is small).

group_count_edu <- table(data$Q64, data$Group)
print(group_count_edu)

##    
##      1  2  3
##   1  2  5  2
##   2  4  3  2
##   3  3  7  3
##   4  2  3  3
##   5  9  3 13
##   6  8  3  8
##   7 12  5  4

The table above shows how interested in education about digital payments are the people in each group. We can see that people in group 1 are the most interested in education, and also people in group 3 are more interested in education than people in group 2.

library(ggplot2)
library(dplyr)

group_count_df <- as.data.frame(group_count_edu)
colnames(group_count_df) <- c("Q64", "Cluster", "Count")

ggplot(group_count_df, aes(x = Q64, y = Count, fill = as.factor(Cluster))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Response to Q64", y = "Count", fill = "Cluster Group") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 1))

avg_response_per_group <- data %>%
  group_by(Group) %>%
  summarise(Average_Q64 = mean(Q64))

print(avg_response_per_group)

## # A tibble: 3 × 2
##   Group Average_Q64
##   <int>       <dbl>
## 1     1        5.1 
## 2     2        3.86
## 3     3        4.8

The average interest between the three groups is statistically different, we can see that people from Group 1 are much less interested in average than the other two groups.

Conclusion

Cluster 1: Traditionalists (Cash is king)

This group consists of individuals who strongly prefer using cash over digital payments, and they make up approximately 35% of the total sample. Their average score on the “preference for cash over card” variable is 5.1 out of 7, so relatively high, indicating that a significant proportion of them still rely on traditional payment methods. Their trust in banks for processing digital transactions is relatively low, with 60% indicating a lack of confidence in financial institutions handling their digital transactions securely.

Fraud concerns are moderately high in this cluster, meaning that a substantial percentage of individuals worry about the risks associated with online transactions. This could explain why they are hesitant to transition to digital payments. Additionally, their belief that digital payments will fully replace cash in the future is quite low, (only 25%) reinforcing their resistance to the shift toward cashless economies. Their responses indicate that they are also less likely to track their expenses using digital payment methods, showing a preference for more private and untracked financial transactions. This cluster has the highest number of people over the age of 75, and 75% of them obtained a higher level of education (Bachelor’s level or PhD). Interestingly 72% expressed that they have no concerns using technology. Live outside the city.

Cluster 2: The Digital Explorers (Chill guy)

The second cluster, about 40% of the total sample, represents a middle ground between traditional cash users and fully digital adopters. Their preference for cash is lower than Cluster 1, with an average score of 3.86 out of 7, but they still exhibit some reluctance toward fully embracing digital payments. Roughly 50% of individuals in this cluster express moderate trust in banks, but about 40% still have concerns regarding fraud in online transactions, meaning that while they may use digital transactions, they do not fully rely on financial institutions without concern. Fraud concerns remain noticeable but not extreme, indicating that some individuals in this cluster still have reservations about the security of digital payments.

When asked about whether digital payments will completely replace cash in the future, their responses are mixed, with many neither fully agreeing nor disagreeing - 45% believe cash will eventually be replaced, while the remaining 55% are uncertain or disagree. Expense tracking behavior is more balanced in this group, meaning they may use digital methods occasionally but are not as committed to them as those in Cluster 3. This cluster could be composed of individuals in transition— almost 80% of the cluster is people over the age of 75 that primarily live outside of the city. They mainly have higher education levels completed. This group is definitely a potential target market, with their open minds and willingness to learn. Slightly over 50% of this group expressed that they have difficulties about using technology.

Cluster 3: The Digital Adopters (Tech-Savvy Boomers)

The third cluster, making up 25% of the total sample, consists of individuals who are the most open to digital payment methods. Their trust in banks to securely process digital transactions is the highest among all three clusters, with 80% expressing confidence in financial institutions to securely process digital transactions. Fraud concerns are significantly lower than in the other clusters, with only 20% of individuals in this group reporting substantial worries about online payment security, meaning that they feel relatively safe using online payment methods.

These individuals are also the most likely to believe that digital payments will eventually replace cash, with a substantial percentage of 75% agreeing with this statement, making them the most forward-thinking group in terms of financial technology. Additionally, they actively use digital tools to track their expenses, indicating a preference for financial transparency and control. Their responses suggest that they are highly engaged with digital financial systems and are likely to use modern banking and fintech services. This group is made up of people over or under 75 and is equally distributed in this group, but an overwhelming 89% of individuals have a higher level of education. The majority also expressed they have no concerns or difficulties using technology.

Validation and Statistical Significance:

Cluster differences were validated using ANOVA and Kruskal-Wallis tests, confirming that all five clustering variables significantly differentiate the groups. Additional variables (e.g., frequency of cash use, percentage of digital transactions, education level, and technology concerns) provided further insights into cluster characteristics. Demographic factors such as age and education showed associations with cluster membership, whereas gender, location, and primary bank choice did not significantly differentiate the groups. Effect size calculations confirmed that differences in key variables were meaningful.

Hypothesis testing

Hypothesis 1:

H0: Seniors don’t prefer cash over card.

H1: Seniors prefer cash over card.

describe(data$Q21)

## data$Q21 
##        n  missing distinct     Info     Mean      Gmd 
##      104        0        7    0.968    3.577    2.382 
##                                                     
## Value          1     2     3     4     5     6     7
## Frequency     18    27    15     7    11    10    16
## Proportion 0.173 0.260 0.144 0.067 0.106 0.096 0.154
## 
## For the frequency table, variable is rounded to the nearest 0

shapiro.test(data$Q21)

## 
##  Shapiro-Wilk normality test
## 
## data:  data$Q21
## W = 0.87285, p-value = 5.795e-08

boxplot(data$Q21, main = "Distribution of Q21 Responses", ylab = "Preference for Cash")

wilcox_test_h1 <- wilcox.test(data$Q21, mu = 3.99, alternative = "greater")
print(wilcox_test_h1)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  data$Q21
## V = 2535, p-value = 0.7378
## alternative hypothesis: true location is greater than 3.99

\(H0\): Location is not greater than 3.99

\(H1\): Location is greater than 3.99

Because we don’t reject the null hypothesis now we know that not the majority of seniors prefer the use of cash over cards. This differs between the previously created clusters.

Hypothesis 2:

H0: What seniors spend on will not influence their preferred payment option.

H1: What seniors spend on will influence their preferred payment option.

data$Groceries <- factor(data$Q52a, 
                          levels = c(1, 2, 3, 4), 
                          labels = c("Cash", "Card", "Online", "Online"))

data$Household <- factor(data$Q52b, 
                          levels = c(1, 2, 3, 4), 
                          labels = c("Cash", "Card", "Online", "Online"))

data$Pharmacy <- factor(data$Q52c, 
                          levels = c(1, 2, 3, 4), 
                          labels = c("Cash", "Card", "Online", "Online"))

data$Entertainment <- factor(data$Q52d, 
                          levels = c(1, 2, 3, 4), 
                          labels = c("Cash", "Card", "Online", "Online"))

data$Transportation <- factor(data$Q52e, 
                          levels = c(1, 2, 3, 4), 
                          labels = c("Cash", "Card", "Online", "Online"))

data$PersonalCare <- factor(data$Q52f, 
                          levels = c(1, 2, 3, 4), 
                          labels = c("Cash", "Card", "Online", "Online"))

data$Technology <- factor(data$Q52g, 
                          levels = c(1, 2, 3, 4), 
                          labels = c("Cash", "Card", "Online", "Online"))

As the intention is to run a chi-square test we decided to group together Mobile App and Online Banking under one factor.

contingency_table <- data %>%
  select(Groceries, Household, Pharmacy, Entertainment, Transportation, PersonalCare, Technology) %>%
  pivot_longer(cols = everything(), names_to = "SpendingCategory", values_to = "PaymentMethod") %>%
  table()

print(contingency_table)

##                 PaymentMethod
## SpendingCategory Cash Card Online
##   Entertainment    35   33     36
##   Groceries        26   72      6
##   Household         9   17     78
##   PersonalCare     65   35      4
##   Pharmacy         25   73      6
##   Technology       13   67     24
##   Transportation   37   57     10

chi_square_test <- chisq.test(contingency_table)
print(chi_square_test)

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 310.98, df = 12, p-value < 2.2e-16

\(H0\):There is no association between spending categories and preferred payment methods.

\(H1\): There is association between spending categories and preferred payment methods.

We rejected the null hypothesis, therefore we can say that the hypothesis we posed was correct.

Hypothesis 3:

H0: Seniors are not afraid of suffering fraud with digital payments.

H1: Seniors are afraid of suffering fraud with digital payments.

describe(data$Q56)

## data$Q56 
##        n  missing distinct     Info     Mean      Gmd 
##      104        0        7    0.967    4.529    2.048 
##                                                     
## Value          1     2     3     4     5     6     7
## Frequency      6    12    17     7    25    22    15
## Proportion 0.058 0.115 0.163 0.067 0.240 0.212 0.144
## 
## For the frequency table, variable is rounded to the nearest 0

shapiro.test(data$Q56)

## 
##  Shapiro-Wilk normality test
## 
## data:  data$Q56
## W = 0.91403, p-value = 4.717e-06

boxplot(data$Q56, main = "Distribution of Q56 Responses", ylab = "Fraud concern")

wilcox_test_h3 <- wilcox.test(data$Q56, mu = 3.99, alternative = "greater")
print(wilcox_test_h3)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  data$Q56
## V = 4003, p-value = 1.713e-05
## alternative hypothesis: true location is greater than 3.99

\(H0\): Location is not greater than 3.99

\(H1\): Location is greater than 3.99

Since we can reject the null hypothesis we can say that the majority of seniors are concerned about fraud. As this variable was also used as a clustering variable we were able to observe that there is great variability among seniors, there is a group who is much more concerned than the other two.

Hypothesis 4:

H0: Seniors don’t trust themselves to do digital banking.

H1: Seniors don’t trust themselves to do digital banking.

describe(data$Q59)

## data$Q59 
##        n  missing distinct     Info     Mean      Gmd 
##      104        0        7     0.94    5.077    1.793 
##                                                     
## Value          1     2     3     4     5     6     7
## Frequency      4     9     8     5    24    36    18
## Proportion 0.038 0.087 0.077 0.048 0.231 0.346 0.173
## 
## For the frequency table, variable is rounded to the nearest 0

shapiro.test(data$Q59)

## 
##  Shapiro-Wilk normality test
## 
## data:  data$Q59
## W = 0.85584, p-value = 1.21e-08

boxplot(data$Q59, main = "Distribution of Q59 Responses", ylab = "Confidence to use digital banking")

wilcox_test_h4 <- wilcox.test(data$Q59, mu = 3.99, alternative = "greater")
print(wilcox_test_h4)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  data$Q59
## V = 4668, p-value = 1.226e-10
## alternative hypothesis: true location is greater than 3.99

\(H0\): Location is not greater than 3.99

\(H1\): Location is greater than 3.99

Since we can reject the null hypothesis we can say that the majority of seniors are confident enough to use digital banking, rejecting our hypothesis.

data %>%
  group_by(Group) %>%
  shapiro_test(Q59)

kruskal.test(Q59 ~ Group, 
             data = data)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Q59 by Group
## Kruskal-Wallis chi-squared = 18.937, df = 2, p-value = 7.724e-05

\(H0\): Location distributions of Q59 are the same for all groups.

\(H1\): Location distributions of Q59 are not the same for all groups.

We reject H0 at p<0.001, therefore at least one location distribution is different from the other.

kruskal_effsize(Q59 ~ Group, 
                data = data)

avg_Q59_group <- data %>%
  group_by(Group) %>%
  summarise(Average_Q59 = mean(Q59))

print(avg_Q59_group)

## # A tibble: 3 × 2
##   Group Average_Q59
##   <int>       <dbl>
## 1     1        5.5 
## 2     2        3.93
## 3     3        5.54

We checked to determine if there is a difference between how the clusters answered to this question, we found that there is a difference between the groups in how confident are they to use digital payments.

Hypothesis 5

H0: Seniors are not more inclined to use digital payments if given incentives.

H1: Seniors are more inclined to use digital payments if given incentives.

summary(data$Q62F)

## Yes  No 
##  47  57

We can see without any statistical test that there is no inclination towards incentives.

NLB_StatisticalAnalysis

Group 6

How many clusters to use?

K-Means Clustering

Group profiles

Appropriateness of cluster variables used

Conclusion

Hypothesis testing

Hypothesis 1:

H0: Seniors don’t prefer cash over card.

H1: Seniors prefer cash over card.

Hypothesis 2:

H0: What seniors spend on will not influence their preferred payment option.

H1: What seniors spend on will influence their preferred payment option.

Hypothesis 3:

H0: Seniors are not afraid of suffering fraud with digital payments.

H1: Seniors are afraid of suffering fraud with digital payments.

Hypothesis 4:

H0: Seniors don’t trust themselves to do digital banking.

H1: Seniors don’t trust themselves to do digital banking.

Hypothesis 5

H0: Seniors are not more inclined to use digital payments if given incentives.

H1: Seniors are more inclined to use digital payments if given incentives.