Analysis:

If you look at each of the 19 questions in the google form, you will find that each question has 5 options. Those are:

Extremely important.
Quite important.
Moderately important.
Slightly important.
Not important at all.

I code these options as

Extremely important : 5.
Quite important : 4.
Moderately important : 3.
Slightly important : 2.
Not important at all : 1.

Now I have a 71 \(\times\) 19 data matrix. The entries of this data matrix are postive integers between 1 to 5. Let’s take a look at this data matrix.

d = read.csv("summer_project.csv")
datatable(d, autoHideNavigation=TRUE, fillContainer=FALSE)

Let’s look at the data in a different way.

d1 = read.csv("summer_project_1.csv")
d1$Responses = d1$x
d1 = d1[,-2]
dd = d1 %>% group_by(Questions)
datatable(dd, autoHideNavigation = TRUE, fillContainer = FALSE)

Before considering the statistical nature of the data it is worth considering the potential for unseen biases in the scoring. If an online form presents the scale as a series of checkboxes to choose between, the response could depend on the order in which the choices are listed. If a slider is used on an online form it may provoke different responses than would be chosen using checkboxes. The wording in which the question is framed may lead to aquiescence bias or social desirability bias. Some forms of wording may deter respondents from choosing extreme responses while others may encourage them. These are all psychological aspects that should be taken into account when the survey is designed.

Assuming that these issues have been taken care of, the problem I address concerns the numerical part of the data analysis.

Here my main focus is to test the null hypothesis of no difference in the responses between the questions. The responses here are not on a simple linear scale. Rather these are on likert scale. While a mean can be calculated for any set of numbers, most of the advice found online regarding the likert scale quite rightly points out that a mean likert score is difficult to interpret. The nature of the likert scale prevents the calculation of a valid standard deviation. Classical parametric methods based on assumption of normality are very clearly not appropriate when analysing responses to any single question. If the responses to many individual questions are pooled it has been suggested that the central limit theorem kicks in and the resulting pooled scores can be treated as if they follow a Gaussian distribution. However this is not very good advice either, as the assumption is being made that there is independence in the responses. This is rarely justifiable in the case of dissatisfaction scores. Respondents who are dissatisfied regarding one aspect will also tend to be dissatisfied about other aspects, violating the assumption of independence.

Other suggestions for null hypothesis testing include using non-parametric procedures such as Kruskall Wallis tests. However this does not test a very meaningful hypothesis nor does it provide a basis for comparing effect sizes.

As a result of these problems the conventional approach to analysis that is widely used involves splitting the data an looking at the proportion of responses that fall above or below some cut off point. This leads to discussions of the proportion of students who are dissatisfied with the issue revealed by a question, sometimes without using any statistical analysis at all.

Let’s look at how the distributions of the responses vary among different questions.

g0 = ggplot(d1, aes(x = Responses))
g0 + geom_bar(fill ='green') + facet_wrap(~Questions) + xlab('Likert Scores')

It seems that the distributions of the responses do not remain the same among different questions.

Now I will simplify the data into binary classes and look at the number of responses falling into each class. The measure typically used is the proportion or percentage of students who are dissatisfied i.e. giving a score above 3.

d1$x1 = 1*(d1$Responses > 3) 
tb1 = table(d1$x1, d1$Questions) 
round(prop.table(tb1, margin = 2)*100, 1)

##    
##     Question_1 Question_10 Question_11 Question_12 Question_13 Question_14
##   0        2.8        40.8        32.4         1.4        15.5        14.1
##   1       97.2        59.2        67.6        98.6        84.5        85.9
##    
##     Question_15 Question_16 Question_17 Question_18 Question_19 Question_2
##   0        12.7        33.8        66.2        54.9        64.8        9.9
##   1        87.3        66.2        33.8        45.1        35.2       90.1
##    
##     Question_3 Question_4 Question_5 Question_6 Question_7 Question_8
##   0        7.0       56.3       42.3       14.1       22.5       21.1
##   1       93.0       43.7       57.7       85.9       77.5       78.9
##    
##     Question_9
##   0       21.1
##   1       78.9

The data can be quite validly analyzed using a chi-squared test, although there is a much better way that I will show later.

chisq.test(tb1)

## 
##  Pearson's Chi-squared test
## 
## data:  tb1
## X-squared = 273.23, df = 18, p-value < 2.2e-16

The p-value is quite less than 0.05. So at 5% level of significance, this test shows the presence of significant differences between the responses when scores over 3 are scored as ones and scores below are scored as zeros.

Chi-squared test could be used to look at stronger feelings towards a question by changing the splitting rule.

d1$x2 = 1*(d1$Responses > 4) 
tb2 = table(d1$x2, d1$Questions) 
round(prop.table(tb2, margin = 2)*100, 1)

##    
##     Question_1 Question_10 Question_11 Question_12 Question_13 Question_14
##   0       22.5        60.6        59.2        15.5        49.3        26.8
##   1       77.5        39.4        40.8        84.5        50.7        73.2
##    
##     Question_15 Question_16 Question_17 Question_18 Question_19 Question_2
##   0        36.6        60.6        88.7        87.3        83.1       26.8
##   1        63.4        39.4        11.3        12.7        16.9       73.2
##    
##     Question_3 Question_4 Question_5 Question_6 Question_7 Question_8
##   0       26.8       87.3       76.1       50.7       57.7       38.0
##   1       73.2       12.7       23.9       49.3       42.3       62.0
##    
##     Question_9
##   0       56.3
##   1       43.7

chisq.test(tb2)

## 
##  Pearson's Chi-squared test
## 
## data:  tb2
## X-squared = 289.92, df = 18, p-value < 2.2e-16

Again, the p-value is quite less than 0.05. So at 5% level of significance, this test shows the presence of significant differences between the responses when scores over 4 are scored as ones and scores below are scored as zeros.

So. the Chi-squared test confirms the presence of significant differences. However I haven’t measured the effect size yet, nor extracted confidence intervals.

One possibility is to use logistic regression.

mod = glm(data = d1, x1 ~ Questions, family = "binomial")
anova(mod, test = "Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: x1
## 
## Terms added sequentially (first to last)
## 
## 
##           Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                       1348     1602.2              
## Questions 18   287.71      1330     1314.5 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The overall p-value is almost identical to the ch-squared test, as it should be. The coefficients of this model could be back transformed and then combined to form estimates with confidence intervals. However there is an even simpler way.

Here I am going to use Binomial test to extract the confidence intervals. The Binomial test in R by default tests the null hypothesis that the true proportion is not equal to 0.5. This can be set to the overall proportion of dissatisfaction when all the questions are taken together to test whether the proportion of dissatisfaction revealed by any question is “significantly different” from the overall proportion of dissatisfaction taking all the questions together. The Binomial test also provides confidence intervals through the Clopper and Pearson procedure. This guarantees that the confidence level is at least the confidence level, but does not necessarily give the shortest-length confidence interval.

prop = sum(d1$x1)/length(d1$x1)
prop

## [1] 0.7190511

I can now use the Binomial test to build a function that takes the original vector of likert scores and returns the percentage dissatisfied, with upper and lower bounds for a 95% confidence interval along with a p-value for the significance of differences from the baseline value.

  dissatisfied_ci = function(x, score = 3, baseline = prop) {
  dissatisfied = (x>score)*1 
  n = length(x)
  p = sum(dissatisfied)
  mid = round(p/n*100, 0)
  b = binom.test(p,n,baseline)
  c(mid, round(b$conf.int*100, 0) , round(b$p.value, 3))
  }

The results can be tabulated for each group using dplyr.

d1 %>% group_by(Questions) %>% 
  summarise(lwr = dissatisfied_ci(Responses, 3, prop)[2],
            med = dissatisfied_ci(Responses, 3, prop)[1],
            upr = dissatisfied_ci(Responses, 3, prop)[3],
            n=n(),
            n_dissat = sum((Responses>3)*1),
            p_val = dissatisfied_ci(Responses, 3, prop)[4]
            ) -> dd
datatable(dd, autoHideNavigation = TRUE, fillContainer = FALSE)

Now we have added the 95% confidence intervals for each question plus a p-value which represents the probability of obtaining the data for that question if the null hypothesis were true. Except for the questions 7, 8, 9, 11 and 16, all the questions show significant p-values. It is more useful to look directly at the confidence intervals, as they show the range of possible scores that could be obtained by chance.

A quick and intuitive way of looking at the data is to plot the confidence intervals after ranking the scores.

dd = dd[order(dd$med),]
dd$Questions = factor(dd$Questions, levels=dd$Questions[order(dd$med)], ordered=TRUE)
g0 = ggplot(dd, aes(x = Questions)) 
g0 = g0 + geom_point(aes(y = med), colour = "red")
g0 = g0 + geom_hline(yintercept = prop*100, col = "green") + ylab("Percent dissatisfaction")
g1 = g0 + geom_errorbar(aes(ymin = lwr, ymax = upr)) + coord_flip()
g1

The following facts are revealed:

Question 12 is the most serious question. Question 12 assesses the customer dissatisfaction if the payment confirmation is missing. A challenge is to find a payment gateway that is smooth. Sometimes when the customers are directed to the payment page, their money is deducted and suddenly, the page shuts off without any notice to the consumer. And that’s when the customer is in a fix. Then chasing the company for a refund is a different challenge altogether.
Question 17 is the least serious question. Question 17 assesses the customer dissatisfaction if there are too many options in the website to choose from. The online world provides too many options and it can be overwhelming for the customer to make a choice. The absence of support that most customers are used to in the in-store experience is missing and this can chicken out them of a purchase decision.
Looking at the proportion of dissatisfied individuals and the confidence intervals for the questions, we can group the questions. We can place questions 6, 13, 14 and 15 in one group, questions 7, 8 and 9 in another group, questions 11 and 16 in another group, questions 5 and 10 in another group, questions 4 and 18 in another group, questions 17 and 19 in another group. Questions 1, 2, 3 and 12 need to be treated individually.

The analysis can be re-run using any cut off point in order to add depth in the analysis.

prop1 = sum(d1$x2)/length(d1$x2)
prop1

## [1] 0.4684952

d1 %>% group_by(Questions) %>% 
  summarise(lwr = dissatisfied_ci(Responses, 4, prop1)[2],
            med = dissatisfied_ci(Responses, 4, prop1)[1],
            upr = dissatisfied_ci(Responses, 4, prop1)[3],
            n=n(),
            n_dissat = sum((Responses>4)*1),
            p_val = dissatisfied_ci(Responses, 4, prop1)[4]
            ) -> dd
dd = dd[order(dd$med),]
dd$Questions = factor(dd$Questions, levels=dd$Questions[order(dd$med)], ordered=TRUE)
g0 = ggplot(dd, aes(x = Questions)) 
g0 = g0 + geom_point(aes(y = med), colour = "red")
g0 = g0 + geom_hline(yintercept = prop1*100, col = "green") + ylab("Percent highly dissatisfied")
g1 = g0 + geom_errorbar(aes(ymin = lwr, ymax = upr)) + coord_flip()
g1

Question 12 still remains the most serious question and question 17 still remains the least serious question. Note that question 14 accounts for a large percentage of highly dissatisfied individuals. It is the issue of uclear website policies for return and refund. Many shopping websites do not even have clear and concise website policies for return and refund. Consumers get confused due to vague stipulations about a refund and return. When the policies section is not defined properly, sellers reject a consumer’s claim to return or get a refund. This is among the biggest challenges that many customers face online. A lot of these websites have no clear outline on the warranty and guarantee of products. A buyer can take this to consumer court in case the demands are not met with.

Although parametric analysis of differences in mean scores are not valid, and the mean itself may be difficult to interpret it is possible to produce confidence intervals for the mean through bootstrapping. This involves resampling with replaecements from the data. If I repeat the resampling thousand of times and exclude the extreme values which occur very infrequently we can get a bootstrapped confidence interval for the mean by calculating it for all the random samples. This approach will occasionally break down for small samples (for example when all the values are identical) but in general it is quite robust and will never throw up values outside the bounds of the data.

boot_mean = function(x)
{ 
  n = length(x)
  x = replicate(1000, mean(sample(x, n, replace = TRUE)))
  round(quantile(x, c(0.025,0.5,0.975)), 2)
}

d1 %>% group_by(Questions)%>% 
  summarise(n = n(),
            mean = boot_mean(Responses)[2],
            lwr = boot_mean(Responses)[1],
            upr = boot_mean(Responses)[3],
            )->dd
datatable(dd, autoHideNavigation = TRUE,fillContainer = FALSE)

dd = dd[order(-dd$mean),]
dd$Questions = factor(dd$Questions, levels = dd$Questions[order(dd$mean)], ordered = TRUE)
g0 = ggplot(dd, aes(x = Questions)) 
g0 = g0 + geom_point(aes(y = mean), colour = "red")
g0 = g0 + geom_hline(yintercept = mean(d1$Responses), col = "green") +xlab("Mean Likert score with bootstrapped 95% confidence intervals")
g1 = g0 + geom_errorbar(aes(ymin = lwr, ymax = upr)) + coord_flip()
g1

This analysis generally picks up the same questions in the simulated data set with confidence intervals that do not overlap the baseline. This is useful, as it suggests that the conclusions do not rely too much on the assumptions, providing a statistically justifiable procedure is being used.

The sample covariance matrix of our data matrix, S is given by

d = read.csv("summer_project.csv")
X = as.matrix(d)
S = cov(X)
S

##               question_1   question_2   question_3  question_4
## question_1   0.249094567 -0.012072435  0.013078471  0.01207243
## question_2  -0.012072435  0.496177062  0.212474849 -0.06760563
## question_3   0.013078471  0.212474849  0.369818913  0.04466801
## question_4   0.012072435 -0.067605634  0.044668008  0.92474849
## question_5   0.002414487 -0.139235412 -0.053923541  0.41066398
## question_6  -0.002615694 -0.068209256  0.048893360  0.15392354
## question_7   0.046881288 -0.079678068 -0.085110664  0.19396378
## question_8   0.047082495  0.056338028  0.048490946  0.12937626
## question_9   0.021931590  0.097183099 -0.008853119  0.28853119
## question_10  0.020321932 -0.033802817 -0.134808853  0.40523139
## question_11 -0.029376258  0.070221328 -0.020120724  0.31549296
## question_12  0.027967807 -0.022334004 -0.043661972  0.00804829
## question_13  0.069014085 -0.075050302 -0.046076459  0.14647887
## question_14  0.105432596  0.029577465  0.035814889  0.22756539
## question_15  0.069617706  0.032997988 -0.016700201  0.06700201
## question_16  0.138832998  0.008249497  0.085110664  0.27746479
## question_17  0.024346076  0.086519115  0.108651911  0.68490946
## question_18  0.061569416  0.006639839  0.024949698  0.37907445
## question_19  0.018108652 -0.001408451 -0.018712274  0.40140845
##               question_5   question_6  question_7   question_8
## question_1   0.002414487 -0.002615694  0.04688129  0.047082495
## question_2  -0.139235412 -0.068209256 -0.07967807  0.056338028
## question_3  -0.053923541  0.048893360 -0.08511066  0.048490946
## question_4   0.410663984  0.153923541  0.19396378  0.129376258
## question_5   1.022132797  0.445070423  0.32736419  0.088732394
## question_6   0.445070423  0.855935614  0.23702213  0.193158954
## question_7   0.327364185  0.237022133  0.85513078  0.390744467
## question_8   0.088732394  0.193158954  0.39074447  1.637424547
## question_9   0.260563380  0.290342052  0.31046278  0.116700201
## question_10  0.298189135  0.184104628  0.29698189  0.486116700
## question_11  0.368812877  0.032595573  0.21468813 -0.258148893
## question_12  0.044466801 -0.039839034  0.02173038 -0.040040241
## question_13  0.163581489  0.140643863  0.14406439  0.068410463
## question_14  0.114084507  0.101408451  0.16156942  0.331790744
## question_15  0.076257545  0.137625755  0.19376258  0.108450704
## question_16  0.544064386  0.434406439  0.27344064  0.080684105
## question_17  0.339839034  0.221126761  0.18068410  0.005432596
## question_18  0.421529175  0.277867203  0.08350101  0.026961771
## question_19  0.323138833  0.180885312  0.09094567 -0.113078471
##               question_9 question_10 question_11 question_12 question_13
## question_1   0.021931590  0.02032193 -0.02937626  0.02796781  0.06901408
## question_2   0.097183099 -0.03380282  0.07022133 -0.02233400 -0.07505030
## question_3  -0.008853119 -0.13480885 -0.02012072 -0.04366197 -0.04607646
## question_4   0.288531187  0.40523139  0.31549296  0.00804829  0.14647887
## question_5   0.260563380  0.29818913  0.36881288  0.04446680  0.16358149
## question_6   0.290342052  0.18410463  0.03259557 -0.03983903  0.14064386
## question_7   0.310462777  0.29698189  0.21468813  0.02173038  0.14406439
## question_8   0.116700201  0.48611670 -0.25814889 -0.04004024  0.06841046
## question_9   0.951307847  0.40140845  0.36076459 -0.03299799  0.06800805
## question_10  0.401408451  1.75975855  0.55774648  0.14688129  0.37323944
## question_11  0.360764588  0.55774648  1.86277666  0.10422535  0.11690141
## question_12 -0.032997988  0.14688129  0.10422535  0.17102616  0.05553320
## question_13  0.068008048  0.37323944  0.11690141  0.05553320  0.65070423
## question_14  0.182696177  0.34949698  0.06197183  0.10362173  0.21448692
## question_15  0.086720322  0.01207243  0.14124748  0.08450704  0.13802817
## question_16  0.232394366  0.30301811  0.31388330  0.06398390  0.15593561
## question_17  0.383299799  0.58531187  0.51529175  0.05432596  0.37444668
## question_18  0.180080483  0.30382294  0.15472837  0.04104628  0.19275654
## question_19  0.189939638  0.57927565  0.20181087  0.02635815  0.34828974
##             question_14 question_15 question_16 question_17 question_18
## question_1   0.10543260  0.06961771 0.138832998 0.024346076 0.061569416
## question_2   0.02957746  0.03299799 0.008249497 0.086519115 0.006639839
## question_3   0.03581489 -0.01670020 0.085110664 0.108651911 0.024949698
## question_4   0.22756539  0.06700201 0.277464789 0.684909457 0.379074447
## question_5   0.11408451  0.07625755 0.544064386 0.339839034 0.421529175
## question_6   0.10140845  0.13762575 0.434406439 0.221126761 0.277867203
## question_7   0.16156942  0.19376258 0.273440644 0.180684105 0.083501006
## question_8   0.33179074  0.10845070 0.080684105 0.005432596 0.026961771
## question_9   0.18269618  0.08672032 0.232394366 0.383299799 0.180080483
## question_10  0.34949698  0.01207243 0.303018109 0.585311871 0.303822938
## question_11  0.06197183  0.14124748 0.313883300 0.515291751 0.154728370
## question_12  0.10362173  0.08450704 0.063983903 0.054325956 0.041046278
## question_13  0.21448692  0.13802817 0.155935614 0.374446680 0.192756539
## question_14  0.82454728  0.21086519 0.238430584 0.482494970 0.373440644
## question_15  0.21086519  0.56780684 0.391951710 0.262977867 0.194567404
## question_16  0.23843058  0.39195171 1.369416499 0.376458753 0.587927565
## question_17  0.48249497  0.26297787 0.376458753 1.751710262 0.801609658
## question_18  0.37344064  0.19456740 0.587927565 0.801609658 1.298993964
## question_19  0.33420523  0.09336016 0.494768612 0.741649899 0.997183099
##              question_19
## question_1   0.018108652
## question_2  -0.001408451
## question_3  -0.018712274
## question_4   0.401408451
## question_5   0.323138833
## question_6   0.180885312
## question_7   0.090945674
## question_8  -0.113078471
## question_9   0.189939638
## question_10  0.579275654
## question_11  0.201810865
## question_12  0.026358149
## question_13  0.348289738
## question_14  0.334205231
## question_15  0.093360161
## question_16  0.494768612
## question_17  0.741649899
## question_18  0.997183099
## question_19  1.609255533

Now let us find the eigen values of S in decreasing order of magnitude.

ev = eigen(S)
ev$values

##  [1] 5.5141408 2.2047256 2.0026940 1.6441261 1.2147320 1.0139448 0.8291445
##  [8] 0.7924877 0.7085762 0.5886663 0.5204820 0.4630024 0.3888965 0.3484514
## [15] 0.2947891 0.2720755 0.1851259 0.1404526 0.1012533

The screeplot will look like:

index = seq(1,19,1)
eigen_values = ev$values
e = data.frame(index,eigen_values)
g0 = ggplot(e,aes(x=index,y=eigen_values))+geom_line(colour='magenta')
g0

The elbow shape in the plot is coming somewhere in between 5th and 10th maximum eigenvalue.

prcomp = prcomp(x = S,retx = T)
summary(prcomp)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     0.7501 0.4753 0.4685 0.37371 0.27232 0.22302
## Proportion of Variance 0.3912 0.1571 0.1526 0.09709 0.05155 0.03458
## Cumulative Proportion  0.3912 0.5482 0.7008 0.79790 0.84945 0.88403
##                            PC7    PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.19477 0.1827 0.16654 0.13817 0.12266 0.09784
## Proportion of Variance 0.02637 0.0232 0.01928 0.01327 0.01046 0.00665
## Cumulative Proportion  0.91040 0.9336 0.95288 0.96615 0.97661 0.98327
##                           PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.08725 0.08194 0.06864 0.05402 0.03451 0.03043
## Proportion of Variance 0.00529 0.00467 0.00327 0.00203 0.00083 0.00064
## Cumulative Proportion  0.98856 0.99322 0.99650 0.99853 0.99936 1.00000
##                             PC19
## Standard deviation     4.181e-18
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00

We see that 91%, which is a substantial amount, of the total variation in the dataset is explained by the first 7 principal components.

Now I will proceed for exploratory factor analysis. Basic assumption for use of factor analysis is the existence of sufficient correlations among data in the data matrix. To analyze these correlations, it is possible to use KMO test:

Henry Kaiser (1970) introduced an Measure of Sampling Adequacy (MSA) of factor analytic data matrices. Kaiser and Rice (1974) then modified it. This is just a function of the squared elements of the ‘image’ matrix compared to the squares of the original correlations. The overall MSA as well as estimates for each item are found. The index is known as the Kaiser-Meyer-Olkin (KMO) index. In his delightfully flamboyant style, Kaiser (1975) suggested that KMO > .9 were marvelous, in the .80s, mertitourious, in the .70s, middling, in the .60s, medicore, in the 50s, miserable, and less than .5, unacceptable.

KMO(d)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = d)
## Overall MSA =  0.67
## MSA for each item = 
##  question_1  question_2  question_3  question_4  question_5  question_6 
##        0.58        0.45        0.41        0.77        0.72        0.68 
##  question_7  question_8  question_9 question_10 question_11 question_12 
##        0.78        0.39        0.69        0.61        0.69        0.49 
## question_13 question_14 question_15 question_16 question_17 question_18 
##        0.74        0.81        0.56        0.74        0.77        0.72 
## question_19 
##        0.70

Although the test result is mediocre, we can still proceed for the exploratory factor analysis.

We assume the orthogonal factor model. We first factor analyze the sample covariance matrix S by the iterative principal component method with varimax rotation. Since 7 PCs are sufficient for summarising the data, we consider 7 factors here. The estimated loading matrix is shown below.

f = fac(r = S, nfactors = 7,rotate = 'varimax', scores = 'Thurstone', residuals = TRUE, fm = 'pa')
L = f$loadings
L

## 
## Loadings:
##             PA4    PA1    PA5    PA6    PA2    PA7    PA3   
## question_1                 0.355                            
## question_2  -0.108                       0.850  0.124       
## question_3                               0.603 -0.173       
## question_4   0.297  0.246         0.528         0.127       
## question_5   0.664  0.202         0.176 -0.179  0.111       
## question_6   0.624  0.148                              0.151
## question_7   0.462 -0.123  0.207  0.187 -0.147  0.164  0.257
## question_8   0.159         0.108                       0.703
## question_9   0.403                0.315  0.129  0.239       
## question_10  0.124  0.258         0.171 -0.117  0.812  0.273
## question_11  0.228                0.289         0.436 -0.363
## question_12 -0.119         0.482        -0.179  0.316 -0.157
## question_13         0.256  0.260  0.215 -0.170  0.181  0.107
## question_14         0.265  0.464  0.307         0.123  0.306
## question_15  0.226         0.662  0.116                     
## question_16  0.599  0.317  0.425         0.126        -0.104
## question_17  0.101  0.389  0.180  0.752  0.126  0.107       
## question_18  0.239  0.745  0.193  0.222                     
## question_19  0.116  0.774         0.179         0.153       
## 
##                  PA4   PA1   PA5   PA6   PA2   PA7   PA3
## SS loadings    1.911 1.764 1.397 1.370 1.281 1.205 0.962
## Proportion Var 0.101 0.093 0.074 0.072 0.067 0.063 0.051
## Cumulative Var 0.101 0.193 0.267 0.339 0.406 0.470 0.521

Now corresponding to a particular question I will look at the absolute values of the estimated loadings and I will place the issue revealed by that question under that factor under which it has the maximum estimated loading. For example if we look at question 7, the maximum absolute value of the loading is occuring under factor 4. So we place the issue revealed by question 7 under factor 4. In this way we calssify all the questions under 7 factors as follows:

Factor 1: Questions 18 and 19.
Factor 2: Questions 2 and 3.
Factor 3: Question 8.
Factor 4: Questions 4, 5, 6, 7, 9 and 16.
Factor 5: Questions 1, 12, 13, 14 and 15.
Factor 6: Question 17.
Factor 7: Questions 10 and 11.

Looking at the questions I name the factors as

Factor 1: Website design factor.
Factor 2: Package handling factor.
Factor 3: Delivery time factor.
Factor 4: Product tracking and return time factor.
Factor 5: Product quality, payment confirmation and customer care services factor.
Factor 6: Number of options factor.
Factor 7: Payment method factor.

Now I will proceed for Confirmatory Factor Analysis (CFA) to see whether the assumed data structure is correct or not.

model = 'Factor_1 =~ question_18 + question_19
         Factor_2 =~ question_2 + question_3
         Factor_3 =~ question_8
         Factor_4 =~ question_4 + question_5 + question_6 + question_7 + question_9 + question_16
         Factor_5 =~ question_1 + question_12 + question_13 + question_14 + question_15
         Factor_6 =~ question_17
         Factor_7 =~ question_10 + question_11'
fit = cfa(model = model, data = d)
summary(fit, fit.measures=TRUE)

## lavaan 0.6-3 ended normally after 101 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         57
## 
##   Number of observations                            71
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     164.420
##   Degrees of freedom                               133
##   P-value (Chi-square)                           0.033
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              434.189
##   Degrees of freedom                               171
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.881
##   Tucker-Lewis Index (TLI)                       0.847
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -1657.316
##   Loglikelihood unrestricted model (H1)      -1575.106
## 
##   Number of free parameters                         57
##   Akaike (AIC)                                3428.632
##   Bayesian (BIC)                              3557.605
##   Sample-size adjusted Bayesian (BIC)         3378.039
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.058
##   90 Percent Confidence Interval          0.018  0.085
##   P-value RMSEA <= 0.05                          0.328
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.085
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Factor_1 =~                                         
##     question_18       1.000                           
##     question_19       0.925    0.162    5.706    0.000
##   Factor_2 =~                                         
##     question_2        1.000                           
##     question_3        2.027    1.640    1.236    0.216
##   Factor_3 =~                                         
##     question_8        1.000                           
##   Factor_4 =~                                         
##     question_4        1.000                           
##     question_5        1.345    0.341    3.946    0.000
##     question_6        0.988    0.284    3.485    0.000
##     question_7        0.819    0.268    3.058    0.002
##     question_9        0.863    0.282    3.056    0.002
##     question_16       1.327    0.367    3.619    0.000
##   Factor_5 =~                                         
##     question_1        1.000                           
##     question_12       1.024    0.639    1.602    0.109
##     question_13       2.898    1.596    1.817    0.069
##     question_14       4.728    2.449    1.931    0.054
##     question_15       2.537    1.421    1.785    0.074
##   Factor_6 =~                                         
##     question_17       1.000                           
##   Factor_7 =~                                         
##     question_10       1.000                           
##     question_11       0.435    0.221    1.971    0.049
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Factor_1 ~~                                         
##     Factor_2          0.008    0.040    0.189    0.850
##     Factor_3         -0.005    0.167   -0.031    0.976
##     Factor_4          0.303    0.106    2.854    0.004
##     Factor_5          0.072    0.042    1.709    0.088
##     Factor_6          0.790    0.199    3.967    0.000
##     Factor_7          0.369    0.179    2.064    0.039
##   Factor_2 ~~                                         
##     Factor_3          0.022    0.049    0.459    0.646
##     Factor_4          0.003    0.022    0.147    0.883
##     Factor_5         -0.001    0.006   -0.149    0.881
##     Factor_6          0.052    0.064    0.809    0.418
##     Factor_7         -0.065    0.072   -0.903    0.367
##   Factor_3 ~~                                         
##     Factor_4          0.137    0.097    1.412    0.158
##     Factor_5          0.049    0.035    1.385    0.166
##     Factor_6          0.005    0.198    0.027    0.978
##     Factor_7          0.423    0.204    2.076    0.038
##   Factor_4 ~~                                         
##     Factor_5          0.035    0.022    1.572    0.116
##     Factor_6          0.326    0.120    2.713    0.007
##     Factor_7          0.291    0.115    2.519    0.012
##   Factor_5 ~~                                         
##     Factor_6          0.099    0.056    1.775    0.076
##     Factor_7          0.074    0.045    1.640    0.101
##   Factor_6 ~~                                         
##     Factor_7          0.608    0.217    2.804    0.005
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .question_18       0.218    0.153    1.424    0.155
##    .question_19       0.677    0.170    3.975    0.000
##    .question_2        0.386    0.103    3.755    0.000
##    .question_3       -0.060    0.328   -0.183    0.855
##    .question_8        0.000                           
##    .question_4        0.631    0.120    5.244    0.000
##    .question_5        0.500    0.116    4.317    0.000
##    .question_6        0.570    0.110    5.185    0.000
##    .question_7        0.655    0.119    5.499    0.000
##    .question_9        0.729    0.132    5.500    0.000
##    .question_16       0.856    0.170    5.027    0.000
##    .question_1        0.227    0.039    5.776    0.000
##    .question_12       0.149    0.026    5.666    0.000
##    .question_13       0.485    0.093    5.223    0.000
##    .question_14       0.396    0.114    3.468    0.001
##    .question_15       0.440    0.082    5.341    0.000
##    .question_17       0.000                           
##    .question_10       0.471    0.561    0.839    0.401
##    .question_11       1.597    0.288    5.547    0.000
##     Factor_1          1.063    0.259    4.110    0.000
##     Factor_2          0.103    0.094    1.094    0.274
##     Factor_3          1.614    0.271    5.958    0.000
##     Factor_4          0.281    0.124    2.258    0.024
##     Factor_5          0.019    0.018    1.010    0.312
##     Factor_6          1.727    0.290    5.958    0.000
##     Factor_7          1.264    0.622    2.031    0.042

The high CFI and TLI values (0.881 and 0.847 respectively) are indicators of good fit. The SRMR value (0.085) between 0.05 and 0.10 indicates an acceptable fit. The RMSEA value (0.058) between 0.05 and 0.08 indicates a fit close to good. So we can say that our model fitting is acceptable.

Note that, factor 5 includes product quality factor, which I personally don’t think should go with payment confirmation and customer care services factors. I think this is happening just because of the data. So treating product quality as a separate factor I can list down the factors that can dissatisfy a customer while shopping online. These are

Product quality not up to the mark.
Bad way of handling the packages.
Delayed delivery.
Bad product tracking facilities and problems and delay in case of product return.
Missing payment confirmation and bad customer care services.
Missing of the desired payment methods.
Too many options to choose from.
Complicated website designs and boring interfaces.

Now I will address these factors one by one with possible solutions:

1. Product quality not up to the mark:

This is the most common problem faced by customers who shop online regularly. The quality of the product is often not up to the mark with what is presented in the pictures. With the competition growing in the e-commerce industry, as many websites become a marketplace for sellers to sell their products, and the issue of fraudulent sellers is increasing. The checks on registration are poor and selling poor-quality in the name of brands is becoming increasingly common. Even worse, quality-checks have become so rare with the magnitude of online sales which is soaring high.

Make your products undergo a usability test where they are evaluated for their usefulness and effectiveness.

2. Bad way of handling the packages:

This is another common issue faced in online shopping. Barring a few websites, delivery and logistics is a major issue. Websites are becoming so casual about the delivery quality of products. So many times, either the package is lost or damaged while in transit.

Make the delivery boys more responsible while transiting the product.

3. Delayed delivery:

Many of these companies do not follow the stipulated time limit, leaving consumers confused as the products come in too late. They will send the delivery guy when they seem fit, almost never according to the promised time. Sometimes customers receive the product after the need is over. It’s important for buyers to have realistic expectations and know when they can expect their orders. It is important for people to know when their product is arriving so that they can plan their day accordingly. Customers often complain that the delivery boys do not even call them before arriving to deliver the product. The best thing is to check with the website about the estimated time of arrival of a certain product before you place the order.

Since logistics has become very complex, it is mandatory for ecommerce business owners to keep tabs on it. An ecommerce platform with inventory management solutions can give an idea to the owner about the stock status and thus, product deliveries can be managed much more accurately. Send Message to the customer when the order is shipped and then when it is expected to be delivered so that the customer is prepared to receive it.

4. Bad product tracking facilities and problems and delay in case of product return:

The delay issue is the same with returns. You place an order for the returns to be picked up and there is no response. Customers are facing a troubled time with the tracking systems which do not accurately locate the product. Often customers choose same-day-delivery by paying a few extra bucks only to get their product delivered.

Right tracking facilities and right return policies need to be developed by the ecommerce platforms.

5. Missing payment confirmation and bad customer care services:

Another challenge is to find a payment gateway that is smooth. Sometimes when the customers are directed to the payment page, their money is deducted and suddenly, the page shuts off without any notice to the consumer. And that’s when the customer is in a fix. Then chasing the company for a refund is a different challenge altogether. Also sometimes the website asks too many secret questions or too much info before the customer can make the payment. This too can increase the perceived inconvenience during the purchase and leads to an abandoned cart. Many a times customers face various problems and they call customer care for assistance and many a times their calls are not picked up.

A quick fix is to email payment confirmation to the customer. If a customer gets an email confirming the order, they are not worried about the outcome. They know they are paying for an order that has been placed successfully. Also, keep the payment process simple and easy to execute without including too many stages. Customer care services should always be ready to assist the customers.

6. Missing of the desired payment methods:

This is another common problem. A lot of times, consumers do not know how to make the payment if the debit cards they use are not available as an option. More so, customers are often stuck with the payment options when Cash On Delivery is not available. With online frauds picking up steam, most customers prefer paying cash on delivery as they are skeptical about sharing their card details. This is a common complaint by many customers these days. They do not have many payment methods that they can trust.

An e-security seal on the website can help earn the trust of the consumer opting for e-payments on the website. Further, use of e-payments offers convenience to the buyers and hence leads to increase in sales.

7. Too many options to choose from:

The online world provides too many options and it can be overwhelming for the customer to make a choice. The absence of support that most customers are used to in the in-store experience is missing and this can chicken out them of a purchase decision.

Give proper product specs in the same format for all products so it is easy to compare them. Instead of overloading the customer with information, give minimal but useful information. A shopping comparison tool can help buyers simplify their purchases decisions. Also, a live chat option to offer queries always is comforting for buyers to make that final click.

8. Complicated website designs and boring interfaces:

Does your website resemble a maze of where the visitor feels lost? It’s tempting to opt for complex looking website structures and designs and they may also hook in many curious customers; but this may not be a great idea to build long-term customer base. Most visitors get frustrated with these complex monstrous website and bounce off. While overwhelming site structures can be a bummer, websites with insipid interfaces don’t get much done either. It’s hard to get attention of the customers, so make sure you get it right at the first go. Shoppers are an impatient lot. Get their attention with attractive website designs that allure visitors and give them an enjoyable experience both on desktops and other handheld devices like mobiles, tabs etc.

Keep the website structure simple and provide easy navigation tools to the customers.Do away with long forms. Also, reduce the number of clicks required to complete a purchase. Make sure your web pages don’t take too long to load. Give interactive product guides to keep visitors entertained and engaged at the same time.

A Statistical Analysis of The Problems Faced by The Customers in Online Shopping

Sayan Nath

June 2019

Introduction:

Objectives:

Survey procedure: