Machine Learning to segment the green consumer in Kenya

Introduction

The study uses an old dataset for a survey done in 2015 with 1000 respondents, and analysis will be mainly based on pyschographic questions on attitude, opinion and interest (AIO) questions asked to respondents measuring areas on environmental matters. As the rating questions are many, unobserved characteristics will thus be determined by two unsupervised machine learning models:

Principal component analysis (PCA) to reduce dimensions (number of variables) and determine most suitable names for the segments.
K-means cluster analysis will be used to cluster membership of individuals through the segments identified using PCA. As PCA will indicate the type of customer segments that we have, thus we shall have an idea on what to expect on number of k-groups.

After identifying clusters on kind of green consumer segments we have, we will build a robust classification ML model that can in future predict a consumer based on the underlying demographic variables. Three of the most robust models will be evaluated their performance and one selected.

Multinomial logistic regression That works almost the same way as binary logistic regression but is more relevant when the outcome variable has more than two categories
Support Vector Machine - In this method we wil use Kernel based SVMs which assume difficulty in separation as a result of non-linearity and ideally projects points to a third dimension to determine optimal plane for separation.
Random forest Which uses multiple decisions trees and merges them to get accurate classification

Summary of clusters

Cluster analysis revealed different profile of green consumer, which are:

Champions: These are dedicated people on environmental matters. They are emotionally invested on environment and concerned with pollution.
Hunters Information seekers on environmental matters, and if called upon would make contribution on environmental conservation.
Remoaners:They are guilty about with environmental pollution but are not willing to participate in conservation.
Unwary: Uninformed on environmental matters; to them, source of livelihood is important even if it leads to destruction of the environment.
Uncommitted: Not committed at all on environmental matters- they have other pressing issues.

Data preparation

library(haven)
library(dplyr)
library(factoextra)
library(psych)
library(expss)
library(kableExtra)
library(nnet)
library(caret)
library(e1071)
library(randomForest)
data <- read_sav("E:/PARS Folder/Technical stuff/Environment consumer/data2.sav")
attach(data)

data$gender=factor(data$gender,labels=c('Male','Female'), levels=c(1,2))
data$location=factor(data$location,labels=c('Nairobi','Central','Coast',
                                            'Eastern','North Eastern','Nyanza','Rift Valley',
                                            'Western'),levels=c(1,2,3,4,5,6,7,8))

data$age=factor(data$age,labels=c('Below 18 years','18-24','25-34','35-44','45-54',
                                  '55-65','>65'), levels=c(1,2,3,4,5,6,7))

data$marital=factor(data$marital,labels=c('Single','Married','Divorced/ Separated',
                                          'Widowed/ Widower','Other(specify)'),levels=c(1,2,3,4,5))
data$sec=factor(data$sec,labels=c('A','B','C1','C2','D','E'),levels=c(1,2,3,4,5,6))
data$education=factor(data$education,labels=c('No formal education','Some primary','Completed primary education','Some secondary','Completed secondary','University / Polytechnic incomplete','University / Polytechnic complete','Post-university incomplete','Post-university complete'), levels=c(1,2,3,4,5,6,7,8,9))

data$renewable=factor(data$renewable,labels=c('yes','No'), levels=c(1,0))

data$aware_govt_initiative=factor(data$aware_govt_initiative,labels=c('yes','No'), levels=c(1,0))

data$envt_information=factor(data$envt_information,labels=c('yes','No'), levels=c(1,0))

data$non_renewable=factor(data$non_renewable,labels=c('yes','No'), levels=c(1,0))

Build two indices to measure awareness on environmental matters based on rating questions and a participation on environmental issues. The two indices are generated using the first principal component of PCA.

data.scale_aware=scale(data[10:13])

data.scale_participate=scale(data[15:16])

awareness_index<-principal(data.scale_participate, rotate="varimax", nfactors=1,covar=T, scores=TRUE)

participation_index<-principal(data.scale_participate, rotate="varimax", nfactors=1,covar=T, scores=TRUE)

awareness_index=awareness_index$scores

participation_index=participation_index$scores

data2<-cbind(data,awareness_index,participation_index)
names(data2)[42]<-"awareness_index"
names(data2)[43]<-"participation_index"

PCA

pca.plot<-prcomp(data[18:41],scale=T)
fviz_eig(pca.plot)

data.scale=scale(data[18:41])


prn<-principal(data.scale, rotate="varimax", nfactors=5,covar=T, scores=TRUE)

output<-prn$loadings

newd=read.csv("output.csv")
newd %>%
  kable("html") %>%
  kable_styling(font_size=12) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Rating.Question	Champs	Hunters	Remoaners	Unwary	Uncommitted
I play a role in protecting the environment in Kenya	0.582	.	0.133	.	-0.116
I want to protect the planet	0.602	0.212	.	.	.
I love the environment and animals	0.664	0.227	0.131	.	.
I feel guilty when I carry items packed with plastic materials	0.152	.	0.678	.	.
I am African and I care about my community	0.714	.	0.182	.	.
I want to make a difference and leave a mark	0.682	0.138	.	.	.
In my daily life I try to find ways to conserve water or power	0.673	.	0.159	.	.
If I come across information about environment, I will tend to look at it	0.411	0.548	.	.	.
I would like to join and actively participate in an environmentalist group.	0.398	0.567	0.117	.	.
I’ld donate some money to an environmental organization.	0.153	0.65	0.188	0.123	.
I’ld certainly devote some of it to working for environmental causes	0.342	0.55	0.235	.	.
I am not the kind of person who makes efforts to conserve natural resources.	-0.17	.	.	.	0.712
I feel angry when I see waste been dumped in open sites	0.523	0.153	0.24	.	.
I feel angry when people cut forest trees for farming	0.219	0.148	0.666	.	-0.11
I’m proud of the government efforts in environmental conservation	0.181	0.24	0.204	.	0.372
I feel guilty whenI throw plastic materials in the street	0.244	0.154	0.71	.	.
I’m sad when I hear news about pollution of our rivers by industries	0.55	.	0.371	.	.
I would feel proud after volunteering to an environmental activity	0.503	0.291	0.28	.	.
I have more pressing issues to worry about other than environment	-0.114	.	-0.107	.	0.723
I will give thanks and cherish “Mother Nature”	0.592	0.183	0.208	.	-0.133
There is nothing we can do about climate change as it is already too late	-0.255	0.206	.	0.69	.
The problems of the environment are not as bad as most people think	-0.112	0.178	.	0.755	.
It is right for humans to use nature as a resource for economic purposes.	0.243	-0.259	.	0.636	.
Protecting peoples’ source of livelihood is more important	.	-0.419	.	0.395	0.315

y=prn$scores %>%
  data.frame()

y=prn$scores %>%
  data.frame()

data2$Champs<-y$RC1
data2$Hunters<-y$RC3
data2$Remoaners<-y$RC5
data2$Unwary<-y$RC2
data2$Uncommitted<-y$RC4
data2<-data2[-c(18:41)]

head(data2) %>%
  kable("html") %>%
  kable_styling(font_size=12) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

SbjNum	location	age	gender	marital	renewable	aware_govt_initiative	envt_information	sec	aware_one	aware_two	aware_three	aware_four	education	participate_one	participate_two	non_renewable	awareness_index	participation_index	Champs	Hunters	Remoaners	Unwary	Uncommitted
11999193	Central	25-34	Male	Single	yes	No	No	A	0	5	5	5	University / Polytechnic incomplete	2	2	No	1.4224644	1.4224644	1.2181461	0.7810937	-1.4490291	1.4479224	-1.1009969
11999305	Central	35-44	Male	Single	yes	No	No	C2	0	5	3	5	University / Polytechnic incomplete	2	1	No	0.0771632	0.0771632	1.5745056	0.3204658	-3.2564672	0.1197839	0.1084916
11999984	Central	25-34	Male	Single	No	yes	No	C1	0	1	5	5	University / Polytechnic incomplete	2	1	yes	0.0771632	0.0771632	0.5531007	0.0361897	0.5727102	-0.3309916	-0.5028206
12000457	Central	55-65	Male	Divorced/ Separated	No	yes	No	C1	1	3	1	5	Some primary	2	1	No	0.0771632	0.0771632	-2.8555465	2.7320131	0.5336392	-1.8903485	-0.2272935
12001126	Central	25-34	Male	Single	yes	No	No	C1	0	5	5	5	Post-university complete	2	2	No	1.4224644	1.4224644	-1.5132486	0.6531634	-0.6242752	-0.3480355	-0.2504428
12001937	Central	45-54	Male	Married	yes	yes	No	C1	0	5	5	5	Completed secondary	2	1	No	0.0771632	0.0771632	0.1529545	1.6651352	0.5015768	-0.0147494	-1.3145899

K-means cluster analysis

cluster_data = data2[20:24]
set.seed(1234)
kmeans = kmeans(x = cluster_data,iter.max=1000, centers = 5)
kmeans$centers

##       Champs    Hunters   Remoaners      Unwary Uncommitted
## 1  0.4405515 -1.2089632  0.05739103 -0.08682849  -0.4912145
## 2  0.3231998  0.1983640  0.34432095 -0.62873858   1.1063522
## 3  0.3623228  0.6689641 -0.08868706 -0.40004118  -0.8257980
## 4 -1.5917301 -0.1109957 -0.36407797  0.14819016   0.1125356
## 5  0.4721589  0.5187240  0.07329285  1.65298345   0.4204302

Cluster membership

data2$Member<-kmeans$cluster
data2$Member=factor(data2$Member,labels=c('Champs','Reproachables','Hunters','Unpassionate','Unwary'), levels=c(1,2,3,4,5))
data2<-data2[-c(20:24)]
head(data2) %>%
  kable("html") %>%
  kable_styling(font_size=12) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

SbjNum	location	age	gender	marital	renewable	aware_govt_initiative	envt_information	sec	aware_one	aware_two	aware_three	aware_four	education	participate_one	participate_two	non_renewable	awareness_index	participation_index	Member
11999193	Central	25-34	Male	Single	yes	No	No	A	0	5	5	5	University / Polytechnic incomplete	2	2	No	1.4224644	1.4224644	Unwary
11999305	Central	35-44	Male	Single	yes	No	No	C2	0	5	3	5	University / Polytechnic incomplete	2	1	No	0.0771632	0.0771632	Hunters
11999984	Central	25-34	Male	Single	No	yes	No	C1	0	1	5	5	University / Polytechnic incomplete	2	1	yes	0.0771632	0.0771632	Hunters
12000457	Central	55-65	Male	Divorced/ Separated	No	yes	No	C1	1	3	1	5	Some primary	2	1	No	0.0771632	0.0771632	Unpassionate
12001126	Central	25-34	Male	Single	yes	No	No	C1	0	5	5	5	Post-university complete	2	2	No	1.4224644	1.4224644	Unpassionate
12001937	Central	45-54	Male	Married	yes	yes	No	C1	0	5	5	5	Completed secondary	2	1	No	0.0771632	0.0771632	Hunters

Anticipated interaction terms before building models and splitting into training and test data

interaction_1<-data2$awareness_index^2
interaction_2<-as.numeric(data2$sec)*as.numeric(data2$location)
interaction_3<-data2$participation_index^2
data2<-cbind(data2,interaction_1,interaction_2,interaction_3)

Predictive learning

We start with splitting data to training and testing data

data<-na.omit(data2)
set.seed(1235)
data<-data[2:23]
data<- data[sample(nrow(data)),]
split <- floor(nrow(data)/2)
data_train <- data[0:split,]
data_test <- data[(split+1):(nrow(data)-1),]

Multinomial Model

mlogit=multinom(Member~.,data=data_train,maxit=1000)

## # weights:  240 (188 variable)
## initial  value 730.684812 
## iter  10 value 651.636407
## iter  20 value 560.137281
## iter  30 value 535.205359
## iter  40 value 526.486360
## iter  50 value 522.040933
## iter  60 value 520.218390
## iter  70 value 519.861618
## iter  80 value 519.783404
## iter  90 value 519.673790
## iter 100 value 519.641983
## iter 110 value 519.631458
## final  value 519.627484 
## converged

predictedML <- predict(mlogit,data_test,na.action =na.pass, type="probs")
predicted_classML <- predict(mlogit,data_test)

confusionMatrix(as.factor(predicted_classML),as.factor(data_test$Member))

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Champs Reproachables Hunters Unpassionate Unwary
##   Champs            25            25      30           16      6
##   Reproachables     11            17      17            7      5
##   Hunters           28            28      42           18     19
##   Unpassionate      15            18      11           51      5
##   Unwary            14            15      18            1     11
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3223          
##                  95% CI : (0.2794, 0.3675)
##     No Information Rate : 0.2605          
##     P-Value [Acc > NIR] : 0.001951        
##                                           
##                   Kappa : 0.141           
##                                           
##  Mcnemar's Test P-Value : 0.004229        
## 
## Statistics by Class:
## 
##                      Class: Champs Class: Reproachables Class: Hunters
## Sensitivity                0.26882              0.16505        0.35593
## Specificity                0.78611              0.88571        0.72239
## Pos Pred Value             0.24510              0.29825        0.31111
## Neg Pred Value             0.80627              0.78283        0.76101
## Prevalence                 0.20530              0.22737        0.26049
## Detection Rate             0.05519              0.03753        0.09272
## Detection Prevalence       0.22517              0.12583        0.29801
## Balanced Accuracy          0.52746              0.52538        0.53916
##                      Class: Unpassionate Class: Unwary
## Sensitivity                       0.5484       0.23913
## Specificity                       0.8639       0.88206
## Pos Pred Value                    0.5100       0.18644
## Neg Pred Value                    0.8810       0.91117
## Prevalence                        0.2053       0.10155
## Detection Rate                    0.1126       0.02428
## Detection Prevalence              0.2208       0.13024
## Balanced Accuracy                 0.7061       0.56060

Support Vector Machine

model.svm = svm(formula = Member ~ .,
                 data = data_train,
                 type = 'C-classification',
                 kernel = 'radial')

y_pred = predict(model.svm, newdata = data_test[-19])

x.1<-as.factor(data_test$Member)
y_pred<-as.factor(y_pred)

confusionMatrix(as.factor(y_pred),as.factor(data_test$Member))

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Champs Reproachables Hunters Unpassionate Unwary
##   Champs            40            46      51           14     18
##   Reproachables      1             1       1            0      0
##   Hunters           35            34      59           12     22
##   Unpassionate      16            22       6           67      6
##   Unwary             1             0       1            0      0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3687          
##                  95% CI : (0.3241, 0.4149)
##     No Information Rate : 0.2605          
##     P-Value [Acc > NIR] : 2.698e-07       
##                                           
##                   Kappa : 0.1857          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Champs Class: Reproachables Class: Hunters
## Sensitivity                 0.4301             0.009709         0.5000
## Specificity                 0.6417             0.994286         0.6925
## Pos Pred Value              0.2367             0.333333         0.3642
## Neg Pred Value              0.8134             0.773333         0.7973
## Prevalence                  0.2053             0.227373         0.2605
## Detection Rate              0.0883             0.002208         0.1302
## Detection Prevalence        0.3731             0.006623         0.3576
## Balanced Accuracy           0.5359             0.501997         0.5963
##                      Class: Unpassionate Class: Unwary
## Sensitivity                       0.7204      0.000000
## Specificity                       0.8611      0.995086
## Pos Pred Value                    0.5726      0.000000
## Neg Pred Value                    0.9226      0.898004
## Prevalence                        0.2053      0.101545
## Detection Rate                    0.1479      0.000000
## Detection Prevalence              0.2583      0.004415
## Balanced Accuracy                 0.7908      0.497543

Random Forest

classifier = randomForest(x = data_train[-19],
                          y = data_train$Member,
                          ntree = 1000)
y_pred = predict(classifier, newdata = data_test[-19])
confusionMatrix(as.factor(y_pred),as.factor(data_test$Member))

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Champs Reproachables Hunters Unpassionate Unwary
##   Champs            27            27      32           10      7
##   Reproachables     16            12      16            7     10
##   Hunters           31            34      47           16     13
##   Unpassionate       9            19       9           59      5
##   Unwary            10            11      14            1     11
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3444          
##                  95% CI : (0.3007, 0.3901)
##     No Information Rate : 0.2605          
##     P-Value [Acc > NIR] : 4.716e-05       
##                                           
##                   Kappa : 0.1651          
##                                           
##  Mcnemar's Test P-Value : 0.02794         
## 
## Statistics by Class:
## 
##                      Class: Champs Class: Reproachables Class: Hunters
## Sensitivity                 0.2903              0.11650         0.3983
## Specificity                 0.7889              0.86000         0.7194
## Pos Pred Value              0.2621              0.19672         0.3333
## Neg Pred Value              0.8114              0.76786         0.7724
## Prevalence                  0.2053              0.22737         0.2605
## Detection Rate              0.0596              0.02649         0.1038
## Detection Prevalence        0.2274              0.13466         0.3113
## Balanced Accuracy           0.5396              0.48825         0.5589
##                      Class: Unpassionate Class: Unwary
## Sensitivity                       0.6344       0.23913
## Specificity                       0.8833       0.91155
## Pos Pred Value                    0.5842       0.23404
## Neg Pred Value                    0.9034       0.91379
## Prevalence                        0.2053       0.10155
## Detection Rate                    0.1302       0.02428
## Detection Prevalence              0.2230       0.10375
## Balanced Accuracy                 0.7589       0.57534