Overview

Executive Summary

Initial Findings

For my analysis, I first examined differences between our customers who bought organics and those who did not. Across store spending, region, and months as loyalty card holders, there was little to no discrepancy between our organic and non-organic customers. Outside of these three variables, however, we see some key differences between these two customer groups.

Firstly, those who bought more organics were generally wealthier. This is to be expected-those with more money have more disposable income to spend on healthier produce. Secondly, our organics customers tend to be younger. As younger generations have become more health conscious about how their groceries are sourced, this also is not surprising. In terms of gender, our primary shoppers are women. Over 70% of all our sales are to females. Women also have a higher tendency to buy organics, as can be seen by the mosaic plot in the Gender tab under EDA.
Across the pre-calculated cluster groups, we see that cluster group F has a higher tendency to buy organics. Cluster group A is the least likely to buy our organics. More information on how these cluster groups were generated can help in interpreting this output. Most importantly, however, I noted some differences across our loyalty status categories. The higher a customer’s loyalty status, the lower their tendency to purchase organics. This relationship holds true at every loyalty status level. Delving further into this issue might enable us as a company to boost our organics sales.

Predictive Modeling

From here I ran some predictive modeling to predict the probability that our customers will purchase organics. Backing out these models, I was able to determine which independent variables were most significantly impacting our customers’ choice to buy organics. Based on the independent variables, my final gradient boosting model can accurately predict which customers will and will not purchase our organics with only a misclassification rate of 19.35%. I analyzed the output of my model via a decision tree on its predictions - this showed that the variables most impacting the model, and therefore customers’ probabilities of purchasing organics, are age, affluence, and gender. This model will enable our marketing team to assess which of our customers on an individual basis are most likely to purchase our organics. As a grocer, we can use these probabilities to better spend our advertising dollars to boost organics sales.

Customer Clustering

After performing predictive modeling, I also chose to implement clustering techniques to split our customers into different segments. I ultimately generated twelve customer segment clusters, with one specific customer segment having a high tendency to purchase our organics. I then chose to delve further into the characteristics of this cluster.

This cluster was both the wealthiest and the second youngest. This falls in line with the initial exploratory analysis. The cluster was also mostly female, although it did contain some males as well. This cluster did not consist of our most loyal customers; most of the customers in this segment had either silver or tin loyalty status.

The most worrying statistic about this cluster is the amount that it spends annually in our stores. This cluster of customers had the third lowest average annual expenditures at our stores. Our high-spending customers, on the other hand, only buy organics at a rate of 15%. This means that we as a grocer need to reposition our brand if we really want to sell more organics. Those who shop at our store the most are not buying organics, indicating that organic produce is not something that people come to our store most often to buy.

Going Forward

We should send more targeted messaging to our high-spending customer cluster. Doing so, hopefully we can convince this cluster to buy more organics. If we can successfully convert this segment, then we as a brand position ourselves to sell more organics to our customers and improve their health. We need to seize this opportunity and convince those who shop with us the most frequently to buy more organics.

If we can effectively engage with these cluster and leverage our predictive model, then we can better spend our advertising dollars and disburse our coupon. This should boost our organic sales across the board, boosting our public image, and make us the healthy grocer of tomorrow.

Data Dictionary

data_dictionary <- read.csv("Data Dictionary.csv")
colnames(data_dictionary) <- c("Variable", "Description")
data_dictionary %>%
  datatable(rownames = FALSE)

Required Packages

library(plyr)            # For caret package
library(tidyverse)       # For data manipulation
library(missRanger)      # For missing value imputation
library(leaps)           # For gradient boosting
library(rpart)           # For decision tree
library(randomForest)    # For random forest model
library(e1071)           # For svm model
library(caret)           # For regression analysis
library(rattle)          # For ploting decision tree
library(factoextra)      # For cluster analysis
library(cluster)         # For cluster analysis
library(NbClust)         # For cluster analysis
library(DT)              # For datatable visualization

Import and Format

df <- read.csv("organics.csv")
df <- df %>%
  select(-DemCluster)
colnames(df)[1] <- "ID"

The initial dataset consisted of 11 variables and 22,223 observations. I dropped the DemCluster variable because of the high number of levels and since this variable had already been used to generate the DemClusterGroup variable.

The initial dataset also contained some “U” values and some “” values, indicating that these values were unknown. I therefore changed these values to NA.

df$DemClusterGroup[df$DemClusterGroup %in% c("U","")] <- NA
df$DemGender[df$DemGender %in% c("U","")] <- NA
df$DemReg[df$DemReg == ""] <- NA
df$DemTVReg[df$DemTVReg == ""] <- NA

I converted the PromTime variable to an ordered factor, assuming the ordering that “tin”<“silver”<“gold”<“platinum”. I also converted the binary target variable to a factor. After doing so, I dropped all unused factor levels from the data.

df$PromClass <- factor(df$PromClass, ordered = TRUE, 
                       levels = c("Tin", "Silver", "Gold", "Platinum"))
df$TargetBuy <- factor(df$TargetBuy)
df <- droplevels(df)

At this point there were still 8791 missing values in the dataset. Before imputing these values, I first performed some exploratory data analysis.

EDA

I chose to first determine if there were any noticeable differences between customers that purchase organics versus those who do not. I did this across each of the explanatory attributes.

Dependent Variables

Affluence

df %>%
  group_by(TargetBuy) %>%
  ggplot(aes(x = TargetBuy, y = DemAffl)) +
  geom_boxplot() +
  xlab("Organics") +
  scale_x_discrete(labels = c("Didn't Buy", "Bought")) +
  ylab("Affluence (1-30)") +
  ggtitle("Wealthier People Buy More Organics")

Age

df %>%
  group_by(TargetBuy) %>%
  ggplot(aes(x = TargetBuy, y = DemAge)) +
  geom_boxplot() +
  xlab("Organics") +
  scale_x_discrete(labels = c("Didn't Buy", "Bought")) +
  ylab("Age") +
  ggtitle("Younger People Buy More Organics")

Gender

table(df$DemGender, df$TargetBuy) %>%
  prop.table() %>%
  print() %>%
  mosaicplot(main = "Organic Purchases and Genders", xlab = "Gender", 
             ylab = "Whether They Purchase Organics")
   
             0          1
  F 0.44221777 0.23407927
  M 0.26992875 0.05377422

Cluster Group

table(df$DemClusterGroup, df$TargetBuy) %>%
  prop.table() %>%
  print() %>%
  mosaicplot(main = "Organic Purchases and Cluster Group", 
             xlab = "Cluster Group", ylab = "Whether They Purchase Organics")
   
             0          1
  A 0.06894627 0.01712026
  B 0.15189579 0.04089323
  C 0.15961852 0.05280298
  D 0.15231449 0.05136078
  E 0.09025355 0.03107699
  F 0.13035590 0.05336125

We see that among our cluster groups, the F group tends to buy more organics.

Region

table(df$DemReg, df$TargetBuy) %>%
  prop.table() %>%
  print() %>%
  mosaicplot(main = "Organic Purchases and Region", xlab = "Region",
             ylab = "Whether They Purchase Organics")
            
                       0           1
  Midlands   0.231363177 0.078453902
  North      0.152219873 0.046511628
  Scottish   0.047982351 0.014891075
  South East 0.296442688 0.100376873
  South West 0.024450777 0.007307657

We see that most of our shoppers are from the South East, followed by the Midlands and the North. Across all regions, we don’t see major discrepancies in the proportion of organics sold.

Loyalty Status

table(df$PromClass, df$TargetBuy) %>%
  prop.table() %>%
  print() %>%
  mosaicplot(main = "Organic Purchases and Loyalty Status", 
             xlab = "Loyalty Status", ylab = "Whether They Purchase Organics")
          
                     0           1
  Tin      0.200602979 0.091301804
  Silver   0.290689826 0.095036674
  Gold     0.228951987 0.055618053
  Platinum 0.032038879 0.005759798

Across loyalty status, we see that our loyal customers buy fewer organics.

Store Spending

df %>%
  group_by(TargetBuy) %>%
  ggplot(aes(x = TargetBuy, y = PromSpend)) +
  geom_boxplot() +
  xlab("Organics") +
  scale_x_discrete(labels = c("Didn't Buy", "Bought")) +
  ylab("Store Spending") +
  scale_y_continuous(labels = scales::dollar) +
  ggtitle("Wealthier People Buy More Organics")

We see that the distribution of grocery spending is highly skewed right based on these boxplots. We see that there are three people who are spending over $200,000 on groceries. Given that this seems fairly extreme, I decided to dig more into these observations.

df[c(2057, 3483, 11476),]
            ID DemAffl DemAge DemClusterGroup DemGender     DemReg  DemTVReg PromClass PromSpend
2057   4558448      15     76               C         F      North Yorkshire  Platinum  239542.1
3483   7932499      10     67               C         M South East    London  Platinum  296313.8
11476 29207630      15     71               A      <NA> South East    London  Platinum  201000.0
      PromTime TargetBuy TargetAmt
2057         4         1         1
3483         8         0         0
11476        7         0         0

Given that these three outliers have average affluence levels, I am operating under the assumption that their spending was not above $200,000–rather, they misrecorded by two decimal places. I am imputing them as such.

df[c(2057, 3483, 11476), "PromSpend"] <- df[c(2057, 3483, 11476), 
                                            "PromSpend"] / 100

Months as Loyalty Member

df %>%
  group_by(TargetBuy) %>%
  ggplot(aes(x = TargetBuy, y = PromTime)) +
  geom_boxplot() +
  xlab("Organics") +
  scale_x_discrete(labels = c("Didn't Buy", "Bought")) +
  ylab("Months as Loyalty Card Member") +
  ggtitle("Comparing Months as Member to Organics Purchases")

We don’t see a difference in the distribution of months as loyalty card members between those who bought organics and who didn’t.

Imputation

I operated under the assumption that the remaining 8791 missing values were missing at random. Therefore I chose to use the missRanger function to impute the missing values. This function imputes mixed-type data sets by chaining tree ensembles.

df[1:10] <- missRanger(df[1:10], seed = 42)

Missing value imputation by chained tree ensembles

iter 1: .......
iter 2: .......
iter 3: .......

Predictive Modeling

I first split my data into training and testing data. I dropped the TargetAmt variable because the purpose of my models was to predict whether someone purchased organics, not how many organics they bought. I also dropped the ID variable since this variable has no impact on my target variable.

set.seed(42)
df <- df %>%
  select(-TargetAmt, -ID)
sample <- sample(c(TRUE, FALSE), nrow(df), replace = TRUE, 
                 prob = c(0.7, 0.3))
train <- df[sample, ]
test <- df[!sample, ]

I used a logistic regression, decision tree, random forest, gradient boosting model, and support vector machine to determine which model did the best job predicting the TargetBuy variable.

Logistic Regression

logit_null <- glm(TargetBuy ~ 1, data = train, 
                  family = binomial(link = "logit"))
logit_full <- glm(TargetBuy ~ ., data = train, 
                  family = binomial(link = "logit"))
logit_fit <- step(logit_null,scope=formula(logit_full),
                    direction="both",k=2)
logit_predictions <- predict(logit_fit, newdata = test, type = "response")
logit_conf_matrix <- table(test$TargetBuy, logit_predictions > 0.5) %>%
  prop.table()

I used both forward and backward selection to determine my final logistic regression model. The final statistically significant predictor variables were DemAffl, DemAge, and DemGender, all of which were significant at the 99% confidence level. Customer affluence increased the probability of a customer purchasing organics, which increased age and being male both decreased the probability of a customer purchasing organics.

summary(logit_fit)

Call:
glm(formula = TargetBuy ~ DemAffl + DemAge + DemGender, family = binomial(link = "logit"), 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0625  -0.7073  -0.4655  -0.1603   2.8973  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.397150   0.108143  -3.672  0.00024 ***
DemAffl      0.248395   0.006934  35.822  < 2e-16 ***
DemAge      -0.055316   0.001734 -31.896  < 2e-16 ***
DemGenderM  -0.713361   0.051673 -13.805  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 17451  on 15570  degrees of freedom
Residual deviance: 14182  on 15567  degrees of freedom
AIC: 14190

Number of Fisher Scoring iterations: 5

Decision Tree

tree_fit <- rpart(TargetBuy ~ ., data = train, method = "class")
tree_predictions <- predict(tree_fit, newdata = test, type = "class")
tree_conf_matrix <- table(test$TargetBuy, tree_predictions) %>% 
  prop.table()
fancyRpartPlot(tree_fit)

The output of my tree method showed that age was the single most important variable in determining whether or not someone purchased organics, with those age 45 or over not buying organic. The next most significant split was on affluence, with those having more money tending to purchase more organics.

Random Forest

forest_fit <- randomForest(TargetBuy ~ ., data = train, ntree = 500)
forest_predictions <- predict(forest_fit, newdata = test, type = "class")
forest_conf_matrix <- table(test$TargetBuy, forest_predictions) %>%
  prop.table()

Gradient Boosting

fit_control <- trainControl(method = "cv", number = 10)
tune_grid <- expand.grid(interaction.depth = 2, n.trees = 500, shrinkage = 0.1,
                         n.minobsinnode = 10)                            
gbm_fit <- train(TargetBuy ~ ., data = train, method = "gbm", 
                   trControl = fit_control, verbose = FALSE, 
                   tuneGrid = tune_grid)
gbm_predictions <- predict(gbm_fit, newdata = test, type = "prob")
gbm_conf_matrix <- table(test$TargetBuy, gbm_predictions[, 2] > 0.5) %>%
  prop.table()

SVM

svm_fit <- svm(TargetBuy ~ ., data = train)
svm_predictions <- predict(svm_fit, newdata = test, type = "prob")
svm_conf_matrix <- table(test$TargetBuy, svm_predictions) %>%
  prop.table()

Model Comparison

models <- list(
  logit = logit_conf_matrix,
  tree = tree_conf_matrix,
  forest = forest_conf_matrix,
  boosting = gbm_conf_matrix,
  svm = svm_conf_matrix
)
for (i in 1:length(models)){
  rownames(models[[i]]) <- c("Bought", "Didn't Buy")
  colnames(models[[i]]) <- c("Predicted No Buy", "Predicted Buy")
}
misclassification_rates <- list(
  logit = paste("Misclassification Rate =",
                mean(test$TargetBuy != round(logit_predictions, 0))),
  tree = paste("Misclassification Rate =", 
               mean(test$TargetBuy != tree_predictions)),
  forest = paste("Misclassification Rate =", 
                 mean(test$TargetBuy != forest_predictions)),
  gbm = paste("Misclassification Rate =",
                    mean(test$TargetBuy != round(gbm_predictions[, 2], 0))),
  svm = paste("Misclassification Rate =", 
              mean(test$TargetBuy != svm_predictions))
)
print(models)
$`logit`
            
             Predicted No Buy Predicted Buy
  Bought           0.71602526    0.03743235
  Didn't Buy       0.16551413    0.08102826

$tree
            tree_predictions
             Predicted No Buy Predicted Buy
  Bought           0.72113650    0.03232111
  Didn't Buy       0.16791942    0.07862297

$forest
            forest_predictions
             Predicted No Buy Predicted Buy
  Bought           0.69678292    0.05667468
  Didn't Buy       0.14762477    0.09891762

$boosting
            
             Predicted No Buy Predicted Buy
  Bought           0.70625376    0.04720385
  Didn't Buy       0.14657246    0.09996993

$svm
            svm_predictions
             Predicted No Buy Predicted Buy
  Bought           0.73677090    0.01668671
  Didn't Buy       0.18656043    0.05998196
print(misclassification_rates)
$`logit`
[1] "Misclassification Rate = 0.202946482260974"

$tree
[1] "Misclassification Rate = 0.200240529164161"

$forest
[1] "Misclassification Rate = 0.204299458809381"

$gbm
[1] "Misclassification Rate = 0.19377630787733"

$svm
[1] "Misclassification Rate = 0.203247143716176"

The final model with the lowest misclassification rate was the gradient boosting model. I am therefore choosing to proceed with the gradient boosting model for predicting whether or not a customer will purchase organics. I implemented a decision tree trying to predict my gradient boosting model predictions based on the x inputs to the model to discern which variables had the biggest impact on my gradient boosting model predictions.

train_predictions <- predict(gbm_fit, newdata = train, type = "prob")
train_df_w_predictions <- train %>%
  mutate(Prediction = round(train_predictions[, 2], 0)) %>%
  select(-TargetBuy)
model_assessment_tree <- rpart(Prediction ~ ., data = train_df_w_predictions,
                               method = "class")
fancyRpartPlot(model_assessment_tree)

Based on the output of the decision tree, the variables most impacting my final model’s predictions were age followed by affluence followed by gender. These three variables were the only attributes represented in my decision tree modeling of my gbm predictions.

Scored Set

new_data <- read.csv("New_organics.csv")
new_data <- new_data %>%
  select(-DemCluster)
new_data$PromSpend <- 0
new_data$DemClusterGroup[new_data$DemClusterGroup %in% c("U","")] <- NA
new_data$DemGender[new_data$DemGender %in% c("U","")] <- NA
new_data$DemReg[new_data$DemReg == ""] <- NA
new_data$DemTVReg[new_data$DemTVReg == ""] <- NA
new_data$PromClass <- factor(new_data$PromClass, ordered = TRUE, 
                       levels = c("Tin", "Silver", "Gold", "Platinum"))
new_data <- droplevels(new_data)
new_data <- missRanger(new_data, seed = 42)

Missing value imputation by chained tree ensembles

iter 1: .
iter 2: .
iter 3: .
## Creating gbm_fit on all data
gbm_fit <- train(TargetBuy ~ ., data = df, method = "gbm", 
                 trControl = fit_control, verbose = FALSE, 
                 tuneGrid = tune_grid)
gbm_predictions <- predict(gbm_fit, newdata = new_data, type = "prob")
scored_set <- cbind(new_data$ID,
                            predict(gbm_fit,
                                    newdata = new_data, type = "prob")[, 2]) %>%
  data.frame() %>%
  arrange(desc(X2))
colnames(scored_set) <- c("ID", "Prob of Purchasing")
print(scored_set)
     ID Prob of Purchasing
1  2014         0.87826575
2  2018         0.53859139
3  2006         0.35826681
4  2000         0.32818577
5  2004         0.32805737
6  2017         0.31163555
7  2011         0.28386556
8  2021         0.27353447
9  2009         0.24883738
10 2005         0.22013667
11 2020         0.18217325
12 2007         0.16587082
13 2012         0.14600402
14 2010         0.10950861
15 2008         0.09411691
16 2016         0.09113999
17 2003         0.08954738
18 2002         0.07898911
19 2013         0.06980488
20 2022         0.06952756
21 2015         0.06437703
22 2023         0.05295740
23 2019         0.04909787
24 2001         0.03230183

I applied the same data cleaning and imputation on the new set as I did on the old set. When creating my gbm model, however, I trained it on the full initial dataset rather than just the training partition. The final output shows the customer ID in the left column and the probability that that customer buys organics in the right column. This final model only predicted two of the ten people as likely to purchase organics.

Customer Clustering

I clustered the customers to generate customer segments. The intent of these clusters is so the grocery chain can better target its users with its marketing messages. For these clusters, I chose to leave out the target variable TargetBuy since the intent of these clusters is to eventually also target a customer segment with a high probability of purchasing our organics. I ran my clusters across two separate dataframes, where I either:

  1. Removed unordered multilevel factor variables DemClusterGroup, DemReg, and DemTVReg.

  2. Generated dummy variables for the multilevel factor variables.

Clustering on the dataframe containing the dummy variables failed to converge across Elbow, Silhouette, and Gap Statistics methods. Therefore I chose to continue with the first dataframe, where I left the unordered multilevel factor variables entirely out of my clustering analysis.

Preparation

clus_vars <- df %>%
  select(-TargetBuy)
clus_vars <- clus_vars %>%
  select(-c(DemClusterGroup, DemReg, DemTVReg))
clus_vars <- lapply(clus_vars, as.integer) %>%
  data.frame()

I scaled the dataframe to keep any one variable from having too large an impact on my final output clusters.

clus_vars_scaled <- scale(clus_vars) %>%
  data.frame()

Results

I first tried to determine the optimal cluster numbers through both the elbow and silhouette method.

Elbow Method

fviz_nbclust(clus_vars_scaled, kmeans, method = "wss")
Error: cannot allocate vector of size 1.8 Gb

The elbow method, while not completely definitive, suggested that optimum cluster sizes might be 2, 4, or 6.

Silhouette Method

fviz_nbclust(clus_vars_scaled, kmeans, method = "silhouette")

The average silhouette method demonstrates optimal cluster number at 4.

Seeing that both methods disagreed on ideal cluster size, I chose to use the NbClust function to determine which cluster count was optimal across the majority of the 26 non-computationally expensive methods embedded in NbClust. Note that I used argument index = “all” to iterate only on the non-computationally expensive metrics. Since this procedure is only being used to determine optimal number of clusters, not the final clusters for my model, I am choosing to subset my data to cut down on computational time. Therefore I am only running NbClust on 30% of my data.

NbClust

cluster_sample <- sample(c(TRUE, FALSE), nrow(clus_vars_scaled), replace = TRUE, 
                         prob = c(0.3, 0.7))
cluster_subset <- clus_vars_scaled[cluster_sample,]
nb <- NbClust(cluster_subset, distance = "euclidean", min.nc = 2,
        max.nc = 15, method = "complete", index ="all")
fviz_nbclust(nb)

The resuls of this clustering on my subsetted data demonstrated an ideal cluster count of 2. Given that this is not necessarily practical from a business perspective, I am choosing to operate with k = 12 clusters since this was the second-based cluster number.

k12 <- kmeans(clus_vars_scaled, centers = 12, nstart = 50, iter.max = 50)
k12

Each of these clusters was of a substantial size, the smallest of which being of size 141 68.3% of the total variability was accounted for by between-cluster variance.

fviz_cluster(k12, data = clus_vars_scaled)

Note that around 50% of the variability in the customer clusters is explained by Dim1 and Dim2. Also note that 52.2% of the variability in the data can be explained by the two eigenvectors Dim1 and Dim2.

I then chose to analyze which among these clusters had the highest proportion of organic purchases.

## Adding back target variable:
clus_vars <- cbind(clus_vars, TargetBuy = as.integer(df$TargetBuy),
                   Cluster = k12$cluster)

cluster_means <- clus_vars %>%
  group_by(Cluster) %>%
  summarize_all("mean") %>%
  select(TargetBuy, everything()) %>%
  arrange(desc(TargetBuy))

datatable(cluster_means)

Exploring Final Clusters

ggplot(cluster_means, aes(x = reorder(Cluster, -TargetBuy), y = TargetBuy - 1, 
                          fill = TargetBuy - 1)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(name = "Proportion \nPurchasing \nOrganics") +
  xlab("Cluster") +
  ylab("Organics Purchased") +
  ggtitle("One of Our Segments Purchases Organics More Often")

Note that one cluster by far has the highest probability of purchasing organics. For the following graphs, I will continue to have the bar fill represent organic purchases.

Explanatory Variable Plots

Affluence
ggplot(cluster_means, aes(x = reorder(Cluster, -DemAffl), y = DemAffl, 
                          fill = TargetBuy - 1)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(name = "Proportion \nPurchasing \nOrganics") +
  xlab("Cluster") +
  ylab("Affluence") +
  ggtitle("The Organics Cluster has more Money")
Age
ggplot(cluster_means, aes(x = reorder(Cluster, -DemAge), y = DemAge, 
                          fill = TargetBuy - 1)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(name = "Proportion \nPurchasing \nOrganics") +
  xlab("Cluster") +
  ylab("Age") +
  ggtitle("The Organics Cluster is Fairly Young")
Gender
ggplot(cluster_means, aes(x = reorder(Cluster, -DemGender), y = 1.5 - DemGender, 
                          fill = TargetBuy - 1)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(name = "Proportion \nPurchasing \nOrganics") +
  xlab("Cluster") +
  ylab("Gender") +
  scale_y_continuous(breaks = c(-0.5, 0, 0.5), 
                     labels = c("Male", "Even", "Female")) +
  ggtitle("The Organics Cluster is Mostly Female")
Loyalty Status
ggplot(cluster_means, aes(x = reorder(Cluster, -PromClass), y = PromClass, 
                          fill = TargetBuy - 1)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(breaks = c(1,2,3,4), labels = c("tin", "silver", "gold",
                                                     "platinum")) +
  scale_fill_gradient(name = "Proportion \nPurchasing \nOrganics") +
  ylab("Loyalty Status") +
  xlab("Cluster") +
  ggtitle("Our Most Loyal Customers are not buying Organics")
Annual Store Spending
ggplot(cluster_means, aes(x = reorder(Cluster, -PromSpend), y = PromSpend, 
                          fill = TargetBuy - 1)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::dollar) +
  scale_fill_gradient(name = "Proportion \nPurchasing \nOrganics") +
  ylab("Total Store Spending") +
  xlab("Cluster") +
  ggtitle("Our Highest-Spending Customers are not buying Organics")
Time as Loyalty Card Member
ggplot(cluster_means, aes(x = reorder(Cluster, -PromTime), y = PromTime, 
                          fill = TargetBuy - 1)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(name = "Proportion \nPurchasing \nOrganics") +
  xlab("Cluster") +
  ylab("Months as Loyalty Card Member") +
  ggtitle("Our Oldest Loyalty Cards Customers are not buying Organics")

Analysis of Explanatory Variables

The primary takeaways from these clusters is that affluence is most tied to organic produce purchases. These clusters also indicate that those who mostly purchase organics tend to not spend much in the store, indicating that the store is not attracting organics-savvy consumers as much as it could be.

Outside of analyzing organics sales, we see that one cluster’s primary feature is those who have had loyalty cards longest. Another cluster spends the most and also consists mostly of platinum members. Based on these features, we know that this cluster is our highest value cluster. Therefore we should target this customer segment to purchase more organics.

---
title: "Organic Produce Analysis"
output: html_notebook
---

# Overview {.tabset .tabset-fade}

## Executive Summary

![](organics_photo.jpg)

### Initial Findings

For my analysis, I first examined differences between our customers who bought organics and those who did not.  Across store spending, region, and months as loyalty card holders, there was little to no discrepancy between our organic and non-organic customers.  Outside of these three variables, however, we see some key differences between these two customer groups.
	
Firstly, those who bought more organics were generally wealthier.  This is to be expected-those with more money have more disposable income to spend on healthier produce.  Secondly, our organics customers tend to be younger.  As younger generations have become more health conscious about how their groceries are sourced, this also is not surprising.  In terms of gender, our primary shoppers are women.  Over 70% of all our sales are to females.  Women also have a higher tendency to buy organics, as can be seen by the mosaic plot in the Gender tab under EDA.  
Across the pre-calculated cluster groups, we see that cluster group F has a higher tendency to buy organics.  Cluster group A is the least likely to buy our organics.  More information on how these cluster groups were generated can help in interpreting this output.
Most importantly, however, I noted some differences across our loyalty status categories.  The higher a customer's loyalty status, the lower their tendency to purchase organics.  This relationship holds true at every loyalty status level.  Delving further into this issue might enable us as a company to boost our organics sales.
	
### Predictive Modeling

From here I ran some predictive modeling to predict the probability that our customers will purchase organics.  Backing out these models, I was able to determine which independent variables were most significantly impacting our customers' choice to buy organics.  Based on the independent variables, my final gradient boosting model can accurately predict which customers will and will not purchase our organics with only a misclassification rate of 19.35%.  I analyzed the output of my model via a decision tree on its predictions - this showed that the variables most impacting the model, and therefore customers' probabilities of purchasing organics, are age, affluence, and gender.  This model will enable our marketing team to assess which of our customers on an individual basis are most likely to purchase our organics.  As a grocer, we can use these probabilities to better spend our advertising dollars to boost organics sales.


### Customer Clustering

After performing predictive modeling, I also chose to implement clustering techniques to split our customers into different segments.  I ultimately generated twelve customer segment clusters, with one specific customer segment having a high tendency to purchase our organics.  I then chose to delve further into the characteristics of this cluster.
	 
This cluster was both the wealthiest and the second youngest.  This falls in line with the initial exploratory analysis.  The cluster was also mostly female, although it did contain some males as well.  This cluster did not consist of our most loyal customers; most of the customers in this segment had either silver or tin loyalty status.
	
The most worrying statistic about this cluster is the amount that it spends annually in our stores.  This cluster of customers had the third lowest average annual expenditures at our stores.  Our high-spending customers, on the other hand, only buy organics at a rate of 15%.  This means that we as a grocer need to reposition our brand if we really want to sell more organics.  Those who shop at our store the most are not buying organics, indicating that organic produce is not something that people come to our store most often to buy. 

### Going Forward

We should send more targeted messaging to our high-spending customer cluster.  Doing so, hopefully we can convince this cluster to buy more organics.  If we can successfully convert this segment, then we as a brand position ourselves to sell more organics to our customers and improve their health.  We need to seize this opportunity and convince those who shop with us the most frequently to buy more organics.
	
If we can effectively engage with these cluster and leverage our predictive model, then we can better spend our advertising dollars and disburse our coupon.  This should boost our organic sales across the board, boosting our public image, and make us the healthy grocer of tomorrow.

## Data Dictionary

```{r}
data_dictionary <- read.csv("Data Dictionary.csv")
colnames(data_dictionary) <- c("Variable", "Description")
data_dictionary %>%
  datatable(rownames = FALSE)
```

## Required Packages

```{r, message = FALSE}
library(plyr)            # For caret package
library(tidyverse)       # For data manipulation
library(missRanger)      # For missing value imputation
library(leaps)           # For gradient boosting
library(rpart)           # For decision tree
library(randomForest)    # For random forest model
library(e1071)           # For svm model
library(caret)           # For regression analysis
library(rattle)          # For ploting decision tree
library(factoextra)      # For cluster analysis
library(cluster)         # For cluster analysis
library(NbClust)         # For cluster analysis
library(DT)              # For datatable visualization
```

## Import and Format

```{r}
df <- read.csv("organics.csv")

df <- df %>%
  select(-DemCluster)

colnames(df)[1] <- "ID"
```

The initial dataset consisted of 11 variables and 22,223 observations.  I dropped the <code>DemCluster</code> variable because of the high number of levels and since this variable had already been used to generate the <code>DemClusterGroup</code> variable.

The initial dataset also contained some "U" values and some "" values, indicating that these values were unknown.  I therefore changed these values to NA.

```{r}
df$DemClusterGroup[df$DemClusterGroup %in% c("U","")] <- NA
df$DemGender[df$DemGender %in% c("U","")] <- NA
df$DemReg[df$DemReg == ""] <- NA
df$DemTVReg[df$DemTVReg == ""] <- NA
```

I converted the <code>PromTime</code> variable to an ordered factor, assuming the ordering that "tin"<"silver"<"gold"<"platinum".  I also converted the binary target variable to a factor.  After doing so, I dropped all unused factor levels from the data.

```{r}
df$PromClass <- factor(df$PromClass, ordered = TRUE, 
                       levels = c("Tin", "Silver", "Gold", "Platinum"))

df$TargetBuy <- factor(df$TargetBuy)

df <- droplevels(df)
```

At this point there were still 8791 missing values in the dataset.  Before imputing these values, I first performed some exploratory data analysis.

## EDA

I chose to first determine if there were any noticeable differences between customers that purchase organics versus those who do not.  I did this across each of the explanatory attributes.

### Dependent Variables {.tabset .tabset-fade}

#### Affluence

```{r, warning = FALSE}
df %>%
  group_by(TargetBuy) %>%
  ggplot(aes(x = TargetBuy, y = DemAffl)) +
  geom_boxplot() +
  xlab("Organics") +
  scale_x_discrete(labels = c("Didn't Buy", "Bought")) +
  ylab("Affluence (1-30)") +
  ggtitle("Wealthier People Buy More Organics")
```


#### Age

```{r, warning = FALSE}
df %>%
  group_by(TargetBuy) %>%
  ggplot(aes(x = TargetBuy, y = DemAge)) +
  geom_boxplot() +
  xlab("Organics") +
  scale_x_discrete(labels = c("Didn't Buy", "Bought")) +
  ylab("Age") +
  ggtitle("Younger People Buy More Organics")
```


#### Gender

```{r}
table(df$DemGender, df$TargetBuy) %>%
  prop.table() %>%
  print() %>%
  mosaicplot(main = "Organic Purchases and Genders", xlab = "Gender", 
             ylab = "Whether They Purchase Organics")

```

#### Cluster Group

```{r}
table(df$DemClusterGroup, df$TargetBuy) %>%
  prop.table() %>%
  print() %>%
  mosaicplot(main = "Organic Purchases and Cluster Group", 
             xlab = "Cluster Group", ylab = "Whether They Purchase Organics")
```

 We see that among our cluster groups, the F group tends to buy more organics.

#### Region

```{r}
table(df$DemReg, df$TargetBuy) %>%
  prop.table() %>%
  print() %>%
  mosaicplot(main = "Organic Purchases and Region", xlab = "Region",
             ylab = "Whether They Purchase Organics")
```

We see that most of our shoppers are from the South East, followed by the Midlands and the North.  Across all regions, we don't see major discrepancies in the proportion of organics sold.

#### Loyalty Status

```{r}
table(df$PromClass, df$TargetBuy) %>%
  prop.table() %>%
  print() %>%
  mosaicplot(main = "Organic Purchases and Loyalty Status", 
             xlab = "Loyalty Status", ylab = "Whether They Purchase Organics")

```

Across loyalty status, we see that our loyal customers buy fewer organics.

#### Store Spending

```{r}
df %>%
  group_by(TargetBuy) %>%
  ggplot(aes(x = TargetBuy, y = PromSpend)) +
  geom_boxplot() +
  xlab("Organics") +
  scale_x_discrete(labels = c("Didn't Buy", "Bought")) +
  ylab("Store Spending") +
  scale_y_continuous(labels = scales::dollar) +
  ggtitle("Wealthier People Buy More Organics")
```

We see that the distribution of grocery spending is highly skewed right based on these boxplots.  We see that there are three people who are spending over $200,000 on groceries.  Given that this seems fairly extreme, I decided to dig more into these observations.

```{r}
df[c(2057, 3483, 11476),]
```

Given that these three outliers have average affluence levels, I am operating under the assumption that their spending was not above $200,000--rather, they misrecorded by two decimal places.  I am imputing them as such.

```{r}
df[c(2057, 3483, 11476), "PromSpend"] <- df[c(2057, 3483, 11476), 
                                            "PromSpend"] / 100
```


#### Months as Loyalty Member

```{r, warning = FALSE}
df %>%
  group_by(TargetBuy) %>%
  ggplot(aes(x = TargetBuy, y = PromTime)) +
  geom_boxplot() +
  xlab("Organics") +
  scale_x_discrete(labels = c("Didn't Buy", "Bought")) +
  ylab("Months as Loyalty Card Member") +
  ggtitle("Comparing Months as Member to Organics Purchases")
```

We don't see a difference in the distribution of months as loyalty card members between those who bought organics and who didn't.


## Imputation

I operated under the assumption that the remaining 8791 missing values were missing at random.  Therefore I chose to use the <code>missRanger</code> function to impute the missing values.  This function imputes mixed-type data sets by chaining tree ensembles.

```{r}
df[1:10] <- missRanger(df[1:10], seed = 42)
```

## Predictive Modeling {.tabset .tabset-fade}
I first split my data into training and testing data.  I dropped the <code>TargetAmt</code> variable because the purpose of my models was to predict whether someone purchased organics, not how many organics they bought.  I also dropped the <code>ID</code> variable since this variable has no impact on my target variable.

```{r}
set.seed(42)
df <- df %>%
  select(-TargetAmt, -ID)

sample <- sample(c(TRUE, FALSE), nrow(df), replace = TRUE, 
                 prob = c(0.7, 0.3))
train <- df[sample, ]
test <- df[!sample, ]
```

I used a logistic regression, decision tree, random forest, gradient boosting model, and support vector machine to determine which model did the best job predicting the <code>TargetBuy</code> variable.

### Logistic Regression

```{r, message = FALSE, results = 'hide'}
logit_null <- glm(TargetBuy ~ 1, data = train, 
                  family = binomial(link = "logit"))
logit_full <- glm(TargetBuy ~ ., data = train, 
                  family = binomial(link = "logit"))

logit_fit <- step(logit_null,scope=formula(logit_full),
                    direction="both",k=2)

logit_predictions <- predict(logit_fit, newdata = test, type = "response")

logit_conf_matrix <- table(test$TargetBuy, logit_predictions > 0.5) %>%
  prop.table()
```

I used both forward and backward selection to determine my final logistic regression model.  The final statistically significant predictor variables were <code>DemAffl</code>, <code>DemAge</code>, and <code>DemGender</code>, all of which were significant at the 99% confidence level.  Customer affluence increased the probability of a customer purchasing organics, which increased age and being male both decreased the probability of a customer purchasing organics.

```{r}
summary(logit_fit)
```


### Decision Tree

```{r}
tree_fit <- rpart(TargetBuy ~ ., data = train, method = "class")

tree_predictions <- predict(tree_fit, newdata = test, type = "class")

tree_conf_matrix <- table(test$TargetBuy, tree_predictions) %>% 
  prop.table()

fancyRpartPlot(tree_fit)
```
The output of my tree method showed that age was the single most important variable in determining whether or not someone purchased organics, with those age 45 or over not buying organic.  The next most significant split was on affluence, with those having more money tending to purchase more organics.

### Random Forest

```{r}
forest_fit <- randomForest(TargetBuy ~ ., data = train, ntree = 500)

forest_predictions <- predict(forest_fit, newdata = test, type = "class")

forest_conf_matrix <- table(test$TargetBuy, forest_predictions) %>%
  prop.table()
```

### Gradient Boosting

```{r, message = FALSE}
fit_control <- trainControl(method = "cv", number = 10)
tune_grid <- expand.grid(interaction.depth = 2, n.trees = 500, shrinkage = 0.1,
                         n.minobsinnode = 10)                            
gbm_fit <- train(TargetBuy ~ ., data = train, method = "gbm", 
                   trControl = fit_control, verbose = FALSE, 
                   tuneGrid = tune_grid)

gbm_predictions <- predict(gbm_fit, newdata = test, type = "prob")

gbm_conf_matrix <- table(test$TargetBuy, gbm_predictions[, 2] > 0.5) %>%
  prop.table()
```

### SVM

```{r}
svm_fit <- svm(TargetBuy ~ ., data = train)

svm_predictions <- predict(svm_fit, newdata = test, type = "prob")

svm_conf_matrix <- table(test$TargetBuy, svm_predictions) %>%
  prop.table()
```

## Model Comparison

```{r}
models <- list(
  logit = logit_conf_matrix,
  tree = tree_conf_matrix,
  forest = forest_conf_matrix,
  boosting = gbm_conf_matrix,
  svm = svm_conf_matrix
)

for (i in 1:length(models)){
  rownames(models[[i]]) <- c("Bought", "Didn't Buy")
  colnames(models[[i]]) <- c("Predicted No Buy", "Predicted Buy")
}

misclassification_rates <- list(
  logit = paste("Misclassification Rate =",
                mean(test$TargetBuy != round(logit_predictions, 0))),
  tree = paste("Misclassification Rate =", 
               mean(test$TargetBuy != tree_predictions)),
  forest = paste("Misclassification Rate =", 
                 mean(test$TargetBuy != forest_predictions)),
  gbm = paste("Misclassification Rate =",
                    mean(test$TargetBuy != round(gbm_predictions[, 2], 0))),
  svm = paste("Misclassification Rate =", 
              mean(test$TargetBuy != svm_predictions))
)

print(models)
print(misclassification_rates)
```

The final model with the lowest misclassification rate was the gradient boosting model.  I am therefore choosing to proceed with the gradient boosting model for predicting whether or not a customer will purchase organics.  I implemented a decision tree trying to predict my gradient boosting model predictions based on the x inputs to the model to discern which variables had the biggest impact on my gradient boosting model predictions.

```{r}
train_predictions <- predict(gbm_fit, newdata = train, type = "prob")

train_df_w_predictions <- train %>%
  mutate(Prediction = round(train_predictions[, 2], 0)) %>%
  select(-TargetBuy)

model_assessment_tree <- rpart(Prediction ~ ., data = train_df_w_predictions,
                               method = "class")

fancyRpartPlot(model_assessment_tree)
```

Based on the output of the decision tree, the variables most impacting my final model's predictions were age followed by affluence followed by gender.  These three variables were the only attributes represented in my decision tree modeling of my gbm predictions.

## Scored Set

```{r}
new_data <- read.csv("New_organics.csv")

new_data <- new_data %>%
  select(-DemCluster)

new_data$PromSpend <- 0

new_data$DemClusterGroup[new_data$DemClusterGroup %in% c("U","")] <- NA
new_data$DemGender[new_data$DemGender %in% c("U","")] <- NA
new_data$DemReg[new_data$DemReg == ""] <- NA
new_data$DemTVReg[new_data$DemTVReg == ""] <- NA

new_data$PromClass <- factor(new_data$PromClass, ordered = TRUE, 
                       levels = c("Tin", "Silver", "Gold", "Platinum"))

new_data <- droplevels(new_data)

new_data <- missRanger(new_data, seed = 42)

## Creating gbm_fit on all data

gbm_fit <- train(TargetBuy ~ ., data = df, method = "gbm", 
                 trControl = fit_control, verbose = FALSE, 
                 tuneGrid = tune_grid)

gbm_predictions <- predict(gbm_fit, newdata = new_data, type = "prob")

scored_set <- cbind(new_data$ID,
                            predict(gbm_fit,
                                    newdata = new_data, type = "prob")[, 2]) %>%
  data.frame() %>%
  arrange(desc(X2))

colnames(scored_set) <- c("ID", "Prob of Purchasing")
print(scored_set)
```

I applied the same data cleaning and imputation on the new set as I did on the old set.  When creating my gbm model, however, I trained it on the full initial dataset rather than just the training partition.  The final output shows the customer ID in the left column and the probability that that customer buys organics in the right column.  This final model only predicted two of the ten people as likely to purchase organics.

## Customer Clustering

I clustered the customers to generate customer segments.  The intent of these clusters is so the grocery chain can better target its users with its marketing messages.  For these clusters, I chose to leave out the target variable <code>TargetBuy</code> since the intent of these clusters is to eventually also target a customer segment with a high probability of purchasing our organics.  I ran my clusters across two separate dataframes, where I either:

1. Removed unordered multilevel factor variables <code>DemClusterGroup</code>, <code>DemReg</code>, and <code>DemTVReg</code>.

2. Generated dummy variables for the multilevel factor variables.

Clustering on the dataframe containing the dummy variables failed to converge across Elbow, Silhouette, and Gap Statistics methods.  Therefore I chose to continue with the first dataframe, where I left the unordered multilevel factor variables entirely out of my clustering analysis.

### Preparation

```{r}
clus_vars <- df %>%
  select(-TargetBuy)

clus_vars <- clus_vars %>%
  select(-c(DemClusterGroup, DemReg, DemTVReg))

clus_vars <- lapply(clus_vars, as.integer) %>%
  data.frame()
```

I scaled the dataframe to keep any one variable from having too large an impact on my final output clusters.

```{r}
clus_vars_scaled <- scale(clus_vars) %>%
  data.frame()
```

### Results

I first tried to determine the optimal cluster numbers through both the elbow and silhouette method.

#### Elbow Method

```{r}
fviz_nbclust(clus_vars_scaled, kmeans, method = "wss")
```


The elbow method, while not completely definitive, suggested that optimum cluster sizes might be 2, 4, or 6.

#### Silhouette Method

```{r}
fviz_nbclust(clus_vars_scaled, kmeans, method = "silhouette")
```

The average silhouette method demonstrates optimal cluster number at 4.

Seeing that both methods disagreed on ideal cluster size, I chose to use the <code>NbClust</code> function to determine which cluster count was optimal across the majority of the 26 non-computationally expensive methods embedded in <code>NbClust</code>.  Note that I used argument <code>index = "all"</code> to iterate only on the non-computationally expensive metrics.  Since this procedure is only being used to determine optimal number of clusters, not the final clusters for my model, I am choosing to subset my data to cut down on computational time.  Therefore I am only running <code>NbClust</code> on 30% of my data.

#### NbClust

```{r, results = "hide"}
cluster_sample <- sample(c(TRUE, FALSE), nrow(clus_vars_scaled), replace = TRUE, 
                         prob = c(0.3, 0.7))

cluster_subset <- clus_vars_scaled[cluster_sample,]

nb <- NbClust(cluster_subset, distance = "euclidean", min.nc = 2,
        max.nc = 15, method = "complete", index ="all")
```

```{r}
fviz_nbclust(nb)
```

The resuls of this clustering on my subsetted data demonstrated an ideal cluster count of 2.  Given that this is not necessarily practical from a business perspective, I am choosing to operate with <code>k = 12</code> clusters since this was the second-based cluster number.

```{r}
k12 <- kmeans(clus_vars_scaled, centers = 12, nstart = 50, iter.max = 50)
k12
```

Each of these clusters was of a substantial size, the smallest of which being of size 141  68.3% of the total variability was accounted for by between-cluster variance.

```{r}
fviz_cluster(k12, data = clus_vars_scaled)
```

Note that around 50% of the variability in the customer clusters is explained by <code>Dim1</code> and <code>Dim2</code>.  Also note that 52.2% of the variability in the data can be explained by the two eigenvectors <code>Dim1</code> and <code>Dim2</code>.

I then chose to analyze which among these clusters had the highest proportion of organic purchases.

```{r}
## Adding back target variable:
clus_vars <- cbind(clus_vars, TargetBuy = as.integer(df$TargetBuy),
                   Cluster = k12$cluster)

cluster_means <- clus_vars %>%
  group_by(Cluster) %>%
  summarize_all("mean") %>%
  select(TargetBuy, everything()) %>%
  arrange(desc(TargetBuy))

datatable(cluster_means)
```

## Exploring Final Clusters

```{r}
ggplot(cluster_means, aes(x = reorder(Cluster, -TargetBuy), y = TargetBuy - 1, 
                          fill = TargetBuy - 1)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(name = "Proportion \nPurchasing \nOrganics") +
  xlab("Cluster") +
  ylab("Organics Purchased") +
  ggtitle("One of Our Segments Purchases Organics More Often")
```

Note that one cluster by far has the highest probability of purchasing organics.  For the following graphs, I will continue to have the bar fill represent organic purchases.


#### Explanatory Variable Plots {.tabset .tabset-fade}
##### Affluence
```{r}
ggplot(cluster_means, aes(x = reorder(Cluster, -DemAffl), y = DemAffl, 
                          fill = TargetBuy - 1)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(name = "Proportion \nPurchasing \nOrganics") +
  xlab("Cluster") +
  ylab("Affluence") +
  ggtitle("The Organics Cluster has more Money")
```

##### Age
```{r}
ggplot(cluster_means, aes(x = reorder(Cluster, -DemAge), y = DemAge, 
                          fill = TargetBuy - 1)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(name = "Proportion \nPurchasing \nOrganics") +
  xlab("Cluster") +
  ylab("Age") +
  ggtitle("The Organics Cluster is Fairly Young")
```

##### Gender
```{r}
ggplot(cluster_means, aes(x = reorder(Cluster, -DemGender), y = 1.5 - DemGender, 
                          fill = TargetBuy - 1)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(name = "Proportion \nPurchasing \nOrganics") +
  xlab("Cluster") +
  ylab("Gender") +
  scale_y_continuous(breaks = c(-0.5, 0, 0.5), 
                     labels = c("Male", "Even", "Female")) +
  ggtitle("The Organics Cluster is Mostly Female")
```

##### Loyalty Status
```{r}
ggplot(cluster_means, aes(x = reorder(Cluster, -PromClass), y = PromClass, 
                          fill = TargetBuy - 1)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(breaks = c(1,2,3,4), labels = c("tin", "silver", "gold",
                                                     "platinum")) +
  scale_fill_gradient(name = "Proportion \nPurchasing \nOrganics") +
  ylab("Loyalty Status") +
  xlab("Cluster") +
  ggtitle("Our Most Loyal Customers are not buying Organics")
```

##### Annual Store Spending
```{r}
ggplot(cluster_means, aes(x = reorder(Cluster, -PromSpend), y = PromSpend, 
                          fill = TargetBuy - 1)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::dollar) +
  scale_fill_gradient(name = "Proportion \nPurchasing \nOrganics") +
  ylab("Total Store Spending") +
  xlab("Cluster") +
  ggtitle("Our Highest-Spending Customers are not buying Organics")
```

##### Time as Loyalty Card Member
```{r}
ggplot(cluster_means, aes(x = reorder(Cluster, -PromTime), y = PromTime, 
                          fill = TargetBuy - 1)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(name = "Proportion \nPurchasing \nOrganics") +
  xlab("Cluster") +
  ylab("Months as Loyalty Card Member") +
  ggtitle("Our Oldest Loyalty Cards Customers are not buying Organics")
```

#### Analysis of Explanatory Variables

The primary takeaways from these clusters is that affluence is most tied to organic produce purchases.  These clusters also indicate that those who mostly purchase organics tend to not spend much in the store, indicating that the store is not attracting organics-savvy consumers as much as it could be.

Outside of analyzing organics sales, we see that one cluster's primary feature is those who have had loyalty cards longest. Another cluster spends the most and also consists mostly of platinum members.  Based on these features, we know that this cluster is our highest value cluster.  Therefore we should target this customer segment to purchase more organics.


