library(rio)         # dloading data

library(dplyr)       # data manipulation

library(ggplot2)     # visualization

library(factoextra)  # visualization

files <- dir("data", full = T)


if(length(files)==0 ){
warning("FIRST YOU MUST CREATE A NEW R PROJEC THEN COPY THE FILES DOWNLOADED FROM GIT HUB 
        AND PASTE ALL INTO THE NEW PROJECT FOLDER") }

stores <- import(files[1], which = 1)%>%janitor::clean_names()
cost-of-health

Introduction

WeWashYouSleep is a ralatively small company and they cannot compete with the big players in the major cities. The company’s strategy is to build a vast network in the smaller cityes.

WeWashYouSleep alredy had a strong presence in 140 locations and recently oppened stores in 10 new cityes. Additionally, the company has two separate sales regions.

You have been assigned for analytics task;

Indentify which of the 10 new locations have the best potential for the company to invest more funds into markenting.


The Data

The Dataset has been taken from superdatascience this dataset contains 7 attributes. It has a total number of 150 rows.

Attributes:

  • store_id: Unique identifier of the store

  • state: State of the established store

  • city: City of the established store

  • sales_region: Region Sales

  • new_expansion: Indicates whether the store is Old or New

  • marketing_spend: Amount invested in marketing per store

  • revenue: Amount of revenue per store

Data cleaning and pre-processing

There were no duplicated entries in the dataset and no missing values the functions below gives a summary of these:

paste0("There are ",sum(duplicated(stores)), " duplicated observations") 
## [1] "There are 0 duplicated observations"


#Checking for missing values 
visdat::vis_miss(stores)


Data Analysis

skimr::skim(stores)
Data summary
Name stores
Number of rows 150
Number of columns 7
_______________________
Column type frequency:
character 4
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
city 0 1 4 16 0 149 0
state 0 1 4 14 0 35 0
sales_region 0 1 8 8 0 2 0
new_expansion 0 1 3 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
store_id 0 1 75.50 43.45 1 38.25 75.5 112.75 150 ▇▇▇▇▇
marketing_spend 0 1 2893.15 367.86 1811 2662.25 2898.0 3111.50 3984 ▁▃▇▃▁
revenue 0 1 39301.43 15465.75 15562 21113.50 42993.0 51145.50 68828 ▇▁▆▇▂

ggplot(stores, aes(marketing_spend, revenue))+ 
   geom_point(size = 4, alpha= 0.6)+
   theme_minimal()  +
   scale_x_continuous(labels = scales::dollar) +
   scale_y_continuous(labels = scales::dollar) +
   ggtitle("Marketing Spend vs Revenue")+
   theme(
      axis.text.x=element_text( color="black"),
      plot.title = element_text(size = 17),
      strip.text.x = element_text(size = 14))
Figure 1.0

Figure 1.0

Straightforward We can verify the existence of two clusters in this image. The first cluster has a weak, positive, linear association between the two variables in which the market investment does not seem to raise the amount of revenue to the company. The other one has a moderately strong, positive, linear association between the two variables in which the market investment seems to raise the amount of revenue to the company.

Unsupervised Learning

Algorithm Description
K Means Clustering


Let’s apply the K Means algorithm and visualize it to see whether it will separate the clusters that we were able to identify in figure 1.0.

# Lets check if the algorithm kmeans can indentify these clusters 
k_means_data <- stores %>%
                   select(revenue, marketing_spend)

set.seed(111)

# Set k equal to the optimal number of clusters
num_clusters <- 2

# Run the k-means algorithm 
k_means <- kmeans(k_means_data, num_clusters, iter.max = 15, nstart = 25)


stores <- stores %>%
   mutate( k_means_cluster = k_means$cluster)  
ggplot(stores, aes(marketing_spend, revenue, color = as.factor(k_means_cluster)))+ 
   geom_point(size = 4, alpha= 0.6) +
   geom_smooth(method = "lm",se = F)+
   theme_minimal()+
   scale_x_continuous(labels = scales::dollar) +
   scale_y_continuous(labels = scales::dollar) +
   scale_color_manual(name="Clusters",values=c("#EC2C73","#29AEC7")) +
   ggtitle('Clusters by Marketing Spend vs Revenue')+
   ylab("Revenue") +
   xlab("Marketing Spend") 
Figure 2.0

Figure 2.0

The algorithm separated the dataset into two clusters very well. The only problem is that for some reason, it assigned some stores in cluster two that appear to belong to cluster one.


Adding Population to the Dataset

I decided to bring one more dataset for this analysis. The US-Cities-Population dataset contains the amount of population from these cities. As you can imagine. The volume of work that they get is directly proportionate to the number of people that they can potentially service.

For a laundry company, the more people you have exactly proportionately, the more you will have dirty laundry in that city. So the population of a city might play a huge role in the analysis that we are about to perform.

Let’s join these two datasets:

pop <- read.csv(files[2])%>%
                 janitor::clean_names()%>%
                  select(city,state,x2015_estimate)

paste0("There are ", stores%>% nrow, " rows in the datase before the join")
## [1] "There are 150 rows in the datase before the join"
company_pop <- stores %>% 
   inner_join(pop, by = c("city", "state"))

paste0("There are ", company_pop %>% nrow , " rows in the datase after the join")
## [1] "There are 150 rows in the datase after the join"

It is fundamental to check the number of rows after and before joining two datasets. We do not want to end up with fewer or duplicate observations in the dataset.

Finding the Number of Clusters

Great! Now that we have a more informative dataset, let’s begin our clustering analysis by determining how many clusters we should be modeling. The best number of clusters is not always a clear-cut answer, but fortunately there are several techniques to help us optimize. We’ll work with two different methods to compare recommendations:

  • Elbow Method

  • Silhouette Method

First up will be the Elbow Method. This method plots the percent variance against the number of clusters. The “elbow” bend of the curve indicates the optimal point at which adding more clusters will no longer explain a significant amount of the variance

k_means_data <- company_pop %>%
   select(revenue, marketing_spend,x2015_estimate)


fviz_nbclust(k_means_data, kmeans, method = "wss")
Figure 3.0

Figure 3.0

It appears that we have a third cluster now. Let’s see if we can confirm this result by checking the recommendation from the Silhouette method.


The silhouette method: The Silhouette Method will evaluate the quality of clusters by how well each point fits within a cluster, maximizing average “silhouette” width.

# Run the fviz_nbclust function with the method "silhouette" 
fviz_nbclust(k_means_data, kmeans,method = "silhouette")
Figure 4.0

Figure 4.0

Great, now we have the confirmation of the existence of three clusters.

Now, let’s run K means algorithm again and get to know the third cluster in our dataset

set.seed(111)

# Set k equal to the optimal number of clusters
num_clusters <- 3

# Run the k-means algorithm 
k_means <- kmeans(k_means_data, num_clusters, iter.max = 5, nstart = 25)


company_pop <- company_pop %>%
   mutate( k_means_cluster = k_means$cluster)  


ggplot(company_pop, aes(marketing_spend, revenue, color = as.factor(k_means_cluster) ))+ 
   geom_point(size = 4, alpha= 0.6) +
   geom_smooth(method = "lm",se = F)+
   theme_minimal()  +
   scale_x_continuous(labels = scales::dollar) +
   scale_y_continuous(labels = scales::dollar) +
   scale_color_manual(name="Clusters",values=c("#EC2C73","#29AEC7", 
                                               "#FFDD30")) +
   ggtitle('Clusters by Marketing Spend vs Revenue')+
   ylab("Revenue") +
   xlab("Marketing Spend") 
Figure 5.0

Figure 5.0

Here it is. The k means algorithm was able to identify this third cluster (1) that has the strongest relationship between the two variables. Therefore the new stores that are in these cities would be the good ones to invest in marketing.

company_pop <- company_pop %>%
   mutate(k_means_cluster = as.factor(
      ifelse(revenue >27000 & k_means_cluster == 2, 1,k_means_cluster )))



company_pop %>%
    group_by(k_means_cluster)%>%
    summarise(pop_mean = mean(x2015_estimate))
## # A tibble: 3 x 2
##   k_means_cluster pop_mean
##   <fct>              <dbl>
## 1 1                132028.
## 2 2                107172.
## 3 3                190854.

Here we can see the mean proportion of the population for each cluster.

lm( revenue ~ marketing_spend, company_pop %>% filter(k_means_cluster == 1))
## 
## Call:
## lm(formula = revenue ~ marketing_spend, data = company_pop %>% 
##     filter(k_means_cluster == 1))
## 
## Coefficients:
##     (Intercept)  marketing_spend  
##       22588.109            7.321

For the stores situated in the cities that make part of the cluster one, the least square regression predicts for each dollar invested in marketing it is predicted to have a return of $7.31

lm( revenue ~ marketing_spend, company_pop %>% filter(k_means_cluster == 2))
## 
## Call:
## lm(formula = revenue ~ marketing_spend, data = company_pop %>% 
##     filter(k_means_cluster == 2))
## 
## Coefficients:
##     (Intercept)  marketing_spend  
##       1.694e+04        9.439e-01

For the stores situated in the cities that make part of the cluster two, the least square regression predicts for each dollar invested in marketing it is predicted to have a return of $0.94

lm( revenue ~ marketing_spend, company_pop %>% filter(k_means_cluster == 3))
## 
## Call:
## lm(formula = revenue ~ marketing_spend, data = company_pop %>% 
##     filter(k_means_cluster == 3))
## 
## Coefficients:
##     (Intercept)  marketing_spend  
##       45983.858            3.172

For the stores situated in the cities that make part of the cluster three, the least square regression predicts for each dollar invested in marketing it is predicted to have a return of $3.17

Showing the Best Stores to Invest in Marketing

company_pop %>%
   filter(k_means_cluster == 1 & new_expansion == "New")%>%
   mutate(`Return per dollar invested` = round(revenue /marketing_spend,2))%>%
   select(city,`Return per dollar invested`)%>%
   ggplot(aes(city,`Return per dollar invested`))+
   scale_y_continuous(labels = scales::dollar) +
   geom_col() +
   theme_minimal()+
   ggtitle('Best Stores to invest in Marketing')+
   xlab("City")  
Figure 5.0

Figure 5.0

Those are the new stores in which the investment in marketing has generate more revenue for the company. Now it is time to present the finds to the CEO of the company. And let him with, his domain knowledge, choose three out of these four stores to invest in marketing.

You can find the code used in this article in the Github Repository. Right-click on the link and choose (open in a New Window).