library(rio) # dloading data
library(dplyr) # data manipulation
library(ggplot2) # visualization
library(factoextra) # visualization
files <- dir("data", full = T)
if(length(files)==0 ){
warning("FIRST YOU MUST CREATE A NEW R PROJEC THEN COPY THE FILES DOWNLOADED FROM GIT HUB
AND PASTE ALL INTO THE NEW PROJECT FOLDER") }
stores <- import(files[1], which = 1)%>%janitor::clean_names()
WeWashYouSleep is a ralatively small company and they cannot compete with the big players in the major cities. The company’s strategy is to build a vast network in the smaller cityes.
WeWashYouSleep alredy had a strong presence in 140 locations and recently oppened stores in 10 new cityes. Additionally, the company has two separate sales regions.
You have been assigned for analytics task;
Indentify which of the 10 new locations have the best potential for the company to invest more funds into markenting.
The Dataset has been taken from superdatascience this dataset contains 7 attributes. It has a total number of 150 rows.
store_id: Unique identifier of the store
state: State of the established store
city: City of the established store
sales_region: Region Sales
new_expansion: Indicates whether the store is Old or New
marketing_spend: Amount invested in marketing per store
revenue: Amount of revenue per store
There were no duplicated entries in the dataset and no missing values the functions below gives a summary of these:
paste0("There are ",sum(duplicated(stores)), " duplicated observations")
## [1] "There are 0 duplicated observations"
#Checking for missing values
visdat::vis_miss(stores)
skimr::skim(stores)
| Name | stores |
| Number of rows | 150 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| city | 0 | 1 | 4 | 16 | 0 | 149 | 0 |
| state | 0 | 1 | 4 | 14 | 0 | 35 | 0 |
| sales_region | 0 | 1 | 8 | 8 | 0 | 2 | 0 |
| new_expansion | 0 | 1 | 3 | 3 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| store_id | 0 | 1 | 75.50 | 43.45 | 1 | 38.25 | 75.5 | 112.75 | 150 | ▇▇▇▇▇ |
| marketing_spend | 0 | 1 | 2893.15 | 367.86 | 1811 | 2662.25 | 2898.0 | 3111.50 | 3984 | ▁▃▇▃▁ |
| revenue | 0 | 1 | 39301.43 | 15465.75 | 15562 | 21113.50 | 42993.0 | 51145.50 | 68828 | ▇▁▆▇▂ |
ggplot(stores, aes(marketing_spend, revenue))+
geom_point(size = 4, alpha= 0.6)+
theme_minimal() +
scale_x_continuous(labels = scales::dollar) +
scale_y_continuous(labels = scales::dollar) +
ggtitle("Marketing Spend vs Revenue")+
theme(
axis.text.x=element_text( color="black"),
plot.title = element_text(size = 17),
strip.text.x = element_text(size = 14))
Figure 1.0
Straightforward We can verify the existence of two clusters in this image. The first cluster has a weak, positive, linear association between the two variables in which the market investment does not seem to raise the amount of revenue to the company. The other one has a moderately strong, positive, linear association between the two variables in which the market investment seems to raise the amount of revenue to the company.
| Algorithm | Description |
|---|---|
| K Means | Clustering |
Let’s apply the K Means algorithm and visualize it to see whether it will separate the clusters that we were able to identify in figure 1.0.
# Lets check if the algorithm kmeans can indentify these clusters
k_means_data <- stores %>%
select(revenue, marketing_spend)
set.seed(111)
# Set k equal to the optimal number of clusters
num_clusters <- 2
# Run the k-means algorithm
k_means <- kmeans(k_means_data, num_clusters, iter.max = 15, nstart = 25)
stores <- stores %>%
mutate( k_means_cluster = k_means$cluster)
ggplot(stores, aes(marketing_spend, revenue, color = as.factor(k_means_cluster)))+
geom_point(size = 4, alpha= 0.6) +
geom_smooth(method = "lm",se = F)+
theme_minimal()+
scale_x_continuous(labels = scales::dollar) +
scale_y_continuous(labels = scales::dollar) +
scale_color_manual(name="Clusters",values=c("#EC2C73","#29AEC7")) +
ggtitle('Clusters by Marketing Spend vs Revenue')+
ylab("Revenue") +
xlab("Marketing Spend")
Figure 2.0
The algorithm separated the dataset into two clusters very well. The only problem is that for some reason, it assigned some stores in cluster two that appear to belong to cluster one.
I decided to bring one more dataset for this analysis. The US-Cities-Population dataset contains the amount of population from these cities. As you can imagine. The volume of work that they get is directly proportionate to the number of people that they can potentially service.
For a laundry company, the more people you have exactly proportionately, the more you will have dirty laundry in that city. So the population of a city might play a huge role in the analysis that we are about to perform.
Let’s join these two datasets:
pop <- read.csv(files[2])%>%
janitor::clean_names()%>%
select(city,state,x2015_estimate)
paste0("There are ", stores%>% nrow, " rows in the datase before the join")
## [1] "There are 150 rows in the datase before the join"
company_pop <- stores %>%
inner_join(pop, by = c("city", "state"))
paste0("There are ", company_pop %>% nrow , " rows in the datase after the join")
## [1] "There are 150 rows in the datase after the join"
It is fundamental to check the number of rows after and before joining two datasets. We do not want to end up with fewer or duplicate observations in the dataset.
Great! Now that we have a more informative dataset, let’s begin our clustering analysis by determining how many clusters we should be modeling. The best number of clusters is not always a clear-cut answer, but fortunately there are several techniques to help us optimize. We’ll work with two different methods to compare recommendations:
Elbow Method
Silhouette Method
First up will be the Elbow Method. This method plots the percent variance against the number of clusters. The “elbow” bend of the curve indicates the optimal point at which adding more clusters will no longer explain a significant amount of the variance
k_means_data <- company_pop %>%
select(revenue, marketing_spend,x2015_estimate)
fviz_nbclust(k_means_data, kmeans, method = "wss")
Figure 3.0
It appears that we have a third cluster now. Let’s see if we can confirm this result by checking the recommendation from the Silhouette method.
The silhouette method: The Silhouette Method will evaluate the quality of clusters by how well each point fits within a cluster, maximizing average “silhouette” width.
# Run the fviz_nbclust function with the method "silhouette"
fviz_nbclust(k_means_data, kmeans,method = "silhouette")
Figure 4.0
Great, now we have the confirmation of the existence of three clusters.
Now, let’s run K means algorithm again and get to know the third cluster in our dataset
set.seed(111)
# Set k equal to the optimal number of clusters
num_clusters <- 3
# Run the k-means algorithm
k_means <- kmeans(k_means_data, num_clusters, iter.max = 5, nstart = 25)
company_pop <- company_pop %>%
mutate( k_means_cluster = k_means$cluster)
ggplot(company_pop, aes(marketing_spend, revenue, color = as.factor(k_means_cluster) ))+
geom_point(size = 4, alpha= 0.6) +
geom_smooth(method = "lm",se = F)+
theme_minimal() +
scale_x_continuous(labels = scales::dollar) +
scale_y_continuous(labels = scales::dollar) +
scale_color_manual(name="Clusters",values=c("#EC2C73","#29AEC7",
"#FFDD30")) +
ggtitle('Clusters by Marketing Spend vs Revenue')+
ylab("Revenue") +
xlab("Marketing Spend")
Figure 5.0
Here it is. The k means algorithm was able to identify this third cluster (1) that has the strongest relationship between the two variables. Therefore the new stores that are in these cities would be the good ones to invest in marketing.
company_pop <- company_pop %>%
mutate(k_means_cluster = as.factor(
ifelse(revenue >27000 & k_means_cluster == 2, 1,k_means_cluster )))
company_pop %>%
group_by(k_means_cluster)%>%
summarise(pop_mean = mean(x2015_estimate))
## # A tibble: 3 x 2
## k_means_cluster pop_mean
## <fct> <dbl>
## 1 1 132028.
## 2 2 107172.
## 3 3 190854.
Here we can see the mean proportion of the population for each cluster.
lm( revenue ~ marketing_spend, company_pop %>% filter(k_means_cluster == 1))
##
## Call:
## lm(formula = revenue ~ marketing_spend, data = company_pop %>%
## filter(k_means_cluster == 1))
##
## Coefficients:
## (Intercept) marketing_spend
## 22588.109 7.321
For the stores situated in the cities that make part of the cluster one, the least square regression predicts for each dollar invested in marketing it is predicted to have a return of $7.31
lm( revenue ~ marketing_spend, company_pop %>% filter(k_means_cluster == 2))
##
## Call:
## lm(formula = revenue ~ marketing_spend, data = company_pop %>%
## filter(k_means_cluster == 2))
##
## Coefficients:
## (Intercept) marketing_spend
## 1.694e+04 9.439e-01
For the stores situated in the cities that make part of the cluster two, the least square regression predicts for each dollar invested in marketing it is predicted to have a return of $0.94
lm( revenue ~ marketing_spend, company_pop %>% filter(k_means_cluster == 3))
##
## Call:
## lm(formula = revenue ~ marketing_spend, data = company_pop %>%
## filter(k_means_cluster == 3))
##
## Coefficients:
## (Intercept) marketing_spend
## 45983.858 3.172
For the stores situated in the cities that make part of the cluster three, the least square regression predicts for each dollar invested in marketing it is predicted to have a return of $3.17
company_pop %>%
filter(k_means_cluster == 1 & new_expansion == "New")%>%
mutate(`Return per dollar invested` = round(revenue /marketing_spend,2))%>%
select(city,`Return per dollar invested`)%>%
ggplot(aes(city,`Return per dollar invested`))+
scale_y_continuous(labels = scales::dollar) +
geom_col() +
theme_minimal()+
ggtitle('Best Stores to invest in Marketing')+
xlab("City")
Figure 5.0
Those are the new stores in which the investment in marketing has generate more revenue for the company. Now it is time to present the finds to the CEO of the company. And let him with, his domain knowledge, choose three out of these four stores to invest in marketing.
You can find the code used in this article in the Github Repository. Right-click on the link and choose (open in a New Window).