Caffeine Form is a company creating coffee cups from recycled material. Although they started selling the products on their website last year, the results were not as good as they expected. To better enter the local market, they decided to collaborate with local coffee shops to advertise and sell their coffee mugs.
The marketing team is trying to come up with the best criteria to choose possible collaborators by investigating the local market. They would like you to answer the following questions to help:
The dataset contains the information about coffee shops in this new market.
The dataset needs to be validated based on the description below:
| Column Name | Criteria |
|---|---|
| Region | Character, one of 10 possible regions (A to J) where coffee shop is |
| located | |
| Place name | Character, name of the shop. |
| Place type | Character, the type of coffee shop, one of “Coffee shop”, “Cafe”,“Espresso bar”, and “Others” |
| Rating | Numeric, coffee shop rating (on a 5 point scale). |
| Reviews | Numeric, number of reviews provided for the shop. Remove the rows if the number of reviews is missing. |
| Price | character, price category, one of “one dollar”, “two dollar”, three dollar" |
| Delivery option | Binary, describing whether there is a delivery option, either True or False. |
| Dine in option | Binary, describing whether there is a dine-in option, either True or False. Replace missing values with False. |
| Takeout option | Binary, describing whether there is a takeout option, either True or False.Replace missing values with False. |
coffe_df <- read_csv("coffee.csv", show_col_types = F)
coffe_df %>%
select(- `Place name`) %>%
head()
## # A tibble: 6 x 8
## Region `Place type` Rating Reviews Price `Delivery option` `Dine in option`
## <chr> <chr> <dbl> <dbl> <chr> <lgl> <lgl>
## 1 C Others 4.6 206 $$ FALSE NA
## 2 C Cafe 5 24 $$ FALSE NA
## 3 C Coffee shop 5 11 $$ FALSE NA
## 4 C Coffee shop 4.4 331 $$ FALSE TRUE
## 5 C Coffee shop 5 12 $$ FALSE TRUE
## 6 C Espresso bar 4.6 367 $$ FALSE TRUE
## # ... with 1 more variable: `Takeout option` <lgl>
skim(coffe_df)
| Name | coffe_df |
| Number of rows | 200 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| logical | 3 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Region | 0 | 1 | 1 | 1 | 0 | 10 | 0 |
| Place name | 0 | 1 | 4 | 60 | 0 | 187 | 0 |
| Place type | 0 | 1 | 4 | 12 | 0 | 4 | 0 |
| Price | 0 | 1 | 1 | 3 | 0 | 3 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| Delivery option | 0 | 1.00 | 0.17 | FAL: 165, TRU: 35 |
| Dine in option | 60 | 0.70 | 1.00 | TRU: 140 |
| Takeout option | 56 | 0.72 | 1.00 | TRU: 144 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Rating | 2 | 0.99 | 4.66 | 0.22 | 3.9 | 4.6 | 4.7 | 4.80 | 5 | ▁▁▃▇▆ |
| Reviews | 2 | 0.99 | 622.49 | 1400.90 | 3.0 | 47.5 | 271.5 | 786.25 | 17937 | ▇▁▁▁▁ |
Observation
## replace the missing value in Dine-in option and Takeout option with False and drop the missing value in numeric columns
coffe_df <- coffe_df %>%
mutate(`Dine in option` = ifelse(is.na(`Dine in option`), FALSE, `Dine in option`)) %>%
mutate(`Takeout option` = ifelse(is.na(`Takeout option`), FALSE, `Takeout option`)) %>%
drop_na()
skim(coffe_df)
| Name | coffe_df |
| Number of rows | 198 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| logical | 3 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Region | 0 | 1 | 1 | 1 | 0 | 10 | 0 |
| Place name | 0 | 1 | 4 | 60 | 0 | 185 | 0 |
| Place type | 0 | 1 | 4 | 12 | 0 | 4 | 0 |
| Price | 0 | 1 | 1 | 3 | 0 | 3 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| Delivery option | 0 | 1 | 0.18 | FAL: 163, TRU: 35 |
| Dine in option | 0 | 1 | 0.71 | TRU: 140, FAL: 58 |
| Takeout option | 0 | 1 | 0.73 | TRU: 144, FAL: 54 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Rating | 0 | 1 | 4.66 | 0.22 | 3.9 | 4.6 | 4.7 | 4.80 | 5 | ▁▁▃▇▆ |
| Reviews | 0 | 1 | 622.49 | 1400.90 | 3.0 | 47.5 | 271.5 | 786.25 | 17937 | ▇▁▁▁▁ |
Now our data is clean we will perform EDA to answer the problem
coffe_df %>%
ggplot(aes(Rating)) +
geom_histogram(aes(y = ..density..), fill = "#3279a8") +
geom_density(color = "blue") +
theme_minimal()
Observation
coffe_df %>%
ggplot(aes(Reviews)) +
geom_boxplot(fill = "#3279a8") +
theme_minimal()
The review columns have an outliers and there is a one value that is to big. Something is suspicious why we have a huge number of review but we will find out for more analysis.
cor(coffe_df$Reviews, coffe_df$Rating)
## [1] -0.1040226
coffe_df %>%
ggplot(aes(Reviews, Rating)) +
geom_point(position = "jitter") +
geom_smooth(method = lm) +
theme_minimal()+
labs(title = "Rating vs. Reviews") +
theme(plot.title = element_text(hjust = .5))
## `geom_smooth()` using formula 'y ~ x'
Observation
The two numeric variables in dataset have no relationship at all since the correlation of coefficient is too low negative.
coffe_df %>%
group_by(`Place type`) %>%
summarize(cor = cor(Reviews, Rating))
## # A tibble: 4 x 2
## `Place type` cor
## <chr> <dbl>
## 1 Cafe -0.286
## 2 Coffee shop -0.0854
## 3 Espresso bar -0.356
## 4 Others -0.156
coffe_df %>%
ggplot(aes(x = Reviews, y = Rating)) +
geom_point() +
geom_smooth(method = lm) +
facet_wrap(~ `Place type`) +
theme_minimal()+
labs(title = "Reviews and Rating Relationship per Place Type") +
theme(plot.title = element_text(hjust = .5))
## `geom_smooth()` using formula 'y ~ x'
Observation
coffe_df %>%
group_by(`Place type`) %>%
summarise(count = n()) %>%
arrange(- count)
## # A tibble: 4 x 2
## `Place type` count
## <chr> <int>
## 1 Coffee shop 96
## 2 Cafe 57
## 3 Others 25
## 4 Espresso bar 20
coffe_df %>%
group_by(`Place type`) %>%
summarise(count = n()) %>%
mutate( `Place type` = fct_reorder(`Place type`, count, .desc = T )) %>%
ggplot(aes(x = `Place type`, y = count, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
geom_text(aes(label = count), vjust = -.5) +
theme_minimal()
Observation
Highest number of bars were coffee shop for the place type while the Espresso bar is the lowest.
coffe_df %>%
ggplot(aes(x = `Delivery option` , y = Reviews, fill = `Delivery option`)) +
geom_boxplot(show.legend = F ) +
theme_minimal()
Observation
coffe_df <- coffe_df %>%
filter(Reviews != 17937)
coffe_df %>%
ggplot(aes(x = `Delivery option` , y = Reviews, fill = `Delivery option`)) +
geom_boxplot(show.legend = F ) +
theme_minimal()
Observation
coffe_df %>%
group_by(`Place type`) %>%
summarise(AvgReviews = mean(Reviews)) %>%
arrange( - AvgReviews)
## # A tibble: 4 x 2
## `Place type` AvgReviews
## <chr> <dbl>
## 1 Coffee shop 557.
## 2 Cafe 533.
## 3 Espresso bar 526.
## 4 Others 462.
coffe_df %>%
group_by(`Place type`) %>%
summarise(AvgReviews = mean(Reviews)) %>%
mutate(`Place type` = fct_reorder(`Place type`, AvgReviews, .desc = T)) %>%
ggplot(aes(`Place type`, AvgReviews, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
geom_text(aes(label = round(AvgReviews, 2)), vjust = -.5) +
labs(
y = "Average Reviews"
) +
theme_minimal()
Observation
The highest bar of Average review is the Coffee shop in place type while the lowest bar is others.
coffe_df %>%
group_by(`Place type`) %>%
summarise(AvgRating = mean(Rating)) %>%
arrange( - AvgRating)
## # A tibble: 4 x 2
## `Place type` AvgRating
## <chr> <dbl>
## 1 Others 4.72
## 2 Espresso bar 4.69
## 3 Coffee shop 4.68
## 4 Cafe 4.60
coffe_df %>%
group_by(`Place type`) %>%
summarise(AvgRating = mean(Rating)) %>%
mutate(`Place type` = fct_reorder(`Place type`, AvgRating, .desc = T)) %>%
ggplot(aes(`Place type`, AvgRating, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
geom_text(aes(label = round(AvgRating, 2)), vjust = -.5) +
labs(
y = "Average Rating"
) +
theme_minimal()
Observation
The highest column of average rating is others in Place Type while the lowest is Cafe.
coffe_df %>%
filter(Region == "A") %>%
group_by(`Place type`) %>%
summarise(totalreviews = sum(Reviews)) %>%
mutate(`Place type` = fct_reorder(`Place type`, totalreviews, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = totalreviews, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = totalreviews), vjust = -.3) +
labs(
title = "Region A",
y = "Total Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
coffe_df %>%
filter(Region == "B") %>%
group_by(`Place type`) %>%
summarise(totalreviews = sum(Reviews)) %>%
mutate(`Place type` = fct_reorder(`Place type`, totalreviews, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = totalreviews, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = totalreviews), vjust = -.3) +
labs(
title = "Region B",
y = "Total Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
coffe_df %>%
filter(Region == "C") %>%
group_by(`Place type`) %>%
summarise(totalreviews = sum(Reviews)) %>%
mutate(`Place type` = fct_reorder(`Place type`, totalreviews, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = totalreviews, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = totalreviews), vjust = -.3) +
labs(
title = "Region C",
y = "Total Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
coffe_df %>%
filter(Region == "D") %>%
group_by(`Place type`) %>%
summarise(totalreviews = sum(Reviews)) %>%
mutate(`Place type` = fct_reorder(`Place type`, totalreviews, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = totalreviews, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = totalreviews), vjust = -.3) +
labs(
title = "Region D",
y = "Total Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
coffe_df %>%
filter(Region == "E") %>%
group_by(`Place type`) %>%
summarise(totalreviews = sum(Reviews)) %>%
mutate(`Place type` = fct_reorder(`Place type`, totalreviews, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = totalreviews, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = totalreviews), vjust = -.3) +
labs(
title = "Region E",
y = "Total Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
coffe_df %>%
filter(Region == "F") %>%
group_by(`Place type`) %>%
summarise(totalreviews = sum(Reviews)) %>%
mutate(`Place type` = fct_reorder(`Place type`, totalreviews, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = totalreviews, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = totalreviews), vjust = -.3) +
labs(
title = "Region F",
y = "Total Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
coffe_df %>%
filter(Region == "G") %>%
group_by(`Place type`) %>%
summarise(totalreviews = sum(Reviews)) %>%
mutate(`Place type` = fct_reorder(`Place type`, totalreviews, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = totalreviews, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = totalreviews), vjust = -.3) +
labs(
title = "Region G",
y = "Total Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
coffe_df %>%
filter(Region == "H") %>%
group_by(`Place type`) %>%
summarise(totalreviews = sum(Reviews)) %>%
mutate(`Place type` = fct_reorder(`Place type`, totalreviews, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = totalreviews, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = totalreviews), vjust = -.3) +
labs(
title = "Region H",
y = "Total Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
coffe_df %>%
filter(Region == "I") %>%
group_by(`Place type`) %>%
summarise(totalreviews = sum(Reviews)) %>%
mutate(`Place type` = fct_reorder(`Place type`, totalreviews, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = totalreviews, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = totalreviews), vjust = -.3) +
labs(
title = "Region I",
y = "Total Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
coffe_df %>%
filter(Region == "J") %>%
group_by(`Place type`) %>%
summarise(totalreviews = sum(Reviews)) %>%
mutate(`Place type` = fct_reorder(`Place type`, totalreviews, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = totalreviews, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = totalreviews), vjust = -.3) +
labs(
title = "Region J",
y = "Total Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
coffe_df %>%
filter(Region == "A") %>%
group_by(`Place type`) %>%
summarise(AvgRating = mean(Rating)) %>%
mutate(`Place type` = fct_reorder(`Place type`, AvgRating, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = AvgRating, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = round(AvgRating, 2)), vjust = -.3) +
labs(
title = "Region A",
y = "Average Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
The rating for every coffee shop type is almost same but the highest is bar of others.
coffe_df %>%
filter(Region == "B") %>%
group_by(`Place type`) %>%
summarise(AvgRating = mean(Rating)) %>%
mutate(`Place type` = fct_reorder(`Place type`, AvgRating, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = AvgRating, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = round(AvgRating, 2)), vjust = -.3) +
labs(
title = "Region B",
y = "Average Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
The rating for every coffee shop type is almost same but the highest is bar of others.
coffe_df %>%
filter(Region == "C") %>%
group_by(`Place type`) %>%
summarise(AvgRating = mean(Rating)) %>%
mutate(`Place type` = fct_reorder(`Place type`, AvgRating, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = AvgRating, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = round(AvgRating, 2)), vjust = -.3) +
labs(
title = "Region C",
y = "Average Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
The rating for every coffee shop type is almost same but the highest is bar of Coffee shop.
coffe_df %>%
filter(Region == "D") %>%
group_by(`Place type`) %>%
summarise(AvgRating = mean(Rating)) %>%
mutate(`Place type` = fct_reorder(`Place type`, AvgRating, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = AvgRating, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = round(AvgRating, 2)), vjust = -.3) +
labs(
title = "Region D",
y = "Average Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
The rating for every coffee shop type is almost same but the highest is bar of Espresso bar.
coffe_df %>%
filter(Region == "E") %>%
group_by(`Place type`) %>%
summarise(AvgRating = mean(Rating)) %>%
mutate(`Place type` = fct_reorder(`Place type`, AvgRating, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = AvgRating, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = round(AvgRating, 2)), vjust = -.3) +
labs(
title = "Region E",
y = "Average Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
The rating for every coffee shop type is almost same but the highest is bar of Cafe.
coffe_df %>%
filter(Region == "F") %>%
group_by(`Place type`) %>%
summarise(AvgRating = mean(Rating)) %>%
mutate(`Place type` = fct_reorder(`Place type`, AvgRating, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = AvgRating, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = round(AvgRating, 2)), vjust = -.3) +
labs(
title = "Region F",
y = "Average Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
The rating for every coffee shop type is almost same but the highest is bar of Cafe.
coffe_df %>%
filter(Region == "G") %>%
group_by(`Place type`) %>%
summarise(AvgRating = mean(Rating)) %>%
mutate(`Place type` = fct_reorder(`Place type`, AvgRating, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = AvgRating, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = round(AvgRating, 2)), vjust = -.3) +
labs(
title = "Region G",
y = "Average Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
The rating for every coffee shop type is almost same but the highest is bar of others.
coffe_df %>%
filter(Region == "H") %>%
group_by(`Place type`) %>%
summarise(AvgRating = mean(Rating)) %>%
mutate(`Place type` = fct_reorder(`Place type`, AvgRating, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = AvgRating, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = round(AvgRating, 2)), vjust = -.3) +
labs(
title = "Region H",
y = "Average Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
The rating for every coffee shop type is almost same but the highest is bar of others.
coffe_df %>%
filter(Region == "I") %>%
group_by(`Place type`) %>%
summarise(AvgRating = mean(Rating)) %>%
mutate(`Place type` = fct_reorder(`Place type`, AvgRating, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = AvgRating, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = round(AvgRating, 2)), vjust = -.3) +
labs(
title = "Region I",
y = "Average Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
The rating for every coffee shop type is almost same but the highest is bar of Coffee shop.
coffe_df %>%
filter(Region == "J") %>%
group_by(`Place type`) %>%
summarise(AvgRating = mean(Rating)) %>%
mutate(`Place type` = fct_reorder(`Place type`, AvgRating, .desc = T)) %>%
ggplot(aes(x = `Place type`, y = AvgRating, fill = `Place type`)) +
geom_col(show.legend = F, width = .7) +
scale_fill_viridis_d() +
geom_text(aes(label = round(AvgRating, 2)), vjust = -.3) +
labs(
title = "Region J",
y = "Average Reviews"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5, face = "bold"))
Observation
The rating for every coffee shop type is almost same but the highest is bar of Espresso bar.