options(scipen=999)
library(tidyverse)
library(socviz)
library(datasetsICR)LAB 5
PART 0: Useful code ONLY for those assigned the customers {datasetsICR}
I am using the customers dataset, then I need to transform the Channel and Region from integer into factor variables, using text for the levels.
## THIS CODE IS TO RENAME THE LEVELS FOR THE CHANNEL AND REGION VARIABLES IN THE CUSTOMER DATASET
library(datasetsICR)
data(customers)
head(customers) Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 2 3 12669 9656 7561 214 2674 1338
2 2 3 7057 9810 9568 1762 3293 1776
3 2 3 6353 8808 7684 2405 3516 7844
4 1 3 13265 1196 4221 6404 507 1788
5 2 3 22615 5410 7198 3915 1777 5185
6 2 3 9413 8259 5126 666 1795 1451
view(customers)
# CHANNEL VARIABLE
# create the factor variable
customers$channel <- as.factor(customers$Channel)
# rename the levels from 1 and 2 to descriptor
levels(customers$channel)[levels(customers$channel)=='1'] <- "Hotel-Restaurant"
levels(customers$channel)[levels(customers$channel)=='2'] <- "Retail"
# REGION VARIABLE
customers$region <- as.factor(customers$Region)
levels(customers$region)[levels(customers$region)=='1'] <- "Lisbon"
levels(customers$region)[levels(customers$region)=='2'] <- "Oporto"
levels(customers$region)[levels(customers$region)=='3'] <- "Other"
# DROP THE FIRST TWO COLUMNS TO AVOID CONFUSION
customers <- customers %>% select(3:10)
head(customers) Fresh Milk Grocery Frozen Detergents_Paper Delicassen channel region
1 12669 9656 7561 214 2674 1338 Retail Other
2 7057 9810 9568 1762 3293 1776 Retail Other
3 6353 8808 7684 2405 3516 7844 Retail Other
4 13265 1196 4221 6404 507 1788 Hotel-Restaurant Other
5 22615 5410 7198 3915 1777 5185 Retail Other
6 9413 8259 5126 666 1795 1451 Retail Other
PART 1: PRACTICE USING PIPES (dplyr) TO SUMMARIZE DATA: TWO CATEGORICAL VARIABLES
I use the only two categorical variables that are present in the dataset that are the Region and Channel.
library(datasetsICR)
pip1 <- customers %>%
group_by(channel, region) %>%
summarize(N = n()) %>%
mutate(freq = N/sum(N),
pct = round((freq*100),0))
# Remove missing values
pip1 <- subset(pip1, !is.na(channel) & !is.na(region))
pip1# A tibble: 6 × 5
# Groups: channel [2]
channel region N freq pct
<fct> <fct> <int> <dbl> <dbl>
1 Hotel-Restaurant Lisbon 59 0.198 20
2 Hotel-Restaurant Oporto 28 0.0940 9
3 Hotel-Restaurant Other 211 0.708 71
4 Retail Lisbon 18 0.127 13
5 Retail Oporto 19 0.134 13
6 Retail Other 105 0.739 74
PART 2: CREATE STACKED AND DODGED BAR CHARTS FROM 2 CATEGORICAL VARIABLES
In order to create the following charts I used the tow categorical variables, Region and Channel.
# Define custom colors
custom_colors <- c("lightblue", "orange")
p_title <- "Channels by Big Regions"
p_caption <- "customers dataset"
# AS STACKED BAR CHART
p <- ggplot(data = subset(pip1, !is.na(region) & !is.na(channel)),
aes(x = region, y = pct, fill = channel))
p + geom_col(position = "stack") +
labs(x = "Regions", y = "Percent", fill = "Channel",
title = p_title, caption = p_caption,
subtitle = "As a stacked bar chart") +
geom_text(aes(label = pct), position = position_stack(vjust = 0.5)) +
scale_fill_manual(values = custom_colors)# AS DODGED BAR CHART
p + geom_col(position = "dodge2") +
labs(x = "Major Region", y = "Percent", fill = "Channel",
title = p_title, caption = p_caption,
subtitle = "As a dodged bar chart") +
geom_text(aes(label = pct), position = position_dodge(width = 0.9)) +
scale_fill_manual(values = custom_colors)# AS FACETED HORIZONTAL BAR CHART
p + geom_col(position = "dodge2") +
labs(x = NULL, y = "Percent", fill = "Channel",
title = p_title, caption = p_caption,
subtitle = "As a faceted horizontal bar chart") +
guides(fill = "none") +
coord_flip() +
facet_grid(~ region) +
geom_text(aes(label = pct), position = position_dodge2(width = 1)) +
scale_fill_manual(values = custom_colors)These charts provide a visual representation of the distribution of customers across regions and channels, allowing for easy comparison and identifying patterns. The stacked bar chart shows the overall distribution, while the dodged bar chart and faceted horizontal bar chart provide more detail by separating the channels.As you can see the other section has a great percentage in comparison with the Lisbon and Oporto regions.
PART 3: PRACTICE USING PIPES (dplyr) TO SUMMARIZE DATA: TWO CONTINUOUS & ONE CATEGORICAL VARIABLE
In this section I used Region as my categorical variable because it has a better use in this context, if I have used the Channel variable I would have only the Hotel and Retail which makes the graph look empty.
pip2 <- customers %>%
group_by(region) %>%
summarize(N = n(),
milk_mean = mean(Milk, na.rm=TRUE),
det_mean = mean(Detergents_Paper, na.rm=TRUE)) %>%
mutate(freq = N/sum(N),
pct = round((freq*100),0))
pip2# A tibble: 3 × 6
region N milk_mean det_mean freq pct
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 Lisbon 77 5486. 2651. 0.175 18
2 Oporto 47 5088. 3687. 0.107 11
3 Other 316 5977. 2818. 0.718 72
This summarized data provides the count, frequency, percentage, mean milk sales, and mean detergent sales for each region. The results appear reasonable, showing variations in the mean sales across regions, which could be influenced by factors such as population density, consumer preferences, and market conditions.
PART 4: SCATTERPLOT WITH A THIRD CATEGORICAL VARIABLE
This code does not remove NA – make sure you deal with that.
custom_colors <- c("red","purple", "orange")
p <- ggplot(pip2, aes(x=milk_mean, y=det_mean, color=region, na.rm = TRUE))
p + geom_point(size=5) +
annotate(geom = "text", x = 1.6, y=58,
label = "Two Continuous Variables and One Categorical", hjust=0) +
labs(y="Average Milk Purchase", x="Average Detergent Purchase",
title="Relationship between Milk and Detergent",
subtitle = "Oporto has the highest milk purchases on average",
caption = "customer dataset{datasetsICR}")+
scale_color_manual(values = custom_colors)This scatterplot shows the relationship between mean milk sales and mean detergent sales, with each point representing a region and colored by region. The annotation highlights an interesting observation that Oporto has higher average sales for both product categories compared to other regions. Of course other regions seem to have a higher average purchase of detergent. Lisboa has less average milk than other regions and lower average purchases of milk than Oporto.
PART 5: LEGEND AND GUIDES
pip2$region.c <- as.character(pip2$region)
pip2 <- pip2[order(pip2$region.c),]
pip2# A tibble: 3 × 7
region N milk_mean det_mean freq pct region.c
<fct> <int> <dbl> <dbl> <dbl> <dbl> <chr>
1 Lisbon 77 5486. 2651. 0.175 18 Lisbon
2 Oporto 47 5088. 3687. 0.107 11 Oporto
3 Other 316 5977. 2818. 0.718 72 Other
custom_colors <- c("red","purple", "orange")
p <- ggplot(pip2, aes(x=milk_mean, y=det_mean, color=region.c, na.rm = TRUE))
p + geom_point(size=5) +
annotate(geom = "text", x = 1.6, y=58,
label = "Relationship between Milk and Detergent Purchases in regions", hjust=0) +
labs(y="Average Milk Purchase", x="Average Detergent Purchase",
color = "Region",
title="Relationship between Milk and Detergent Purchases",
subtitle = "Oporto has the highest milk purchases on average",
caption <- "customer dataset{datasetsICR}") +
theme(legend.title = element_text(color="gray50", size=14, face="bold"),
legend.position = c(x=0.1, y=.7)) +
scale_color_manual(values = custom_colors)In this plot, the legend order has been sorted alphabetically, with Lisbon first, followed by Oporto and Other. The legend title has been modified to “Region” with bold text and a gray color. The legend position has also been adjusted to the left-right corner of the plot area.
PART 6: DATA LABELS VS LEGEND
custom_colors <- c("red","purple", "orange")
p <- ggplot(pip2, aes(x=milk_mean, y=det_mean, color=region.c, na.rm = TRUE))
p + geom_point(size=5) +
geom_text(mapping = aes(label=region), hjust=1.2, size=3) +
annotate(geom = "text", x = 1.6, y=58,
label = "Relationship between Milk and Detergent Purchases in regions", hjust=0) +
labs(y="Average Milk Purchases", x="Average Detergent Purchases",
title="Relationship between Milk and Detergent Purchases",
color = "Region") +
theme(legend.position = "none") +
scale_color_manual(values = custom_colors)In this plot, data labels have been added to directly display the region names instead of using a legend. The geom_text layer has been used to add text labels next to each data point, with a horizontal adjustment (hjust) to separate the labels from the points.
PART 7: INTERPRETATION
Create insights from the visualization.
The scatterplot in Part 4 visualizes the relationship between the mean sales of milk and detergent products across different sales channels (Hotel-Restaurant and Retail). Each point on the scatterplot represents a specific region, colored by the region to distinguish between Lisbon, Oporto, and Other regions.
From the scatterplot, we can observe the following insights:
Regional variations: The scatterplot reveals distinct regional differences in the mean sales of milk and detergent products. The points representing the Oporto region generally have higher values on the x-axis (mean milk sales) and Other in the y-axis (mean detergent sales), indicating higher average sales in the capital region compared to Oporto.
Potential correlation: The scatterplot suggests a positive correlation between mean milk sales and mean detergent sales across regions. In other words, regions with higher mean milk sales tend to have higher mean detergent sales as well. This correlation could be explained by various factors, such as population density, purchasing power, or consumer preferences in those regions.
Overall, the scatterplot effectively visualizes the relationship between two continuous variables (milk and detergent sales) while incorporating a categorical variable (region) through color coding. It allows for the identification of regional variations, potential channel differences, and the exploration of correlations between the two product categories. This information can be valuable for businesses in understanding consumer demand patterns, tailoring marketing strategies, and optimizing product distribution across different regions and sales channels.