LAB 5

Author

Erick Xavier Maldonado

options(scipen=999) 
library(tidyverse)
library(socviz)
library(datasetsICR)

PART 0: Useful code ONLY for those assigned the customers {datasetsICR}

I am using the customers dataset, then I need to transform the Channel and Region from integer into factor variables, using text for the levels.

## THIS CODE IS TO RENAME THE LEVELS FOR THE CHANNEL AND REGION VARIABLES IN THE CUSTOMER DATASET
library(datasetsICR)
data(customers)
head(customers)

  Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1       2      3 12669 9656    7561    214             2674       1338
2       2      3  7057 9810    9568   1762             3293       1776
3       2      3  6353 8808    7684   2405             3516       7844
4       1      3 13265 1196    4221   6404              507       1788
5       2      3 22615 5410    7198   3915             1777       5185
6       2      3  9413 8259    5126    666             1795       1451

view(customers)

# CHANNEL VARIABLE
#   create the factor variable
customers$channel <- as.factor(customers$Channel)  
#   rename the levels from 1 and 2 to descriptor
levels(customers$channel)[levels(customers$channel)=='1'] <- "Hotel-Restaurant"
levels(customers$channel)[levels(customers$channel)=='2'] <- "Retail"

# REGION VARIABLE
customers$region <- as.factor(customers$Region)
levels(customers$region)[levels(customers$region)=='1'] <- "Lisbon"
levels(customers$region)[levels(customers$region)=='2'] <- "Oporto"
levels(customers$region)[levels(customers$region)=='3'] <- "Other"

# DROP THE FIRST TWO COLUMNS TO AVOID CONFUSION
customers <- customers %>% select(3:10)
head(customers)

  Fresh Milk Grocery Frozen Detergents_Paper Delicassen          channel region
1 12669 9656    7561    214             2674       1338           Retail  Other
2  7057 9810    9568   1762             3293       1776           Retail  Other
3  6353 8808    7684   2405             3516       7844           Retail  Other
4 13265 1196    4221   6404              507       1788 Hotel-Restaurant  Other
5 22615 5410    7198   3915             1777       5185           Retail  Other
6  9413 8259    5126    666             1795       1451           Retail  Other

PART 1: PRACTICE USING PIPES (dplyr) TO SUMMARIZE DATA: TWO CATEGORICAL VARIABLES

I use the only two categorical variables that are present in the dataset that are the Region and Channel.

library(datasetsICR)
pip1 <- customers %>%         
  group_by(channel, region) %>%
  summarize(N = n()) %>% 
  mutate(freq = N/sum(N),
         pct = round((freq*100),0))
# Remove missing values
pip1 <- subset(pip1, !is.na(channel) & !is.na(region))

pip1

# A tibble: 6 × 5
# Groups:   channel [2]
  channel          region     N   freq   pct
  <fct>            <fct>  <int>  <dbl> <dbl>
1 Hotel-Restaurant Lisbon    59 0.198     20
2 Hotel-Restaurant Oporto    28 0.0940     9
3 Hotel-Restaurant Other    211 0.708     71
4 Retail           Lisbon    18 0.127     13
5 Retail           Oporto    19 0.134     13
6 Retail           Other    105 0.739     74

PART 2: CREATE STACKED AND DODGED BAR CHARTS FROM 2 CATEGORICAL VARIABLES

In order to create the following charts I used the tow categorical variables, Region and Channel.

# Define custom colors
custom_colors <- c("lightblue", "orange")

p_title <- "Channels by Big Regions"
p_caption <- "customers dataset"

# AS STACKED BAR CHART
p <- ggplot(data = subset(pip1, !is.na(region) & !is.na(channel)),
            aes(x = region, y = pct, fill = channel))

p + geom_col(position = "stack") +
  labs(x = "Regions", y = "Percent", fill = "Channel",
       title = p_title, caption = p_caption,
       subtitle = "As a stacked bar chart") +
  geom_text(aes(label = pct), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = custom_colors)

# AS DODGED BAR CHART
p + geom_col(position = "dodge2") +
  labs(x = "Major Region", y = "Percent", fill = "Channel",
       title = p_title, caption = p_caption,
       subtitle = "As a dodged bar chart") +
  geom_text(aes(label = pct), position = position_dodge(width = 0.9)) +
  scale_fill_manual(values = custom_colors)

# AS FACETED HORIZONTAL BAR CHART
p + geom_col(position = "dodge2") +
  labs(x = NULL, y = "Percent", fill = "Channel",
       title = p_title, caption = p_caption,
       subtitle = "As a faceted horizontal bar chart") +
  guides(fill = "none") +
  coord_flip() +
  facet_grid(~ region) +
  geom_text(aes(label = pct), position = position_dodge2(width = 1)) +
  scale_fill_manual(values = custom_colors)

These charts provide a visual representation of the distribution of customers across regions and channels, allowing for easy comparison and identifying patterns. The stacked bar chart shows the overall distribution, while the dodged bar chart and faceted horizontal bar chart provide more detail by separating the channels.As you can see the other section has a great percentage in comparison with the Lisbon and Oporto regions.

PART 3: PRACTICE USING PIPES (dplyr) TO SUMMARIZE DATA: TWO CONTINUOUS & ONE CATEGORICAL VARIABLE

In this section I used Region as my categorical variable because it has a better use in this context, if I have used the Channel variable I would have only the Hotel and Retail which makes the graph look empty.

pip2 <- customers %>%         
  group_by(region) %>%
  summarize(N = n(),
            milk_mean = mean(Milk, na.rm=TRUE), 
            det_mean = mean(Detergents_Paper, na.rm=TRUE)) %>% 
  mutate(freq = N/sum(N),
         pct = round((freq*100),0))
pip2

# A tibble: 3 × 6
  region     N milk_mean det_mean  freq   pct
  <fct>  <int>     <dbl>    <dbl> <dbl> <dbl>
1 Lisbon    77     5486.    2651. 0.175    18
2 Oporto    47     5088.    3687. 0.107    11
3 Other    316     5977.    2818. 0.718    72

This summarized data provides the count, frequency, percentage, mean milk sales, and mean detergent sales for each region. The results appear reasonable, showing variations in the mean sales across regions, which could be influenced by factors such as population density, consumer preferences, and market conditions.

PART 4: SCATTERPLOT WITH A THIRD CATEGORICAL VARIABLE

This code does not remove NA – make sure you deal with that.

custom_colors <- c("red","purple", "orange")
p <- ggplot(pip2, aes(x=milk_mean, y=det_mean, color=region, na.rm = TRUE))

p + geom_point(size=5) +
  annotate(geom = "text", x = 1.6, y=58,
           label = "Two Continuous Variables and One Categorical", hjust=0) +
  labs(y="Average Milk Purchase", x="Average Detergent Purchase",
       title="Relationship between Milk and Detergent",
       subtitle = "Oporto has the highest milk purchases on average",
       caption = "customer dataset{datasetsICR}")+
  scale_color_manual(values = custom_colors)

This scatterplot shows the relationship between mean milk sales and mean detergent sales, with each point representing a region and colored by region. The annotation highlights an interesting observation that Oporto has higher average sales for both product categories compared to other regions. Of course other regions seem to have a higher average purchase of detergent. Lisboa has less average milk than other regions and lower average purchases of milk than Oporto.

PART 5: LEGEND AND GUIDES

pip2$region.c <- as.character(pip2$region)
pip2 <- pip2[order(pip2$region.c),]
pip2

# A tibble: 3 × 7
  region     N milk_mean det_mean  freq   pct region.c
  <fct>  <int>     <dbl>    <dbl> <dbl> <dbl> <chr>   
1 Lisbon    77     5486.    2651. 0.175    18 Lisbon  
2 Oporto    47     5088.    3687. 0.107    11 Oporto  
3 Other    316     5977.    2818. 0.718    72 Other

custom_colors <- c("red","purple", "orange")
p <- ggplot(pip2, aes(x=milk_mean, y=det_mean, color=region.c, na.rm = TRUE))
p + geom_point(size=5) +
    annotate(geom = "text", x = 1.6, y=58, 
                     label = "Relationship between Milk and Detergent Purchases in regions", hjust=0) +
    labs(y="Average Milk Purchase", x="Average Detergent Purchase", 
         color = "Region",
         title="Relationship between Milk and Detergent Purchases", 
         subtitle = "Oporto has the highest milk purchases on average",
         caption <- "customer dataset{datasetsICR}") +
  theme(legend.title = element_text(color="gray50", size=14, face="bold"),
        legend.position = c(x=0.1, y=.7)) + 
    scale_color_manual(values = custom_colors)

In this plot, the legend order has been sorted alphabetically, with Lisbon first, followed by Oporto and Other. The legend title has been modified to “Region” with bold text and a gray color. The legend position has also been adjusted to the left-right corner of the plot area.

PART 6: DATA LABELS VS LEGEND

custom_colors <- c("red","purple", "orange")
p <- ggplot(pip2, aes(x=milk_mean, y=det_mean, color=region.c, na.rm = TRUE))
p + geom_point(size=5) +
    geom_text(mapping = aes(label=region), hjust=1.2, size=3) +
    annotate(geom = "text", x = 1.6, y=58, 
                     label = "Relationship between Milk and Detergent Purchases in regions", hjust=0) +
    labs(y="Average Milk Purchases", x="Average Detergent Purchases", 
         title="Relationship between Milk and Detergent Purchases", 
         color = "Region") +
    theme(legend.position = "none") + 
scale_color_manual(values = custom_colors)

In this plot, data labels have been added to directly display the region names instead of using a legend. The geom_text layer has been used to add text labels next to each data point, with a horizontal adjustment (hjust) to separate the labels from the points.

PART 7: INTERPRETATION

Create insights from the visualization.

The scatterplot in Part 4 visualizes the relationship between the mean sales of milk and detergent products across different sales channels (Hotel-Restaurant and Retail). Each point on the scatterplot represents a specific region, colored by the region to distinguish between Lisbon, Oporto, and Other regions.

From the scatterplot, we can observe the following insights:

Regional variations: The scatterplot reveals distinct regional differences in the mean sales of milk and detergent products. The points representing the Oporto region generally have higher values on the x-axis (mean milk sales) and Other in the y-axis (mean detergent sales), indicating higher average sales in the capital region compared to Oporto.
Potential correlation: The scatterplot suggests a positive correlation between mean milk sales and mean detergent sales across regions. In other words, regions with higher mean milk sales tend to have higher mean detergent sales as well. This correlation could be explained by various factors, such as population density, purchasing power, or consumer preferences in those regions.

Overall, the scatterplot effectively visualizes the relationship between two continuous variables (milk and detergent sales) while incorporating a categorical variable (region) through color coding. It allows for the identification of regional variations, potential channel differences, and the exploration of correlations between the two product categories. This information can be valuable for businesses in understanding consumer demand patterns, tailoring marketing strategies, and optimizing product distribution across different regions and sales channels.