Question 1

(a)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(readr)
library(ggplot2)
library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:readr':
## 
##     col_factor

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.0     ✔ tibble    3.2.1
## ✔ purrr     1.2.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ scales::col_factor() masks readr::col_factor()
## ✖ purrr::discard()     masks scales::discard()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ dplyr::lag()         masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(conflicted)
library(boxplotdbl)
# Reading data
pizza <- read_csv("26T1/pizza.csv", show_col_types = FALSE)

# Calculate the average sodium content of each brand
brand_sodium <- pizza %>%        
  group_by(brand) %>%
  summarise(mean_sodium = mean(sodium, na.rm = TRUE)) %>%
  arrange(mean_sodium) %>%                            # Sort in ascending order
  mutate(brand = factor(brand, levels = brand))      

# Mark the color categories of the first two.
brand_sodium <- brand_sodium %>%     
  mutate(
    category = case_when(
      row_number() <= 2 ~ "Lowest",
      TRUE ~ "Other"
    )
  )

# Obtain the data range of the lowest group
lowest_max <- max(brand_sodium$mean_sodium[brand_sodium$category == "Lowest"])

# Plot
p <- ggplot(brand_sodium, aes(x = brand, y = mean_sodium, fill = category)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = sprintf("%.2f", mean_sodium)),
            vjust = -0.5, size = 3.5) +
  scale_fill_manual(
    values = c("Lowest" = "darkorange", 
               "Other" = "gray80")
  ) +
  labs(
    title = "Average Sodium Content by Pizza Brand",
    subtitle = "Lowest two brands highlighted in orange",
    x = "Brand",
    y = "Sodium (g/100g)",
    fill = "Brand Group"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "top"
  )

# Highlight the two lowest brands (the first two)
p + annotate("rect", 
             xmin = 0.5, xmax = 2.5,                    
             ymin = 0,                               
             ymax = lowest_max * 1.1,                    
             alpha = 0.1, fill = "darkorange") +
  annotate("text", 
           x = 1.5,                                     
           y = lowest_max * 1.5,                      
           label = "Lowest sodium", 
           color = "darkorange", size = 3.5, fontface = "bold")

(a)

1

Target Audience: Tom needs to launch a marketing campaign for pizza to promote healthy eating and introduce low-sodium products. The product data must be accurate and reliable, with the information on low-sodium products more prominently displayed to quickly draw attention to these product data.

3

Aesthetic Design: X-axis: Brand name Y-axis: Average sodium content Bar height: The value of average sodium content Bar fill color: Brand group (the two brands with the lowest sodium content; other brands) Bar label: The exact value of average sodium content Rectangular annotation: Use explanatory text to highlight the brand group with the lowest sodium content.

Pre-Attention: Color: The deep orange color stands out immediately against the gray background, allowing the low-sodium brand to be located without a detailed search. Length: The significant height differences among the columns, arranged in ascending order, make the growth trend clearly visible.

Gestalt principles: Similarity: Columns of the same color naturally form a group (the lowest group). Closure: The semi-transparent rectangular frame encloses the two brands with the lowest sodium content, reinforcing the perception of the lowest group.

Conclusion: The deep orange bars in the chart represent Brand H and Brand G, which have the lowest sodium content among all brands, at 0.42 grams per 100 grams and 0.44 grams per 100 grams respectively. The color and labels enable me to immediately identify these key brands. ## (b)

# Calculate energy: 17 kJ/g for protein, 37 kJ/g for fat, and 17 kJ/g for carbohydrates
pizza <- pizza %>%
  mutate(energy = prot * 17 + fat * 37 + carb * 17)

# Sort by median brand energy
brand_order <- pizza %>%
  group_by(brand) %>%
  summarise(med_energy = median(energy)) %>%
  arrange(med_energy) %>%
  pull(brand)

pizza$brand <- factor(pizza$brand, levels = brand_order)

# Mark brand C for color filling.
pizza$highlight <- ifelse(pizza$brand == "C", "Brand C", "Others")

# Basic Box Plot
p <- ggplot(pizza, aes(x = brand, y = energy, fill = highlight)) +
  geom_boxplot(alpha = 0.7, outlier.color = "firebrick", outlier.shape = 1) +
  scale_fill_manual(values = c("Brand C" = "darkorange", "Others" = "gray80")) +
  labs(
    title = "Comparison of Total Energy Content Across Pizza Brands",
    subtitle = "Brands ordered by median energy (lowest to highest)",
    x = "Brand",
    y = "Energy (kJ per 100g)",
    fill = "Brand"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "top"
  )

# Add manual annotations, emphasizing brand C with "annotate"
# Find the position index of brand C on the x-axis
brand_c_index <- which(levels(pizza$brand) == "C")

# Obtain the approximate range of the y-axis for brand C to place the rectangle box
p + 
  annotate("rect",
           xmin = brand_c_index - 0.4,
           xmax = brand_c_index + 0.4,
           ymin = 1050,   # 根据实际数据微调
           ymax = 1350,
           alpha = 0.2,
           fill = "darkorange") +
  annotate("text",
           x = brand_c_index,
           y = 1600,
           label = "Brand C",
           color = "darkorange",
           fontface = "bold",
           size = 4)

1

Target Audience: Amy works at Brand C. The marketing and sales director needs to understand the energy data reports of their own brand and competing brands in order to formulate positioning strategies for their own products. The report must clearly mark the data of Brand C to quickly locate their own products.

3

Aesthetics: X-axis: Brand name Y-axis: Total energy per 100 grams of pizza Box plot components: - Thick center line: Median - Upper and lower boundaries of the box: First and third quartiles - Whiskers: Data points within 1.5 times the interquartile range (IQR) - Points: Outliers Fill color: Brand C is highlighted in orange to immediately draw visual attention, while other brands are displayed in light gray. Manual annotation: Add a semi-transparent orange rectangle overlay and the label “Brand C” to further emphasize the position of the target brand.

Pre-Attention: Color: The orange container stands out prominently against the gray background, facilitating quick identification of Brand C. Position: The vertical height of the container is directly related to the energy value, and the middle position makes it easy to conduct quick comparative analysis.

Gestalt principles: Similarity: The gray boxes are classified as “Other Brands”, while the orange boxes are regarded as a different category. Closure: Each box serves as an independent visual unit, representing the energy distribution of the corresponding brand. Continuity: Brands are arranged from left to right in ascending order of energy, facilitating readers’ better understanding of the competitive data.

Conclusion: The median energy of brand C in the graph is approximately 1200 kj per 100 grams, slightly lower than the market average: it is higher than brand I on the far left but lower than brands J and D on the far right. Brand A has the highest energy. The box width of brand C is moderate, indicating that the energy of its product line is relatively stable.

(c)

# Calculate the energy contribution of each component
# Protein: 17 kJ/g, Fat: 37 kJ/g, Carbohydrate: 17 kJ/g
pizza <- pizza %>%
  mutate(
    energy_fat   = fat * 37,
    energy_prot  = prot * 17,
    energy_carb  = carb * 17,
    energy_total = energy_fat + energy_prot + energy_carb
  )

# Calculate the average percentage of each energy source by brand
brand_energy <- pizza %>%
  group_by(brand) %>%
  summarise(
    fat_pct   = mean(energy_fat / energy_total) * 100,
    prot_pct  = mean(energy_prot / energy_total) * 100,
    carb_pct  = mean(energy_carb / energy_total) * 100
  ) %>%
  arrange(fat_pct) %>%                    # Sort in ascending order by fat ratio
  mutate(brand = factor(brand, levels = brand))

# Convert to long format for ggplot plotting convenience
brand_long <- brand_energy %>%
  pivot_longer(
    cols = c(fat_pct, prot_pct, carb_pct),
    names_to = "source",
    values_to = "percentage"
  ) %>%
  mutate(
    source = factor(source, 
                    levels = c("fat_pct", "prot_pct", "carb_pct"),
                    labels = c("Fat", "Protein", "Carbohydrate"))
  )

# Plot Stacked Percentage Bar Charts
p <- ggplot(brand_long, aes(x = brand, y = percentage, fill = source)) +
  geom_col(position = "fill", width = 0.7) +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  scale_fill_manual(
    values = c("Fat"         = "darkorange",
               "Protein"     = "olivedrab3",
               "Carbohydrate" = "lightblue")
  ) +
  labs(
    title = "Proportions of Energy from Fat, Protein, and Carbohydrates by Pizza Brand",
    subtitle = "Brands are ordered by increasing fat percentage",
    x = "Brand",
    y = "Percentage of total energy",
    fill = "Nutrient source"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom"
  )

# Add numerical labels within each bar
p <- p + geom_text(aes(label = sprintf("%.1f%%", percentage)),
                   position = position_fill(vjust = 0.5),
                   size = 3, color = "white")

# Mark the brand with the lowest fat ratio and the brand with the highest fat ratio
n_brands <- nrow(brand_energy)
p + 
  annotate("rect", xmin = 0.5, xmax = 1.5, ymin = 0, ymax = 1,
           alpha = 0.2, fill = "gray10", color = NA) +
  annotate("text", x = 1, y = 0.4, label = "Lowest fat", 
           color = "red", size = 3, fontface = "bold") +
  annotate("rect", xmin = n_brands - 0.5, xmax = n_brands + 0.5, 
           ymin = 0, ymax = 1, alpha = 0.2, fill = "gray10", color = NA) +
  annotate("text", x = n_brands, y = 0.4, label = "Highest fat", 
           color = "red", size = 3, fontface = "bold")

1

Target audience: Vivian is an editor of a nutrition magazine. She needs to publish the energy ratios of different brands of pizza products to determine if these products are healthy. The published chart must be able to clearly compare the differences between each product, and the data must be accurate.

#3 Aesthetics: X-axis: Brand names (arranged in ascending order of the proportion of energy contribution from fat) Y-axis: The contribution percentage of each energy source Fill colors: Dark orange: Fat Olive green 3: Protein Light blue: Carbohydrates Square height: Represents the average energy proportion provided by each nutrient White numerical labels: Located in the center of each colored square, showing the exact percentage. Semi-transparent rectangular frames: Highlight the areas corresponding to the brands with the lowest and highest fat proportions.

Pre-attention: Color scheme: Three nutrients are represented by contrasting colors, allowing for immediate visual distinction of different energy sources; the deep orange section representing fat is particularly prominent, facilitating tracking of the proportionate changes in fat content. Numeric labels: White numbers are clearly legible against the colored background, enabling accurate data interpretation without relying on coordinate axes.

Gestalt principles: Similarity: Colored squares of the same hue are automatically classified into the same nutritional category by different brands, facilitating the comparison of the proportion of this nutrient among different brands. Closure: Each brand is composed of three colored squares forming a complete bar, which is visually perceived as a whole, representing the energy composition of the brand. Continuity: Brands are arranged from left to right according to fat content, creating a gradient of increasing fat proportion, which helps readers perceive the overall trend.

Conclusion: There are significant differences in the energy source composition among different pizza brands. For the majority of the brands, fat is the main energy source, accounting for 38% to 82% respectively. The brands with the lowest fat proportion are G and H (about 38%), and they have a relatively higher proportion of carbohydrates (about 50%); while the brands with the highest fat proportion are A and B (about 80%), and their protein and carbohydrate proportions are correspondingly lower.

Question2

(a)

library(readr)
library(ggplot2)
# Reading data
bikes <- read_csv("26T1/bikes.csv", show_col_types = FALSE)

# An examination of the quantities corresponding to each factor of weathersit (1, 2, 3, 4) revealed an absence of factor 4
table(bikes$weathersit)

## 
##   1   2   3 
## 463 247  21

# Convert weathersit to a factor (1, 2, 3 representing different weather types)
bikes$weathersit <- factor(bikes$weathersit, 
                           labels = c("Sunny", "Misty", "Rain/Snow"))

# Fitting a Linear Regression Model (with predictor variables: yr, workingday, temp, weathersit)
model <- lm(cnt ~ yr + workingday + temp + weathersit, data = bikes)

# Output the model summary
summary(model)

## 
## Call:
## lm(formula = cnt ~ yr + workingday + temp + weathersit, data = bikes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3317.3  -622.1   -48.1   763.0  2644.7 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           597.39     128.97   4.632 4.30e-06 ***
## yr                   2033.06      74.95  27.124  < 2e-16 ***
## workingday            184.28      80.61   2.286   0.0225 *  
## temp                 6092.42     206.18  29.549  < 2e-16 ***
## weathersitMisty      -571.66      80.14  -7.133 2.38e-12 ***
## weathersitRain/Snow -2156.08     226.56  -9.517  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1009 on 725 degrees of freedom
## Multiple R-squared:  0.7303, Adjusted R-squared:  0.7285 
## F-statistic: 392.7 on 5 and 725 DF,  p-value: < 2.2e-16

(a)

I will choose yr, workingday, temp, and weathersit as my predictor variables. Yr can reflect the trend of annual growth; workingday can indicate whether working days affect the need for commuting; temp can show whether the temperature is comfortable for people traveling; weathersit can demonstrate the impact of different weather conditions on travel. If it is a certain day one year later, these four variables are relatively easy to obtain or predict compared to other variables.

(b)

# Two pictures in one line
par(mfrow = c(1, 2))

# Figure 1: Residuals vs Fitted Values
plot(model, which = 1, caption = "Residuals vs Fitted")

# Figure 2: Normal Q-Q Plot
plot(model, which = 2, caption = "Normal Q-Q")

(b)

Residuals vs Fitted: This graph is used to test for linearity and homoscedasticity. The residual points in the graph are roughly randomly distributed on both sides of the zero line and do not show a clear curve trend, indicating that the linear assumption is basically reasonable and the variance is approximately constant. However, when the adjusted values are large (> 6000), the residuals slightly deviate, suggesting the possible existence of mild but not severe heteroscedasticity. Therefore, no obvious violation of the linear assumption was found.

Normal Q-Q Plot: This plot is used to test the normality of the residuals. The points in the plot are generally distributed along the reference line, but there is a slight deviation at both ends, especially the upper end. This indicates that the right tail of the residual distribution is slightly thicker, meaning there are some relatively large positive residuals. This might be due to extreme weather conditions and other factors that cause peak periods for car rentals. Therefore, the residuals are approximately normal but with a slight deviation, which may affect the accuracy of the prediction interval.

(c)

# Extraction of model coefficients and confidence intervals (CI)
cf_df <- as.data.frame(summary(model)$coefficients)
cf_df$variable <- rownames(cf_df)                     # Predictive Variable
ci_int <- confint(model)                               # 95% Confidence Interval
cf_df$lower <- ci_int[, 1]                             # Lower
cf_df$upper <- ci_int[, 2]                             # Upper

# Remove the intercept term
cf_df <- cf_df[cf_df$variable != "(Intercept)", ]

# Draw Forest plot
ggplot(cf_df, aes(x = Estimate, y = reorder(variable, Estimate))) +
  geom_point(size = 3, color = "lightblue") +               # Data point
  geom_errorbar(aes(xmin = lower, xmax = upper),            # Add error bars
                width = 0.2,                                 # Terminal line width of error bars
                color = "gray30", 
                orientation = "y") +                         # Set the specified direction to horizontal
  geom_vline(xintercept = 0, linetype = "dashed", color = "red") +  # Zero line
  labs(
    title = "Forest Plot of Regression Coefficients",
    x = "Coefficient Estimate (95% CI)",
    y = "Predictive Variable"
  ) +
  theme_minimal()

(c)

The coefficient of yr is approximately 2000, and the confidence interval is far from 0, indicating that the average daily rental volume in 2012 was about 2000 vehicles more than that in 2011, reflecting an overall increase in demand. The coefficient of workingday is approximately 200, and the interval is positive, suggesting that the rental volume on working days is about 200 vehicles more than that on non-working days, possibly due to the increased commuting demand. The coefficient of temp is approximately 6000, meaning that for each unit increase in temperature (normalized temperature), the rental volume increases by about 6000 vehicles, indicating a strong positive impact of temperature on rental behavior. The coefficient of weathersitMisty (foggy days) is approximately -600, indicating that the average rental volume on foggy days is about 600 vehicles less than that on sunny days. The coefficient of weathersitRain/Snow (rainy and snowy days) is approximately -2000, suggesting that the average rental volume on rainy and snowy days is about 2000 vehicles less than that on sunny days, with a much greater impact than foggy days. The confidence intervals of all variables do not include 0, indicating that they all have significant effects on the average rental volume. Among them, temperature has the greatest impact, followed by rainy and snowy weather, and both year and working days also have obvious positive effects.

DATA5002 Assignment 1

Qiang Xiang

Question 1

(a)

(a)

1

3

1

3

(c)

1

Question2

(a)

(a)

(b)

(b)

(c)

(c)