I chose the Mango Prices dataset because I felt as though I would be able to make compelling graphics that were both thematically relevant and visually interesting. This set included time as a variable, which was used in a lot of the video tutorials and I wanted to practice creating visuals with time as a factor. In finding the summary statistics, we notice very high values for Mango_4046, Mango_4225 which are small mangos and medium sized mangos respectively. Summary statistics indicate that medium mangos (PLU 4225) have the highest average and median sales, followed by small mangos (4046).

library(dplyr)
library(ggplot2)
library(tidyr)
library(lubridate)
library(scales)
library(ggrepel)
library(plotly)

mango_prices_dataset <- read.csv("mango_prices_dataset.csv")

mango_prices_dataset <- mango_prices_dataset %>%
  mutate(
    Date = as.Date(Date),
    year = substr(as.character(Date), 1, 4),
    month_date = floor_date(Date, "month"),
    small_mangos = Mango_4046,
    medium_mangos = Mango_4225,
    large_mangos = Mango_4770
  )

mango_year <- mango_prices_dataset %>%
  group_by(year) %>%
  summarise(
    small_mangos = sum(small_mangos, na.rm = TRUE),
    medium_mangos = sum(medium_mangos, na.rm = TRUE),
    large_mangos = sum(large_mangos, na.rm = TRUE),
    .groups = "drop"
  )

mango_long <- mango_year %>%
  pivot_longer(
    cols = c(small_mangos, medium_mangos, large_mangos),
    names_to = "type",
    values_to = "total_sold"
  )

ggplot(mango_long, aes(x = year, y = total_sold, fill = type)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(
    title = "Mango sales by type and year",
    x = "",
    y = "Total mangos sold",
    fill = "Mango type"
  ) +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_manual(values = c(
    small_mangos = "gold",
    medium_mangos = "orange",
    large_mangos = "darkgreen"
  ))

Overall I feel as though this is a great representation of the total number of mangoes sold by type and year. My goal was to analyze the data by year similar to the horizontal stacked bar chart tutorial, and I feel as though this is a helpful graph with relevant colors. It indicates that 2016 had the most mangoes sold, but just by barely. It also depicts how 2018’s data is not as extensive as the other years.

mango_prices_dataset <- mango_prices_dataset %>%
  mutate(
    Date = as.Date(Date),
    month_date = floor_date(Date, "month")
  )

monthly_totals <- mango_prices_dataset %>%
  group_by(month_date) %>%
  summarise(
    volume_k = sum(`Total.Volume`, na.rm = TRUE) / 1000,
    .groups = "drop"
  )

extreme_points <- monthly_totals %>%
  filter(
    volume_k == max(volume_k) |
      volume_k == min(volume_k)
  )

ggplot(monthly_totals, aes(x = month_date, y = volume_k)) +
  geom_line(color = "darkgreen", linewidth = 1) +
  geom_point(size = 2, color = "darkgreen") +
  geom_point(data = extreme_points, size = 4, color = "orange") +
  geom_label_repel(
    data = extreme_points,
    aes(label = comma(round(volume_k))),
    size = 3.5,
    box.padding = 0.6,
    point.padding = 0.5,
    segment.color = "darkgreen"
  ) +
  labs(
    title = "Monthly mango volume over time",
    x = "",
    y = "Volume (thousands)"
  ) +
  scale_y_continuous(labels = comma) +
  theme_light() +
  theme(
    plot.title = element_text(hjust = 0.5),
    panel.grid.minor = element_blank()
  )

My second visualization is an indication of overall mango volume over time, which includes a labeled minimum and maximum. The maximum was 16,274,128 in October of 2016, and the minimum was 11,540,729 with a steep decline in 2018, which was indicated in all of my research. Seeing the amount of mango over the years builds a more conclusive understanding of the data.

df <- mango_prices_dataset %>%
  mutate(
    Date = as.Date(Date),
    year = year(Date),
    month = month(Date, label = TRUE)
  )

monthly_avg <- df %>%
  group_by(year, month) %>%
  summarise(
    avg_price = mean(AveragePrice, na.rm = TRUE),
    .groups = "drop"
  )

ggplot(monthly_avg, aes(x = month, y = avg_price, color = factor(year), group = year)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_color_manual(values = c(
    "darkgreen",
    "orange",
    "tan",
    "gold"
  )) +
  labs(
    title = "Average mango price by month",
    x = "",
    y = "Average price",
    color = "year"
  ) +
  scale_y_continuous(labels = dollar_format()) +
  theme_light() +
  theme(
    plot.title = element_text(hjust = 0.5),
    panel.grid.minor = element_blank()
  )

This is a multiple lines plot that visualizes the average mango price by month and year. The goal was to see if there are any obvious patterns or trend mango-buying habits depending on the season (or month specifically). This visualization was inspired by the multiple line plots tutorial, because I wanted again to practice with time as a variable. The 2016 and 2017 data seemed to be more simular to each other that that of any other year, other than that there were not visible trends

size_totals <- mango_prices_dataset %>%
  summarise(
    small = sum(small_mangos, na.rm = TRUE),
    medium = sum(medium_mangos, na.rm = TRUE),
    large = sum(large_mangos, na.rm = TRUE)
  )

donut_df <- data.frame(
  size = c("Small", "Medium", "Large"),
  total = c(size_totals$small,
            size_totals$medium,
            size_totals$large)
)

total_all <- sum(donut_df$total)

donut_df %>%
  plot_ly(labels = ~size,
          values = ~total) %>%
  add_pie(hole = 0.6,
          textposition = "outside",
          textinfo = "label+percent",
          marker = list(colors = c("darkgreen",
                                   "orange",
                                   "gold"))) %>%
  layout(
    title = "Total mango sales by size",
    annotations = list(
      list(
        text = paste0("Total Mangos<br>", comma(total_all)),
        x = 0.5,
        y = 0.5,
        showarrow = FALSE,
        font = list(size = 14)
      )
    )
  )

This labeled donut visualization signifies the total number of mangos over roughly four years and their respective sizes. The Largest catergory is “Medium” with 36.4% followed by “Small” with 32.6% and then finally “Large” with 31%. This is one of my favorite visuals because it is simple to understand yet it it informative especially with the percentages. Additionally the solid bright colors remind me of mangos.

heat_df <- mango_prices_dataset %>%
  mutate(
    Date = as.Date(Date),
    year = year(Date),
    month = month(Date, label = TRUE)
  ) %>%
  group_by(year, month) %>%
  summarise(
    volume_k = sum(`Total.Volume`, na.rm = TRUE) / 1000,
    .groups = "drop"
  )

ggplot(heat_df, aes(x = factor(year), y = month, fill = volume_k)) +
  geom_tile(color = "white") +
  scale_fill_gradient(
    low = "lightgreen",
    high = "orange",
    labels = comma
  ) +
  labs(
    title = "monthly mango volume by year",
    x = "",
    y = "",
    fill = "volume (thousands)"
  ) +
  theme_light() +
  theme(
    plot.title = element_text(hjust = 0.5),
    panel.grid = element_blank()
  )

For the last visualization I wanted to try a heat map for the monthly mango volume per year. Similar to the line graph it compares month and year, but instead of average price it measure volume. October of 2018 had the highest volume of mangoes, while March of 2018 had the lowest. ```