I chose the Mango Prices dataset because I felt as though I would be able to make compelling graphics that were both thematically relevant and visually interesting. This set included time as a variable, which was used in a lot of the video tutorials and I wanted to practice creating visuals with time as a factor. In finding the summary statistics, we notice very high values for Mango_4046, Mango_4225 which are small mangos and medium sized mangos respectively. Summary statistics indicate that medium mangos (PLU 4225) have the highest average and median sales, followed by small mangos (4046).
library(dplyr)
library(ggplot2)
library(tidyr)
library(lubridate)
library(scales)
library(ggrepel)
library(plotly)
mango_prices_dataset <- read.csv("mango_prices_dataset.csv")
mango_prices_dataset <- mango_prices_dataset %>%
mutate(
Date = as.Date(Date),
year = substr(as.character(Date), 1, 4),
month_date = floor_date(Date, "month"),
small_mangos = Mango_4046,
medium_mangos = Mango_4225,
large_mangos = Mango_4770
)
mango_year <- mango_prices_dataset %>%
group_by(year) %>%
summarise(
small_mangos = sum(small_mangos, na.rm = TRUE),
medium_mangos = sum(medium_mangos, na.rm = TRUE),
large_mangos = sum(large_mangos, na.rm = TRUE),
.groups = "drop"
)
mango_long <- mango_year %>%
pivot_longer(
cols = c(small_mangos, medium_mangos, large_mangos),
names_to = "type",
values_to = "total_sold"
)
ggplot(mango_long, aes(x = year, y = total_sold, fill = type)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(
title = "Mango sales by type and year",
x = "",
y = "Total mangos sold",
fill = "Mango type"
) +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_manual(values = c(
small_mangos = "gold",
medium_mangos = "orange",
large_mangos = "darkgreen"
))
Overall I feel as though this is a great representation of the total
number of mangoes sold by type and year. My goal was to analyze the data
by year similar to the horizontal stacked bar chart tutorial, and I feel
as though this is a helpful graph with relevant colors. It indicates
that 2016 had the most mangoes sold, but just by barely. It also depicts
how 2018’s data is not as extensive as the other years.
mango_prices_dataset <- mango_prices_dataset %>%
mutate(
Date = as.Date(Date),
month_date = floor_date(Date, "month")
)
monthly_totals <- mango_prices_dataset %>%
group_by(month_date) %>%
summarise(
volume_k = sum(`Total.Volume`, na.rm = TRUE) / 1000,
.groups = "drop"
)
extreme_points <- monthly_totals %>%
filter(
volume_k == max(volume_k) |
volume_k == min(volume_k)
)
ggplot(monthly_totals, aes(x = month_date, y = volume_k)) +
geom_line(color = "darkgreen", linewidth = 1) +
geom_point(size = 2, color = "darkgreen") +
geom_point(data = extreme_points, size = 4, color = "orange") +
geom_label_repel(
data = extreme_points,
aes(label = comma(round(volume_k))),
size = 3.5,
box.padding = 0.6,
point.padding = 0.5,
segment.color = "darkgreen"
) +
labs(
title = "Monthly mango volume over time",
x = "",
y = "Volume (thousands)"
) +
scale_y_continuous(labels = comma) +
theme_light() +
theme(
plot.title = element_text(hjust = 0.5),
panel.grid.minor = element_blank()
)
My second visualization is an indication of overall mango volume over
time, which includes a labeled minimum and maximum. The maximum was
16,274,128 in October of 2016, and the minimum was 11,540,729 with a
steep decline in 2018, which was indicated in all of my research. Seeing
the amount of mango over the years builds a more conclusive
understanding of the data.
df <- mango_prices_dataset %>%
mutate(
Date = as.Date(Date),
year = year(Date),
month = month(Date, label = TRUE)
)
monthly_avg <- df %>%
group_by(year, month) %>%
summarise(
avg_price = mean(AveragePrice, na.rm = TRUE),
.groups = "drop"
)
ggplot(monthly_avg, aes(x = month, y = avg_price, color = factor(year), group = year)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
scale_color_manual(values = c(
"darkgreen",
"orange",
"tan",
"gold"
)) +
labs(
title = "Average mango price by month",
x = "",
y = "Average price",
color = "year"
) +
scale_y_continuous(labels = dollar_format()) +
theme_light() +
theme(
plot.title = element_text(hjust = 0.5),
panel.grid.minor = element_blank()
)
This is a multiple lines plot that visualizes the average mango price by
month and year. The goal was to see if there are any obvious patterns or
trend mango-buying habits depending on the season (or month
specifically). This visualization was inspired by the multiple line
plots tutorial, because I wanted again to practice with time as a
variable. The 2016 and 2017 data seemed to be more simular to each other
that that of any other year, other than that there were not visible
trends
size_totals <- mango_prices_dataset %>%
summarise(
small = sum(small_mangos, na.rm = TRUE),
medium = sum(medium_mangos, na.rm = TRUE),
large = sum(large_mangos, na.rm = TRUE)
)
donut_df <- data.frame(
size = c("Small", "Medium", "Large"),
total = c(size_totals$small,
size_totals$medium,
size_totals$large)
)
total_all <- sum(donut_df$total)
donut_df %>%
plot_ly(labels = ~size,
values = ~total) %>%
add_pie(hole = 0.6,
textposition = "outside",
textinfo = "label+percent",
marker = list(colors = c("darkgreen",
"orange",
"gold"))) %>%
layout(
title = "Total mango sales by size",
annotations = list(
list(
text = paste0("Total Mangos<br>", comma(total_all)),
x = 0.5,
y = 0.5,
showarrow = FALSE,
font = list(size = 14)
)
)
)
This labeled donut visualization signifies the total number of mangos over roughly four years and their respective sizes. The Largest catergory is “Medium” with 36.4% followed by “Small” with 32.6% and then finally “Large” with 31%. This is one of my favorite visuals because it is simple to understand yet it it informative especially with the percentages. Additionally the solid bright colors remind me of mangos.
heat_df <- mango_prices_dataset %>%
mutate(
Date = as.Date(Date),
year = year(Date),
month = month(Date, label = TRUE)
) %>%
group_by(year, month) %>%
summarise(
volume_k = sum(`Total.Volume`, na.rm = TRUE) / 1000,
.groups = "drop"
)
ggplot(heat_df, aes(x = factor(year), y = month, fill = volume_k)) +
geom_tile(color = "white") +
scale_fill_gradient(
low = "lightgreen",
high = "orange",
labels = comma
) +
labs(
title = "monthly mango volume by year",
x = "",
y = "",
fill = "volume (thousands)"
) +
theme_light() +
theme(
plot.title = element_text(hjust = 0.5),
panel.grid = element_blank()
)
For the last visualization I wanted to try a heat map for the monthly
mango volume per year. Similar to the line graph it compares month and
year, but instead of average price it measure volume. October of 2018
had the highest volume of mangoes, while March of 2018 had the lowest.
```