Code
library(tidyverse)
ice_movies <- read_delim(
"https://query.data.world/s/pmbfldxflx7ttdyfs23cx3abehcl5c",
delim = ";",
escape_double = FALSE,
trim_ws = TRUE,
locale = locale(encoding = "ISO-8859-1")
)library(tidyverse)
ice_movies <- read_delim(
"https://query.data.world/s/pmbfldxflx7ttdyfs23cx3abehcl5c",
delim = ";",
escape_double = FALSE,
trim_ws = TRUE,
locale = locale(encoding = "ISO-8859-1")
)glimpse(ice_movies)Rows: 3,799
Columns: 12
$ blog.date <date> 2021-01-25, 2021-01-25, 2021-01-25, 2021-01-25, 2…
$ rank.this.week <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
$ rank.last.week <dbl> NA, 4, 1, 5, 2, 8, 3, 11, 12, 10, 9, NA, 13, 14, 7…
$ weeks.in.release <dbl> NA, 16, 3, 3, 6, 8, 3, 2, 17, 8, 4, NA, 13, 2, 2, …
$ film.title <chr> "Monster Hunter", "Dragon Rider", "Honest Thief", …
$ distributor.name <chr> "Sena", "Samfilm", "Myndform", "Samfilm", "Samfilm…
$ gross.box.o.weekend <dbl> 1572126, 1417327, 934945, 688795, 662733, 520025, …
$ adm.weekend <dbl> 1045, 1204, 650, 594, 467, 320, 266, 274, 220, 197…
$ weekend.start <date> 2021-01-22, 2021-01-22, 2021-01-22, 2021-01-22, 2…
$ weekend.end <date> 2021-01-24, 2021-01-24, 2021-01-24, 2021-01-24, 2…
$ adm.to.date <dbl> 1045, 4744, 3580, 2071, 15197, 3974, 2470, 543, 58…
$ total.box.o.to.date <dbl> 1572126, 5067923, 4924054, 2293264, 21034734, 5837…
Viewing the column names to choose what data to compare.
colnames(ice_movies) [1] "blog.date" "rank.this.week" "rank.last.week"
[4] "weeks.in.release" "film.title" "distributor.name"
[7] "gross.box.o.weekend" "adm.weekend" "weekend.start"
[10] "weekend.end" "adm.to.date" "total.box.o.to.date"
The data point i’ve chosen to analyze is how distributor correlates with gross box office weekend.
first I will clean up the column names.
ice_movies <- ice_movies %>%
rename(
distributor_name = `distributor.name`,
total_box_office = `total.box.o.to.date`
) %>%
drop_na(distributor_name, total_box_office)I need to aggregate my data to summarize total box office revenue per distributor.
agg_data <- ice_movies %>%
group_by(distributor_name) %>%
summarise(total_revenue = sum(total_box_office, na.rm = TRUE)) %>%
arrange(desc(total_revenue))Before I plot my chart i need to filter the top 5 distributors based on revenue.
top_5_distributors <- agg_data %>%
top_n(5, total_revenue)For my chart to show the correlation between distributor and revenue I chose a bar plot.
ggplot(top_5_distributors, aes(x = reorder(distributor_name, total_revenue), y = total_revenue, fill = distributor_name)) +
geom_bar(stat = "identity", show.legend = FALSE) +
coord_flip() +
theme_minimal() +
labs(title = "Top 5 Distributors by Total Revenue",
x = "Distributor",
y = "Total Revenue") +
scale_y_continuous(labels = scales::comma)
Key Distribution Patterns
Revenue Gap: The top distributor earns significantly more than the others, showing a clear market leader.
Industry Domination: One or two distributors control most of the revenue, indicating a highly concentrated market.
Performance Spread: The middle-ranked distributors have moderate earnings, while the lowest of the top 5 lag far behind.
Why I chose a Bar Chart
Easy comparison of revenue across distributors.
Descending order highlights the top performer instantly.
Horizontal bars improve readability of distributor names.
✔️ Strengths:
Clear ranking of top distributors.
Simple and uncluttered design.
Effective use of bar length for comparison.
✖️Limitations:
No time trends – a line chart could show changes over time.
No market share – adding percentages would provide better context.
Limited scope – expanding to the top 10 may reveal more insights.
library(tidyverse)
ice_movies <- read_delim(
"https://query.data.world/s/pmbfldxflx7ttdyfs23cx3abehcl5c",
delim = ";",
escape_double = FALSE,
trim_ws = TRUE,
locale = locale(encoding = "ISO-8859-1")
)
For my temporal data analysis I chose a connected scatterplot from chapter 5 (pg. 177) to visualize my comparison
colnames(ice_movies) [1] "blog.date" "rank.this.week" "rank.last.week"
[4] "weeks.in.release" "film.title" "distributor.name"
[7] "gross.box.o.weekend" "adm.weekend" "weekend.start"
[10] "weekend.end" "adm.to.date" "total.box.o.to.date"
First I need to clean and prepare my data.
ice_movies <- ice_movies %>%
rename(
distributor_name = `distributor.name`,
total_box_office = `total.box.o.to.date`
) %>%
drop_na(distributor_name, total_box_office)
I need to aggregate my data and summarize.
agg_data <- ice_movies %>%
group_by(distributor_name) %>%
summarise(total_revenue = sum(total_box_office, na.rm = TRUE)) %>%
arrange(desc(total_revenue))I need to also select out the top 5 distributors to simplify my chart.
top_distributors <- agg_data %>%
top_n(5, total_revenue)Now I can plot my connected scatterplot.
ggplot(top_distributors, aes(x = reorder(distributor_name, -total_revenue),
y = total_revenue,
group = 1)) +
geom_point(size = 4, color = "black") +
geom_line(size = 1, color = "purple") +
theme_minimal() +
labs(title = "Total Revenue of Top 5 Distributors (Connected Scatterplot)",
x = "Distributor",
y = "Total Revenue") +
scale_y_continuous(labels = scales::comma) Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Revenue Distribution: The revenues show a clear ranking, with the highest-earning distributor standing out prominently.
Relative Differences: There is a noticeable gap between the top-performing distributor and the others, suggesting market dominance by a few key players.
Potential Outliers: If one distributor has significantly lower revenue than the others, it may indicate external factors negativly affecting its market performance.
Although the visualization does not specifically track revenue over time, it effectively portrays differences in cumulative revenue, which may reflect long-term industry trends.
The connected scatter plot was chosen to highlight both ranking and revenue comparisons effectively.
Strengths:
✔️ Clear Comparison: The visualization makes it easy to compare revenue rankings among the top distributors.
✔️ Simple & Effective: It avoids excessive complexity while still showing meaningful patterns.
✔️ Intuitive Flow: The combination of points and connecting lines ensures viewers can follow the logical sequence effortlessly.
Limitations & Areas for Refinement:
❌ Lack of Temporal Aspect: Since the dataset is aggregated revenue, it does not explicitly show changes over time. A line graph with time-series data would be more effective if historical trends were available.
❌ Limited Context: It does not show total market revenue, making it hard to judge the relative market share. A stacked bar chart or area plot could complement this.
❌ Potential for Overlapping Points: If distributors have close revenue values, the points might overlap. Adding labels could help mitigate this issue.
The bar chart and connected scatter plot complement each other to give a complete picture of revenue distribution:
Bar Chart – Shows total revenue per distributor with clear, easy-to-compare heights.
Connected Scatterplot – Highlights relative positioning and trends across distributors with a connecting line.
Together, they provide both absolute comparisons (bar chart) and insights (scatterplot) for a clearer story.
Trends Over Time – Have distributor revenues changed consistently year-over-year?
Market Share – Do the top 5 distrubutors dominate, or are smaller players growing?
Genre Impact – Do certain distributors specialize in high-revenue genres compared to others?
External Factors – How do events or seasonal trends affect revenue?
Consumer Behavior – Is there brand loyalty among audiences?
Generative AI use: I used generative AI to assist me in the creation of my bar chart and my connected scatter plot.
bar chart creation directions “https://chatgpt.com/share/67c76bf0-8194-8003-94b1-9fd1bab0a929”
connected scatterplot directions “https://chatgpt.com/share/67c76c69-4fe4-8003-a029-196fbf9b6a66”
I followed the generative AI as an outline and and used the data i gathered from glimpsing ice_movies and viewing the column names to decide on what to compare.