Assignment 2

Author

Brady Heath

Code
library(tidyverse)
ice_movies <- read_delim(
  "https://query.data.world/s/pmbfldxflx7ttdyfs23cx3abehcl5c",
  delim = ";",
  escape_double = FALSE,
  trim_ws = TRUE,
  locale = locale(encoding = "ISO-8859-1")
)
Code
glimpse(ice_movies)
Rows: 3,799
Columns: 12
$ blog.date           <date> 2021-01-25, 2021-01-25, 2021-01-25, 2021-01-25, 2…
$ rank.this.week      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
$ rank.last.week      <dbl> NA, 4, 1, 5, 2, 8, 3, 11, 12, 10, 9, NA, 13, 14, 7…
$ weeks.in.release    <dbl> NA, 16, 3, 3, 6, 8, 3, 2, 17, 8, 4, NA, 13, 2, 2, …
$ film.title          <chr> "Monster Hunter", "Dragon Rider", "Honest Thief", …
$ distributor.name    <chr> "Sena", "Samfilm", "Myndform", "Samfilm", "Samfilm…
$ gross.box.o.weekend <dbl> 1572126, 1417327, 934945, 688795, 662733, 520025, …
$ adm.weekend         <dbl> 1045, 1204, 650, 594, 467, 320, 266, 274, 220, 197…
$ weekend.start       <date> 2021-01-22, 2021-01-22, 2021-01-22, 2021-01-22, 2…
$ weekend.end         <date> 2021-01-24, 2021-01-24, 2021-01-24, 2021-01-24, 2…
$ adm.to.date         <dbl> 1045, 4744, 3580, 2071, 15197, 3974, 2470, 543, 58…
$ total.box.o.to.date <dbl> 1572126, 5067923, 4924054, 2293264, 21034734, 5837…

Viewing the column names to choose what data to compare.

Code
colnames(ice_movies)
 [1] "blog.date"           "rank.this.week"      "rank.last.week"     
 [4] "weeks.in.release"    "film.title"          "distributor.name"   
 [7] "gross.box.o.weekend" "adm.weekend"         "weekend.start"      
[10] "weekend.end"         "adm.to.date"         "total.box.o.to.date"

The data point i’ve chosen to analyze is how distributor correlates with gross box office weekend.

first I will clean up the column names.

Code
ice_movies <- ice_movies %>%
  rename(
    distributor_name = `distributor.name`,
    total_box_office = `total.box.o.to.date`
  ) %>%
  drop_na(distributor_name, total_box_office)

I need to aggregate my data to summarize total box office revenue per distributor.

Code
agg_data <- ice_movies %>%
  group_by(distributor_name) %>%
  summarise(total_revenue = sum(total_box_office, na.rm = TRUE)) %>%
  arrange(desc(total_revenue))

Before I plot my chart i need to filter the top 5 distributors based on revenue.

Code
top_5_distributors <- agg_data %>%
  top_n(5, total_revenue)

For my chart to show the correlation between distributor and revenue I chose a bar plot.

Code
ggplot(top_5_distributors, aes(x = reorder(distributor_name, total_revenue), y = total_revenue, fill = distributor_name)) +
  geom_bar(stat = "identity", show.legend = FALSE) +  
  coord_flip() +
  theme_minimal() +
  labs(title = "Top 5 Distributors by Total Revenue",
       x = "Distributor",
       y = "Total Revenue") +
  scale_y_continuous(labels = scales::comma)

Analysis and Reflection



Key Distribution Patterns

  1. Revenue Gap: The top distributor earns significantly more than the others, showing a clear market leader.

  2. Industry Domination: One or two distributors control most of the revenue, indicating a highly concentrated market.

  3. Performance Spread: The middle-ranked distributors have moderate earnings, while the lowest of the top 5 lag far behind.

    Visualization Choices


    Why I chose a Bar Chart

    • Easy comparison of revenue across distributors.

    • Descending order highlights the top performer instantly.

    • Horizontal bars improve readability of distributor names.

    Critical Evaluation

    ✔️ Strengths:

    • Clear ranking of top distributors.

    • Simple and uncluttered design.

    • Effective use of bar length for comparison.

    ✖️Limitations:

    • No time trends – a line chart could show changes over time.

    • No market share – adding percentages would provide better context.

    • Limited scope – expanding to the top 10 may reveal more insights.

Temporal Data Analysis

Code
library(tidyverse)
ice_movies <- read_delim(
  "https://query.data.world/s/pmbfldxflx7ttdyfs23cx3abehcl5c",
  delim = ";",
  escape_double = FALSE,
  trim_ws = TRUE,
  locale = locale(encoding = "ISO-8859-1")
)


For my temporal data analysis I chose a connected scatterplot from chapter 5 (pg. 177) to visualize my comparison

Code
colnames(ice_movies)
 [1] "blog.date"           "rank.this.week"      "rank.last.week"     
 [4] "weeks.in.release"    "film.title"          "distributor.name"   
 [7] "gross.box.o.weekend" "adm.weekend"         "weekend.start"      
[10] "weekend.end"         "adm.to.date"         "total.box.o.to.date"

First I need to clean and prepare my data.

Code
ice_movies <- ice_movies %>%
  rename(
    distributor_name = `distributor.name`,
    total_box_office = `total.box.o.to.date`
  ) %>%
  drop_na(distributor_name, total_box_office)


I need to aggregate my data and summarize.

Code
agg_data <- ice_movies %>%
  group_by(distributor_name) %>%
  summarise(total_revenue = sum(total_box_office, na.rm = TRUE)) %>%
  arrange(desc(total_revenue))

I need to also select out the top 5 distributors to simplify my chart.

Code
top_distributors <- agg_data %>%
  top_n(5, total_revenue)

Now I can plot my connected scatterplot.

Code
ggplot(top_distributors, aes(x = reorder(distributor_name, -total_revenue), 
                             y = total_revenue, 
                             group = 1)) +
  geom_point(size = 4, color = "black") +  
  geom_line(size = 1, color = "purple") +
  theme_minimal() +
  labs(title = "Total Revenue of Top 5 Distributors (Connected Scatterplot)",
       x = "Distributor",
       y = "Total Revenue") +
  scale_y_continuous(labels = scales::comma) 
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Key Temporal Patterns

The connected scatter plot visualization highlights the total revenue of the top 5 distributors.

Revenue Distribution: The revenues show a clear ranking, with the highest-earning distributor standing out prominently.

Relative Differences: There is a noticeable gap between the top-performing distributor and the others, suggesting market dominance by a few key players.

Potential Outliers: If one distributor has significantly lower revenue than the others, it may indicate external factors negativly affecting its market performance.

Although the visualization does not specifically track revenue over time, it effectively portrays differences in cumulative revenue, which may reflect long-term industry trends.

Visualization Choices

The connected scatter plot was chosen to highlight both ranking and revenue comparisons effectively.

Critical Evaluation

Strengths:

✔️ Clear Comparison: The visualization makes it easy to compare revenue rankings among the top distributors.
✔️ Simple & Effective: It avoids excessive complexity while still showing meaningful patterns.
✔️ Intuitive Flow: The combination of points and connecting lines ensures viewers can follow the logical sequence effortlessly.

Limitations & Areas for Refinement:
Lack of Temporal Aspect: Since the dataset is aggregated revenue, it does not explicitly show changes over time. A line graph with time-series data would be more effective if historical trends were available.
Limited Context: It does not show total market revenue, making it hard to judge the relative market share. A stacked bar chart or area plot could complement this.
Potential for Overlapping Points: If distributors have close revenue values, the points might overlap. Adding labels could help mitigate this issue.

Brief Conclusion

How the Bar Chart and Connected Scatter plot Work Together

The bar chart and connected scatter plot complement each other to give a complete picture of revenue distribution:

  • Bar Chart – Shows total revenue per distributor with clear, easy-to-compare heights.

  • Connected Scatterplot – Highlights relative positioning and trends across distributors with a connecting line.

Together, they provide both absolute comparisons (bar chart) and insights (scatterplot) for a clearer story.

Additional Questions Raised

  • Trends Over Time – Have distributor revenues changed consistently year-over-year?

  • Market Share – Do the top 5 distrubutors dominate, or are smaller players growing?

  • Genre Impact – Do certain distributors specialize in high-revenue genres compared to others?

  • External Factors – How do events or seasonal trends affect revenue?

  • Consumer Behavior – Is there brand loyalty among audiences?



Generative AI use: I used generative AI to assist me in the creation of my bar chart and my connected scatter plot.

I followed the generative AI as an outline and and used the data i gathered from glimpsing ice_movies and viewing the column names to decide on what to compare.