Purpose: Summarize structure, completeness, and basic distribution.
| Q# | Question | Type | Visualization |
|---|---|---|---|
| 1 | How many unique movies are in the dataset after removing duplicates? | Descriptive | None |
| 2 | What are the column names and their data types in the dataset? | Descriptive | None |
| 3 | Are there any missing values in the dataset, and how are they distributed? | Descriptive | None |
| 4 | What is the range (min, max, mean) for Budget, Worldwide_Sales, ROI, and Runtime_Minutes? | Descriptive | None |
| 5 | How many movies fall under each budget category (Low, Medium, High)? | Descriptive | None |
Purpose: Examine how financial inputs/outputs relate.
| Q# | Question | Type | Visualization |
|---|---|---|---|
| 6 | How does ROI vary across different budget categories? | Visual + ANOVA | Boxplot |
| 6b | Is the difference in ROI between budget categories statistically significant? | ANOVA | None (Stats Output) |
| 7 | What is the relationship between budget and worldwide sales? | Visual | Scatter Plot + Trendline |
| 8 | Do higher-budget movies consistently have higher ROI? | Visual | Scatter Plot |
| 9 | How do domestic and international sales contribute to total revenue? | Visual | Stacked Bar Chart |
Purpose: Analyze trends over time and genre-based differences.
| Q# | Question | Type | Visualization |
|---|---|---|---|
| 10 | How has the number of movies changed over the years? | Visual | Line Chart |
| 11 | What are the most common genres over time? | Visual | Stacked Area Chart or Heatmap |
| 12 | Which genres have the highest average ROI? | Visual + ANOVA | Bar Chart |
| 12b | Is the difference in ROI across genres statistically significant? | ANOVA | None (Stats Output) |
| 13 | Which year had the highest average ROI? | Visual | Bar Chart or Line Chart |
Purpose: Identify high performers and institutional patterns.
| Q# | Question | Type | Visualization |
|---|---|---|---|
| 14 | Which distributors released the most high-ROI movies? | Visual | Bar Chart |
| 15 | What are the top 10 movies by ROI? | Visual | Horizontal Bar Chart |
| 16 | What are the top 10 movies by worldwide sales? | Visual | Horizontal Bar Chart |
| 17 | How do top distributors compare in terms of average worldwide sales and ROI? | Visual | Grouped Bar Chart |
Purpose: Final analysis and insights based on descriptive and visual summaries.
library(readr)
imdb_visualisation <- read_csv("imdb_visualisation.csv")
## Rows: 2844 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Title, Movie_Info, Distributor, Domestic_Opening_Original, Genre_Or...
## dbl (9): Year, Budget_Original, Domestic_Sales, International_Sales, Worldwi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(imdb_visualisation)
summary(imdb_visualisation)
## Title Movie_Info Year Distributor
## Length:2844 Length:2844 Min. :1939 Length:2844
## Class :character Class :character 1st Qu.:2003 Class :character
## Mode :character Mode :character Median :2010 Mode :character
## Mean :2008
## 3rd Qu.:2015
## Max. :2023
## Budget_Original Domestic_Opening_Original Domestic_Sales
## Min. : 15000 Length:2844 Min. : 214670
## 1st Qu.: 60000000 Class :character 1st Qu.:100830111
## Median : 92000000 Mode :character Median :141186864
## Mean :102085425 Mean :170646068
## 3rd Qu.:145000000 3rd Qu.:206362140
## Max. :356000000 Max. :936662225
## International_Sales Worldwide_Sales Genre_Original
## Min. :2.450e+07 Min. :1.806e+08 Length:2844
## 1st Qu.:1.360e+08 1st Qu.:2.374e+08 Class :character
## Median :1.995e+08 Median :3.343e+08 Mode :character
## Mean :2.667e+08 Mean :4.321e+08
## 3rd Qu.:3.300e+08 3rd Qu.:5.227e+08
## Max. :2.138e+09 Max. :2.924e+09
## Running_Time_Original License Budget
## Length:2844 Length:2844 Min. : 15000
## Class :character Class :character 1st Qu.: 60000000
## Mode :character Mode :character Median : 92000000
## Mean :102085425
## 3rd Qu.:145000000
## Max. :356000000
## Opening_Sales Runtime_Minutes ROI Budget_Category
## Min. : 2 Min. : 72.0 Min. : 0.014 Length:2844
## 1st Qu.: 21761408 1st Qu.: 98.0 1st Qu.: 1.788 Class :character
## Median : 35363376 Median :113.0 Median : 3.067 Mode :character
## Mean : 43325601 Mean :115.5 Mean : 17.002
## 3rd Qu.: 55101604 3rd Qu.:130.0 3rd Qu.: 5.396
## Max. :357115007 Max. :201.0 Max. :12889.387
head(imdb_visualisation)
## # A tibble: 6 × 17
## Title Movie_Info Year Distributor Budget_Original Domestic_Opening_Ori…¹
## <chr> <chr> <dbl> <chr> <dbl> <chr>
## 1 Avatar A paraple… 2009 Twentieth … 237000000 77025481
## 2 Avatar A paraple… 2009 Twentieth … 237000000 77025481
## 3 Avatar A paraple… 2009 Twentieth … 237000000 77025481
## 4 Avatar A paraple… 2009 Twentieth … 237000000 77025481
## 5 Avengers:… After the… 2019 Walt Disne… 356000000 357115007
## 6 Avengers:… After the… 2019 Walt Disne… 356000000 357115007
## # ℹ abbreviated name: ¹Domestic_Opening_Original
## # ℹ 11 more variables: Domestic_Sales <dbl>, International_Sales <dbl>,
## # Worldwide_Sales <dbl>, Genre_Original <chr>, Running_Time_Original <chr>,
## # License <chr>, Budget <dbl>, Opening_Sales <dbl>, Runtime_Minutes <dbl>,
## # ROI <dbl>, Budget_Category <chr>
unique_movies <- length(unique(imdb_visualisation$Title))
unique_movies
## [1] 774
# Get the column names and their data types
column_info <- str(imdb_visualisation)
## spc_tbl_ [2,844 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Title : chr [1:2844] "Avatar" "Avatar" "Avatar" "Avatar" ...
## $ Movie_Info : chr [1:2844] "A paraplegic Marine dispatched to the moon Pandora on a unique mission becomes torn between following his order"| __truncated__ "A paraplegic Marine dispatched to the moon Pandora on a unique mission becomes torn between following his order"| __truncated__ "A paraplegic Marine dispatched to the moon Pandora on a unique mission becomes torn between following his order"| __truncated__ "A paraplegic Marine dispatched to the moon Pandora on a unique mission becomes torn between following his order"| __truncated__ ...
## $ Year : num [1:2844] 2009 2009 2009 2009 2019 ...
## $ Distributor : chr [1:2844] "Twentieth Century Fox" "Twentieth Century Fox" "Twentieth Century Fox" "Twentieth Century Fox" ...
## $ Budget_Original : num [1:2844] 2.37e+08 2.37e+08 2.37e+08 2.37e+08 3.56e+08 3.56e+08 3.56e+08 3.56e+08 2.00e+08 2.00e+08 ...
## $ Domestic_Opening_Original: chr [1:2844] "77025481" "77025481" "77025481" "77025481" ...
## $ Domestic_Sales : num [1:2844] 7.85e+08 7.85e+08 7.85e+08 7.85e+08 8.58e+08 ...
## $ International_Sales : num [1:2844] 2.14e+09 2.14e+09 2.14e+09 2.14e+09 1.94e+09 ...
## $ Worldwide_Sales : num [1:2844] 2.92e+09 2.92e+09 2.92e+09 2.92e+09 2.80e+09 ...
## $ Genre_Original : chr [1:2844] "c(Action" "Adventure" "Fantasy" "Sci-Fi)" ...
## $ Running_Time_Original : chr [1:2844] "2 hr 42 min" "2 hr 42 min" "2 hr 42 min" "2 hr 42 min" ...
## $ License : chr [1:2844] "PG-13" "PG-13" "PG-13" "PG-13" ...
## $ Budget : num [1:2844] 2.37e+08 2.37e+08 2.37e+08 2.37e+08 3.56e+08 3.56e+08 3.56e+08 3.56e+08 2.00e+08 2.00e+08 ...
## $ Opening_Sales : num [1:2844] 7.70e+07 7.70e+07 7.70e+07 7.70e+07 3.57e+08 ...
## $ Runtime_Minutes : num [1:2844] 162 162 162 162 181 181 181 181 194 194 ...
## $ ROI : num [1:2844] 11.34 11.34 11.34 11.34 6.86 ...
## $ Budget_Category : chr [1:2844] "High" "High" "High" "High" ...
## - attr(*, "spec")=
## .. cols(
## .. Title = col_character(),
## .. Movie_Info = col_character(),
## .. Year = col_double(),
## .. Distributor = col_character(),
## .. Budget_Original = col_double(),
## .. Domestic_Opening_Original = col_character(),
## .. Domestic_Sales = col_double(),
## .. International_Sales = col_double(),
## .. Worldwide_Sales = col_double(),
## .. Genre_Original = col_character(),
## .. Running_Time_Original = col_character(),
## .. License = col_character(),
## .. Budget = col_double(),
## .. Opening_Sales = col_double(),
## .. Runtime_Minutes = col_double(),
## .. ROI = col_double(),
## .. Budget_Category = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
column_info
## NULL
unique_genres <- length(unique(imdb_visualisation$Genre_Original))
unique_genres
## [1] 46
average_roi <- mean(imdb_visualisation$ROI, na.rm = TRUE)
average_roi
## [1] 17.00191
median_roi <- median(imdb_visualisation$ROI, na.rm = TRUE)
median_roi
## [1] 3.066516
# Ensure necessary libraries are loaded for this phase
library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringr)
library(dplyr)
library(ggplot2)
library(scales) # For label formatting
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
ggplot(imdb_visualisation, aes(x = Budget_Category, y = ROI, fill = Budget_Category)) +
geom_boxplot(show.legend = FALSE, na.rm = TRUE, outlier.shape = NA) + # Hide outliers outside limits
coord_cartesian(ylim = quantile(imdb_visualisation$ROI, probs = c(0.01, 0.99), na.rm = TRUE)) + # Zoom y-axis
labs(
title = "ROI Distribution Across Budget Categories (Zoomed)",
subtitle = "Note: Y-axis limited to show central distribution; outliers exist.",
x = "Budget Category",
y = "Return on Investment (ROI)"
) +
theme_minimal() +
scale_y_continuous(labels = comma) + # Format y-axis numbers
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, size = 9)
)
roi_anova_results <- aov(ROI ~ Budget_Category, data = imdb_visualisation)
anova_summary <- summary(roi_anova_results)
print(anova_summary)
## Df Sum Sq Mean Sq F value Pr(>F)
## Budget_Category 2 2.04e+06 1019830 7.96 0.000357 ***
## Residuals 2841 3.64e+08 128113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Creating our scatter plot of Budget vs Worldwide Sales
ggplot(imdb_visualisation, aes(x = Budget, y = Worldwide_Sales)) +
geom_point(alpha = 0.5, na.rm = TRUE) + # here im making pts semi-transprnt to see density
geom_smooth(method = "lm", color = "red", se = FALSE, na.rm = TRUE) +
labs(
title = "Relationship Between Budget and Worldwide Sales",
x = "Budget (in $)",
y = "Worldwide Sales (in $)"
) +
theme_minimal() +
# We are Formating axes to show currency in Millions (M) right here
scale_x_continuous(labels = scales::dollar_format(scale = 1e-6, suffix = "M")) +
scale_y_continuous(labels = scales::dollar_format(scale = 1e-6, suffix = "M")) +
theme(plot.title = element_text(hjust = 0.5)) # Center title
## `geom_smooth()` using formula = 'y ~ x'
ggplot(imdb_visualisation, aes(x = Budget, y = ROI)) +
geom_point(alpha = 0.4, color = "dodgerblue", size = 1.5, na.rm = TRUE) + # Points for each movie
coord_cartesian(ylim = c(0, 50)) + # Focus y-axis from 0 to 50 ROI for clearer view of central tendency
labs(
title = "Relationship Between Movie Budget and ROI",
x = "Budget ($ Millions)",
y = "Return on Investment (ROI)"
) +
# Format axes for readability
scale_x_continuous(labels = dollar_format(scale = 1e-6, suffix = "M", accuracy = 1)) +
scale_y_continuous(labels = comma, breaks = seq(0, 50, by = 10)) +
theme_minimal(base_size = 11) +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, size = 9)
)
total_sales_summary <- imdb_visualisation %>%
summarise(
Total_Domestic = sum(Domestic_Sales, na.rm = TRUE),
Total_International = sum(International_Sales, na.rm = TRUE)
)
# 2. Prepare data for ggplot (long format)
total_sales_long <- data.frame(
Sale_Type = c("Domestic", "International"),
Total_Sales = c(total_sales_summary$Total_Domestic, total_sales_summary$Total_International),
Category = "All Movies" # A dummy category for the single bar
)
# 3. Create the stacked bar chart
ggplot(total_sales_long, aes(x = Category, y = Total_Sales, fill = Sale_Type)) +
geom_col(position = "stack", width = 0.5) + # Create the stacked bar, width adjusts thickness
labs(
title = "Overall Contribution of Domestic vs. International Sales",
subtitle = "Summed across all movies in the dataset",
x = NULL, # No need for an x-axis label for a single bar
y = "Total Sales",
fill = "Sales Market" # Legend title
) +
scale_y_continuous(labels = dollar_format(scale = 1e-9, suffix = "B")) + # Format y-axis in Billions
scale_fill_brewer(palette = "Paired") + # Choose a color palette
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_blank(), # Hide x-axis text
axis.ticks.x = element_blank() # Hide x-axis ticks
) +
# Add text labels showing the percentage contribution (optional but informative)
geom_text(
aes(label = paste0(round(Total_Sales / sum(Total_Sales) * 100), "%")),
position = position_stack(vjust = 0.5), # Position text in the middle of each stack segment
size = 4,
color = "white"
)
print(total_sales_summary)
## # A tibble: 1 × 2
## Total_Domestic Total_International
## <dbl> <dbl>
## 1 485317417585 758554813210
# 1. Count unique movies per year (using 'Year' instead of 'Release_Year')
movies_per_year <- imdb_visualisation %>%
filter(!is.na(Year) & !is.na(Title)) %>% # Use the correct year column name here
distinct(Title, Year) %>% # Use the correct year column name here
count(Year, name = "Movie_Count") %>% # Use the correct year column name here
arrange(Year) # Use the correct year column name here
# 2. Create the line chart
ggplot(movies_per_year, aes(x = Year, y = Movie_Count)) + # Use the correct year column name here
geom_line(color = "steelblue", size = 1) +
geom_point(color = "steelblue", size = 2) +
labs(
title = "Number of Unique Movies Released Per Year",
subtitle = "Based on movies present in the dataset",
x = "Release Year", # Label can stay descriptive
y = "Number of Unique Movies"
) +
theme_minimal() +
# Adjust x-axis breaks (using the correct year column name)
scale_x_continuous(breaks = seq(min(movies_per_year$Year, na.rm = TRUE), # Use the correct year column name here
max(movies_per_year$Year, na.rm = TRUE), # Use the correct year column name here
by = 5)) +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1)
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
imdb_visualisation %>%
# Use 'Year' as the correct column name
filter(!is.na(Genre_Original) & !is.na(Year)) %>%
# Identify and filter for top 10 genres inline
group_by(Genre_Original) %>%
mutate(genre_total_count = n()) %>%
ungroup() %>%
filter(dense_rank(desc(genre_total_count)) <= 10) %>%
# Count occurrences per year ONLY for these top 10 genres
count(Year, Genre_Original, name = "Count") %>%
# Plot directly
ggplot(aes(x = Year, y = Count, fill = Genre_Original)) +
geom_area(alpha = 0.8) +
# --- Use a Brewer palette for better colors ---
scale_fill_brewer(palette = "Paired") + # Or try "Set3", "Spectral" etc.
# --- Labels and Theme ---
labs(
title = "Top 10 Genres Over Time",
subtitle = "Based on movie entries per genre each year",
x = "Release Year", y = "Number of Movie Entries", fill = "Genre"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
scale_x_continuous(breaks = scales::pretty_breaks(n = 8))
imdb_visualisation %>%
mutate(
Genre_Clean = str_remove_all(Genre_Original, "[c()]"),
Genre_Clean = trimws(Genre_Clean)
) %>%
filter(!is.na(ROI) & !is.na(Genre_Clean) & Genre_Clean != "" & ROI > 0) %>% # Filter ROI > 0 for log scale
group_by(Genre_Clean) %>%
summarise(
Average_ROI = mean(ROI, na.rm = TRUE),
n = n()
) %>%
filter(n >= 5) %>%
arrange(desc(Average_ROI)) %>%
slice_head(n = 15) %>%
ggplot(aes(x = reorder(Genre_Clean, -Average_ROI), y = Average_ROI, fill = Genre_Clean)) +
geom_col(show.legend = FALSE) +
# --- Use a Log10 Scale for the Y-axis ---
scale_y_log10(
breaks = scales::trans_breaks("log10", function(x) 10^x), # Auto breaks for log
labels = scales::comma # Use comma format for labels
) +
labs(
title = "Top 15 Genres by Average ROI (Log Scale)",
subtitle = "Genres with >= 5 movies. Y-axis uses log scale to show relative differences.",
x = "Genre",
y = "Average ROI (Log Scale)"
) +
theme_minimal(base_size = 10) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, size = 8)
) +
annotation_logticks(sides = "l")
anova_tidy_results <- imdb_visualisation %>%
mutate(
Genre_Clean = str_remove_all(Genre_Original, "[c()]"),
Genre_Clean = trimws(Genre_Clean)
) %>%
filter(!is.na(ROI) & !is.na(Genre_Clean) & Genre_Clean != "") %>%
group_by(Genre_Clean) %>%
filter(n() >= 5) %>%
ungroup() %>%
# Perform ANOVA
aov(ROI ~ Genre_Clean, data = .) %>%
# Tidy the results into a tibble
broom::tidy() %>%
# Format the p-value column specifically
mutate(
# format() prevents scientific notation for this specific column
p.value_formatted = format(p.value, scientific = FALSE, digits = 9)
)
# --- Display Tidied & Formatted Summary ---
print("ANOVA Results (tidied, p-value formatted):")
## [1] "ANOVA Results (tidied, p-value formatted):"
# Print the relevant columns from the tidied table
print(select(anova_tidy_results, term, df, statistic, p.value_formatted))
## # A tibble: 2 × 4
## term df statistic p.value_formatted
## <chr> <dbl> <dbl> <chr>
## 1 Genre_Clean 19 3.89 "0.0000000269587933"
## 2 Residuals 2823 NA " NA"
imdb_visualisation %>%
filter(!is.na(Year) & !is.na(ROI)) %>%
distinct(Title, .keep_all = TRUE) %>%
group_by(Year) %>%
summarise(Average_ROI = mean(ROI, na.rm = TRUE), n = n(), .groups = 'drop') %>%
filter(n >= 5 & Average_ROI > 0) %>% # Filter years with >= 5 movies AND positive ROI
ggplot(aes(x = Year, y = Average_ROI)) +
geom_line(color = "darkorange", linewidth = 0.8) +
geom_point(color = "darkorange", size = 1.5) +
# Highlight layer for the max point
geom_point(data = . %>% slice_max(order_by = Average_ROI, n = 1),
aes(x = Year, y = Average_ROI),
color = "firebrick", size = 3) +
# Text label layer for the max point
geom_text(data = . %>% slice_max(order_by = Average_ROI, n = 1),
aes(label = Year),
vjust = -1, color = "firebrick", size = 3) +
# --- Apply Log Scale to Y-axis ---
scale_y_log10(
breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::comma # Format labels nicely
) +
annotation_logticks(sides = "l") + # Add log ticks
labs(
title = "Average Movie ROI Per Year (Log Scale)",
subtitle = "Based on unique movies (>= 5 releases/year). Y-axis uses log scale.",
x = "Release Year",
y = "Average ROI (Log Scale)"
) +
theme_minimal(base_size = 10) +
scale_x_continuous(breaks = seq(1975, 2020, by = 5)) + # Adjust breaks
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, size = 8)
)
roi_threshold <- imdb_visualisation %>%
filter(!is.na(ROI)) %>%
distinct(Title, .keep_all = TRUE) %>%
pull(ROI) %>%
quantile(0.75, na.rm = TRUE)
# 2. Main pipeline: Filter, count, select top, and plot
imdb_visualisation %>%
filter(!is.na(ROI) & !is.na(Distributor)) %>% # Filter needed NAs
distinct(Title, .keep_all = TRUE) %>% # Get unique movies
filter(ROI >= roi_threshold) %>% # Filter for high ROI
count(Distributor, name = "High_ROI_Movie_Count") %>% # Count per distributor
arrange(desc(High_ROI_Movie_Count)) %>%
slice_head(n = 15) %>% # Take top 15
# Pipe directly into ggplot
ggplot(aes(x = reorder(Distributor, -High_ROI_Movie_Count), y = High_ROI_Movie_Count, fill = Distributor)) +
geom_col(show.legend = FALSE) +
labs(
title = "Top 15 Distributors by Number of High-ROI Movies",
# Use the calculated threshold in the subtitle
subtitle = paste("Movies with ROI >=", round(roi_threshold, 2), "(Top 25%)"),
x = "Distributor",
y = "Number of High-ROI Movies"
) +
theme_minimal(base_size = 10) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, size = 8)
) +
scale_fill_viridis_d() # Apply a color scale
imdb_visualisation %>%
filter(!is.na(ROI) & !is.na(Title)) %>% # Filter out NAs
distinct(Title, .keep_all = TRUE) %>% # Get unique movies
arrange(desc(ROI)) %>% # Sort by ROI descending
slice_head(n = 10) %>% # Take top 10
# Pipe directly into ggplot for horizontal bar chart
ggplot(aes(x = ROI, y = reorder(Title, ROI), fill = Title)) +
geom_col(show.legend = FALSE) +
# Add text labels for ROI values (adjust hjust for placement)
geom_text(aes(label = scales::comma(ROI, accuracy = 1)), # Format with comma
hjust = -0.1, size = 3) + # Position to the right of the bar
labs(
title = "Top 10 Movies by Return on Investment (ROI)",
x = "Return on Investment (ROI)",
y = "Movie Title"
) +
theme_minimal(base_size = 10) +
# Expand x-axis limits slightly to make room for labels
scale_x_continuous(labels = scales::comma, expand = expansion(mult = c(0.01, 0.15))) +
theme(
plot.title = element_text(hjust = 0.5),
axis.text.y = element_text(size = 8) # Adjust y-axis label size if needed
) +
scale_fill_viridis_d() # Optional color scale
imdb_visualisation %>%
filter(!is.na(Worldwide_Sales) & !is.na(Title)) %>% # Filter NAs
distinct(Title, .keep_all = TRUE) %>% # Get unique movies
arrange(desc(Worldwide_Sales)) %>% # Sort by sales descending
slice_head(n = 10) %>% # Take top 10
# Pipe directly into ggplot
ggplot(aes(x = Worldwide_Sales, y = reorder(Title, Worldwide_Sales), fill = Title)) +
geom_col(show.legend = FALSE) +
# Add text labels formatted in Billions of dollars
geom_text(aes(label = dollar(Worldwide_Sales, scale = 1e-9, suffix = "B", accuracy = 0.1)),
hjust = -0.1, size = 3) + # Position to the right of the bar
labs(
title = "Top 10 Movies by Worldwide Sales",
x = "Worldwide Sales ($ Billions)",
y = "Movie Title"
) +
theme_minimal(base_size = 10) +
# Format x-axis in Billions and expand limits for labels
scale_x_continuous(labels = dollar_format(scale = 1e-9, suffix = "B"),
expand = expansion(mult = c(0.01, 0.15))) +
theme(
plot.title = element_text(hjust = 0.5),
axis.text.y = element_text(size = 8)
) +
scale_fill_viridis_d(option = "plasma") # Different color palette
top_n_distributors <- 10
top_distributors_list <- imdb_visualisation %>%
filter(!is.na(Distributor)) %>%
distinct(Title, .keep_all = TRUE) %>%
count(Distributor) %>%
slice_max(order_by = n, n = top_n_distributors) %>%
pull(Distributor)
# Calculate metrics, reshape, and plot
imdb_visualisation %>%
filter(!is.na(Distributor) & !is.na(Worldwide_Sales) & !is.na(ROI)) %>% # Filter NAs
distinct(Title, .keep_all = TRUE) %>% # Use unique movies for averages
filter(Distributor %in% top_distributors_list) %>% # Keep only top distributors
group_by(Distributor) %>%
summarise(
`Avg Worldwide Sales` = mean(Worldwide_Sales, na.rm = TRUE), # Name with spaces for facet title
`Avg ROI` = mean(ROI, na.rm = TRUE), # Name with spaces
.groups = 'drop'
) %>%
# Pivot to long format for faceting
pivot_longer(
cols = c(`Avg Worldwide Sales`, `Avg ROI`),
names_to = "Metric_Type",
values_to = "Value"
) %>%
# Create the faceted plot
ggplot(aes(x = reorder(Distributor, -Value), y = Value, fill = Metric_Type)) +
geom_col(show.legend = FALSE) +
# Facet by metric type with independent y-axes
facet_wrap(~ Metric_Type, scales = "free_y") +
labs(
title = paste("Average Sales vs. Average ROI for Top", top_n_distributors, "Distributors"),
subtitle = "Distributors selected by total movie count in dataset",
x = "Distributor",
y = "Average Value" # Y-axis label is generic due to free scales
) +
# Custom y-axis formatting for each facet
scale_y_continuous(labels = function(x) case_when(
max(x, na.rm = TRUE) > 1e6 ~ dollar(x, scale = 1e-6, suffix = "M"), # Sales in Millions
TRUE ~ comma(x, accuracy = 0.1) # ROI as number
)) +
theme_minimal(base_size = 10) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, size = 8),
strip.text = element_text(face = "bold") # Bold facet titles
) +
scale_fill_brewer(palette = "Set2") # Color palette