————————————————————————————————————————-

Phase 1: Dataset Understanding (Descriptive Only — No Visuals)

Purpose: Summarize structure, completeness, and basic distribution.

Q# Question Type Visualization
1 How many unique movies are in the dataset after removing duplicates? Descriptive None
2 What are the column names and their data types in the dataset? Descriptive None
3 Are there any missing values in the dataset, and how are they distributed? Descriptive None
4 What is the range (min, max, mean) for Budget, Worldwide_Sales, ROI, and Runtime_Minutes? Descriptive None
5 How many movies fall under each budget category (Low, Medium, High)? Descriptive None

Phase 2: Budget, Sales & ROI Patterns

Purpose: Examine how financial inputs/outputs relate.

Q# Question Type Visualization
6 How does ROI vary across different budget categories? Visual + ANOVA Boxplot
6b Is the difference in ROI between budget categories statistically significant? ANOVA None (Stats Output)
7 What is the relationship between budget and worldwide sales? Visual Scatter Plot + Trendline
8 Do higher-budget movies consistently have higher ROI? Visual Scatter Plot
9 How do domestic and international sales contribute to total revenue? Visual Stacked Bar Chart

Phase 4: Distributors & Top Performers

Purpose: Identify high performers and institutional patterns.

Q# Question Type Visualization
14 Which distributors released the most high-ROI movies? Visual Bar Chart
15 What are the top 10 movies by ROI? Visual Horizontal Bar Chart
16 What are the top 10 movies by worldwide sales? Visual Horizontal Bar Chart
17 How do top distributors compare in terms of average worldwide sales and ROI? Visual Grouped Bar Chart

Phase 5: Conclusion

Purpose: Final analysis and insights based on descriptive and visual summaries.

————————————————————————————————————————-

library(readr)
imdb_visualisation <- read_csv("imdb_visualisation.csv")
## Rows: 2844 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Title, Movie_Info, Distributor, Domestic_Opening_Original, Genre_Or...
## dbl (9): Year, Budget_Original, Domestic_Sales, International_Sales, Worldwi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(imdb_visualisation)
summary(imdb_visualisation)
##     Title            Movie_Info             Year      Distributor       
##  Length:2844        Length:2844        Min.   :1939   Length:2844       
##  Class :character   Class :character   1st Qu.:2003   Class :character  
##  Mode  :character   Mode  :character   Median :2010   Mode  :character  
##                                        Mean   :2008                     
##                                        3rd Qu.:2015                     
##                                        Max.   :2023                     
##  Budget_Original     Domestic_Opening_Original Domestic_Sales     
##  Min.   :    15000   Length:2844               Min.   :   214670  
##  1st Qu.: 60000000   Class :character          1st Qu.:100830111  
##  Median : 92000000   Mode  :character          Median :141186864  
##  Mean   :102085425                             Mean   :170646068  
##  3rd Qu.:145000000                             3rd Qu.:206362140  
##  Max.   :356000000                             Max.   :936662225  
##  International_Sales Worldwide_Sales     Genre_Original    
##  Min.   :2.450e+07   Min.   :1.806e+08   Length:2844       
##  1st Qu.:1.360e+08   1st Qu.:2.374e+08   Class :character  
##  Median :1.995e+08   Median :3.343e+08   Mode  :character  
##  Mean   :2.667e+08   Mean   :4.321e+08                     
##  3rd Qu.:3.300e+08   3rd Qu.:5.227e+08                     
##  Max.   :2.138e+09   Max.   :2.924e+09                     
##  Running_Time_Original   License              Budget         
##  Length:2844           Length:2844        Min.   :    15000  
##  Class :character      Class :character   1st Qu.: 60000000  
##  Mode  :character      Mode  :character   Median : 92000000  
##                                           Mean   :102085425  
##                                           3rd Qu.:145000000  
##                                           Max.   :356000000  
##  Opening_Sales       Runtime_Minutes      ROI            Budget_Category   
##  Min.   :        2   Min.   : 72.0   Min.   :    0.014   Length:2844       
##  1st Qu.: 21761408   1st Qu.: 98.0   1st Qu.:    1.788   Class :character  
##  Median : 35363376   Median :113.0   Median :    3.067   Mode  :character  
##  Mean   : 43325601   Mean   :115.5   Mean   :   17.002                     
##  3rd Qu.: 55101604   3rd Qu.:130.0   3rd Qu.:    5.396                     
##  Max.   :357115007   Max.   :201.0   Max.   :12889.387
head(imdb_visualisation)
## # A tibble: 6 × 17
##   Title      Movie_Info  Year Distributor Budget_Original Domestic_Opening_Ori…¹
##   <chr>      <chr>      <dbl> <chr>                 <dbl> <chr>                 
## 1 Avatar     A paraple…  2009 Twentieth …       237000000 77025481              
## 2 Avatar     A paraple…  2009 Twentieth …       237000000 77025481              
## 3 Avatar     A paraple…  2009 Twentieth …       237000000 77025481              
## 4 Avatar     A paraple…  2009 Twentieth …       237000000 77025481              
## 5 Avengers:… After the…  2019 Walt Disne…       356000000 357115007             
## 6 Avengers:… After the…  2019 Walt Disne…       356000000 357115007             
## # ℹ abbreviated name: ¹​Domestic_Opening_Original
## # ℹ 11 more variables: Domestic_Sales <dbl>, International_Sales <dbl>,
## #   Worldwide_Sales <dbl>, Genre_Original <chr>, Running_Time_Original <chr>,
## #   License <chr>, Budget <dbl>, Opening_Sales <dbl>, Runtime_Minutes <dbl>,
## #   ROI <dbl>, Budget_Category <chr>

————————————————————————————————————————-

————————————————————————————————————————-

PHASE 1 : UNDERSTAND CURRENT DATASET

1. How many unique movies are in the dataset after removing duplicates?

unique_movies <- length(unique(imdb_visualisation$Title))
unique_movies
## [1] 774
[Q1-] 774 unique movies in your dataset after removing any duplicates based on the Title column

2. What are the column names and their data types in the dataset?

# Get the column names and their data types
column_info <- str(imdb_visualisation)
## spc_tbl_ [2,844 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Title                    : chr [1:2844] "Avatar" "Avatar" "Avatar" "Avatar" ...
##  $ Movie_Info               : chr [1:2844] "A paraplegic Marine dispatched to the moon Pandora on a unique mission becomes torn between following his order"| __truncated__ "A paraplegic Marine dispatched to the moon Pandora on a unique mission becomes torn between following his order"| __truncated__ "A paraplegic Marine dispatched to the moon Pandora on a unique mission becomes torn between following his order"| __truncated__ "A paraplegic Marine dispatched to the moon Pandora on a unique mission becomes torn between following his order"| __truncated__ ...
##  $ Year                     : num [1:2844] 2009 2009 2009 2009 2019 ...
##  $ Distributor              : chr [1:2844] "Twentieth Century Fox" "Twentieth Century Fox" "Twentieth Century Fox" "Twentieth Century Fox" ...
##  $ Budget_Original          : num [1:2844] 2.37e+08 2.37e+08 2.37e+08 2.37e+08 3.56e+08 3.56e+08 3.56e+08 3.56e+08 2.00e+08 2.00e+08 ...
##  $ Domestic_Opening_Original: chr [1:2844] "77025481" "77025481" "77025481" "77025481" ...
##  $ Domestic_Sales           : num [1:2844] 7.85e+08 7.85e+08 7.85e+08 7.85e+08 8.58e+08 ...
##  $ International_Sales      : num [1:2844] 2.14e+09 2.14e+09 2.14e+09 2.14e+09 1.94e+09 ...
##  $ Worldwide_Sales          : num [1:2844] 2.92e+09 2.92e+09 2.92e+09 2.92e+09 2.80e+09 ...
##  $ Genre_Original           : chr [1:2844] "c(Action" "Adventure" "Fantasy" "Sci-Fi)" ...
##  $ Running_Time_Original    : chr [1:2844] "2 hr 42 min" "2 hr 42 min" "2 hr 42 min" "2 hr 42 min" ...
##  $ License                  : chr [1:2844] "PG-13" "PG-13" "PG-13" "PG-13" ...
##  $ Budget                   : num [1:2844] 2.37e+08 2.37e+08 2.37e+08 2.37e+08 3.56e+08 3.56e+08 3.56e+08 3.56e+08 2.00e+08 2.00e+08 ...
##  $ Opening_Sales            : num [1:2844] 7.70e+07 7.70e+07 7.70e+07 7.70e+07 3.57e+08 ...
##  $ Runtime_Minutes          : num [1:2844] 162 162 162 162 181 181 181 181 194 194 ...
##  $ ROI                      : num [1:2844] 11.34 11.34 11.34 11.34 6.86 ...
##  $ Budget_Category          : chr [1:2844] "High" "High" "High" "High" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Title = col_character(),
##   ..   Movie_Info = col_character(),
##   ..   Year = col_double(),
##   ..   Distributor = col_character(),
##   ..   Budget_Original = col_double(),
##   ..   Domestic_Opening_Original = col_character(),
##   ..   Domestic_Sales = col_double(),
##   ..   International_Sales = col_double(),
##   ..   Worldwide_Sales = col_double(),
##   ..   Genre_Original = col_character(),
##   ..   Running_Time_Original = col_character(),
##   ..   License = col_character(),
##   ..   Budget = col_double(),
##   ..   Opening_Sales = col_double(),
##   ..   Runtime_Minutes = col_double(),
##   ..   ROI = col_double(),
##   ..   Budget_Category = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
column_info
## NULL

(Q2 - ) The dataset has 2,844 rows and 17 columns. This is because movies that belong to multiple genres are duplicated in the dataset. (we separated the genres to make them easy to count)

3. How many distinct genres are represented in the dataset?

unique_genres <- length(unique(imdb_visualisation$Genre_Original))
unique_genres
## [1] 46

(Q3 -)There is a total of 46

4. What is the average ROI across all movies?

average_roi <- mean(imdb_visualisation$ROI, na.rm = TRUE)
average_roi
## [1] 17.00191

(Q4 -) The average ROI across all movies is approximately 17.00. This means that, on average, the movies in the dataset earned about 17 times the initial investment, reflecting a strong return on investment.

5. What is the median ROI across all movies?

median_roi <- median(imdb_visualisation$ROI, na.rm = TRUE)
median_roi 
## [1] 3.066516

(Q5 -) The median ROI across all movies is approximately 3.07. This indicates that half of the movies in the dataset had a return on investment (ROI) higher than 3.07, and half had an ROI lower than this value.

————————————————————————————————————————-

————————————————————————————————————————-

Phase 2: Budget, Sales & ROI Patterns

# Ensure necessary libraries are loaded for this phase
library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringr)
library(dplyr)
library(ggplot2)
library(scales) # For label formatting
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

6. How does ROI vary across different budget categories?

ggplot(imdb_visualisation, aes(x = Budget_Category, y = ROI, fill = Budget_Category)) +
  geom_boxplot(show.legend = FALSE, na.rm = TRUE, outlier.shape = NA) + # Hide outliers outside limits
  coord_cartesian(ylim = quantile(imdb_visualisation$ROI, probs = c(0.01, 0.99), na.rm = TRUE)) + # Zoom y-axis
  labs(
    title = "ROI Distribution Across Budget Categories (Zoomed)",
    subtitle = "Note: Y-axis limited to show central distribution; outliers exist.",
    x = "Budget Category",
    y = "Return on Investment (ROI)"
  ) +
  theme_minimal() +
  scale_y_continuous(labels = comma) + # Format y-axis numbers
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5, size = 9)
  )

Low-budget films display the highest median ROI (around 6), while medium and high-budget films show more consistent but lower median returns (around 2-3).

Low Budget: Median ROI of about 6, with the central 50% (IQR Interquartile Range.) ranging roughly from 4 to 11.

Medium Budget: Median ROI of about 2.5, with the central 50% (IQR) ranging roughly from 2 to 4.

High Budget: Median ROI of about 1 , with the central 50% (IQR) ranging roughly from 1.5 to 3.5 .

Q6b: Is the difference in ROI between budget categories statistically significant?

Hypotheses:

H₀ (Null Hypothesis): Average ROI is the same for all budget categories.

H₁ (Alternative Hypothesis): Average ROI is different for at least one budget category.

roi_anova_results <- aov(ROI ~ Budget_Category, data = imdb_visualisation)

anova_summary <- summary(roi_anova_results)
print(anova_summary)
##                   Df   Sum Sq Mean Sq F value   Pr(>F)    
## Budget_Category    2 2.04e+06 1019830    7.96 0.000357 ***
## Residuals       2841 3.64e+08  128113                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Result & Conclusion:

Since the p-value is less than 0.05, we reject H₀.

Conclusion: There is a statistically significant difference in average ROI between the budget categories.

(Q6b - Explanation) Since this p-value is less than the standard significance level of 0.05, we concluded that there is a statistically significant difference in the average ROI between the budget categories. This suggests the variations in ROI observed in the boxplot are unlikely due to random chance alone.

Q7: What is the relationship between budget and worldwide sales?

# Creating our scatter plot of Budget vs Worldwide Sales

ggplot(imdb_visualisation, aes(x = Budget, y = Worldwide_Sales)) +
  geom_point(alpha = 0.5, na.rm = TRUE) + # here im making pts semi-transprnt to see density
  geom_smooth(method = "lm", color = "red", se = FALSE, na.rm = TRUE) + 
  labs(
    title = "Relationship Between Budget and Worldwide Sales",
    x = "Budget (in $)",
    y = "Worldwide Sales (in $)"
  ) +
  theme_minimal() +
  # We are Formating axes to show currency in Millions (M) right here
  scale_x_continuous(labels = scales::dollar_format(scale = 1e-6, suffix = "M")) +
  scale_y_continuous(labels = scales::dollar_format(scale = 1e-6, suffix = "M")) +
  theme(plot.title = element_text(hjust = 0.5)) # Center title
## `geom_smooth()` using formula = 'y ~ x'

(Q7 -)The plot indicates that higher budgets generally correlate with higher worldwide sales, but with considerable variation around this trend, especially for more expensive films.

Our Ovaral View : The scatter plot shows a positive relationship between a movie’s budget (on the x-axis) and its worldwide sales (on the y-axis). As the budget increases, worldwide sales tend to increase as well, which is confirmed by the upward slope of the red trendline.

Q8: Do higher-budget movies consistently have higher ROI?

Creating scatter plot of Budget vs ROI

ggplot(imdb_visualisation, aes(x = Budget, y = ROI)) +
  geom_point(alpha = 0.4, color = "dodgerblue", size = 1.5, na.rm = TRUE) + # Points for each movie
  coord_cartesian(ylim = c(0, 50)) + # Focus y-axis from 0 to 50 ROI for clearer view of central tendency
  labs(
    title = "Relationship Between Movie Budget and ROI",
    x = "Budget ($ Millions)",
    y = "Return on Investment (ROI)"
  ) +
  # Format axes for readability
  scale_x_continuous(labels = dollar_format(scale = 1e-6, suffix = "M", accuracy = 1)) +
  scale_y_continuous(labels = comma, breaks = seq(0, 50, by = 10)) +
  theme_minimal(base_size = 11) +
  theme(
      plot.title = element_text(hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5, size = 9)
  )

(Q8 -) The scatter plot reveals an inverse relationship between budget and ROI.As movie budgets increase, their ROI tends to decrease, indicating diminishing returns on investment for higher-budget productions.

9. How do domestic and international sales contribute to total revenue?

total_sales_summary <- imdb_visualisation %>%
  summarise(
    Total_Domestic = sum(Domestic_Sales, na.rm = TRUE),
    Total_International = sum(International_Sales, na.rm = TRUE)
  )

# 2. Prepare data for ggplot (long format)
total_sales_long <- data.frame(
  Sale_Type = c("Domestic", "International"),
  Total_Sales = c(total_sales_summary$Total_Domestic, total_sales_summary$Total_International),
  Category = "All Movies" # A dummy category for the single bar
)

# 3. Create the stacked bar chart
ggplot(total_sales_long, aes(x = Category, y = Total_Sales, fill = Sale_Type)) +
  geom_col(position = "stack", width = 0.5) + # Create the stacked bar, width adjusts thickness
  labs(
    title = "Overall Contribution of Domestic vs. International Sales",
    subtitle = "Summed across all movies in the dataset",
    x = NULL, # No need for an x-axis label for a single bar
    y = "Total Sales",
    fill = "Sales Market" # Legend title
  ) +
  scale_y_continuous(labels = dollar_format(scale = 1e-9, suffix = "B")) + # Format y-axis in Billions
  scale_fill_brewer(palette = "Paired") + # Choose a color palette
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    axis.text.x = element_blank(), # Hide x-axis text
    axis.ticks.x = element_blank() # Hide x-axis ticks
  ) +
  # Add text labels showing the percentage contribution (optional but informative)
  geom_text(
    aes(label = paste0(round(Total_Sales / sum(Total_Sales) * 100), "%")),
    position = position_stack(vjust = 0.5), # Position text in the middle of each stack segment
    size = 4,
    color = "white"
  )

print(total_sales_summary)
## # A tibble: 1 × 2
##   Total_Domestic Total_International
##            <dbl>               <dbl>
## 1   485317417585        758554813210

(Q9 -) Across all movies analyzed in this dataset, international sales proved to be the dominant revenue driver, contributing a substantial 61% (approximately $759 billion) of the total gross earnings, significantly outweighing the 39% (approximately $485 billion) generated from the domestic market.

————————————————————————————————————————-

————————————————————————————————————————-

10. How has the number of movies changed over the years?

# 1. Count unique movies per year (using 'Year' instead of 'Release_Year')
movies_per_year <- imdb_visualisation %>%
  filter(!is.na(Year) & !is.na(Title)) %>% # Use the correct year column name here
  distinct(Title, Year) %>%              # Use the correct year column name here
  count(Year, name = "Movie_Count") %>%     # Use the correct year column name here
  arrange(Year)                           # Use the correct year column name here

# 2. Create the line chart
ggplot(movies_per_year, aes(x = Year, y = Movie_Count)) + # Use the correct year column name here
  geom_line(color = "steelblue", size = 1) +
  geom_point(color = "steelblue", size = 2) +
  labs(
    title = "Number of Unique Movies Released Per Year",
    subtitle = "Based on movies present in the dataset",
    x = "Release Year", # Label can stay descriptive
    y = "Number of Unique Movies"
  ) +
  theme_minimal() +
  # Adjust x-axis breaks (using the correct year column name)
  scale_x_continuous(breaks = seq(min(movies_per_year$Year, na.rm = TRUE), # Use the correct year column name here
                                 max(movies_per_year$Year, na.rm = TRUE), # Use the correct year column name here
                                 by = 5)) +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
    )
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

(Q10 -) The number of unique film releases per year shows a distinct upward trend from the late 1980s, peaking at over 40 films annually around 2015-2017, followed by a sharp decline; while the initial drop starting around 2018 likely reflects data incompleteness for recent years within the dataset itself, the extremely low numbers observed from 2020 onwards are probably compounded by the real-world disruptions to film production and releases caused by the COVID-19 pandemic, further emphasizing that the chart reflects the dataset’s limitations rather than the overall industry output.

11. What are the most common genres over time?

imdb_visualisation %>%
  # Use 'Year' as the correct column name
  filter(!is.na(Genre_Original) & !is.na(Year)) %>%
  # Identify and filter for top 10 genres inline
  group_by(Genre_Original) %>%
  mutate(genre_total_count = n()) %>%
  ungroup() %>%
  filter(dense_rank(desc(genre_total_count)) <= 10) %>%
  # Count occurrences per year ONLY for these top 10 genres
  count(Year, Genre_Original, name = "Count") %>%
  # Plot directly
  ggplot(aes(x = Year, y = Count, fill = Genre_Original)) +
    geom_area(alpha = 0.8) +
    # --- Use a Brewer palette for better colors ---
    scale_fill_brewer(palette = "Paired") + # Or try "Set3", "Spectral" etc.
    # --- Labels and Theme ---
    labs(
      title = "Top 10 Genres Over Time",
      subtitle = "Based on movie entries per genre each year",
      x = "Release Year", y = "Number of Movie Entries", fill = "Genre"
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5),
      axis.text.x = element_text(angle = 45, hjust = 1)
    ) +
    scale_x_continuous(breaks = scales::pretty_breaks(n = 8))

12. Which genres have the highest average ROI?

imdb_visualisation %>%
  mutate(
    Genre_Clean = str_remove_all(Genre_Original, "[c()]"),
    Genre_Clean = trimws(Genre_Clean)
  ) %>%
  filter(!is.na(ROI) & !is.na(Genre_Clean) & Genre_Clean != "" & ROI > 0) %>% # Filter ROI > 0 for log scale
  group_by(Genre_Clean) %>%
  summarise(
    Average_ROI = mean(ROI, na.rm = TRUE),
    n = n()
  ) %>%
  filter(n >= 5) %>%
  arrange(desc(Average_ROI)) %>%
  slice_head(n = 15) %>%
  ggplot(aes(x = reorder(Genre_Clean, -Average_ROI), y = Average_ROI, fill = Genre_Clean)) +
    geom_col(show.legend = FALSE) +
    # --- Use a Log10 Scale for the Y-axis ---
    scale_y_log10(
      breaks = scales::trans_breaks("log10", function(x) 10^x), # Auto breaks for log
      labels = scales::comma # Use comma format for labels
    ) +
    labs(
      title = "Top 15 Genres by Average ROI (Log Scale)",
      subtitle = "Genres with >= 5 movies. Y-axis uses log scale to show relative differences.",
      x = "Genre",
      y = "Average ROI (Log Scale)"
    ) +
    theme_minimal(base_size = 10) +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1),
      plot.title = element_text(hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5, size = 8)
    ) +
    annotation_logticks(sides = "l")

(Q12 -) Horror and Mystery genres delivered substantially higher average Return on Investment (ROI) compared to other top genres like Drama, Comedy, or Adventure, indicating greater profitability relative to budget for these films within this dataset.

12b. Is the difference in ROI across genres statistically significant?

Hypotheses for ANOVA Test (Q12b):

H₀ (Null Hypothesis): The average Return on Investment (ROI) is the same for all movie genres included in the analysis (those with at least 5 movies).

H₁ (Alternative Hypothesis): The average Return on Investment (ROI) is different for at least one movie genre compared to the others.

anova_tidy_results <- imdb_visualisation %>%
  mutate(
    Genre_Clean = str_remove_all(Genre_Original, "[c()]"),
    Genre_Clean = trimws(Genre_Clean)
  ) %>%
  filter(!is.na(ROI) & !is.na(Genre_Clean) & Genre_Clean != "") %>%
  group_by(Genre_Clean) %>%
  filter(n() >= 5) %>%
  ungroup() %>%
  # Perform ANOVA
  aov(ROI ~ Genre_Clean, data = .) %>%
  # Tidy the results into a tibble
  broom::tidy() %>%
  # Format the p-value column specifically
  mutate(
    # format() prevents scientific notation for this specific column
    p.value_formatted = format(p.value, scientific = FALSE, digits = 9)
    )

# --- Display Tidied & Formatted Summary ---
print("ANOVA Results (tidied, p-value formatted):")
## [1] "ANOVA Results (tidied, p-value formatted):"
# Print the relevant columns from the tidied table
print(select(anova_tidy_results, term, df, statistic, p.value_formatted))
## # A tibble: 2 × 4
##   term           df statistic p.value_formatted   
##   <chr>       <dbl>     <dbl> <chr>               
## 1 Genre_Clean    19      3.89 "0.0000000269587933"
## 2 Residuals    2823     NA    "                NA"

(Q12b -) This very low p-value provides strong statistical evidence to reject the null hypothesis (that all genres have the same average ROI). Therefore, we can confidently conclude that the average ROI does significantly differ depending on the movie genre within this dataset. The variations seen in the previous bar chart are not just due to random chance.

13. Which year had the highest average ROI?

imdb_visualisation %>%
  filter(!is.na(Year) & !is.na(ROI)) %>%
  distinct(Title, .keep_all = TRUE) %>%
  group_by(Year) %>%
  summarise(Average_ROI = mean(ROI, na.rm = TRUE), n = n(), .groups = 'drop') %>%
  filter(n >= 5 & Average_ROI > 0) %>% # Filter years with >= 5 movies AND positive ROI
  ggplot(aes(x = Year, y = Average_ROI)) +
    geom_line(color = "darkorange", linewidth = 0.8) +
    geom_point(color = "darkorange", size = 1.5) +
    # Highlight layer for the max point
    geom_point(data = . %>% slice_max(order_by = Average_ROI, n = 1),
               aes(x = Year, y = Average_ROI),
               color = "firebrick", size = 3) +
    # Text label layer for the max point
    geom_text(data = . %>% slice_max(order_by = Average_ROI, n = 1),
              aes(label = Year),
              vjust = -1, color = "firebrick", size = 3) +
    # --- Apply Log Scale to Y-axis ---
    scale_y_log10(
        breaks = scales::trans_breaks("log10", function(x) 10^x),
        labels = scales::comma # Format labels nicely
        ) +
    annotation_logticks(sides = "l") + # Add log ticks
    labs(
      title = "Average Movie ROI Per Year (Log Scale)",
      subtitle = "Based on unique movies (>= 5 releases/year). Y-axis uses log scale.",
      x = "Release Year",
      y = "Average ROI (Log Scale)"
    ) +
    theme_minimal(base_size = 10) +
    scale_x_continuous(breaks = seq(1975, 2020, by = 5)) + # Adjust breaks
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
      plot.title = element_text(hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5, size = 8)
      )

(Q13 -) The most striking feature is the massive peak in 2007, where the average ROI was exceptionally high (appearing above 300x on the log scale). This is almost certainly driven by outlier hits with extremely low budgets and high returns, such as “Paranormal Activity,” significantly pulling up the average for that year.

————————————————————————————————————————-

————————————————————————————————————————-

Phase 4: Distributors & Top Performers

14. Which distributors released the most high-ROI movies?

roi_threshold <- imdb_visualisation %>%
  filter(!is.na(ROI)) %>%
  distinct(Title, .keep_all = TRUE) %>%
  pull(ROI) %>%
  quantile(0.75, na.rm = TRUE)

# 2. Main pipeline: Filter, count, select top, and plot
imdb_visualisation %>%
  filter(!is.na(ROI) & !is.na(Distributor)) %>% # Filter needed NAs
  distinct(Title, .keep_all = TRUE) %>%         # Get unique movies
  filter(ROI >= roi_threshold) %>%              # Filter for high ROI
  count(Distributor, name = "High_ROI_Movie_Count") %>% # Count per distributor
  arrange(desc(High_ROI_Movie_Count)) %>%
  slice_head(n = 15) %>%                        # Take top 15
  # Pipe directly into ggplot
  ggplot(aes(x = reorder(Distributor, -High_ROI_Movie_Count), y = High_ROI_Movie_Count, fill = Distributor)) +
    geom_col(show.legend = FALSE) +
    labs(
      title = "Top 15 Distributors by Number of High-ROI Movies",
      # Use the calculated threshold in the subtitle
      subtitle = paste("Movies with ROI >=", round(roi_threshold, 2), "(Top 25%)"),
      x = "Distributor",
      y = "Number of High-ROI Movies"
    ) +
    theme_minimal(base_size = 10) +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
      plot.title = element_text(hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5, size = 8)
      ) +
    scale_fill_viridis_d() # Apply a color scale

(Q14. - )Universal Pictures leads significantly, followed by other major studios like Warner Bros., Fox, Disney, and Paramount, with several smaller or specialized distributors also demonstrating success in releasing highly profitable movies.

15. What are the top 10 movies by ROI?

imdb_visualisation %>%
  filter(!is.na(ROI) & !is.na(Title)) %>%   # Filter out NAs
  distinct(Title, .keep_all = TRUE) %>%   # Get unique movies
  arrange(desc(ROI)) %>%                  # Sort by ROI descending
  slice_head(n = 10) %>%                  # Take top 10
  # Pipe directly into ggplot for horizontal bar chart
  ggplot(aes(x = ROI, y = reorder(Title, ROI), fill = Title)) +
    geom_col(show.legend = FALSE) +
    # Add text labels for ROI values (adjust hjust for placement)
    geom_text(aes(label = scales::comma(ROI, accuracy = 1)), # Format with comma
              hjust = -0.1, size = 3) +  # Position to the right of the bar
    labs(
      title = "Top 10 Movies by Return on Investment (ROI)",
      x = "Return on Investment (ROI)",
      y = "Movie Title"
    ) +
    theme_minimal(base_size = 10) +
    # Expand x-axis limits slightly to make room for labels
    scale_x_continuous(labels = scales::comma, expand = expansion(mult = c(0.01, 0.15))) +
    theme(
      plot.title = element_text(hjust = 0.5),
      axis.text.y = element_text(size = 8) # Adjust y-axis label size if needed
      ) +
    scale_fill_viridis_d() # Optional color scale

(Q15. ) The chart dramatically shows that “Paranormal Activity” [ 1,286,566% ] and “The Blair Witch Project” [ 4,143,000% ] achieved astronomical Returns on Investment (ROI), vastly exceeding all others. This is because the movies had low budget but got extremely high returns which resulted in very high ROI’s.

16. What are the top 10 movies by worldwide sales?

imdb_visualisation %>%
  filter(!is.na(Worldwide_Sales) & !is.na(Title)) %>% # Filter NAs
  distinct(Title, .keep_all = TRUE) %>%          # Get unique movies
  arrange(desc(Worldwide_Sales)) %>%             # Sort by sales descending
  slice_head(n = 10) %>%                         # Take top 10
  # Pipe directly into ggplot
  ggplot(aes(x = Worldwide_Sales, y = reorder(Title, Worldwide_Sales), fill = Title)) +
    geom_col(show.legend = FALSE) +
    # Add text labels formatted in Billions of dollars
    geom_text(aes(label = dollar(Worldwide_Sales, scale = 1e-9, suffix = "B", accuracy = 0.1)),
              hjust = -0.1, size = 3) +  # Position to the right of the bar
    labs(
      title = "Top 10 Movies by Worldwide Sales",
      x = "Worldwide Sales ($ Billions)",
      y = "Movie Title"
    ) +
    theme_minimal(base_size = 10) +
    # Format x-axis in Billions and expand limits for labels
    scale_x_continuous(labels = dollar_format(scale = 1e-9, suffix = "B"),
                       expand = expansion(mult = c(0.01, 0.15))) +
    theme(
      plot.title = element_text(hjust = 0.5),
      axis.text.y = element_text(size = 8)
      ) +
    scale_fill_viridis_d(option = "plasma") # Different color palette

(Q16. - ) the top 10 highest-grossing movies are led by “Avatar” and “Avengers: Endgame,” each nearing $3 billion in worldwide sales, with all top 10 films on the list surpassing $1.4 billion.

17. How do top distributors compare in terms of average worldwide sales and ROI?

top_n_distributors <- 10
top_distributors_list <- imdb_visualisation %>%
  filter(!is.na(Distributor)) %>%
  distinct(Title, .keep_all = TRUE) %>%
  count(Distributor) %>%
  slice_max(order_by = n, n = top_n_distributors) %>%
  pull(Distributor)

# Calculate metrics, reshape, and plot
imdb_visualisation %>%
  filter(!is.na(Distributor) & !is.na(Worldwide_Sales) & !is.na(ROI)) %>% # Filter NAs
  distinct(Title, .keep_all = TRUE) %>%           # Use unique movies for averages
  filter(Distributor %in% top_distributors_list) %>% # Keep only top distributors
  group_by(Distributor) %>%
  summarise(
    `Avg Worldwide Sales` = mean(Worldwide_Sales, na.rm = TRUE), # Name with spaces for facet title
    `Avg ROI` = mean(ROI, na.rm = TRUE),                     # Name with spaces
    .groups = 'drop'
  ) %>%
  # Pivot to long format for faceting
  pivot_longer(
    cols = c(`Avg Worldwide Sales`, `Avg ROI`),
    names_to = "Metric_Type",
    values_to = "Value"
  ) %>%
  # Create the faceted plot
  ggplot(aes(x = reorder(Distributor, -Value), y = Value, fill = Metric_Type)) +
    geom_col(show.legend = FALSE) +
    # Facet by metric type with independent y-axes
    facet_wrap(~ Metric_Type, scales = "free_y") +
    labs(
      title = paste("Average Sales vs. Average ROI for Top", top_n_distributors, "Distributors"),
      subtitle = "Distributors selected by total movie count in dataset",
      x = "Distributor",
      y = "Average Value" # Y-axis label is generic due to free scales
    ) +
    # Custom y-axis formatting for each facet
    scale_y_continuous(labels = function(x) case_when(
      max(x, na.rm = TRUE) > 1e6 ~ dollar(x, scale = 1e-6, suffix = "M"), # Sales in Millions
      TRUE ~ comma(x, accuracy = 0.1) # ROI as number
    )) +
    theme_minimal(base_size = 10) +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
      plot.title = element_text(hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5, size = 8),
      strip.text = element_text(face = "bold") # Bold facet titles
      ) +
    scale_fill_brewer(palette = "Set2") # Color palette

(Q17 - ) Among the top distributors by film count, leaders in average worldwide sales (like DreamWorks and Disney) differ notably from the leader in average ROI (Paramount, heavily influenced by low-budget successes), illustrating distinct patterns between maximizing gross revenue and achieving high investment efficiency.

————————————————————————————————————————-

————————————————————————————————————————-

Phase 5: Conclusion

Summary of Findings:

- While spending more money on a film generally leads to higher total worldwide sales, it consistently results in a lower average Return on Investment (ROI). The most profitable films relative to their cost are often those with significantly smaller budgets that achieve unexpected success.

- Genre Profitability: Genre plays a crucial role in financial efficiency. Horror, Mystery, and Thriller films demonstrate the highest average ROI in this dataset, likely due to lower typical production costs combined with strong audience appeal. Blockbuster genres like Action, Adventure, and Sci-Fi, while generating massive total revenue, show lower average ROI because of their high initial investments. The differences in average ROI across genres were found to be statistically significant.

- Market Dynamics: For the movies included here, the international market contributed more significantly to total worldwide sales (around 61%) than the domestic market (around 39%).

- Distributor Strategies: Major distributors (Disney, Universal, Warner Bros., Fox) excel at generating high average worldwide sales, often driven by big-budget franchise films. However, distributors handling exceptionally successful low-budget hits (like Paramount with “Paranormal Activity” in this data) or those focusing on mid-range or specialty films (like Fox Searchlight or Lionsgate) can achieve higher average ROI, showcasing different paths to profitability.

- Outlier Impact: Average ROI figures, especially on a yearly basis (like the peak in 2007) or for certain distributors, can be heavily skewed by a small number of films with exceptionally high performance relative to their budget (e.g., “Paranormal Activity”, “The Blair Witch Project”).

Overall Takeaway:

- Success in the film industry, as reflected in this dataset, isn’t just about making the most money overall; it’s also about efficiency. While big-budget blockbusters dominate the highest sales charts, lower-budget films in specific genres often provide a much higher percentage return on the money invested, and the international audience is vital for maximizing total revenue.

————————————————————————————————————————-

————————————————————————————————————————-