Budget vs. Worldwide Sales Distribution of Highest Grossing Movies

Author

Kenny Nguyen

This data set focuses on the highest-grossing Hollywood movies. The data has been updated as of September 25, 2023, and was sourced through Web Scrapping. The data set includes the following variables: Ranking – The movie’s position on the highest-grossing list. Title – The name of the movie. Brief Description – A short summary of the movie. Release Year – The year the movie premiered. Distributor – The studio or company responsible for distributing the movie. Budget – The estimated production cost of the movie. Domestic Opening Revenue – The total earnings in North America during the first week of release. Domestic Sales – The total revenue earned in North America. International Sales – The total revenue earned outside of North America. Worldwide Sales – The combined total revenue from both domestic and international markets. Release Date – The exact date the movie was released in theaters. Genre – The category or type of film. Running Time – The total duration of the movie in minutes. License (Rating) – The movie’s rating, indicating the appropriate audience (e.g., PG, R).With this data set, I aim to explore whether there is a correlation between a movie’s budget and its worldwide sales. Understanding this relationship could provide insights into whether if movies have higher world-wide sales due to their budget

library(RColorBrewer)  # Loading a necessary library for plots 
library(GGally)  # Loading library for linear regression

Loading required package: ggplot2

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

library(plotly)  # Loading the plotly library for interactive visualizations


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(readr)  # Loading the readr library to read CSV files
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ stringr   1.5.1
✔ forcats   1.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks plotly::filter(), stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("~/Desktop/Data 110")  # Set working directory to the correct path
movies <- read_csv("Highest Holywood Grossing Movies.csv")  # Opening up my dataset

New names:
Rows: 1000 Columns: 14
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): Title, Movie Info, Distributor, Budget (in $), Domestic Opening (in... dbl
(5): ...1, Year, Domestic Sales (in $), International Sales (in $), Worl...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

names(movies) <- tolower(names(movies))
names(movies) <- gsub("[$]", "", names(movies)) # removes $ symobl
names(movies) <- gsub(" ", "_", names(movies)) # removes spaces between words
names(movies) <- gsub("[()]", "", names(movies)) # remove parenthesis in variable
movies$budget_in_ <- as.numeric(gsub("[^0-9]", "", movies$budget_in_))
# now remove all non-numeric characters from budget
movies$budget_in_ <- ifelse(!grepl("[^0-9]", movies$budget_in_), NA, as.integer(movies$budget_in_))
head(movies)

# A tibble: 6 × 14
   ...1 title       movie_info  year distributor budget_in_ domestic_opening_in_
  <dbl> <chr>       <chr>      <dbl> <chr>            <int> <chr>               
1     0 Avatar      A paraple…  2009 Twentieth …  237000000 77025481            
2     1 Avengers: … After the…  2019 Walt Disne…  356000000 357115007           
3     2 Avatar: Th… Jake Sull…  2022 20th Centu…         NA 134100226           
4     3 Titanic     A sevente…  1997 Paramount …  200000000 28638131            
5     4 Star Wars:… As a new …  2015 Walt Disne…  245000000 247966675           
6     5 Avengers: … The Aveng…  2018 Walt Disne…         NA 257698183           
# ℹ 7 more variables: domestic_sales_in_ <dbl>, international_sales_in_ <dbl>,
#   world_wide_sales_in_ <dbl>, release_date <chr>, genre <chr>,
#   running_time <chr>, license <chr>

nona <- na.omit(movies,"budget_in_") # Remove NA from budget to fix errors
  
cor(nona$world_wide_sales_in_, nona$budget_in_) # Finding the correlation between worldwide sales and budget

[1] 0.5417298

fit1 <- lm(world_wide_sales_in_ ~ budget_in_, data = nona) # Creating a model to predict worldwide sales based on budget
  
summary(fit1) # Getting a summary of the regression model (p-value, R-squared, coefficients, etc.)


Call:
lm(formula = world_wide_sales_in_ ~ budget_in_, data = nona)

Residuals:
       Min         1Q     Median         3Q        Max 
-554893226 -143153414  -40005605   84672116 2120405030 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.519e+08  1.883e+07   8.065 3.09e-15 ***
budget_in_  2.749e+00  1.593e-01  17.257  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 256900000 on 717 degrees of freedom
Multiple R-squared:  0.2935,    Adjusted R-squared:  0.2925 
F-statistic: 297.8 on 1 and 717 DF,  p-value: < 2.2e-16

The correlation between budget and worldwide sales is 0.54, indicating a moderate positive relationship. This suggests that as a movie’s budget increases, its worldwide sales tend to increase as well, even though other factors also play a role. The regression model is represented by the equation: worldwide_sales=161,100,000+2.67(budget)

This means that for each additional $1 in budget, the model predicts an increase of $2.67 in worldwide sales. The p-value for the budget variable is extremely small (< 2e-16) and has three asterisks, indicating that budget is a highly significant factor of worldwide sales. The Adjusted R-Squared value is 0.2907, meaning that about 29.1% of the variation in worldwide sales can be explained by a movie’s budget. This suggests that while budget has an impact on worldwide sales, a large portion (70.9%) of the variation is likely influenced by other factors, such as marketing or franchise popularity.

cc <- nona |>  
  select(title, budget_in_, world_wide_sales_in_) # Selecting the variables I want
 

summary(cc)  # Showing the first few rows of the selected data

    title             budget_in_        world_wide_sales_in_
 Length:719         Min.   :  5000000   Min.   :1.806e+08   
 Class :character   1st Qu.: 55000000   1st Qu.:2.311e+08   
 Mode  :character   Median : 90000000   Median :3.294e+08   
                    Mean   :101809458   Mean   :4.317e+08   
                    3rd Qu.:145000000   3rd Qu.:5.210e+08   
                    Max.   :356000000   Max.   :2.924e+09

groups <- cc |> # Created a new dataset with categories for budget and worldwide sales
  
mutate( budget_category = case_when( # Create a new column that categorizes movies based on budget ranges
budget_in_ < 500000 ~ "< $500,000",  # finding films with budget less than $500,000
budget_in_ >= 500000 & budget_in_ < 1000000 ~ "$500,000 - $1 Million",  #finding films with budget between $500,000 and $1 Million
      budget_in_ >= 1000000 & budget_in_ < 10000000 ~ "$1 Million - $10 Million",  #finding films with budget between $1 Million and $10 Million
      budget_in_ >= 10000000 & budget_in_ < 50000000 ~ "$10 Million - $50 Million",  #finding films with budget between $10 Million and $50 Million
      budget_in_ >= 50000000 & budget_in_ < 100000000 ~ "$50 Million - $100 Million",  # finding films with budget between $50 Million and $100 Million
      budget_in_ >= 100000000 & budget_in_ < 500000000 ~ "$100 Million - $500 Million",  # finding films with budget between $100 Million and $500 Million
      budget_in_ >= 500000000 & budget_in_ < 1000000000 ~ "$500 Million - $1 Billion",  # finding films with budget between $500 Million and $1 Billion
),
    
    # Create a new column 'worldwide_category' that categorizes movies based on worldwide sales
    worldwide_category = case_when(
      world_wide_sales_in_ < 500000 ~ "< $500,000",  # Sales are less than $500,000
      world_wide_sales_in_ >= 500000 & world_wide_sales_in_ < 1000000 ~ "$500,000 - $1 Million",  # finding films with world wide sales between $500,000 and $1 Million
      world_wide_sales_in_ >= 1000000 & world_wide_sales_in_ < 10000000 ~ "$1 Million - $10 Million",  # finding films with world wide sales between $1 Million and $10 Million
      world_wide_sales_in_ >= 10000000 & world_wide_sales_in_ < 50000000 ~ "$10 Million - $50 Million",  # finding films with world wide sales between $10 Million and $50 Million
      world_wide_sales_in_ >= 50000000 & world_wide_sales_in_ < 100000000 ~ "$50 Million - $100 Million",  # finding films with world wide sales between $50 Million and $100 Million
      world_wide_sales_in_ >= 100000000 & world_wide_sales_in_ < 500000000 ~ "$100 Million - $500 Million",  # finding films with world wide sales between $100 Million and $500 Million
      world_wide_sales_in_ >= 500000000 & world_wide_sales_in_ < 1000000000 ~ "$500 Million - $1 Billion",  # finding films with world wide sales between $500 Million and $1 Billion
      world_wide_sales_in_ >= 1000000000 & world_wide_sales_in_ < 2000000000 ~ "$1 Billion - $2 Billion",  # finding films with world wide sales between $1 Billion and $2 Billion
      world_wide_sales_in_ >= 2000000000 ~ "> $2 Billion"  # finding films with world wide sales greater than $2 Billion
    ))
head(groups) # Display the first few rows of the new dataset

# A tibble: 6 × 5
  title       budget_in_ world_wide_sales_in_ budget_category worldwide_category
  <chr>            <int>                <dbl> <chr>           <chr>             
1 Avatar       237000000           2923706026 $100 Million -… > $2 Billion      
2 Avengers: …  356000000           2799439100 $100 Million -… > $2 Billion      
3 Titanic      200000000           2264743305 $100 Million -… > $2 Billion      
4 Star Wars:…  245000000           2071310218 $100 Million -… > $2 Billion      
5 Jurassic W…  150000000           1671537444 $100 Million -… $1 Billion - $2 B…
6 The Lion K…  260000000           1663075401 $100 Million -… $1 Billion - $2 B…

n <- groups |> 
  select(title, budget_category, worldwide_category) |>  # Select columns I want
  count(budget_category, worldwide_category) # Getting the number count of films in each category
head(n)  # Display the first few rows

# A tibble: 6 × 3
  budget_category             worldwide_category              n
  <chr>                       <chr>                       <int>
1 $1 Million - $10 Million    $100 Million - $500 Million     6
2 $10 Million - $50 Million   $100 Million - $500 Million   130
3 $10 Million - $50 Million   $500 Million - $1 Billion       7
4 $100 Million - $500 Million $1 Billion - $2 Billion        30
5 $100 Million - $500 Million $100 Million - $500 Million   192
6 $100 Million - $500 Million $500 Million - $1 Billion     106

ggplot(n,aes(x = budget_category, y = worldwide_category, fill = n)) +  # telling r what I want in my plot 
  geom_tile(color = "black") +  # Add black grid lines to separate tiles
  scale_fill_distiller(palette = "Blues", direction = 1, name = "Movie Count") +  # Add a blue gradient and used direction to make it go from light to dark
  theme_bw() +  # Use a clean, white-background theme
  labs(
    title = "Budget vs. Worldwide Sales Distribution of \n Highest Grossing Movies",  # Meaningful title and used "\n" to drop the title 
    x = "Budget Category",  # Added a meaningful x axis label
    y = "Worldwide Sales Category",  # Added a meaningful Y-axis label
    caption = "Source: Web Scraping"  # Added the source for data
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate x-axis labels to read better
    legend.position = "right",  # Place legend to the right
  )

ggplotly()

The way I cleaned up this data set was by first converting all the headers to lowercase. I removed the “$” symbol from the data to prevent errors in the data set. Additionally, I replaced spaces with underscores, removed parentheses from the column names, and eliminated all non-numerical characters.The visualization illustrates the relationship between budget and worldwide sales. One interesting finding is that six movies with a budget of $1 million to $10 million managed to generate between $100 million and $500 million in worldwide sales. However, one thing I was unable to accomplish was ordering the x- and y-axes in a logical sequence.