Budget vs. Worldwide Sales Distribution of Highest Grossing Movies
Author
Kenny Nguyen
This data set focuses on the highest-grossing Hollywood movies. The data has been updated as of September 25, 2023, and was sourced through Web Scrapping. The data set includes the following variables: Ranking – The movie’s position on the highest-grossing list. Title – The name of the movie. Brief Description – A short summary of the movie. Release Year – The year the movie premiered. Distributor – The studio or company responsible for distributing the movie. Budget – The estimated production cost of the movie. Domestic Opening Revenue – The total earnings in North America during the first week of release. Domestic Sales – The total revenue earned in North America. International Sales – The total revenue earned outside of North America. Worldwide Sales – The combined total revenue from both domestic and international markets. Release Date – The exact date the movie was released in theaters. Genre – The category or type of film. Running Time – The total duration of the movie in minutes. License (Rating) – The movie’s rating, indicating the appropriate audience (e.g., PG, R).With this data set, I aim to explore whether there is a correlation between a movie’s budget and its worldwide sales. Understanding this relationship could provide insights into whether if movies have higher world-wide sales due to their budget
library(RColorBrewer) # Loading a necessary library for plots library(GGally) # Loading library for linear regression
Loading required package: ggplot2
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
library(plotly) # Loading the plotly library for interactive visualizations
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(readr) # Loading the readr library to read CSV fileslibrary(tidyverse)
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks plotly::filter(), stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("~/Desktop/Data 110") # Set working directory to the correct pathmovies <-read_csv("Highest Holywood Grossing Movies.csv") # Opening up my dataset
New names:
Rows: 1000 Columns: 14
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): Title, Movie Info, Distributor, Budget (in $), Domestic Opening (in... dbl
(5): ...1, Year, Domestic Sales (in $), International Sales (in $), Worl...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
names(movies) <-tolower(names(movies))names(movies) <-gsub("[$]", "", names(movies)) # removes $ symoblnames(movies) <-gsub(" ", "_", names(movies)) # removes spaces between wordsnames(movies) <-gsub("[()]", "", names(movies)) # remove parenthesis in variablemovies$budget_in_ <-as.numeric(gsub("[^0-9]", "", movies$budget_in_))# now remove all non-numeric characters from budgetmovies$budget_in_ <-ifelse(!grepl("[^0-9]", movies$budget_in_), NA, as.integer(movies$budget_in_))head(movies)
# A tibble: 6 × 14
...1 title movie_info year distributor budget_in_ domestic_opening_in_
<dbl> <chr> <chr> <dbl> <chr> <int> <chr>
1 0 Avatar A paraple… 2009 Twentieth … 237000000 77025481
2 1 Avengers: … After the… 2019 Walt Disne… 356000000 357115007
3 2 Avatar: Th… Jake Sull… 2022 20th Centu… NA 134100226
4 3 Titanic A sevente… 1997 Paramount … 200000000 28638131
5 4 Star Wars:… As a new … 2015 Walt Disne… 245000000 247966675
6 5 Avengers: … The Aveng… 2018 Walt Disne… NA 257698183
# ℹ 7 more variables: domestic_sales_in_ <dbl>, international_sales_in_ <dbl>,
# world_wide_sales_in_ <dbl>, release_date <chr>, genre <chr>,
# running_time <chr>, license <chr>
nona <-na.omit(movies,"budget_in_") # Remove NA from budget to fix errorscor(nona$world_wide_sales_in_, nona$budget_in_) # Finding the correlation between worldwide sales and budget
[1] 0.5417298
fit1 <-lm(world_wide_sales_in_ ~ budget_in_, data = nona) # Creating a model to predict worldwide sales based on budgetsummary(fit1) # Getting a summary of the regression model (p-value, R-squared, coefficients, etc.)
Call:
lm(formula = world_wide_sales_in_ ~ budget_in_, data = nona)
Residuals:
Min 1Q Median 3Q Max
-554893226 -143153414 -40005605 84672116 2120405030
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.519e+08 1.883e+07 8.065 3.09e-15 ***
budget_in_ 2.749e+00 1.593e-01 17.257 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 256900000 on 717 degrees of freedom
Multiple R-squared: 0.2935, Adjusted R-squared: 0.2925
F-statistic: 297.8 on 1 and 717 DF, p-value: < 2.2e-16
The correlation between budget and worldwide sales is 0.54, indicating a moderate positive relationship. This suggests that as a movie’s budget increases, its worldwide sales tend to increase as well, even though other factors also play a role. The regression model is represented by the equation: worldwide_sales=161,100,000+2.67(budget)
This means that for each additional $1 in budget, the model predicts an increase of $2.67 in worldwide sales. The p-value for the budget variable is extremely small (< 2e-16) and has three asterisks, indicating that budget is a highly significant factor of worldwide sales. The Adjusted R-Squared value is 0.2907, meaning that about 29.1% of the variation in worldwide sales can be explained by a movie’s budget. This suggests that while budget has an impact on worldwide sales, a large portion (70.9%) of the variation is likely influenced by other factors, such as marketing or franchise popularity.
cc <- nona |>select(title, budget_in_, world_wide_sales_in_) # Selecting the variables I wantsummary(cc) # Showing the first few rows of the selected data
title budget_in_ world_wide_sales_in_
Length:719 Min. : 5000000 Min. :1.806e+08
Class :character 1st Qu.: 55000000 1st Qu.:2.311e+08
Mode :character Median : 90000000 Median :3.294e+08
Mean :101809458 Mean :4.317e+08
3rd Qu.:145000000 3rd Qu.:5.210e+08
Max. :356000000 Max. :2.924e+09
groups <- cc |># Created a new dataset with categories for budget and worldwide salesmutate( budget_category =case_when( # Create a new column that categorizes movies based on budget rangesbudget_in_ <500000~"< $500,000", # finding films with budget less than $500,000budget_in_ >=500000& budget_in_ <1000000~"$500,000 - $1 Million", #finding films with budget between $500,000 and $1 Million budget_in_ >=1000000& budget_in_ <10000000~"$1 Million - $10 Million", #finding films with budget between $1 Million and $10 Million budget_in_ >=10000000& budget_in_ <50000000~"$10 Million - $50 Million", #finding films with budget between $10 Million and $50 Million budget_in_ >=50000000& budget_in_ <100000000~"$50 Million - $100 Million", # finding films with budget between $50 Million and $100 Million budget_in_ >=100000000& budget_in_ <500000000~"$100 Million - $500 Million", # finding films with budget between $100 Million and $500 Million budget_in_ >=500000000& budget_in_ <1000000000~"$500 Million - $1 Billion", # finding films with budget between $500 Million and $1 Billion),# Create a new column 'worldwide_category' that categorizes movies based on worldwide salesworldwide_category =case_when( world_wide_sales_in_ <500000~"< $500,000", # Sales are less than $500,000 world_wide_sales_in_ >=500000& world_wide_sales_in_ <1000000~"$500,000 - $1 Million", # finding films with world wide sales between $500,000 and $1 Million world_wide_sales_in_ >=1000000& world_wide_sales_in_ <10000000~"$1 Million - $10 Million", # finding films with world wide sales between $1 Million and $10 Million world_wide_sales_in_ >=10000000& world_wide_sales_in_ <50000000~"$10 Million - $50 Million", # finding films with world wide sales between $10 Million and $50 Million world_wide_sales_in_ >=50000000& world_wide_sales_in_ <100000000~"$50 Million - $100 Million", # finding films with world wide sales between $50 Million and $100 Million world_wide_sales_in_ >=100000000& world_wide_sales_in_ <500000000~"$100 Million - $500 Million", # finding films with world wide sales between $100 Million and $500 Million world_wide_sales_in_ >=500000000& world_wide_sales_in_ <1000000000~"$500 Million - $1 Billion", # finding films with world wide sales between $500 Million and $1 Billion world_wide_sales_in_ >=1000000000& world_wide_sales_in_ <2000000000~"$1 Billion - $2 Billion", # finding films with world wide sales between $1 Billion and $2 Billion world_wide_sales_in_ >=2000000000~"> $2 Billion"# finding films with world wide sales greater than $2 Billion ))head(groups) # Display the first few rows of the new dataset
# A tibble: 6 × 5
title budget_in_ world_wide_sales_in_ budget_category worldwide_category
<chr> <int> <dbl> <chr> <chr>
1 Avatar 237000000 2923706026 $100 Million -… > $2 Billion
2 Avengers: … 356000000 2799439100 $100 Million -… > $2 Billion
3 Titanic 200000000 2264743305 $100 Million -… > $2 Billion
4 Star Wars:… 245000000 2071310218 $100 Million -… > $2 Billion
5 Jurassic W… 150000000 1671537444 $100 Million -… $1 Billion - $2 B…
6 The Lion K… 260000000 1663075401 $100 Million -… $1 Billion - $2 B…
n <- groups |>select(title, budget_category, worldwide_category) |># Select columns I wantcount(budget_category, worldwide_category) # Getting the number count of films in each categoryhead(n) # Display the first few rows
# A tibble: 6 × 3
budget_category worldwide_category n
<chr> <chr> <int>
1 $1 Million - $10 Million $100 Million - $500 Million 6
2 $10 Million - $50 Million $100 Million - $500 Million 130
3 $10 Million - $50 Million $500 Million - $1 Billion 7
4 $100 Million - $500 Million $1 Billion - $2 Billion 30
5 $100 Million - $500 Million $100 Million - $500 Million 192
6 $100 Million - $500 Million $500 Million - $1 Billion 106
ggplot(n,aes(x = budget_category, y = worldwide_category, fill = n)) +# telling r what I want in my plot geom_tile(color ="black") +# Add black grid lines to separate tilesscale_fill_distiller(palette ="Blues", direction =1, name ="Movie Count") +# Add a blue gradient and used direction to make it go from light to darktheme_bw() +# Use a clean, white-background themelabs(title ="Budget vs. Worldwide Sales Distribution of \n Highest Grossing Movies", # Meaningful title and used "\n" to drop the title x ="Budget Category", # Added a meaningful x axis labely ="Worldwide Sales Category", # Added a meaningful Y-axis labelcaption ="Source: Web Scraping"# Added the source for data ) +theme(axis.text.x =element_text(angle =45, hjust =1), # Rotate x-axis labels to read betterlegend.position ="right", # Place legend to the right )
ggplotly()
The way I cleaned up this data set was by first converting all the headers to lowercase. I removed the “$” symbol from the data to prevent errors in the data set. Additionally, I replaced spaces with underscores, removed parentheses from the column names, and eliminated all non-numerical characters.The visualization illustrates the relationship between budget and worldwide sales. One interesting finding is that six movies with a budget of $1 million to $10 million managed to generate between $100 million and $500 million in worldwide sales. However, one thing I was unable to accomplish was ordering the x- and y-axes in a logical sequence.