data <-read.csv("movies_metadata.csv", stringsAsFactors = FALSE)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
Prepping Data 1:
data$budget <- as.numeric(data$budget)
## Warning: NAs introduced by coercion
data$runtime <- as.numeric(data$runtime)
data$popularity <- as.numeric(data$popularity)
## Warning: NAs introduced by coercion
data <- data |>
mutate(budget_per_runtime = ifelse(!is.na(runtime) & runtime > 0, budget / runtime, NA)) |>
filter(!is.na(budget_per_runtime) & !is.na(popularity) & budget > 0 & runtime > 0)
Here we prep the data, converting the budget, runtime and
popularity columns to numeric to ensure our
calculations work correctly and accurately. We then create a new
mutated column budget_per_runtime,
dividing the budget by the runtime and setting zero values to
NA to avoid division by zero.
ggplot(data, aes(x = budget_per_runtime, y = popularity)) +
geom_point(alpha = 0.5, color = "blue") +
labs(title = "Budget per Runtime vs. Popularity",
x = "Budget per Runtime ($)", y = "Popularity Score") +
theme_minimal()
Here we visualize the relationship between Popularity and Budget Per Runtime, each data point represents a movie.
We can see that higher popularity generally doesn’t mean that a film may have a high budget per runtime.
The plot also reveals that some points fall far from the main
cluster hinting at outliers. It suggests that certain films have high
popularity that’s uneven with the budget per runtime.
correlation_bpr_popularity <- cor(data$budget_per_runtime, data$popularity, use = "complete.obs")
correlation_bpr_popularity
## [1] 0.345524
Next, we compute the Pearson correlation coefficient to measure the linear relationship between the two variables.
We can see that the correlation coefficient is very close to 0. It suggests a weak to moderate positive correlation between Budget per Runtime and Popularity.
There are outliers though, indicating that other factors may influence a film’s budget per runtime.
mean_bpr <- mean(data$budget_per_runtime, na.rm = TRUE) #Mean
se_bpr <- sd(data$budget_per_runtime, na.rm = TRUE) / sqrt(sum(!is.na(data$budget_per_runtime))) #Standard error
P <- 0.95 # % confidence
z_score <- qnorm(p=(1 - P)/2, lower.tail=FALSE)
ci_bpr_z <- mean_bpr + c(-z_score, z_score) * se_bpr
print(paste("95% Confidence Interval for Budget Per Runtime: (", round(ci_bpr_z[1], 2), ",", round(ci_bpr_z[2], 2), ")"))
## [1] "95% Confidence Interval for Budget Per Runtime: ( 190168.11 , 202614.94 )"
We then compute the 95% confidence interval using Z-Score for the mean of budget_per_runtime. This interval tells us the expected range of the true mean to lie, with 95% confidence.
The 95% Confidence Interval indicates that most films fall between a spending range of $38,224.32 to $41,142.24 on a per-minute basis
This range provides an estimate of typical per-minute spending across the industry, giving insight into how resources are distributed in film production.
Prepping Data 2:
data$budget <- as.numeric(data$budget)
data$revenue <- as.numeric(data$revenue)
data <- data %>%
mutate(profit = revenue - budget)
Here we prep the data ensuring that our budget and revenue columns are converted to numeric forms. We do this again just in case somebody begins with the second data without using the first data. We then create a new variable called profit by subtracting budget from revenue.
plot(data$revenue, data$profit,
main = "Revenue vs. Profit",
xlab = "Revenue", ylab = "Profit", pch = 16, col = "red")
Here we visualize the relationship between Revenue and Profit to gain a better understanding, allowing us to see general trends and notice any outliers.
The scatterplot indicates that movies generating higher revenues tend to generate more profit.
Most of the points tend to travel in a linear cluster, hinting at little outliers. Outliers may include movies with extremely high revenue and profit.
correlation_Revenue_Profit <- cor(data$revenue, data$profit, use = "complete.obs")
correlation_Revenue_Profit
## [1] 0.9791504
Next, we compute the Pearson correlation coefficient to measure the linear relationship between the two variables.
The correlation coefficient value (being very close to 1) confirms our understanding of the scatterplot.
It indicates that profit is strongly associated with recvenue.
avg_profit <- mean(data$profit, na.rm = TRUE)
n <- length(data$profit[!is.na(data$profit)]) #Remove null values
profit_se_boot <- sd(data$profit, na.rm = TRUE) / sqrt(n)
P <- 0.95 # % confidence
t_star <- qt(p = (1 - P)/2, df = n - 1, lower.tail = FALSE)
CI_t <- t_star * profit_se_boot
z_score <- qnorm(p = (1 - P)/2, lower.tail = FALSE)
CI_z <- z_score * profit_se_boot
CI_t_interval <- c(avg_profit - CI_t, avg_profit + CI_t) # t-distribution
CI_z_interval <- c(avg_profit - CI_z, avg_profit + CI_z) # normal distribution
print(paste("95% Confidence Interval for Profit (T-distribution): (", round(CI_t_interval[1], 2), ",", round(CI_t_interval[2], 2), ")"))
## [1] "95% Confidence Interval for Profit (T-distribution): ( 30968633.06 , 35726784.42 )"
print(paste("95% Confidence Interval for Profit (Normal-distribution): (", round(CI_z_interval[1], 2), ",", round(CI_z_interval[2], 2), ")"))
## [1] "95% Confidence Interval for Profit (Normal-distribution): ( 30968959.61 , 35726457.86 )"
Using the bootstrapping method here, we computed a 95% confidence interval for the mean profit using both t-distribution and normal distribution.
The t-distribution confidence interval is (6,505,169.99 to 7,463,811.88), while the z-distribution interval is (6,505,182.75 to 7,463,799.12), suggesting that the true mean falls within this distribution.
The very high profit range suggests that most movies are successful in this dataset