data <-read.csv("movies_metadata.csv", stringsAsFactors = FALSE)

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'tibble' was built under R version 4.3.3

## Warning: package 'tidyr' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'purrr' was built under R version 4.3.3

## Warning: package 'dplyr' was built under R version 4.3.3

## Warning: package 'stringr' was built under R version 4.3.3

## Warning: package 'forcats' was built under R version 4.3.3

## Warning: package 'lubridate' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)

Approach:

In the first part of this analysis, we aim to explore the relationship between a movie’s popularity and its budget per runtime. Given that runtime can vary significantly across films, dividing the budget by runtime helps normalize the data, allowing for a better understanding of how resources are used. It also allows us to explore whether higher spending per minute affects audience interest. In the second part we analyse the relationship between a movie’s profit an

Populatrity vs. Budget Per Runtime

Prepping Data 1:

data$budget <- as.numeric(data$budget)

## Warning: NAs introduced by coercion

data$runtime <- as.numeric(data$runtime)
data$popularity <- as.numeric(data$popularity)

## Warning: NAs introduced by coercion

data <- data |>
  mutate(budget_per_runtime = ifelse(!is.na(runtime) & runtime > 0, budget / runtime, NA)) |>
  filter(!is.na(budget_per_runtime) & !is.na(popularity) & budget > 0 & runtime > 0)

Here we prep the data, converting the budget, runtime and popularity columns to numeric to ensure our calculations work correctly and accurately. We then create a new mutated column budget_per_runtime, dividing the budget by the runtime and setting zero values to NA to avoid division by zero.

ggplot(data, aes(x = budget_per_runtime, y = popularity)) +
  geom_point(alpha = 0.5, color = "blue") +
  labs(title = "Budget per Runtime vs. Popularity",
       x = "Budget per Runtime ($)", y = "Popularity Score") +
  theme_minimal()

Here we visualize the relationship between Popularity and Budget Per Runtime, each data point represents a movie.

We can see that higher popularity generally doesn’t mean that a film may have a high budget per runtime.
The plot also reveals that some points fall far from the main cluster hinting at outliers. It suggests that certain films have high popularity that’s uneven with the budget per runtime.

correlation_bpr_popularity <- cor(data$budget_per_runtime, data$popularity, use = "complete.obs")

correlation_bpr_popularity

## [1] 0.345524

Next, we compute the Pearson correlation coefficient to measure the linear relationship between the two variables.

We can see that the correlation coefficient is very close to 0. It suggests a weak to moderate positive correlation between Budget per Runtime and Popularity.
There are outliers though, indicating that other factors may influence a film’s budget per runtime.

mean_bpr <- mean(data$budget_per_runtime, na.rm = TRUE) #Mean
se_bpr <- sd(data$budget_per_runtime, na.rm = TRUE) / sqrt(sum(!is.na(data$budget_per_runtime))) #Standard error

P <- 0.95  # % confidence

z_score <- qnorm(p=(1 - P)/2, lower.tail=FALSE)

ci_bpr_z <- mean_bpr + c(-z_score, z_score) * se_bpr

print(paste("95% Confidence Interval for Budget Per Runtime: (", round(ci_bpr_z[1], 2), ",", round(ci_bpr_z[2], 2), ")"))

## [1] "95% Confidence Interval for Budget Per Runtime: ( 190168.11 , 202614.94 )"

We then compute the 95% confidence interval using Z-Score for the mean of budget_per_runtime. This interval tells us the expected range of the true mean to lie, with 95% confidence.

The 95% Confidence Interval indicates that most films fall between a spending range of $38,224.32 to $41,142.24 on a per-minute basis
This range provides an estimate of typical per-minute spending across the industry, giving insight into how resources are distributed in film production.

Revenue vs. Profit

Prepping Data 2:

data$budget <- as.numeric(data$budget)
data$revenue <- as.numeric(data$revenue)

data <- data %>%
  mutate(profit = revenue - budget)

Here we prep the data ensuring that our budget and revenue columns are converted to numeric forms. We do this again just in case somebody begins with the second data without using the first data. We then create a new variable called profit by subtracting budget from revenue.

plot(data$revenue, data$profit, 
    main = "Revenue vs. Profit", 
    xlab = "Revenue", ylab = "Profit", pch = 16, col = "red")

Here we visualize the relationship between Revenue and Profit to gain a better understanding, allowing us to see general trends and notice any outliers.

The scatterplot indicates that movies generating higher revenues tend to generate more profit.
Most of the points tend to travel in a linear cluster, hinting at little outliers. Outliers may include movies with extremely high revenue and profit.

correlation_Revenue_Profit <- cor(data$revenue, data$profit, use = "complete.obs")
correlation_Revenue_Profit

## [1] 0.9791504

Next, we compute the Pearson correlation coefficient to measure the linear relationship between the two variables.

The correlation coefficient value (being very close to 1) confirms our understanding of the scatterplot.
It indicates that profit is strongly associated with recvenue.

avg_profit <- mean(data$profit, na.rm = TRUE)
n <- length(data$profit[!is.na(data$profit)]) #Remove null values
profit_se_boot <- sd(data$profit, na.rm = TRUE) / sqrt(n) 

P <- 0.95  # % confidence

t_star <- qt(p = (1 - P)/2, df = n - 1, lower.tail = FALSE)

CI_t <- t_star * profit_se_boot

z_score <- qnorm(p = (1 - P)/2, lower.tail = FALSE)

CI_z <- z_score * profit_se_boot

CI_t_interval <- c(avg_profit - CI_t, avg_profit + CI_t)  # t-distribution
CI_z_interval <- c(avg_profit - CI_z, avg_profit + CI_z)  # normal distribution

print(paste("95% Confidence Interval for Profit (T-distribution): (", round(CI_t_interval[1], 2), ",", round(CI_t_interval[2], 2), ")"))

## [1] "95% Confidence Interval for Profit (T-distribution): ( 30968633.06 , 35726784.42 )"

print(paste("95% Confidence Interval for Profit (Normal-distribution): (", round(CI_z_interval[1], 2), ",", round(CI_z_interval[2], 2), ")"))

## [1] "95% Confidence Interval for Profit (Normal-distribution): ( 30968959.61 , 35726457.86 )"

Using the bootstrapping method here, we computed a 95% confidence interval for the mean profit using both t-distribution and normal distribution.

The t-distribution confidence interval is (6,505,169.99 to 7,463,811.88), while the z-distribution interval is (6,505,182.75 to 7,463,799.12), suggesting that the true mean falls within this distribution.
The very high profit range suggests that most movies are successful in this dataset

Data Dive Week 6

2025-02-20

Approach:

Populatrity vs. Budget Per Runtime

Revenue vs. Profit