Introduction

This document provides a detailed analysis of the TMDB dataset. The analysis includes data processing, visualizations, correlation calculations, and confidence interval calculations.

Data Loading and Processing

# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggplot2)
library(gapminder)
library(dplyr)
# Read the dataset
tmdb <- read.csv("TMDB.csv")
head(tmdb)
##                         Names                  Orig_title           Orig_lang
## 1                   Creed III                   Creed III             English
## 2    Avatar: The Way of Water    Avatar: The Way of Water             English
## 3 The Super Mario Bros. Movie The Super Mario Bros. Movie             English
## 4                     Mummies                      Momias  Spanish, Castilian
## 5                   Supercell                   Supercell             English
## 6                Cocaine Bear                Cocaine Bear             English
##                                           Genre Release_Date Score Budget
## 1                                 Drama, Action   02-03-2023    73   75.0
## 2            Science Fiction, Adventure, Action   15-12-2022    78  460.0
## 3 Animation, Adventure, Family, Fantasy, Comedy   05-04-2023    76  100.0
## 4 Animation, Comedy, Family, Adventure, Fantasy   05-01-2023    70   12.3
## 5                                        Action   17-03-2023    61   77.0
## 6                       Thriller, Comedy, Crime   23-02-2023    66   35.0
##     Revenue    Status Country
## 1  271.6167  Released      AU
## 2 2316.7949  Released      AU
## 3  724.4590  Released      AU
## 4   34.2000  Released      AU
## 5  340.9420  Released      US
## 6   80.0000  Released      AU
##                                                                                                                                                                                                                                                                                                              Crew
## 1          Michael B. Jordan, Adonis Creed, Tessa Thompson, Bianca Taylor, Jonathan Majors, Damien Anderson, Wood Harris, Tony 'Little Duke' Evers, Phylicia Rashād, Mary Anne Creed, Mila Davis-Kent, Amara Creed, Florian Munteanu, Viktor Drago, José Benavidez Jr., Felix Chavez, Selenis Leyva, Laura Chavez
## 2                                    Sam Worthington, Jake Sully, Zoe Saldaña, Neytiri, Sigourney Weaver, Kiri / Dr. Grace Augustine, Stephen Lang, Colonel Miles Quaritch, Kate Winslet, Ronal, Cliff Curtis, Tonowari, Joel David Moore, Norm Spellman, CCH Pounder, Mo'at, Edie Falco, General Frances Ardmore
## 3 Chris Pratt, Mario (voice), Anya Taylor-Joy, Princess Peach (voice), Charlie Day, Luigi (voice), Jack Black, Bowser (voice), Keegan-Michael Key, Toad (voice), Seth Rogen, Donkey Kong (voice), Fred Armisen, Cranky Kong (voice), Kevin Michael Richardson, Kamek (voice), Sebastian Maniscalco, Spike (voice)
## 4    Óscar Barberán, Thut (voice), Ana Esther Alborg, Nefer (voice), Luis Pérez Reina, Carnaby (voice), María Luisa Solá, Madre (voice), Jaume Solà, Sekhem (voice), José Luis Mediavilla, Ed (voice), José Javier Serrano Rodríguez, Danny (voice), Aleix Estadella, Dennis (voice), María Moscardó, Usi (voice)
## 5                                                                Skeet Ulrich, Roy Cameron, Anne Heche, Dr Quinn Brody, Daniel Diemer, William Brody, Jordan Kristine Seamón, Harper Hunter, Alec Baldwin, Zane Rogers, Richard Gunn, Bill Brody, Praya Lundberg, Amy, Johnny Wactor, Martin, Anjul Nigam, Ramesh
## 6                                                                      Keri Russell, Sari, Alden Ehrenreich, Eddie, O'Shea Jackson Jr., Daveed, Ray Liotta, Syd, Kristofer Hivju, Olaf (Kristoffer), Margo Martindale, Ranger Liz, Christian Convery, Henry, Isiah Whitlock Jr., Bob, Jesse Tyler Ferguson, Peter
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Overview
## 1 After dominating the boxing world, Adonis Creed has been thriving in both his career and family life. When a childhood friend and former boxing prodigy, Damien Anderson, resurfaces after serving a long sentence in prison, he is eager to prove that he deserves his shot in the ring. The face-off between former friends is more than just a fight. To settle the score, Adonis must put his future on the line to battle Damien — a fighter who has nothing to lose.
## 2                                                                                                                                                                                           Set more than a decade after the events of the first film, learn the story of the Sully family (Jake, Neytiri, and their kids), the trouble that follows them, the lengths they go to keep each other safe, the battles they fight to stay alive, and the tragedies they endure.
## 3                                                                                                                                                                                                               While working underground to fix a water main, Brooklyn plumbers—and brothers—Mario and Luigi are transported down a mysterious pipe and wander into a magical new world. But when the brothers are separated, Mario embarks on an epic quest to find Luigi.
## 4                                                                                                                                                                                                                                     Through a series of unfortunate events, three mummies end up in present-day London and embark on a wacky and hilarious journey in search of an old ring belonging to the Royal Family, stolen by ambitious archaeologist Lord Carnaby.
## 5                                                        Good-hearted teenager William always lived in hope of following in his late father’s footsteps and becoming a storm chaser. His father’s legacy has now been turned into a storm-chasing tourist business, managed by the greedy and reckless Zane Rogers, who is now using William as the main attraction to lead a group of unsuspecting adventurers deep into the eye of the most dangerous supercell ever seen.
## 6                                                                                                                                                                                                                                                           Inspired by a true story, an oddball group of cops, criminals, tourists and teens converge in a Georgia forest where a 500-pound black bear goes on a murderous rampage after unintentionally ingesting cocaine.
# Create new variables
tmdb$Profit <- tmdb$Revenue - tmdb$Budget
tmdb$ROI <- (tmdb$Profit / tmdb$Budget) * 100
tmdb$ProfitMargin <- (tmdb$Profit / tmdb$Revenue) * 100

Visualizations

Histograms

hist_vars <- c("Score")
for(var in hist_vars) {
  print(ggplot(tmdb, aes_string(var)) + 
    geom_histogram(binwidth = 10, fill = "blue", color = "black", alpha = 0.7) +
    ggtitle(paste("Histogram of", var)))
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

hist_vars <- c("Revenue")
for(var in hist_vars) {
  print(ggplot(tmdb, aes_string(var)) + 
    geom_histogram(binwidth = 10, fill = "blue", color = "black", alpha = 0.7) +
    ggtitle(paste("Histogram of", var)))
}

hist_vars <- c("Budget")
for(var in hist_vars) {
  print(ggplot(tmdb, aes_string(var)) + 
    geom_histogram(binwidth = 10, fill = "blue", color = "black", alpha = 0.7) +
    ggtitle(paste("Histogram of", var)))
}

hist_vars <- c("Profit")
for(var in hist_vars) {
  print(ggplot(tmdb, aes_string(var)) + 
    geom_histogram(binwidth = 10, fill = "blue", color = "black", alpha = 0.7) +
    ggtitle(paste("Histogram of", var)))
}

Boxplots

box_vars <- c("Score")
for(var in box_vars) {
  print(ggplot(tmdb, aes_string(x = "1", y = var)) + 
    geom_boxplot() +
    ggtitle(paste("Boxplot of", var)))
}

box_vars <- c("Revenue")
for(var in box_vars) {
  print(ggplot(tmdb, aes_string(x = "1", y = var)) + 
    geom_boxplot() +
    ggtitle(paste("Boxplot of", var)))
}

box_vars <- c("Budget")
for(var in box_vars) {
  print(ggplot(tmdb, aes_string(x = "1", y = var)) + 
    geom_boxplot() +
    ggtitle(paste("Boxplot of", var)))
}

Scatter Plots with Regression Lines

#  Score vs. Budget and Profit
print(ggplot(tmdb, aes(x = Budget, y = Score)) + 
  geom_point() + 
  geom_smooth(method = lm, col = "red") +
  ggtitle("Score vs. Budget with Regression Line"))
## `geom_smooth()` using formula = 'y ~ x'

print(ggplot(tmdb, aes(x = Profit, y = Score)) + 
  geom_point() + 
  geom_smooth(method = lm, col = "red") +
  ggtitle("Score vs. Profit with Regression Line"))
## `geom_smooth()` using formula = 'y ~ x'

#  Revenue vs. Budget and ROI
print(ggplot(tmdb, aes(x = Budget, y = Revenue)) + 
  geom_point() + 
  geom_smooth(method = lm, col = "red") +
  ggtitle("Revenue vs. Budget with Regression Line"))
## `geom_smooth()` using formula = 'y ~ x'

print(ggplot(tmdb, aes(x = ROI, y = Revenue)) + 
  geom_point() + 
  geom_smooth(method = lm, col = "red") +
  ggtitle("Revenue vs. ROI with Regression Line"))
## `geom_smooth()` using formula = 'y ~ x'

# Score vs. Revenue and ProfitMargin
print(ggplot(tmdb, aes(x = Revenue, y = Score)) + 
  geom_point() + 
  geom_smooth(method = lm, col = "red") +
  ggtitle("Score vs. Revenue with Regression Line"))
## `geom_smooth()` using formula = 'y ~ x'

print(ggplot(tmdb, aes(x = ProfitMargin, y = Score)) + 
  geom_point() + 
  geom_smooth(method = lm, col = "red") +
  ggtitle("Score vs. ProfitMargin with Regression Line"))
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 73 rows containing non-finite values (`stat_smooth()`).

Correlation Coefficients

cor_set1_budget <- cor(tmdb$Budget, tmdb$Score)
cor_set1_profit <- cor(tmdb$Profit, tmdb$Score)

cor_set2_budget <- cor(tmdb$Budget, tmdb$Revenue)
cor_set2_ROI <- cor(tmdb$ROI, tmdb$Revenue)

cor_set3_revenue <- cor(tmdb$Revenue, tmdb$Score)
cor_set3_profitMargin <- cor(tmdb$ProfitMargin, tmdb$Score)

correlations <- data.frame(
  Set = c("Set 1 (Budget)", "Set 1 (Profit)", "Set 2 (Budget)", "Set 2 (ROI)", "Set 3 (Revenue)", "Set 3 (ProfitMargin)"),
  Correlation = c(cor_set1_budget, cor_set1_profit, cor_set2_budget, cor_set2_ROI, cor_set3_revenue, cor_set3_profitMargin)
)

correlations
##                    Set   Correlation
## 1       Set 1 (Budget) -0.2354700991
## 2       Set 1 (Profit)  0.1656486921
## 3       Set 2 (Budget)  0.6738295692
## 4          Set 2 (ROI) -0.0009230343
## 5      Set 3 (Revenue)  0.0965328705
## 6 Set 3 (ProfitMargin)           NaN

Confidence Intervals

conf_int_score <- t.test(tmdb$Score)$conf.int
conf_int_revenue <- t.test(tmdb$Revenue)$conf.int

confidence_intervals <- data.frame(
  Variable = c("Score", "Revenue"),
  Lower_Bound = c(conf_int_score[1], conf_int_revenue[1]),
  Upper_Bound = c(conf_int_score[2], conf_int_revenue[2])
)

confidence_intervals
##   Variable Lower_Bound Upper_Bound
## 1    Score    63.23403    63.76007
## 2  Revenue   247.74272   258.53746

Conclusion

Based on the visualizations, correlations, and confidence intervals, we can derive insights about the relationships between the variables in the TMDB dataset. Further statistical tests and modeling can provide deeper insights and predictions based on this data.