The IMDB dataset contains information about movies, including their names, release dates, user ratings, genres, overviews, cast and crew members, original titles, production status, original languages, budgets, revenues, and countries of origin. This data can be used for various analyses, such as identifying trends in movie genres, exploring the relationship between budget and revenue, and predicting the success of future movies.
# Load the lubridate package
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(plyr)
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following objects are masked from 'package:plyr':
##
## arrange, mutate, rename, summarise
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(vcd)
## Loading required package: grid
movieData <-read.csv('C:/Users/govin/OneDrive/Desktop/RStudio/Data/imdb_movies.csv')
movieData$date_x <- sapply(movieData$date_x, function(x) gsub("/", "-", x))
movieData[c('date_x')] <- lapply(movieData[c('date_x')], function(x) as.Date(x, format="%m-%d-%Y"))
movieData <- type_convert(movieData)
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## names = col_character(),
## genre = col_character(),
## overview = col_character(),
## crew = col_character(),
## orig_title = col_character(),
## status = col_character(),
## orig_lang = col_character(),
## country = col_character()
## )
head(movieData, 10)
## names date_x score
## 1 Creed III 2023-03-02 73
## 2 Avatar: The Way of Water 2022-12-15 78
## 3 The Super Mario Bros. Movie 2023-04-05 76
## 4 Mummies 2023-01-05 70
## 5 Supercell 2023-03-17 61
## 6 Cocaine Bear 2023-02-23 66
## 7 John Wick: Chapter 4 2023-03-23 80
## 8 Puss in Boots: The Last Wish 2022-12-26 83
## 9 Attack on Titan 2022-09-30 59
## 10 The Park 2023-03-02 58
## genre
## 1 Drama, Action
## 2 Science Fiction, Adventure, Action
## 3 Animation, Adventure, Family, Fantasy, Comedy
## 4 Animation, Comedy, Family, Adventure, Fantasy
## 5 Action
## 6 Thriller, Comedy, Crime
## 7 Action, Thriller, Crime
## 8 Animation, Family, Fantasy, Adventure, Comedy
## 9 Action, Science Fiction
## 10 Action, Drama, Horror, Science Fiction, Thriller
## overview
## 1 After dominating the boxing world, Adonis Creed has been thriving in both his career and family life. When a childhood friend and former boxing prodigy, Damien Anderson, resurfaces after serving a long sentence in prison, he is eager to prove that he deserves his shot in the ring. The face-off between former friends is more than just a fight. To settle the score, Adonis must put his future on the line to battle Damien — a fighter who has nothing to lose.
## 2 Set more than a decade after the events of the first film, learn the story of the Sully family (Jake, Neytiri, and their kids), the trouble that follows them, the lengths they go to keep each other safe, the battles they fight to stay alive, and the tragedies they endure.
## 3 While working underground to fix a water main, Brooklyn plumbers—and brothers—Mario and Luigi are transported down a mysterious pipe and wander into a magical new world. But when the brothers are separated, Mario embarks on an epic quest to find Luigi.
## 4 Through a series of unfortunate events, three mummies end up in present-day London and embark on a wacky and hilarious journey in search of an old ring belonging to the Royal Family, stolen by ambitious archaeologist Lord Carnaby.
## 5 Good-hearted teenager William always lived in hope of following in his late father’s footsteps and becoming a storm chaser. His father’s legacy has now been turned into a storm-chasing tourist business, managed by the greedy and reckless Zane Rogers, who is now using William as the main attraction to lead a group of unsuspecting adventurers deep into the eye of the most dangerous supercell ever seen.
## 6 Inspired by a true story, an oddball group of cops, criminals, tourists and teens converge in a Georgia forest where a 500-pound black bear goes on a murderous rampage after unintentionally ingesting cocaine.
## 7 With the price on his head ever increasing, John Wick uncovers a path to defeating The High Table. But before he can earn his freedom, Wick must face off against a new enemy with powerful alliances across the globe and forces that turn old friends into foes.
## 8 Puss in Boots discovers that his passion for adventure has taken its toll: He has burned through eight of his nine lives, leaving him with only one life left. Puss sets out on an epic journey to find the mythical Last Wish and restore his nine lives.
## 9 As viable water is depleted on Earth, a mission is sent to Saturn's moon Titan to retrieve sustainable H2O reserves from its alien inhabitants. But just as the humans acquire the precious resource, they are attacked by Titan rebels, who don't trust that the Earthlings will leave in peace.
## 10 A dystopian coming-of-age movie focused on three kids who find themselves in an abandoned amusement park, aiming to unite whoever remains. With dangers lurking around every corner, they will do whatever it takes to survive their hellish Neverland.
## crew
## 1 Michael B. Jordan, Adonis Creed, Tessa Thompson, Bianca Taylor, Jonathan Majors, Damien Anderson, Wood Harris, Tony 'Little Duke' Evers, Phylicia Rashād, Mary Anne Creed, Mila Davis-Kent, Amara Creed, Florian Munteanu, Viktor Drago, José Benavidez Jr., Felix Chavez, Selenis Leyva, Laura Chavez
## 2 Sam Worthington, Jake Sully, Zoe Saldaña, Neytiri, Sigourney Weaver, Kiri / Dr. Grace Augustine, Stephen Lang, Colonel Miles Quaritch, Kate Winslet, Ronal, Cliff Curtis, Tonowari, Joel David Moore, Norm Spellman, CCH Pounder, Mo'at, Edie Falco, General Frances Ardmore
## 3 Chris Pratt, Mario (voice), Anya Taylor-Joy, Princess Peach (voice), Charlie Day, Luigi (voice), Jack Black, Bowser (voice), Keegan-Michael Key, Toad (voice), Seth Rogen, Donkey Kong (voice), Fred Armisen, Cranky Kong (voice), Kevin Michael Richardson, Kamek (voice), Sebastian Maniscalco, Spike (voice)
## 4 Óscar Barberán, Thut (voice), Ana Esther Alborg, Nefer (voice), Luis Pérez Reina, Carnaby (voice), María Luisa Solá, Madre (voice), Jaume Solà, Sekhem (voice), José Luis Mediavilla, Ed (voice), José Javier Serrano Rodríguez, Danny (voice), Aleix Estadella, Dennis (voice), María Moscardó, Usi (voice)
## 5 Skeet Ulrich, Roy Cameron, Anne Heche, Dr Quinn Brody, Daniel Diemer, William Brody, Jordan Kristine Seamón, Harper Hunter, Alec Baldwin, Zane Rogers, Richard Gunn, Bill Brody, Praya Lundberg, Amy, Johnny Wactor, Martin, Anjul Nigam, Ramesh
## 6 Keri Russell, Sari, Alden Ehrenreich, Eddie, O'Shea Jackson Jr., Daveed, Ray Liotta, Syd, Kristofer Hivju, Olaf (Kristoffer), Margo Martindale, Ranger Liz, Christian Convery, Henry, Isiah Whitlock Jr., Bob, Jesse Tyler Ferguson, Peter
## 7 Keanu Reeves, John Wick, Donnie Yen, Caine, Bill Skarsgård, Marquis de Gramont, Ian McShane, Winston, Laurence Fishburne, Bowery King, Lance Reddick, Charon, Clancy Brown, The Harbinger, Hiroyuki Sanada, Shimazu, Shamier Anderson, Mr Nobody
## 8 Antonio Banderas, Puss in Boots (voice), Salma Hayek, Kitty Softpaws (voice), Harvey Guillén, Perrito (voice), Wagner Moura, Wolf (voice), Florence Pugh, Goldilocks (voice), Olivia Colman, Mama Bear (voice), Ray Winstone, Papa Bear (voice), Samson Kayo, Baby Bear (voice), John Mulaney, Jack Horner (voice)
## 9 Paul Bianchi, Computer (voice), Erin Coker, Allison Quince, Jack Pearson, Max Reece, Anthony Jensen, Jowers, Neli Sabour, Heidi Quince, Karan Sagoo, Adrian Naidu, Natalie Storrs, Saoirse Parker, Justin Tanks, Mark Morales, Jenny Tran, Kim Costa
## 10 Chloe Guidry, Ines, Nhedrick Jabier, Bui, Carmina Garay, Kuan, Billy Slaughter, Martin Parker, Carli McIntyre, Rue, Laura Coover, Reporter, Presley Richardson, Bennett, Sean Papajohn, Jack, Legend Jay Jones, Slingshot Gang
## orig_title status orig_lang budget_x
## 1 Creed III Released English 75000000
## 2 Avatar: The Way of Water Released English 460000000
## 3 The Super Mario Bros. Movie Released English 100000000
## 4 Momias Released Spanish, Castilian 12300000
## 5 Supercell Released English 77000000
## 6 Cocaine Bear Released English 35000000
## 7 John Wick: Chapter 4 Released English 100000000
## 8 Puss in Boots: The Last Wish Released English 90000000
## 9 Attack on Titan Released English 71000000
## 10 The Park Released English 119200000
## revenue country
## 1 271616668 AU
## 2 2316794914 AU
## 3 724459031 AU
## 4 34200000 AU
## 5 340941959 US
## 6 80000000 AU
## 7 351349364 AU
## 8 483480577 AU
## 9 254946484 US
## 10 488962491 US
movieData["score"] <- movieData["revenue"] - movieData["revenue"]
colnames(movieData)
## [1] "names" "date_x" "score" "genre" "overview"
## [6] "crew" "orig_title" "status" "orig_lang" "budget_x"
## [11] "revenue" "country"
Hypothesis 1 H0: There is no significant difference in budget_x between different country
HA: There is a significant difference in budget_x between different country. Significance level: 0.05, Power level: 0.8, Minimum Effect Size: 0.3.
# Since we are finding a relationship between continuous and categorical column, we need to write an anova test
result_countryvsbudget_x <- aov(budget_x ~ country, data = movieData)
summary(result_countryvsbudget_x)
## Df Sum Sq Mean Sq F value Pr(>F)
## country 59 2.361e+18 4.003e+16 13.15 <2e-16 ***
## Residuals 10118 3.079e+19 3.043e+15
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the probability or p value greater than 0.05, we reject the null hypothesis.
effect_size <- 0.3
sample_size <- nrow(movieData)
# Load the pwr package
library(pwr)
# Performing power analysis for ANOVA
power_analysis <- pwr.anova.test(k = 49,
f = effect_size,
sig.level = 0.05,
n = sample_size)
# Printing the power analysis results
power_analysis
##
## Balanced one-way analysis of variance power calculation
##
## k = 49
## n = 10178
## f = 0.3
## sig.level = 0.05
## power = 1
##
## NOTE: n is number in each group
Hypotheis 2: Null Hypothesis (H0): There is no significant interaction effect between “country” and “status” Alternative Hypothesis (HA): There is a significant interaction effect between “country” and “status” Significance level: 0.05, Power level: 0.8, Minimum Effect Size: 0.3.
country_vs_status <- chisq.test(table(movieData$country, movieData$status))
## Warning in chisq.test(table(movieData$country, movieData$status)): Chi-squared
## approximation may be incorrect
country_vs_status
##
## Pearson's Chi-squared test
##
## data: table(movieData$country, movieData$status)
## X-squared = 722.09, df = 118, p-value < 2.2e-16
Since the probability or p value is lesser than 0.05, we reject the null hypothesis. Reasons for choosing alpha level, power and effect size: Alpha Level (Significance Level): 0.05. The standard alpha level is often set at 0.05, indicating a 5% chance of rejecting the null hypothesis when it is true. This is a common and widely accepted level in statistical testing. Power Level: 0.80. A power level of 0.80 is commonly used, indicating an 80% chance of detecting a true effect if it exists. This balance between alpha and power is often considered a reasonable compromise in statistical testing. Minimum Effect Size: 0.3. The effect size represents the practical significance of the result. Choosing a minimum effect size of 0.3 means you are interested in detecting a moderate effect. This value may be determined based on domain knowledge or previous research. We can perform pearson neymann test
# Creating a contingency table
contingency_table <- table(movieData$country, movieData$country)
# Performing the chi-square test
chi_square_result <- chisq.test(contingency_table)
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
# Checking if the chi-square test was successful
if (chi_square_result$p.value > 0) {
# Extract observed and expected frequencies
observed_freq <- chi_square_result$observed
expected_freq <- chi_square_result$expected
# Calculating the likelihood ratio test statistic
likelihood_ratio_statistic <- 2 * sum(observed_freq * log(observed_freq / expected_freq))
# Choosing a significance level (alpha)
alpha <- 0.05
# Determining the critical value from the chi-square distribution
critical_value <- qchisq(1 - alpha, movieData = chi_square_result$parameter)
# Comparing the test statistic to the critical value
if (!is.na(likelihood_ratio_statistic) && !is.na(critical_value)) {
if (likelihood_ratio_statistic > critical_value) {
cat("Reject the null hypothesis at the", alpha, "significance level.\n")
} else {
cat("Fail to reject the null hypothesis at the", alpha, "significance level.\n")
}
} else {
cat("Error: Unable to perform the likelihood ratio test.\n")
}
# Print the result of the chi-square test
print(chi_square_result)
} else {
cat("Error: Chi-square test unsuccessful. Check your data.\n")
}
## Error: Chi-square test unsuccessful. Check your data.
# Print the result of the chi-square test
print(chi_square_result)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 600502, df = 3481, p-value < 2.2e-16
The log likelihood comes out to be Nan here. And hence we won’t be able to compute it. Fishers exact test for significance Test 1 Creating a contingency table
contingency_table <- table(movieData$status, movieData$country)
# Performing the Fisher's exact test
fisher_test_results <- fisher.test(contingency_table, simulate.p.value = TRUE)
# Checking the p-value
p_value <- fisher_test_results$p.value
# Making a decision
cat("p: ", p_value, "\n")
## p: 0.0004997501
if (p_value < 0.05) {
# Rejecting the null hypothesis
print( "There is a significant association between status and country")
} else {
# Failing to reject the null hypothesis
print("There is not enough evidence to conclude that there is a significant association between status and country")
}
## [1] "There is a significant association between status and country"
Test 2
contingency_table <- table(movieData$status, movieData$country)
# Performing the Fisher's exact test
fisher_test_results <- fisher.test(contingency_table, simulate.p.value = TRUE)
# Checking the p-value
p_value <- fisher_test_results$p.value
# Making a decision
cat("p: ", p_value, "\n")
## p: 0.0009995002
Build two visualizations that best illustrate the results from the two pairs of hypothesis tests, one for each null hypothesis.
library(ggplot2)
ggplot(movieData, aes(x = country, y = budget_x)) +
geom_boxplot(fill = "lightblue", color = "blue") +
labs(title = "Boxplot of budget_x by country",
x = "country",
y = "budget_x")
# Load necessary libraries
library(ggplot2)
# Assuming 'status' and 'country' are categorical variables
movieData$status <- as.factor(movieData$status)
movieData$country <- as.factor(movieData$country)
# Create a grouped bar plot
ggplot(movieData, aes(x = status, fill = country)) +
geom_bar(position = "dodge", color = "black", stat = "count") +
labs(title = "Grouped Bar Plot of status and country",
x = "status",
y = "Count")