Variable Set 1: Movie Revenue Prediction
Response Variable: “Revenue” (binary, indicating whether a movie was successful or not)
Explanatory Variables: “budget_x” (continuous, movie budget) “score” (continuous, movie IMDb score)
Variable Set 2: Genre Popularity Analysis (Revised)
Response Variable : “genre_popularity” (continuous, a measure of how popular a movie genre is) -
Explanatory Variables: “genre” (ordered, movie genre, e.g., [‘Action’, ‘Drama’, ‘Comedy’, ‘Sci-Fi’, ‘Fantasy’]) - Ordered. “score” (continuous, movie IMDb score) - Continuous and numeric.
Variable Set 3: Release Date Analysis
Response Variable : “date_success_proportions” Explanatory Variables: “date_x” (continuous, movie release date in numeric format, e.g., 07/09/1985) “score” (continuous, movie IMDb score)
The scatterplots are to visualize the relationships between the response variable “Revenue” and the explanatory variables “Budget” (budget_x) and “IMDb Score” (score).
library(ggplot2)
ggplot(data = data, aes(x = budget_x, y = revenue)) +
geom_point() +
labs(title = "Revenue vs. Budget", x = "Budget (in dollars)", y = "Revenue")
ggplot(data, aes(x = score, y = revenue)) +
geom_point() +
labs(title = "Revenue vs. IMDb Score", x = "IMDb Score", y = "Revenue")
The below R code creates a scatterplot to visualize the relationship between the categorical variable “Genre” and the continuous variable “IMDb Score.” This scatterplot helps you explore how different movie genres are associated with their IMDb scores.
library(ggplot2)
ggplot(data, aes(x = genre, y = score)) +
geom_point() +
labs(title = "Genre vs. IMDb Score", x = "Genre", y = "IMDb Score")
These plots can provide insights into the relationships between movie release dates and IMDb scores and the distribution of movies across different countries. They help you visualize trends and patterns in the dataset.
library(ggplot2)
ggplot(data, aes(x = date_x, y = score)) +
geom_point() +
labs(title = "Release Date vs. IMDb Score", x = "Release Date", y = "IMDb Score")
ggplot(data, aes(x = country, fill = ..count..)) +
geom_bar() +
labs(title = "Distribution of Movies by Country", x = "Country", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
correlation_budget_revenue <- cor(data$budget_x, data$revenue, method = "pearson")
correlation_score_revenue <- cor(data$score, data$revenue, method = "pearson")
correlation_budget_revenue
## [1] 0.6738296
correlation_score_revenue
## [1] 0.09653287
The correlation coefficient between “Revenue” and “Budget” (budget_x) is approximately 0.674. The correlation coefficient between “Revenue” and “IMDb Score” (score) is approximately 0.097 In summary, based on these correlation coefficients, it appears that a movie’s budget has a more substantial influence on its revenue compared to its IMDb score, which has a weaker impact
ggplot(data, aes(x = genre, y = score)) +
geom_boxplot() +
labs(title = "Distribution of IMDb Scores by Genre", x = "Genre", y = "IMDb Score")
correlation analysis is more suitable for examining relationships
between two continuous variables, while data visualization techniques
like box plots are useful for exploring and comparing the distributions
of variables, especially when dealing with categorical and continuous
data. so i choosed to plot to show the relationship.
library(lubridate) # For date manipulation
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
data$date_x <- as.Date(data$date_x, format = "%m/%d/%Y")
reference_date <- as.Date("1900-01-01")
data$date_numeric <- as.numeric(data$date_x - reference_date)
correlation_date_score <- cor(data$date_numeric, data$score, method = "pearson")
print(correlation_date_score)
## [1] -0.144178
The above Pearson correlation coefficient between “date_x” (numeric representation of movie release dates) and “score” (IMDb scores) is approximately -0.144. This negative correlation suggests a weak negative linear relationship between the release date of a movie and its IMDb score
if (!require(binom)) {
install.packages("binom")
library(binom)
}
## Loading required package: binom
successful_movies <- sum(data$revenue > data$budget_x)
total_movies <- nrow(data)
prop_successful <- successful_movies / total_movies
conf_interval <- binom.confint(successful_movies, total_movies, method = "wilson")
prop_successful
## [1] 0.8037925
conf_interval
## method x n mean lower upper
## 1 wilson 8181 10178 0.8037925 0.7959633 0.8113925
Conclusion:
Based on the above analysis, we can conclude that a significant proportion of the movies in my dataset can be considered successful, with an estimated proportion of approximately 80.38%. This estimate is supported by a 95% confidence interval, which suggests that the true proportion of successful movies is likely to be between 79.60% and 81.14%.
genre_success_proportions <- aggregate(data$score > 80, by = list(data$genre), mean)
colnames(genre_success_proportions) <- c("Genre", "Proportion_Successful")
genre_success_proportions$Lower <- genre_success_proportions$Proportion_Successful -
qnorm(0.975) * sqrt((genre_success_proportions$Proportion_Successful * (1 - genre_success_proportions$Proportion_Successful)) / nrow(data))
genre_success_proportions$Upper <- genre_success_proportions$Proportion_Successful +
qnorm(0.975) * sqrt((genre_success_proportions$Proportion_Successful * (1 - genre_success_proportions$Proportion_Successful)) / nrow(data))
genre_success_proportions_above_80 <- subset(genre_success_proportions, Proportion_Successful > 0.80)
print(genre_success_proportions_above_80)
## Genre
## 182 Action, Comedy, Science Fiction, Animation
## 411 Adventure, Animation, Comedy, Fantasy, Mystery
## 515 Adventure, Fantasy, Action, Family
## 520 Adventure, Fantasy, Animation
## 587 Animation, Action, Adventure, Comedy, Drama, Fantasy, Romance
## 603 Animation, Action, Adventure, Fantasy, Thriller
## 611 Animation, Action, Comedy, Mystery, Crime, Fantasy
## 635 Animation, Action, Science Fiction, Drama
## 640 Animation, Action, War, Fantasy
## 656 Animation, Adventure, Crime, Family, Comedy
## 727 Animation, Comedy, Romance
## 728 Animation, Comedy, Romance, Drama, Fantasy
## 794 Animation, Family, Comedy, Fantasy, Drama
## 801 Animation, Family, Drama, Fantasy, Adventure
## 806 Animation, Family, Fantasy, Adventure, Comedy
## 836 Animation, Fantasy, Adventure, Action
## 837 Animation, Fantasy, Adventure, Action, Family
## 840 Animation, Fantasy, Drama, Music
## 852 Animation, Fantasy, Romance, Drama
## 896 Animation, Thriller
## 899 Animation, TV Movie, Fantasy, Action
## 936 Comedy, Animation, Adventure, Fantasy, Romance, Action
## 1052 Comedy, Music, Romance, Crime
## 1095 Comedy, War, Drama
## 1262 Drama, Comedy, War
## 1291 Drama, Fantasy, Animation
## 1301 Drama, Fantasy, Mystery
## 1364 Drama, Mystery, Science Fiction
## 1431 Drama, War, Mystery
## 1498 Family, Animation, Drama
## 1505 Family, Animation, Fantasy, Music, Comedy, Adventure
## 1656 Fantasy, Drama, Crime
## 1911 Mystery, Romance, Thriller
## 1937 Romance, Animation, Drama
## 2079 Science Fiction, Mystery, Adventure
## 2224 TV Movie, Animation, Science Fiction, Action, Adventure, Comedy, Drama, Fantasy, Music
## Proportion_Successful Lower Upper
## 182 1 1 1
## 411 1 1 1
## 515 1 1 1
## 520 1 1 1
## 587 1 1 1
## 603 1 1 1
## 611 1 1 1
## 635 1 1 1
## 640 1 1 1
## 656 1 1 1
## 727 1 1 1
## 728 1 1 1
## 794 1 1 1
## 801 1 1 1
## 806 1 1 1
## 836 1 1 1
## 837 1 1 1
## 840 1 1 1
## 852 1 1 1
## 896 1 1 1
## 899 1 1 1
## 936 1 1 1
## 1052 1 1 1
## 1095 1 1 1
## 1262 1 1 1
## 1291 1 1 1
## 1301 1 1 1
## 1364 1 1 1
## 1431 1 1 1
## 1498 1 1 1
## 1505 1 1 1
## 1656 1 1 1
## 1911 1 1 1
## 1937 1 1 1
## 2079 1 1 1
## 2224 1 1 1
Conclusion:
The analysis focused on genres that had movies with IMDb scores exceeding 80. Out of the 36 genres examined, a select few consistently produced movies that received exceptionally high IMDb scores. These genres can be considered highly successful based on this criterion.
date_success_proportions <- aggregate(data$score > 80, by = list(data$date_x), mean)
colnames(date_success_proportions) <- c("Release_Date", "Proportion_Successful")
date_success_proportions$Lower <- date_success_proportions$Proportion_Successful -
qnorm(0.975) * sqrt((date_success_proportions$Proportion_Successful * (1 - date_success_proportions$Proportion_Successful)) / nrow(data))
date_success_proportions$Upper <- date_success_proportions$Proportion_Successful +
qnorm(0.975) * sqrt((date_success_proportions$Proportion_Successful * (1 - date_success_proportions$Proportion_Successful)) / nrow(data))
date_success_proportions_above_80 <- subset(date_success_proportions, Proportion_Successful > 0.80)
print(date_success_proportions_above_80)
## Release_Date Proportion_Successful Lower Upper
## 9 1928-04-07 1 1 1
## 10 1928-10-25 1 1 1
## 15 1931-03-07 1 1 1
## 16 1931-05-11 1 1 1
## 25 1938-02-17 1 1 1
## 33 1940-01-12 1 1 1
## 40 1941-02-07 1 1 1
## 49 1944-12-01 1 1 1
## 50 1944-12-08 1 1 1
## 62 1948-02-06 1 1 1
## 69 1950-09-29 1 1 1
## 71 1950-11-03 1 1 1
## 72 1951-02-05 1 1 1
## 74 1951-07-11 1 1 1
## 82 1952-03-20 1 1 1
## 83 1952-04-02 1 1 1
## 84 1952-12-25 1 1 1
## 99 1955-06-23 1 1 1
## 103 1956-01-01 1 1 1
## 115 1957-10-10 1 1 1
## 117 1957-12-17 1 1 1
## 119 1958-04-26 1 1 1
## 126 1959-01-23 1 1 1
## 130 1959-10-15 1 1 1
## 139 1960-06-09 1 1 1
## 143 1960-08-24 1 1 1
## 144 1960-09-21 1 1 1
## 162 1962-09-15 1 1 1
## 168 1963-03-01 1 1 1
## 186 1964-06-12 1 1 1
## 203 1965-05-31 1 1 1
## 234 1968-05-02 1 1 1
## 262 1969-07-31 1 1 1
## 269 1970-02-12 1 1 1
## 303 1971-12-22 1 1 1
## 309 1972-02-10 1 1 1
## 323 1972-11-02 1 1 1
## 380 1975-02-20 1 1 1
## 387 1975-06-18 1 1 1
## 408 1976-04-01 1 1 1
## 415 1976-08-05 1 1 1
## 452 1977-10-26 1 1 1
## 465 1978-03-08 1 1 1
## 501 1979-05-25 1 1 1
## 513 1979-10-19 1 1 1
## 516 1979-11-15 1 1 1
## 533 1980-05-21 1 1 1
## 552 1980-11-13 1 1 1
## 564 1981-01-16 1 1 1
## 619 1982-02-26 1 1 1
## 636 1982-10-21 1 1 1
## 724 1984-10-04 1 1 1
## 744 1985-02-28 1 1 1
## 764 1985-07-09 1 1 1
## 793 1985-12-01 1 1 1
## 837 1986-09-01 1 1 1
## 987 1989-07-21 1 1 1
## 1010 1989-12-26 1 1 1
## 1054 1990-09-27 1 1 1
## 1115 1991-07-27 1 1 1
## 1125 1991-09-05 1 1 1
## 1258 1993-07-30 1 1 1
## 1345 1994-08-25 1 1 1
## 1363 1994-11-17 1 1 1
## 1398 1995-04-16 1 1 1
## 1447 1995-11-02 1 1 1
## 1648 1998-03-12 1 1 1
## 1695 1998-09-24 1 1 1
## 1704 1998-10-28 1 1 1
## 1710 1998-11-12 1 1 1
## 1712 1998-11-19 1 1 1
## 1749 1999-04-08 1 1 1
## 1756 1999-05-06 1 1 1
## 1793 1999-11-11 1 1 1
## 1838 2000-05-04 1 1 1
## 1849 2000-07-15 1 1 1
## 1882 2000-11-06 1 1 1
## 1908 2001-01-10 1 1 1
## 1948 2001-06-02 1 1 1
## 2084 2002-08-30 1 1 1
## 2153 2003-05-02 1 1 1
## 2277 2004-05-20 1 1 1
## 2299 2004-08-10 1 1 1
## 2330 2004-11-06 1 1 1
## 2423 2005-08-26 1 1 1
## 2476 2005-12-21 1 1 1
## 2513 2006-04-07 1 1 1
## 2524 2006-05-06 1 1 1
## 2690 2007-08-07 1 1 1
## 2893 2008-12-07 1 1 1
## 2924 2009-02-20 1 1 1
## 2996 2009-08-19 1 1 1
## 3080 2010-02-18 1 1 1
## 3103 2010-04-17 1 1 1
## 3130 2010-06-22 1 1 1
## 3199 2010-11-02 1 1 1
## 3308 2011-07-13 1 1 1
## 3453 2012-05-18 1 1 1
## 3612 2013-05-02 1 1 1
## 3718 2013-11-26 1 1 1
## 3852 2014-08-24 1 1 1
## 3886 2014-10-22 1 1 1
## 3991 2015-05-23 1 1 1
## 4152 2016-04-02 1 1 1
## 4256 2016-10-16 1 1 1
## 4301 2017-01-06 1 1 1
## 4422 2017-08-07 1 1 1
## 4499 2017-12-27 1 1 1
## 4595 2018-06-02 1 1 1
## 4689 2018-11-12 1 1 1
## 4719 2018-12-31 1 1 1
## 4814 2019-06-17 1 1 1
## 4854 2019-09-02 1 1 1
## 4858 2019-09-06 1 1 1
## 4907 2019-12-03 1 1 1
## 4989 2020-05-20 1 1 1
## 5011 2020-07-03 1 1 1
## 5034 2020-08-22 1 1 1
## 5058 2020-09-26 1 1 1
## 5078 2020-10-24 1 1 1
## 5098 2020-11-25 1 1 1
## 5113 2020-12-21 1 1 1
## 5140 2021-02-15 1 1 1
## 5382 2022-03-05 1 1 1
## 5540 2022-10-23 1 1 1
## 5639 2023-03-22 1 1 1
## 5688 2023-12-31 1 1 1
Conclusion:
Based on the analysis of the relationship between movie release dates (“date_x”) and IMDb scores (“score”), we examined a dataset containing 127 rows of data. We aimed to identify release dates that have a proportion of successful movies (movies with a score greater than 70) above 90%.
After conducting the analysis, we found that specific release dates demonstrated a high proportion of successful movies. These dates may represent periods when movies tend to perform exceptionally well. While the exact dates may vary, it is clear that there are windows of time during which movies released tend to receive high IMDb scores.