Build at least three sets of variable combinations

Variable Set 1: Movie Revenue Prediction

Response Variable: “Revenue” (binary, indicating whether a movie was successful or not)

Explanatory Variables: “budget_x” (continuous, movie budget) “score” (continuous, movie IMDb score)

Variable Set 2: Genre Popularity Analysis (Revised)

Response Variable : “genre_popularity” (continuous, a measure of how popular a movie genre is) -

Explanatory Variables: “genre” (ordered, movie genre, e.g., [‘Action’, ‘Drama’, ‘Comedy’, ‘Sci-Fi’, ‘Fantasy’]) - Ordered. “score” (continuous, movie IMDb score) - Continuous and numeric.

Variable Set 3: Release Date Analysis

Response Variable : “date_success_proportions” Explanatory Variables: “date_x” (continuous, movie release date in numeric format, e.g., 07/09/1985) “score” (continuous, movie IMDb score)

Plot a visualization for each response-explanatory relationship, and draw some conclusions based on the plot

The scatterplots are to visualize the relationships between the response variable “Revenue” and the explanatory variables “Budget” (budget_x) and “IMDb Score” (score).

library(ggplot2)


ggplot(data = data, aes(x = budget_x, y = revenue)) +
  geom_point() +
  labs(title = "Revenue vs. Budget", x = "Budget (in dollars)", y = "Revenue")

ggplot(data, aes(x = score, y = revenue)) +
  geom_point() +
  labs(title = "Revenue vs. IMDb Score", x = "IMDb Score", y = "Revenue")

The below R code creates a scatterplot to visualize the relationship between the categorical variable “Genre” and the continuous variable “IMDb Score.” This scatterplot helps you explore how different movie genres are associated with their IMDb scores.

library(ggplot2)

ggplot(data, aes(x = genre, y = score)) +
  geom_point() +
  labs(title = "Genre vs. IMDb Score", x = "Genre", y = "IMDb Score")

These plots can provide insights into the relationships between movie release dates and IMDb scores and the distribution of movies across different countries. They help you visualize trends and patterns in the dataset.

library(ggplot2)

ggplot(data, aes(x = date_x, y = score)) +
  geom_point() +
  labs(title = "Release Date vs. IMDb Score", x = "Release Date", y = "IMDb Score")

ggplot(data, aes(x = country, fill = ..count..)) +
  geom_bar() +
  labs(title = "Distribution of Movies by Country", x = "Country", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Calculate the appropriate correlation coefficient for each of these combinations

correlation_budget_revenue <- cor(data$budget_x, data$revenue, method = "pearson")


correlation_score_revenue <- cor(data$score, data$revenue, method = "pearson")

correlation_budget_revenue
## [1] 0.6738296
correlation_score_revenue
## [1] 0.09653287

The correlation coefficient between “Revenue” and “Budget” (budget_x) is approximately 0.674. The correlation coefficient between “Revenue” and “IMDb Score” (score) is approximately 0.097 In summary, based on these correlation coefficients, it appears that a movie’s budget has a more substantial influence on its revenue compared to its IMDb score, which has a weaker impact

ggplot(data, aes(x = genre, y = score)) +
  geom_boxplot() +
  labs(title = "Distribution of IMDb Scores by Genre", x = "Genre", y = "IMDb Score")

correlation analysis is more suitable for examining relationships between two continuous variables, while data visualization techniques like box plots are useful for exploring and comparing the distributions of variables, especially when dealing with categorical and continuous data. so i choosed to plot to show the relationship.

library(lubridate)  # For date manipulation
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
data$date_x <- as.Date(data$date_x, format = "%m/%d/%Y")


reference_date <- as.Date("1900-01-01")

data$date_numeric <- as.numeric(data$date_x - reference_date)


correlation_date_score <- cor(data$date_numeric, data$score, method = "pearson")

print(correlation_date_score)
## [1] -0.144178

The above Pearson correlation coefficient between “date_x” (numeric representation of movie release dates) and “score” (IMDb scores) is approximately -0.144. This negative correlation suggests a weak negative linear relationship between the release date of a movie and its IMDb score

Build a confidence interval for each of the response variables

if (!require(binom)) {
  install.packages("binom")
  library(binom)
}
## Loading required package: binom
successful_movies <- sum(data$revenue > data$budget_x)
total_movies <- nrow(data)
prop_successful <- successful_movies / total_movies


conf_interval <- binom.confint(successful_movies, total_movies, method = "wilson")


prop_successful
## [1] 0.8037925
conf_interval
##   method    x     n      mean     lower     upper
## 1 wilson 8181 10178 0.8037925 0.7959633 0.8113925

Conclusion:

Based on the above analysis, we can conclude that a significant proportion of the movies in my dataset can be considered successful, with an estimated proportion of approximately 80.38%. This estimate is supported by a 95% confidence interval, which suggests that the true proportion of successful movies is likely to be between 79.60% and 81.14%.

genre_success_proportions <- aggregate(data$score > 80, by = list(data$genre), mean)

colnames(genre_success_proportions) <- c("Genre", "Proportion_Successful")


genre_success_proportions$Lower <- genre_success_proportions$Proportion_Successful - 
  qnorm(0.975) * sqrt((genre_success_proportions$Proportion_Successful * (1 - genre_success_proportions$Proportion_Successful)) / nrow(data))

genre_success_proportions$Upper <- genre_success_proportions$Proportion_Successful + 
  qnorm(0.975) * sqrt((genre_success_proportions$Proportion_Successful * (1 - genre_success_proportions$Proportion_Successful)) / nrow(data))


genre_success_proportions_above_80 <- subset(genre_success_proportions, Proportion_Successful > 0.80)


print(genre_success_proportions_above_80)
##                                                                                       Genre
## 182                                              Action, Comedy, Science Fiction, Animation
## 411                                          Adventure, Animation, Comedy, Fantasy, Mystery
## 515                                                      Adventure, Fantasy, Action, Family
## 520                                                           Adventure, Fantasy, Animation
## 587                           Animation, Action, Adventure, Comedy, Drama, Fantasy, Romance
## 603                                         Animation, Action, Adventure, Fantasy, Thriller
## 611                                      Animation, Action, Comedy, Mystery, Crime, Fantasy
## 635                                               Animation, Action, Science Fiction, Drama
## 640                                                         Animation, Action, War, Fantasy
## 656                                             Animation, Adventure, Crime, Family, Comedy
## 727                                                              Animation, Comedy, Romance
## 728                                              Animation, Comedy, Romance, Drama, Fantasy
## 794                                               Animation, Family, Comedy, Fantasy, Drama
## 801                                            Animation, Family, Drama, Fantasy, Adventure
## 806                                           Animation, Family, Fantasy, Adventure, Comedy
## 836                                                   Animation, Fantasy, Adventure, Action
## 837                                           Animation, Fantasy, Adventure, Action, Family
## 840                                                        Animation, Fantasy, Drama, Music
## 852                                                      Animation, Fantasy, Romance, Drama
## 896                                                                     Animation, Thriller
## 899                                                    Animation, TV Movie, Fantasy, Action
## 936                                  Comedy, Animation, Adventure, Fantasy, Romance, Action
## 1052                                                          Comedy, Music, Romance, Crime
## 1095                                                                     Comedy, War, Drama
## 1262                                                                     Drama, Comedy, War
## 1291                                                              Drama, Fantasy, Animation
## 1301                                                                Drama, Fantasy, Mystery
## 1364                                                        Drama, Mystery, Science Fiction
## 1431                                                                    Drama, War, Mystery
## 1498                                                               Family, Animation, Drama
## 1505                                   Family, Animation, Fantasy, Music, Comedy, Adventure
## 1656                                                                  Fantasy, Drama, Crime
## 1911                                                             Mystery, Romance, Thriller
## 1937                                                              Romance, Animation, Drama
## 2079                                                    Science Fiction, Mystery, Adventure
## 2224 TV Movie, Animation, Science Fiction, Action, Adventure, Comedy, Drama, Fantasy, Music
##      Proportion_Successful Lower Upper
## 182                      1     1     1
## 411                      1     1     1
## 515                      1     1     1
## 520                      1     1     1
## 587                      1     1     1
## 603                      1     1     1
## 611                      1     1     1
## 635                      1     1     1
## 640                      1     1     1
## 656                      1     1     1
## 727                      1     1     1
## 728                      1     1     1
## 794                      1     1     1
## 801                      1     1     1
## 806                      1     1     1
## 836                      1     1     1
## 837                      1     1     1
## 840                      1     1     1
## 852                      1     1     1
## 896                      1     1     1
## 899                      1     1     1
## 936                      1     1     1
## 1052                     1     1     1
## 1095                     1     1     1
## 1262                     1     1     1
## 1291                     1     1     1
## 1301                     1     1     1
## 1364                     1     1     1
## 1431                     1     1     1
## 1498                     1     1     1
## 1505                     1     1     1
## 1656                     1     1     1
## 1911                     1     1     1
## 1937                     1     1     1
## 2079                     1     1     1
## 2224                     1     1     1

Conclusion:

The analysis focused on genres that had movies with IMDb scores exceeding 80. Out of the 36 genres examined, a select few consistently produced movies that received exceptionally high IMDb scores. These genres can be considered highly successful based on this criterion.

date_success_proportions <- aggregate(data$score > 80, by = list(data$date_x), mean)

colnames(date_success_proportions) <- c("Release_Date", "Proportion_Successful")


date_success_proportions$Lower <- date_success_proportions$Proportion_Successful - 
  qnorm(0.975) * sqrt((date_success_proportions$Proportion_Successful * (1 - date_success_proportions$Proportion_Successful)) / nrow(data))

date_success_proportions$Upper <- date_success_proportions$Proportion_Successful + 
  qnorm(0.975) * sqrt((date_success_proportions$Proportion_Successful * (1 - date_success_proportions$Proportion_Successful)) / nrow(data))


date_success_proportions_above_80 <- subset(date_success_proportions, Proportion_Successful > 0.80)


print(date_success_proportions_above_80)
##      Release_Date Proportion_Successful Lower Upper
## 9      1928-04-07                     1     1     1
## 10     1928-10-25                     1     1     1
## 15     1931-03-07                     1     1     1
## 16     1931-05-11                     1     1     1
## 25     1938-02-17                     1     1     1
## 33     1940-01-12                     1     1     1
## 40     1941-02-07                     1     1     1
## 49     1944-12-01                     1     1     1
## 50     1944-12-08                     1     1     1
## 62     1948-02-06                     1     1     1
## 69     1950-09-29                     1     1     1
## 71     1950-11-03                     1     1     1
## 72     1951-02-05                     1     1     1
## 74     1951-07-11                     1     1     1
## 82     1952-03-20                     1     1     1
## 83     1952-04-02                     1     1     1
## 84     1952-12-25                     1     1     1
## 99     1955-06-23                     1     1     1
## 103    1956-01-01                     1     1     1
## 115    1957-10-10                     1     1     1
## 117    1957-12-17                     1     1     1
## 119    1958-04-26                     1     1     1
## 126    1959-01-23                     1     1     1
## 130    1959-10-15                     1     1     1
## 139    1960-06-09                     1     1     1
## 143    1960-08-24                     1     1     1
## 144    1960-09-21                     1     1     1
## 162    1962-09-15                     1     1     1
## 168    1963-03-01                     1     1     1
## 186    1964-06-12                     1     1     1
## 203    1965-05-31                     1     1     1
## 234    1968-05-02                     1     1     1
## 262    1969-07-31                     1     1     1
## 269    1970-02-12                     1     1     1
## 303    1971-12-22                     1     1     1
## 309    1972-02-10                     1     1     1
## 323    1972-11-02                     1     1     1
## 380    1975-02-20                     1     1     1
## 387    1975-06-18                     1     1     1
## 408    1976-04-01                     1     1     1
## 415    1976-08-05                     1     1     1
## 452    1977-10-26                     1     1     1
## 465    1978-03-08                     1     1     1
## 501    1979-05-25                     1     1     1
## 513    1979-10-19                     1     1     1
## 516    1979-11-15                     1     1     1
## 533    1980-05-21                     1     1     1
## 552    1980-11-13                     1     1     1
## 564    1981-01-16                     1     1     1
## 619    1982-02-26                     1     1     1
## 636    1982-10-21                     1     1     1
## 724    1984-10-04                     1     1     1
## 744    1985-02-28                     1     1     1
## 764    1985-07-09                     1     1     1
## 793    1985-12-01                     1     1     1
## 837    1986-09-01                     1     1     1
## 987    1989-07-21                     1     1     1
## 1010   1989-12-26                     1     1     1
## 1054   1990-09-27                     1     1     1
## 1115   1991-07-27                     1     1     1
## 1125   1991-09-05                     1     1     1
## 1258   1993-07-30                     1     1     1
## 1345   1994-08-25                     1     1     1
## 1363   1994-11-17                     1     1     1
## 1398   1995-04-16                     1     1     1
## 1447   1995-11-02                     1     1     1
## 1648   1998-03-12                     1     1     1
## 1695   1998-09-24                     1     1     1
## 1704   1998-10-28                     1     1     1
## 1710   1998-11-12                     1     1     1
## 1712   1998-11-19                     1     1     1
## 1749   1999-04-08                     1     1     1
## 1756   1999-05-06                     1     1     1
## 1793   1999-11-11                     1     1     1
## 1838   2000-05-04                     1     1     1
## 1849   2000-07-15                     1     1     1
## 1882   2000-11-06                     1     1     1
## 1908   2001-01-10                     1     1     1
## 1948   2001-06-02                     1     1     1
## 2084   2002-08-30                     1     1     1
## 2153   2003-05-02                     1     1     1
## 2277   2004-05-20                     1     1     1
## 2299   2004-08-10                     1     1     1
## 2330   2004-11-06                     1     1     1
## 2423   2005-08-26                     1     1     1
## 2476   2005-12-21                     1     1     1
## 2513   2006-04-07                     1     1     1
## 2524   2006-05-06                     1     1     1
## 2690   2007-08-07                     1     1     1
## 2893   2008-12-07                     1     1     1
## 2924   2009-02-20                     1     1     1
## 2996   2009-08-19                     1     1     1
## 3080   2010-02-18                     1     1     1
## 3103   2010-04-17                     1     1     1
## 3130   2010-06-22                     1     1     1
## 3199   2010-11-02                     1     1     1
## 3308   2011-07-13                     1     1     1
## 3453   2012-05-18                     1     1     1
## 3612   2013-05-02                     1     1     1
## 3718   2013-11-26                     1     1     1
## 3852   2014-08-24                     1     1     1
## 3886   2014-10-22                     1     1     1
## 3991   2015-05-23                     1     1     1
## 4152   2016-04-02                     1     1     1
## 4256   2016-10-16                     1     1     1
## 4301   2017-01-06                     1     1     1
## 4422   2017-08-07                     1     1     1
## 4499   2017-12-27                     1     1     1
## 4595   2018-06-02                     1     1     1
## 4689   2018-11-12                     1     1     1
## 4719   2018-12-31                     1     1     1
## 4814   2019-06-17                     1     1     1
## 4854   2019-09-02                     1     1     1
## 4858   2019-09-06                     1     1     1
## 4907   2019-12-03                     1     1     1
## 4989   2020-05-20                     1     1     1
## 5011   2020-07-03                     1     1     1
## 5034   2020-08-22                     1     1     1
## 5058   2020-09-26                     1     1     1
## 5078   2020-10-24                     1     1     1
## 5098   2020-11-25                     1     1     1
## 5113   2020-12-21                     1     1     1
## 5140   2021-02-15                     1     1     1
## 5382   2022-03-05                     1     1     1
## 5540   2022-10-23                     1     1     1
## 5639   2023-03-22                     1     1     1
## 5688   2023-12-31                     1     1     1

Conclusion:

Based on the analysis of the relationship between movie release dates (“date_x”) and IMDb scores (“score”), we examined a dataset containing 127 rows of data. We aimed to identify release dates that have a proportion of successful movies (movies with a score greater than 70) above 90%.

After conducting the analysis, we found that specific release dates demonstrated a high proportion of successful movies. These dates may represent periods when movies tend to perform exceptionally well. While the exact dates may vary, it is clear that there are windows of time during which movies released tend to receive high IMDb scores.