2025-11-02

Description of Movie Dataset

This movie dataset was obtained from Kaggle and designed to simulate practical movie industry metrics, such as financial performance and audience engagement. Movie releases span from 1950 to 2025 and include drama, action, comedy, thriller, romance, sci-fi, horror, and documentary genres.However,the following analysis focuses on major trends from 2014 to 2024.

Overview of Graphs

  • Bar Graph(ggplot): Movie Releases by Year
  • Pie Chart(plotly): Movie Releases by Genre
  • 3D Scatter Plot(ggplot): Relationship Between Opening Day Sales, Budget, and U.S. Box Office
  • Boxplot(plotly): Distribution of IMDb Rating By Genre and Country
  • ANOVA Test: Mean of IMDb Rating Across Genres
  • Linear Regression: Budget vs. U.S. Box Office

Bar Graph: Movie Releases Per Year

Between 2014 to 2024, there was an increasing trend in movie releases. 2024 had the highest number of movies released, while 2014 had the lowest.

Pie Chart: Proportions of Movie Releases by Genre

3D Scatter Plot: Movie Financial Performance

Box Plot: Distribution of IMDb Rating By Genre and Country

Boxplot Analysis

Based on the boxplot, countries displayed different median ratings per genre. In Australia, action had the highest IMDb rating at 6.6, while drama had the lowest at 6.25. Similarly, Japan had the highest median for action at 7, but the lowest for documentary at a rating of 6.1. In the US, the median ratings across all genres were extremely similar, but romance and thriller were highest rated at 6.6. In France, however, horror had the highest median at 6.8. In order to determine whether there are statistically significant differences for imdb ratings for each genre, an anova test was conducted.

ANOVA Test

The results of the ANOVA test suggest there is not a difference in the means of IMDb ratings across the genres. A high p-value of 0.112 provides evidence to conclude the means may be the same.

anova_mod=aov(IMDbRating~Genre,data=movies_2019)
summary(anova_mod)
##                Df Sum Sq Mean Sq F value Pr(>F)
## Genre           7     26   3.712   1.666  0.112
## Residuals   22208  49489   2.228

Linear Regression

Linear Regression Output

model <- lm(US_BoxOfficeUSD ~ BudgetUSD, data = movies_clean)
summary(model)
## 
## Call:
## lm(formula = US_BoxOfficeUSD ~ BudgetUSD, data = movies_clean)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -331942774   -1783754    -132050    1405133  560481667 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.153e+04  4.074e+04   2.001   0.0454 *  
## BudgetUSD   1.517e+00  1.643e-03 923.298   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18470000 on 244358 degrees of freedom
## Multiple R-squared:  0.7772, Adjusted R-squared:  0.7772 
## F-statistic: 8.525e+05 on 1 and 244358 DF,  p-value: < 2.2e-16

Linear Regression Analysis

From the linear regression analysis between movie budget and U.S. box office revenue, there is a strong positive linear relationship between the two variables. The regression equation, US_BoxOfficeUSD = 81525.98 + 1.52 × BudgetUSD, and an R² value of approximately 0.7772, indicate that about 77.7% of the variation in box office earnings can be explained by a movie’s budget. The p-value (< 0.001) confirms that this relationship is statistically significant, meaning higher budgets are strongly associated with higher box office returns. This trend is clearly visible in the scatter plot, where data points cluster closely around the upward-sloping regression line. While budget plays a major role in predicting box office success, the remaining variation suggests other factors such as genre, marketing, release timing, and audience reception which also influence movie performance. Overall, the analysis demonstrates that budget size is a significant and reliable predictor of financial success in the film industry.

Conclusion

Based on the analysis of movie releases from 2014 to 2024, several key trends emerged. The steady increase in movie releases over the decade, peaking in 2024, demonstrates the growing output of the film industry. The genre distribution revealed diverse production across all categories, with each genre maintaining consistent audience engagement as shown by similar IMDb ratings. The 3D scatter plot and linear regression analysis confirmed that budget is a strong predictor of box office success, explaining approximately 77.7% of revenue variation. However, the remaining variation suggests that factors beyond budget and genre, such as marketing strategies, release timing, star power, and audience preferences play crucial roles in a movie’s ultimate success. This analysis highlights that while financial investment is significant, the movie industry’s success involves a multifaceted interplay of various elements.