#load the data libraries - remove or add as needed
library(tidyverse) #tools form data science, included ggplot2, dplyr, tidyr, readr, tibble, stringr, and forcats as core libraries.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales) #loaded to address viz issues, including currency issues
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
options(scipen=999) #disable scientific notation since high values are used
#clean up the work space
rm(list = ls())
#load the adjusted version of the csv from the local desktop
t_box_office <- read_delim("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/box_office_data_2000_24_adj.csv", delim = ",")
## Rows: 5000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Release Group, Genres, Rating, Original_Language, Production_Count...
## dbl (10): Rank, $Worldwide, $Domestic, Domestic %, $Foreign, Foreign %, Year...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#create a copy of the data set for this activity
movies <- t_box_office
#cat(colnames(movies),sep = ", ")
#cleanup of column names to avoid constant issues with special chars and innaccurate names
# Rename specific columns
colnames(movies)[which(colnames(movies) == "Release Group")] <- "MovieName"
colnames(movies)[which(colnames(movies) == "$Worldwide")] <- "WorldwideRevenue"
colnames(movies)[which(colnames(movies) == "$Domestic")] <- "DomesticRevenue"
colnames(movies)[which(colnames(movies) == "$Foreign")] <- "ForeignRevenue"
colnames(movies)[which(colnames(movies) == "Domestic %")] <- "DomesticPercentage"
colnames(movies)[which(colnames(movies) == "Foreign %")] <- "ForeignPercentage"
colnames(movies)[which(colnames(movies) == "Rank")] <- "RankForYear"
# View updated column names
#cat(colnames(movies), sep = ", ") #commented out after test
#create a version of the data set which excludes worldwide revenue outliers
# Calculate IQR for WorldwideRevenue
Q1 <- quantile(movies$WorldwideRevenue, 0.25) # First quartile
Q3 <- quantile(movies$WorldwideRevenue, 0.75) # Third quartile
IQR <- Q3 - Q1
# Define lower and upper bounds
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
# Filter out outliers
movies_clean <- movies[movies$WorldwideRevenue >= lower_bound & movies$WorldwideRevenue <= upper_bound, ]
# View summary of not cleaned data
summary(movies$WorldwideRevenue)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1666028 24662197 48446575 119213693 119758766 2799439100
# View summary of not cleaned data
summary(movies_clean$WorldwideRevenue)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1666028 22937293 41258298 63580189 86322304 261930436
Thanks to previous assignments, it is clear that there are blockbuster movies which have much greater returns than the other movies. While this is an interesting set of data on it’s own, it hides some trends. For this assignment the a limited data set excluding outliers will be used called movies_clean. (code listed above)
The summary data for worldwide revenue in the cleaned set is not nearly as extreme as the means in the not cleaned set.
“Select a continuous (or ordered integer) column of data that seems most”valuable” given the context of your data, and call this your response variable. For example, in the Ames housing data, the price of the house is likely of the most value to both buyers and sellers. This is the thing most people will ask about when it comes to houses. What is that thing (or, one of “those” things) for your data? Select a categorical column of data (explanatory variable) that you expect might influence the response variable. Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results (e.g., use box plots). Be clear about how the R output relates to your conclusions. If there are more than 10 categories, consolidate them before running the test using the methods we’ve learned in class. Explain what this might mean for people who may be interested in your data. E.g., “there is not enough evidence to conclude [—-], so it would be safe to assume that we can [——]”. Find a single continuous (or ordered integer, non-binary) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear. Build a linear regression model of the response using just this column, and evaluate its fit. Interpret the coefficients of your model, and explain how they relate to the context of your data. For example, can you make any recommendations about an optimal way of doing something?”
Selected WorldwideRevenue as the response variable, as this is a continuous value which would logically be impacted by other factors. Additionally revenue is of significant importance to the industry so factors that might drive revenue would be worth analysis.
For the categoical variable we will focus on Prime_Production_Country. Examining the relationship between this variable and the WorldwideRevenue variable might provide insight on if the production country has any impact of the revenue of the movie.
Null Hypothesis (\(H_0\)):
The mean WorldwideRevenue for all Prime_Production_Country groups is equal.
Alternative Hypothesis (\(H_1\)):
There exists at least one pair of countries where the mean WorldwideRevenue is significantly different.
( If there are more than 10 categories, consolidate them before running the test using the methods we’ve learned in class.)
From previous work I know there are more than 10 countries. The question is how to simplify the categories in a meaningful way. The idea is to figure out if most of the data is represented by a smaller number of countries.
First question is how many countries are represented in the data set.
# Find unique countries in Prime_Production_Country
uniqueCountries <- unique(movies_clean$Prime_Production_Country)
# Count the number of unique countries
numCountries <- length(uniqueCountries)
# Print the results
#print(uniqueCountries)
print(paste("Number of unique countries:", numCountries))
## [1] "Number of unique countries: 61"
Now, we know there are 61 countries, How many films are there per country?
#determine how many times each country is in the data frame
# Count and sort countries by frequency
countryCounts <- sort(
table(movies_clean$Prime_Production_Country),
decreasing = TRUE
)
# Print frequency table
print(countryCounts)
##
## United States of America Japan China
## 1693 379 314
## France United Kingdom Germany
## 292 225 223
## South Korea Canada India
## 193 188 129
## Belgium Italy Australia
## 68 66 62
## Spain Hong Kong Russia
## 55 54 44
## Mexico Denmark Turkey
## 22 21 20
## Switzerland Brazil Ireland
## 18 17 17
## Czech Republic Austria Bulgaria
## 14 13 12
## Poland Finland Netherlands
## 12 7 7
## New Zealand Sweden United Arab Emirates
## 7 7 7
## Argentina South Africa Vietnam
## 6 6 6
## Iceland Thailand Greece
## 5 5 4
## Luxembourg Chile Estonia
## 4 3 2
## Iran Taiwan Algeria
## 2 2 1
## Bahamas Belarus Cambodia
## 1 1 1
## Dominican Republic Ecuador Hungary
## 1 1 1
## Indonesia Lithuania Morocco
## 1 1 1
## Nigeria Norway Pakistan
## 1 1 1
## Peru Philippines Romania
## 1 1 1
## Singapore Slovakia Ukraine
## 1 1 1
This looks like there is definitely a smaller group of countries that represent the majority of the data. I could make a reasonable decision that anything over 100 would be useful. I’m looking for something more precise, maybe based on a specific value. If we look at the percentage of the whole data set that each country has and add them together sequentially we should be able to see when the total reaches something significant.
# Check cumulative representation
cumulativePercentage <- cumsum(countryCounts) / sum(countryCounts) * 100
print(cumulativePercentage)
## United States of America Japan China
## 39.83529 48.75294 56.14118
## France United Kingdom Germany
## 63.01176 68.30588 73.55294
## South Korea Canada India
## 78.09412 82.51765 85.55294
## Belgium Italy Australia
## 87.15294 88.70588 90.16471
## Spain Hong Kong Russia
## 91.45882 92.72941 93.76471
## Mexico Denmark Turkey
## 94.28235 94.77647 95.24706
## Switzerland Brazil Ireland
## 95.67059 96.07059 96.47059
## Czech Republic Austria Bulgaria
## 96.80000 97.10588 97.38824
## Poland Finland Netherlands
## 97.67059 97.83529 98.00000
## New Zealand Sweden United Arab Emirates
## 98.16471 98.32941 98.49412
## Argentina South Africa Vietnam
## 98.63529 98.77647 98.91765
## Iceland Thailand Greece
## 99.03529 99.15294 99.24706
## Luxembourg Chile Estonia
## 99.34118 99.41176 99.45882
## Iran Taiwan Algeria
## 99.50588 99.55294 99.57647
## Bahamas Belarus Cambodia
## 99.60000 99.62353 99.64706
## Dominican Republic Ecuador Hungary
## 99.67059 99.69412 99.71765
## Indonesia Lithuania Morocco
## 99.74118 99.76471 99.78824
## Nigeria Norway Pakistan
## 99.81176 99.83529 99.85882
## Peru Philippines Romania
## 99.88235 99.90588 99.92941
## Singapore Slovakia Ukraine
## 99.95294 99.97647 100.00000
That did it. The cumulative sum function helps us see that the first three countries represent over about 56% of the data, the first 7 represent about 78% and the first 9 get us to 85%. This will get us to 10 categories if we tag all reamining countries as “Other”. This give us USA, Japan, China, France, U.K, Germany, S. Korea, Canada, India and Other as our categories.
# Define top 9 countries
topCountries <- names(sort(
table(movies_clean$Prime_Production_Country),
decreasing = TRUE
)[1:9])
# Create the new column for grouping
movies_clean <- movies_clean %>%
mutate(
top9ProdCountries = ifelse(
Prime_Production_Country %in% topCountries,
Prime_Production_Country,
"Other"
)
)
# Verify the grouping
table(movies_clean$top9ProdCountries)
##
## Canada China France
## 188 314 292
## Germany India Japan
## 223 129 379
## Other South Korea United Kingdom
## 811 193 225
## United States of America
## 1693
That did it.
# Display the distribution in a cleaner format using cat()
countryCounts <- table(movies_clean$top9ProdCountries) # Calculate the counts
totalMovies <- nrow(movies_clean) # Total number of movies
# Loop through each category and print with cat()
for (country in names(countryCounts)) {
percentage <- (countryCounts[country] / totalMovies) * 100
cat(sprintf("%s: %d movies (%.2f%% of total)\n", country, countryCounts[country], percentage))
}
## Canada: 188 movies (4.23% of total)
## China: 314 movies (7.06% of total)
## France: 292 movies (6.57% of total)
## Germany: 223 movies (5.01% of total)
## India: 129 movies (2.90% of total)
## Japan: 379 movies (8.52% of total)
## Other: 811 movies (18.24% of total)
## South Korea: 193 movies (4.34% of total)
## United Kingdom: 225 movies (5.06% of total)
## United States of America: 1693 movies (38.07% of total)
# Perform ANOVA test
anovaResult <- aov(
WorldwideRevenue ~ top9ProdCountries,
data = movies_clean
)
# View ANOVA summary
summary(anovaResult)
## Df Sum Sq Mean Sq F value
## top9ProdCountries 9 1394368176594929152 154929797399436576 50.91
## Residuals 4437 13502055684873889792 3043059654017104
## Pr(>F)
## top9ProdCountries <0.0000000000000002 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Visualize results with a box plot
boxplot(
WorldwideRevenue ~ top9ProdCountries,
data = movies_clean,
main = "Worldwide Revenue by Production Group",
xlab = "Production Group",
ylab = "Worldwide Revenue",
las = 2, # Rotate x-axis labels for readability
col = rainbow(length(unique(movies_clean$top9ProdCountries)))
)
Analysis of Variance (ANOVA) Results An ANOVA test was conducted to evaluate whether the mean WorldwideRevenue differs significantly across the top 9 production groups (and an “Other” category). The results are as follows:
Degrees of Freedom (Df): 9 (between groups) and 4437 (residuals).
Sum of Squares (Sum Sq): The between-group variance was 1.39 quintillion, and residual variance was 13.5 quintillion.
F-Value: 50.91, indicating a strong ratio of explained to unexplained variance.
p-Value: <0.0000000000000002, confirming statistical significance at any reasonable threshold.
Conclusion: The results strongly reject the null hypothesis, suggesting that the production country significantly impacts the average WorldwideRevenue. Further analysis (e.g., post-hoc testing) may identify specific differences between production groups.
Looking for a good predictor. Looking at vote count and rating_Of_10.
# Scatter plot for Vote_Count
plot(movies_clean$Vote_Count, movies_clean$WorldwideRevenue,
main = "Vote Count vs. Worldwide Revenue",
xlab = "Vote Count", ylab = "Worldwide Revenue", pch = 19, col = "blue")
This looks linear as it looks to climb upward and shows higher revenue when there are more votes.
# Scatter plot for Rating_of_10
plot(movies_clean$Rating_of_10, movies_clean$WorldwideRevenue,
main = "Rating of 10 vs. Worldwide Revenue",
xlab = "Average Rating", ylab = "Worldwide Revenue", pch = 19, col = "green")
Of the two I like the looks of vote count better. it looks like it is linear. This one doesn’t suggest anything linear as the data is grouped heavily in one area and doesn’t suggest any connection.
# Fit the regression model
regressionModel <- lm(WorldwideRevenue ~ Vote_Count, data = movies_clean)
# View model summary
summary(regressionModel)
##
## Call:
## lm(formula = WorldwideRevenue ~ Vote_Count, data = movies_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -322050397 -29582414 -13986823 17363661 214483667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41710918.4 925994.8 45.04 <0.0000000000000002 ***
## Vote_Count 13920.5 331.7 41.97 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49080000 on 4278 degrees of freedom
## (167 observations deleted due to missingness)
## Multiple R-squared: 0.2917, Adjusted R-squared: 0.2915
## F-statistic: 1762 on 1 and 4278 DF, p-value: < 0.00000000000000022
# Scatter plot with regression line
plot(
movies_clean$Vote_Count, movies_clean$WorldwideRevenue,
main = "Vote Count vs. Worldwide Revenue", # Add a title
xlab = "Vote Count", # Label for the x-axis
ylab = "Worldwide Revenue", # Label for the y-axis
pch = 19, # Point type
col = "blue" # Point color
)
# Add the regression line
abline(
lm(WorldwideRevenue ~ Vote_Count, data = movies_clean),
col = "red", # Line color
lwd = 2 # Line width
)
Intercept:
Vote_Count Coefficient:
Statistical Significance:
Residual Standard Error (RSE):
R-Squared (R²):
The Multiple R-squared is 0.2917, meaning the model explains about 29.17% of the variation in WorldwideRevenue using Vote_Count.
While this isn’t a perfect fit, it suggests Vote_Count is a meaningful predictor, though other factors also contribute to revenue variability.
F-Statistic:
These finding suggest that popularity and revenue are strongly related. From a strictly revenue perspective this supports the idea of investing in sequels or additions to franchises that have already shown to be popular.