Week 8 Data Dive - ANOVA and Regression Testing

Initialization Step 1 - Load Libraries

#load the data libraries - remove or add as needed
library(tidyverse)   #tools form data science, included ggplot2, dplyr, tidyr, readr, tibble, stringr, and forcats as core libraries.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)      #loaded to address viz issues, including currency issues
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
options(scipen=999)  #disable scientific notation since high values are used

Initialization Step 2 - Load Data Set

#clean up the work space
rm(list = ls())
#load the adjusted version of the csv from the local desktop
t_box_office <- read_delim("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/box_office_data_2000_24_adj.csv", delim = ",")
## Rows: 5000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): Release Group, Genres, Rating, Original_Language, Production_Count...
## dbl (10): Rank, $Worldwide, $Domestic, Domestic %, $Foreign, Foreign %, Year...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#create a copy of the data set for this activity
movies <- t_box_office
#cat(colnames(movies),sep = ", ")

#cleanup of column names to avoid constant issues with special chars and innaccurate names
# Rename specific columns
colnames(movies)[which(colnames(movies) == "Release Group")] <- "MovieName"
colnames(movies)[which(colnames(movies) == "$Worldwide")] <- "WorldwideRevenue"
colnames(movies)[which(colnames(movies) == "$Domestic")] <- "DomesticRevenue"
colnames(movies)[which(colnames(movies) == "$Foreign")] <- "ForeignRevenue"
colnames(movies)[which(colnames(movies) == "Domestic %")] <- "DomesticPercentage"
colnames(movies)[which(colnames(movies) == "Foreign %")] <- "ForeignPercentage"
colnames(movies)[which(colnames(movies) == "Rank")] <- "RankForYear"

# View updated column names
#cat(colnames(movies), sep = ", ")             #commented out after test
#create a version of the data set which excludes worldwide revenue outliers
# Calculate IQR for WorldwideRevenue
Q1 <- quantile(movies$WorldwideRevenue, 0.25) # First quartile
Q3 <- quantile(movies$WorldwideRevenue, 0.75) # Third quartile
IQR <- Q3 - Q1

# Define lower and upper bounds
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Filter out outliers
movies_clean <- movies[movies$WorldwideRevenue >= lower_bound & movies$WorldwideRevenue <= upper_bound, ]
# View summary of not cleaned data
summary(movies$WorldwideRevenue)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
##    1666028   24662197   48446575  119213693  119758766 2799439100
# View summary of not cleaned data
summary(movies_clean$WorldwideRevenue)  
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   1666028  22937293  41258298  63580189  86322304 261930436

Insights

Thanks to previous assignments, it is clear that there are blockbuster movies which have much greater returns than the other movies. While this is an interesting set of data on it’s own, it hides some trends. For this assignment the a limited data set excluding outliers will be used called movies_clean. (code listed above)

The summary data for worldwide revenue in the cleaned set is not nearly as extreme as the means in the not cleaned set.

Task Demonstrations

The assignment

Select a continuous (or ordered integer) column of data that seems most”valuable” given the context of your data, and call this your response variable. For example, in the Ames housing data, the price of the house is likely of the most value to both buyers and sellers. This is the thing most people will ask about when it comes to houses. What is that thing (or, one of “those” things) for your data? Select a categorical column of data (explanatory variable) that you expect might influence the response variable. Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results (e.g., use box plots). Be clear about how the R output relates to your conclusions. If there are more than 10 categories, consolidate them before running the test using the methods we’ve learned in class. Explain what this might mean for people who may be interested in your data. E.g., “there is not enough evidence to conclude [—-], so it would be safe to assume that we can [——]”. Find a single continuous (or ordered integer, non-binary) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear. Build a linear regression model of the response using just this column, and evaluate its fit. Interpret the coefficients of your model, and explain how they relate to the context of your data. For example, can you make any recommendations about an optimal way of doing something?”

Step 1: Select a Response variable

Selected WorldwideRevenue as the response variable, as this is a continuous value which would logically be impacted by other factors. Additionally revenue is of significant importance to the industry so factors that might drive revenue would be worth analysis.

Step 2: Select a Categorical variable

For the categoical variable we will focus on Prime_Production_Country. Examining the relationship between this variable and the WorldwideRevenue variable might provide insight on if the production country has any impact of the revenue of the movie.

Step 3: Devise a Null hypothesis for ANOVA

Null Hypothesis (\(H_0\)):

The mean WorldwideRevenue for all Prime_Production_Country groups is equal.

Alternative Hypothesis (\(H_1\)):

There exists at least one pair of countries where the mean WorldwideRevenue is significantly different.

Step 4: Perform and summarize ANOVA

A: Reduce the categories if there are too many.

( If there are more than 10 categories, consolidate them before running the test using the methods we’ve learned in class.)

From previous work I know there are more than 10 countries. The question is how to simplify the categories in a meaningful way. The idea is to figure out if most of the data is represented by a smaller number of countries.

First question is how many countries are represented in the data set.

# Find unique countries in Prime_Production_Country
uniqueCountries <- unique(movies_clean$Prime_Production_Country)

# Count the number of unique countries
numCountries <- length(uniqueCountries)

# Print the results
#print(uniqueCountries)
print(paste("Number of unique countries:", numCountries))
## [1] "Number of unique countries: 61"

Now, we know there are 61 countries, How many films are there per country?

#determine how many times each country is in the data frame
# Count and sort countries by frequency
countryCounts <- sort(
  table(movies_clean$Prime_Production_Country),
  decreasing = TRUE
)

# Print frequency table
print(countryCounts)
## 
## United States of America                    Japan                    China 
##                     1693                      379                      314 
##                   France           United Kingdom                  Germany 
##                      292                      225                      223 
##              South Korea                   Canada                    India 
##                      193                      188                      129 
##                  Belgium                    Italy                Australia 
##                       68                       66                       62 
##                    Spain                Hong Kong                   Russia 
##                       55                       54                       44 
##                   Mexico                  Denmark                   Turkey 
##                       22                       21                       20 
##              Switzerland                   Brazil                  Ireland 
##                       18                       17                       17 
##           Czech Republic                  Austria                 Bulgaria 
##                       14                       13                       12 
##                   Poland                  Finland              Netherlands 
##                       12                        7                        7 
##              New Zealand                   Sweden     United Arab Emirates 
##                        7                        7                        7 
##                Argentina             South Africa                  Vietnam 
##                        6                        6                        6 
##                  Iceland                 Thailand                   Greece 
##                        5                        5                        4 
##               Luxembourg                    Chile                  Estonia 
##                        4                        3                        2 
##                     Iran                   Taiwan                  Algeria 
##                        2                        2                        1 
##                  Bahamas                  Belarus                 Cambodia 
##                        1                        1                        1 
##       Dominican Republic                  Ecuador                  Hungary 
##                        1                        1                        1 
##                Indonesia                Lithuania                  Morocco 
##                        1                        1                        1 
##                  Nigeria                   Norway                 Pakistan 
##                        1                        1                        1 
##                     Peru              Philippines                  Romania 
##                        1                        1                        1 
##                Singapore                 Slovakia                  Ukraine 
##                        1                        1                        1

This looks like there is definitely a smaller group of countries that represent the majority of the data. I could make a reasonable decision that anything over 100 would be useful. I’m looking for something more precise, maybe based on a specific value. If we look at the percentage of the whole data set that each country has and add them together sequentially we should be able to see when the total reaches something significant.

# Check cumulative representation
cumulativePercentage <- cumsum(countryCounts) / sum(countryCounts) * 100
print(cumulativePercentage)
## United States of America                    Japan                    China 
##                 39.83529                 48.75294                 56.14118 
##                   France           United Kingdom                  Germany 
##                 63.01176                 68.30588                 73.55294 
##              South Korea                   Canada                    India 
##                 78.09412                 82.51765                 85.55294 
##                  Belgium                    Italy                Australia 
##                 87.15294                 88.70588                 90.16471 
##                    Spain                Hong Kong                   Russia 
##                 91.45882                 92.72941                 93.76471 
##                   Mexico                  Denmark                   Turkey 
##                 94.28235                 94.77647                 95.24706 
##              Switzerland                   Brazil                  Ireland 
##                 95.67059                 96.07059                 96.47059 
##           Czech Republic                  Austria                 Bulgaria 
##                 96.80000                 97.10588                 97.38824 
##                   Poland                  Finland              Netherlands 
##                 97.67059                 97.83529                 98.00000 
##              New Zealand                   Sweden     United Arab Emirates 
##                 98.16471                 98.32941                 98.49412 
##                Argentina             South Africa                  Vietnam 
##                 98.63529                 98.77647                 98.91765 
##                  Iceland                 Thailand                   Greece 
##                 99.03529                 99.15294                 99.24706 
##               Luxembourg                    Chile                  Estonia 
##                 99.34118                 99.41176                 99.45882 
##                     Iran                   Taiwan                  Algeria 
##                 99.50588                 99.55294                 99.57647 
##                  Bahamas                  Belarus                 Cambodia 
##                 99.60000                 99.62353                 99.64706 
##       Dominican Republic                  Ecuador                  Hungary 
##                 99.67059                 99.69412                 99.71765 
##                Indonesia                Lithuania                  Morocco 
##                 99.74118                 99.76471                 99.78824 
##                  Nigeria                   Norway                 Pakistan 
##                 99.81176                 99.83529                 99.85882 
##                     Peru              Philippines                  Romania 
##                 99.88235                 99.90588                 99.92941 
##                Singapore                 Slovakia                  Ukraine 
##                 99.95294                 99.97647                100.00000

That did it. The cumulative sum function helps us see that the first three countries represent over about 56% of the data, the first 7 represent about 78% and the first 9 get us to 85%. This will get us to 10 categories if we tag all reamining countries as “Other”. This give us USA, Japan, China, France, U.K, Germany, S. Korea, Canada, India and Other as our categories.

# Define top 9 countries
topCountries <- names(sort(
  table(movies_clean$Prime_Production_Country),
  decreasing = TRUE
)[1:9])

# Create the new column for grouping
movies_clean <- movies_clean %>%
  mutate(
    top9ProdCountries = ifelse(
      Prime_Production_Country %in% topCountries,
      Prime_Production_Country,
      "Other"
    )
  )

# Verify the grouping
table(movies_clean$top9ProdCountries)
## 
##                   Canada                    China                   France 
##                      188                      314                      292 
##                  Germany                    India                    Japan 
##                      223                      129                      379 
##                    Other              South Korea           United Kingdom 
##                      811                      193                      225 
## United States of America 
##                     1693

That did it.

# Display the distribution in a cleaner format using cat()
countryCounts <- table(movies_clean$top9ProdCountries) # Calculate the counts
totalMovies <- nrow(movies_clean) # Total number of movies

# Loop through each category and print with cat()
for (country in names(countryCounts)) {
  percentage <- (countryCounts[country] / totalMovies) * 100
  cat(sprintf("%s: %d movies (%.2f%% of total)\n", country, countryCounts[country], percentage))
}
## Canada: 188 movies (4.23% of total)
## China: 314 movies (7.06% of total)
## France: 292 movies (6.57% of total)
## Germany: 223 movies (5.01% of total)
## India: 129 movies (2.90% of total)
## Japan: 379 movies (8.52% of total)
## Other: 811 movies (18.24% of total)
## South Korea: 193 movies (4.34% of total)
## United Kingdom: 225 movies (5.06% of total)
## United States of America: 1693 movies (38.07% of total)

B: Run the ANOVA test

# Perform ANOVA test
anovaResult <- aov(
  WorldwideRevenue ~ top9ProdCountries,
  data = movies_clean
)

# View ANOVA summary
summary(anovaResult)
##                     Df               Sum Sq            Mean Sq F value
## top9ProdCountries    9  1394368176594929152 154929797399436576   50.91
## Residuals         4437 13502055684873889792   3043059654017104        
##                                Pr(>F)    
## top9ProdCountries <0.0000000000000002 ***
## Residuals                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Visualize results with a box plot
boxplot(
  WorldwideRevenue ~ top9ProdCountries,
  data = movies_clean,
  main = "Worldwide Revenue by Production Group",
  xlab = "Production Group",
  ylab = "Worldwide Revenue",
  las = 2, # Rotate x-axis labels for readability
  col = rainbow(length(unique(movies_clean$top9ProdCountries)))
)

Analysis of Variance (ANOVA) Results An ANOVA test was conducted to evaluate whether the mean WorldwideRevenue differs significantly across the top 9 production groups (and an “Other” category). The results are as follows:

  • Degrees of Freedom (Df): 9 (between groups) and 4437 (residuals).

  • Sum of Squares (Sum Sq): The between-group variance was 1.39 quintillion, and residual variance was 13.5 quintillion.

  • F-Value: 50.91, indicating a strong ratio of explained to unexplained variance.

  • p-Value: <0.0000000000000002, confirming statistical significance at any reasonable threshold.

Conclusion: The results strongly reject the null hypothesis, suggesting that the production country significantly impacts the average WorldwideRevenue. Further analysis (e.g., post-hoc testing) may identify specific differences between production groups.

Step 5: Find a continuous predictor for regression

Looking for a good predictor. Looking at vote count and rating_Of_10.

# Scatter plot for Vote_Count
plot(movies_clean$Vote_Count, movies_clean$WorldwideRevenue,
     main = "Vote Count vs. Worldwide Revenue",
     xlab = "Vote Count", ylab = "Worldwide Revenue", pch = 19, col = "blue")

This looks linear as it looks to climb upward and shows higher revenue when there are more votes.

# Scatter plot for Rating_of_10
plot(movies_clean$Rating_of_10, movies_clean$WorldwideRevenue,
     main = "Rating of 10 vs. Worldwide Revenue",
     xlab = "Average Rating", ylab = "Worldwide Revenue", pch = 19, col = "green")

Of the two I like the looks of vote count better. it looks like it is linear. This one doesn’t suggest anything linear as the data is grouped heavily in one area and doesn’t suggest any connection.

Step 6: Build and Evaluate a regression model

# Fit the regression model
regressionModel <- lm(WorldwideRevenue ~ Vote_Count, data = movies_clean)

# View model summary
summary(regressionModel)
## 
## Call:
## lm(formula = WorldwideRevenue ~ Vote_Count, data = movies_clean)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -322050397  -29582414  -13986823   17363661  214483667 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 41710918.4   925994.8   45.04 <0.0000000000000002 ***
## Vote_Count     13920.5      331.7   41.97 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49080000 on 4278 degrees of freedom
##   (167 observations deleted due to missingness)
## Multiple R-squared:  0.2917, Adjusted R-squared:  0.2915 
## F-statistic:  1762 on 1 and 4278 DF,  p-value: < 0.00000000000000022
# Scatter plot with regression line
plot(
  movies_clean$Vote_Count, movies_clean$WorldwideRevenue,
  main = "Vote Count vs. Worldwide Revenue",      # Add a title
  xlab = "Vote Count",                            # Label for the x-axis
  ylab = "Worldwide Revenue",                     # Label for the y-axis
  pch = 19,                                       # Point type
  col = "blue"                                    # Point color
)

# Add the regression line
abline(
  lm(WorldwideRevenue ~ Vote_Count, data = movies_clean),
  col = "red",                                   # Line color
  lwd = 2                                       # Line width
)

Regression Results Interpretation

  1. Intercept:

    • The intercept is $41,710,918.4. This means that theoretically when there are no votes the baseline revenue for a movie is nearly $42 million.

    Vote_Count Coefficient:

    • The coefficient for Vote_count is $13,920.50. This suggests that for every additional vote the revenue value increases by nearly 14K. This is a strong positive change, indicating vote-counts are related to revenue.
  2. Statistical Significance:

    • The p-values for the intercept and the vote_count are very small, suggesting it is unlikely that these results are due to chance.
  3. Residual Standard Error (RSE):

    • The RSE is $49,080,000, which represents the average distance between the observed and predicted revenue values. W
  4. R-Squared (R²):

    • The Multiple R-squared is 0.2917, meaning the model explains about 29.17% of the variation in WorldwideRevenue using Vote_Count.

    • While this isn’t a perfect fit, it suggests Vote_Count is a meaningful predictor, though other factors also contribute to revenue variability.

  5. F-Statistic:

    • The F-statistic is 1762, with a corresponding p-value of < 2e-16, confirming that the model as a whole is statistically significant.

These finding suggest that popularity and revenue are strongly related. From a strictly revenue perspective this supports the idea of investing in sequels or additions to franchises that have already shown to be popular.