#load the necessary libraries and packages
library(tidyverse) #tools for data science, included ggplot2, dplyr, tidyr, readr, tibble, stringr, and forcats as core libraries.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales) #loaded to address viz issues, including currency issues
## Warning: package 'scales' was built under R version 4.4.3
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
options(scipen=999) #disable scientific notation since high values are used
# Load the tsibble package
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
#clean up the work space
rm(list = ls())
#load the data set
#load the adjusted version of the csv from the local desktop
t_box_office <- read_delim("C:/Users/danjh/Grad School/H510 Stats for DS/Datasets/box_office_data_2000_24_adj.csv", delim = ",")
## Rows: 5000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Release Group, Genres, Rating, Original_Language, Production_Count...
## dbl (10): Rank, $Worldwide, $Domestic, Domestic %, $Foreign, Foreign %, Year...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#adjust the data set for easier handling
#create a copy of the data set for this activity
movies <- t_box_office
#cat(colnames(movies),sep = ", ")
#cleanup of column names to avoid constant issues with special chars and innaccurate names
# Rename specific columns
colnames(movies)[which(colnames(movies) == "Release Group")] <- "MovieName"
colnames(movies)[which(colnames(movies) == "$Worldwide")] <- "WorldwideRevenue"
colnames(movies)[which(colnames(movies) == "$Domestic")] <- "DomesticRevenue"
colnames(movies)[which(colnames(movies) == "$Foreign")] <- "ForeignRevenue"
colnames(movies)[which(colnames(movies) == "Domestic %")] <- "DomesticPercentage"
colnames(movies)[which(colnames(movies) == "Foreign %")] <- "ForeignPercentage"
colnames(movies)[which(colnames(movies) == "Rank")] <- "RankForYear"
#cat(colnames(movies), sep = ", ")
Note: if you are completing this exercise in the fully-online version of H510, you’ll be doing it individually. So, you can ignore all instances of “your group”, and just think about them as “yourself”.
Model Critique
For this lab, you’ll be working with a group of other classmates, and each group will be assigned to critique a lab from a previous week.
Your group will have three goals:
- Create an explicit business scenario which might leverage the data (and methods) used in the lab.
- Critique the models (or analyses) present in the lab based on this scenario.
- Devise a list of ethical and epistemological concerns that might pertain to this lab in the context of your business scenario.
For the purposes of this model critique we will use the week 6 lab on
Confidence Intervals. In this lab we evaluated two variable pairs
$Domestic to Foreign_to_Domestic_Ratio for one, and Vote_Count to
Popularity_Index as the other. We then examined the relationships for
the two pairs in plots, calculated the correlation coeefficents and then
built the confidence intervals.
The original lab can be found here for reference.
Goal 1: Business Scenario
First, create your own context for the lab. This should be a business use-case such as “a real estate firm aims to present housing trends (and recommendations) for their clients in Ames, Iowa”.
You do not need to solve the problem, you only need to define it.
Your scenario should include the following:
- Customer or Audience: who exactly will use your results?
- Problem Statement: reference this article to help you write a SMART problem statement.
- E.g., the statement “we need to analyze sales data” is not a good problem summary/statement, but “for <this> reason, the company needs to know if they should stop selling product A …” is on a better track.
- Scope: What variables from the data (in the lab) can address the issue presented in your problem statement? What analyses would you use? You’ll need to define any assumptions you feel need to be made before you move forward.
- If you feel the content in the lab cannot sufficiently address the issue, try to devise a more applicable problem statement.
- Objective: Define your success criteria. In other words, suppose you started working on this problem in earnest; how will you know when you are done? For example, you might want to “identify the factors that most influence
<some variable>
.”
- Note: words like “identify”, “maximize”, “determine”, etc. could be useful here. Feel free to find the right action verbs that work for you!
Customer:
The customer for this lab might be a production company looking to make some strategic decisions.
Problem Statement:
A new US based production company needs to determine what factors drive domestic revenue in order to establish an effective business model that can strategically take advantage of these. At this time they are interested in the impact of vote_count (popularity) on domestic box office revenue as they are preparing to select their first US targeted project in the next 6 months..
Scope:
In this case the key variables would be the domestic revenue
($Domestic) as the dependent variable, with
Vote_Count, Popularity_Index and
Foreign_to_Domestic_Ratio as the explanatory
independent variables. A confidence interval calculation of 95%
for Domestic revenue and vote count
will help determine the expected revenue and audience engagement levels.
Correlation coefficients between Vote_count, Popularity_Index and
$Domestic will indicate if audinece engagement can predict higher
revenue.
Objective:
If we find that the correlation is strong and positive it would suggest
that high audience engagement does lead to higher revenue.
Goal 2: Model Critique
Since this is a class, and not a workplace, we need to be careful not to present too much information to you all at once. For this reason, our labs are often not as analytically rigorous or thorough as they might be in practice … So here, your goal is to:
Present a list of at least 3 (improved) analyses you would recommend for your business scenario. Each proposed analysis should be accompanied by a “proof of concept” R implementation. (As usual, execute
R
code blocks here in the RMarkdown file.)In the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and propose your solutions (even if you might need to request more data or resources to accomplish those solutions).
You’ll want to consider the following:
- Analytical issues, such as the current model assumptions.
- Issues with the data itself.
- Statistical improvements; what do we know now that we didn’t know (or at least didn’t use) then? Are there other methods that would be appropriate?
- Are there better visualizations which could have been used?
Feel free to use the reading for the week associated with your assigned lab to help refresh your memory on the concepts presented.
Instead of just using correlation coefficients, a linear regression model will help predict box office revenue based on vote count while controlling for other factors
#proof of concept #1
#linear model
# create the linear regression model
model <- lm(DomesticRevenue ~ Vote_Count
+ Rating_of_10
+ ForeignPercentage,
data = movies)
# View regression summary
summary(model)
##
## Call:
## lm(formula = DomesticRevenue ~ Vote_Count + Rating_of_10 + ForeignPercentage,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -328431836 -17787662 -1335400 6565583 646781076
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58452626.4 5411222.1 10.802 < 0.0000000000000002 ***
## Vote_Count 14392.1 200.2 71.886 < 0.0000000000000002 ***
## Rating_of_10 -3498824.8 825490.4 -4.238 0.0000229 ***
## ForeignPercentage -411262.3 24646.6 -16.686 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51780000 on 4826 degrees of freedom
## (170 observations deleted due to missingness)
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5621
## F-statistic: 2067 on 3 and 4826 DF, p-value: < 0.00000000000000022
Results:
Vote count has a strong postive effect on Domestic Revenue.
Rating_of_10 has a significant negative impact on Domestic Revenue
ForeignPercentage has a significant negative impact on Domestic Revenue
The R-squared value indicates this model explains ~56% of the Domestic Revenue change
# Visualize Vote_Count vs. DomesticRevenue relationship
ggplot(movies, aes(x = Vote_Count, y = DomesticRevenue)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", col = "blue") +
labs(title = "Vote Count vs. Domestic Revenue with linear regression line", x = "Vote Count", y = "Domestic Revenue")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 170 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 170 rows containing missing values or values outside the scale range
## (`geom_point()`).
The plot shows a linear relationship with a strong upward trend between the increase in votes and the increase in revenue
Why This Improves the Analysis?
The original lab uses t-distribution confidence intervals, but bootstrapping provides a more robust estimate, especially if the data isn’t normally distributed.
#proof of concept #2
#bootstrapping
# Load package
library(boot)
# Define abootstrap function
bootstrap_mean <- function(data, index) {
return(mean(data[index]))
}
# Run the test
boot_result <- boot(movies$DomesticRevenue,
bootstrap_mean,
R = 10000)
# Compute confidence interval
boot.ci(boot_result, type = "perc")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot_result, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (42643608, 46894520 )
## Calculations and Intervals on Original Scale
Results:
Establishes the CI at $42,607,222 and $46,939,642.
This method doesn’t assume normal distribution.
Why This Improves the Analysis?
Instead of relying solely on Pearson correlation, which assumes a linear relationship, Spearman’s correlation is a nonparametric method.
#proof of concept #3
#spearman's Rank
# Apply jitter to Vote_Count and DomesticRevenue
# added to address ties which caused warnings
movies$Vote_Count <- jitter(movies$Vote_Count)
movies$DomesticRevenue <- jitter(movies$DomesticRevenue)
# run test
spearman_corr_jittered <- cor.test(movies$Vote_Count,
movies$DomesticRevenue,
method = "spearman")
# View results
print(spearman_corr_jittered)
##
## Spearman's rank correlation rho
##
## data: movies$Vote_Count and movies$DomesticRevenue
## S = 4346871744, p-value < 0.00000000000000022
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.7685343
Results:
The correlation result of p=0.7676741 with a near zero p value indicates a strong positve, statistically significant relationship between Vote_Count and DomesticRevenue. This means it is highly unlikely a random event.
Why This Improves the Analysis?
Goal 3: Ethical and Epistemological Concerns
Review the materials from the Week 5 lesson on Ethics and Epistemology. This includes lecture slides, the lecture video, or the reading. You should also consider doing supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.). Some issues you might want to consider include:
- Overcoming biases (existing or potential).
- Possible risks or societal implications.
- Crucial issues which might not be measurable.
- Who would be affected by this project, and how does that affect your critique?
Since epistemology is concerned with how we acquire, validate and interpret the information obtained through analysis, and ethics addresses the fairness, and potential bias of the analysis, there are some key things we should consider for the current analysis and if possible use to address future analysis efforts.
It is an important consideration that the data set used for this analysis represents the top 200 movies for the last 24 years. There is a potential sample bias that could color the results. It may be a legitimate approach to focus on the top results to reflect trends, but is it true, or even likely that the top 200 movies are reflective of the trends for all movies released in the same time period. Ethically, it is critical for the analyst to make this issue known, so the customer of the analysis can judge the analysis through that lens.
As this analysis is concerned with popularity there is a growing ethical concern of what represents “popularity”. This analysis attempts to create a statistical measure that indicates vote_count as a measure of popularity might influence revenue success. The challenge here specifically is in reference to the gathering methods of vote_counts. In a largely conected and online environment there is growing risk of result manipulation, bot’s flooding review sites, and targeted campaigns which reward users for leaving reviews. These techniques may manipulate “vote_count” results which may reduce its effectiveness to truly predict revenue success.
Outside of the analysis is also the concern of using data supported success metrics as the primary tool to drive project decisions. If movie makers are only able to secure funding in statistically safe or high likelihood situations it may have an adverse effect of reducing risk taking in the movie industry. Creative freedom and imaginative or expeirimaental story telling approachescould get significantly stifled and discouraged in favor of a more formulaic approach.
Additionally, we need to consider the important concept that correlation does not necessarily equate to causation. Movie’s specifically are driven by so many factors that the likelihood of creating a highly significant and reliable model would require an immense data set with an extremely varied and robust set of variables. While our analysis may show significant correlation between variables it is critical to keep in mind that the cuase of that correlation could be any number of external factors that are difficult to track. Some to consider are:
What is the impact of star power?
In the current world of social influencers, how might they impact poularity? Not everyone is on social media, but how heavily do influencers change popularity scores not because the movie is good, but because the influencer is “trending”.
What about “late bloomer” movies that don’t find their audience until several years after it was released, when someone discovers it?
Conclusion
It is critically important to balance the statistical results with a measure of critical thinking and reality checking. Statistics are very useful and they can reveal some incredibly compelling and useful information. While we, as data scientistsm must be able to justify and explain our results, we also have a responsibility to remain skeptical and continue to push the accuracy, relevance and legitmacy of our analyses forward by questinoning the results, researching the results, validating the sources and representing our analyses openly and ethically.