2024-11-01

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'plotly'
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## 
## The following object is masked from 'package:graphics':
## 
##     layout
## 
## 
## Loading required package: lattice
## 
## 
## Attaching package: 'BSDA'
## 
## 
## The following object is masked from 'package:datasets':
## 
##     Orange

Hypothesis Testing Using Z Tests

Steps for Success:

  1. Generate a hypothesis (null hypothesis) about a population

  2. Determine the alternative of that hypothesis

  3. Set the confidence interval for the testing

  4. Determine critical z-values based on the confidence interval

  5. Calculate the z test score

  6. Decide whether to accept or reject the null hypothesis

Hypothesis

The hypothesis asks a question about a population.

Hypothesis for examination: Is it reasonable to claim that the average budget of a movie in IMDB is $12.5 million?

Null Hypothesis: It is reasonable to claim that the average budget of a movie in IMDB is $12.5 million.

Alternate Hypothesis: The average budget of a movie in IMDB cannot be reasonably claimed to be $12.5 million.

Confidence Interval

The confidence interval ensures that a result is statistically significant

With a proper confidence interval results are repeatable

The most common confidence interval is \(\alpha = .05\)

For this analysis the confidence interval will be \(\alpha = .05\)

Distribution of Budgets

Before beginning the analysis, here is a preliminary look at the distribution of budgets. The vast majority of movies have a low budget, but the average budget is unclear from this plot.

Further Details about the Data

This plot makes it more clear that the mean is between $10-15 million

Determing the Critical Z Values

The critical Z values are determined based on the confidence interval selected and type of hypothesis tested.

Since the confidence interval is \(\alpha = .05\) and this is a two-tailed test, the critical Z values will be at \(\alpha / 2\) on both sides of the curve.

Use a table of values to determine the Z values at these points. For \(\alpha = .05\), \(Z < -1.96 \quad \text{and} \quad 1.96 < Z\).

Calculating the Z Test Score

The Z score is calculated according to the formula \(Z = (x - \mu) / (\sigma / \sqrt{n})\) x is the mean of the sample being evaluated, \(\mu\) is the mean of the population, \(\sigma\) is the standard of deviation of the population, n is the sample size R has a library named BSDA that has statistical tests, including Z tests, than can be installed and imported Run the test with the mean of $12.5 million that is being analyzed

Calculating the Z Test Score Continued

z_test_result <- z.test(movies_filtered$budget, 
                         mu = 12500000, 
                         alternative = "two.sided", 
                         sigma.x = sd(movies_filtered$budget), 
                         conf.level = 0.95)

print(z_test_result)
## 
##  One-sample z-Test
## 
## data:  movies_filtered$budget
## z = 3.0625, p-value = 0.002195
## alternative hypothesis: true mean is not equal to 12500000
## 95 percent confidence interval:
##  12858322 14132323
## sample estimates:
## mean of x 
##  13495323

Conclusion of the Test

The Z score of 3.0625 is greater than the critical z value of 1.96, thus the null hypothesis would be rejected.

This is reflected in the p-value of 0.002195. Since \(.002195 < .05\), the result is not within the confidence interval.

Thus, it is not reasonable to state that the mean budget of movies on IMDB is $12.5 million.

Plot of Valid Guesses

This plot highlights how the confidence interval influences what estimates would be statistically valid.