Hypothesis Testing

2024-11-01

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'plotly'
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## 
## The following object is masked from 'package:graphics':
## 
##     layout
## 
## 
## Loading required package: lattice
## 
## 
## Attaching package: 'BSDA'
## 
## 
## The following object is masked from 'package:datasets':
## 
##     Orange

Hypothesis Testing Using Z Tests

Steps for Success:

Generate a hypothesis (null hypothesis) about a population
Determine the alternative of that hypothesis
Set the confidence interval for the testing
Determine critical z-values based on the confidence interval
Calculate the z test score
Decide whether to accept or reject the null hypothesis

Hypothesis

The hypothesis asks a question about a population.

Hypothesis for examination: Is it reasonable to claim that the average budget of a movie in IMDB is $12.5 million?

Null Hypothesis: It is reasonable to claim that the average budget of a movie in IMDB is $12.5 million.

Alternate Hypothesis: The average budget of a movie in IMDB cannot be reasonably claimed to be $12.5 million.

Confidence Interval

The confidence interval ensures that a result is statistically significant

With a proper confidence interval results are repeatable

The most common confidence interval is $\alpha = .05$

For this analysis the confidence interval will be $\alpha = .05$

Distribution of Budgets

Before beginning the analysis, here is a preliminary look at the distribution of budgets. The vast majority of movies have a low budget, but the average budget is unclear from this plot.

Further Details about the Data

This plot makes it more clear that the mean is between $10-15 million

Determing the Critical Z Values

The critical Z values are determined based on the confidence interval selected and type of hypothesis tested.

Since the confidence interval is $\alpha = .05$ and this is a two-tailed test, the critical Z values will be at $\alpha / 2$ on both sides of the curve.

Use a table of values to determine the Z values at these points. For $\alpha = .05$, $Z < -1.96 \quad \text{and} \quad 1.96 < Z$.

Calculating the Z Test Score

The Z score is calculated according to the formula $Z = (x - \mu) / (\sigma / \sqrt{n})$ x is the mean of the sample being evaluated, $\mu$ is the mean of the population, $\sigma$ is the standard of deviation of the population, n is the sample size R has a library named BSDA that has statistical tests, including Z tests, than can be installed and imported Run the test with the mean of $12.5 million that is being analyzed

Calculating the Z Test Score Continued

z_test_result <- z.test(movies_filtered$budget, 
                         mu = 12500000, 
                         alternative = "two.sided", 
                         sigma.x = sd(movies_filtered$budget), 
                         conf.level = 0.95)

print(z_test_result)

## 
##  One-sample z-Test
## 
## data:  movies_filtered$budget
## z = 3.0625, p-value = 0.002195
## alternative hypothesis: true mean is not equal to 12500000
## 95 percent confidence interval:
##  12858322 14132323
## sample estimates:
## mean of x 
##  13495323

Conclusion of the Test

The Z score of 3.0625 is greater than the critical z value of 1.96, thus the null hypothesis would be rejected.

This is reflected in the p-value of 0.002195. Since $.002195 < .05$, the result is not within the confidence interval.

Thus, it is not reasonable to state that the mean budget of movies on IMDB is $12.5 million.

Plot of Valid Guesses

This plot highlights how the confidence interval influences what estimates would be statistically valid.