P-Value

2024-01-29

Foundations: Null and Alternate Hypotheses

Null Hypothesis

- The statement of “no effect”; e.g., mood does not affect performance

- Represented by H₀

Alternative Hypothesis

- The statement of “effect” that is undertaken to be proven; e.g., mood affects performance

- Represented by H_a

Understanding P-Value

P-value is a standard of scientific research and is widely used throughout many fields
P-value acts as a “denotation of significance” and determines whether researchers can reasonably accept or reject the null and alternative hypotheses
P-value also indicates with what level of confidence researchers can make these decisions
This allows researchers to quantitatively understand outcomes of inquiries into the relationship between variables

Calculating P-Value

First calculate the test stastic using either z-test or t-test
Examples: - One-sample Z-test \[ z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1 - p_0)}{n}}} \] - Two-sample T-test \[ t = |\frac{{\bar{X}_1 - \bar{X}_2}}{{\sqrt{\frac{{s_1^2}}{{n_1}} + \frac{{s_2^2}}{{n_2}}}}}| \] This determines how mathematically significant the difference between the sample and test population is.

Interpreting test results

- The normal distribution curve helps us interpret test results

- Assuming no effect, the p-value would be directly in the middle

- Any deviation begins to indicate significance

Confidence Levels

- A researcher also sets a desired significance level which is used to accept or reject the null hypothesis

- This bell curve illustrates a 95% confidence interval or CI

- In order to accept an alternative hypothesis under this CI a test statistic would need to place in one of the ends

Z & T tables: Understanding CDF

Once you have determined your test statitic, you can compare it to a z or t table
These are generated using the CDF formula which is visualized above

Formula for CDF

The formula for CDF is as follows: \(F(x) = P(X \leq x)\)

Where:

- P() represents a function of probability,

- X is a random variable

- and x is the selected value of cumulative probability

Example: Coin Flip

Someone might theorize that a person is using a weighted quarter to cheat during a coin flip
In order to prove their point they could utilize P-Value testing
Sample: (a trusted quarter) as well as generate a null hypothesis and an alternative hypothesis
Null hypothesis: there is no difference between the heads/tails ratio of the trusted coin and the weighted coin
Alternative hypothesis: the weighted coin lands on one side more in comparison to the regular coin

Example Continued

You perform testing and determine that the weighted coin landed on heads 70 times in 100 flips
You then determine that this has a P-value of:

## [1] 3.92507e-05

Given a significance value of 0.05 you conclude that p < α and thus the result is statistically significant
You now have mathematically solid evidence that the coin is being used to cheat

Example Visualized

## [1] 3.92507e-05

R Code Utilized

dnorm_label <- function(x, mean, sd) { dnorm(x, mean = mean, sd = sd) }

mean_value <- 0 sd_value <- 1 set.seed(123)

data <- data.frame(x = seq(-3, 3, length.out = 1000), y = dnorm(seq(-3, 3, length.out = 1000)))

data\(center <- cut(data\)x, breaks = c(-Inf, -1.96, 1.96, Inf), labels = c(“Lower”, “Middle”, “Upper”))

ggplot(data, aes(x = x, y = y)) + geom_line(color = “blue”) + geom_ribbon(data = subset(data, center == “Middle”), aes(ymin = 0, ymax = y), fill = “gray”, alpha = 0.5) + labs(title = “Bell Curve with 95% Confidence Interval”, x = “X-axis”, y = “Density”) + theme_minimal() ggplot(data.frame(x = c(-4, 4)), aes(x = x)) + stat_function(fun = dnorm_label, args = list(mean = mean_value, sd = sd_value), color = “blue”) +

labs(title = “Normal Distribution with P-value Labels”, x = “Standard Deviations from Mean”, y = “Density”) + theme_minimal()’

set.seed(42) data <- rnorm(1000, mean = 0, sd = 1)

sorted_data <- sort(data) cumulative_prob <- seq(0, 1, length.out = length(sorted_data))

cumulative_prob <- cumulative_prob / max(cumulative_prob)

cdf_plot <- plot_ly(x = sorted_data, y = cumulative_prob, type = “scatter”, mode = “lines”, name = “CDF”)

layout <- list( title = “Cumulative Distribution Function (CDF)”, xaxis = list(title = “Data”), yaxis = list(title = “Cumulative Probability”) )

cdf_plot <- cdf_plot %>% layout(layout)

cdf_plot