- A p-value is a probablility between 0 and 1
- It’s used in hypothesis testing to decide to reject or accept the null hypothesis (ie, if theres any significance)
```{r setup, include=FALSE}
We calculate a \(\chi^2\) statistic that measures the total difference between the observed counts and the expected counts in a dataset.\[ \chi^2 = \sum \frac{(O - E)^2} {E} \]\(O\) is the observed frequency in a cell, \(E\) is the expected frequency in a cell. A large \(\chi^2\) value means our observations are very different from what we’d expect, giving a small p-value. The next slide will have a \(\chi^2\) test on the mtcars dataset
mtcars_data <- mtcars %>%
mutate(cylinder = as.factor(cyl),
AorM = factor(am, labels = c("Automatic", "Manual")))
chitable <- table(mtcars$cyl, mtcars$am)
result <- chisq.test(chitable)
print(result)
ggplot(mtcars_data, aes(x = cylinder, fill = AorM)) +
geom_bar(position = "fill", alpha = 0.8) +
labs(title = "Transmission Type by Number of Cylinders",
x = "Number of Cylinders", y = "Ratio", fill = "Transmission")
## ## Pearson's Chi-squared test ## ## data: chitable ## X-squared = 8.7407, df = 2, p-value = 0.01265
As the p-value is less than 0.05, we can say there is a significant relationsip between the whether a car is manual or automatic and how many cylinders it has
p <- pnorm(-1.5) * 2
data <- data.frame(x = seq(-4, 4, length.out = 1000))
data$y <- dnorm(data$x)
left <- data %>% filter(x <= -1.5)
right <- data %>% filter(x >= 1.5)
ggplot(data, aes(x, y)) + geom_line(size = 1) +
geom_area(data = left, aes(x, y), alpha = 0.6) +
geom_area(data = right, aes(x, y), alpha = 0.6) +
labs(title = "Two-Tailed Test", subtitle = paste("p-value =", p), y = "Density", x = "Z-score")
The p value can clearly be seen as below 0.05, which means that the data is sinificant
Last, lets show the relationship between T-value and the degrees of freedom in a T-test, here’s the code
df <- seq(1, 20, by = 1) x <- seq(-5, 5, by = 0.1) density <- outer(x, df, FUN = function(x, df) dt(x, df)) plot_ly(x = ~df, y = ~x, z = ~density, type = "surface") %>% layout( title = "T-Distribution by T-value and Degrees of Freedom", scene = list(xaxis = list(title = "Degrees of Freedom (df)"), yaxis = list(title = "T-value"), zaxis = list(title = "Density")))