```{r setup, include=FALSE}

The p-value

  • A p-value is a probablility between 0 and 1
  • It’s used in hypothesis testing to decide to reject or accept the null hypothesis (ie, if theres any significance)

Significance

  • A “small” p-value (<0.05) means that the null hypothesis can be rejected, meaning there is significance between the two variables.
  • A “large” p-value (>0.05) means that the null hypothesis cannot be rejected, meaning there is no significance between the two variables.

The chi-squared test

We calculate a \(\chi^2\) statistic that measures the total difference between the observed counts and the expected counts in a dataset.\[ \chi^2 = \sum \frac{(O - E)^2} {E} \]\(O\) is the observed frequency in a cell, \(E\) is the expected frequency in a cell. A large \(\chi^2\) value means our observations are very different from what we’d expect, giving a small p-value. The next slide will have a \(\chi^2\) test on the mtcars dataset

Cars \(\chi^2\) test code

mtcars_data <- mtcars %>%
  mutate(cylinder = as.factor(cyl),
         AorM = factor(am, labels = c("Automatic", "Manual")))

chitable <- table(mtcars$cyl, mtcars$am)
result <- chisq.test(chitable)
print(result)

ggplot(mtcars_data, aes(x = cylinder, fill = AorM)) +
  geom_bar(position = "fill", alpha = 0.8) +
  labs(title = "Transmission Type by Number of Cylinders",
       x = "Number of Cylinders", y = "Ratio", fill = "Transmission")

Cars \(\chi^2\) data plot

Cars \(\chi^2\) test results

## 
##  Pearson's Chi-squared test
## 
## data:  chitable
## X-squared = 8.7407, df = 2, p-value = 0.01265

As the p-value is less than 0.05, we can say there is a significant relationsip between the whether a car is manual or automatic and how many cylinders it has

Two-tailed Test

  • For our next test, lets do a two-tailed test in the same dataset, here is the code of normal data with a z value of 1.5
p <- pnorm(-1.5) * 2
data <- data.frame(x = seq(-4, 4, length.out = 1000))
data$y <- dnorm(data$x)

left <- data %>% filter(x <= -1.5)
right <- data %>% filter(x >= 1.5)

ggplot(data, aes(x, y)) + geom_line(size = 1) +
  geom_area(data = left, aes(x, y), alpha = 0.6) +
  geom_area(data = right, aes(x, y), alpha = 0.6) +
  labs(title = "Two-Tailed Test", subtitle = paste("p-value =", p), y = "Density", x = "Z-score")

Two-tailed Test plot

Results

The p value can clearly be seen as below 0.05, which means that the data is sinificant

T-test vs degrees of freedom

Last, lets show the relationship between T-value and the degrees of freedom in a T-test, here’s the code

df <- seq(1, 20, by = 1)
x <- seq(-5, 5, by = 0.1)

density <- outer(x, df, FUN = function(x, df) dt(x, df))

plot_ly(x = ~df, y = ~x, z = ~density, type = "surface") %>%
  layout( title = "T-Distribution by T-value and Degrees of Freedom", scene = list(xaxis = list(title = "Degrees of Freedom (df)"), yaxis = list(title = "T-value"), zaxis = list(title = "Density")))

T Denisty Plot

Thanks for tuning in!