2023-10-15

What is a p-value?

The p value refers to the probability that a statistical summary would be equal to or more extreme than the given data if the null hypothesis is true.

A p value does NOT tell you how two groups are different. That is done by effect size. Rather, a p value highlights the degree of difference between the groups.

The build-in murders dataset will be used in this presentation.

Common p-values

  • There is a statistically significant difference between two groups if \(p < 0.05\)
  • There is a marginally significant difference between two groups if \(0.05 < p < 0.1\)
  • There is not a statistically significant difference between two groups if \(p > 0.1\)

Calculating p-values

  • P-values are typically calculated with a t-test

T-test Equation

\[t = (x_1 - x_2)/\sqrt{(s_1^2/n_1) + (s_2^2/n_2)}\]

\[x_1, x_2 = \] Mean of first, second sample \[s_1, s_2 = \] Standard deviation of first, second sample \[n_1, n_2 = \] Size of first, second sample

Load Libraries and Dataset

Libraries Needed: dslabs, ggplot2, grid, gridExtra, lattice, plotly Built-in Dataset: murders

Regions in Murders Dataset

# Extract rows by region
print(unique(murders[c('region')]))
##           region
## 1          South
## 2           West
## 7      Northeast
## 14 North Central

Comparing Different Regions in Murders Dataset

murders1 <- subset(murders, 
murders$region == 'West' | murders$region == 'South') 
murders2 <- subset(murders, 
murders$region == 'South' | murders$region == 'Northeast')
murders3 <- subset(murders, 
murders$region == 'Northeast' | murders$region == 'North Central')
murders4 <- subset(murders, 
murders$region == 'North Central' | murders$region == 'West')
murders5 <- subset(murders, 
murders$region == 'North Central' | murders$region == 'South')
murders6 <- subset(murders, 
murders$region == 'West' | murders$region == 'Northeast')

Using T-Test to Calculate Statistical Significance Between Groups

murders1t <- t.test(total ~ region, data = murders1)
murders2t <- t.test(total ~ region, data = murders2)
murders3t <- t.test(total ~ region, data = murders3)
murders4t <- t.test(total ~ region, data = murders4)
murders5t <- t.test(total ~ region, data = murders5)
murders6t <- t.test(total ~ region, data = murders6)

Statistical Signifcance Results Part I

# West and South
murders1t[3]
## $p.value
## [1] 0.3640829
# South and Northeast
murders2t[3]
## $p.value
## [1] 0.3357019
# Northeast and North Central
murders3t[3]
## $p.value
## [1] 0.8938812

Statistical Signifcance Results Part II

# North Central and West
murders4t[3]
## $p.value
## [1] 0.9597182
# North Central and South
murders5t[3]
## $p.value
## [1] 0.1773486
# West and Northeast
murders6t[3]
## $p.value
## [1] 0.8895603

Statistical Signifcance Results Part III

  • All p-values are greater than 0.05
  • There are no significant differences in the number of murders in each region

Boxplot of Number of Murders in Each Region

Population vs. Number of Murders

The number of murders increases as the state population increases.

Barplots to show Lack of Statistical Significance between Regions (ggplot)

References