Pearson Correlation Test


When to use: Use this test when you want to determine if there is a linear relationship between two numerical variables.

How it works: The test calculates the correlation coefficient (r) and tests if it is significantly different from zero.

Pearson’s correlation coefficient (r) measures the strength and direction of the linear relationship between two numerical variables.

Range: The value of r ranges from -1 to 1.

Strength: The closer the absolute value of r is to 1, the stronger the linear relationship.

Direction: The sign of r indicates the direction of the relationship.

Let’s take a look at an example from the ‘mtcars’ dataset in R. This data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Let’s explore the relationship between the horsepower and miles per gallon of various cars.

Null Hypothesis (H0): There is no correlation between miles per gallon (mpg) and horsepower (hp) (correlation coefficient = 0).

Alternative Hypothesis (H1): There is a correlation between miles per gallon (mpg) and horsepower (hp) (correlation coefficient ≠ 0).

The following scatterplot shows this relationship visually:

data(mtcars)
library(ggplot2)

ggplot(mtcars, aes(x=hp, y=mpg)) +
  geom_point() +
  labs(title="Scatter Plot of MPG vs Horsepower",
       x="Horsepower (hp)",
       y="Miles Per Gallon (mpg)")

We can run the Pearson correlation test for these two variables using the cor.test() function in R:

cor.test(mtcars$mpg, mtcars$hp)
## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$mpg and mtcars$hp
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8852686 -0.5860994
## sample estimates:
##        cor 
## -0.7761684

Since the p-value was below 0.05 (1.78e-07 is scientific notation for 0.000000178), we can reject the null hypothesis and say that there is a statistically significant correlation between miles per gallon and horsepower.

The “cor” value of -0.776 gives the Pearson’s correlation coefficient between these two variables. This would be considered a strong, negative relationship, meaning that as the horsepower of a car increases, the miles per gallon of the car decreases.


Chi-Squared Test


When to use: Use this test to examine the association between two categorical variables.

How it works: The test compares the observed frequencies of categories to the expected frequencies if the variables were independent.

Let’s take a look at an example from the same dataset (mtcars). Take a look at the bar plot below showing the distribution of automatic and manual transmissions across cars with 4, 6, and 8 cylinders:

Looking at this bar plot, there appears to be a relationship between the transmission type of the car and the number of cylinders the car has (the proportion of automatic cars seems to increase as the number of cylinders increase). Let’s perform a Chi-Squared Test to see if these two variables are independent:

Null Hypothesis (H0): The number of cylinders in a car is independent of the type of transmission. (In other words, the proportion of cars with different types of transmission (automatic or manual) is the same across different cylinder counts.)

Alternative Hypothesis (H1): The number of cylinders in a car is not independent of the type of transmission. (In other words, the proportion of cars with different types of transmission (automatic or manual) varies across different cylinder counts.)

We first need to create a contingency table to perform the test. This is a table that has three rows (the number of cylinders: 4, 6, or 8) and two columns (manual (1) and automatic (0). The values in each cell will be the count of how many cars fit into each of these subgroups (i.e., the number of cars with 4 cylinders that are manual, the number of cars with 8 cylinders that are automatic, etc.).

contingency_table <- table(mtcars$cyl, mtcars$am)
print(contingency_table)
##    
##      0  1
##   4  3  8
##   6  4  3
##   8 12  2

Now we can run the Chi-Squared Test using the chisq.test() function in R, which takes the contingency table we just made as the input.

chi_squared_result <- chisq.test(contingency_table)
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
print(chi_squared_result)
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 8.7407, df = 2, p-value = 0.01265

Uh oh! While the test ran, we received an error message that this approximation may be incorrect. Why might that be?

Well, if you take a look at the contingency table, some of the counts are very low (only 3 automatic cars have 4 cylinders!). For small samples where you have cells in your table that have expected counts that are less than 5, Fisher’s Exact test is a more appropriate statistical test to use (over Chi-Squared). Fortunately, we can easily perform that test instead using the fisher.test() function in R:

fisher_result <- fisher.test(contingency_table)
print(fisher_result)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value = 0.009105
## alternative hypothesis: two.sided

Since our p-value (0.009105) is below 0.5, we can reject the null hypothesis and accept the alternative hypothesis that the proportion of cars in the population with different types of transmission (automatic or manual) does indeed vary across different cylinder counts.


Two-Sample t-Test


When to use: Use this test to compare the means of a numerical variable between two different levels of a categorical variable. For example, you might use this test to determine if individuals who adopted a special diet (two categories: diet or no diet) had a reduced blood pressure compared to individuals who did not.

How it works: The test compares the means of the two groups and determines if they are statistically significantly different.

Let’s take a look at an example from the same dataset (mtcars). Take a look at the boxplots below comparing the mpgs for each type of transmission (the red diamond indicates the mean mpg):

ggplot(mtcars, aes(x=factor(am, labels=c("Automatic", "Manual")), y=mpg)) +
  geom_boxplot() +
  stat_summary(fun=mean, geom="point", shape=23, size=3, color="red", fill = "red") +
  labs(title="MPG by Transmission Type",
       x="Transmission Type",
       y="Miles Per Gallon (mpg)")

There appears to be a clear difference in the mean mpg between automatic and manual cars in our sample. Let’s perform a Two-Sample t-Test to see if this difference is statistically significant.

Null Hypothesis (H0): The mean mpg for automatic transmission cars is equal to the mean mpg for manual transmission cars.

Alternative Hypothesis (H1): The mean mpg for automatic transmission cars is not equal to the mean mpg for manual transmission cars.

We will use the t.test function in R to perform this test, which takes a numerical variable as the first input, then the categorical variable as the second input:

t.test(mtcars$mpg ~ mtcars$am)
## 
##  Welch Two Sample t-test
## 
## data:  mtcars$mpg by mtcars$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

Since our p-value (0.001374) is below 0.05, we can reject the null hypothesis and accept the alternative hypothesis that there is indeed a statistically significant difference between the average mpg for manual and automatic cars.

These are the main types of statistical inference tests you can use to determine statistically significant relationships in your data. Happy hypothesis testing!