Introduction

In the previous sessions we tested the significance of means and proportions using single sample, two samples and more than two samples (ANOVA for comparing more than two means). Those tests were based on univariate (because the test involves only one test variable). Therefore, we can say that analysis of single variable refers to univariate analysis.

When we talk about simple correlation, we have two variables under study and we are talking about bivariate distribution. Similarly when we analyze more than two variables then we are doing multivariate analysis.

In this session we are going to talk about how we can test whether the correlation coefficient (r) between two variables can be regarded as statistically significant or not. But first recall the simple correlation and its properties.

Simple correlation coefficient (r)

Consider two variables say X and Y, then the linear relationship between these two variables is known as correlation and strength is measured by using its coefficient, called correlation coefficient denoted traditionally by r or sometimes by rxy and is given by;

\[r = \frac{Cov(X,Y)}{\sqrt {Var(X)} \times \sqrt {Var(Y)}} ...............(1)\] Where,

\(Cov(X,Y)\) = Co variance of X and Y = \(\frac{1}{n} \times \sum {(X - \overline X) (Y - \overline Y)}\)

\(Var(X)\) = Variance of X = \(\frac{1}{n} \times \sum (X - \overline X)^2\)

\(Var(Y)\) = Variance of Y = \(\frac{1}{n} \times \sum (Y - \overline Y)^2\)

Substituting, all these values in equation (1) and simplifying, we get;

\[r = \frac{n \sum XY - \sum X \sum Y}{\sqrt{n \sum X^2 - (\sum X)^2} \times {\sqrt{n \sum Y^2 - (\sum Y)^2}}} ..............(2)\]

Which is known as Karl Pearson’s product moment formula for computing coefficient of correlation.

Properties of r

Following are some important properties of r.

  1. The correlation coefficient is symmetric ie, \(r_{xy} = r_{yx}\)

  2. The value of \(\large r\) always lies between -1 to +1. ie, \(-1 \leq r \leq +1\)

    • if \(r = 1\), there is perfect and positive correlation between X and Y
    • if \(0 \leq r < 0.49\) , there is low degree and positive correlation between them
    • if \(0.5 \leq r < 0.69\), there is moderate and positive correlation
    • if \(0.7 \leq r < 0.99\) , there is high degree and positive correlation
    • if \(r = 0\), there is no correlation between them.
  3. The correlation coefficient is independent of change of origin and scale.

Test of significance of simple correlation coefficient

Suppose we want to test whether the simple correlation coefficient is statistically significant or not, 1 then the test of hypothesis follows as below.

  1. Null Hypothesis (H0): \(\rho = 0\) [The correlation coefficient is not significant.]
    Against
    Alternative Hypothesis (H1): \(\rho \neq 0\) [The correlation coefficient is significant.]

  2. Level of significance (\(\alpha\)) = 0.05

  3. The test statistic: \(\large t = \frac{r - \rho_0}{\sqrt{\frac{1 - r^2}{n -2}}}\)

Where:
\(r\) = sample correlation coefficient
\(\rho_0\) = Hypothesized population correlation coefficient (= 0)
\(n\) = Size of the sample (pair of observations)

Note :

  • the denominator part \(\sqrt{\frac{1 -r^2}{n-2}}\) is known as the standard error of correlation coefficient (s.e. of \(r\)).

  • \(r^2\) is known as coefficient of determination.

  1. Decision Rule: Accept the H0 if \(Cal|t| \leq tab \ t\) for n - 1 degrees of freedom at \(\alpha\) level of significance. (don’t forget to check the nature of H1.)
    Reject H0 otherwise.

  2. Conclusion: On the basis of decision and problem statement.

Assumptions of correlation test

  1. The observations are paired.
  2. Both the variables should be continuous (not ordinal) and linearly related.
  3. Variables have to follow Normal distribution
  4. Homogeneity of variance. The variance of one variable should be stable at all the levels of other variable.
  5. There are no major outliers.

Numerical Example

Calculate the Karl Pearson’s correlation coefficient for the following data of sales and expenses in thousand rupees of 5 firms. Also interpret the value of correlation coefficient.

Sales Expenses
43 12
41 24
36 15
34 21
50 19

Test the significance of the correlation coefficients at 5% level of significance. Can you generalize that the sales and expenses are correlated according to this sample data?

Solution

Part I

Let us first compute all the sums \(\sum {X}, \sum {Y}, \sum {XY}, \sum {X^2}, \sum {Y^2}\) etc to find sample correlation coefficient (\(r\)) using equation (2). (Students are required to compute all the sums by themselves).

The sample correlation coefficient
\((r) = \frac{n \sum XY - \sum X \sum Y}{\sqrt{n \sum X^2 - (\sum X)^2} \times {\sqrt{n \sum Y^2 - (\sum Y)^2}}} = - 0.0732\)

There is very low and negative correlation between sales and expenses according to sample data.

Part II. Test of hypothesis

  1. H0 : \(\rho = 0\) (The correlation coefficient is not significant)
    H1 : \(\rho \neq 0\)

  2. Level of significance \(\alpha = 0.05\)

  3. The test statistic : \(t = \frac{r - \rho_0}{\sqrt{\frac{1 - r^2}{n -2}}}\) \(= \frac{(-0.732) - 0}{\sqrt{\frac{1 - (-0.732)^2}{5 - 2}}}\)
    Calculated t = - 0.12728

\(\therefore\) Calculated |t| = 0.12728

Now, Tabulated t = ???

degrees of freedom (df) = n - 2 = 5 - 2 = 3
level of significance (\(\alpha\)) = 0.05
Alternative Hypothesis: Two tailed

\(\therefore\) the tabulated value = 3.182 (see t-table)

  1. Decision: Since cal|t| < tab t, at 0.05 level of significance & for 3 df, we accept the Null Hypothesis.

  2. Conclusion: We conclude that there is no significant correlation between Sales and Expenses according to this sample data.

No, we can not generalize that the Sales and Expenses are significantly related.

Correlation test using R

The cor.test() function is used to test whether the correlation coefficient is significant or not. Using R we can test this hypothesis as below.

Sales=c(43,41,36,34,50) # creates `Sales` as an object
Expenses = c(12,24,15,21,19) # creates `Expenses`
cor.test(Sales,Expenses)
## 
##  Pearson's product-moment correlation
## 
## data:  Sales and Expenses
## t = -0.12728, df = 3, p-value = 0.9068
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8975205  0.8649034
## sample estimates:
##        cor 
## -0.0732849

Since the p-value > 0.05, we accept the H0. Therefore, we conclude that the correlation coefficient is no significant.


  1. The meaning of significant should be understood such that whether the computed value of statistic,for example sample mean (\(\overline{x}\)), sample proportion (\(p\)) , sample correlation coefficient (\(r\)) etc, could be generalized or not. If a sample statistic is significant then it could be generalized or inferred to the population, from where the sample has been drawn.↩︎