In the previous sessions we tested the significance of means and proportions using single sample, two samples and more than two samples (ANOVA for comparing more than two means). Those tests were based on univariate (because the test involves only one test variable). Therefore, we can say that analysis of single variable refers to univariate analysis.
When we talk about simple correlation, we have two variables under study and we are talking about bivariate distribution. Similarly when we analyze more than two variables then we are doing multivariate analysis.
In this session we are going to talk about how we can test whether the correlation coefficient (r) between two variables can be regarded as statistically significant or not. But first recall the simple correlation and its properties.
r)Consider two variables say X and Y, then the linear relationship between these two variables is known as correlation and strength is measured by using its coefficient, called correlation coefficient denoted traditionally by r or sometimes by rxy and is given by;
\[r = \frac{Cov(X,Y)}{\sqrt {Var(X)} \times \sqrt {Var(Y)}} ...............(1)\] Where,
\(Cov(X,Y)\) = Co variance of X and Y = \(\frac{1}{n} \times \sum {(X - \overline X) (Y - \overline Y)}\)
\(Var(X)\) = Variance of X = \(\frac{1}{n} \times \sum (X - \overline X)^2\)
\(Var(Y)\) = Variance of Y = \(\frac{1}{n} \times \sum (Y - \overline Y)^2\)
Substituting, all these values in equation (1) and simplifying, we get;
\[r = \frac{n \sum XY - \sum X \sum Y}{\sqrt{n \sum X^2 - (\sum X)^2} \times {\sqrt{n \sum Y^2 - (\sum Y)^2}}} ..............(2)\]
Which is known as Karl Pearson’s product moment formula for computing coefficient of correlation.
rFollowing are some important properties of r.
The correlation coefficient is symmetric ie, \(r_{xy} = r_{yx}\)
The value of \(\large r\) always lies between -1 to +1. ie, \(-1 \leq r \leq +1\)
The correlation coefficient is independent of change of origin and scale.
Suppose we want to test whether the simple correlation coefficient is statistically significant or not, 1 then the test of hypothesis follows as below.
Null Hypothesis (H0): \(\rho = 0\) [The correlation coefficient is not significant.]
Against
Alternative Hypothesis (H1): \(\rho \neq 0\) [The correlation coefficient is significant.]
Level of significance (\(\alpha\)) = 0.05
The test statistic: \(\large t = \frac{r - \rho_0}{\sqrt{\frac{1 - r^2}{n -2}}}\)
Where:
\(r\) = sample correlation coefficient
\(\rho_0\) = Hypothesized population correlation coefficient (= 0)
\(n\) = Size of the sample (pair of observations)
Note :
the denominator part \(\sqrt{\frac{1 -r^2}{n-2}}\) is known as the standard error of correlation coefficient (s.e. of \(r\)).
\(r^2\) is known as coefficient of determination.
Decision Rule: Accept the H0 if \(Cal|t| \leq tab \ t\) for n - 1 degrees of freedom at \(\alpha\) level of significance. (don’t forget to check the nature of H1.)
Reject H0 otherwise.
Conclusion: On the basis of decision and problem statement.
Calculate the Karl Pearson’s correlation coefficient for the following data of sales and expenses in thousand rupees of 5 firms. Also interpret the value of correlation coefficient.
| Sales | Expenses |
|---|---|
| 43 | 12 |
| 41 | 24 |
| 36 | 15 |
| 34 | 21 |
| 50 | 19 |
Test the significance of the correlation coefficients at 5% level of significance. Can you generalize that the sales and expenses are correlated according to this sample data?
Solution
Part I
Let us first compute all the sums \(\sum {X}, \sum {Y}, \sum {XY}, \sum {X^2}, \sum {Y^2}\) etc to find sample correlation coefficient (\(r\)) using equation (2). (Students are required to compute all the sums by themselves).
The sample correlation coefficient
\((r) = \frac{n \sum XY - \sum X \sum Y}{\sqrt{n \sum X^2 - (\sum X)^2} \times {\sqrt{n \sum Y^2 - (\sum Y)^2}}} = - 0.0732\)
There is very low and negative correlation between sales and expenses according to sample data.
Part II. Test of hypothesis
H0 : \(\rho = 0\) (The correlation coefficient is not significant)
H1 : \(\rho \neq 0\)
Level of significance \(\alpha = 0.05\)
The test statistic : \(t = \frac{r - \rho_0}{\sqrt{\frac{1 - r^2}{n -2}}}\) \(= \frac{(-0.732) - 0}{\sqrt{\frac{1 - (-0.732)^2}{5 - 2}}}\)
Calculated t = - 0.12728
\(\therefore\) Calculated |t| = 0.12728
Now, Tabulated t = ???
degrees of freedom (df) = n - 2 = 5 - 2 = 3
level of significance (\(\alpha\)) = 0.05
Alternative Hypothesis: Two tailed
\(\therefore\) the tabulated value = 3.182 (see t-table)
Decision: Since cal|t| < tab t, at 0.05 level of significance & for 3 df, we accept the Null Hypothesis.
Conclusion: We conclude that there is no significant correlation between Sales and Expenses according to this sample data.
No, we can not generalize that the Sales and Expenses are significantly related.
The cor.test() function is used to test whether the correlation coefficient is significant or not. Using R we can test this hypothesis as below.
Sales=c(43,41,36,34,50) # creates `Sales` as an object
Expenses = c(12,24,15,21,19) # creates `Expenses`
cor.test(Sales,Expenses)
##
## Pearson's product-moment correlation
##
## data: Sales and Expenses
## t = -0.12728, df = 3, p-value = 0.9068
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8975205 0.8649034
## sample estimates:
## cor
## -0.0732849
Since the p-value > 0.05, we accept the H0. Therefore, we conclude that the correlation coefficient is no significant.
The meaning of significant should be understood such that whether the computed value of statistic,for example sample mean (\(\overline{x}\)), sample proportion (\(p\)) , sample correlation coefficient (\(r\)) etc, could be generalized or not. If a sample statistic is significant then it could be generalized or inferred to the population, from where the sample has been drawn.↩︎