- Prove that \(cov(x,x) = var(x)\)
- Prove that \(cor(x,x) = 1\)
- In a finite sample, what is the difference between: the average difference between an individual and the mean VS the standard deviation (yes it’s not the same)?
- Compute the Pearson and Spearman correlation coefficient on the different variables from the iris dataset. Are they any differences? Why?
- With R, create two samples: one where the Pearson coefficient correlation is equal to 1 (but with 2 distinct variables) and one where it’s between [-0.1 ; 0.1]. Use a sample size >100.
- What is the condition on the type of the variables used in order to compute the Pearson (or Spearman) coefficient correlation (it’s implicitly said in the course)?
- What are the condition(s) that make possible that
cov(x,y)is equal tocor(x,y)?
- Why the Pearson correlation coefficient isn’t enough to say that there is a causal link between the variables?
- What happens when doing
cor(X), where \(X\) is a matrix of quantitative variables?
- With your own words, make a short description of what correlation is.
- Based on the sample below, explain with your own words why 2 variables created the same way are not correlated.
x = rnorm(n=1000)
y = rnorm(n=1000)
cor.test(x,y)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = -0.21092, df = 998, p-value = 0.833
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06864118 0.05533948
## sample estimates:
## cor
## -0.00667651
- Create a scatter plot with the sample size in the x-axis and the Pearson correlation coefficient in the y-axis for a range of 10 to 500 for the sample size. For this, you have to use the
generate()function below. Do the same thing with the p-value in the y-axis. What do you see? What are the implications?
generate = function(sample_size){
#generate sample
x = rnorm(n=sample_size, sd=15)
y = 1.001*x + rnorm(n=sample_size, sd=30)
#compute correlation for the sample
correlation = cor.test(x,y)
p = correlation$p.value
r = correlation$estimate
return(r) #change it to p for the second plot
}