Covariation and Correlation
library(tidyverse)
library(knitr)
library(Hmisc)
options(scipen=999)
Questions
Covariation and Correlation
Consider a correlation coefficient.
Describe what specifically the correlation coefficient quantifies?
Answer:
The correlation coefficient quantifies the degree of linear relationship between two quantitative variables on a −1 to 1 scale. Although a nonlinear relationship may exist, the correlation coefficient does not quantify it in any way. What is a type of relationship that the correlation does not adequately describe?
Answer:
Correlation does not mean causation. Just because two products may be correlated, it does not mean that the occurrence of one directly causes the other to occur.
Furthermore, the correlation coefficient does not quantify non-linear relationships. How does a correlation value relate to a covariance value?
Answer:
The covariance describes the strength of linear relationship in terms of the units of measurement for x and y. In general, it is best to standardize the covariance to remove the units of measurement (i.e., the variability). Standardizing the covariance provides a “unit free” measure called the correlation coefficient.Five observations taken for two variables follow.
Note that the mean of \(x\) is 8 and the mean of \(y\) is 46. Further, note that the variance of \(x\) is 29.5 and the variance of \(y\) is 130. Perform the following calculations by hand (you may check your answers with a program).
| Name | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| \(x_i\) | 4 | 6 | 11 | 3 | 16 |
| \(y_i\) | 50 | 50 | 40 | 60 | 30 |
x <- c(4,6,11,3,16)
y <- c(50,50,40,60,30)
dataframe <- as.data.frame(cbind(x,y))
dataframe<- dataframe %>%
mutate(varx = (x-mean(x)), vary=(y-mean(y))) %>%
mutate(cross = varx*vary)
dataframe <- rbind(dataframe, sapply(dataframe[1:5,],FUN = 'sum'))
dataframe <- rbind(dataframe, sapply(dataframe[1:5,], FUN = 'mean'))
dataframe <- rbind(dataframe, sapply(dataframe[1:5,], FUN = 'sd'))
rownames(dataframe) <- c(1,2,3,4,5,'sum','mean','sd')
covar <- dataframe$cross[6]/(length(dataframe[1:5])-1)
corellation <- covar/(dataframe$varx[8]*dataframe$vary[8])
kable(dataframe)
| x | y | varx | vary | cross | |
|---|---|---|---|---|---|
| 1 | 4.00000 | 50.00000 | -4.00000 | 4.00000 | -16.0000 |
| 2 | 6.00000 | 50.00000 | -2.00000 | 4.00000 | -8.0000 |
| 3 | 11.00000 | 40.00000 | 3.00000 | -6.00000 | -18.0000 |
| 4 | 3.00000 | 60.00000 | -5.00000 | 14.00000 | -70.0000 |
| 5 | 16.00000 | 30.00000 | 8.00000 | -16.00000 | -128.0000 |
| sum | 40.00000 | 230.00000 | 0.00000 | 0.00000 | -240.0000 |
| mean | 8.00000 | 46.00000 | 0.00000 | 0.00000 | -48.0000 |
| sd | 5.43139 | 11.40175 | 5.43139 | 11.40175 | 51.0098 |
cor.test(x,y)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = -6.7792, df = 3, p-value = 0.00656
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9980245 -0.5965211
## sample estimates:
## cor
## -0.9688768
What is the sum of the cross products (i.e., the numerator of the covariance)?
Answer:
-240.0000 What is the covariance?
-240/(5-1)
## [1] -60What is the value of the correlation coefficient?
-60/(5.43139*11.40175)
## [1] -0.9688772Describe, in words, the relationship between \(x\) and \(y\).
Answer:
The degree of the relationship for the correlation coefficient or Rxy is - .9688768 within the range of -1 to 1. This relationship is negative and weak.
The covariance is -60 which describes the strength of the linear relations in terms of the units of measurement for x and y. The sign of the covariance (positive or negative) describes the relationship between x and y. This being negative at a – 60. PC World provided ratings for 15 (non-Mac) PCs. The performance score is a measure of how fast a PC can run a mix of common applications as compared to a (common) baseline machine. The dataset is provided below:
|Notebook | Score| Rating| varx| vary| cross|
|:-----------------------------|-----:|------:|-----------:|-----------:|-----------:|
|AMS Tech Roadster 15CTA380 | 115| 67| -68.4666667| -11.4666667| 785.0844444|
|Compaq Armada M700 | 191| 78| 7.5333333| -0.4666667| -3.5155556|
|Compaq Prosignia Notebook 150 | 153| 79| -30.4666667| 0.5333333| -16.2488889|
|Dell Inspiron 3700 C466GT | 194| 80| 10.5333333| 1.5333333| 16.1511111|
|Dell Inspiron 7500 R500VT | 236| 84| 52.5333333| 5.5333333| 290.6844444|
|Dell Latitude Cpi A366XT | 184| 76| 0.5333333| -2.4666667| -1.3155556|
|Enpower ENP-313 Pro | 184| 77| 0.5333333| -1.4666667| -0.7822222|
|Gateway Solo 9300LS | 216| 92| 32.5333333| 13.5333333| 440.2844444|
|HP Pavillion Notebook PC | 185| 83| 1.5333333| 4.5333333| 6.9511111|
|IBM ThinkPad I Serues 1480 | 183| 78| -0.4666667| -0.4666667| 0.2177778|
|Micro Express NP7400 | 189| 77| 5.5333333| -1.4666667| -8.1155556|
|Micron TransPort NX PII-400 | 202| 78| 18.5333333| -0.4666667| -8.6488889|
|NEC Versa SX | 192| 78| 8.5333333| -0.4666667| -3.9822222|
|Sceptre Soundx 5200 | 141| 73| -42.4666667| -5.4666667| 232.1511111|
|Sony VAIO PCG-F340 | 187| 77| 3.5333333| -1.4666667| -5.1822222|
```
##
## Pearson's product-moment correlation
##
## data: Perform$Score and Perform$Rating
## t = 4.491, df = 13, p-value = 0.0006072
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4454777 0.9232530
## sample estimates:
## cor
## 0.7797909
```
What is the covariance between the Performance Score and Overall Rating?
sum(Perform$cross)/(15-1)
## [1] 123.1238What is the correlation between the two measures?
123.124/(sd(Perform$varx)*sd(Perform$vary))
## [1] 0.7797921Perform a two-sided significance test on the correlation coefficient. What is the \(p\)-value from the null hypothesis significance test?
Answer:
p-value = 0.0006072 What is the 95% confidence interval for the population correlation coefficient?
Answer:
95 percent confidence interval: 0.4454777 0.9232530 Describe, in words, the relationship between the variables.
Answer:
As the performance score goes up the computer rating goes up. Create a scatterplot, labeled appropriately, and provide it below.
plot(Perform$Score, Perform$Rating)
For a random sample of size 100 that yields a correlation coefficient of \(r=.50\), what is the 95% confidence interval?
Answer:
CI: 0.3366 0.6341Morningstar tracks the performance of a large number of companies and publishes an evaluation of each. Along with a variety of financial data, Morningstar includes a Fair Value estimate for the price that should be paid for a share of the company’s common stock. Data for 30 companies are available in the file named Fair Value. The data include the Fair Value estimate per share of common stock, the most recent price per share, and the earning per share for the company (Morningstar Stocks 500). The Fair Value dataset is available here: here, and in the following directory: http://bit.ly/MSBA_Data.
```r
morning <- read_csv('https://www.dropbox.com/s/nr20yk7mcmg8s49/FairValue.csv?dl=1')
```
```
## Parsed with column specification:
## cols(
## Company = col_character(),
## Fair_Value = col_double(),
## Share_Price = col_double(),
## Earnings_Per_Share = col_double()
## )
```
What are the correlation coefficients for the following pairs of variables :
Fair Value vs. Share Price
cor.test(morning$Fair_Value, morning$Share_Price)
##
## Pearson's product-moment correlation
##
## data: morning$Fair_Value and morning$Share_Price
## t = 16.893, df = 28, p-value = 0.0000000000000003226
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9052256 0.9782346
## sample estimates:
## cor
## 0.9542803Fair Value vs. Earnings Per Share
cor.test(morning$Fair_Value, morning$Earnings_Per_Share)
##
## Pearson's product-moment correlation
##
## data: morning$Fair_Value and morning$Earnings_Per_Share
## t = 4.2293, df = 28, p-value = 0.0002266
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3407074 0.8038088
## sample estimates:
## cor
## 0.624341Share Price vs. Earnings Per Share?
cor.test(morning$Share_Price, morning$Earnings_Per_Share)
##
## Pearson's product-moment correlation
##
## data: morning$Share_Price and morning$Earnings_Per_Share
## t = 4.0029, df = 28, p-value = 0.0004169
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3105310 0.7915331
## sample estimates:
## cor
## 0.6033056Which correlation above describes the strongest linear relationship?
Answer:
Share Price and Fair Value Create a scatter matrix (pairs plot) and include it below.
pairs(morning[,2:4], pch = 19)
Describe what the phrase “correlation does not imply causation” means. How does this phrase take on special meaning in observational studies?
Answer:
It’s also known as Post Hoc Ergo Propter Hoc – it’s an informal fallacy that states:
“since event Y followed event X, event Y must have been caused by event X.
does not mean that one is causing the other to occur. In observational studies, it takes on
special meaning because we need to be aware of this fallacy. We must not jump to conclusions
and say that because two events or factors are correlated, one is causing the other. We need
to account for all other factors that may be causing both events. Random assignment to the
level of a predictor needs to be done in order to be in a position to eliminate lurking
variables as a potential confounds of causal relationships.
**Example:** Ice cream sales and crime are correlated but one does not cause the other. The occurrence of both increases during the summer. The results of 10 college football bowl games are given below. The predicted winning point margin was based on Las Vegas betting odds approximately one week before the bowl games were played. For example, Auburn was predicted to beat Northwestern in the Outback Bowl by 5 points. The actual winning point margin for Auburn was 3 points. A negative predicted winning point margin means that the team that won the bowl game was an underdog (i.e., expected to lose). For example, in the Rose Bowl, Ohio State was a 2-point underdog to Oregon and ended up winning by 9 points.
| Bowl Game | Score | Predicted Point Margin | Actual Point Margin |
|---|---|---|---|
| Outback | Auburn 38-Northwestern 35 | 5 | 3 |
| Gator | Florida State 33-West Virginia 21 | 1 | 12 |
| Capital One | Penn State 19-LSU 17 | 3 | 2 |
| Rose | Ohio State 26-Oregon 17 | -2 | 9 |
| Sugar | Florida 51-Cincinatti 24 | 14 | 27 |
| Cotton | Mississippi State 21-Oklahoma St. 7 | 3 | 14 |
| Alamo | Texas Tech 41-Michigan State 31 | 9 | 10 |
| Fiesta | Boise State 17-TCU 10 | -4 | 7 |
| Orange | Iowa 24-Georgia Tech 14 | -3 | 10 |
| Championship | Alabama 37-Texas 21 | 4 | 16 |
bowl <- c('Outback','Gator','Capitol One','Rose','Sugar',
'Cotton','Alamo','Fiesta','Orange','Championhip')
score <- c('Auburn 38-Northwestern 35','Florida State 33-West Virginia 21',
'Penn State 19-LSU 17 ','Ohio State 26-Oregon 17 ','Florida 51-Cincinatti 24',
'Mississippi State 21-Oklahoma St. 7 ','Texas Tech 41-Michigan State 31','
Boise State 17-TCU 10 ','Iowa 24-Georgia Tech 14 ','Alabama 37-Texas 21 ')
predicted <- c(5,1,3,-2,14,3,9,-4,-3,4)
actual <- c(3,12,2,9,27,14,10,7,10,16)
betting <- data.frame(cbind(bowl,score,predicted,actual))
betting$predicted <- as.numeric(betting$predicted)
betting$actual <- as.numeric(betting$actual)
betting <- betting %>%
mutate(varpredicted = predicted - mean(predicted), varactual = actual -mean(actual), cross = varpredicted*varactual)
cor.test(betting$predicted, betting$actual)
##
## Pearson's product-moment correlation
##
## data: betting$predicted and betting$actual
## t = 1.9385, df = 8, p-value = 0.08855
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.09981793 0.88127377
## sample estimates:
## cor
## 0.5653388
Develop a scatter diagram with predicted point margin on the \(x\)-axis. Be sure to use appropriate labels and titles, and that it clearly displays the data.
plot(betting$predicted, betting$actual)
What is the relationship between predicted and actual point margins, as indicated by the scatter diagram?
Answer:
There is a positive linear relationship. Compute the sample covariance.
sum(betting$cross)/(10-1)
## [1] 22.33333Compute the sample correlation coefficient.
22.33333/(sd(betting$varpredicted)*sd(betting$varactual))
## [1] 0.5653387What is the \(t\)-value for the test of the null hypothesis for the value of \(r\) in part d?
Answer:
t = 1.9385 What is the \(p\)-value for the test of the null hypothesis for the \(r\) value in part d?
Answer:
p-value = 0.08855 Describe the relationship between the results of the bowl games and the Las Vegas predictions.
Answer:
There is a positive relationship and looking at the correlation coefficient at .565 it is a moderate relationship. Using the La Quinta Hotels data available here and in following directory: http://bit.ly/MSBA_Data. Use SPSS.
```r
hotels <- read_csv('https://www.dropbox.com/s/4ufywh9z8rygpmb/LaQuintaInns.csv?dl=1')
```
```
## Parsed with column specification:
## cols(
## Margin = col_double(),
## Number = col_double(),
## Nearest = col_double(),
## OfficeSpace = col_double(),
## Enrollment = col_double(),
## Income = col_double(),
## Distance = col_double()
## )
```
The correlation matrix for any three variables of interest (with \(p\)-values).
rcorr(as.matrix(hotels))
## Margin Number Nearest OfficeSpace Enrollment Income Distance
## Margin 1.00 -0.47 0.16 0.50 0.12 0.25 -0.09
## Number -0.47 1.00 0.08 -0.09 -0.06 0.04 0.07
## Nearest 0.16 0.08 1.00 0.04 0.07 -0.05 0.09
## OfficeSpace 0.50 -0.09 0.04 1.00 0.00 0.15 0.03
## Enrollment 0.12 -0.06 0.07 0.00 1.00 -0.11 0.10
## Income 0.25 0.04 -0.05 0.15 -0.11 1.00 -0.05
## Distance -0.09 0.07 0.09 0.03 0.10 -0.05 1.00
##
## n= 100
##
##
## P
## Margin Number Nearest OfficeSpace Enrollment Income Distance
## Margin 0.0000 0.1112 0.0000 0.2227 0.0130 0.3612
## Number 0.0000 0.4192 0.3550 0.5276 0.7137 0.4704
## Nearest 0.1112 0.4192 0.6727 0.4812 0.6543 0.3664
## OfficeSpace 0.0000 0.3550 0.6727 0.9919 0.1296 0.7456
## Enrollment 0.2227 0.5276 0.4812 0.9919 0.2645 0.3354
## Income 0.0130 0.7137 0.6543 0.1296 0.2645 0.6106
## Distance 0.3612 0.4704 0.3664 0.7456 0.3354 0.6106A scatterplot matrix of the data in part a.
plot(hotels)