Covariance and Correlation

Joe Ripberger

So far, we have covered…

  • Two-sample t-test (difference of means)
    • DV: continuous variable
    • IV: dichotomous categorical variable
  • Two-proportion z-test (difference of proportions)
    • DV: dichotomous categorical variable
    • IV: dichotomous categorical variable

Today we add…

  • Two-sample t-test (difference of means)
    • DV: continuous variable
    • IV: dichotomous categorical variable
  • Two-proportion z-test (difference of proportions)
    • DV: dichotomous categorical variable
    • IV: dichotomous categorical variable
  • Covariance and correlation
    • DV: continuous variable
    • IV: continuous variable

Example Research

  • Research question: why are some countries democracies whereas others are not?
  • Theory: economic development causes democratization
    • Modernization theory (Lipset 1963; Przeworski et al 2000)
  • Hypothesis: there is a positive relationship between economic development and democracy–more “developed” countries will be more democratic than less developed countries (and vice versa)
  • Data:
    • Democracy (dependent variable): 0-100 rating based on the Freedom House composite index
    • Economic development (independent variable): per capita GDP (in US $), based on UN statistics
  • Unit of Analysis: Country in 2000 (n = 188)

Descriptive Statistics

# A tibble: 2 × 8
  name       min   max   mean median     sd skewness kurtosis
  <chr>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>    <dbl>    <dbl>
1 fh_index  14.3   100   64.6   67.8   28.3   -0.271     1.68
2 gdp       92   45001 6060.  1728   9297.     2.00      6.24

Descriptive Statistics

# A tibble: 3 × 8
  name       min     max    mean  median      sd skewness kurtosis
  <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl>    <dbl>
1 fh_index 14.3    100     64.6    67.8    28.3    -0.271     1.68
2 gdp      92    45001   6060.   1728    9297.      2.00      6.24
3 log_gdp   4.52    10.7    7.55    7.45    1.62    0.167     2.00

Logarithmic Scale

Review: Two-Sample T-Test

  • We can test H_1 with a two-sample t-test

    • H_{0}:\mu_{MDC} - \mu_{LDC} = 0
    • H_{A}:\mu_{MDC} - \mu_{LDC} > 0
ds <- ds %>% 
  mutate(log_gdp_category = ifelse(log_gdp > mean(log_gdp), "MDC", "LDC"))
t.test(ds$fh_index ~ ds$log_gdp_category)

    Welch Two Sample t-test

data:  ds$fh_index by ds$log_gdp_category
t = -7.9998, df = 177.77, p-value = 1.545e-13
alternative hypothesis: true difference in means between group LDC and group MDC is not equal to 0
95 percent confidence interval:
 -35.76340 -21.61045
sample estimates:
mean in group LDC mean in group MDC 
         51.06364          79.75056 

Covariance and Correlation

  • Covariance: measure of how two continuous variables deviate from their expected values (means)
    • Positive covariance: as x goes up, y goes up; as x goes down, y goes down
    • Negative covariance: as x goes up, y goes down; as x goes down, y goes up
    • No covariance: as x goes up, y does not change
  • Pearson’s (product-moment) correlation coefficient: standardized measure of covariance
    • Ranges from -1 (perfect negative relationship) to +1 (perfect positive relationship)
  • Scatter plot: the plot we use to explore the relationship (covariance and correlation) between two continuous variables
    • geom_point() or geom_smooth()

Covariance and Correlation

Scatter Plots are Important

Covariance and Correlation

  • Are x and y correlated?
  • Is the correlation positive or negative?
  • How close to |1| is the correlation?

Calculating Covariance

  • The covariance between two variables j and k is calculated using the following formula:

    • q_{jk} = \frac{1}{n-1}\sum_{i = 1}^n (x_{ij} - \bar{x_{j}})(x_{ik} - \bar{x_{k}})
  • What is the covariance of x and y?

    • x = \{28, 22, 32, 34, 18, 19\}
    • y = \{20, 26, 36, 32, 19, 20\}
    • x-\bar{x} = \{2.5, -3.5, 6.5, 8.5, -7.5, -6.5\}
    • y-\bar{y} = \{-5.5, 0.5, 10.5, 6.5, -6.5, -5.5\}
    • (x-\bar{x})(y-\bar{y}) = \{-13.75, -1.75, 68.25, 55.25, 48.75, 35.75\}
    • \sum_{i = 1}^n (x-\bar{x})(y-\bar{y}) = 192.5
    • \frac{1}{6-1} * 192.5\ = 38.5

Calculating Covariance

(x <- c(28, 22, 32, 34, 18, 19))
[1] 28 22 32 34 18 19
(y <- c(20, 26, 36, 32, 19, 20))
[1] 20 26 36 32 19 20
(xdev <- x - mean(x))
[1]  2.5 -3.5  6.5  8.5 -7.5 -6.5
(ydev <- y - mean(y))
[1] -5.5  0.5 10.5  6.5 -6.5 -5.5
(xdev_ydev <-  xdev * ydev)
[1] -13.75  -1.75  68.25  55.25  48.75  35.75
(1 / (6 - 1)) * (sum(xdev_ydev))
[1] 38.5

Calculating Covariance

cov(x, y)
[1] 38.5

Calculating Pearson’s Correlation Coefficient

  • Standardizes the covariance of two variables by dividing it by the product of the standard deviation of each variable, which allows us to compare correlations across scales/units/measures:

    • r_{jk} = \frac{q_{jk}}{s_{j}s_{k}}
  • If q_{xy} = 38.5, s_x = 6.8, and s_y = 7.1, what is r_{xy}?

    • \frac{38.5}{6.8 * 7.1}=0.80

Calculating Pearson’s Correlation Coefficient

cor(x, y)
[1] 0.7915162

Inference with Correlation Coefficients

  • To draw inference (and test hypotheses) using correlation coefficients, we follow these steps:

    1. Calculate the point estimate
    2. Calculate the standard error of the point estimate: SE(r) = \sqrt{\frac{1-r^2}{n-2}}
    3. Calculate the confidence interval: 95\%CI = r \pm t_\nu*SE(r)
    4. Calculate the t-statistic: t = \frac{r-0}{SE(r)}
    5. Calculate the p-value
  • Is the correlation statistically different than zero?

    • H_{0}:\rho_{xy} = 0
    • H_{A}:\rho_{xy} > 0

Inference with Correlation Coefficients

cor.test(x,  y)

    Pearson's product-moment correlation

data:  x and y
t = 2.5903, df = 4, p-value = 0.06067
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.05604893  0.97607934
sample estimates:
      cor 
0.7915162 

Development and Democracy

H_1: there is a positive relationship between economic development and democracy

  • We can test H_1 by looking at the correlation between the IV and DV
    • H_{0}:\rho_{Dev, Dem} = 0
    • H_{A}:\rho_{Dev, Dem} > 0

Development and Democracy

Development and Democracy

ds %>% 
  summarise(cov = cov(fh_index,  log_gdp), 
            cor = cor(fh_index,  log_gdp))
# A tibble: 1 × 2
    cov   cor
  <dbl> <dbl>
1  25.1 0.550
cor.test(ds$fh_index,  ds$log_gdp)

    Pearson's product-moment correlation

data:  ds$fh_index and ds$log_gdp
t = 8.9739, df = 186, p-value = 3.101e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4412788 0.6422632
sample estimates:
      cor 
0.5496762 

Development and Democracy

On average, there is a positive and statistically significant relationship between development and democracy—as development increases, democracy seems to increase.

  • By how much?

CAUTION!!!

  • Correlation does not equal causation!!
    • The relationship between development and democracy may be simultaneous (or codetermined)
    • The relationship between development and democracy may be spurious (caused by a “third” variable)
    • The relationship between development and democracy may be coincidental (caused by chance)
  • To identify a causal relationship, we need:
    • Theory: explain that there is a compelling reason to believe that variation in the IV causes variation the DV
    • Association: show that variation in the IV is correlated with variation in the DV
    • Specification: show that the correlation holds when controlling for spurious variables
    • Time Order: show that the change in the IV happened before the change in the DV

Example: Correlation or Causation?