Here, we inspect the relationship between cosine similarity and Pearson similarity

suppressPackageStartupMessages(library(magrittr))
suppressPackageStartupMessages(library(tidyverse))
cosine_sim <- function(x, y) {
  sum(x*y)/(sqrt(sum(x*x)) * sqrt(sum(y*y)))
}

Pearson correlation (interchangeably called Pearson similarity) of two vectors is the cosine similarity of the vectors after subtracting the means of each.

pearson <- function(x, y) {
  cosine_sim(x - mean(x), y - mean(y))
}

Verify that our implementation is correct

all(sapply(seq(100), 
      function(i) {
        x <- rnorm(100); 
        y <- rnorm(100); 
        pearson(x, y) - cor(x, y) < .Machine$double.eps
        }
      )
)
[1] TRUE

\[ cosine(\boldsymbol{x},\boldsymbol{y}) \doteq \frac{\boldsymbol{x}\cdot\boldsymbol{y}}{\|\boldsymbol{x}\|\|\boldsymbol{y}\|} \]

\[ \rho(\boldsymbol{x},\boldsymbol{y}) \doteq cosine(\boldsymbol{x}-\boldsymbol{\bar{x}}, \boldsymbol{y}-\boldsymbol{\bar{y}}) \]

Pearson similarity is invariant to scaling of the features (i.e. scale all features with the same value)

\[ \rho(u\boldsymbol{x},v\boldsymbol{y}) \doteq cosine(u(\boldsymbol{x}-\boldsymbol{\bar{x}}), v(\boldsymbol{y}-\boldsymbol{\bar{y}})) \doteq cosine(\boldsymbol{x}-\boldsymbol{\bar{x}}, \boldsymbol{y}-\boldsymbol{\bar{y}}) \doteq \rho(\boldsymbol{x},\boldsymbol{y}) \]

Pearson similarity is invariant to a translation in the features (i.e. translate all features with the same value)

\[ (\boldsymbol{x} - a\boldsymbol{1}) - \overline{(\boldsymbol{x}-a\boldsymbol{1})} \doteq (\boldsymbol{x} - a\boldsymbol{1}) - (\bar{\boldsymbol{x}}-a{\boldsymbol{\bar{1}}}) \doteq \boldsymbol{x} - \bar{\boldsymbol{x}} \]

\[ \rho(\boldsymbol{x}-a\boldsymbol{1},\boldsymbol{y}-b\boldsymbol{1}) \doteq cosine(\boldsymbol{x}-\boldsymbol{\bar{x}}, \boldsymbol{y}-\boldsymbol{\bar{y}}) \doteq \rho(\boldsymbol{x},\boldsymbol{y}) \]

Test it out

Two random vectors

x <- rnorm(10)
y <- rnorm(10)

Scale each separately

xa <- abs(rnorm(1)) * x
ya <- abs(rnorm(1)) * y

Pearson similarity is invariant to scale

pearson(x,y)
[1] -0.2058116
pearson(xa,ya)
[1] -0.2058116

Cosine similarity is invariant to scale

cosine_sim(x,y)
[1] -0.2799055
cosine_sim(xa,ya)
[1] -0.2799055

Now create a new vector by appending columns (all zeros)

xz <- c(x, rep(0, 2))
yz <- c(y, rep(0, 2))

This affects Pearson by not cosine (because Pearson subtracts mean)

pearson(xz,yz)
[1] -0.2201575
cosine_sim(xz,yz)
[1] -0.2799055

Now test shift and scale in higher dimensions (we haven’t tested shift yet)

set.seed(43)
n <- 2
d <- 10
m <- matrix(rnorm(d * n), n, d)
#m[, 3] <- 0

b <- matrix(rep(3, n), n, d, byrow = FALSE)
m_shift <- m +  b

s <- matrix(rep(2, n), n, d, byrow = FALSE)
m_scale <- m * s

m_scale_shift <- m * s + b
m
            [,1]       [,2]       [,3]        [,4]       [,5]       [,6]       [,7]      [,8]      [,9]      [,10]
[1,] -0.03751376 -0.4859675 -0.9040981  0.38643441 -0.6861798  1.8037598 -0.3531183 0.5663244  1.469278  0.2026365
[2,] -1.57460441  0.4651862 -0.2774328 -0.06040412 -1.9061368 -0.9668729  1.1069375 2.0643087 -1.651500 -0.7209798
b
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    3    3    3    3    3    3    3    3    3     3
[2,]    3    3    3    3    3    3    3    3    3     3
s
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    2    2    2    2    2    2    2    2    2     2
[2,]    2    2    2    2    2    2    2    2    2     2
m_shift
         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]     [,9]    [,10]
[1,] 2.962486 2.514032 2.095902 3.386434 2.313820 4.803760 2.646882 3.566324 4.469278 3.202636
[2,] 1.425396 3.465186 2.722567 2.939596 1.093863 2.033127 4.106937 5.064309 1.348500 2.279020
m_scale
            [,1]       [,2]       [,3]       [,4]      [,5]      [,6]       [,7]     [,8]      [,9]      [,10]
[1,] -0.07502752 -0.9719350 -1.8081961  0.7728688 -1.372360  3.607520 -0.7062365 1.132649  2.938556  0.4052729
[2,] -3.14920883  0.9303725 -0.5548656 -0.1208082 -3.812274 -1.933746  2.2138750 4.128617 -3.303000 -1.4419597
m_scale_shift
           [,1]     [,2]     [,3]     [,4]       [,5]     [,6]     [,7]     [,8]       [,9]    [,10]
[1,]  2.9249725 2.028065 1.191804 3.772869  1.6276405 6.607520 2.293763 4.132649  5.9385556 3.405273
[2,] -0.1492088 3.930372 2.445134 2.879192 -0.8122736 1.066254 5.213875 7.128617 -0.3030003 1.558040
df <- bind_rows(
  as.data.frame(m) %>% mutate(type = "orig"),
  as.data.frame(m_shift) %>% mutate(type = "shift"),
  as.data.frame(m_scale) %>% mutate(type = "scale"),
  as.data.frame(m_scale_shift) %>% mutate(type = "scale_shift")
)

df %<>%
  mutate(type =
           factor(
             type,
             levels = c("orig", "scale", "shift", "scale_shift"),
             ordered = TRUE
           ))
df %<>%
  inner_join(
    df %>%
      group_by(type) %>%
      nest() %>%
      mutate(cosine_val = map(data, function(data) {
        m <- as.matrix(data)
        round(cosine_sim(m[1,], m[2,]), 2)
      })) %>%
      mutate(cor_val = map(data, function(data) {
        m <- as.matrix(data)
        round(cor(m[1,], m[2,]), 2)
      })) %>%
      unnest(c(cosine_val, cor_val)) %>%
      select(-data)
  )
Joining, by = "type"
df
ggplot() +
  geom_segment(data = df,
               aes(
                 x = 0,
                 y = 0,
                 xend = V1,
                 yend = V2
               ),
               arrow = arrow()) +
  coord_equal() +
  ggforce::geom_circle(data = data.frame(x0 = 0, y0 = 0, r = 1),
                       aes(x0 = x0, y0 = y0, r = r), color = "red") +
  facet_wrap( ~ type ~ cor_val ~ cosine_val, labeller = "label_both", nrow = 1) +
  ggtitle("Only first two dimensions shown")

Scatter plot of each simulated dataset, along with Pearson similarity and cosine similarity for each.

Note: This is NOT the feature space. These are d dimensions from 2 feature vectors displayed in a scatter plot.

  • Cosine similarity is scale invariant but not shift invariant
  • Pearson similarity is scale invariant and shift invariant
  • Note that the scale and shift in variance relates to scaling or shifting all features with constant, not each dimension separately. Pearson is not invariant to shifting the mean of each dimension
dfm <-
  df %>% 
  group_by(type, cor_val, cosine_val) %>% 
  mutate(repid = paste0("r", seq(n()))) %>% 
  pivot_longer(matches("^V[0-9]+")) %>% 
  pivot_wider(names_from = repid, values_from = value)

ggplot(dfm, aes(r1, r2)) + 
  geom_point() + 
  facet_wrap(~ type ~ cor_val ~ cosine_val, labeller = "label_both", nrow = 1) +
  coord_equal()

Unlike Euclidean distance, Pearson similarity 1. is shift invariant along the features (i.e. add a constant scalar to all features for both vectors) 2. is not shift invariant in the feature space (i.e. add a constant vector to the two vectors) 3. is scale invariant along the features (i.e. multiply all features by a constant scalar)

Cosine similarity is same as Pearson in this sense, except for #1 (it is not shift invariant along the features)

Now lets systemically add a constant scalar (translate) to a vector and report the cosine similarity with the untranslated vector and visualize it.

k <- 9
m <- matrix(rep(rnorm(d), k), k, d, byrow = TRUE)
m
            [,1]      [,2]       [,3]        [,4]      [,5]       [,6]      [,7]        [,8]       [,9]      [,10]
 [1,] -0.1590422 0.7342484 -0.3633032 -0.01009403 0.5827959 -0.2932883 -1.394382 -0.08508171 -0.6880953 -0.7765219
 [2,] -0.1590422 0.7342484 -0.3633032 -0.01009403 0.5827959 -0.2932883 -1.394382 -0.08508171 -0.6880953 -0.7765219
 [3,] -0.1590422 0.7342484 -0.3633032 -0.01009403 0.5827959 -0.2932883 -1.394382 -0.08508171 -0.6880953 -0.7765219
 [4,] -0.1590422 0.7342484 -0.3633032 -0.01009403 0.5827959 -0.2932883 -1.394382 -0.08508171 -0.6880953 -0.7765219
 [5,] -0.1590422 0.7342484 -0.3633032 -0.01009403 0.5827959 -0.2932883 -1.394382 -0.08508171 -0.6880953 -0.7765219
 [6,] -0.1590422 0.7342484 -0.3633032 -0.01009403 0.5827959 -0.2932883 -1.394382 -0.08508171 -0.6880953 -0.7765219
 [7,] -0.1590422 0.7342484 -0.3633032 -0.01009403 0.5827959 -0.2932883 -1.394382 -0.08508171 -0.6880953 -0.7765219
 [8,] -0.1590422 0.7342484 -0.3633032 -0.01009403 0.5827959 -0.2932883 -1.394382 -0.08508171 -0.6880953 -0.7765219
 [9,] -0.1590422 0.7342484 -0.3633032 -0.01009403 0.5827959 -0.2932883 -1.394382 -0.08508171 -0.6880953 -0.7765219
bshift <- matrix(rep(seq(k)-1, d)/(k-1)*4, k, d, byrow = FALSE)
m_shift <- m + bshift
df <- as.data.frame(m_shift)
df$shift <- bshift[,1]

df0 <- as.data.frame(m)
df0$shift <- bshift[,1]

df <- bind_rows(df, df0)

Consider \(k\) pairs of vectors.

df %>% select(V1, V2, shift) %>% arrange(shift)
df %<>%
  inner_join(
    df %>%
      group_by(shift) %>%
      nest() %>%
      mutate(cosine_val = map(data, function(data) {
        m <- as.matrix(data)
        round(cosine_sim(m[1,], m[2,]), 2)
      })) %>%
      mutate(cor_val = map(data, function(data) {
        m <- as.matrix(data)
        round(cor(m[1,], m[2,]), 2)
      })) %>%
      unnest(c(cosine_val, cor_val)) %>%
      select(-data)
  )
Joining, by = "shift"
df %>% select(V1, V2, shift, cosine_val, cor_val)
ggplot() +
  geom_segment(data = df,
               aes(
                 x = 0,
                 y = 0,
                 xend = V1,
                 yend = V2,
                 color = cosine_val
               ),
               arrow = arrow(type = "closed", angle = 20, length = unit(0.1, "inches"))) +
  coord_equal() +
  ggforce::geom_circle(data = data.frame(x0 = 0, y0 = 0, r = 1),
                       aes(x0 = x0, y0 = y0, r = r), color = "red") +
  ggtitle("Only first two dimensions shown", subtitle = "Pearson sim. = 1 for all cases") + facet_wrap(~shift)

The invariance to this shift is almost never useful, because it is shifting all dimenions by the same value. So while this invariance exists for Pearson, it doesn’t ever seem advantageous (over cosine). Cosine is better than Pearson when it is useful to ignore missing features (because cosine will remain the same but Pearson will change) Pearson is better than Cosine when it is useful to ignore the mean across features (because Pearson subtracts the mean)

We’ve seen this above but here it is again in a different form

x <- runif(100)*100
y <- runif(100)*100
xz <- c(x, rep(0, 1000))
yz <- c(y, rep(0, 1000))
mean(x)
[1] 50.99834
mean(y)
[1] 53.8328
mean(xz)
[1] 4.636213
mean(yz)
[1] 4.893891
cosine_sim(x, y)
[1] 0.7175429
pearson(x, y)
[1] -0.1655152
cosine_sim(xz, yz)
[1] 0.7175429
pearson(xz, yz)
[1] 0.6966495

Useful pages - https://www.quora.com/In-what-scenario-is-using-Pearson-correlation-better-than-Cosine-similarity - https://grouplens.org/blog/similarity-functions-for-user-user-collaborative-filtering/

---
title: "Pearson similarity and Cosine similarity"
output: 
  html_notebook:
    toc: true
    toc_float: true
    toc_depth: 3
    number_sections: true
    theme: lumen
---

Here, we inspect the relationship between cosine similarity and Pearson similarity

```{r}
suppressPackageStartupMessages(library(magrittr))
suppressPackageStartupMessages(library(tidyverse))
```


```{r}
cosine_sim <- function(x, y) {
  sum(x*y)/(sqrt(sum(x*x)) * sqrt(sum(y*y)))
}
```

Pearson correlation (interchangeably called Pearson similarity) of two vectors is the cosine similarity of the vectors after subtracting the means of each.

```{r}
pearson <- function(x, y) {
  cosine_sim(x - mean(x), y - mean(y))
}
```

Verify that our implementation is correct

```{r}
all(sapply(seq(100), 
      function(i) {
        x <- rnorm(100); 
        y <- rnorm(100); 
        pearson(x, y) - cor(x, y) < .Machine$double.eps
        }
      )
)
```


$$
cosine(\boldsymbol{x},\boldsymbol{y}) 
\doteq  
\frac{\boldsymbol{x}\cdot\boldsymbol{y}}{\|\boldsymbol{x}\|\|\boldsymbol{y}\|}
$$

$$
\rho(\boldsymbol{x},\boldsymbol{y}) 
\doteq  
cosine(\boldsymbol{x}-\boldsymbol{\bar{x}}, \boldsymbol{y}-\boldsymbol{\bar{y}})
$$

Pearson similarity is  invariant to scaling of the features (i.e. scale all features with the same value)

$$
\rho(u\boldsymbol{x},v\boldsymbol{y}) 
\doteq  
cosine(u(\boldsymbol{x}-\boldsymbol{\bar{x}}), v(\boldsymbol{y}-\boldsymbol{\bar{y}})) \doteq 
cosine(\boldsymbol{x}-\boldsymbol{\bar{x}}, \boldsymbol{y}-\boldsymbol{\bar{y}}) \doteq 
\rho(\boldsymbol{x},\boldsymbol{y}) 
$$

Pearson similarity is  invariant to a translation in the features (i.e. translate all features with the same value)

$$
 (\boldsymbol{x} - a\boldsymbol{1}) - \overline{(\boldsymbol{x}-a\boldsymbol{1})} \doteq
 (\boldsymbol{x} - a\boldsymbol{1}) - (\bar{\boldsymbol{x}}-a{\boldsymbol{\bar{1}}}) \doteq
 \boldsymbol{x} - \bar{\boldsymbol{x}}
$$

$$
\rho(\boldsymbol{x}-a\boldsymbol{1},\boldsymbol{y}-b\boldsymbol{1}) 
\doteq  
cosine(\boldsymbol{x}-\boldsymbol{\bar{x}}, \boldsymbol{y}-\boldsymbol{\bar{y}})
 \doteq 
\rho(\boldsymbol{x},\boldsymbol{y}) 
$$

Test it out

Two random vectors

```{r}
x <- rnorm(10)
y <- rnorm(10)
```

Scale each separately 

```{r}
xa <- abs(rnorm(1)) * x
ya <- abs(rnorm(1)) * y
```

Pearson similarity is invariant to scale

```{r}
pearson(x,y)
pearson(xa,ya)
```

Cosine similarity is invariant to scale

```{r}
cosine_sim(x,y)
cosine_sim(xa,ya)
```

Now create a new vector by appending columns (all zeros)

```{r}
xz <- c(x, rep(0, 2))
yz <- c(y, rep(0, 2))
```


This affects Pearson by not cosine (because Pearson subtracts mean) 

```{r}
pearson(xz,yz)
cosine_sim(xz,yz)
```

Now test shift and scale in higher dimensions (we haven't tested shift yet)

```{r}
set.seed(43)
n <- 2
d <- 10
m <- matrix(rnorm(d * n), n, d)
#m[, 3] <- 0

b <- matrix(rep(3, n), n, d, byrow = FALSE)
m_shift <- m +  b

s <- matrix(rep(2, n), n, d, byrow = FALSE)
m_scale <- m * s

m_scale_shift <- m * s + b
```


```{r}
m
```


```{r}
b
```

```{r}
s
```


```{r}
m_shift
```


```{r}
m_scale
```

```{r}
m_scale_shift
```


```{r}
df <- bind_rows(
  as.data.frame(m) %>% mutate(type = "orig"),
  as.data.frame(m_shift) %>% mutate(type = "shift"),
  as.data.frame(m_scale) %>% mutate(type = "scale"),
  as.data.frame(m_scale_shift) %>% mutate(type = "scale_shift")
)

df %<>%
  mutate(type =
           factor(
             type,
             levels = c("orig", "scale", "shift", "scale_shift"),
             ordered = TRUE
           ))
```


```{r}
df %<>%
  inner_join(
    df %>%
      group_by(type) %>%
      nest() %>%
      mutate(cosine_val = map(data, function(data) {
        m <- as.matrix(data)
        round(cosine_sim(m[1,], m[2,]), 2)
      })) %>%
      mutate(cor_val = map(data, function(data) {
        m <- as.matrix(data)
        round(cor(m[1,], m[2,]), 2)
      })) %>%
      unnest(c(cosine_val, cor_val)) %>%
      select(-data)
  )

df
```


```{r}
ggplot() +
  geom_segment(data = df,
               aes(
                 x = 0,
                 y = 0,
                 xend = V1,
                 yend = V2
               ),
               arrow = arrow()) +
  coord_equal() +
  ggforce::geom_circle(data = data.frame(x0 = 0, y0 = 0, r = 1),
                       aes(x0 = x0, y0 = y0, r = r), color = "red") +
  facet_wrap( ~ type ~ cor_val ~ cosine_val, labeller = "label_both", nrow = 1) +
  ggtitle("Only first two dimensions shown")
```

Scatter plot of each simulated dataset, along with Pearson similarity and cosine similarity for each.

Note: This is NOT the feature space. These are `d` dimensions from 2 feature vectors displayed in a scatter plot.

- Cosine similarity is scale invariant but not shift invariant
- Pearson similarity is scale invariant and shift invariant
- Note that the scale and shift in variance relates to scaling or shifting all features with constant, not each dimension separately. Pearson is not invariant to shifting the mean of each dimension

```{r}
dfm <-
  df %>% 
  group_by(type, cor_val, cosine_val) %>% 
  mutate(repid = paste0("r", seq(n()))) %>% 
  pivot_longer(matches("^V[0-9]+")) %>% 
  pivot_wider(names_from = repid, values_from = value)

ggplot(dfm, aes(r1, r2)) + 
  geom_point() + 
  facet_wrap(~ type ~ cor_val ~ cosine_val, labeller = "label_both", nrow = 1) +
  coord_equal()

```



Unlike Euclidean distance, Pearson similarity
1. is shift invariant along the features (i.e. add a constant scalar to all features for both vectors)
2. is not shift invariant in the feature space (i.e. add a constant vector to the two vectors)
3. is scale invariant along the features (i.e. multiply all features by a constant scalar)

Cosine similarity is same as Pearson in this sense, except for #1 (it is not shift invariant along the features)


Now lets systemically add a constant scalar (translate) to a vector and report the cosine similarity with the untranslated vector and visualize it.

```{r}
k <- 9
m <- matrix(rep(rnorm(d), k), k, d, byrow = TRUE)
m
```


```{r}
bshift <- matrix(rep(seq(k)-1, d)/(k-1)*4, k, d, byrow = FALSE)
m_shift <- m + bshift
df <- as.data.frame(m_shift)
df$shift <- bshift[,1]

df0 <- as.data.frame(m)
df0$shift <- bshift[,1]

df <- bind_rows(df, df0)
```


Consider $k$ pairs of vectors.


```{r}
df %>% select(V1, V2, shift) %>% arrange(shift)
```


```{r}
df %<>%
  inner_join(
    df %>%
      group_by(shift) %>%
      nest() %>%
      mutate(cosine_val = map(data, function(data) {
        m <- as.matrix(data)
        round(cosine_sim(m[1,], m[2,]), 2)
      })) %>%
      mutate(cor_val = map(data, function(data) {
        m <- as.matrix(data)
        round(cor(m[1,], m[2,]), 2)
      })) %>%
      unnest(c(cosine_val, cor_val)) %>%
      select(-data)
  )

df %>% select(V1, V2, shift, cosine_val, cor_val)
```


```{r}
ggplot() +
  geom_segment(data = df,
               aes(
                 x = 0,
                 y = 0,
                 xend = V1,
                 yend = V2,
                 color = cosine_val
               ),
               arrow = arrow(type = "closed", angle = 20, length = unit(0.1, "inches"))) +
  coord_equal() +
  ggforce::geom_circle(data = data.frame(x0 = 0, y0 = 0, r = 1),
                       aes(x0 = x0, y0 = y0, r = r), color = "red") +
  ggtitle("Only first two dimensions shown", subtitle = "Pearson sim. = 1 for all cases") + facet_wrap(~shift)
```

The invariance to this shift is almost never useful, because it is shifting all dimenions by the same value. 
So while this invariance exists for Pearson, it doesn't ever seem advantageous (over cosine).
Cosine is better than Pearson when it is useful to ignore missing features (because cosine will remain the same but Pearson will change)
Pearson is better than Cosine when it is useful to ignore the mean across features (because Pearson subtracts the mean)

We've seen this above but here it is again in a different form

```{r}
x <- runif(100)*100
y <- runif(100)*100
xz <- c(x, rep(0, 1000))
yz <- c(y, rep(0, 1000))
```


```{r}
mean(x)
mean(y)
```

```{r}
mean(xz)
mean(yz)
```

```{r}
cosine_sim(x, y)
pearson(x, y)
```


```{r}
cosine_sim(xz, yz)
pearson(xz, yz)
```

Useful pages
- https://www.quora.com/In-what-scenario-is-using-Pearson-correlation-better-than-Cosine-similarity
- https://grouplens.org/blog/similarity-functions-for-user-user-collaborative-filtering/
