A student asked me the question of whether or not scaling matters when running correlations. Let’s investigate.

library(tidyverse)
library(knitr)
set.seed(1)

First, let’s get two random samples of integers as our data:

x = sample.int(20,20)
y = sample.int(50,20)
data.frame(x,y)%>%
  kable()

x y
6 47
8 11
11 32
16 6
4 13
14 18
15 1
9 17
19 37
1 14
3 20
2 24
20 19
10 7
5 30
7 39
12 28
17 4
18 35
13 46

data.frame(x,y)%>%
  summary()
       x               y        
 Min.   : 1.00   Min.   : 1.00  
 1st Qu.: 5.75   1st Qu.:12.50  
 Median :10.50   Median :19.50  
 Mean   :10.50   Mean   :22.40  
 3rd Qu.:15.25   3rd Qu.:32.75  
 Max.   :20.00   Max.   :47.00  

These data points for these \(x\) and \(y\) variables were generated from a uniform distribution, meaning that any number is equally as likely to be picked as any other. Let’s plot these data points:

df = data.frame(x,y)
ggplot(df, aes(x,y))+
  geom_point()+
  geom_smooth(method = "lm")

As you can see, the data points are pretty much randomly scattered, but the two variables are correlated by the slope of the line.

What happens if we scale the data?

df$x_scaled = scale(df$x)
df$y_scaled = scale(df$y)
select(df, one_of(c("x_scaled", "y_scaled")))%>%
  summary()
     x_scaled.V1          y_scaled.V1     
 Min.   :-1.6057931   Min.   :-1.5438196  
 1st Qu.:-0.8028965   1st Qu.:-0.7141969  
 Median : 0.0000000   Median :-0.2092092  
 Mean   : 0.0000000   Mean   : 0.0000000  
 3rd Qu.: 0.8028965   3rd Qu.: 0.7466604  
 Max.   : 1.6057931   Max.   : 1.7746711  

The data now has a mean of zero (and a standard deviation of 1; not shown). If we plot this data:

ggplot(df, aes(x_scaled,y_scaled))+
  geom_point()+
  geom_smooth(method = "lm")

We see no difference other than the fact that the axes have moved to be centred at 0. We can even compute the pearson correlation coefficient to show that both the unscaled and scaled data are not different in relation to each other:

#function to compute correlation coefficient
#unscaled
cor(df$x,df$y) 
[1] -0.05391063
#scaled
cor(df$x_scaled, df$y_scaled)
            [,1]
[1,] -0.05391063

Same value.


But this has all been with a uniform sample; in the real world, most of our data resembles a normal distribution. Let’s take a sample of data from the normal distribution.

a = rnorm(20,50, 10)%>%round()
b = rnorm(20, 30, 8)%>%round()
data.frame(a,b)%>%
  summary()
       a               b        
 Min.   :30.00   Min.   :21.00  
 1st Qu.:46.00   1st Qu.:26.50  
 Median :49.00   Median :31.00  
 Mean   :49.85   Mean   :31.10  
 3rd Qu.:56.50   3rd Qu.:35.25  
 Max.   :64.00   Max.   :46.00  
data.frame(a,b)%>%
  kable()
a b
59 29
58 28
51 36
30 34
56 24
49 24
48 33
35 36
45 29
54 37
64 33
49 25
54 33
49 21
36 41
46 46
46 27
49 22
61 35
58 29

Now each collection of data \(a\) and \(b\) has a different mean. The standard deviations are 10 and 5 respectively too (not shown). Let’s plot these data:

df2 = data.frame(a,b)
ggplot(df2, aes(a,b))+
  geom_point()+
  geom_smooth(method = "lm")

You can see the clear correlation here. Now let’s scale these variables to see if that changes anything:

df2$a_scaled = scale(df2$a)
df2$b_scaled = scale(df2$b)
select(df2, one_of(c("a_scaled", "b_scaled")))%>%
  summary()
     a_scaled.V1          b_scaled.V1     
 Min.   :-2.2486702   Min.   :-1.5567091  
 1st Qu.:-0.4361401   1st Qu.:-0.7089962  
 Median :-0.0962907   Median :-0.0154130  
 Mean   : 0.0000000   Mean   : 0.0000000  
 3rd Qu.: 0.7533329   3rd Qu.: 0.6396379  
 Max.   : 1.6029564   Max.   : 2.2965313  

Once again, the mean has been brought down to 1 and the standard deviation is now also 1 for both variables.

Now we plot:

ggplot(df2, aes(a_scaled,b_scaled))+
  geom_point()+
  geom_smooth(method = "lm")

Same story; the data looks exactly the same except for the fact that the axes are centred at 0.

Correlation coefficient confirms this:

#unscaled
cor(df2$a, df2$b)
[1] -0.2496821
#scaled
cor(df2$a_scaled, df2$b_scaled)
           [,1]
[1,] -0.2496821

Remember that scaling is what’s called a transformation. Some transformations do wild things to data, but scaling is one that does not change the fundamental qualities of the data (like the shape of the spread), but it does change the characteristics of it, like where the data actually is in space. Scaling is done to standardise data, mostly so that when we read the data (and when we apply statistical methods to it), the two variables are more comparable.

LS0tCnRpdGxlOiAiRG9lcyBTY2FsaW5nIEFmZmVjdCBDb3JyZWxhdGlvbj8iCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCkEgc3R1ZGVudCBhc2tlZCBtZSB0aGUgcXVlc3Rpb24gb2Ygd2hldGhlciBvciBub3Qgc2NhbGluZyBtYXR0ZXJzIHdoZW4gcnVubmluZyBjb3JyZWxhdGlvbnMuIExldCdzIGludmVzdGlnYXRlLgoKYGBge3J9CmxpYnJhcnkodGlkeXZlcnNlKQpsaWJyYXJ5KGtuaXRyKQpzZXQuc2VlZCgxKQpgYGAKCkZpcnN0LCBsZXQncyBnZXQgdHdvIHJhbmRvbSBzYW1wbGVzIG9mIGludGVnZXJzIGFzIG91ciBkYXRhOgoKYGBge3J9CnggPSBzYW1wbGUuaW50KDIwLDIwKQp5ID0gc2FtcGxlLmludCg1MCwyMCkKCmRhdGEuZnJhbWUoeCx5KSU+JQogIGthYmxlKCkKCmRhdGEuZnJhbWUoeCx5KSU+JQogIHN1bW1hcnkoKQpgYGAKClRoZXNlIGRhdGEgcG9pbnRzIGZvciB0aGVzZSAkeCQgYW5kICR5JCB2YXJpYWJsZXMgd2VyZSBnZW5lcmF0ZWQgZnJvbSBhICp1bmlmb3JtIGRpc3RyaWJ1dGlvbiosIG1lYW5pbmcgdGhhdCBhbnkgbnVtYmVyIGlzIGVxdWFsbHkgYXMgbGlrZWx5IHRvIGJlIHBpY2tlZCBhcyBhbnkgb3RoZXIuIExldCdzIHBsb3QgdGhlc2UgZGF0YSBwb2ludHM6CgpgYGB7cn0KZGYgPSBkYXRhLmZyYW1lKHgseSkKCmdncGxvdChkZiwgYWVzKHgseSkpKwogIGdlb21fcG9pbnQoKSsKICBnZW9tX3Ntb290aChtZXRob2QgPSAibG0iKQpgYGAKCkFzIHlvdSBjYW4gc2VlLCB0aGUgZGF0YSBwb2ludHMgYXJlIHByZXR0eSBtdWNoIHJhbmRvbWx5IHNjYXR0ZXJlZCwgYnV0IHRoZSB0d28gdmFyaWFibGVzIGFyZSBjb3JyZWxhdGVkIGJ5IHRoZSBzbG9wZSBvZiB0aGUgbGluZS4KCldoYXQgaGFwcGVucyBpZiB3ZSBzY2FsZSB0aGUgZGF0YT8KCmBgYHtyfQpkZiR4X3NjYWxlZCA9IHNjYWxlKGRmJHgpCmRmJHlfc2NhbGVkID0gc2NhbGUoZGYkeSkKCnNlbGVjdChkZiwgb25lX29mKGMoInhfc2NhbGVkIiwgInlfc2NhbGVkIikpKSU+JQogIHN1bW1hcnkoKQpgYGAKClRoZSBkYXRhIG5vdyBoYXMgYSBtZWFuIG9mIHplcm8gKGFuZCBhIHN0YW5kYXJkIGRldmlhdGlvbiBvZiAxOyBub3Qgc2hvd24pLiBJZiB3ZSBwbG90IHRoaXMgZGF0YToKCmBgYHtyfQpnZ3Bsb3QoZGYsIGFlcyh4X3NjYWxlZCx5X3NjYWxlZCkpKwogIGdlb21fcG9pbnQoKSsKICBnZW9tX3Ntb290aChtZXRob2QgPSAibG0iKQpgYGAKCldlIHNlZSBubyBkaWZmZXJlbmNlIG90aGVyIHRoYW4gdGhlIGZhY3QgdGhhdCB0aGUgYXhlcyBoYXZlIG1vdmVkIHRvIGJlIGNlbnRyZWQgYXQgMC4gV2UgY2FuIGV2ZW4gY29tcHV0ZSB0aGUgcGVhcnNvbiBjb3JyZWxhdGlvbiBjb2VmZmljaWVudCB0byBzaG93IHRoYXQgYm90aCB0aGUgdW5zY2FsZWQgYW5kIHNjYWxlZCBkYXRhIGFyZSBub3QgZGlmZmVyZW50IGluIHJlbGF0aW9uIHRvIGVhY2ggb3RoZXI6CgpgYGB7cn0KI2Z1bmN0aW9uIHRvIGNvbXB1dGUgY29ycmVsYXRpb24gY29lZmZpY2llbnQKCiN1bnNjYWxlZApjb3IoZGYkeCxkZiR5KSAKCiNzY2FsZWQKY29yKGRmJHhfc2NhbGVkLCBkZiR5X3NjYWxlZCkKYGBgCgpTYW1lIHZhbHVlLgoKLS0tCgpCdXQgdGhpcyBoYXMgYWxsIGJlZW4gd2l0aCBhIHVuaWZvcm0gc2FtcGxlOyBpbiB0aGUgcmVhbCB3b3JsZCwgbW9zdCBvZiBvdXIgZGF0YSByZXNlbWJsZXMgYSBub3JtYWwgZGlzdHJpYnV0aW9uLiBMZXQncyB0YWtlIGEgc2FtcGxlIG9mIGRhdGEgZnJvbSB0aGUgbm9ybWFsIGRpc3RyaWJ1dGlvbi4KCmBgYHtyfQphID0gcm5vcm0oMjAsNTAsIDEwKSU+JXJvdW5kKCkKYiA9IHJub3JtKDIwLCAzMCwgOCklPiVyb3VuZCgpCgpkYXRhLmZyYW1lKGEsYiklPiUKICBzdW1tYXJ5KCkKCmRhdGEuZnJhbWUoYSxiKSU+JQogIGthYmxlKCkKYGBgCgpOb3cgZWFjaCBjb2xsZWN0aW9uIG9mIGRhdGEgICRhJCBhbmQgJGIkIGhhcyBhIGRpZmZlcmVudCBtZWFuLiBUaGUgc3RhbmRhcmQgZGV2aWF0aW9ucyBhcmUgMTAgYW5kIDUgcmVzcGVjdGl2ZWx5IHRvbyAobm90IHNob3duKS4gTGV0J3MgcGxvdCB0aGVzZSBkYXRhOgoKYGBge3J9CmRmMiA9IGRhdGEuZnJhbWUoYSxiKQoKZ2dwbG90KGRmMiwgYWVzKGEsYikpKwogIGdlb21fcG9pbnQoKSsKICBnZW9tX3Ntb290aChtZXRob2QgPSAibG0iKQpgYGAKCllvdSBjYW4gc2VlIHRoZSBjbGVhciBjb3JyZWxhdGlvbiBoZXJlLiBOb3cgbGV0J3Mgc2NhbGUgdGhlc2UgdmFyaWFibGVzIHRvIHNlZSBpZiB0aGF0IGNoYW5nZXMgYW55dGhpbmc6CgpgYGB7cn0KZGYyJGFfc2NhbGVkID0gc2NhbGUoZGYyJGEpCmRmMiRiX3NjYWxlZCA9IHNjYWxlKGRmMiRiKQoKc2VsZWN0KGRmMiwgb25lX29mKGMoImFfc2NhbGVkIiwgImJfc2NhbGVkIikpKSU+JQogIHN1bW1hcnkoKQpgYGAKCk9uY2UgYWdhaW4sIHRoZSBtZWFuIGhhcyBiZWVuIGJyb3VnaHQgZG93biB0byAxIGFuZCB0aGUgc3RhbmRhcmQgZGV2aWF0aW9uIGlzIG5vdyBhbHNvIDEgZm9yIGJvdGggdmFyaWFibGVzLgoKTm93IHdlIHBsb3Q6CgpgYGB7cn0KZ2dwbG90KGRmMiwgYWVzKGFfc2NhbGVkLGJfc2NhbGVkKSkrCiAgZ2VvbV9wb2ludCgpKwogIGdlb21fc21vb3RoKG1ldGhvZCA9ICJsbSIpCmBgYAoKU2FtZSBzdG9yeTsgdGhlIGRhdGEgbG9va3MgZXhhY3RseSB0aGUgc2FtZSBleGNlcHQgZm9yIHRoZSBmYWN0IHRoYXQgdGhlIGF4ZXMgYXJlIGNlbnRyZWQgYXQgMC4KCkNvcnJlbGF0aW9uIGNvZWZmaWNpZW50IGNvbmZpcm1zIHRoaXM6CmBgYHtyfQojdW5zY2FsZWQKY29yKGRmMiRhLCBkZjIkYikKCiNzY2FsZWQKY29yKGRmMiRhX3NjYWxlZCwgZGYyJGJfc2NhbGVkKQpgYGAKCi0tLQoKUmVtZW1iZXIgdGhhdCBzY2FsaW5nIGlzIHdoYXQncyBjYWxsZWQgYSAqKnRyYW5zZm9ybWF0aW9uKiouIFNvbWUgdHJhbnNmb3JtYXRpb25zIGRvIHdpbGQgdGhpbmdzIHRvIGRhdGEsIGJ1dCBzY2FsaW5nIGlzIG9uZSB0aGF0IGRvZXMgbm90IGNoYW5nZSB0aGUgZnVuZGFtZW50YWwgcXVhbGl0aWVzIG9mIHRoZSBkYXRhIChsaWtlIHRoZSBzaGFwZSBvZiB0aGUgc3ByZWFkKSwgYnV0IGl0IGRvZXMgY2hhbmdlIHRoZSBjaGFyYWN0ZXJpc3RpY3Mgb2YgaXQsIGxpa2UgKip3aGVyZSoqIHRoZSBkYXRhIGFjdHVhbGx5IGlzIGluIHNwYWNlLiBTY2FsaW5nIGlzIGRvbmUgdG8gKipzdGFuZGFyZGlzZSoqIGRhdGEsIG1vc3RseSBzbyB0aGF0IHdoZW4gd2UgcmVhZCB0aGUgZGF0YSAoYW5kIHdoZW4gd2UgYXBwbHkgc3RhdGlzdGljYWwgbWV0aG9kcyB0byBpdCksIHRoZSB0d28gdmFyaWFibGVzIGFyZSBtb3JlIGNvbXBhcmFibGUu