A student asked me the question of whether or not scaling matters when running correlations. Let’s investigate.
library(tidyverse)
library(knitr)
set.seed(1)
First, let’s get two random samples of integers as our data:
x = sample.int(20,20)
y = sample.int(50,20)
data.frame(x,y)%>%
kable()
x | y |
---|---|
6 | 47 |
8 | 11 |
11 | 32 |
16 | 6 |
4 | 13 |
14 | 18 |
15 | 1 |
9 | 17 |
19 | 37 |
1 | 14 |
3 | 20 |
2 | 24 |
20 | 19 |
10 | 7 |
5 | 30 |
7 | 39 |
12 | 28 |
17 | 4 |
18 | 35 |
13 | 46 |
data.frame(x,y)%>%
summary()
x y
Min. : 1.00 Min. : 1.00
1st Qu.: 5.75 1st Qu.:12.50
Median :10.50 Median :19.50
Mean :10.50 Mean :22.40
3rd Qu.:15.25 3rd Qu.:32.75
Max. :20.00 Max. :47.00
These data points for these \(x\) and \(y\) variables were generated from a uniform distribution, meaning that any number is equally as likely to be picked as any other. Let’s plot these data points:
df = data.frame(x,y)
ggplot(df, aes(x,y))+
geom_point()+
geom_smooth(method = "lm")
As you can see, the data points are pretty much randomly scattered, but the two variables are correlated by the slope of the line.
What happens if we scale the data?
df$x_scaled = scale(df$x)
df$y_scaled = scale(df$y)
select(df, one_of(c("x_scaled", "y_scaled")))%>%
summary()
x_scaled.V1 y_scaled.V1
Min. :-1.6057931 Min. :-1.5438196
1st Qu.:-0.8028965 1st Qu.:-0.7141969
Median : 0.0000000 Median :-0.2092092
Mean : 0.0000000 Mean : 0.0000000
3rd Qu.: 0.8028965 3rd Qu.: 0.7466604
Max. : 1.6057931 Max. : 1.7746711
The data now has a mean of zero (and a standard deviation of 1; not shown). If we plot this data:
ggplot(df, aes(x_scaled,y_scaled))+
geom_point()+
geom_smooth(method = "lm")
We see no difference other than the fact that the axes have moved to be centred at 0. We can even compute the pearson correlation coefficient to show that both the unscaled and scaled data are not different in relation to each other:
#function to compute correlation coefficient
#unscaled
cor(df$x,df$y)
[1] -0.05391063
#scaled
cor(df$x_scaled, df$y_scaled)
[,1]
[1,] -0.05391063
Same value.
But this has all been with a uniform sample; in the real world, most of our data resembles a normal distribution. Let’s take a sample of data from the normal distribution.
a = rnorm(20,50, 10)%>%round()
b = rnorm(20, 30, 8)%>%round()
data.frame(a,b)%>%
summary()
a b
Min. :30.00 Min. :21.00
1st Qu.:46.00 1st Qu.:26.50
Median :49.00 Median :31.00
Mean :49.85 Mean :31.10
3rd Qu.:56.50 3rd Qu.:35.25
Max. :64.00 Max. :46.00
data.frame(a,b)%>%
kable()
a | b |
---|---|
59 | 29 |
58 | 28 |
51 | 36 |
30 | 34 |
56 | 24 |
49 | 24 |
48 | 33 |
35 | 36 |
45 | 29 |
54 | 37 |
64 | 33 |
49 | 25 |
54 | 33 |
49 | 21 |
36 | 41 |
46 | 46 |
46 | 27 |
49 | 22 |
61 | 35 |
58 | 29 |
Now each collection of data \(a\) and \(b\) has a different mean. The standard deviations are 10 and 5 respectively too (not shown). Let’s plot these data:
df2 = data.frame(a,b)
ggplot(df2, aes(a,b))+
geom_point()+
geom_smooth(method = "lm")
You can see the clear correlation here. Now let’s scale these variables to see if that changes anything:
df2$a_scaled = scale(df2$a)
df2$b_scaled = scale(df2$b)
select(df2, one_of(c("a_scaled", "b_scaled")))%>%
summary()
a_scaled.V1 b_scaled.V1
Min. :-2.2486702 Min. :-1.5567091
1st Qu.:-0.4361401 1st Qu.:-0.7089962
Median :-0.0962907 Median :-0.0154130
Mean : 0.0000000 Mean : 0.0000000
3rd Qu.: 0.7533329 3rd Qu.: 0.6396379
Max. : 1.6029564 Max. : 2.2965313
Once again, the mean has been brought down to 1 and the standard deviation is now also 1 for both variables.
Now we plot:
ggplot(df2, aes(a_scaled,b_scaled))+
geom_point()+
geom_smooth(method = "lm")
Same story; the data looks exactly the same except for the fact that the axes are centred at 0.
Correlation coefficient confirms this:
#unscaled
cor(df2$a, df2$b)
[1] -0.2496821
#scaled
cor(df2$a_scaled, df2$b_scaled)
[,1]
[1,] -0.2496821
Remember that scaling is what’s called a transformation. Some transformations do wild things to data, but scaling is one that does not change the fundamental qualities of the data (like the shape of the spread), but it does change the characteristics of it, like where the data actually is in space. Scaling is done to standardise data, mostly so that when we read the data (and when we apply statistical methods to it), the two variables are more comparable.