Preliminaries

Let’s generate some data:

n=2000
tab = rbind(data.frame(salary=rlnorm(n/2,mean=8,sd=0.3),group="A"),
            data.frame(salary=rlnorm(n/2,mean=10,sd=0.3),group="B"))

We have two groups A and B, of the same size, with different salaries.

Let’s have a helper function for drawing these nice filled polygons

mypolygon = function(x,y,...) {
  n = length(x)
  x = c(x[1],x,x[n])
  y = c(0,y,0)
  polygon(x,y,...)
}
integral = function(x,y) {
  sum(diff(x)*(y[-1]-diff(y)/2))
}

Visualization

Let’s start with something simple. Let’s visualize the number of people with specific salaries.

Visualize the number of people

Let’s construct the density (frequency) function:

d1 = density(tab$salary[tab$group=="A"],from=500,to=50000,n=512)
d2 = density(tab$salary[tab$group=="B"],from=500,to=50000,n=512)

Let’s plot it in a linear scale:

plot(d1$x,d1$y+d2$y,type="n",yaxt='n',xlab="Salary [$]",ylab="")
mypolygon(d1$x,d1$y+d2$y,col=2)
mypolygon(d1$x,d1$y,col=3)
d1_i = integral(d1$x,d1$y)
d2_i = integral(d2$x,d2$y)
percent = c(d1_i,d2_i)/(d1_i+d2_i)
legend("topleft",sprintf("Group %s - %.0f%% population", c("A","B"), percent*100), fill=c(2:3))

You can see now that both groups are of equal size (number of people).

Let’s plot it in a logarithmic scale:

plot(d1$x,d1$y+d2$y,type="n",yaxt='n',xlab="Salary",ylab="",log="x")
mypolygon(d1$x,d1$y+d2$y,col=2)
mypolygon(d1$x,d1$y,col=3)

The salaries are better spread, but now the green group look like it’s bigger.

The problem is in this case simple: the density is with respect to x, and not log(x). Let’s do it in a different way:

d1 = density(log(tab$salary[tab$group=="A"]),from=log(500),to=log(50000),n=512)
d1$x = exp(d1$x)
d2 = density(log(tab$salary[tab$group=="B"]),from=log(500),to=log(50000),n=512)
d2$x = exp(d2$x)

Now we have a density function, with respect to the thing we want on the x axis: logarithm of salary.

We can now plot it on a logarithmic scale:

plot(d1$x,d1$y+d2$y,type="n",yaxt='n',xlab="Salary",ylab="",log="x")
mypolygon(d1$x,d1$y+d2$y,col=2)
mypolygon(d1$x,d1$y,col=3)
d1_i = integral(log(d1$x),d1$y)
d2_i = integral(log(d2$x),d2$y)
percent = c(d1_i,d2_i)/(d1_i+d2_i)
legend("topleft",sprintf("Group %s - %.0f%% population", c("A","B"), percent*100), fill=c(2:3))

Now we see that we have a nice spread of the salaries (log scale), and it is clearly visible that the two groups are of the same size.

Visualize the total income

For now we were visualizing the populations of the groups. So on the plots we saw that the two populations are of equal size. But if we wanted the plot to visualize the total income of the groups - that would be misleading. The two groups don’t earn the same. Group B is much more wealthy.

If we want to view the total amount of money earned, we have to scale the y value by the salary, getting us:

plot(d1$x,d1$x*(d1$y+d2$y),type="n",yaxt='n',xlab="Salary",ylab="",log="x")
mypolygon(d1$x,d1$x*(d1$y+d2$y),col=2)
mypolygon(d1$x,d1$x*(d1$y),col=3)
d1_i = integral(log(d1$x),d1$x*d1$y)
d2_i = integral(log(d2$x),d1$x*d2$y)
percent = c(d1_i,d2_i)/(d1_i+d2_i)
legend("topleft",sprintf("Group %s - %.0f%% of total income", c("A","B"), percent*100), fill=c(2:3))

And in the linear x-axis:

d1 = density(tab$salary[tab$group=="A"],from=500,to=50000,n=512)
d2 = density(tab$salary[tab$group=="B"],from=500,to=50000,n=512)
plot(d1$x,d1$x*(d1$y+d2$y),type="n",yaxt='n',xlab="Salary",ylab="")
mypolygon(d1$x,d1$x*(d1$y+d2$y),col=2)
mypolygon(d1$x,d1$x*(d1$y),col=3)
d1_i = integral(d1$x,d1$x*d1$y)
d2_i = integral(d2$x,d1$x*d2$y)
percent = c(d1_i,d2_i)/(d1_i+d2_i)
legend("topleft",sprintf("Group %s - %.0f%% of total income", c("A","B"), percent*100), fill=c(2:3))