The friendship paradox is the observation that most people have fewer friends than their friends do, on average. This is often observed on social networks such as Facebook. The phenomenon is perhaps obvious in hindsight, since a person who has many friends generally is more likely to befriend you, but it can be disconcerting to find that you are less popular than your friends.
Eric Tsai asked the following question: if \(X\) is the number of friends that have more friends than oneself, and \(Y\) is the number of friends that have fewer friends than oneself, what is the average value of \(X/Y\)?
Alas, the answer is infinity. A finite social network has at least one person with the least number of friends. For this unfortunate soul, \(Y\) is zero, so the ratio \(X/Y\) is infinite. But if any one of the terms is infinite, then the average is also infinite. One might reasonably object that this answer is unhelpful, but that’s what you get for asking a mathematician. Let’s agree to throw out the infinite values and compute the average of the remaining terms.
A reasonable way to approach this problem is to download some anonymized data from actual Facebook friend networks and compute the average of \(X/Y\). That is exactly what the following R script does. Our conclusion is that the average value of the ratio is about 4. The dataset is from https://snap.stanford.edu/data/egonets-Facebook.html.
require(data.table)
require(dplyr)
require(ggplot2)
local.file <- 'facebook_combined.txt.gz'
URL <- 'https://snap.stanford.edu/data/facebook_combined.txt.gz'
if (!file.exists(local.file)) {
download.file(URL, local.file)
}
edges <- read.table(gzfile(local.file), header=FALSE, sep=' ')
edges <- data.table(source = c(edges$V1, edges$V2),
target = c(edges$V2, edges$V1))
degree.source <- edges %>%
group_by(source) %>%
summarise(degree.source = n())
degree.target <- edges %>%
group_by(target) %>%
summarise(degree.target = n())
edges <- edges %>%
merge(degree.source, 'source') %>%
merge(degree.target, 'target') %>%
group_by(source) %>%
summarise(X = sum(degree.target > degree.source),
Y = sum(degree.target < degree.source)) %>%
filter(Y > 0) %>%
mutate(Z = X / Y)
rm(degree.source, degree.target)
print (mean(edges$Z))
## [1] 3.997289
ggplot(edges, aes(x=Z)) +
geom_histogram(binwidth=1, fill="blue") +
scale_x_continuous(limits=c(0,20))