Having the dataset pseudo_facebook.tsv, I am going to analyze the users’ behaviour in Facebook, understand what they are doing there and how they behave.
First, I read the data and check its summary. Next, I get the list of names for all the variables of the dataset.
pf <- read.csv('pseudo_facebook.tsv', sep = '\t')
# Alternatively:
# pf <- read.delim('pseudo_facebook.tsv')
summary(pf)
## userid age dob_day dob_year
## Min. :1000008 Min. : 13.00 Min. : 1.00 Min. :1900
## 1st Qu.:1298806 1st Qu.: 20.00 1st Qu.: 7.00 1st Qu.:1963
## Median :1596148 Median : 28.00 Median :14.00 Median :1985
## Mean :1597045 Mean : 37.28 Mean :14.53 Mean :1976
## 3rd Qu.:1895744 3rd Qu.: 50.00 3rd Qu.:22.00 3rd Qu.:1993
## Max. :2193542 Max. :113.00 Max. :31.00 Max. :2000
##
## dob_month gender tenure friend_count
## Min. : 1.000 female:40254 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.000 male :58574 1st Qu.: 226.0 1st Qu.: 31.0
## Median : 6.000 NA's : 175 Median : 412.0 Median : 82.0
## Mean : 6.283 Mean : 537.9 Mean : 196.4
## 3rd Qu.: 9.000 3rd Qu.: 675.0 3rd Qu.: 206.0
## Max. :12.000 Max. :3139.0 Max. :4923.0
## NA's :2
## friendships_initiated likes likes_received
## Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 17.0 1st Qu.: 1.0 1st Qu.: 1.0
## Median : 46.0 Median : 11.0 Median : 8.0
## Mean : 107.5 Mean : 156.1 Mean : 142.7
## 3rd Qu.: 117.0 3rd Qu.: 81.0 3rd Qu.: 59.0
## Max. :4144.0 Max. :25111.0 Max. :261197.0
##
## mobile_likes mobile_likes_received www_likes
## Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 4.0 Median : 4.00 Median : 0.00
## Mean : 106.1 Mean : 84.12 Mean : 49.96
## 3rd Qu.: 46.0 3rd Qu.: 33.00 3rd Qu.: 7.00
## Max. :25111.0 Max. :138561.00 Max. :14865.00
##
## www_likes_received
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 2.00
## Mean : 58.57
## 3rd Qu.: 20.00
## Max. :129953.00
##
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
Scatterplots are used to look at two continous variables at the same time on one plot. Here again, I will use both an ordinary qplot command and a ggplot library to produce scatterplots.
Notes:
library(ggplot2)
qplot(age, friend_count, data = pf)
It looks like the younger users (under the age of thirty) have much more friends than users of the other age ranges.
The high and dense vertical lines, such as around the age of 69 or 100, show where the users lied about their age.
summary(pf$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 20.00 28.00 37.28 50.00 113.00
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point() +
xlim(13, 90)
## Warning: Removed 4906 rows containing missing values (geom_point).
Overplotting makes it difficult to say how many points are there in each region. In order to avoid this, I use geom_jitter(alpha = 1/20) to make each point on the plot to represent 20 points of data.
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_jitter(alpha = 1/20) +
xlim(13, 90)
Next, I will add a transformation to the y axis to make the data more readable.
# position = position_jitter(h = 0) is added in order to avoid
# negative points for the users with 0 friends
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha = 1/20, position = position_jitter(h = 0)) +
xlim(13, 90) +
coord_trans(y = 'sqrt')
Now, I examine the relationship between friendships_initiated and age using the ggplot syntax.
ggplot(aes(x = age, y = friendships_initiated), data = pf) +
geom_point(alpha = 1/10, position = position_jitter(h = 0)) +
xlim(13, 90) +
coord_trans(y = 'sqrt')
It is sometimes useful to see how the average value of one variable varies comparing to the value of another variable. I will use a dpyr package for this.
For example, I will look at the plot of how the average number of friends changes over the age of the users.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
age_groups <- group_by(pf, age)
pf.fc_by_age <- summarise(age_groups,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n())
pf.fc_by_age <- arrange(pf.fc_by_age, age)
head(pf.fc_by_age)
## Source: local data frame [6 x 4]
##
## age friend_count_mean friend_count_median n
## (int) (dbl) (dbl) (int)
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
Now when I have all the neccessary summarised informaation, I will create the plot:
ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) +
geom_line()
Now, I will provide both original raw data and a summary information on the same plot.
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha = 1/20,
position = position_jitter(h = 0),
color = "orange") +
coord_cartesian(xlim = c(13, 70), ylim = c(0,1000)) +
geom_line(stat = "summary", fun.y = mean) +
geom_line(stat = "summary", fun.y = quantile, prob = 0.1,
linetype = 2, color = "blue") +
geom_line(stat = "summary", fun.y = quantile, prob = 0.1,
color = "blue") +
geom_line(stat = "summary", fun.y = quantile, prob = 0.9,
linetype = 2, color = "blue")
Having more than 1000 friends is rare, even for the younger users, as their 90% quantile is well below 1000.
I will calculate the correlation to see what is the linear relationship between the age and the number of friends users have.
cor.test(pf$age, pf$friend_count, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
As the most meaningful data in our subset is for the users of age < 70, I will use only this subset for calculating the correlation coefficient.
with( subset(pf, age <= 70), cor.test(age, friend_count,
method = "spearman"))
##
## Spearman's rank correlation rho
##
## data: age and friend_count
## S = 1.5782e+14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.2552934
Next, I will create a scatterplot of two highly correlated variables likes_received and www_likes_received.
ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
geom_point(alpha = 1/5, position = position_jitter(h = 0, w = 0)) +
xlim(0, quantile(pf$www_likes_received, 0.95)) +
ylim(0, quantile(pf$likes_received, 0.95)) +
#coord_trans(x = "sqrt", y = "sqrt") +
geom_smooth(method = "lm", color = "red")
To determine the correlation between these two variables, I use the cor.test() command.
cor.test(pf$www_likes_received, pf$likes_received)
##
## Pearson's product-moment correlation
##
## data: pf$www_likes_received and pf$likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9473553 0.9486176
## sample estimates:
## cor
## 0.9479902
Correlation coefficients may be deceptive. To show this, I will use some data from the Mitchell dataset.
#install.packages('alr3')
library(alr3)
## Loading required package: car
data("Mitchell")
Now, I will create a scatterplot of Temperature vs Months data, provided by this dataset.
ggplot(aes(x = Month, y = Temp), data = Mitchell) +
geom_point()
cor.test(Mitchell$Month, Mitchell$Temp)
##
## Pearson's product-moment correlation
##
## data: Mitchell$Month and Mitchell$Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08053637 0.19331562
## sample estimates:
## cor
## 0.05747063
The above scatterplot looks quite messy and the correlation coefficient between the two variables is very low as well.
However, I know that months are discrete and repeat themselves over each 12 months, so I will analyse the data a bit further having this knowledge in mind.
I will add a discrete scale to the x axis to represent 12 months as a repeated measure. I will also change the format of the graph to a line-plot instead of scatterplot.
To find out what is the range of the variable Month, I check its summary.
summary(Mitchell$Month)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 50.75 101.50 101.50 152.20 203.00
ggplot(aes(x = Month, y = Temp), data = Mitchell) +
geom_line() +
scale_x_discrete(breaks = seq(0, 203, 12))
With such a plot, we see the fluctuations of the temperature that have a seasonal nature. So, it is always important to put the data into the context and not to rely on the correlation coefficient only.
Let’s return to the original dataset of Facebook users and further analyse the relationship between users’ age and a number of their friends.
ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) +
geom_line()
Some of the variance here makes some sense, e.g. the peak around the age of 69, but others are just a noise.
I will create the plot of the same relationsgip between the age and the number of friends, but this time I will use age variable as age in months (rather than in years) to see how the noise increases having such a change.
pf$age_with_months <- pf$age + (12 - pf$dob_month)/12
age_with_months <- group_by(pf, age_with_months)
pf.fc_by_age_months <- summarise(age_with_months,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n())
pf.fc_by_age_months <- arrange(pf.fc_by_age_months, age_with_months)
head(pf.fc_by_age_months)
## Source: local data frame [6 x 4]
##
## age_with_months friend_count_mean friend_count_median n
## (dbl) (dbl) (dbl) (int)
## 1 13.16667 46.33333 30.5 6
## 2 13.25000 115.07143 23.5 14
## 3 13.33333 136.20000 44.0 25
## 4 13.41667 164.24242 72.0 33
## 5 13.50000 131.17778 66.0 45
## 6 13.58333 156.81481 64.0 54
Now, I plot the results and it is a much noisier plot.
ggplot(aes(x = age_with_months, y = friend_count_mean),
data = subset(pf.fc_by_age_months, age_with_months <71)) +
geom_line()
I will put the two plots side by side to make the difference more visual.
library(gridExtra)
p1 <- ggplot(aes(x = age, y = friend_count_mean), data = subset(pf.fc_by_age, age < 71)) +
geom_line()
p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean),
data = subset(pf.fc_by_age_months, age_with_months < 71)) +
geom_line()
grid.arrange(p2, p1, ncol = 1)
In order to reduce the noise even more, I transform the age variable by dividing it by 5, rounding and multiplying by 5.
By doing this, we will estimate the mean more precisely but will most probably miss some important features.
p1 <- ggplot(aes(x = age, y = friend_count_mean), data = subset(pf.fc_by_age, age < 71)) +
geom_line() +
geom_smooth()
p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean),
data = subset(pf.fc_by_age_months, age_with_months < 71)) +
geom_line() +
geom_smooth()
p3 <- ggplot(aes(x = round(age/5)*5, y = friend_count_mean), data = subset(pf.fc_by_age, age < 71)) +
geom_line(stat = "summary", fun.y = mean)
grid.arrange(p2, p1, p3, ncol = 1)
In exploratory data analysis, you do not need to choose just one plot. Sometimes, different plots reveal different details about the same data.
I describe exploratory data analysis with more than two variables in further documents. Please, check my website for more details.