Having the dataset pseudo_facebook.tsv, I am going to analyze the users’ behaviour in Facebook, understand what they are doing there and how they behave.
First, I read the data and check its summary. Next, I get the list of names for all the variables of the dataset.
pf <- read.csv('pseudo_facebook.tsv', sep = '\t')
# Alternatively:
# pf <- read.delim('pseudo_facebook.tsv')
summary(pf)
## userid age dob_day dob_year
## Min. :1000008 Min. : 13.00 Min. : 1.00 Min. :1900
## 1st Qu.:1298806 1st Qu.: 20.00 1st Qu.: 7.00 1st Qu.:1963
## Median :1596148 Median : 28.00 Median :14.00 Median :1985
## Mean :1597045 Mean : 37.28 Mean :14.53 Mean :1976
## 3rd Qu.:1895744 3rd Qu.: 50.00 3rd Qu.:22.00 3rd Qu.:1993
## Max. :2193542 Max. :113.00 Max. :31.00 Max. :2000
##
## dob_month gender tenure friend_count
## Min. : 1.000 female:40254 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.000 male :58574 1st Qu.: 226.0 1st Qu.: 31.0
## Median : 6.000 NA's : 175 Median : 412.0 Median : 82.0
## Mean : 6.283 Mean : 537.9 Mean : 196.4
## 3rd Qu.: 9.000 3rd Qu.: 675.0 3rd Qu.: 206.0
## Max. :12.000 Max. :3139.0 Max. :4923.0
## NA's :2
## friendships_initiated likes likes_received
## Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 17.0 1st Qu.: 1.0 1st Qu.: 1.0
## Median : 46.0 Median : 11.0 Median : 8.0
## Mean : 107.5 Mean : 156.1 Mean : 142.7
## 3rd Qu.: 117.0 3rd Qu.: 81.0 3rd Qu.: 59.0
## Max. :4144.0 Max. :25111.0 Max. :261197.0
##
## mobile_likes mobile_likes_received www_likes
## Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 4.0 Median : 4.00 Median : 0.00
## Mean : 106.1 Mean : 84.12 Mean : 49.96
## 3rd Qu.: 46.0 3rd Qu.: 33.00 3rd Qu.: 7.00
## Max. :25111.0 Max. :138561.00 Max. :14865.00
##
## www_likes_received
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 2.00
## Mean : 58.57
## 3rd Qu.: 20.00
## Max. :129953.00
##
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
Here, I will create a plot to examine if the age distribution has an impact on the number of friends across genders. For this, I will add a third variable of age to the line plot of friend count vs age.
library(ggplot2)
# ggplot(aes(x = gender, y = age),
# data = subset(pf, !is.na(gender))) +
# geom_boxplot() +
# stat_summary(fun.y = mean, geom = 'point', shape = 4)
ggplot(aes(x = age, y = friend_count),
data = subset(pf, !is.na(gender))) +
geom_line(aes(color = gender), stat = "summary", fun.y = median)
The plot shows that almost everywhere the median count of friends is larger for women than it is for men. There are some exaptions and they include the noisy regions for older ages.
It is possible to create the same plot using a dplyr package.
library(dplyr)
# chain functions together %>%
pf.fc_by_age_gender <- pf %>%
filter(!is.na(gender)) %>%
group_by(age, gender) %>%
summarise(mean_friend_count = mean(friend_count),
median_friend_count = median(friend_count),
n = n()) %>%
ungroup() %>%
arrange(age)
head(pf.fc_by_age_gender)
## Source: local data frame [6 x 5]
##
## age gender mean_friend_count median_friend_count n
## (int) (fctr) (dbl) (dbl) (int)
## 1 13 female 259.1606 148.0 193
## 2 13 male 102.1340 55.0 291
## 3 14 female 362.4286 224.0 847
## 4 14 male 164.1456 92.5 1078
## 5 15 female 538.6813 276.0 1139
## 6 15 male 200.6658 106.5 1478
ggplot(aes(x = age, y = median_friend_count), data = pf.fc_by_age_gender) +
geom_line(aes(color = gender))
I will use the library reshape2 to transform the data to the wide format.
#install.packages('reshape2')
library(reshape2)
pf.fc_by_age_gender_wide <- dcast(pf.fc_by_age_gender,
age ~ gender,
value.var = 'median_friend_count')
head(pf.fc_by_age_gender_wide)
## age female male
## 1 13 148.0 55.0
## 2 14 224.0 92.5
## 3 15 276.0 106.5
## 4 16 258.5 136.0
## 5 17 245.5 125.0
## 6 18 243.0 122.0
Now, having the pf.fc_by_age_gender.wide dataframe, I will plot the ratio of the female to male median friend counts.
ggplot(aes(x = age, y = female/male),
data = pf.fc_by_age_gender_wide) +
geom_line() +
geom_hline(yintercept = 1, alpha = 0.3, color = 'red')
Now, we can easily see that females have on average more friend then males up to the age of about 70. Very young female users have almost 2.5 more friends on average than males.
I am going to add another quantitative variable to the pf dataframe that would show in which year each user joined Facebook.
pf$year_joined <- floor(2014 - pf$tenure/365)
summary(pf$year_joined)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2005 2012 2012 2012 2013 2014 2
table(pf$year_joined)
##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 9 15 581 1507 4557 5448 9860 33366 43588 70
Now, I will transform the variable year_joined into a new categorical variable with just four backets.
# There will be four backets:
# (2004, 2009]
# (2009, 2011]
# (2011, 2012]
# (2012, 2014]
pf$year_joined_backet <- cut(pf$year_joined,
c(2004, 2009, 2011, 2012, 2014))
summary(pf$year_joined_backet)
## (2004,2009] (2009,2011] (2011,2012] (2012,2014] NA's
## 6669 15308 33366 43658 2
Now, I will plot the distribution of friends count over the age for each of the four backets of the joining year.
ggplot(aes(x = age, y = friend_count),
data = subset(pf, !is.na(year_joined_backet))) +
geom_line(aes(color = year_joined_backet), stat = "summary", fun.y = median)
The plot shows that the younger users that joined Facebook in its early years (between 2004 and 2009) have the biggest average number of friends. This relationship becomes messy for the older users.
In the next plot, I will change the median number of friends to the mean value and will add the grand mean value for all the users as a dash line.
ggplot(aes(x = age, y = friend_count),
data = subset(pf, !is.na(year_joined_backet))) +
geom_line(aes(color = year_joined_backet), stat = "summary", fun.y = mean) +
geom_line(stat = 'summary', fun.y = mean, linetype = 2)
The grand mean line is a great reminder that the most of the data in our dataset is about the users of the recent cohort, who joined Facebook between the years 2012 and 2014.
Next, I will calculate the number of friends per day since the user has been active on Facebook.
with(subset(pf, tenure >=1), summary(friend_count / tenure))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0775 0.2205 0.6096 0.5658 417.0000
In addition, I will create a line graph of mean of friendships_initiated per day (of tenure) vs. tenure colored by year_joined.bucket
ggplot(aes(x = tenure, y = friendships_initiated / tenure),
data = subset(pf, tenure >= 1)) +
geom_line(aes(color = year_joined_backet), stat = "summary", fun.y = mean)
A closer look at this plot shows that people with a longer tenure typically initiate less friendships.
I make several transformations to the y xis to visualize a bias / variance tradeoff. As the bin size increases, we see less noise in the graph.
library(gridExtra)
p1 <- ggplot(aes(x = tenure, y = friendships_initiated / tenure),
data = subset(pf, tenure >= 1)) +
geom_line(aes(color = year_joined_backet),
stat = 'summary',
fun.y = mean)
p2 <- ggplot(aes(x = 7 * round(tenure / 7), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined_backet),
stat = "summary",
fun.y = mean)
p3 <- ggplot(aes(x = 30 * round(tenure / 30), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined_backet),
stat = "summary",
fun.y = mean)
p4 <- ggplot(aes(x = 90 * round(tenure / 90), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined_backet),
stat = "summary",
fun.y = mean)
grid.arrange(p1, p2, p3, p4, ncol = 1)
# ggplot(aes(x = tenure, y = friendships_initiated / tenure),
# data = subset(pf, tenure >= 1)) +
# geom_smooth(aes(color = year_joined_backet))
The Yogurt dataset is another dataset I will use in this document. It has a different structure from the Facebook dataset, as each client here has multiple rows of data, each containing some data of that user’s purchases.
yo <- read.csv("yogurt.csv")
summary(yo)
## obs id time strawberry
## Min. : 1.0 Min. :2100081 Min. : 9662 Min. : 0.0000
## 1st Qu.: 696.5 1st Qu.:2114348 1st Qu.: 9843 1st Qu.: 0.0000
## Median :1369.5 Median :2126532 Median :10045 Median : 0.0000
## Mean :1367.8 Mean :2128592 Mean :10050 Mean : 0.6492
## 3rd Qu.:2044.2 3rd Qu.:2141549 3rd Qu.:10255 3rd Qu.: 1.0000
## Max. :2743.0 Max. :2170639 Max. :10459 Max. :11.0000
## blueberry pina.colada plain mixed.berry
## Min. : 0.0000 Min. : 0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 0.0000 Median : 0.0000 Median :0.0000 Median :0.0000
## Mean : 0.3571 Mean : 0.3584 Mean :0.2176 Mean :0.3887
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :12.0000 Max. :10.0000 Max. :6.0000 Max. :8.0000
## price
## Min. :20.00
## 1st Qu.:50.00
## Median :65.04
## Mean :59.25
## 3rd Qu.:68.96
## Max. :68.96
head(yo)
## obs id time strawberry blueberry pina.colada plain mixed.berry
## 1 1 2100081 9678 0 0 0 0 1
## 2 2 2100081 9697 0 0 0 0 1
## 3 3 2100081 9825 0 0 0 0 1
## 4 4 2100081 9999 0 0 0 0 1
## 5 5 2100081 10015 1 0 1 0 1
## 6 6 2100081 10029 1 0 2 0 1
## price
## 1 58.96
## 2 58.96
## 3 65.04
## 4 65.04
## 5 48.96
## 6 65.04
Most of the values in this dataset are integers, but I will convert one of the variable to a factor.
yo$id <- factor(yo$id)
str(yo)
## 'data.frame': 2380 obs. of 9 variables:
## $ obs : int 1 2 3 4 5 6 7 8 9 10 ...
## $ id : Factor w/ 332 levels "2100081","2100370",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ time : int 9678 9697 9825 9999 10015 10029 10036 10042 10083 10091 ...
## $ strawberry : int 0 0 0 0 1 1 0 0 0 0 ...
## $ blueberry : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pina.colada: int 0 0 0 0 1 2 0 0 0 0 ...
## $ plain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mixed.berry: int 1 1 1 1 1 1 1 1 1 1 ...
## $ price : num 59 59 65 65 49 ...
qplot(x = price, data = yo, fill = I("#F79420"))
It is worth to make a closer look at the distribution of prices in our dataset, as the previous plot showed the prices are discrete and some prices are quite frequent while there are no purchases at all for some other prices.
Afterwards, I will calculate and plot the total number of yogurt purchases by each user (household) in one time.
length(unique((yo$price)))
## [1] 20
table(yo$price)
##
## 20 24.96 33.04 33.2 33.28 33.36 33.52 39.04 44 45.04 48.96 49.52
## 2 11 54 1 1 22 1 234 21 11 81 1
## 49.6 50 55.04 58.96 62 63.04 65.04 68.96
## 1 205 6 303 15 2 799 609
yo <- transform(yo, all.purchases = strawberry + blueberry + pina.colada + plain + mixed.berry)
summary(yo$all.purchases)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 1.971 2.000 21.000
qplot(x = all.purchases, data = yo,
binwidth = 1,
fill = I("#099DD9"))
This histogram reveals that most households buy one or two yogurts in a time.
Now I will plot price over time.
ggplot(aes(x = time, y = price), data = yo) +
geom_jitter(alpha = 1/4, shape = 21, fill = I("#F79420"))
The plot shows that price seems to be increasing a bit over time.
When you are familiarazing yourself with a new dataset that contains multile observations of the same units, it is often useful to work with the sample of that unit so that it is easier to display the raw data for that unit. In terms of the Yogurt dataset, it is useful to look into the sample of several households with more details to know what could be the within-sample variations.
set.seed(4230)
sample.ids <- sample(levels(yo$id), 16)
ggplot(aes(x = time, y = price),
data = subset(yo, id %in% sample.ids)) +
facet_wrap(~ id) +
geom_line() +
geom_point(aes(size = all.purchases), pch = 1)
The plot shows that some households purchase more yogurts than others. For most of the households the price of the yogurt tends to stay the same or increase over time, although there are some exceptions, e.g. id 2123463 or id 2141341. The down-peaks (as for id 2122705 or id 2124750) most probably represent coupons that reduce the price.
It is sometimes very useful to create a matrix of scatterplots to speed up the data analysis.
library(GGally)
theme_set(theme_minimal(20))
set.seed(1836)
# I omit categorical variables
pf_subset <- pf[, c(2:15)]
names(pf_subset)
## [1] "age" "dob_day"
## [3] "dob_year" "dob_month"
## [5] "gender" "tenure"
## [7] "friend_count" "friendships_initiated"
## [9] "likes" "likes_received"
## [11] "mobile_likes" "mobile_likes_received"
## [13] "www_likes" "www_likes_received"
ggpairs(pf_subset[sample.int(nrow(pf_subset),1000),])
The ggpair uses different plot types for different types of combinations of variables. However, ggpairs does not do things like transforming axis into logarithmic ones etc.
Here is an example of how to create a heat map for the data of another dataset, containing information of different genes.
nci <- read.table("nci.tsv")
colnames(nci) <- c(1:64)
nci.long.samp <- melt(as.matrix(nci[1:200,]))
names(nci.long.samp) <- c("gene", "case", "value")
head(nci.long.samp)
## gene case value
## 1 1 1 0.300
## 2 2 1 1.180
## 3 3 1 0.550
## 4 4 1 1.140
## 5 5 1 -0.265
## 6 6 1 -0.070
ggplot(aes(y = gene, x = case, fill = value),
data = nci.long.samp) +
geom_tile() +
scale_fill_gradientn(colours = colorRampPalette(c("blue", "red"))(100))