Having the dataset pseudo_facebook.tsv, I am going to analyze the users’ behaviour in Facebook, understand what they are doing there and how they behave.
First, I read the data and check its summary. Next, I get the list of names for all the variables of the dataset.
pf <- read.csv('pseudo_facebook.tsv', sep = '\t')
summary(pf)
## userid age dob_day dob_year
## Min. :1000008 Min. : 13.00 Min. : 1.00 Min. :1900
## 1st Qu.:1298806 1st Qu.: 20.00 1st Qu.: 7.00 1st Qu.:1963
## Median :1596148 Median : 28.00 Median :14.00 Median :1985
## Mean :1597045 Mean : 37.28 Mean :14.53 Mean :1976
## 3rd Qu.:1895744 3rd Qu.: 50.00 3rd Qu.:22.00 3rd Qu.:1993
## Max. :2193542 Max. :113.00 Max. :31.00 Max. :2000
##
## dob_month gender tenure friend_count
## Min. : 1.000 female:40254 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.000 male :58574 1st Qu.: 226.0 1st Qu.: 31.0
## Median : 6.000 NA's : 175 Median : 412.0 Median : 82.0
## Mean : 6.283 Mean : 537.9 Mean : 196.4
## 3rd Qu.: 9.000 3rd Qu.: 675.0 3rd Qu.: 206.0
## Max. :12.000 Max. :3139.0 Max. :4923.0
## NA's :2
## friendships_initiated likes likes_received
## Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 17.0 1st Qu.: 1.0 1st Qu.: 1.0
## Median : 46.0 Median : 11.0 Median : 8.0
## Mean : 107.5 Mean : 156.1 Mean : 142.7
## 3rd Qu.: 117.0 3rd Qu.: 81.0 3rd Qu.: 59.0
## Max. :4144.0 Max. :25111.0 Max. :261197.0
##
## mobile_likes mobile_likes_received www_likes
## Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 4.0 Median : 4.00 Median : 0.00
## Mean : 106.1 Mean : 84.12 Mean : 49.96
## 3rd Qu.: 46.0 3rd Qu.: 33.00 3rd Qu.: 7.00
## Max. :25111.0 Max. :138561.00 Max. :14865.00
##
## www_likes_received
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 2.00
## Mean : 58.57
## 3rd Qu.: 20.00
## Max. :129953.00
##
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
I will look at a histogram of users’ birthdays, using a ggplot2 library:
#install.packages('ggplot2')
library(ggplot2)
qplot(x = dob_day, data = pf) +
scale_x_discrete(breaks = 1:31)
Most people, according to the histogram, were born on the first day of the month. This does not seem normal.
The amount of people born on 31st day of a month is the smallest. This, however, seems to be normal, as not every month has 31 days.
I am going to break the histogram into 12 histograms, one for each month of a year.
# Two options here: facet_wrap and facet_grid
# facet_wrap(~variable)
# facet_grid(vertical~horizontal)
qplot(x = dob_day, data = pf) +
scale_x_discrete(breaks = 1:31) +
facet_wrap(~dob_month, ncol = 3)
This new plot shows that only the first day of the first month (January) is an outlier. This implies that the date ‘1st of January’ is the default setting for Facebook when providing data of the date of birth, and people tend to provide a false information by selecting the default. This outlier represents the bad data in our dataset.
Now, I will look at the histogram of friend count.
qplot(x = friend_count, data = pf)
The data has a long tail over the x axis that does not help to see the clear picture. I will limit the x axis to see the data in more details.
qplot(x = friend_count, data = pf, xlim = c(0,1000))
# There is another way to limit the scale:
# qplot(x = friend_count, data = pf) +
# scale_x_continuous(limits = c(0,1000)) +
# scale_y_continuous(limits = c(0,20000))
To make the histogram more readable, I adjust the bin width.
qplot(x = friend_count, data = pf) +
scale_x_continuous(limits = c(0,1000), breaks = seq(0,1000,50))
Now, I face the histogram of friend counts by gender. This will help answering a question who has more friends on average, males or females.
qplot(x = friend_count, data = pf, binwidth = 10) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50)) +
facet_wrap(~gender)
Obviously, we do not need the missing values of gender for our analysis.
I omit the data with NA values of gender.
# Using na.omit() will omit all the data with NA values, not
# only NA values of gender. Instead, I will use is.na()
qplot(x = friend_count, data = subset(pf, !is.na(gender)), binwidth = 10) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50)) +
facet_wrap(~gender)
It is difficult to say who has more friends, males or females, bu just looking at the histogram of friend counts.
The output of the table command shows there is slightly more males than females in our dataset.
table(pf$gender)
##
## female male
## 40254 58574
by(pf$friend_count, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
The command by gives us enough information to answer the following questions.
Response: females have more friends on average, their mean number of friends is 242 vs 165 for males.
Response: 96 for females vs 74 for males.
Next, I will analyze the tenure of Facebook users’. In other words, I will look at how long people are using Facebook, according to our dataset.
qplot(x = tenure, data = pf, binwidth = 30,
color = I('black'), fill = I('#099DD9'))
# Equivalent ggplot syntax:
# ggplot(aes(x = tenure), data = pf) +
# geom_histogram(binwidth = 30, color = 'black', fill = '#099DD9')
In order to create a histogram of tenure by year, I change the x value to tenure/365 and modify the binwidth value to be equal to 0.25 (quarters).
qplot(x = tenure/365, data = pf, binwidth = 0.25,
color = I('black'), fill = I('#F79420')) +
scale_x_continuous(breaks = seq(1,7,1), limits = c(0,7))
# Equivalent ggplot syntax:
# ggplot(aes(x = tenure/365), data = pf) +
# geom_histogram(binwidth = .25, color = 'black', fill = '#F79420')
I label the axes to make the plot easily understandable to viewers.
qplot(x = tenure/365, data = pf, binwidth = 0.25,
xlab = "Number of years using Facebook",
ylab = "Number of users in sample",
color = I('black'), fill = I('#F79420')) +
scale_x_continuous(breaks = seq(1,7,1), limits = c(0,7))
# Equivalent ggplot syntax:
# ggplot(aes(x = tenure / 365), data = pf) +
# geom_histogram(color = 'black', fill = '#F79420') +
# scale_x_continuous(breaks = seq(1, 7, 1), limits = c(0, 7)) +
# xlab('Number of years using Facebook') +
# ylab('Number of users in sample')
Now, I create a histogram of users’ ages. But first, I check the summary for the age variable to find out the min and max values of the users’ age and use them to update the x axis.
summary(pf$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 20.00 28.00 37.28 50.00 113.00
qplot(x = age, data = pf, binwidth = 1,
xlab = "Age of a user",
ylab = "Number of users in sample",
fill = I('#5760AB')) +
scale_x_discrete(breaks = seq(13,113,5))
There are no users of the age less than 13 years old, and this corresponds to the Facebook policy. The maximum amount of users is of the age around 20, and there is also a peak on the histogram for the age more than 100. That, obviously, is a false data.
Sometimes, we need to transform the data to make it look more like a normal distribution. For example, the histogram of friends count was skewed a lot and needs some transformations.
summary(pf$friend_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 31.0 82.0 196.4 206.0 4923.0
summary(log10(pf$friend_count + 1)) # we add 1 to avoid -Inf for those users who have 0 friends
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.505 1.919 1.868 2.316 3.692
summary(sqrt(pf$friend_count))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.568 9.055 11.090 14.350 70.160
#install.packages('gridExtra')
library(gridExtra)
p1 <- qplot(x = friend_count, data = pf)
p2 <- qplot(x = log10(friend_count+1), data = pf)
p3 <- qplot(x = sqrt(friend_count), data = pf)
grid.arrange(p1, p2, p3, ncol = 1)
# Equivalent in ggplot:
# p1 <- ggplot(aes(x = friend_count), data = pf) + geom_histogram()
# p2 <- p1 + scale_x_log10()
# p3 <- p1 + scale_x_sqrt()
# grid.arrange(p1, p2, p3, ncol = 1)
There are two ways to transform the variable, the first method use rapid wraping up of the variable and the second adds scaling to the variable. The difference is in the x axis label, which is a friend_count instead of log10(friend_count)
logScale <- qplot(x = log10(friend_count), data = pf)
countScale <- ggplot(aes(x = friend_count), data = pf) +
geom_histogram() +
scale_x_log10()
grid.arrange(logScale, countScale, ncol = 2)
Frequency polygons let show several facets of the data on the same plot. For example, we can look at the count of friends by gender in one frequency polygons’ plot.
qplot(x = friend_count, y = ..count../sum(..count..),
data = subset(pf, !is.na(gender)),
xlab = "Friend Count",
ylab = "Proportion of Users with that Friend Count",
binwidth = 10, geom = "freqpoly", color = gender) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50))
# Equivalent ggplot syntax:
# ggplot(aes(x = friend_count, y = ..count../sum(..count..)), data = subset(pf, !is.na(gender))) +
# geom_freqpoly(aes(color = gender), binwidth=10) +
# scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
# xlab('Friend Count') +
# ylab('Percentage of users with that friend count')
Next, I will use frequency polygons to determine which gender makes more likes.
qplot(x = www_likes, data = subset(pf, !is.na(gender)),
xlab = "Number of Likes",
ylab = "Users with that Number of Likes",
geom = "freqpoly", color = gender) +
scale_x_log10()
by(pf$www_likes, pf$gender, sum)
## pf$gender: female
## [1] 3507665
## --------------------------------------------------------
## pf$gender: male
## [1] 1430175
by(pf$www_likes, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 87.14 25.00 14860.00
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 24.42 2.00 12900.00
I will create box plots to compare statistical data visually.
qplot(x = gender, y = friend_count,
data = subset(pf, !is.na(pf$gender)),
geom = "boxplot")
qplot(x = gender, y = friend_count,
data = subset(pf, !is.na(pf$gender)),
geom = "boxplot") +
coord_cartesian(ylim = c(0,250))
# Another ways to limit y axis:
# + scale_y_continuous(limits = c(0,1000))
# or
# ylim = c(0,1000)
The boxplot shows a slightly higher number of average number of friends for females, comparing with males. To be sure about it, I use the by command.
by(pf$friend_count, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
I create an additional boxplot to answer the following question:
qplot(x = gender, y = friendships_initiated,
data = subset(pf, !is.na(pf$gender)),
geom = "boxplot") +
coord_cartesian(ylim = c(0,150))
by(pf$friendships_initiated, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 19.0 49.0 113.9 124.8 3654.0
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 15.0 44.0 103.1 111.0 4144.0
Response: women initiated more friend than men, both in terms of a mean and median value.
It is possible to transform the variable into a boolean one. I will do this for the mobile_likes variables to find out whether user have used a mobile checkin or not.
summary(pf$mobile_likes > 0)
## Mode FALSE TRUE NA's
## logical 35056 63947 0
The summary results shows that many users (35056) have never used a mobile checkin at all (mobile_likes = 0). So, I will create a new variable mobile_check_in in the dataset and assign its value to 1 or 0 depending on whether a user checked in via mobile or not.
mobile_check_in <- NA
pf$mobile_check_in <- ifelse(pf$mobile_likes>0, 1, 0)
pf$mobile_check_in <- factor(pf$mobile_check_in)
summary(pf$mobile_check_in)
## 0 1
## 35056 63947
Now, I can further analyze the users’ behaviour in terms of mobile checkin, e.g. calculate what is the percentage of check in using mobile.
sum(pf$mobile_check_in==1)/length(pf$mobile_check_in)
## [1] 0.6459097
I describe exploratory data analysis with more than one variable in further documents. Please, check my website for more details.