This report reveals the extent to what facebook variable within the data set can be used to define the facebook user engagement and correlation between the engagement and the variables.
As of Q1 of 2019, Facebook is one of the most important digital advertising channels that the organizations invest in. It is crutial to know the facebook users to further segment and target them based on their status and engagement. Most of the FB users are over 30, there are more female users than male users. There is a correlation between the amount of followers and likes (or reactions) however the correlation is not significant.
1- Introduction
2- About the Data Set
3- Data Collection and Understanding
4- Data Wrangling
5- Data Exploration
6- Conclusion
The aim of this report is to perform an exploratiy analysis to the collected dataset and see if we can find suprising facts and correlation around facebook user engagement. The goal of the business and the problem in question that we are trying to solve is " What are the variables that have an impact within facebook account and engagement? What are the age groups and gender engagement? What are correlation between facebook variables?
Feautres of the data can lead us to unexpected results on how user sets up their facebook account and how they engage within the social network. Our approach to the analysis does mainly focus on understanding the variables within the dataset and see the impact to the users engagement.
Date Set is provided by Facebook as part of open source exploratary analysis for data scientists to learn and grow their skills. Variables descriptions are as follows;
userid: The account number of the facebook user
age: Age of the facebook user
dob_day: Date of birth day of the facebook user
dob_month: Date of month of the facebook user
gender: Gender of the facebook user
tenure: level of the facebook user
friend_count: Friend count of the facebook user
friendship_initiated: Friendship initiated by the facebook user
likes: likes received on facebook users posts
mobile likes: mobile likes received on facebook users posts
www_likes: desktop likes received on facebook users posts
www_likes_received: desktop likes received on facebook users posts
fb_data <- read.csv(file='pseudo_facebook.tsv', sep='\t')
head(fb_data)
## userid age dob_day dob_year dob_month gender tenure friend_count
## 1 2094382 14 19 1999 11 male 266 0
## 2 1192601 14 2 1999 11 female 6 0
## 3 2083884 14 16 1999 11 male 13 0
## 4 1203168 14 25 1999 12 female 93 0
## 5 1733186 14 4 1999 12 male 82 0
## 6 1524765 14 1 1999 12 male 15 0
## friendships_initiated likes likes_received mobile_likes
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## mobile_likes_received www_likes www_likes_received
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
str(fb_data)
## 'data.frame': 99003 obs. of 15 variables:
## $ userid : int 2094382 1192601 2083884 1203168 1733186 1524765 1136133 1680361 1365174 1712567 ...
## $ age : int 14 14 14 14 14 14 13 13 13 13 ...
## $ dob_day : int 19 2 16 25 4 1 14 4 1 2 ...
## $ dob_year : int 1999 1999 1999 1999 1999 1999 2000 2000 2000 2000 ...
## $ dob_month : int 11 11 11 12 12 12 1 1 1 2 ...
## $ gender : Factor w/ 2 levels "female","male": 2 1 2 1 2 2 2 1 2 2 ...
## $ tenure : int 266 6 13 93 82 15 12 0 81 171 ...
## $ friend_count : int 0 0 0 0 0 0 0 0 0 0 ...
## $ friendships_initiated: int 0 0 0 0 0 0 0 0 0 0 ...
## $ likes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ likes_received : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mobile_likes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mobile_likes_received: int 0 0 0 0 0 0 0 0 0 0 ...
## $ www_likes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ www_likes_received : int 0 0 0 0 0 0 0 0 0 0 ...
When we look at the data set details, we see there are 15 variables associated within the data set.
User Id, Age, Date of birth day, Date of birth year, date of birth month, gender, tenure, friend_count, friendships_initiated, likes_received, mobile_likes, mobile_likes_received, www._likes, www_likes_received.
We have a categoriacal variable Gender which has 2 levels. Male and Female.
levels(fb_data$gender)
## [1] "female" "male"
Let’s get the summary of the data set as an overview.
summary(fb_data)
## userid age dob_day dob_year
## Min. :1000008 Min. : 13.00 Min. : 1.00 Min. :1900
## 1st Qu.:1298806 1st Qu.: 20.00 1st Qu.: 7.00 1st Qu.:1963
## Median :1596148 Median : 28.00 Median :14.00 Median :1985
## Mean :1597045 Mean : 37.28 Mean :14.53 Mean :1976
## 3rd Qu.:1895744 3rd Qu.: 50.00 3rd Qu.:22.00 3rd Qu.:1993
## Max. :2193542 Max. :113.00 Max. :31.00 Max. :2000
##
## dob_month gender tenure friend_count
## Min. : 1.000 female:40254 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.000 male :58574 1st Qu.: 226.0 1st Qu.: 31.0
## Median : 6.000 NA's : 175 Median : 412.0 Median : 82.0
## Mean : 6.283 Mean : 537.9 Mean : 196.4
## 3rd Qu.: 9.000 3rd Qu.: 675.0 3rd Qu.: 206.0
## Max. :12.000 Max. :3139.0 Max. :4923.0
## NA's :2
## friendships_initiated likes likes_received
## Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 17.0 1st Qu.: 1.0 1st Qu.: 1.0
## Median : 46.0 Median : 11.0 Median : 8.0
## Mean : 107.5 Mean : 156.1 Mean : 142.7
## 3rd Qu.: 117.0 3rd Qu.: 81.0 3rd Qu.: 59.0
## Max. :4144.0 Max. :25111.0 Max. :261197.0
##
## mobile_likes mobile_likes_received www_likes
## Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 4.0 Median : 4.00 Median : 0.00
## Mean : 106.1 Mean : 84.12 Mean : 49.96
## 3rd Qu.: 46.0 3rd Qu.: 33.00 3rd Qu.: 7.00
## Max. :25111.0 Max. :138561.00 Max. :14865.00
##
## www_likes_received
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 2.00
## Mean : 58.57
## 3rd Qu.: 20.00
## Max. :129953.00
##
Let’s see if there are any missing values within our data set.
sum(is.na(fb_data$userid))
## [1] 0
sum(is.na(fb_data$age))
## [1] 0
sum(is.na(fb_data$dob_day))
## [1] 0
sum(is.na(fb_data$dob_year))
## [1] 0
sum(is.na(fb_data$dob_month))
## [1] 0
sum(is.na(fb_data$gender))
## [1] 175
sum(is.na(fb_data$tenure))
## [1] 2
sum(is.na(fb_data$friend_count))
## [1] 0
sum(is.na(fb_data$friendship_initiated))
## [1] 0
sum(is.na(fb_data$likes))
## [1] 0
sum(is.na(fb_data$likes_received))
## [1] 0
sum(is.na(fb_data$mobile_likes))
## [1] 0
sum(is.na(fb_data$mobile_likes_received))
## [1] 0
sum(is.na(fb_data$www_likes))
## [1] 0
sum(is.na(fb_data$www_likes_received))
## [1] 0
We see that we have 175 values missing gender and 2 missing values in tenure variable.
We have a basic understanding of our data set, we can certainly clean and structure this data set to help us with our data exploration and analysis.
Below are the list of actions we can take to clean the datas set;
1- User id is something we do not need as it doesnt impact to answer the problem that we are trying to solve. We can remove that variable column.
2- There are 99003 objects within our data set. Considering the amount of objects, we can drop the 175 gender and 2 tenure missing variables.
exclude_vars <- names(fb_data) %in% c('userid') # selecting variables to exclude.
fb_data_new <- fb_data[!exclude_vars] # exluding selected variables for the new dataset.
fb_data_new <- na.omit(fb_data) # exlude the data that has missing values.
head(fb_data_new)
## userid age dob_day dob_year dob_month gender tenure friend_count
## 1 2094382 14 19 1999 11 male 266 0
## 2 1192601 14 2 1999 11 female 6 0
## 3 2083884 14 16 1999 11 male 13 0
## 4 1203168 14 25 1999 12 female 93 0
## 5 1733186 14 4 1999 12 male 82 0
## 6 1524765 14 1 1999 12 male 15 0
## friendships_initiated likes likes_received mobile_likes
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## mobile_likes_received www_likes www_likes_received
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
sum(is.na(fb_data_new$gender))
## [1] 0
sum(is.na(fb_data_new$tenure))
## [1] 0
We cleaned our data set so we can further start our analysis.
install.packages('ggplot2', repos="http://cran.us.r-project.org")
## Installing package into 'C:/Users/Anil Akyildirim/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Anil Akyildirim\AppData\Local\Temp\Rtmp4K0l6o\downloaded_packages
library(ggplot2)
We can look at the data set and see the facebook account count based on month and day within the year.
ggplot(data = fb_data_new,aes(x=dob_day))+
geom_bar()+
scale_x_discrete(breaks = 1:31)+
facet_wrap(~dob_month,ncol = 4)
Please keep in mind, the numbers represents the month, so 1 stands for Januar, 2 stands for February and so on. We can see that the majority of the users set their date of birth as the first of January.This might be due to the default settings of the facebook sign up and users simply not wanting to change the default setting. We might also make the assumption that, for this user set, it is possible that the information that is provided within the facebook account does not represent the correct user information
library(ggplot2)
theme_set(theme_classic())
# Histogram on a Continuous (Numeric) Variable
g <- ggplot(fb_data_new, aes(age)) + scale_fill_brewer(palette = "Spectral")
g + geom_histogram(aes(fill=gender),
binwidth = .1,
col="black",
size=.1) + # change binwidth
labs(title="Histogram with Gender",
subtitle="FB likes received across Age")
g + geom_histogram(aes(fill=gender),
bins=5,
col="black",
size=.1) + # change number of bins
labs(title="Histogram with Fixed Bins Gender",
subtitle="FB likes received across Gender")
When we look at the age and gender variable, we see the majority of the facebook accounts are below 30 years of age, mostly female. As the age gets older the facebook account count goes down. We also see some age values above 100. We can also assume that these age values are not correctly provided by the user
summary(fb_data_new$friend_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 31.0 82.0 196.4 206.0 4923.0
summary(fb_data_new$www_likes_received)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 2.00 58.55 20.00 129953.00
Summary of the friend count and likes received of a user gives us some insights on both of these variables. Friend count mean is 196.4 and the facebook user who has the most amount of friends is 4923. The mean of the likes received is 58.55.
options(scipen=999) # turn-off scientific notation like 1e+48
library(ggplot2)
theme_set(theme_bw()) # pre-set the bw theme.
# Scatterplot
gg_1 <- ggplot(fb_data_new, aes(x=friend_count, y=www_likes_received)) +
geom_point(aes(col=gender, size=age)) +
geom_smooth(method="loess", se=F) +
xlim(c(0, 5000)) +
ylim(c(0, 12300)) +
labs(subtitle="Friend Count vs Likes Received",
y="Likes Received",
x="Friend Count",
title="Scatterplot",
caption = "Source: fb_data_new")
plot(gg_1)
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).
We are able to see some correlation between friend count and likes received. However it is not as strong as one would expect.
library(ggplot2)
theme_set(theme_classic())
# Plot
g_1 <- ggplot(fb_data_new, aes(friend_count))
g_1 + geom_density(aes(fill=factor(gender)), alpha=0.8) +
xlim(c(0, 1000)) +
labs(title="Density plot",
subtitle="Friend Count grouped by Gender",
caption="Source: fb_data",
x="Friend Count",
fill="# Gender")
## Warning: Removed 2949 rows containing non-finite values (stat_density).
Let’s see the correlation between variables. For us to do that, we need to create a numeric type data frame.
library(ggplot2)
library(ggcorrplot)
# numeric fb_data
exclude_vars <- names(fb_data_new) %in% c('gender') # selecting variables to exclude.
fb_data_new_numeric <- fb_data_new[!exclude_vars] # exluding selected variables for the new dataset.
corr <- round(cor(fb_data_new_numeric), 1)
corr
## userid age dob_day dob_year dob_month tenure
## userid 1 0.0 0.0 0.0 0.0 0.0
## age 0 1.0 0.0 -1.0 0.0 0.5
## dob_day 0 0.0 1.0 0.0 0.1 0.0
## dob_year 0 -1.0 0.0 1.0 0.0 -0.5
## dob_month 0 0.0 0.1 0.0 1.0 0.0
## tenure 0 0.5 0.0 -0.5 0.0 1.0
## friend_count 0 0.0 0.0 0.0 0.0 0.2
## friendships_initiated 0 -0.1 0.0 0.1 0.0 0.1
## likes 0 0.0 0.0 0.0 0.0 0.1
## likes_received 0 0.0 0.0 0.0 0.0 0.0
## mobile_likes 0 0.0 0.0 0.0 0.0 0.0
## mobile_likes_received 0 0.0 0.0 0.0 0.0 0.0
## www_likes 0 0.0 0.0 0.0 0.0 0.1
## www_likes_received 0 0.0 0.0 0.0 0.0 0.0
## friend_count friendships_initiated likes
## userid 0.0 0.0 0.0
## age 0.0 -0.1 0.0
## dob_day 0.0 0.0 0.0
## dob_year 0.0 0.1 0.0
## dob_month 0.0 0.0 0.0
## tenure 0.2 0.1 0.1
## friend_count 1.0 0.8 0.3
## friendships_initiated 0.8 1.0 0.3
## likes 0.3 0.3 1.0
## likes_received 0.2 0.2 0.3
## mobile_likes 0.2 0.2 0.9
## mobile_likes_received 0.2 0.2 0.3
## www_likes 0.2 0.2 0.6
## www_likes_received 0.2 0.2 0.3
## likes_received mobile_likes mobile_likes_received
## userid 0.0 0.0 0.0
## age 0.0 0.0 0.0
## dob_day 0.0 0.0 0.0
## dob_year 0.0 0.0 0.0
## dob_month 0.0 0.0 0.0
## tenure 0.0 0.0 0.0
## friend_count 0.2 0.2 0.2
## friendships_initiated 0.2 0.2 0.2
## likes 0.3 0.9 0.3
## likes_received 1.0 0.3 1.0
## mobile_likes 0.3 1.0 0.3
## mobile_likes_received 1.0 0.3 1.0
## www_likes 0.3 0.2 0.2
## www_likes_received 0.9 0.2 0.9
## www_likes www_likes_received
## userid 0.0 0.0
## age 0.0 0.0
## dob_day 0.0 0.0
## dob_year 0.0 0.0
## dob_month 0.0 0.0
## tenure 0.1 0.0
## friend_count 0.2 0.2
## friendships_initiated 0.2 0.2
## likes 0.6 0.3
## likes_received 0.3 0.9
## mobile_likes 0.2 0.2
## mobile_likes_received 0.2 0.9
## www_likes 1.0 0.3
## www_likes_received 0.3 1.0
# Plot
ggcorrplot(corr,
type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("tomato2", "white", "springgreen3"),
title="Correlogram of fb_data_new_numeric",
ggtheme=theme_bw)
Based on the correlogram of the data set, we are able to see below correlation
1- There is a negative correlation between Date of year and age which is expexted
2- There is a positive mild (0.5) correlation between Tenure and Age
3- There is a positive small(0.2) correlation between Friend count and Tenure
4- There is a positive and strong (0.8) correlation between friendships initiated and friend count
5- There is a positive small (0.3) correlation between likes and friend count and friendship created
6- There is a positive and strong (0.9) correlation between likes and mobile likes.
For Facebook users and their engagement, the age and gender has an impact on the overall facebook user amount. Majority of the users are below 30 years old and there are more female users than male users. One interesting to see is that, even though there is a correlation between the friend count and likes, it is small. This means, users might have huge amount of friend counts but may not be getting the engagement that they are looking for. Another thing to note is that, most of the likes are coming from mobile rather than desktop.
Key Takeway: On creating facebook posts and content, it is suggested to target audience below 30 years old. Not only increase the amount of facebook followers but also focus on the facebook content to get the most engagement from facebook. Content also should be focused in mobile rather then desktop