EXECUTIVE SUMMARY

This report reveals the extent to what facebook variable within the data set can be used to define the facebook user engagement and correlation between the engagement and the variables.

As of Q1 of 2019, Facebook is one of the most important digital advertising channels that the organizations invest in. It is crutial to know the facebook users to further segment and target them based on their status and engagement. Most of the FB users are over 30, there are more female users than male users. There is a correlation between the amount of followers and likes (or reactions) however the correlation is not significant.

1- Introduction

2- About the Data Set

3- Data Collection and Understanding

4- Data Wrangling

5- Data Exploration

6- Conclusion

INTRODUCTION

The aim of this report is to perform an exploratiy analysis to the collected dataset and see if we can find suprising facts and correlation around facebook user engagement. The goal of the business and the problem in question that we are trying to solve is " What are the variables that have an impact within facebook account and engagement? What are the age groups and gender engagement? What are correlation between facebook variables?

Feautres of the data can lead us to unexpected results on how user sets up their facebook account and how they engage within the social network. Our approach to the analysis does mainly focus on understanding the variables within the dataset and see the impact to the users engagement.

ABOUT THE DATA SET

Date Set is provided by Facebook as part of open source exploratary analysis for data scientists to learn and grow their skills. Variables descriptions are as follows;

userid: The account number of the facebook user

age: Age of the facebook user

dob_day: Date of birth day of the facebook user

dob_month: Date of month of the facebook user

gender: Gender of the facebook user

tenure: level of the facebook user

friend_count: Friend count of the facebook user

friendship_initiated: Friendship initiated by the facebook user

likes: likes received on facebook users posts

mobile likes: mobile likes received on facebook users posts

www_likes: desktop likes received on facebook users posts

www_likes_received: desktop likes received on facebook users posts

DATA COLLECTION AND UNDERSTANDING

fb_data <- read.csv(file='pseudo_facebook.tsv', sep='\t')
head(fb_data)

##    userid age dob_day dob_year dob_month gender tenure friend_count
## 1 2094382  14      19     1999        11   male    266            0
## 2 1192601  14       2     1999        11 female      6            0
## 3 2083884  14      16     1999        11   male     13            0
## 4 1203168  14      25     1999        12 female     93            0
## 5 1733186  14       4     1999        12   male     82            0
## 6 1524765  14       1     1999        12   male     15            0
##   friendships_initiated likes likes_received mobile_likes
## 1                     0     0              0            0
## 2                     0     0              0            0
## 3                     0     0              0            0
## 4                     0     0              0            0
## 5                     0     0              0            0
## 6                     0     0              0            0
##   mobile_likes_received www_likes www_likes_received
## 1                     0         0                  0
## 2                     0         0                  0
## 3                     0         0                  0
## 4                     0         0                  0
## 5                     0         0                  0
## 6                     0         0                  0

str(fb_data)

## 'data.frame':    99003 obs. of  15 variables:
##  $ userid               : int  2094382 1192601 2083884 1203168 1733186 1524765 1136133 1680361 1365174 1712567 ...
##  $ age                  : int  14 14 14 14 14 14 13 13 13 13 ...
##  $ dob_day              : int  19 2 16 25 4 1 14 4 1 2 ...
##  $ dob_year             : int  1999 1999 1999 1999 1999 1999 2000 2000 2000 2000 ...
##  $ dob_month            : int  11 11 11 12 12 12 1 1 1 2 ...
##  $ gender               : Factor w/ 2 levels "female","male": 2 1 2 1 2 2 2 1 2 2 ...
##  $ tenure               : int  266 6 13 93 82 15 12 0 81 171 ...
##  $ friend_count         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ friendships_initiated: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ likes                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ likes_received       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mobile_likes         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mobile_likes_received: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ www_likes            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ www_likes_received   : int  0 0 0 0 0 0 0 0 0 0 ...

When we look at the data set details, we see there are 15 variables associated within the data set.

User Id, Age, Date of birth day, Date of birth year, date of birth month, gender, tenure, friend_count, friendships_initiated, likes_received, mobile_likes, mobile_likes_received, www._likes, www_likes_received.

We have a categoriacal variable Gender which has 2 levels. Male and Female.

levels(fb_data$gender)

## [1] "female" "male"

Let’s get the summary of the data set as an overview.

summary(fb_data)

##      userid             age            dob_day         dob_year   
##  Min.   :1000008   Min.   : 13.00   Min.   : 1.00   Min.   :1900  
##  1st Qu.:1298806   1st Qu.: 20.00   1st Qu.: 7.00   1st Qu.:1963  
##  Median :1596148   Median : 28.00   Median :14.00   Median :1985  
##  Mean   :1597045   Mean   : 37.28   Mean   :14.53   Mean   :1976  
##  3rd Qu.:1895744   3rd Qu.: 50.00   3rd Qu.:22.00   3rd Qu.:1993  
##  Max.   :2193542   Max.   :113.00   Max.   :31.00   Max.   :2000  
##                                                                   
##    dob_month         gender          tenure        friend_count   
##  Min.   : 1.000   female:40254   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.: 3.000   male  :58574   1st Qu.: 226.0   1st Qu.:  31.0  
##  Median : 6.000   NA's  :  175   Median : 412.0   Median :  82.0  
##  Mean   : 6.283                  Mean   : 537.9   Mean   : 196.4  
##  3rd Qu.: 9.000                  3rd Qu.: 675.0   3rd Qu.: 206.0  
##  Max.   :12.000                  Max.   :3139.0   Max.   :4923.0  
##                                  NA's   :2                        
##  friendships_initiated     likes         likes_received    
##  Min.   :   0.0        Min.   :    0.0   Min.   :     0.0  
##  1st Qu.:  17.0        1st Qu.:    1.0   1st Qu.:     1.0  
##  Median :  46.0        Median :   11.0   Median :     8.0  
##  Mean   : 107.5        Mean   :  156.1   Mean   :   142.7  
##  3rd Qu.: 117.0        3rd Qu.:   81.0   3rd Qu.:    59.0  
##  Max.   :4144.0        Max.   :25111.0   Max.   :261197.0  
##                                                            
##   mobile_likes     mobile_likes_received   www_likes       
##  Min.   :    0.0   Min.   :     0.00     Min.   :    0.00  
##  1st Qu.:    0.0   1st Qu.:     0.00     1st Qu.:    0.00  
##  Median :    4.0   Median :     4.00     Median :    0.00  
##  Mean   :  106.1   Mean   :    84.12     Mean   :   49.96  
##  3rd Qu.:   46.0   3rd Qu.:    33.00     3rd Qu.:    7.00  
##  Max.   :25111.0   Max.   :138561.00     Max.   :14865.00  
##                                                            
##  www_likes_received 
##  Min.   :     0.00  
##  1st Qu.:     0.00  
##  Median :     2.00  
##  Mean   :    58.57  
##  3rd Qu.:    20.00  
##  Max.   :129953.00  
##

Let’s see if there are any missing values within our data set.

sum(is.na(fb_data$userid))

## [1] 0

sum(is.na(fb_data$age))

## [1] 0

sum(is.na(fb_data$dob_day))

## [1] 0

sum(is.na(fb_data$dob_year))

## [1] 0

sum(is.na(fb_data$dob_month))

## [1] 0

sum(is.na(fb_data$gender))

## [1] 175

sum(is.na(fb_data$tenure))

## [1] 2

sum(is.na(fb_data$friend_count))

## [1] 0

sum(is.na(fb_data$friendship_initiated))

## [1] 0

sum(is.na(fb_data$likes))

## [1] 0

sum(is.na(fb_data$likes_received))

## [1] 0

sum(is.na(fb_data$mobile_likes))

## [1] 0

sum(is.na(fb_data$mobile_likes_received))

## [1] 0

sum(is.na(fb_data$www_likes))

## [1] 0

sum(is.na(fb_data$www_likes_received))

## [1] 0

We see that we have 175 values missing gender and 2 missing values in tenure variable.

We have a basic understanding of our data set, we can certainly clean and structure this data set to help us with our data exploration and analysis.

DATA WRANGLING

Below are the list of actions we can take to clean the datas set;

1- User id is something we do not need as it doesnt impact to answer the problem that we are trying to solve. We can remove that variable column.

2- There are 99003 objects within our data set. Considering the amount of objects, we can drop the 175 gender and 2 tenure missing variables.

exclude_vars <- names(fb_data) %in% c('userid') # selecting variables to exclude.
fb_data_new <- fb_data[!exclude_vars] # exluding selected variables for the new dataset.


fb_data_new <- na.omit(fb_data) # exlude the data that has missing values. 
head(fb_data_new)

##    userid age dob_day dob_year dob_month gender tenure friend_count
## 1 2094382  14      19     1999        11   male    266            0
## 2 1192601  14       2     1999        11 female      6            0
## 3 2083884  14      16     1999        11   male     13            0
## 4 1203168  14      25     1999        12 female     93            0
## 5 1733186  14       4     1999        12   male     82            0
## 6 1524765  14       1     1999        12   male     15            0
##   friendships_initiated likes likes_received mobile_likes
## 1                     0     0              0            0
## 2                     0     0              0            0
## 3                     0     0              0            0
## 4                     0     0              0            0
## 5                     0     0              0            0
## 6                     0     0              0            0
##   mobile_likes_received www_likes www_likes_received
## 1                     0         0                  0
## 2                     0         0                  0
## 3                     0         0                  0
## 4                     0         0                  0
## 5                     0         0                  0
## 6                     0         0                  0

sum(is.na(fb_data_new$gender))

## [1] 0

sum(is.na(fb_data_new$tenure))

## [1] 0

We cleaned our data set so we can further start our analysis.

DATA EXPLORATION

install.packages('ggplot2', repos="http://cran.us.r-project.org")

## Installing package into 'C:/Users/Anil Akyildirim/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)

## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Anil Akyildirim\AppData\Local\Temp\Rtmp4K0l6o\downloaded_packages

library(ggplot2)

We can look at the data set and see the facebook account count based on month and day within the year.

ggplot(data = fb_data_new,aes(x=dob_day))+
geom_bar()+
scale_x_discrete(breaks = 1:31)+
facet_wrap(~dob_month,ncol = 4)

Please keep in mind, the numbers represents the month, so 1 stands for Januar, 2 stands for February and so on. We can see that the majority of the users set their date of birth as the first of January.This might be due to the default settings of the facebook sign up and users simply not wanting to change the default setting. We might also make the assumption that, for this user set, it is possible that the information that is provided within the facebook account does not represent the correct user information

library(ggplot2)
theme_set(theme_classic())

# Histogram on a Continuous (Numeric) Variable
g <- ggplot(fb_data_new, aes(age)) + scale_fill_brewer(palette = "Spectral")

g + geom_histogram(aes(fill=gender), 
                   binwidth = .1, 
                   col="black", 
                   size=.1) +  # change binwidth
  labs(title="Histogram with Gender", 
       subtitle="FB likes received across Age")

g + geom_histogram(aes(fill=gender), 
                   bins=5, 
                   col="black", 
                   size=.1) +   # change number of bins
  labs(title="Histogram with Fixed Bins Gender", 
       subtitle="FB likes received across Gender")

When we look at the age and gender variable, we see the majority of the facebook accounts are below 30 years of age, mostly female. As the age gets older the facebook account count goes down. We also see some age values above 100. We can also assume that these age values are not correctly provided by the user

summary(fb_data_new$friend_count)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    31.0    82.0   196.4   206.0  4923.0

summary(fb_data_new$www_likes_received)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      0.00      0.00      2.00     58.55     20.00 129953.00

Summary of the friend count and likes received of a user gives us some insights on both of these variables. Friend count mean is 196.4 and the facebook user who has the most amount of friends is 4923. The mean of the likes received is 58.55.

options(scipen=999)  # turn-off scientific notation like 1e+48

library(ggplot2)
theme_set(theme_bw())  # pre-set the bw theme.

# Scatterplot
gg_1 <- ggplot(fb_data_new, aes(x=friend_count, y=www_likes_received)) + 
  geom_point(aes(col=gender, size=age)) + 
  geom_smooth(method="loess", se=F) + 
  xlim(c(0, 5000)) + 
  ylim(c(0, 12300)) + 
  labs(subtitle="Friend Count vs Likes Received", 
       y="Likes Received", 
       x="Friend Count", 
       title="Scatterplot", 
       caption = "Source: fb_data_new")

plot(gg_1)

## Warning: Removed 13 rows containing non-finite values (stat_smooth).

## Warning: Removed 13 rows containing missing values (geom_point).

We are able to see some correlation between friend count and likes received. However it is not as strong as one would expect.

library(ggplot2)
theme_set(theme_classic())

# Plot
g_1 <- ggplot(fb_data_new, aes(friend_count))
g_1 + geom_density(aes(fill=factor(gender)), alpha=0.8) + 
  xlim(c(0, 1000)) + 
    labs(title="Density plot", 
         subtitle="Friend Count grouped by Gender",
         caption="Source: fb_data",
         x="Friend Count",
         fill="# Gender")

## Warning: Removed 2949 rows containing non-finite values (stat_density).

Let’s see the correlation between variables. For us to do that, we need to create a numeric type data frame.

library(ggplot2)
library(ggcorrplot)

# numeric fb_data
exclude_vars <- names(fb_data_new) %in% c('gender') # selecting variables to exclude.
fb_data_new_numeric <- fb_data_new[!exclude_vars] # exluding selected variables for the new dataset.


corr <- round(cor(fb_data_new_numeric), 1)
corr

##                       userid  age dob_day dob_year dob_month tenure
## userid                     1  0.0     0.0      0.0       0.0    0.0
## age                        0  1.0     0.0     -1.0       0.0    0.5
## dob_day                    0  0.0     1.0      0.0       0.1    0.0
## dob_year                   0 -1.0     0.0      1.0       0.0   -0.5
## dob_month                  0  0.0     0.1      0.0       1.0    0.0
## tenure                     0  0.5     0.0     -0.5       0.0    1.0
## friend_count               0  0.0     0.0      0.0       0.0    0.2
## friendships_initiated      0 -0.1     0.0      0.1       0.0    0.1
## likes                      0  0.0     0.0      0.0       0.0    0.1
## likes_received             0  0.0     0.0      0.0       0.0    0.0
## mobile_likes               0  0.0     0.0      0.0       0.0    0.0
## mobile_likes_received      0  0.0     0.0      0.0       0.0    0.0
## www_likes                  0  0.0     0.0      0.0       0.0    0.1
## www_likes_received         0  0.0     0.0      0.0       0.0    0.0
##                       friend_count friendships_initiated likes
## userid                         0.0                   0.0   0.0
## age                            0.0                  -0.1   0.0
## dob_day                        0.0                   0.0   0.0
## dob_year                       0.0                   0.1   0.0
## dob_month                      0.0                   0.0   0.0
## tenure                         0.2                   0.1   0.1
## friend_count                   1.0                   0.8   0.3
## friendships_initiated          0.8                   1.0   0.3
## likes                          0.3                   0.3   1.0
## likes_received                 0.2                   0.2   0.3
## mobile_likes                   0.2                   0.2   0.9
## mobile_likes_received          0.2                   0.2   0.3
## www_likes                      0.2                   0.2   0.6
## www_likes_received             0.2                   0.2   0.3
##                       likes_received mobile_likes mobile_likes_received
## userid                           0.0          0.0                   0.0
## age                              0.0          0.0                   0.0
## dob_day                          0.0          0.0                   0.0
## dob_year                         0.0          0.0                   0.0
## dob_month                        0.0          0.0                   0.0
## tenure                           0.0          0.0                   0.0
## friend_count                     0.2          0.2                   0.2
## friendships_initiated            0.2          0.2                   0.2
## likes                            0.3          0.9                   0.3
## likes_received                   1.0          0.3                   1.0
## mobile_likes                     0.3          1.0                   0.3
## mobile_likes_received            1.0          0.3                   1.0
## www_likes                        0.3          0.2                   0.2
## www_likes_received               0.9          0.2                   0.9
##                       www_likes www_likes_received
## userid                      0.0                0.0
## age                         0.0                0.0
## dob_day                     0.0                0.0
## dob_year                    0.0                0.0
## dob_month                   0.0                0.0
## tenure                      0.1                0.0
## friend_count                0.2                0.2
## friendships_initiated       0.2                0.2
## likes                       0.6                0.3
## likes_received              0.3                0.9
## mobile_likes                0.2                0.2
## mobile_likes_received       0.2                0.9
## www_likes                   1.0                0.3
## www_likes_received          0.3                1.0

# Plot
ggcorrplot(corr, 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("tomato2", "white", "springgreen3"), 
           title="Correlogram of fb_data_new_numeric", 
           ggtheme=theme_bw)

Based on the correlogram of the data set, we are able to see below correlation

1- There is a negative correlation between Date of year and age which is expexted

2- There is a positive mild (0.5) correlation between Tenure and Age

3- There is a positive small(0.2) correlation between Friend count and Tenure

4- There is a positive and strong (0.8) correlation between friendships initiated and friend count

5- There is a positive small (0.3) correlation between likes and friend count and friendship created

6- There is a positive and strong (0.9) correlation between likes and mobile likes.

CONCLUSION

For Facebook users and their engagement, the age and gender has an impact on the overall facebook user amount. Majority of the users are below 30 years old and there are more female users than male users. One interesting to see is that, even though there is a correlation between the friend count and likes, it is small. This means, users might have huge amount of friend counts but may not be getting the engagement that they are looking for. Another thing to note is that, most of the likes are coming from mobile rather than desktop.

Key Takeway: On creating facebook posts and content, it is suggested to target audience below 30 years old. Not only increase the amount of facebook followers but also focus on the facebook content to get the most engagement from facebook. Content also should be focused in mobile rather then desktop

Facebook-User-Data-Analysis

Anil Akyildirim

8/1/2019