Introduction

This markdown document comprises of one of the 4 data science (DS) projects. Each open dataset is specially selected to showcase different technical aspects of a project and their business value(s).

Primary objectives are to demonstrate following knowledge:

  1. Explore, understand & cleanse data with explained reasoning (Exploratory Data Analysis)

  2. Identify, implement, test & select the best possible statistics model based on data-based proof

  3. Systematic documentation with reproducible R coding for others to understand & improve

  4. Summarise observed findings with concise business values & insights through executive summary

  5. Truly understand data project cycle to create a commercial data product with ability to execute statistics through R application

What it doesn’t consists

  1. Data product via dashboard: Visualise processed output data through visualisation tools such as Qlik or Tableau - Though am trained in this ;)

  2. Deep business domain knowledge explanation in the dataset

  3. External dataset(s) to enrich or give another dimensions to tackle the business question

  4. Working style or collaboration methods with other data team members (Range from data analyst, business analyst, data scientist, data engineer, visualisation engineer, business stakeholder, project manager)

  5. Other statistics modeling which might be better or can be used as alternative.

Reasons:

  1. There is a limited time & scope in executing this project.
  2. The essential goal is to meet the above primary stated objectives. Not to showcase the breadth of techniques.
  3. Acknowledge that there is always a better or alternative solution to test the stated hypothesis.

All data sources used are taken from public domain and projects are implemented using Rstudio. For full Markdown version, kindly contact me via teochunwey@gmail.com

—IN PROGRESS—

Content

  1. Overview
  2. Data Dictionary
  3. Exploratory Data Analysis
  4. Analysis & Modeling
  5. Evaluation & Interpretation
  6. Findings & Executive Summary

Overview

Use case in this project is to predict Facebook social media post interaction performance based on the input features (category, page total likes, type, month, hour, weekday, paid). The nature of this post interaction evolve around comestic.

Facebook is chosen for this project as it is one of the popular social media used in Singapore. In addition, business used this channel to advertise or brand themselves. As such, this project can be used in similar brand analysis in Facebook.

Data Dictionary

Page total likes;Type;Category;Post Month;Post Weekday;Post Hour;Paid;Lifetime Post Total Reach;Lifetime Post Total Impressions;Lifetime Engaged Users;Lifetime Post Consumers;Lifetime Post Consumptions;Lifetime Post Impressions by people who have liked your Page;Lifetime Post reach by people who like your Page;Lifetime People who have liked your Page and engaged with your post;comment;like;share;Total Interactions

Data Source: https://archive.ics.uci.edu/ml/datasets/Facebook+metrics

Exploratory Data Analysis

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(stats)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
#library(corrplot)
library(ggfortify)
#library(psych)
library(lattice)
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
library(caret)
df.original<-read.csv('dataset_Facebook.csv', sep = ";")
set.seed(1) #Ensure reproducible code 
df<- sample_frac(df.original,0.7) #split into test and train data by 7:3 ratio
df.index<- as.numeric(rownames(df))
df.test<- df.original[-df.index,]
attach(df)
head(df)
##     Page.total.likes   Type Category Post.Month Post.Weekday Post.Hour
## 133           136393  Photo        1         10            6         9
## 186           134879  Photo        1          9            1        10
## 286           126141 Status        2          6            6         4
## 452            93470  Photo        1          2            5        12
## 101           137020  Photo        1         10            4         9
## 445            96749  Photo        2          3            2         5
##     Paid Lifetime.Post.Total.Reach Lifetime.Post.Total.Impressions
## 133    0                       659                            1158
## 186    0                      2232                            4005
## 286    0                      8628                           14847
## 452    0                      6416                           11459
## 101    1                      1357                            2453
## 445    0                      5312                            9411
##     Lifetime.Engaged.Users Lifetime.Post.Consumers
## 133                    199                     194
## 186                    374                     335
## 286                    870                     843
## 452                   1362                    1313
## 101                     37                      37
## 445                    603                     582
##     Lifetime.Post.Consumptions
## 133                        239
## 186                        458
## 286                       1692
## 452                       1652
## 101                         55
## 445                        795
##     Lifetime.Post.Impressions.by.people.who.have.liked.your.Page
## 133                                                         1041
## 186                                                         3247
## 286                                                        11970
## 452                                                         9122
## 101                                                         2154
## 445                                                         7034
##     Lifetime.Post.reach.by.people.who.like.your.Page
## 133                                              576
## 186                                             1740
## 286                                             6796
## 452                                             4716
## 101                                             1120
## 445                                             3588
##     Lifetime.People.who.have.liked.your.Page.and.engaged.with.your.post
## 133                                                                 169
## 186                                                                 278
## 286                                                                 774
## 452                                                                 497
## 101                                                                  32
## 445                                                                 345
##     comment like share Total.Interactions
## 133       1    7     2                 10
## 186       0   62    10                 72
## 286       4   72    18                 94
## 452       0   96    29                125
## 101       0    0     0                  0
## 445       0   48    20                 68

We run the following EDA methods to ensure data integrity, NA checks, understanding of data in context and visualising the important variables in term of histogram distribution/ density chart.

Head(df): Check to see if data is loaded correctly based on data and column headers str (df): Get data type structure, observations and variables summary(df): Check the range of each variables including min, max, outliers, NA, anomalies such as -1 or 99 for Post.Month summary(is.na(df)): Quick way to check for NA values

str(df)
## 'data.frame':    350 obs. of  19 variables:
##  $ Page.total.likes                                                   : int  136393 134879 126141 93470 137020 96749 91865 121540 124940 138895 ...
##  $ Type                                                               : Factor w/ 4 levels "Link","Photo",..: 2 2 3 2 2 2 2 2 2 2 ...
##  $ Category                                                           : int  1 1 2 1 1 2 2 3 3 2 ...
##  $ Post.Month                                                         : int  10 9 6 2 10 3 2 6 6 12 ...
##  $ Post.Weekday                                                       : int  6 1 6 5 4 2 5 1 2 4 ...
##  $ Post.Hour                                                          : int  9 10 4 12 9 5 13 3 3 2 ...
##  $ Paid                                                               : int  0 0 0 0 1 0 1 0 0 0 ...
##  $ Lifetime.Post.Total.Reach                                          : int  659 2232 8628 6416 1357 5312 4840 3110 3754 4940 ...
##  $ Lifetime.Post.Total.Impressions                                    : int  1158 4005 14847 11459 2453 9411 7466 5405 6295 9390 ...
##  $ Lifetime.Engaged.Users                                             : int  199 374 870 1362 37 603 949 732 791 385 ...
##  $ Lifetime.Post.Consumers                                            : int  194 335 843 1313 37 582 923 712 730 306 ...
##  $ Lifetime.Post.Consumptions                                         : int  239 458 1692 1652 55 795 1116 892 1072 501 ...
##  $ Lifetime.Post.Impressions.by.people.who.have.liked.your.Page       : int  1041 3247 11970 9122 2154 7034 5362 4605 4343 5860 ...
##  $ Lifetime.Post.reach.by.people.who.like.your.Page                   : int  576 1740 6796 4716 1120 3588 3370 2540 2590 2930 ...
##  $ Lifetime.People.who.have.liked.your.Page.and.engaged.with.your.post: int  169 278 774 497 32 345 447 342 482 273 ...
##  $ comment                                                            : int  1 0 4 0 0 0 3 2 8 33 ...
##  $ like                                                               : int  7 62 72 96 0 48 47 33 107 107 ...
##  $ share                                                              : int  2 10 18 29 0 20 21 10 20 22 ...
##  $ Total.Interactions                                                 : int  10 72 94 125 0 68 71 45 135 162 ...
summary(df)
##  Page.total.likes     Type        Category       Post.Month    
##  Min.   : 81370   Link  : 14   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:113028   Photo :295   1st Qu.:1.000   1st Qu.: 4.000  
##  Median :130791   Status: 35   Median :2.000   Median : 7.000  
##  Mean   :123480   Video :  6   Mean   :1.857   Mean   : 7.083  
##  3rd Qu.:136393                3rd Qu.:3.000   3rd Qu.:10.000  
##  Max.   :139441                Max.   :3.000   Max.   :12.000  
##                                                                
##   Post.Weekday     Post.Hour           Paid       
##  Min.   :1.000   Min.   : 1.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.: 3.000   1st Qu.:0.0000  
##  Median :4.000   Median : 9.000   Median :0.0000  
##  Mean   :3.963   Mean   : 7.763   Mean   :0.2743  
##  3rd Qu.:6.000   3rd Qu.:11.000   3rd Qu.:1.0000  
##  Max.   :7.000   Max.   :23.000   Max.   :1.0000  
##                                                   
##  Lifetime.Post.Total.Reach Lifetime.Post.Total.Impressions
##  Min.   :   238            Min.   :   570                 
##  1st Qu.:  3340            1st Qu.:  5807                 
##  Median :  5370            Median :  9356                 
##  Mean   : 14430            Mean   : 27940                 
##  3rd Qu.: 13778            3rd Qu.: 24872                 
##  Max.   :180480            Max.   :665792                 
##                                                           
##  Lifetime.Engaged.Users Lifetime.Post.Consumers Lifetime.Post.Consumptions
##  Min.   :   15          Min.   :   15.0         Min.   :   19.0           
##  1st Qu.:  418          1st Qu.:  343.5         1st Qu.:  514.2           
##  Median :  652          Median :  568.5         Median :  868.5           
##  Mean   :  973          Mean   :  840.6         Mean   : 1471.5           
##  3rd Qu.: 1128          3rd Qu.:  993.8         3rd Qu.: 1606.2           
##  Max.   :11452          Max.   :11328.0         Max.   :19779.0           
##                                                                           
##  Lifetime.Post.Impressions.by.people.who.have.liked.your.Page
##  Min.   :   567                                              
##  1st Qu.:  4104                                              
##  Median :  6705                                              
##  Mean   : 14781                                              
##  3rd Qu.: 15776                                              
##  Max.   :648611                                              
##                                                              
##  Lifetime.Post.reach.by.people.who.like.your.Page
##  Min.   :  236                                   
##  1st Qu.: 2246                                   
##  Median : 3715                                   
##  Mean   : 6724                                   
##  3rd Qu.: 8526                                   
##  Max.   :51456                                   
##                                                  
##  Lifetime.People.who.have.liked.your.Page.and.engaged.with.your.post
##  Min.   :  15.0                                                     
##  1st Qu.: 301.0                                                     
##  Median : 430.5                                                     
##  Mean   : 640.8                                                     
##  3rd Qu.: 700.5                                                     
##  Max.   :4376.0                                                     
##                                                                     
##     comment             like            share        Total.Interactions
##  Min.   :  0.000   Min.   :   0.0   Min.   :  0.00   Min.   :   0.0    
##  1st Qu.:  1.000   1st Qu.:  57.0   1st Qu.: 10.00   1st Qu.:  72.0    
##  Median :  3.000   Median : 107.0   Median : 19.50   Median : 130.5    
##  Mean   :  8.526   Mean   : 194.3   Mean   : 29.21   Mean   : 231.2    
##  3rd Qu.:  7.750   3rd Qu.: 188.0   3rd Qu.: 36.00   3rd Qu.: 232.8    
##  Max.   :372.000   Max.   :5172.0   Max.   :790.00   Max.   :6334.0    
##                    NA's   :1        NA's   :4

Findings: 1. There are only 5 NA values which are insignificant (like & share) in this dataset. As such, no action will be taken.

  1. Photos have the highest count which might skew the data during analysis.

  2. For the lifetime-related metrics, the max number seem to be the outliers.

  3. Post Month/ Weekday/ Hour, Paid and category seem to have no anomalies.

summary(is.na(df))
##  Page.total.likes    Type          Category       Post.Month     
##  Mode :logical    Mode :logical   Mode :logical   Mode :logical  
##  FALSE:350        FALSE:350       FALSE:350       FALSE:350      
##                                                                  
##  Post.Weekday    Post.Hour          Paid         Lifetime.Post.Total.Reach
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical            
##  FALSE:350       FALSE:350       FALSE:350       FALSE:350                
##                                                                           
##  Lifetime.Post.Total.Impressions Lifetime.Engaged.Users
##  Mode :logical                   Mode :logical         
##  FALSE:350                       FALSE:350             
##                                                        
##  Lifetime.Post.Consumers Lifetime.Post.Consumptions
##  Mode :logical           Mode :logical             
##  FALSE:350               FALSE:350                 
##                                                    
##  Lifetime.Post.Impressions.by.people.who.have.liked.your.Page
##  Mode :logical                                               
##  FALSE:350                                                   
##                                                              
##  Lifetime.Post.reach.by.people.who.like.your.Page
##  Mode :logical                                   
##  FALSE:350                                       
##                                                  
##  Lifetime.People.who.have.liked.your.Page.and.engaged.with.your.post
##  Mode :logical                                                      
##  FALSE:350                                                          
##                                                                     
##   comment           like           share         Total.Interactions
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical     
##  FALSE:350       FALSE:349       FALSE:346       FALSE:350         
##                  TRUE :1         TRUE :4

EDA Visualisation

ggplot(df,aes(df$Post.Month))+geom_bar(color='darkblue', fill='lightblue') + ggtitle("Post by Months") + geom_text(stat='count', aes(label=..count..), vjust=1)

ggplot(df,aes(df$Post.Weekday))+geom_bar(color='darkblue', fill='lightblue') + ggtitle("Post by Weekday") + geom_text(stat='count', aes(label=..count..), vjust=1)

ggplot(df,aes(df$Post.Hour))+geom_bar(color='darkblue', fill='lightblue') + ggtitle("Post by Hour") + geom_text(stat='count', aes(label=..count..), vjust=1)

Findings on posting count by time periods:

  1. For posting by months, the frequency increased at April, July and October.

  2. Weekday 1 & 7 have the highest posting.

  3. Posting hour is at the peak during 3rd, 10th, 11th, 13th hour.

ggplot(df, aes( Post.Month, Page.total.likes)) +   
  geom_bar(color='darkblue', fill='lightblue', position = "dodge", stat="identity")+
  theme_minimal()

Findings:

  1. Above plot shows the total likes for the page accumulated over the months.

  2. March & April seem to have the highest increment in page likes and it remains stagnant around August onwards.

ggplot(data=df, aes(x=Type, y=Total.Interactions)) + geom_boxplot() + facet_wrap(~Paid)

ggplot(data=df, aes(x=Type, y=comment)) + geom_boxplot() + facet_wrap(~Paid)

ggplot(data=df, aes(x=Type, y=like)) + geom_boxplot() + facet_wrap(~Paid)
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

ggplot(data=df, aes(x=Type, y=share)) + geom_boxplot() + facet_wrap(~Paid)
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).

Findings:

  1. In term of total interactions for the post types, paid advertising for Status seem to perform better. Photos performed better when paid which is also obivious due to the large sample size compared to rest of the types. Links & Videos types are the except where the interaction is higher for posting without any advertisement.

  2. In term of comments for the post types, Photos, Status, Video overall perform better for paid advertising. Exception performance came from photo posting with a record high of 370+ comments. For non paid, link seem to perform better.

Analysis & Modeling

In this analysis, we will be running Random Forest, an ensemble of Decision Trees.

#default parameters with importance true to inspect variable importance.

Reference

Variables definition: https://www.facebook.com/business/a/page/page-insights#overview