This markdown document comprises of one of the 4 data science (DS) projects. Each open dataset is specially selected to showcase different technical aspects of a project and their business value(s).
Primary objectives are to demonstrate following knowledge:
Explore, understand & cleanse data with explained reasoning (Exploratory Data Analysis)
Identify, implement, test & select the best possible statistics model based on data-based proof
Systematic documentation with reproducible R coding for others to understand & improve
Summarise observed findings with concise business values & insights through executive summary
Truly understand data project cycle to create a commercial data product with ability to execute statistics through R application
Data product via dashboard: Visualise processed output data through visualisation tools such as Qlik or Tableau - Though am trained in this ;)
Deep business domain knowledge explanation in the dataset
External dataset(s) to enrich or give another dimensions to tackle the business question
Working style or collaboration methods with other data team members (Range from data analyst, business analyst, data scientist, data engineer, visualisation engineer, business stakeholder, project manager)
Other statistics modeling which might be better or can be used as alternative.
Reasons:
All data sources used are taken from public domain and projects are implemented using Rstudio. For full Markdown version, kindly contact me via teochunwey@gmail.com
Use case in this project is to predict Facebook social media post interaction performance based on the input features (category, page total likes, type, month, hour, weekday, paid). The nature of this post interaction evolve around comestic.
Facebook is chosen for this project as it is one of the popular social media used in Singapore. In addition, business used this channel to advertise or brand themselves. As such, this project can be used in similar brand analysis in Facebook.
Page total likes;Type;Category;Post Month;Post Weekday;Post Hour;Paid;Lifetime Post Total Reach;Lifetime Post Total Impressions;Lifetime Engaged Users;Lifetime Post Consumers;Lifetime Post Consumptions;Lifetime Post Impressions by people who have liked your Page;Lifetime Post reach by people who like your Page;Lifetime People who have liked your Page and engaged with your post;comment;like;share;Total Interactions
Data Source: https://archive.ics.uci.edu/ml/datasets/Facebook+metrics
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(stats)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
#library(corrplot)
library(ggfortify)
#library(psych)
library(lattice)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(caret)
df.original<-read.csv('dataset_Facebook.csv', sep = ";")
set.seed(1) #Ensure reproducible code
df<- sample_frac(df.original,0.7) #split into test and train data by 7:3 ratio
df.index<- as.numeric(rownames(df))
df.test<- df.original[-df.index,]
attach(df)
head(df)
## Page.total.likes Type Category Post.Month Post.Weekday Post.Hour
## 133 136393 Photo 1 10 6 9
## 186 134879 Photo 1 9 1 10
## 286 126141 Status 2 6 6 4
## 452 93470 Photo 1 2 5 12
## 101 137020 Photo 1 10 4 9
## 445 96749 Photo 2 3 2 5
## Paid Lifetime.Post.Total.Reach Lifetime.Post.Total.Impressions
## 133 0 659 1158
## 186 0 2232 4005
## 286 0 8628 14847
## 452 0 6416 11459
## 101 1 1357 2453
## 445 0 5312 9411
## Lifetime.Engaged.Users Lifetime.Post.Consumers
## 133 199 194
## 186 374 335
## 286 870 843
## 452 1362 1313
## 101 37 37
## 445 603 582
## Lifetime.Post.Consumptions
## 133 239
## 186 458
## 286 1692
## 452 1652
## 101 55
## 445 795
## Lifetime.Post.Impressions.by.people.who.have.liked.your.Page
## 133 1041
## 186 3247
## 286 11970
## 452 9122
## 101 2154
## 445 7034
## Lifetime.Post.reach.by.people.who.like.your.Page
## 133 576
## 186 1740
## 286 6796
## 452 4716
## 101 1120
## 445 3588
## Lifetime.People.who.have.liked.your.Page.and.engaged.with.your.post
## 133 169
## 186 278
## 286 774
## 452 497
## 101 32
## 445 345
## comment like share Total.Interactions
## 133 1 7 2 10
## 186 0 62 10 72
## 286 4 72 18 94
## 452 0 96 29 125
## 101 0 0 0 0
## 445 0 48 20 68
We run the following EDA methods to ensure data integrity, NA checks, understanding of data in context and visualising the important variables in term of histogram distribution/ density chart.
Head(df): Check to see if data is loaded correctly based on data and column headers str (df): Get data type structure, observations and variables summary(df): Check the range of each variables including min, max, outliers, NA, anomalies such as -1 or 99 for Post.Month summary(is.na(df)): Quick way to check for NA values
str(df)
## 'data.frame': 350 obs. of 19 variables:
## $ Page.total.likes : int 136393 134879 126141 93470 137020 96749 91865 121540 124940 138895 ...
## $ Type : Factor w/ 4 levels "Link","Photo",..: 2 2 3 2 2 2 2 2 2 2 ...
## $ Category : int 1 1 2 1 1 2 2 3 3 2 ...
## $ Post.Month : int 10 9 6 2 10 3 2 6 6 12 ...
## $ Post.Weekday : int 6 1 6 5 4 2 5 1 2 4 ...
## $ Post.Hour : int 9 10 4 12 9 5 13 3 3 2 ...
## $ Paid : int 0 0 0 0 1 0 1 0 0 0 ...
## $ Lifetime.Post.Total.Reach : int 659 2232 8628 6416 1357 5312 4840 3110 3754 4940 ...
## $ Lifetime.Post.Total.Impressions : int 1158 4005 14847 11459 2453 9411 7466 5405 6295 9390 ...
## $ Lifetime.Engaged.Users : int 199 374 870 1362 37 603 949 732 791 385 ...
## $ Lifetime.Post.Consumers : int 194 335 843 1313 37 582 923 712 730 306 ...
## $ Lifetime.Post.Consumptions : int 239 458 1692 1652 55 795 1116 892 1072 501 ...
## $ Lifetime.Post.Impressions.by.people.who.have.liked.your.Page : int 1041 3247 11970 9122 2154 7034 5362 4605 4343 5860 ...
## $ Lifetime.Post.reach.by.people.who.like.your.Page : int 576 1740 6796 4716 1120 3588 3370 2540 2590 2930 ...
## $ Lifetime.People.who.have.liked.your.Page.and.engaged.with.your.post: int 169 278 774 497 32 345 447 342 482 273 ...
## $ comment : int 1 0 4 0 0 0 3 2 8 33 ...
## $ like : int 7 62 72 96 0 48 47 33 107 107 ...
## $ share : int 2 10 18 29 0 20 21 10 20 22 ...
## $ Total.Interactions : int 10 72 94 125 0 68 71 45 135 162 ...
summary(df)
## Page.total.likes Type Category Post.Month
## Min. : 81370 Link : 14 Min. :1.000 Min. : 1.000
## 1st Qu.:113028 Photo :295 1st Qu.:1.000 1st Qu.: 4.000
## Median :130791 Status: 35 Median :2.000 Median : 7.000
## Mean :123480 Video : 6 Mean :1.857 Mean : 7.083
## 3rd Qu.:136393 3rd Qu.:3.000 3rd Qu.:10.000
## Max. :139441 Max. :3.000 Max. :12.000
##
## Post.Weekday Post.Hour Paid
## Min. :1.000 Min. : 1.000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.:0.0000
## Median :4.000 Median : 9.000 Median :0.0000
## Mean :3.963 Mean : 7.763 Mean :0.2743
## 3rd Qu.:6.000 3rd Qu.:11.000 3rd Qu.:1.0000
## Max. :7.000 Max. :23.000 Max. :1.0000
##
## Lifetime.Post.Total.Reach Lifetime.Post.Total.Impressions
## Min. : 238 Min. : 570
## 1st Qu.: 3340 1st Qu.: 5807
## Median : 5370 Median : 9356
## Mean : 14430 Mean : 27940
## 3rd Qu.: 13778 3rd Qu.: 24872
## Max. :180480 Max. :665792
##
## Lifetime.Engaged.Users Lifetime.Post.Consumers Lifetime.Post.Consumptions
## Min. : 15 Min. : 15.0 Min. : 19.0
## 1st Qu.: 418 1st Qu.: 343.5 1st Qu.: 514.2
## Median : 652 Median : 568.5 Median : 868.5
## Mean : 973 Mean : 840.6 Mean : 1471.5
## 3rd Qu.: 1128 3rd Qu.: 993.8 3rd Qu.: 1606.2
## Max. :11452 Max. :11328.0 Max. :19779.0
##
## Lifetime.Post.Impressions.by.people.who.have.liked.your.Page
## Min. : 567
## 1st Qu.: 4104
## Median : 6705
## Mean : 14781
## 3rd Qu.: 15776
## Max. :648611
##
## Lifetime.Post.reach.by.people.who.like.your.Page
## Min. : 236
## 1st Qu.: 2246
## Median : 3715
## Mean : 6724
## 3rd Qu.: 8526
## Max. :51456
##
## Lifetime.People.who.have.liked.your.Page.and.engaged.with.your.post
## Min. : 15.0
## 1st Qu.: 301.0
## Median : 430.5
## Mean : 640.8
## 3rd Qu.: 700.5
## Max. :4376.0
##
## comment like share Total.Interactions
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 1.000 1st Qu.: 57.0 1st Qu.: 10.00 1st Qu.: 72.0
## Median : 3.000 Median : 107.0 Median : 19.50 Median : 130.5
## Mean : 8.526 Mean : 194.3 Mean : 29.21 Mean : 231.2
## 3rd Qu.: 7.750 3rd Qu.: 188.0 3rd Qu.: 36.00 3rd Qu.: 232.8
## Max. :372.000 Max. :5172.0 Max. :790.00 Max. :6334.0
## NA's :1 NA's :4
Findings: 1. There are only 5 NA values which are insignificant (like & share) in this dataset. As such, no action will be taken.
Photos have the highest count which might skew the data during analysis.
For the lifetime-related metrics, the max number seem to be the outliers.
Post Month/ Weekday/ Hour, Paid and category seem to have no anomalies.
summary(is.na(df))
## Page.total.likes Type Category Post.Month
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:350 FALSE:350 FALSE:350 FALSE:350
##
## Post.Weekday Post.Hour Paid Lifetime.Post.Total.Reach
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:350 FALSE:350 FALSE:350 FALSE:350
##
## Lifetime.Post.Total.Impressions Lifetime.Engaged.Users
## Mode :logical Mode :logical
## FALSE:350 FALSE:350
##
## Lifetime.Post.Consumers Lifetime.Post.Consumptions
## Mode :logical Mode :logical
## FALSE:350 FALSE:350
##
## Lifetime.Post.Impressions.by.people.who.have.liked.your.Page
## Mode :logical
## FALSE:350
##
## Lifetime.Post.reach.by.people.who.like.your.Page
## Mode :logical
## FALSE:350
##
## Lifetime.People.who.have.liked.your.Page.and.engaged.with.your.post
## Mode :logical
## FALSE:350
##
## comment like share Total.Interactions
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:350 FALSE:349 FALSE:346 FALSE:350
## TRUE :1 TRUE :4
ggplot(df,aes(df$Post.Month))+geom_bar(color='darkblue', fill='lightblue') + ggtitle("Post by Months") + geom_text(stat='count', aes(label=..count..), vjust=1)
ggplot(df,aes(df$Post.Weekday))+geom_bar(color='darkblue', fill='lightblue') + ggtitle("Post by Weekday") + geom_text(stat='count', aes(label=..count..), vjust=1)
ggplot(df,aes(df$Post.Hour))+geom_bar(color='darkblue', fill='lightblue') + ggtitle("Post by Hour") + geom_text(stat='count', aes(label=..count..), vjust=1)
Findings on posting count by time periods:
For posting by months, the frequency increased at April, July and October.
Weekday 1 & 7 have the highest posting.
Posting hour is at the peak during 3rd, 10th, 11th, 13th hour.
ggplot(df, aes( Post.Month, Page.total.likes)) +
geom_bar(color='darkblue', fill='lightblue', position = "dodge", stat="identity")+
theme_minimal()
Findings:
Above plot shows the total likes for the page accumulated over the months.
March & April seem to have the highest increment in page likes and it remains stagnant around August onwards.
ggplot(data=df, aes(x=Type, y=Total.Interactions)) + geom_boxplot() + facet_wrap(~Paid)
ggplot(data=df, aes(x=Type, y=comment)) + geom_boxplot() + facet_wrap(~Paid)
ggplot(data=df, aes(x=Type, y=like)) + geom_boxplot() + facet_wrap(~Paid)
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
ggplot(data=df, aes(x=Type, y=share)) + geom_boxplot() + facet_wrap(~Paid)
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).
Findings:
In term of total interactions for the post types, paid advertising for Status seem to perform better. Photos performed better when paid which is also obivious due to the large sample size compared to rest of the types. Links & Videos types are the except where the interaction is higher for posting without any advertisement.
In term of comments for the post types, Photos, Status, Video overall perform better for paid advertising. Exception performance came from photo posting with a record high of 370+ comments. For non paid, link seem to perform better.
In this analysis, we will be running Random Forest, an ensemble of Decision Trees.
#default parameters with importance true to inspect variable importance.
Reference
Variables definition: https://www.facebook.com/business/a/page/page-insights#overview