Data Cleaning and
Preparation
Libraries
library(dplyr)
library(tidyr)
library(lubridate)
library(writexl)
library(ggplot2)
library(scales)
library(moments)
library(effsize)
Import
Dataset
df <- read.csv("data_input/instagram_post_dataset.csv")
head(df)
In the imported dataset, there are empty data that can be considered
as missing values (NA). Therefore, the data is re-imported with
additional arguments to make empty data into NA.
df <- read.csv("data_input/instagram_post_dataset.csv", na.strings=c("","NA"))
head(df)
Here is a peek of the dataset :
glimpse(df)
> Rows: 1,254
> Columns: 88
> $ coauthorProducers.0.is_verified <chr> NA, "false", NA, NA, NA, NA, NA, NA, N…
> $ coauthorProducers.0.username <chr> NA, "natalie.collins", NA, NA, NA, NA,…
> $ coauthorProducers.1.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.1.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.2.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.2.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.3.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.3.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ commentsCount <int> 0, 1, 0, 0, 4, 107, 91, 42, 0, 0, 2, 0…
> $ likesCount <int> 36, 82, 28, 38, 110, 836, 508, 324, 10…
> $ mentions.0 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.6 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.7 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.8 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.9 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ ownerFullName <chr> "AO Snack Brand", "AO Snack Brand", "A…
> $ ownerUsername <chr> "aosnack.brand", "aosnack.brand", "aos…
> $ productType <chr> "clips", "clips", NA, NA, "clips", NA,…
> $ taggedUsers.0.full_name <chr> NA, "Natalie Collins", NA, NA, NA, NA,…
> $ taggedUsers.0.is_verified <chr> NA, "false", NA, NA, NA, NA, NA, NA, N…
> $ taggedUsers.0.username <chr> NA, "natalie.collins", NA, NA, NA, NA,…
> $ taggedUsers.1.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.1.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.1.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.10.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.10.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.10.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.11.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.11.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.11.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.12.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.12.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.12.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.13.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.13.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.13.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.14.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.14.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.14.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.15.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.15.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.15.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.16.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.16.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.16.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.17.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.17.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.17.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.18.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.18.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.18.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.19.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.19.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.19.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.2.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.2.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.2.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.3.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.3.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.3.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.4.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.4.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.4.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.5.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.5.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.5.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.6.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.6.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.6.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.7.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.7.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.7.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.8.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.8.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.8.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.9.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.9.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.9.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ timestamp <chr> "2025-03-30T09:44:38.000Z", "2025-03-3…
> $ type <chr> "Video", "Video", "Sidecar", "Image", …
> $ videoDuration <dbl> 26.933, 13.066, NA, NA, 34.921, NA, NA…
> $ videoPlayCount <int> 3513, 4510, NA, NA, 6480, NA, NA, 3222…
> $ videoViewCount <int> 926, 1303, NA, NA, 2213, NA, NA, 16428…
Data
Description
Some of the attributes scraped using Apify contain sub-attributes.
For example, the coauthorProducers attribute includes sub-attributes
such as id, username, and other related attributes. Each attributes and
sub-attributes is represented by a single column in the dataset. During
the reshaping stage, all attributes and sub-attributes with indexing
will be merged into a single set of columns (one for each attribute)
without numerical indexing. Here is an explanation of the columns in the
dataset :
coauthorProducers : represents the
accounts that are co-authors of a post. Each account has
several attributes, including:
is_verified : verification status of
the account (true/false).
username : username of the
account.
commentsCount : the number of comments
on the post.
likesCount : the number of likes the
post received.
mentions : accounts that are mentioned
in the post.
ownerFullName : full name of the
account owner who posted the content.
ownerUsername : username of the
account owner.
productType : the type of content
uploaded (clips, igtv, and photo).
taggedUsers : accounts that are tagged
in the post. Each account has several attributes, including:
full_name : full name of the tagged
user.
is_verified : verification status of
the tagged account (true/false).
username : username of the tagged
account.
timestamp : the time when the post was
created.
type : the type of post (video, image,
and sidecar).
videoDuration : the duration of the
video (if the post is a video).
videoPlayCount : the number of times
the video has been played.
videoViewCount : the number of times
the entire video has been watched.
Add Column ID of
Posts
Before going further, column id is
added for identification (ID) of each post that represented by each row.
Column id will help in identifying each
post for further data analysis. First, data type of
timestamp will be changed to
datetime then the dataset will be sorted by
timestamp. This process will assign
sequential IDs to the posts, where the first post will be given ID 1,
the second post will have ID 2, and so on.
df_clean <- df %>%
# Change column `timestamp` data type to `datetime`
mutate(timestamp = ymd_hms(timestamp, tz = "Asia/Jakarta")) %>%
# Sort dataset by column `timestamp`
arrange(timestamp)
# Add column `id`
df_clean <- df_clean %>% mutate(id = row_number())
glimpse(df_clean)
> Rows: 1,254
> Columns: 89
> $ coauthorProducers.0.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.0.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.1.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.1.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.2.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.2.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.3.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.3.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ commentsCount <int> 0, 3, 4, 0, 8, 4, 0, 19, 10, 2, 4, 4, …
> $ likesCount <int> 236, 287, 71, 202, 210, 1237, 429, 773…
> $ mentions.0 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.6 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.7 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.8 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.9 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ ownerFullName <chr> "AO Snack Brand", "AO Snack Brand", "A…
> $ ownerUsername <chr> "aosnack.brand", "aosnack.brand", "aos…
> $ productType <chr> NA, NA, "clips", NA, NA, NA, NA, NA, N…
> $ taggedUsers.0.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.0.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.0.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.1.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.1.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.1.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.10.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.10.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.10.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.11.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.11.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.11.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.12.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.12.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.12.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.13.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.13.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.13.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.14.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.14.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.14.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.15.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.15.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.15.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.16.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.16.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.16.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.17.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.17.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.17.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.18.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.18.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.18.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.19.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.19.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.19.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.2.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.2.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.2.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.3.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.3.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.3.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.4.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.4.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.4.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.5.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.5.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.5.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.6.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.6.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.6.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.7.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.7.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.7.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.8.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.8.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.8.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.9.full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.9.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.9.username <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ timestamp <dttm> 2022-01-03 15:07:53, 2022-01-04 14:06…
> $ type <chr> "Image", "Image", "Video", "Image", "I…
> $ videoDuration <dbl> NA, NA, 23.033, NA, NA, NA, NA, NA, NA…
> $ videoPlayCount <int> NA, NA, 2977, NA, NA, NA, NA, NA, NA, …
> $ videoViewCount <int> NA, NA, 1468, NA, NA, NA, NA, NA, NA, …
> $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,…
Missing
Values
# Check for missing values
colSums(is.na(df_clean))
> coauthorProducers.0.is_verified coauthorProducers.0.username
> 1109 1109
> coauthorProducers.1.is_verified coauthorProducers.1.username
> 1241 1241
> coauthorProducers.2.is_verified coauthorProducers.2.username
> 1249 1249
> coauthorProducers.3.is_verified coauthorProducers.3.username
> 1253 1253
> commentsCount likesCount
> 0 0
> mentions.0 mentions.1
> 1130 1236
> mentions.2 mentions.3
> 1249 1252
> mentions.4 mentions.5
> 1253 1253
> mentions.6 mentions.7
> 1253 1253
> mentions.8 mentions.9
> 1253 1253
> ownerFullName ownerUsername
> 0 0
> productType taggedUsers.0.full_name
> 631 1067
> taggedUsers.0.is_verified taggedUsers.0.username
> 1067 1067
> taggedUsers.1.full_name taggedUsers.1.is_verified
> 1213 1213
> taggedUsers.1.username taggedUsers.10.full_name
> 1213 1253
> taggedUsers.10.is_verified taggedUsers.10.username
> 1253 1253
> taggedUsers.11.full_name taggedUsers.11.is_verified
> 1253 1253
> taggedUsers.11.username taggedUsers.12.full_name
> 1253 1253
> taggedUsers.12.is_verified taggedUsers.12.username
> 1253 1253
> taggedUsers.13.full_name taggedUsers.13.is_verified
> 1253 1253
> taggedUsers.13.username taggedUsers.14.full_name
> 1253 1253
> taggedUsers.14.is_verified taggedUsers.14.username
> 1253 1253
> taggedUsers.15.full_name taggedUsers.15.is_verified
> 1253 1253
> taggedUsers.15.username taggedUsers.16.full_name
> 1253 1253
> taggedUsers.16.is_verified taggedUsers.16.username
> 1253 1253
> taggedUsers.17.full_name taggedUsers.17.is_verified
> 1253 1253
> taggedUsers.17.username taggedUsers.18.full_name
> 1253 1253
> taggedUsers.18.is_verified taggedUsers.18.username
> 1253 1253
> taggedUsers.19.full_name taggedUsers.19.is_verified
> 1253 1253
> taggedUsers.19.username taggedUsers.2.full_name
> 1253 1232
> taggedUsers.2.is_verified taggedUsers.2.username
> 1232 1232
> taggedUsers.3.full_name taggedUsers.3.is_verified
> 1246 1246
> taggedUsers.3.username taggedUsers.4.full_name
> 1246 1249
> taggedUsers.4.is_verified taggedUsers.4.username
> 1249 1249
> taggedUsers.5.full_name taggedUsers.5.is_verified
> 1251 1251
> taggedUsers.5.username taggedUsers.6.full_name
> 1251 1251
> taggedUsers.6.is_verified taggedUsers.6.username
> 1251 1251
> taggedUsers.7.full_name taggedUsers.7.is_verified
> 1251 1251
> taggedUsers.7.username taggedUsers.8.full_name
> 1251 1251
> taggedUsers.8.is_verified taggedUsers.8.username
> 1251 1251
> taggedUsers.9.full_name taggedUsers.9.is_verified
> 1251 1251
> taggedUsers.9.username timestamp
> 1251 0
> type videoDuration
> 0 631
> videoPlayCount videoViewCount
> 635 631
> id
> 0
Based on the result, there are missing values in column
mentions..,
taggedUsers..,
productType,
coauthorProducers..,
videoViewCount,
videoPlayCount, and
videoDuration. To handle these missing
values, I will reshape the dataframe, apply an imputation method, and
perform feature engineering.
Reshaping
Dataframe
Before reshaping the data, I will impute the missing values on column
coauthorProducers.,
mentions., dan
taggedUsers. with index
.0. It will give an information for posts
that doesn’t have co-author producer, mention, or tagged user.
df_clean <- df_clean %>%
mutate_at(vars(starts_with("coauthorProducers.0")), ~replace(., is.na(.), "No co-author")) %>%
mutate_at(vars(starts_with("mentions.0")), ~replace(., is.na(.), "No mention")) %>%
mutate_at(vars(starts_with("taggedUsers.0")), ~replace(., is.na(.), "No tag"))
For column mentions.. that doesn’t have
sub-attribute, reshaping process will consists of pivot longer (wide to
long) and column modification. Pivot longer will produce two new columns
i.e. name and value. Column name
will be remove because it is unneeded. Column value will be
kept and it will be rename to
mention_username.
At the end of the process, there will be rows with missing value in
column mention_username. These rows are
unnecessary because mention attribute information for each post already
represented by mentioned username or “No mention”. Therefore, these rows
with missing value will be remove.
df_clean <- df_clean %>%
# Reshape attribute `mentions` from wide (columns) to long (rows)
pivot_longer(cols = starts_with("mention"),
values_to = "mention_username") %>%
# Removed column 'name' because it is unnecessary
select(-name) %>%
# Remove rows that contains missing value in column `mention_username`
drop_na(mention_username)
For column coauthorProducers.. and
taggedUsers.. that have sub-attributes,
there will be some differences reshaping process. For this case, I made
a custom function named longer(..). This
function will reshape the columns of attributes and its sub-attributes
with many index in wider format into columns of attributes and its
sub-attributes in longer format without index. The full explanation
about this function (longer(..)) is in Custom
Function Validation at the end of this project.
# Custom Function -> `longer(..)`
longer <- function(x, y){
# Reshape columns from wide to long
pivot_longer(data = x,
# Select columns with name that starts with argument `y`
cols = starts_with(y),
# Produce three columns of attribute, index, and sub-attribute
names_to = c("attribute", "index", "sub-attribute"),
# Column name separator `.` that separate attribute, index, and sub-attribute
names_sep = "\\.") %>%
# Remove column attribute because it's not needed
select(-attribute) %>%
# Reshape columns of sub-attributes from long to wide to make columns of sub-attributes without indexing
pivot_wider(names_from = "sub-attribute", values_from = "value", names_prefix = paste0(y,"_")) %>%
# Remove column attribute because it's not longer needed
select(-index) %>%
# Remove rows that contains missing value in columns with name that starts with argument `y`
drop_na(starts_with(y))
}
# For more detail explanation, go to 'Function Validation' at the end of this project.
# Apply function `longer(..)` to `coauthorProducer` and `taggedUser` to reshape them.
df_clean <- df_clean %>%
longer("coauthorProducers") %>%
longer("taggedUsers")
After the reshaping process, this is a peek of the dataset :
glimpse(df_clean)
> Rows: 1,496
> Columns: 17
> $ commentsCount <int> 0, 3, 4, 0, 8, 4, 0, 19, 10, 2, 4, 4, 0,…
> $ likesCount <int> 236, 287, 71, 202, 210, 1237, 429, 773, …
> $ ownerFullName <chr> "AO Snack Brand", "AO Snack Brand", "AO …
> $ ownerUsername <chr> "aosnack.brand", "aosnack.brand", "aosna…
> $ productType <chr> NA, NA, "clips", NA, NA, NA, NA, NA, NA,…
> $ timestamp <dttm> 2022-01-03 15:07:53, 2022-01-04 14:06:5…
> $ type <chr> "Image", "Image", "Video", "Image", "Ima…
> $ videoDuration <dbl> NA, NA, 23.033, NA, NA, NA, NA, NA, NA, …
> $ videoPlayCount <int> NA, NA, 2977, NA, NA, NA, NA, NA, NA, NA…
> $ videoViewCount <int> NA, NA, 1468, NA, NA, NA, NA, NA, NA, NA…
> $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
> $ mention_username <chr> "No mention", "No mention", "No mention"…
> $ coauthorProducers_is_verified <chr> "No co-author", "No co-author", "No co-a…
> $ coauthorProducers_username <chr> "No co-author", "No co-author", "No co-a…
> $ taggedUsers_full_name <chr> "No tag", "No tag", "No tag", "No tag", …
> $ taggedUsers_is_verified <chr> "No tag", "No tag", "No tag", "No tag", …
> $ taggedUsers_username <chr> "No tag", "No tag", "No tag", "No tag", …
# Check for missing values
colSums(is.na(df_clean))
> commentsCount likesCount
> 0 0
> ownerFullName ownerUsername
> 0 0
> productType timestamp
> 703 0
> type videoDuration
> 0 703
> videoPlayCount videoViewCount
> 707 703
> id mention_username
> 0 0
> coauthorProducers_is_verified coauthorProducers_username
> 0 0
> taggedUsers_full_name taggedUsers_is_verified
> 0 0
> taggedUsers_username
> 0
Columns of coauthorProducers and
taggedUser has become longer without
indexing and doesn’t have any missing values. In the next section, I
will apply imputation method to handle the rest missing values.
Imputation
Value
First, I will handle missing values on column
productType. This columns contains the
product type of Instagram Post that is photo, clips, or igtv.
unique(df_clean$productType)
> [1] NA "clips" "igtv"
df_igtv <- df_clean %>%
filter(productType == "igtv")
df_clips <- df_clean %>%
filter(productType == "clips")
unique(df_igtv$type)
> [1] "Video"
unique(df_clips$type)
> [1] "Video"
unique(df_clean$type)
> [1] "Image" "Video" "Sidecar"
df_clean <- df_clean %>%
mutate(productType = ifelse(is.na(productType), "photo", productType))
Now, to verify there is no video type in photo productType, I check
the unique value in data that has photo as product type.
# Filter rows that contains "photo" as
df_photo <- df_clean %>%
filter(productType == "photo")
# Check for unique value
unique(df_photo$type)
> [1] "Image" "Sidecar"
colSums(is.na(df_clean))
> commentsCount likesCount
> 0 0
> ownerFullName ownerUsername
> 0 0
> productType timestamp
> 0 0
> type videoDuration
> 0 703
> videoPlayCount videoViewCount
> 707 703
> id mention_username
> 0 0
> coauthorProducers_is_verified coauthorProducers_username
> 0 0
> taggedUsers_full_name taggedUsers_is_verified
> 0 0
> taggedUsers_username
> 0
Next, I will check for the rows that contains missing values in
videoViewCount
df_clean %>%
filter(is.na(videoViewCount) == TRUE) %>%
head(10)
Based on the result, the data that contains missing values in column
videoViewCount,
videoPlayCount, and
videoDuration have the same product type
which is photo. Therefore, value 0 will be imputed to these columns with
product type “photo” because there is no duration in photo and to keep
the columns in numerical data type.
# Imputation value 0 to columnn `videoDuration`, `videoPlayCount`, and `videoViewCount`
df_clean <- df_clean %>%
mutate(videoDuration = ifelse(productType=="photo",0,videoDuration)) %>%
mutate(videoPlayCount = ifelse(productType=="photo",0,videoPlayCount)) %>%
mutate(videoViewCount = ifelse(productType=="photo",0,videoViewCount))
# Re-check for missing values after imputation
colSums(is.na(df_clean))
> commentsCount likesCount
> 0 0
> ownerFullName ownerUsername
> 0 0
> productType timestamp
> 0 0
> type videoDuration
> 0 0
> videoPlayCount videoViewCount
> 4 0
> id mention_username
> 0 0
> coauthorProducers_is_verified coauthorProducers_username
> 0 0
> taggedUsers_full_name taggedUsers_is_verified
> 0 0
> taggedUsers_username
> 0
Missing values from column
videoViewCount and column
videoDuration have been handled. There is
still some missing values in column
videoPlayCount. Therefore, further check
must be performed.
# Check for missing values in column `videoPlayCount`
df_clean %>%
filter(is.na(videoPlayCount) == TRUE)
The rows with missing values all have the same product type, “igtv,”
and the same type, “Video.” Before proceeding, the product type “igtv”
will be changed to “video,” as this feature is no longer available.
Similarly, the product type “clips” will be changed to “video” for
consistency. The upcoming analysis will focus only on product types
“video” or “photo” to maintain clarity and consistency. Therefore,
column type will be excluded since it have
similiar meaning with productType.
df_clean$productType[df_clean$productType == "igtv" | df_clean$productType == "clips"] <- "video"
# Deselect colum `type`
df_clean <- df_clean %>%
select(-type)
# Check unique value in `productType`
unique(df_clean$productType)
> [1] "photo" "video"
To handle this missing values on column
videoPlayCount, I will perform feature
engineering to make a metrics called
Completion Rate of a video and perform a
calculation to get value for imputation.
Feature Engineering:
Completion Rate
The completion rate of a video is a key metric that measures how
often viewers watch a video in its entirety. It is calculated by
dividing the total number of complete views by the total number of video
plays. In this context, the total number of complete views is
represented by the column videoViewCount,
while the total number of video plays is represented by the column
videoPlayCount. The formula can be
expressed as:
\[Completion\:Rate =
\frac{Total\:Number\:Complete\:Views}{Total\:Video\:Plays} \times
100\]
This percentage indicates the engagement level of the audience,
helping content creators and marketers assess how compelling their
videos are. A higher completion rate suggests that viewers are more
likely to watch the video to the end, while a lower rate may indicate a
need for improvements in content or presentation.
For this case, I choose to using value of
videoPlayCount from completion rate
calculation for imputation missing values in
videoPlayCount. This value will fit better
to estimating missing values in
videoPlayCount than the mean of
videoPlayCount itself. This is because the
completion rate accounts for viewer engagement patterns, making it a
more stable and representative measure. If
videoPlayCount is missing for a video,
using the average completion rate (derived from other videos) allows for
a more accurate estimation by considering typical viewer behavior rather
than just raw play counts. Using the average play count itself can be
inaccurate because it may include values lower than the view count in
cases of missing data, which is nonsensical.
For example, if the average completion rate is 50%, and a video has
videoViewCount = 200, the estimated
videoPlayCount would be
200 / 0.5 = 400. This approach reduces
bias from outliers (such as viral videos with unusually high play
counts) and provides a more reliable imputation for missing data. The
formula for total video plays
(videoPlayCount) :
\[Total\:Video\:Plays
=\frac{Total\:Number\:Complete\:Views}{Completion\:Rate} \times
100\]
Starts with create a new column named
completionRate with value from formula
(videoViewCount/videoPlayCount)*100 if the
type is video and value zero if the type is not video. Then, the mean
completionRate will be computed across all
records where the type is video except the data with missing values.
# Create a new column `completionRate`
df_clean <- df_clean %>%
mutate(completionRate = ifelse(productType == "video",((videoViewCount/videoPlayCount)*100),0))
# Check completionRate distribution
df_clean %>%
distinct(id, .keep_all = TRUE) %>%
filter(productType == "video") %>%
select(completionRate) %>%
boxplot(horizontal = T)

# Check for the outlier which completion rate more then 100%
df_clean %>%
filter(completionRate > 100)
Out of 1,254 data points, two have a completion rate exceeding 100%.
A completion rate above 100% is illogical. Therefore, these data points
will be removed as they may distort insights.
# Remove data with completion rate > 100%
df_clean <- df_clean %>%
filter(completionRate < 100)
# Check again completionRate data distribution after removing outlier
df_clean %>%
distinct(id, .keep_all = TRUE) %>%
filter(productType == "video") %>%
select(completionRate) %>%
boxplot(horizontal = T)
The completionRate data is fairly
normally distributed, and its mean value will be calculated.
# Calculate mean of `completionRate`
mean_completionRate <- df_clean %>%
distinct(id, .keep_all = TRUE) %>%
filter(productType == "video") %>%
# Remove data that contains missing values
na.omit() %>%
# Take just the column `completionRate` for calculation
pull(completionRate) %>%
# Calculate the mean
mean()
mean_completionRate
> [1] 39.82432
The calculated mean completion rate across video-type posts is around
39.82%. This value will serve as the imputation reference for null
entries in the completionRate column,
enabling subsequent estimation of missing
videoPlayCount values.
df_clean <- df_clean %>%
# Imputation mean of completion rate for missing value in column `completionRate`
mutate(completionRate = ifelse(is.na(completionRate)==TRUE, mean_completionRate, completionRate)) %>%
# Imputation estimated value of video play count for missing value in column `videoPlayCount`
mutate(videoPlayCount = ifelse(is.na(videoPlayCount)==TRUE,round(((videoViewCount/completionRate)*100)), videoPlayCount))
# Check for missing values
colSums(is.na(df_clean))
> commentsCount likesCount
> 0 0
> ownerFullName ownerUsername
> 0 0
> productType timestamp
> 0 0
> videoDuration videoPlayCount
> 0 0
> videoViewCount id
> 0 0
> mention_username coauthorProducers_is_verified
> 0 0
> coauthorProducers_username taggedUsers_full_name
> 0 0
> taggedUsers_is_verified taggedUsers_username
> 0 0
> completionRate
> 0
The dataset is now complete, with all previously missing values
properly handled.
Duplicate
Rows
# Check for duplicate rows
dim(df_clean[duplicated(df_clean),])[1]
> [1] 0
The dataset contains no duplicates.
Data Type
Adjusment
# Peek the dataset structure
glimpse(df_clean)
> Rows: 1,490
> Columns: 17
> $ commentsCount <int> 0, 3, 4, 0, 8, 4, 0, 19, 10, 2, 4, 4, 0,…
> $ likesCount <int> 236, 287, 71, 202, 210, 1237, 429, 773, …
> $ ownerFullName <chr> "AO Snack Brand", "AO Snack Brand", "AO …
> $ ownerUsername <chr> "aosnack.brand", "aosnack.brand", "aosna…
> $ productType <chr> "photo", "photo", "video", "photo", "pho…
> $ timestamp <dttm> 2022-01-03 15:07:53, 2022-01-04 14:06:5…
> $ videoDuration <dbl> 0.000, 0.000, 23.033, 0.000, 0.000, 0.00…
> $ videoPlayCount <dbl> 0, 0, 2977, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
> $ videoViewCount <dbl> 0, 0, 1468, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
> $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
> $ mention_username <chr> "No mention", "No mention", "No mention"…
> $ coauthorProducers_is_verified <chr> "No co-author", "No co-author", "No co-a…
> $ coauthorProducers_username <chr> "No co-author", "No co-author", "No co-a…
> $ taggedUsers_full_name <chr> "No tag", "No tag", "No tag", "No tag", …
> $ taggedUsers_is_verified <chr> "No tag", "No tag", "No tag", "No tag", …
> $ taggedUsers_username <chr> "No tag", "No tag", "No tag", "No tag", …
> $ completionRate <dbl> 0.00000, 0.00000, 49.31139, 0.00000, 0.0…
Columns requiring data type conversion :
productType,
type,
coauthorProducer_is_verified, and
taggedUser_is_verified to factor.
videoViewCount and
videoPlayCount to integer.
df_clean <- df_clean %>%
# Change columns data type to factor
mutate_at(vars(productType, coauthorProducers_is_verified, taggedUsers_is_verified),as.factor) %>%
# Change columns data type to integer
mutate_at(vars(videoViewCount, videoPlayCount), as.integer)
# Peek the dataset structure
glimpse(df_clean)
> Rows: 1,490
> Columns: 17
> $ commentsCount <int> 0, 3, 4, 0, 8, 4, 0, 19, 10, 2, 4, 4, 0,…
> $ likesCount <int> 236, 287, 71, 202, 210, 1237, 429, 773, …
> $ ownerFullName <chr> "AO Snack Brand", "AO Snack Brand", "AO …
> $ ownerUsername <chr> "aosnack.brand", "aosnack.brand", "aosna…
> $ productType <fct> photo, photo, video, photo, photo, photo…
> $ timestamp <dttm> 2022-01-03 15:07:53, 2022-01-04 14:06:5…
> $ videoDuration <dbl> 0.000, 0.000, 23.033, 0.000, 0.000, 0.00…
> $ videoPlayCount <int> 0, 0, 2977, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
> $ videoViewCount <int> 0, 0, 1468, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
> $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
> $ mention_username <chr> "No mention", "No mention", "No mention"…
> $ coauthorProducers_is_verified <fct> No co-author, No co-author, No co-author…
> $ coauthorProducers_username <chr> "No co-author", "No co-author", "No co-a…
> $ taggedUsers_full_name <chr> "No tag", "No tag", "No tag", "No tag", …
> $ taggedUsers_is_verified <fct> No tag, No tag, No tag, No tag, No tag, …
> $ taggedUsers_username <chr> "No tag", "No tag", "No tag", "No tag", …
> $ completionRate <dbl> 0.00000, 0.00000, 49.31139, 0.00000, 0.0…
All columns now have the appropriate data types.
Rearrange
Columns
Column positions are rearranged to improve data readability, analysis
efficiency, and logical flow of information. By organizing columns in a
meaningful order (e.g., grouping related variables, placing key
identifiers first, or following a chronological sequence), the data
becomes more intuitive to interpret and reduces cognitive load for
users. Certain columns also will be renamed to enforce naming
consistency.
df_clean <- df_clean %>%
# Re-arrange columns position
select(id,timestamp, ownerFullName, ownerUsername, productType, coauthorProducers_username, coauthorProducers_is_verified, taggedUsers_full_name, taggedUsers_username, taggedUsers_is_verified, mention_username, likesCount, commentsCount, videoDuration, videoPlayCount, videoViewCount, completionRate) %>%
# Rename columns
rename(coauthorProducerUsername = coauthorProducers_username,
coauthorProduceIsVerified = coauthorProducers_is_verified,
taggedUseFullName = taggedUsers_full_name,
taggedUserUsername = taggedUsers_username,
taggedUserIsVerified = taggedUsers_is_verified,
mentionUsername = mention_username)
head(df_clean)
Data cleaning and preparation is complete -
resulting in an analysis-ready dataset.
Insights
Deep-Dive
The primary goal of this data analysis is to identify post
characteristics that drive high engagement. By leveraging the insights
obtained, we aim to create content that maximizes and improves
engagement. The quantitative analysis is based solely on available data,
focusing exclusively on likes and comments per post as the engagement
metrics. First, a quick statistical summary of likes and comments of all
posts will be examined.
df_clean %>%
distinct(id, .keep_all = TRUE) %>%
select(likesCount,commentsCount) %>%
summary()
> likesCount commentsCount
> Min. : -1.00 Min. : 0.00
> 1st Qu.: 86.75 1st Qu.: 2.00
> Median : 186.00 Median : 5.00
> Mean : 1862.22 Mean : 46.08
> 3rd Qu.: 474.75 3rd Qu.: 17.00
> Max. :214517.00 Max. :7947.00
Out of all posts, the range of likes and comments varies
significantly. The maximum values are extremely high compared to others,
suggesting some posts may have gone viral. Additionally, illogical
values have been found like -1 in the
likesCount data. After manual verification
(by viewing the actual posts), it is confirmed that -1 represents posts
where likes were hidden by the user.
Disclaimer: All qualitative evaluations of
Instagram posts in this study were genuinely conducted because the
original (non-anonymized) data from Apify contained post URLs. These
URLs have since been removed to ensure complete data
anonymization.
df_clean %>%
distinct(id, .keep_all = TRUE) %>%
filter(likesCount<0)
# Check number of posts
df_clean %>%
distinct(id, .keep_all = TRUE) %>%
nrow()
> [1] 1248
Only 7 out of 1,248 posts (0.56%) had hidden like counts, removing
them won’t significantly affect the dataset while keeping the analysis
clean and maintain data consistency in our analysis. Then, engagement
column will be created by summing likes and comments.
df_clean <- df_clean %>%
# Remove data with likesCount lower than zero
filter(likesCount>=0) %>%
# Create column engagementCount
mutate(engagementCount = likesCount + commentsCount)
The dataset contains duplicate post IDs caused by variations in other
features (taggedUser, mentions, etc.). To ensure analysis accuracy and
simplify further processing, we removed duplicate post IDs, storing the
cleaned data in the df_id variable while retaining only one
entry per unique post. This prevents skewed results from duplicate
content measurements.
df_id <- df_clean %>%
distinct(id, .keep_all = TRUE)
# Export for visualization in Tableau
write_xlsx(df_id, "data_output/data_id.xlsx")
Top Posts by
Engagements
# Examine top 10 post with highest engagements
df_id %>%
arrange(desc(engagementCount)) %>%
top_n(10)
Qualitative analysis of the top 10 posts’ content reveals common
characteristics: these posts consistently feature brand-relevant
content, either showcasing product strengths or sharing authentic
consumer testimonials. Meanwhile, the attached data provides these key
insights:
- From Jan 2022 to Mar 2025, 9 out of 10 top-performing posts are
videos and only 1 is a photo-type post.
- All 10 high-engagement posts involved collaborations – either with
influencers or complementary brands aligned with the brand image (e.g.,
sporty influencers, hampers brand, or sports events where our product
adds value).
- Wilson Nate’s appearance in five of the top posts suggests he is a
key content creator contributing significantly to engagement.
These insights prompt deeper exploration through follow-up business
questions:
- Do posts categorized as video drive higher engagement?
- Do collaborations posts with – either with influencers or
complementary brands significantly boost engagement?
- Which collaborators deliver the largest engagement impact?
Engagement Trends and Growth
Rates
# Aggregation for Plot
monthly <- df_id %>%
mutate(month_year = floor_date(timestamp, "month") %>% # Bulan-tahun
format("%b %Y")) %>%
mutate(month_date = my(month_year)) %>%
group_by(month_year) %>%
summarise(sum_engagementCount = sum(engagementCount),
numPost = n()) %>%
mutate(month_date = my(month_year)) %>%
arrange(month_date)
# Plot
ggplot(data = monthly,
mapping = aes(x = month_date,
y = sum_engagementCount)) +
geom_line(color = "#4f5bd5", linewidth = 1) +
# Y-Axis
scale_y_continuous(
labels = function(x) paste0(x/1000, "K"), # Format menjadi 300K
breaks = pretty_breaks(), # Break yang sesuai
expand = expansion(mult = c(0.1, 0.1)) # Sedikit ruang di atas
) +
# Theme
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
panel.grid = element_blank(), # Menghapus semua gridlines
panel.background = element_blank(),
axis.line = element_line(color = "black"), # Garis sumbu berwarna hitam
axis.ticks = element_line(color = "black") # Tick marks berwarna hitam
) +
labs(
title = "Monthly Engagements by Post Types",
x = NULL,
y = "Engagement Count",
color = "Post Type"
)

monthly %>%
group_by(year = year(month_date)) %>%
summarise(totalEngagement = sum(sum_engagementCount),
numPost = sum(numPost)) %>%
mutate(engagementGrowth = round(((totalEngagement - lag(totalEngagement))/lag(totalEngagement))*100,2)) %>%
mutate(numPostGrowth = round(((numPost - lag(numPost))/lag(numPost))*100,2))
Insights:
- Between Jan 2022 and Mar 2025, monthly engagement ranged from ~1.2K
to ~368.6K.
- Major spikes occurred in Jun–Jul 2023, Feb 2024, May–Jun 2024, Dec
2024, and another rise starting in Mar 2025, with most spikes being
followed by sharp declines.
- The highest peak in June–July 2023 exceeded 350K, likely driven by
viral content or major campaigns.
- Engagement in 2024 saw multiple peaks (74K–265K), suggesting an
event-driven or seasonal trend, which will be discussed further in the
Impact of Collaborations on
Engagements section below.
Engagements by Post
Types
# Aggregation for Plot
monthlye_types <- df_id %>%
mutate(month_year = floor_date(timestamp, "month") %>% # Bulan-tahun
format("%b %Y")) %>%
mutate(month_date = my(month_year)) %>%
group_by(month_year, productType) %>%
summarise(sum_engagementCount = sum(engagementCount)) %>%
mutate(month_date = my(month_year)) %>%
arrange(month_date, productType)
# Plot
ggplot(data = monthlye_types,
mapping = aes(x = month_date,
y = sum_engagementCount, color = productType)) +
geom_line(linewidth = 1) +
scale_color_brewer(palette = "Paired") +
# Y-Axis
scale_y_continuous(
labels = function(x) paste0(x/1000, "K"), # Format menjadi 300K
breaks = pretty_breaks(), # Break yang sesuai
expand = expansion(mult = c(0.1, 0.1)) # Sedikit ruang di atas
) +
# Theme
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
panel.grid = element_blank(), # Menghapus semua gridlines
panel.background = element_blank(),
axis.line = element_line(color = "black"), # Garis sumbu berwarna hitam
axis.ticks = element_line(color = "black") # Tick marks berwarna hitam
) +
labs(
title = "Monthly Engagements by Post Types",
x = NULL,
y = "Engagement Count",
color = "Post Type"
)
Most of the time, the total engagement trend is influenced a lot by
video-type posts. The spikes that appear dominantly influenced by
engagement from video-type post. Next, count of posts by type will be
compared.
df_id %>%
group_by(productType) %>%
summarise(sum_engagementCount = sum(engagementCount),
numPost = n())
The data distribution between video and photo post types is
sufficiently balanced, allowing for direct comparison of their
engagement metrics.

The data contains many outliers and is heavily skewed. Therefore, the
median will be used to compare the engagement between the two post
types, as it provides a more realistic measure of central tendency than
the mean.
# Engagement stats summary of video type
summary(df_id$engagementCount[df_id$productType == "video"])
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 15 152 361 3420 1022 222262
# Engagement stats summary of photo type
summary(df_id$engagementCount[df_id$productType == "photo"])
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 11.0 67.0 129.5 434.5 253.8 66628.0
Based on the boxplot and statistical summary above, several insights
were obtained, as follows:
- There are several viral posts (as indicated by the outliers) with
significantly higher engagement in video compared to photos and the
maximum engagement for video (222,262) far exceeds that of photos
(66,628), indicating that video have a higher potential for viral
reach.
- Videos tend to generate higher engagement than photos, with a median
of 361 compared to 129.5. Half of all videos achieve engagement above
361, while half of the photos fall below 129.5.
- Based on Q1 to Q3, videos (152-1022) show a broader spread of
engagement compared to photos (67-253.8), meaning the middle 50% of
videos have much higher engagement. This suggests videos are more likely
to drive higher engagement, though variability is also greater.
To support this findings a statistical test will be conduct to see
the difference in engagement between the two types is significant or due
to random variation. First, the distribution will be checked using
normality test.
# Normality test for engagement of photo-type post
shapiro.test(df_id[df_id$productType == "photo",]$engagementCount)
>
> Shapiro-Wilk normality test
>
> data: df_id[df_id$productType == "photo", ]$engagementCount
> W = 0.082677, p-value < 2.2e-16
# Normality test for engagement of video-type post
shapiro.test(df_id[df_id$productType == "video",]$engagementCount)
>
> Shapiro-Wilk normality test
>
> data: df_id[df_id$productType == "video", ]$engagementCount
> W = 0.17691, p-value < 2.2e-16
Results show the distribution of photo-type post and video-type post
are not normal. Therefore, I will use non-parametric statistical test,
Mann-Whitney U to compare engagement post by post types.
# Check level of productType
levels(df_id$productType)
> [1] "photo" "video"
Hypothesis of Mann-Whitney U test for this case:
\(H_{0}\): There is no difference in
engagement distribution between video and photo posts.
\(H_{1}\): The engagement distribution
of video is stochastically greater than photos.
# Mann-Whitney U Test
wilcox.test(engagementCount ~ productType,
data = df_id,
alternative = "greater")
>
> Wilcoxon rank sum test with continuity correction
>
> data: engagementCount by productType
> W = 105159, p-value = 1
> alternative hypothesis: true location shift is greater than 0
With the result (p < 2.2e-16), the null hypothesis (\(H_{0}\)) is rejected. Statistically, the
engagement distribution of video is stochastically greater than photos.
This results indicate that videos consistently drive higher engagement
compared to photo-types.
Engagements and Completion
Rates by Video Durations
The analysis will now examine how video duration impacts engagement
metrics. I’ve segmented videos into five duration categories (0-15s,
15-30s, 30-45s, 45-60s, and >60s) to determine optimal length for
engagement maximization.
# Make a custom function to categorize video duration
convert_duration <- function(y){
if(y <= 15)
{
y <- "0 - 15 sec" # video dengan publish_hour antara tengah malam hingga jam 8 pagi
}
else
if(y > 15 & y <= 30)
{
y <- "15 - 30 sec" # video jam 9 pagi hingga jam 4 sore
}
else
if(y > 30 & y <= 45)
{
y <- "30 - 45 sec" # video jam 9 pagi hingga jam 4 sore
}
else
if(y > 45 & y <= 60)
{
y <- "45 - 60 sec" # video jam 9 pagi hingga jam 4 sore
}
else
{
y <- "> 60 sec" # video jam 5 sore sampai tengah malam
}
}
# Make a new column of duration category by applying the custom function
df_id$durationCat <- factor(sapply(X = df_id$videoDuration,
FUN = convert_duration), levels = c("0 - 15 sec", "15 - 30 sec", "30 - 45 sec", "45 - 60 sec","> 60 sec"))
df_id %>%
filter(productType == "video") %>%
count(durationCat)
While the counts are not perfectly equal, the distribution is
reasonably balanced, with no extreme category dominating the dataset.
This allows for a direct comparison across categories to derive
meaningful insights.
plot(engagementCount ~ durationCat, data = df_id[df_id$productType == "video",])
The data also contains many outliers and is heavily skewed.
Therefore, the median will be used to compare the engagement between
these video duration categories, as it provides a more realistic measure
of central tendency than the mean.
duration_engagement <- df_id %>%
filter(productType == "video") %>%
group_by(durationCat) %>%
summarise(median_engagementCount = median(engagementCount)) %>%
arrange(desc(median_engagementCount))
duration_engagement
# Plot
ggplot(duration_engagement, aes(durationCat, median_engagementCount)) +
geom_bar(stat = "identity", fill = "#4f5bd5") +
# Y-Axis
scale_y_continuous(
breaks = pretty_breaks(), # Break yang sesuai
expand = expansion(mult = c(0.1, 0.1)) # Sedikit ruang di atas
) +
# Theme
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
panel.grid = element_blank(), # Menghapus semua gridlines
panel.background = element_blank(),
axis.line = element_line(color = "black"), # Garis sumbu berwarna hitam
axis.ticks = element_line(color = "black") # Tick marks berwarna hitam
) +
labs(
title = "Median Engagements by Video Durations",
x = NULL,
y = "Engagement Count"
)

Insights:
- Videos longer than 15 seconds generally drive higher engagement,
with medians ranging from 315 to 479, while videos shorter than 15
seconds record the lowest median engagement (148). This indicates that
longer formats can generate roughly 2-3x more interactions, suggesting
overly brief content may offer less value or appeal to the
audience.
plot(completionRate ~ durationCat, data = df_id[df_id$productType == "video",])
The data just contains few outliers and not heavily skewed.
Therefore, the mean value will be used to compare the engagement between
these video duration categories for realistic measure of central
tendency.
duration_completion <- df_id %>%
filter(productType == "video") %>%
group_by(durationCat) %>%
summarise(mean_completionRate = mean(completionRate)) %>%
mutate(across(where(is.numeric), ~round(., 2))) %>%
arrange(desc(mean_completionRate))
duration_completion
# Plot
ggplot(duration_completion, aes(durationCat, mean_completionRate, group = 1)) +
geom_line(color = "#4f5bd5", linewidth = 1) +
geom_point(color = "#4f5bd5") +
# Y-Axis
scale_y_continuous(
breaks = pretty_breaks(), # Break yang sesuai
expand = expansion(mult = c(0.1, 0.1)) # Sedikit ruang di atas
) +
# Theme
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
panel.grid = element_blank(), # Menghapus semua gridlines
panel.background = element_blank(),
axis.line = element_line(color = "black"), # Garis sumbu berwarna hitam
axis.ticks = element_line(color = "black") # Tick marks berwarna hitam
) +
labs(
title = "Completion Rates by Video Durations",
x = NULL,
y = "Completion Rate"
)

Insights:
- The highest average completion rate is for videos over 60 seconds
(44.66%), while the other durations fall within 37.43%–38.96%,
indicating that videos of 60 seconds or longer tend to deliver a 5–7%
higher retention rate.
Engagements by Posting
Times
In this section, I will analyze engagement based on posting times (by
hour and day). To ensure reliable median engagement measurement and
minimize the impact of outliers, I filtered the data to include only
time slots with at least six posts. Posting times with fewer than six
posts were excluded, as these rare cases could distort the overall
engagement pattern.
# Aggregation for Plot
posting_times <- df_id %>%
mutate(day = wday(timestamp, label = TRUE, abbr = FALSE)) %>%
mutate(hour = hour(timestamp)) %>%
group_by(day, hour) %>%
summarise(median_engagementCount = median(engagementCount),
numPost = n()) %>%
filter(numPost > 5)
# Plot
ggplot(posting_times, aes(day, hour, fill= median_engagementCount)) +
geom_tile() +
# Y-Axis
scale_y_continuous(
breaks = pretty_breaks() # Break yang sesuai
) +
# Theme
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
panel.grid = element_blank(), # Menghapus semua gridlines
panel.background = element_blank(),
axis.line = element_line(color = "black"), # Garis sumbu berwarna hitam
axis.ticks = element_line(color = "black") # Tick marks berwarna hitam
) +
labs(
title = "Median Engagements by Post Types",
x = NULL,
y = "Hour"
)

Insights:
- The heatmap of median engagement across posting times suggests no
consistent pattern. While certain time slots (e.g., 4 AM Sunday and 3 AM
Tuesday) show higher median engagement, posting time does not appear to
be a dominant driver of engagement.
Impact of Collaborations on
Engagements
Before starting the analysis, I performed data preprocessing by
consolidating the username columns into a single column to simplify the
evaluation of each collaborator’s involvement in the posts.
collab <- df_clean %>%
pivot_longer(cols = contains("Username"),
values_to = "username") %>%
select(-name) %>%
filter(username != "No co-author" & username != "No tag" & username != "No mention") %>%
filter(username != "aosnack.brand")
glimpse(collab)
> Rows: 868
> Columns: 15
> $ id <int> 98, 205, 205, 205, 216, 218, 218, 234, 234, …
> $ timestamp <dttm> 2022-04-06 14:45:16, 2022-08-01 13:34:38, 2…
> $ ownerFullName <chr> "AO Snack Brand", "AO Snack Brand", "AO Snac…
> $ productType <fct> photo, video, video, video, video, video, vi…
> $ coauthorProduceIsVerified <fct> No co-author, true, true, true, No co-author…
> $ taggedUseFullName <chr> "No tag", "Liam Johnson", "Liam Johnson", "L…
> $ taggedUserIsVerified <fct> No tag, true, true, true, true, true, true, …
> $ likesCount <int> 92, 326, 326, 326, 25, 31, 31, 74, 74, 74, 6…
> $ commentsCount <int> 2, 11, 11, 11, 1, 0, 0, 5, 5, 5, 1, 1, 1, 22…
> $ videoDuration <dbl> 0.000, 163.000, 163.000, 163.000, 69.400, 88…
> $ videoPlayCount <int> 0, 19902, 19902, 19902, 2182, 2184, 2184, 59…
> $ videoViewCount <int> 0, 11608, 11608, 11608, 1006, 870, 870, 3603…
> $ completionRate <dbl> 0.00000, 58.32580, 58.32580, 58.32580, 46.10…
> $ engagementCount <int> 94, 337, 337, 337, 26, 31, 31, 79, 79, 79, 6…
> $ username <chr> "emma.smith", "liam_johnson", "liam_johnson"…
Since an account can be mentioned, tagged, and listed as a co-author
in the same post, this process may generate duplicates. To address this,
duplicate values were checked.
dim(collab[duplicated(collab[c("id", "username")]),])[1]
> [1] 525
As duplicates were found, they will be removed, as they could distort
the calculations.
collab_clean <- collab[!duplicated(collab[c("id","username")]),]
dim(collab_clean[duplicated(collab_clean[c("id", "username")]),])[1]
> [1] 0
Next, column that indicates collaboration will be added to compare
the engagament between post with collaboration and post without
collaboration.
# Pull id post that involve collaboration
id_post_collab <- collab_clean %>%
distinct(id, .keep_all = TRUE) %>%
pull(id)
# Add new column that indicates collaboration
df_id$collaboration <-as.factor(ifelse(df_id$id %in% id_post_collab, "Yes", "No"))
collab_data <- df_id %>%
select(id, collaboration)
# Export for visualization in Tableau
write_xlsx(collab_data,"data_output/collaboration.xlsx")
collaborator_name <- collab_clean %>%
select(id, username)
# Export for visualization in Tableau
write_xlsx(collaborator_name,"data_output/data_collaborator.xlsx")
plot(engagementCount ~ collaboration, data = df_id)

Since the data distribution is skewed, median engagement will be used
to compare the two post types, as it provides a more reliable measure of
central tendency than the mean.
monthlye_collab <- df_id %>%
mutate(month_year = floor_date(timestamp, "month") %>% # Bulan-tahun
format("%b %Y")) %>%
mutate(month_date = my(month_year)) %>%
group_by(month_year, collaboration) %>%
summarise(sum_engagementCount = sum(engagementCount)) %>%
mutate(month_date = my(month_year)) %>%
arrange(month_date, collaboration)
# Plot
ggplot(data = monthlye_collab,
mapping = aes(x = month_date,
y = sum_engagementCount, color = collaboration)) +
geom_line(linewidth = 1) +
scale_color_brewer(palette = "Paired") +
# Y-Axis
scale_y_continuous(
labels = function(x) paste0(x/1000, "K"), # Format menjadi 300K
breaks = pretty_breaks(), # Break yang sesuai
expand = expansion(mult = c(0.1, 0.1)) # Sedikit ruang di atas
) +
# Theme
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
panel.grid = element_blank(), # Menghapus semua gridlines
panel.background = element_blank(),
axis.line = element_line(color = "black"), # Garis sumbu berwarna hitam
axis.ticks = element_line(color = "black") # Tick marks berwarna hitam
) +
labs(
title = "Monthly Engagements by Post Types",
x = NULL,
y = "Engagement Count",
color = "Post Type"
)

df_id %>%
group_by(collaboration) %>%
summarise(sum_engagementCount = sum(engagementCount),
numPost = n())
# Engagement stats summary of post with collaboration
summary(df_id$engagementCount[df_id$collaboration == "Yes"])
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 21.0 193.2 518.5 8565.4 2806.5 222262.0
# Engagement stats summary of post without collaboration
summary(df_id$engagementCount[df_id$collaboration == "No"])
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 11.0 83.5 169.0 502.7 403.0 39408.0
Insights:
- Engagement patterns frequently align with collaboration-driven
posts. Notably, the spike observed in Jun–Jul 2023 was largely the
result of content collaborations.
- Posts with collaborations have a much higher median engagement count
(518.5) compared to posts without collaborations (162.5), representing
around a 3x increase.
top_collaborators <- collab_clean %>%
group_by(username) %>%
summarise(sum_engagementCount = round(sum(engagementCount)),
numPost = n()) %>%
arrange(desc(sum_engagementCount))
top_collaborators
Insights:
- Wilson Nate, Aubrey Evans, Sportyanz Event, Jacob Miller, and Robert
Carter are the top collaborators contributing the highest total
engagement, with each generating over 77K.
Now, I will analyze the engagement spikes that appear as I mentioned
before.
Year 2023
df_id %>%
filter(year(timestamp) == 2023 & month(timestamp) %in% c(6, 7)) %>%
group_by(month = month(timestamp, label = T, abbr = F)) %>%
summarise(sum_engagementCount = sum(engagementCount))
collab_clean %>%
filter(month(timestamp) %in% c(6, 7), year(timestamp) == 2023) %>%
group_by(month = month(timestamp, label = T, abbr = F), username) %>%
summarise(sum_engagementCount = sum(engagementCount)) %>%
arrange(month, desc(sum_engagementCount)) %>%
top_n(3)
Insights:
- During the peak in Jun-Jul 2023, Wilson Nate is involved in ~575K
out of the total ~607K engagement indicating a viral post from Wilson
Nate.
Year 2024
df_id %>%
filter(year(timestamp) == 2024 & month(timestamp) %in% c(2, 6, 7, 8, 9, 10, 11, 12)) %>%
group_by(month = month(timestamp, label = T, abbr = F)) %>%
summarise(sum_engagementCount = sum(engagementCount))
collab_clean %>%
filter(month(timestamp) %in% c(2, 6, 7, 8, 9, 10, 11, 12), year(timestamp) == 2024) %>%
group_by(month = month(timestamp, label = T, abbr = F), username) %>%
summarise(sum_engagementCount = sum(engagementCount)) %>%
arrange(month, desc(sum_engagementCount)) %>%
top_n(3)
Insights:
- The spike in February 2024 was driven by the introduction of the new
Brand Ambassador, Mason Sanchez, in posts that also featured the
existing ambassadors, Wilson Nate and Robert Carter.
- Most engagement from Jun-Dec 2024 related to Sportyanz Event,
contributing ~392K out of the total ~1,108K engagement. This sustained
performance was likely supported by a content strategy that extended
beyond the event period, encompassing pre-event build-up and post-event
follow-up content.