Disclaimer: The data presented is derived from public Instagram content. All account names and usernames displayed have been replaced with pseudonyms to protect the privacy of the original account holders. This anonymization prevents identity tracing, misuse of data, or harm to third parties. There is no association between these aliases and real Instagram accounts.

Project Overview

AO Snack (@aosnack.brand) (pseudonym), a healthy snack brand, uses Instagram to build brand awareness and engage with its audience. I’m taking a role as a data analyst to extract insights and provide recommendations to the social media and marketing teams to drive higher engagement on the Instagram account.

This page outlines the technical details, covering the processes of data extraction, data cleaning, and preparation for analysis, followed by an in-depth insights exploration. Insights and recommendations report can be accessed on my GitHub here.

Objective

The main objective of this analysis is:

Content Engagement Analysis: To identify and understand post characteristics, including topics, post types, and influencer involvement, that drive high audience engagement.

In line with this objective, the analysis seeks to answer key business questions, including:

Which posts achieved the highest engagement?
How have engagement trends and growth rates evolved over time?
Which post type drives higher engagement: video or photo?
How do video durations affect engagement and completion rates?
Does posting time significantly impact engagement?
How do collaborations influence engagement performance?

Data Extraction

The dataset of AO Snack (@aosnack.brand) Instagram account was collected using the Apify Instagram Scraper. The dataset consists of post metrics retrieved on March 31, 2025 at 10:49 PM (GMT+7) . It includes all posts from the account’s inception up to the time of data scraping. The collected metrics include likes, comments, tagged users, and various other engagement indicators. After the scraping process, the dataset was filtered to include only the required metrics and then exported to a CSV file.

Figure 1. Apify Instagram Scraper Web View

Data Cleaning and Preparation

Libraries

library(dplyr)
library(tidyr)
library(lubridate)
library(writexl)
library(ggplot2)
library(scales)
library(moments)
library(effsize)

Import Dataset

df <- read.csv("data_input/instagram_post_dataset.csv")
head(df)

In the imported dataset, there are empty data that can be considered as missing values (NA). Therefore, the data is re-imported with additional arguments to make empty data into NA.

df <- read.csv("data_input/instagram_post_dataset.csv", na.strings=c("","NA"))
head(df)

Here is a peek of the dataset :

glimpse(df)

> Rows: 1,254
> Columns: 88
> $ coauthorProducers.0.is_verified <chr> NA, "false", NA, NA, NA, NA, NA, NA, N…
> $ coauthorProducers.0.username    <chr> NA, "natalie.collins", NA, NA, NA, NA,…
> $ coauthorProducers.1.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.1.username    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.2.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.2.username    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.3.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.3.username    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ commentsCount                   <int> 0, 1, 0, 0, 4, 107, 91, 42, 0, 0, 2, 0…
> $ likesCount                      <int> 36, 82, 28, 38, 110, 836, 508, 324, 10…
> $ mentions.0                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.1                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.2                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.3                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.4                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.5                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.6                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.7                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.8                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.9                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ ownerFullName                   <chr> "AO Snack Brand", "AO Snack Brand", "A…
> $ ownerUsername                   <chr> "aosnack.brand", "aosnack.brand", "aos…
> $ productType                     <chr> "clips", "clips", NA, NA, "clips", NA,…
> $ taggedUsers.0.full_name         <chr> NA, "Natalie Collins", NA, NA, NA, NA,…
> $ taggedUsers.0.is_verified       <chr> NA, "false", NA, NA, NA, NA, NA, NA, N…
> $ taggedUsers.0.username          <chr> NA, "natalie.collins", NA, NA, NA, NA,…
> $ taggedUsers.1.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.1.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.1.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.10.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.10.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.10.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.11.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.11.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.11.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.12.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.12.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.12.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.13.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.13.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.13.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.14.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.14.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.14.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.15.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.15.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.15.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.16.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.16.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.16.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.17.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.17.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.17.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.18.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.18.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.18.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.19.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.19.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.19.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.2.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.2.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.2.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.3.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.3.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.3.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.4.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.4.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.4.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.5.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.5.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.5.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.6.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.6.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.6.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.7.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.7.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.7.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.8.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.8.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.8.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.9.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.9.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.9.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ timestamp                       <chr> "2025-03-30T09:44:38.000Z", "2025-03-3…
> $ type                            <chr> "Video", "Video", "Sidecar", "Image", …
> $ videoDuration                   <dbl> 26.933, 13.066, NA, NA, 34.921, NA, NA…
> $ videoPlayCount                  <int> 3513, 4510, NA, NA, 6480, NA, NA, 3222…
> $ videoViewCount                  <int> 926, 1303, NA, NA, 2213, NA, NA, 16428…

Data Description

Some of the attributes scraped using Apify contain sub-attributes. For example, the coauthorProducers attribute includes sub-attributes such as id, username, and other related attributes. Each attributes and sub-attributes is represented by a single column in the dataset. During the reshaping stage, all attributes and sub-attributes with indexing will be merged into a single set of columns (one for each attribute) without numerical indexing. Here is an explanation of the columns in the dataset :

coauthorProducers : represents the accounts that are co-authors of a post. Each account has several attributes, including:
- is_verified : verification status of the account (true/false).
- username : username of the account.
commentsCount : the number of comments on the post.
likesCount : the number of likes the post received.
mentions : accounts that are mentioned in the post.
ownerFullName : full name of the account owner who posted the content.
ownerUsername : username of the account owner.
productType : the type of content uploaded (clips, igtv, and photo).
taggedUsers : accounts that are tagged in the post. Each account has several attributes, including:
- full_name : full name of the tagged user.
- is_verified : verification status of the tagged account (true/false).
- username : username of the tagged account.
timestamp : the time when the post was created.
type : the type of post (video, image, and sidecar).
videoDuration : the duration of the video (if the post is a video).
videoPlayCount : the number of times the video has been played.
videoViewCount : the number of times the entire video has been watched.

Add Column ID of Posts

Before going further, column id is added for identification (ID) of each post that represented by each row. Column id will help in identifying each post for further data analysis. First, data type of timestamp will be changed to datetime then the dataset will be sorted by timestamp. This process will assign sequential IDs to the posts, where the first post will be given ID 1, the second post will have ID 2, and so on.

df_clean <- df %>% 
  # Change column `timestamp` data type to `datetime`
  mutate(timestamp = ymd_hms(timestamp, tz = "Asia/Jakarta")) %>%
  # Sort dataset by column `timestamp`
  arrange(timestamp)

# Add column `id`
df_clean <- df_clean %>% mutate(id = row_number())

glimpse(df_clean)

> Rows: 1,254
> Columns: 89
> $ coauthorProducers.0.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.0.username    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.1.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.1.username    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.2.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.2.username    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.3.is_verified <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ coauthorProducers.3.username    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ commentsCount                   <int> 0, 3, 4, 0, 8, 4, 0, 19, 10, 2, 4, 4, …
> $ likesCount                      <int> 236, 287, 71, 202, 210, 1237, 429, 773…
> $ mentions.0                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.1                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.2                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.3                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.4                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.5                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.6                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.7                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.8                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ mentions.9                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ ownerFullName                   <chr> "AO Snack Brand", "AO Snack Brand", "A…
> $ ownerUsername                   <chr> "aosnack.brand", "aosnack.brand", "aos…
> $ productType                     <chr> NA, NA, "clips", NA, NA, NA, NA, NA, N…
> $ taggedUsers.0.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.0.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.0.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.1.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.1.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.1.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.10.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.10.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.10.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.11.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.11.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.11.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.12.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.12.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.12.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.13.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.13.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.13.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.14.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.14.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.14.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.15.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.15.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.15.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.16.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.16.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.16.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.17.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.17.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.17.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.18.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.18.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.18.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.19.full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.19.is_verified      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.19.username         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.2.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.2.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.2.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.3.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.3.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.3.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.4.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.4.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.4.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.5.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.5.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.5.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.6.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.6.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.6.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.7.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.7.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.7.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.8.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.8.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.8.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.9.full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.9.is_verified       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ taggedUsers.9.username          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
> $ timestamp                       <dttm> 2022-01-03 15:07:53, 2022-01-04 14:06…
> $ type                            <chr> "Image", "Image", "Video", "Image", "I…
> $ videoDuration                   <dbl> NA, NA, 23.033, NA, NA, NA, NA, NA, NA…
> $ videoPlayCount                  <int> NA, NA, 2977, NA, NA, NA, NA, NA, NA, …
> $ videoViewCount                  <int> NA, NA, 1468, NA, NA, NA, NA, NA, NA, …
> $ id                              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,…

Missing Values

# Check for missing values
colSums(is.na(df_clean))

> coauthorProducers.0.is_verified    coauthorProducers.0.username 
>                            1109                            1109 
> coauthorProducers.1.is_verified    coauthorProducers.1.username 
>                            1241                            1241 
> coauthorProducers.2.is_verified    coauthorProducers.2.username 
>                            1249                            1249 
> coauthorProducers.3.is_verified    coauthorProducers.3.username 
>                            1253                            1253 
>                   commentsCount                      likesCount 
>                               0                               0 
>                      mentions.0                      mentions.1 
>                            1130                            1236 
>                      mentions.2                      mentions.3 
>                            1249                            1252 
>                      mentions.4                      mentions.5 
>                            1253                            1253 
>                      mentions.6                      mentions.7 
>                            1253                            1253 
>                      mentions.8                      mentions.9 
>                            1253                            1253 
>                   ownerFullName                   ownerUsername 
>                               0                               0 
>                     productType         taggedUsers.0.full_name 
>                             631                            1067 
>       taggedUsers.0.is_verified          taggedUsers.0.username 
>                            1067                            1067 
>         taggedUsers.1.full_name       taggedUsers.1.is_verified 
>                            1213                            1213 
>          taggedUsers.1.username        taggedUsers.10.full_name 
>                            1213                            1253 
>      taggedUsers.10.is_verified         taggedUsers.10.username 
>                            1253                            1253 
>        taggedUsers.11.full_name      taggedUsers.11.is_verified 
>                            1253                            1253 
>         taggedUsers.11.username        taggedUsers.12.full_name 
>                            1253                            1253 
>      taggedUsers.12.is_verified         taggedUsers.12.username 
>                            1253                            1253 
>        taggedUsers.13.full_name      taggedUsers.13.is_verified 
>                            1253                            1253 
>         taggedUsers.13.username        taggedUsers.14.full_name 
>                            1253                            1253 
>      taggedUsers.14.is_verified         taggedUsers.14.username 
>                            1253                            1253 
>        taggedUsers.15.full_name      taggedUsers.15.is_verified 
>                            1253                            1253 
>         taggedUsers.15.username        taggedUsers.16.full_name 
>                            1253                            1253 
>      taggedUsers.16.is_verified         taggedUsers.16.username 
>                            1253                            1253 
>        taggedUsers.17.full_name      taggedUsers.17.is_verified 
>                            1253                            1253 
>         taggedUsers.17.username        taggedUsers.18.full_name 
>                            1253                            1253 
>      taggedUsers.18.is_verified         taggedUsers.18.username 
>                            1253                            1253 
>        taggedUsers.19.full_name      taggedUsers.19.is_verified 
>                            1253                            1253 
>         taggedUsers.19.username         taggedUsers.2.full_name 
>                            1253                            1232 
>       taggedUsers.2.is_verified          taggedUsers.2.username 
>                            1232                            1232 
>         taggedUsers.3.full_name       taggedUsers.3.is_verified 
>                            1246                            1246 
>          taggedUsers.3.username         taggedUsers.4.full_name 
>                            1246                            1249 
>       taggedUsers.4.is_verified          taggedUsers.4.username 
>                            1249                            1249 
>         taggedUsers.5.full_name       taggedUsers.5.is_verified 
>                            1251                            1251 
>          taggedUsers.5.username         taggedUsers.6.full_name 
>                            1251                            1251 
>       taggedUsers.6.is_verified          taggedUsers.6.username 
>                            1251                            1251 
>         taggedUsers.7.full_name       taggedUsers.7.is_verified 
>                            1251                            1251 
>          taggedUsers.7.username         taggedUsers.8.full_name 
>                            1251                            1251 
>       taggedUsers.8.is_verified          taggedUsers.8.username 
>                            1251                            1251 
>         taggedUsers.9.full_name       taggedUsers.9.is_verified 
>                            1251                            1251 
>          taggedUsers.9.username                       timestamp 
>                            1251                               0 
>                            type                   videoDuration 
>                               0                             631 
>                  videoPlayCount                  videoViewCount 
>                             635                             631 
>                              id 
>                               0

Based on the result, there are missing values in column mentions.., taggedUsers.., productType, coauthorProducers.., videoViewCount, videoPlayCount, and videoDuration. To handle these missing values, I will reshape the dataframe, apply an imputation method, and perform feature engineering.

Reshaping Dataframe

Before reshaping the data, I will impute the missing values on column coauthorProducers., mentions., dan taggedUsers. with index .0. It will give an information for posts that doesn’t have co-author producer, mention, or tagged user.

df_clean <- df_clean %>% 
  mutate_at(vars(starts_with("coauthorProducers.0")), ~replace(., is.na(.), "No co-author")) %>% 
  mutate_at(vars(starts_with("mentions.0")), ~replace(., is.na(.), "No mention")) %>% 
  mutate_at(vars(starts_with("taggedUsers.0")), ~replace(., is.na(.), "No tag"))

For column mentions.. that doesn’t have sub-attribute, reshaping process will consists of pivot longer (wide to long) and column modification. Pivot longer will produce two new columns i.e. name and value. Column name will be remove because it is unneeded. Column value will be kept and it will be rename to mention_username.

At the end of the process, there will be rows with missing value in column mention_username. These rows are unnecessary because mention attribute information for each post already represented by mentioned username or “No mention”. Therefore, these rows with missing value will be remove.

df_clean <- df_clean %>% 
  # Reshape attribute `mentions` from wide (columns) to long (rows)
  pivot_longer(cols = starts_with("mention"),
               values_to = "mention_username") %>%
  # Removed column 'name' because it is unnecessary
  select(-name) %>% 
  # Remove rows that contains missing value in column `mention_username`
  drop_na(mention_username)

For column coauthorProducers.. and taggedUsers.. that have sub-attributes, there will be some differences reshaping process. For this case, I made a custom function named longer(..). This function will reshape the columns of attributes and its sub-attributes with many index in wider format into columns of attributes and its sub-attributes in longer format without index. The full explanation about this function (longer(..)) is in Custom Function Validation at the end of this project.

# Custom Function -> `longer(..)`
longer <- function(x, y){
  # Reshape columns from wide to long
  pivot_longer(data = x,
               # Select columns with name that starts with argument `y`
               cols = starts_with(y),
               # Produce three columns of attribute, index, and sub-attribute
               names_to = c("attribute", "index", "sub-attribute"),
               # Column name separator `.` that separate attribute, index, and sub-attribute
               names_sep = "\\.") %>% 
    # Remove column attribute because it's not needed
    select(-attribute) %>% 
    # Reshape columns of sub-attributes from long to wide to make columns of sub-attributes without indexing
    pivot_wider(names_from = "sub-attribute", values_from = "value", names_prefix = paste0(y,"_")) %>% 
    # Remove column attribute because it's not longer needed
    select(-index) %>% 
    # Remove rows that contains missing value in columns with name that starts with argument `y`
    drop_na(starts_with(y))
}

# For more detail explanation, go to 'Function Validation' at the end of this project.

# Apply function `longer(..)` to `coauthorProducer` and `taggedUser` to reshape them.
df_clean <- df_clean %>%
  longer("coauthorProducers") %>% 
  longer("taggedUsers")

After the reshaping process, this is a peek of the dataset :

glimpse(df_clean)

> Rows: 1,496
> Columns: 17
> $ commentsCount                 <int> 0, 3, 4, 0, 8, 4, 0, 19, 10, 2, 4, 4, 0,…
> $ likesCount                    <int> 236, 287, 71, 202, 210, 1237, 429, 773, …
> $ ownerFullName                 <chr> "AO Snack Brand", "AO Snack Brand", "AO …
> $ ownerUsername                 <chr> "aosnack.brand", "aosnack.brand", "aosna…
> $ productType                   <chr> NA, NA, "clips", NA, NA, NA, NA, NA, NA,…
> $ timestamp                     <dttm> 2022-01-03 15:07:53, 2022-01-04 14:06:5…
> $ type                          <chr> "Image", "Image", "Video", "Image", "Ima…
> $ videoDuration                 <dbl> NA, NA, 23.033, NA, NA, NA, NA, NA, NA, …
> $ videoPlayCount                <int> NA, NA, 2977, NA, NA, NA, NA, NA, NA, NA…
> $ videoViewCount                <int> NA, NA, 1468, NA, NA, NA, NA, NA, NA, NA…
> $ id                            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
> $ mention_username              <chr> "No mention", "No mention", "No mention"…
> $ coauthorProducers_is_verified <chr> "No co-author", "No co-author", "No co-a…
> $ coauthorProducers_username    <chr> "No co-author", "No co-author", "No co-a…
> $ taggedUsers_full_name         <chr> "No tag", "No tag", "No tag", "No tag", …
> $ taggedUsers_is_verified       <chr> "No tag", "No tag", "No tag", "No tag", …
> $ taggedUsers_username          <chr> "No tag", "No tag", "No tag", "No tag", …

# Check for missing values
colSums(is.na(df_clean))

>                 commentsCount                    likesCount 
>                             0                             0 
>                 ownerFullName                 ownerUsername 
>                             0                             0 
>                   productType                     timestamp 
>                           703                             0 
>                          type                 videoDuration 
>                             0                           703 
>                videoPlayCount                videoViewCount 
>                           707                           703 
>                            id              mention_username 
>                             0                             0 
> coauthorProducers_is_verified    coauthorProducers_username 
>                             0                             0 
>         taggedUsers_full_name       taggedUsers_is_verified 
>                             0                             0 
>          taggedUsers_username 
>                             0

Columns of coauthorProducers and taggedUser has become longer without indexing and doesn’t have any missing values. In the next section, I will apply imputation method to handle the rest missing values.

Imputation Value

First, I will handle missing values on column productType. This columns contains the product type of Instagram Post that is photo, clips, or igtv.

unique(df_clean$productType)

> [1] NA      "clips" "igtv"

df_igtv <- df_clean %>% 
  filter(productType == "igtv")

df_clips <- df_clean %>% 
  filter(productType == "clips")

unique(df_igtv$type)

> [1] "Video"

unique(df_clips$type)

> [1] "Video"

unique(df_clean$type)

> [1] "Image"   "Video"   "Sidecar"

df_clean <- df_clean %>% 
  mutate(productType = ifelse(is.na(productType), "photo", productType))

Now, to verify there is no video type in photo productType, I check the unique value in data that has photo as product type.

# Filter rows that contains "photo" as 
df_photo <- df_clean %>% 
  filter(productType == "photo")

# Check for unique value
unique(df_photo$type)

> [1] "Image"   "Sidecar"

colSums(is.na(df_clean))

>                 commentsCount                    likesCount 
>                             0                             0 
>                 ownerFullName                 ownerUsername 
>                             0                             0 
>                   productType                     timestamp 
>                             0                             0 
>                          type                 videoDuration 
>                             0                           703 
>                videoPlayCount                videoViewCount 
>                           707                           703 
>                            id              mention_username 
>                             0                             0 
> coauthorProducers_is_verified    coauthorProducers_username 
>                             0                             0 
>         taggedUsers_full_name       taggedUsers_is_verified 
>                             0                             0 
>          taggedUsers_username 
>                             0

Next, I will check for the rows that contains missing values in videoViewCount

df_clean %>% 
  filter(is.na(videoViewCount) == TRUE) %>% 
  head(10)

Based on the result, the data that contains missing values in column videoViewCount, videoPlayCount, and videoDuration have the same product type which is photo. Therefore, value 0 will be imputed to these columns with product type “photo” because there is no duration in photo and to keep the columns in numerical data type.

# Imputation value 0 to columnn `videoDuration`, `videoPlayCount`, and `videoViewCount`
df_clean <- df_clean %>% 
  mutate(videoDuration = ifelse(productType=="photo",0,videoDuration)) %>%
  mutate(videoPlayCount = ifelse(productType=="photo",0,videoPlayCount)) %>% 
  mutate(videoViewCount = ifelse(productType=="photo",0,videoViewCount))

# Re-check for missing values after imputation
colSums(is.na(df_clean))

>                 commentsCount                    likesCount 
>                             0                             0 
>                 ownerFullName                 ownerUsername 
>                             0                             0 
>                   productType                     timestamp 
>                             0                             0 
>                          type                 videoDuration 
>                             0                             0 
>                videoPlayCount                videoViewCount 
>                             4                             0 
>                            id              mention_username 
>                             0                             0 
> coauthorProducers_is_verified    coauthorProducers_username 
>                             0                             0 
>         taggedUsers_full_name       taggedUsers_is_verified 
>                             0                             0 
>          taggedUsers_username 
>                             0

Missing values from column videoViewCount and column videoDuration have been handled. There is still some missing values in column videoPlayCount. Therefore, further check must be performed.

# Check for missing values in column `videoPlayCount`
df_clean %>% 
  filter(is.na(videoPlayCount) == TRUE)

The rows with missing values all have the same product type, “igtv,” and the same type, “Video.” Before proceeding, the product type “igtv” will be changed to “video,” as this feature is no longer available. Similarly, the product type “clips” will be changed to “video” for consistency. The upcoming analysis will focus only on product types “video” or “photo” to maintain clarity and consistency. Therefore, column type will be excluded since it have similiar meaning with productType.

df_clean$productType[df_clean$productType == "igtv" | df_clean$productType == "clips"] <-  "video"

# Deselect colum `type`
df_clean <- df_clean %>% 
  select(-type)

# Check unique value in `productType`
unique(df_clean$productType)

> [1] "photo" "video"

To handle this missing values on column videoPlayCount, I will perform feature engineering to make a metrics called Completion Rate of a video and perform a calculation to get value for imputation.

Feature Engineering: Completion Rate

The completion rate of a video is a key metric that measures how often viewers watch a video in its entirety. It is calculated by dividing the total number of complete views by the total number of video plays. In this context, the total number of complete views is represented by the column videoViewCount, while the total number of video plays is represented by the column videoPlayCount. The formula can be expressed as:

\[Completion\:Rate = \frac{Total\:Number\:Complete\:Views}{Total\:Video\:Plays} \times 100\]

This percentage indicates the engagement level of the audience, helping content creators and marketers assess how compelling their videos are. A higher completion rate suggests that viewers are more likely to watch the video to the end, while a lower rate may indicate a need for improvements in content or presentation.

For this case, I choose to using value of videoPlayCount from completion rate calculation for imputation missing values in videoPlayCount. This value will fit better to estimating missing values in videoPlayCount than the mean of videoPlayCount itself. This is because the completion rate accounts for viewer engagement patterns, making it a more stable and representative measure. If videoPlayCount is missing for a video, using the average completion rate (derived from other videos) allows for a more accurate estimation by considering typical viewer behavior rather than just raw play counts. Using the average play count itself can be inaccurate because it may include values lower than the view count in cases of missing data, which is nonsensical.

For example, if the average completion rate is 50%, and a video has videoViewCount = 200, the estimated videoPlayCount would be 200 / 0.5 = 400. This approach reduces bias from outliers (such as viral videos with unusually high play counts) and provides a more reliable imputation for missing data. The formula for total video plays (videoPlayCount) :

\[Total\:Video\:Plays =\frac{Total\:Number\:Complete\:Views}{Completion\:Rate} \times 100\]

Starts with create a new column named completionRate with value from formula (videoViewCount/videoPlayCount)*100 if the type is video and value zero if the type is not video. Then, the mean completionRate will be computed across all records where the type is video except the data with missing values.

# Create a new column `completionRate`
df_clean <- df_clean %>%
  mutate(completionRate = ifelse(productType == "video",((videoViewCount/videoPlayCount)*100),0))

# Check completionRate distribution
df_clean %>% 
  distinct(id, .keep_all = TRUE) %>% 
  filter(productType == "video") %>% 
  select(completionRate) %>%
  boxplot(horizontal = T)

# Check for the outlier which completion rate more then 100%
df_clean %>% 
  filter(completionRate > 100)

Out of 1,254 data points, two have a completion rate exceeding 100%. A completion rate above 100% is illogical. Therefore, these data points will be removed as they may distort insights.

# Remove data with completion rate > 100%
df_clean <- df_clean %>% 
  filter(completionRate < 100)

# Check again completionRate data distribution after removing outlier
df_clean %>% 
  distinct(id, .keep_all = TRUE) %>% 
  filter(productType == "video") %>% 
  select(completionRate) %>%
  boxplot(horizontal = T)

The completionRate data is fairly normally distributed, and its mean value will be calculated.

# Calculate mean of `completionRate`
mean_completionRate <- df_clean %>% 
  distinct(id, .keep_all = TRUE) %>% 
  filter(productType == "video") %>% 
  # Remove data that contains missing values
  na.omit() %>% 
  # Take just the column `completionRate` for calculation
  pull(completionRate) %>% 
  # Calculate the mean
  mean()

mean_completionRate

> [1] 39.82432

The calculated mean completion rate across video-type posts is around 39.82%. This value will serve as the imputation reference for null entries in the completionRate column, enabling subsequent estimation of missing videoPlayCount values.

df_clean <- df_clean %>% 
  # Imputation mean of completion rate for missing value in column `completionRate`
  mutate(completionRate = ifelse(is.na(completionRate)==TRUE, mean_completionRate, completionRate)) %>% 
  # Imputation estimated value of video play count for missing value in column `videoPlayCount`
  mutate(videoPlayCount = ifelse(is.na(videoPlayCount)==TRUE,round(((videoViewCount/completionRate)*100)), videoPlayCount))

# Check for missing values
colSums(is.na(df_clean))

>                 commentsCount                    likesCount 
>                             0                             0 
>                 ownerFullName                 ownerUsername 
>                             0                             0 
>                   productType                     timestamp 
>                             0                             0 
>                 videoDuration                videoPlayCount 
>                             0                             0 
>                videoViewCount                            id 
>                             0                             0 
>              mention_username coauthorProducers_is_verified 
>                             0                             0 
>    coauthorProducers_username         taggedUsers_full_name 
>                             0                             0 
>       taggedUsers_is_verified          taggedUsers_username 
>                             0                             0 
>                completionRate 
>                             0

The dataset is now complete, with all previously missing values properly handled.

Duplicate Rows

# Check for duplicate rows
dim(df_clean[duplicated(df_clean),])[1]

> [1] 0

The dataset contains no duplicates.

Data Type Adjusment

# Peek the dataset structure
glimpse(df_clean)

> Rows: 1,490
> Columns: 17
> $ commentsCount                 <int> 0, 3, 4, 0, 8, 4, 0, 19, 10, 2, 4, 4, 0,…
> $ likesCount                    <int> 236, 287, 71, 202, 210, 1237, 429, 773, …
> $ ownerFullName                 <chr> "AO Snack Brand", "AO Snack Brand", "AO …
> $ ownerUsername                 <chr> "aosnack.brand", "aosnack.brand", "aosna…
> $ productType                   <chr> "photo", "photo", "video", "photo", "pho…
> $ timestamp                     <dttm> 2022-01-03 15:07:53, 2022-01-04 14:06:5…
> $ videoDuration                 <dbl> 0.000, 0.000, 23.033, 0.000, 0.000, 0.00…
> $ videoPlayCount                <dbl> 0, 0, 2977, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
> $ videoViewCount                <dbl> 0, 0, 1468, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
> $ id                            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
> $ mention_username              <chr> "No mention", "No mention", "No mention"…
> $ coauthorProducers_is_verified <chr> "No co-author", "No co-author", "No co-a…
> $ coauthorProducers_username    <chr> "No co-author", "No co-author", "No co-a…
> $ taggedUsers_full_name         <chr> "No tag", "No tag", "No tag", "No tag", …
> $ taggedUsers_is_verified       <chr> "No tag", "No tag", "No tag", "No tag", …
> $ taggedUsers_username          <chr> "No tag", "No tag", "No tag", "No tag", …
> $ completionRate                <dbl> 0.00000, 0.00000, 49.31139, 0.00000, 0.0…

Columns requiring data type conversion :

productType, type, coauthorProducer_is_verified, and taggedUser_is_verified to factor.
videoViewCount and videoPlayCount to integer.

df_clean <- df_clean %>% 
  # Change columns data type to factor
  mutate_at(vars(productType, coauthorProducers_is_verified, taggedUsers_is_verified),as.factor) %>% 
  # Change columns data type to integer
  mutate_at(vars(videoViewCount, videoPlayCount), as.integer)

# Peek the dataset structure
glimpse(df_clean)

> Rows: 1,490
> Columns: 17
> $ commentsCount                 <int> 0, 3, 4, 0, 8, 4, 0, 19, 10, 2, 4, 4, 0,…
> $ likesCount                    <int> 236, 287, 71, 202, 210, 1237, 429, 773, …
> $ ownerFullName                 <chr> "AO Snack Brand", "AO Snack Brand", "AO …
> $ ownerUsername                 <chr> "aosnack.brand", "aosnack.brand", "aosna…
> $ productType                   <fct> photo, photo, video, photo, photo, photo…
> $ timestamp                     <dttm> 2022-01-03 15:07:53, 2022-01-04 14:06:5…
> $ videoDuration                 <dbl> 0.000, 0.000, 23.033, 0.000, 0.000, 0.00…
> $ videoPlayCount                <int> 0, 0, 2977, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
> $ videoViewCount                <int> 0, 0, 1468, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
> $ id                            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
> $ mention_username              <chr> "No mention", "No mention", "No mention"…
> $ coauthorProducers_is_verified <fct> No co-author, No co-author, No co-author…
> $ coauthorProducers_username    <chr> "No co-author", "No co-author", "No co-a…
> $ taggedUsers_full_name         <chr> "No tag", "No tag", "No tag", "No tag", …
> $ taggedUsers_is_verified       <fct> No tag, No tag, No tag, No tag, No tag, …
> $ taggedUsers_username          <chr> "No tag", "No tag", "No tag", "No tag", …
> $ completionRate                <dbl> 0.00000, 0.00000, 49.31139, 0.00000, 0.0…

All columns now have the appropriate data types.

Rearrange Columns

Column positions are rearranged to improve data readability, analysis efficiency, and logical flow of information. By organizing columns in a meaningful order (e.g., grouping related variables, placing key identifiers first, or following a chronological sequence), the data becomes more intuitive to interpret and reduces cognitive load for users. Certain columns also will be renamed to enforce naming consistency.

df_clean <- df_clean %>% 
  # Re-arrange columns position
  select(id,timestamp, ownerFullName, ownerUsername, productType, coauthorProducers_username, coauthorProducers_is_verified, taggedUsers_full_name, taggedUsers_username, taggedUsers_is_verified, mention_username, likesCount, commentsCount, videoDuration, videoPlayCount, videoViewCount, completionRate) %>% 
  # Rename columns
  rename(coauthorProducerUsername = coauthorProducers_username,
         coauthorProduceIsVerified = coauthorProducers_is_verified,
         taggedUseFullName = taggedUsers_full_name,
         taggedUserUsername = taggedUsers_username,
         taggedUserIsVerified = taggedUsers_is_verified,
         mentionUsername = mention_username)
head(df_clean)

Data cleaning and preparation is complete - resulting in an analysis-ready dataset.

Insights Deep-Dive

The primary goal of this data analysis is to identify post characteristics that drive high engagement. By leveraging the insights obtained, we aim to create content that maximizes and improves engagement. The quantitative analysis is based solely on available data, focusing exclusively on likes and comments per post as the engagement metrics. First, a quick statistical summary of likes and comments of all posts will be examined.

df_clean %>% 
  distinct(id, .keep_all = TRUE) %>% 
  select(likesCount,commentsCount) %>% 
  summary()

>    likesCount        commentsCount    
>  Min.   :    -1.00   Min.   :   0.00  
>  1st Qu.:    86.75   1st Qu.:   2.00  
>  Median :   186.00   Median :   5.00  
>  Mean   :  1862.22   Mean   :  46.08  
>  3rd Qu.:   474.75   3rd Qu.:  17.00  
>  Max.   :214517.00   Max.   :7947.00

Out of all posts, the range of likes and comments varies significantly. The maximum values are extremely high compared to others, suggesting some posts may have gone viral. Additionally, illogical values have been found like -1 in the likesCount data. After manual verification (by viewing the actual posts), it is confirmed that -1 represents posts where likes were hidden by the user.

Disclaimer: All qualitative evaluations of Instagram posts in this study were genuinely conducted because the original (non-anonymized) data from Apify contained post URLs. These URLs have since been removed to ensure complete data anonymization.

df_clean %>% 
  distinct(id, .keep_all = TRUE) %>% 
  filter(likesCount<0)

# Check number of posts
df_clean %>% 
  distinct(id, .keep_all = TRUE) %>% 
  nrow()

> [1] 1248

Only 7 out of 1,248 posts (0.56%) had hidden like counts, removing them won’t significantly affect the dataset while keeping the analysis clean and maintain data consistency in our analysis. Then, engagement column will be created by summing likes and comments.

df_clean <- df_clean %>%
  # Remove data with likesCount lower than zero
  filter(likesCount>=0) %>% 
  # Create column engagementCount
  mutate(engagementCount = likesCount + commentsCount)

The dataset contains duplicate post IDs caused by variations in other features (taggedUser, mentions, etc.). To ensure analysis accuracy and simplify further processing, we removed duplicate post IDs, storing the cleaned data in the df_id variable while retaining only one entry per unique post. This prevents skewed results from duplicate content measurements.

df_id <- df_clean %>% 
  distinct(id, .keep_all = TRUE)

# Export for visualization in Tableau
write_xlsx(df_id, "data_output/data_id.xlsx")

Engagement Trends and Growth Rates

# Aggregation for Plot
monthly <- df_id %>%
  mutate(month_year = floor_date(timestamp, "month") %>%  # Bulan-tahun
    format("%b %Y")) %>% 
  mutate(month_date = my(month_year)) %>% 
  group_by(month_year) %>%
  summarise(sum_engagementCount = sum(engagementCount),
            numPost = n()) %>% 
  mutate(month_date = my(month_year)) %>% 
  arrange(month_date)

# Plot
ggplot(data = monthly, 
       mapping = aes(x = month_date,
                     y = sum_engagementCount)) +
  geom_line(color = "#4f5bd5", linewidth = 1) +
  
  # Y-Axis
  scale_y_continuous(
    labels = function(x) paste0(x/1000, "K"), # Format menjadi 300K
    breaks = pretty_breaks(), # Break yang sesuai
    expand = expansion(mult = c(0.1, 0.1)) # Sedikit ruang di atas
  ) +
  
  # Theme
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    panel.grid = element_blank(), # Menghapus semua gridlines
    panel.background = element_blank(),
    axis.line = element_line(color = "black"), # Garis sumbu berwarna hitam
    axis.ticks = element_line(color = "black") # Tick marks berwarna hitam
  ) +
  
  labs(
    title = "Monthly Engagements by Post Types",
    x = NULL,
    y = "Engagement Count",
    color = "Post Type"
  )

monthly %>%
  group_by(year = year(month_date)) %>% 
  summarise(totalEngagement = sum(sum_engagementCount),
            numPost = sum(numPost)) %>% 
  mutate(engagementGrowth = round(((totalEngagement - lag(totalEngagement))/lag(totalEngagement))*100,2)) %>% 
  mutate(numPostGrowth = round(((numPost - lag(numPost))/lag(numPost))*100,2))

Insights:

Between Jan 2022 and Mar 2025, monthly engagement ranged from ~1.2K to ~368.6K.
Major spikes occurred in Jun–Jul 2023, Feb 2024, May–Jun 2024, Dec 2024, and another rise starting in Mar 2025, with most spikes being followed by sharp declines.
The highest peak in June–July 2023 exceeded 350K, likely driven by viral content or major campaigns.
Engagement in 2024 saw multiple peaks (74K–265K), suggesting an event-driven or seasonal trend, which will be discussed further in the Impact of Collaborations on Engagements section below.

Engagements by Post Types

# Aggregation for Plot
monthlye_types <- df_id %>%
  mutate(month_year = floor_date(timestamp, "month") %>%  # Bulan-tahun
    format("%b %Y")) %>% 
  mutate(month_date = my(month_year)) %>% 
  group_by(month_year, productType) %>%
  summarise(sum_engagementCount = sum(engagementCount)) %>% 
  mutate(month_date = my(month_year)) %>% 
  arrange(month_date, productType)

# Plot
ggplot(data = monthlye_types, 
       mapping = aes(x = month_date,
                     y = sum_engagementCount, color = productType)) +
  geom_line(linewidth = 1) +
  scale_color_brewer(palette = "Paired") +
  
  # Y-Axis
  scale_y_continuous(
    labels = function(x) paste0(x/1000, "K"), # Format menjadi 300K
    breaks = pretty_breaks(), # Break yang sesuai
    expand = expansion(mult = c(0.1, 0.1)) # Sedikit ruang di atas
  ) +
  
  # Theme
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    panel.grid = element_blank(), # Menghapus semua gridlines
    panel.background = element_blank(),
    axis.line = element_line(color = "black"), # Garis sumbu berwarna hitam
    axis.ticks = element_line(color = "black") # Tick marks berwarna hitam
  ) +
  
  labs(
    title = "Monthly Engagements by Post Types",
    x = NULL,
    y = "Engagement Count",
    color = "Post Type"
  )

Most of the time, the total engagement trend is influenced a lot by video-type posts. The spikes that appear dominantly influenced by engagement from video-type post. Next, count of posts by type will be compared.

df_id %>%
  group_by(productType) %>% 
  summarise(sum_engagementCount = sum(engagementCount),
            numPost = n())

The data distribution between video and photo post types is sufficiently balanced, allowing for direct comparison of their engagement metrics.

The data contains many outliers and is heavily skewed. Therefore, the median will be used to compare the engagement between the two post types, as it provides a more realistic measure of central tendency than the mean.

# Engagement stats summary of video type
summary(df_id$engagementCount[df_id$productType == "video"])

>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
>      15     152     361    3420    1022  222262

# Engagement stats summary of photo type
summary(df_id$engagementCount[df_id$productType == "photo"])

>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
>    11.0    67.0   129.5   434.5   253.8 66628.0

Based on the boxplot and statistical summary above, several insights were obtained, as follows:

There are several viral posts (as indicated by the outliers) with significantly higher engagement in video compared to photos and the maximum engagement for video (222,262) far exceeds that of photos (66,628), indicating that video have a higher potential for viral reach.
Videos tend to generate higher engagement than photos, with a median of 361 compared to 129.5. Half of all videos achieve engagement above 361, while half of the photos fall below 129.5.
Based on Q1 to Q3, videos (152-1022) show a broader spread of engagement compared to photos (67-253.8), meaning the middle 50% of videos have much higher engagement. This suggests videos are more likely to drive higher engagement, though variability is also greater.

To support this findings a statistical test will be conduct to see the difference in engagement between the two types is significant or due to random variation. First, the distribution will be checked using normality test.

# Normality test for engagement of photo-type post
shapiro.test(df_id[df_id$productType == "photo",]$engagementCount)

> 
>   Shapiro-Wilk normality test
> 
> data:  df_id[df_id$productType == "photo", ]$engagementCount
> W = 0.082677, p-value < 2.2e-16

# Normality test for engagement of video-type post
shapiro.test(df_id[df_id$productType == "video",]$engagementCount)

> 
>   Shapiro-Wilk normality test
> 
> data:  df_id[df_id$productType == "video", ]$engagementCount
> W = 0.17691, p-value < 2.2e-16

Results show the distribution of photo-type post and video-type post are not normal. Therefore, I will use non-parametric statistical test, Mann-Whitney U to compare engagement post by post types.

# Check level of productType
levels(df_id$productType)

> [1] "photo" "video"

Hypothesis of Mann-Whitney U test for this case:

\(H_{0}\): There is no difference in engagement distribution between video and photo posts.
\(H_{1}\): The engagement distribution of video is stochastically greater than photos.

# Mann-Whitney U Test
wilcox.test(engagementCount ~ productType,
            data = df_id,
            alternative = "greater")

> 
>   Wilcoxon rank sum test with continuity correction
> 
> data:  engagementCount by productType
> W = 105159, p-value = 1
> alternative hypothesis: true location shift is greater than 0

With the result (p < 2.2e-16), the null hypothesis (\(H_{0}\)) is rejected. Statistically, the engagement distribution of video is stochastically greater than photos. This results indicate that videos consistently drive higher engagement compared to photo-types.

Engagements and Completion Rates by Video Durations

The analysis will now examine how video duration impacts engagement metrics. I’ve segmented videos into five duration categories (0-15s, 15-30s, 30-45s, 45-60s, and >60s) to determine optimal length for engagement maximization.

# Make a custom function to categorize video duration
convert_duration <- function(y){ 
    if(y <= 15)
      {
      y <- "0 - 15 sec" # video dengan publish_hour antara tengah malam hingga jam 8 pagi
    }
    else 
      if(y > 15 & y <= 30)
      {
      y <- "15 - 30 sec" # video jam 9 pagi hingga jam 4 sore
    }
    else 
      if(y > 30 & y <= 45)
      {
      y <- "30 - 45 sec" # video jam 9 pagi hingga jam 4 sore
    }
    else 
      if(y > 45 & y <= 60)
      {
      y <- "45 - 60 sec" # video jam 9 pagi hingga jam 4 sore
    }
    else
      {
      y <- "> 60 sec" # video jam 5 sore sampai tengah malam
    }  
}

# Make a new column of duration category by applying the custom function
df_id$durationCat <- factor(sapply(X = df_id$videoDuration,
                                                        FUN = convert_duration), levels = c("0 - 15 sec", "15 - 30 sec", "30 - 45 sec", "45 - 60 sec","> 60 sec"))

df_id %>%
  filter(productType == "video") %>% 
  count(durationCat)

While the counts are not perfectly equal, the distribution is reasonably balanced, with no extreme category dominating the dataset. This allows for a direct comparison across categories to derive meaningful insights.

plot(engagementCount ~ durationCat, data = df_id[df_id$productType == "video",])

The data also contains many outliers and is heavily skewed. Therefore, the median will be used to compare the engagement between these video duration categories, as it provides a more realistic measure of central tendency than the mean.

duration_engagement <- df_id %>% 
  filter(productType == "video") %>% 
  group_by(durationCat) %>% 
  summarise(median_engagementCount = median(engagementCount)) %>% 
  arrange(desc(median_engagementCount))

duration_engagement

# Plot
ggplot(duration_engagement, aes(durationCat, median_engagementCount)) + 
  geom_bar(stat = "identity", fill = "#4f5bd5") +
  
  # Y-Axis
  scale_y_continuous(
    breaks = pretty_breaks(), # Break yang sesuai
    expand = expansion(mult = c(0.1, 0.1)) # Sedikit ruang di atas
  ) +
  
  # Theme
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    panel.grid = element_blank(), # Menghapus semua gridlines
    panel.background = element_blank(),
    axis.line = element_line(color = "black"), # Garis sumbu berwarna hitam
    axis.ticks = element_line(color = "black") # Tick marks berwarna hitam
  ) +
  
  labs(
    title = "Median Engagements by Video Durations",
    x = NULL,
    y = "Engagement Count"
  )

Insights:

Videos longer than 15 seconds generally drive higher engagement, with medians ranging from 315 to 479, while videos shorter than 15 seconds record the lowest median engagement (148). This indicates that longer formats can generate roughly 2-3x more interactions, suggesting overly brief content may offer less value or appeal to the audience.

plot(completionRate ~ durationCat, data = df_id[df_id$productType == "video",])

The data just contains few outliers and not heavily skewed. Therefore, the mean value will be used to compare the engagement between these video duration categories for realistic measure of central tendency.

duration_completion <- df_id %>% 
  filter(productType == "video") %>% 
  group_by(durationCat) %>% 
  summarise(mean_completionRate = mean(completionRate)) %>% 
  mutate(across(where(is.numeric), ~round(., 2))) %>% 
  arrange(desc(mean_completionRate))

duration_completion

# Plot
ggplot(duration_completion, aes(durationCat, mean_completionRate, group = 1)) + 
  geom_line(color = "#4f5bd5", linewidth = 1) +
  geom_point(color = "#4f5bd5") +
  
  # Y-Axis
  scale_y_continuous(
    breaks = pretty_breaks(), # Break yang sesuai
    expand = expansion(mult = c(0.1, 0.1)) # Sedikit ruang di atas
  ) +
  
  # Theme
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    panel.grid = element_blank(), # Menghapus semua gridlines
    panel.background = element_blank(),
    axis.line = element_line(color = "black"), # Garis sumbu berwarna hitam
    axis.ticks = element_line(color = "black") # Tick marks berwarna hitam
  ) +
  
  labs(
    title = "Completion Rates by Video Durations",
    x = NULL,
    y = "Completion Rate"
  )

Insights:

The highest average completion rate is for videos over 60 seconds (44.66%), while the other durations fall within 37.43%–38.96%, indicating that videos of 60 seconds or longer tend to deliver a 5–7% higher retention rate.

Engagements by Posting Times

In this section, I will analyze engagement based on posting times (by hour and day). To ensure reliable median engagement measurement and minimize the impact of outliers, I filtered the data to include only time slots with at least six posts. Posting times with fewer than six posts were excluded, as these rare cases could distort the overall engagement pattern.

# Aggregation for Plot
posting_times <- df_id %>% 
  mutate(day = wday(timestamp, label = TRUE, abbr = FALSE)) %>% 
  mutate(hour = hour(timestamp)) %>% 
  group_by(day, hour) %>% 
  summarise(median_engagementCount = median(engagementCount),
            numPost = n()) %>% 
  filter(numPost > 5)

# Plot
ggplot(posting_times, aes(day, hour, fill= median_engagementCount)) + 
  geom_tile() +
  
  # Y-Axis
  scale_y_continuous(
    breaks = pretty_breaks() # Break yang sesuai
  ) +
  
  # Theme
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    panel.grid = element_blank(), # Menghapus semua gridlines
    panel.background = element_blank(),
    axis.line = element_line(color = "black"), # Garis sumbu berwarna hitam
    axis.ticks = element_line(color = "black") # Tick marks berwarna hitam
  ) +
  
  labs(
    title = "Median Engagements by Post Types",
    x = NULL,
    y = "Hour"
  )

Insights:

The heatmap of median engagement across posting times suggests no consistent pattern. While certain time slots (e.g., 4 AM Sunday and 3 AM Tuesday) show higher median engagement, posting time does not appear to be a dominant driver of engagement.

Impact of Collaborations on Engagements

Before starting the analysis, I performed data preprocessing by consolidating the username columns into a single column to simplify the evaluation of each collaborator’s involvement in the posts.

collab <- df_clean %>% 
  pivot_longer(cols = contains("Username"),
               values_to = "username") %>% 
  select(-name) %>% 
  filter(username != "No co-author" & username != "No tag" & username != "No mention") %>% 
  filter(username != "aosnack.brand")
glimpse(collab)

> Rows: 868
> Columns: 15
> $ id                        <int> 98, 205, 205, 205, 216, 218, 218, 234, 234, …
> $ timestamp                 <dttm> 2022-04-06 14:45:16, 2022-08-01 13:34:38, 2…
> $ ownerFullName             <chr> "AO Snack Brand", "AO Snack Brand", "AO Snac…
> $ productType               <fct> photo, video, video, video, video, video, vi…
> $ coauthorProduceIsVerified <fct> No co-author, true, true, true, No co-author…
> $ taggedUseFullName         <chr> "No tag", "Liam Johnson", "Liam Johnson", "L…
> $ taggedUserIsVerified      <fct> No tag, true, true, true, true, true, true, …
> $ likesCount                <int> 92, 326, 326, 326, 25, 31, 31, 74, 74, 74, 6…
> $ commentsCount             <int> 2, 11, 11, 11, 1, 0, 0, 5, 5, 5, 1, 1, 1, 22…
> $ videoDuration             <dbl> 0.000, 163.000, 163.000, 163.000, 69.400, 88…
> $ videoPlayCount            <int> 0, 19902, 19902, 19902, 2182, 2184, 2184, 59…
> $ videoViewCount            <int> 0, 11608, 11608, 11608, 1006, 870, 870, 3603…
> $ completionRate            <dbl> 0.00000, 58.32580, 58.32580, 58.32580, 46.10…
> $ engagementCount           <int> 94, 337, 337, 337, 26, 31, 31, 79, 79, 79, 6…
> $ username                  <chr> "emma.smith", "liam_johnson", "liam_johnson"…

Since an account can be mentioned, tagged, and listed as a co-author in the same post, this process may generate duplicates. To address this, duplicate values were checked.

dim(collab[duplicated(collab[c("id", "username")]),])[1]

> [1] 525

As duplicates were found, they will be removed, as they could distort the calculations.

collab_clean <- collab[!duplicated(collab[c("id","username")]),]

dim(collab_clean[duplicated(collab_clean[c("id", "username")]),])[1]

> [1] 0

Next, column that indicates collaboration will be added to compare the engagament between post with collaboration and post without collaboration.

# Pull id post that involve collaboration
id_post_collab <- collab_clean %>% 
  distinct(id, .keep_all = TRUE) %>% 
  pull(id)

# Add new column that indicates collaboration
df_id$collaboration <-as.factor(ifelse(df_id$id %in% id_post_collab, "Yes", "No"))

collab_data <- df_id %>%
  select(id, collaboration)

# Export for visualization in Tableau  
write_xlsx(collab_data,"data_output/collaboration.xlsx")

collaborator_name <- collab_clean %>% 
  select(id, username)

# Export for visualization in Tableau  
write_xlsx(collaborator_name,"data_output/data_collaborator.xlsx")

plot(engagementCount ~ collaboration, data = df_id)

Since the data distribution is skewed, median engagement will be used to compare the two post types, as it provides a more reliable measure of central tendency than the mean.

monthlye_collab <- df_id %>%
  mutate(month_year = floor_date(timestamp, "month") %>%  # Bulan-tahun
    format("%b %Y")) %>% 
  mutate(month_date = my(month_year)) %>% 
  group_by(month_year, collaboration) %>%
  summarise(sum_engagementCount = sum(engagementCount)) %>% 
  mutate(month_date = my(month_year)) %>% 
  arrange(month_date, collaboration)

# Plot
ggplot(data = monthlye_collab, 
       mapping = aes(x = month_date,
                     y = sum_engagementCount, color = collaboration)) +
  geom_line(linewidth = 1) +
  scale_color_brewer(palette = "Paired") +
  
  # Y-Axis
  scale_y_continuous(
    labels = function(x) paste0(x/1000, "K"), # Format menjadi 300K
    breaks = pretty_breaks(), # Break yang sesuai
    expand = expansion(mult = c(0.1, 0.1)) # Sedikit ruang di atas
  ) +
  
  # Theme
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    panel.grid = element_blank(), # Menghapus semua gridlines
    panel.background = element_blank(),
    axis.line = element_line(color = "black"), # Garis sumbu berwarna hitam
    axis.ticks = element_line(color = "black") # Tick marks berwarna hitam
  ) +
  
  labs(
    title = "Monthly Engagements by Post Types",
    x = NULL,
    y = "Engagement Count",
    color = "Post Type"
  )

df_id %>% 
  group_by(collaboration) %>% 
  summarise(sum_engagementCount = sum(engagementCount),
            numPost = n())

# Engagement stats summary of post with collaboration
summary(df_id$engagementCount[df_id$collaboration == "Yes"])

>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
>     21.0    193.2    518.5   8565.4   2806.5 222262.0

# Engagement stats summary of post without collaboration
summary(df_id$engagementCount[df_id$collaboration == "No"])

>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
>    11.0    83.5   169.0   502.7   403.0 39408.0

Insights:

Engagement patterns frequently align with collaboration-driven posts. Notably, the spike observed in Jun–Jul 2023 was largely the result of content collaborations.
Posts with collaborations have a much higher median engagement count (518.5) compared to posts without collaborations (162.5), representing around a 3x increase.

top_collaborators <- collab_clean %>% 
  group_by(username) %>% 
  summarise(sum_engagementCount = round(sum(engagementCount)),
            numPost = n()) %>% 
  arrange(desc(sum_engagementCount))

top_collaborators

Insights:

Wilson Nate, Aubrey Evans, Sportyanz Event, Jacob Miller, and Robert Carter are the top collaborators contributing the highest total engagement, with each generating over 77K.

Now, I will analyze the engagement spikes that appear as I mentioned before.

Year 2023

df_id %>%
  filter(year(timestamp) == 2023 & month(timestamp) %in% c(6, 7)) %>% 
  group_by(month = month(timestamp, label = T, abbr = F)) %>% 
  summarise(sum_engagementCount = sum(engagementCount))

collab_clean %>% 
  filter(month(timestamp) %in% c(6, 7), year(timestamp) == 2023) %>% 
  group_by(month = month(timestamp, label = T, abbr = F), username) %>% 
  summarise(sum_engagementCount = sum(engagementCount)) %>% 
  arrange(month, desc(sum_engagementCount)) %>% 
  top_n(3)

Insights:

During the peak in Jun-Jul 2023, Wilson Nate is involved in ~575K out of the total ~607K engagement indicating a viral post from Wilson Nate.

Year 2024

df_id %>%
  filter(year(timestamp) == 2024 & month(timestamp) %in% c(2, 6, 7, 8, 9, 10, 11, 12)) %>% 
  group_by(month = month(timestamp, label = T, abbr = F)) %>% 
  summarise(sum_engagementCount = sum(engagementCount))

collab_clean %>% 
  filter(month(timestamp) %in% c(2, 6, 7, 8, 9, 10, 11, 12), year(timestamp) == 2024) %>% 
  group_by(month = month(timestamp, label = T, abbr = F), username) %>% 
  summarise(sum_engagementCount = sum(engagementCount)) %>% 
  arrange(month, desc(sum_engagementCount)) %>% 
  top_n(3)

Insights:

The spike in February 2024 was driven by the introduction of the new Brand Ambassador, Mason Sanchez, in posts that also featured the existing ambassadors, Wilson Nate and Robert Carter.
Most engagement from Jun-Dec 2024 related to Sportyanz Event, contributing ~392K out of the total ~1,108K engagement. This sustained performance was likely supported by a content strategy that extended beyond the event period, encompassing pre-event build-up and post-event follow-up content.

Recommendations

Optimizing Content Format

Prioritize Video Content: Focus on producing video posts, as they consistently generate higher engagement than photos.
Optimize Video Length: Create videos longer than 15 seconds, since these tend to receive 2–3× more engagement than short videos. Maintain high content quality and storytelling value since length alone does not guarantee performance.
Content Duration Testing (A/B Testing): Post and compare videos with durations especially above 15 seconds (e.g., 15-30s, 30-45s) to identify the optimal balance between engagement, viewer retention, and cost efficiency.
Refine Drivers Beyond Timing: Since posting time is not a strong engagement factor, focus future optimizations on content type, duration, audience targeting, and creative execution.

Maintaining & Expanding Collaborations

Leverage Proven Collaborators: Consider allocating budget to continue collaborating with high-performing collaborators such as Wilson Nate, Aubrey Evans, and Sportyanz Event.
Sustain Engagement via Long-Term Partnerships: Build ongoing collaboration programs instead of one-off campaigns to prevent post-peak engagement drops.
Co-Branding Opportunities: Consider partnering with complementary brands (e.g., sports events, hampers) that align with the brand image to expand audience reach.

Sustaining Engagement Momentum

Plan for Post-Peak Content: Design and post follow-up content series after major engagement spikes (e.g., behind-the-scenes or recap videos) to sustain audience momentum and reduce steep declines.
Optimize Audience Retargeting: Retarget viewers and engagers from peak periods with tailored content to extend their activity cycle.

Caveats and Assumptions

The data presented is derived from public Instagram content. All account names and usernames displayed have been replaced with pseudonyms to protect the privacy of the original account holders. This anonymization prevents identity tracing, misuse of data, or harm to third parties. There is no association between these aliases and real Instagram accounts.
All qualitative evaluations of Instagram posts in this study were genuinely conducted because the original (non-anonymized) data from Apify contained post URLs. These URLs have since been removed to ensure complete data anonymization.
The data analyzed reflects the records captured at the time of data scraping.
videoPlayCount refers to the number of people who started playing a video, while videoViewCount refers to the number of people who watched the video in its entirety.
The analysis does not account for posts that were promoted/advertised, as no data related to paid promotion was available.

Key Questions for Further Analysis

Here are several key questions for deeper insights that couldn’t be explored in this project due to limited data and information, but could provide meaningful value to the business in the future:

Is there historical data on likes, comments, and followers available at the time of posting to calculate engagement rates? This would be useful for measuring the relevance and quality of the content.
Is there data on video categories (e.g., storytelling, event-related, product-focused, etc.)? If not, could this information be obtained from the marketing or social media team? Having such categorization would allow deeper analysis of video characteristics that drive higher engagement.
Is there sales, revenue, or cost data associated with Instagram posts (e.g., advertising spend, CPC, CTR, ROAS, etc.)? Integrating financial metrics would enable a more comprehensive evaluation of content effectiveness and optimization of content strategy.

Custom Function Validation

In this section, I will explain the custom function (longer(..)) that I created to reshape the columns the dataframe for this project. The idea of this function is to transform attributes and sub-attributes that are spread out with indexes (wider format) into a single column without indexes, in a long format. This function can always be used as long as the structure of the data being transformed is similar.

Figure 2. Function longer effect illustration

To explain, demonstrate, and validate the longer function, I will create a dummy dataframe so that the changes made are more clearly visible.

# Create Dummy Dataframe
dummy_df <- data.frame(id = c(1,2,3),
                 name = c("Nick","Alex","Patrick"),
                 tag.0.name = c("No tag", "Feb", "May"),
                 tag.0.verified = c("No tag", "Yes", "Yes"),
                 tag.1.name = c(NA, "Cool", NA),
                 tag.1.verified = c(NA, "No", NA))

dummy_df

The longer function consists of several stages:

Reshaping column using Pivot Longer

Pivot longer will produce three new columns which are names of columns that will be separated into three i.e. attribute, index, and sub-attribute.

# Reshape columns from wide to long
dummy_df <- pivot_longer(data = dummy_df,
             # Select columns with name that starts with the attribute
             cols = starts_with("tag"),
             # Produce three columns of attribute, index, and sub-attribute
             names_to = c("attribute", "index", "sub-attribute"),
             # Column name separator `.` that separate attribute, index, and sub-attribute
             names_sep = "\\.")

dummy_df

Remove column attribute

In this case, column attribute will be remove because it is not required.

dummy_df <- dummy_df %>% 
  select(-attribute)

dummy_df

Reshape column using Pivot Wider

Reshape columns of sub-attributes from long to wide to make columns of sub-attributes without indexing. Also, assign a prefix to the resulting column names, which is the name of the corresponding attribute.

dummy_df <- dummy_df %>% 
  pivot_wider(names_from = "sub-attribute", values_from = "value", names_prefix = paste0("tag","_"))

dummy_df

Remove column index and missing values

In this case, the index column will be removed because it is no longer needed. Missing values in the attribute columns will also be removed, as they do not carry any meaningful information.

dummy_df <- dummy_df %>% 
  select(-index) %>% 
  drop_na(starts_with("tag"))

dummy_df

These steps are then combined into a single function called longer. The names of the attributes that have sub-attributes to be reshaped and will be used repeatedly are set as arguments of the longer function.

# Custom Function -> `longer(..)`
longer <- function(x, y){
  # Reshape columns from wide to long
  pivot_longer(data = x,
               # Select columns with name that starts with argument `y`
               cols = starts_with(y),
               # Produce three columns of attribute, index, and sub-attribute
               names_to = c("attribute", "index", "sub-attribute"),
               # Column name separator `.` that separate attribute, index, and sub-attribute
               names_sep = "\\.") %>% 
    # Remove column attribute because it's not needed
    select(-attribute) %>% 
    # Reshape columns of sub-attributes from long to wide to make columns of sub-attributes without indexing
    pivot_wider(names_from = "sub-attribute", values_from = "value", names_prefix = paste0(y,"_")) %>% 
    # Remove column attribute because it's not longer needed
    select(-index) %>% 
    # Remove rows that contains missing value in columns with name that starts with argument `y`
    drop_na(starts_with(y))
}

Contacts

Thank you for reading through this project. Should you have any feedback, suggestions, or wish to collaborate further, feel free to reach out via email or connect with me on LinkedIn.

🔗 LinkedIn: @andrewoanggoh
📧 Email: andrew.anggoh@gmail.com

Healthy Brand Snack - Instagram Engagement Analysis

Authored by Andrew Anggoh

Project Overview

Objective

Data Extraction

Data Cleaning and Preparation

Libraries

Import Dataset

Data Description

Add Column ID of Posts

Missing Values

Reshaping Dataframe

Imputation Value

Feature Engineering: Completion Rate

Duplicate Rows

Data Type Adjusment

Rearrange Columns

Insights Deep-Dive

Top Posts by Engagements

Engagement Trends and Growth Rates

Engagements by Post Types

Engagements and Completion Rates by Video Durations

Engagements by Posting Times

Impact of Collaborations on Engagements

Year 2023

Year 2024

Recommendations

Optimizing Content Format

Maintaining & Expanding Collaborations

Sustaining Engagement Momentum

Caveats and Assumptions

Key Questions for Further Analysis

Custom Function Validation

Contacts