Analyzing the Impact of Format and Timing on Facebook Engagement

Author

Dashan Richards

Published

May 7, 2026

Logo
QR Code

Introduction

Trying to figure out exactly what makes users engage with social media content is notoriously tricky. With this in mind, this project sets out to answer a very specific and practical question: How does the format of a post (Type) and the time of week it is posted affect total reach and user engagement (likes and shares)?

Rather than just reporting on what happened in the past, I built a predictive framework. In other words, I am looking for the statistical makeup of high-performing content so the focus can shift away from simple metrics such as Total Reach, toward more valuable metrics like Engagement Velocity.

Data Description

The dataset utilized is the UCI Facebook Metrics Dataset, containing 500 posts from an international cosmetics brand. The variables analyzed include Type, Post.Weekday, Lifetime.Post.Total.Reach, like, and Category.

To prepare the data for analysis, I utilized the tidyverse framework to clean and transform the dataset. Missing values are omitted, surrogate keys are created, and new engagement efficiency metrics are calculated.

# Load required libraries
library(tidyverse)
library(ggrepel)
library(patchwork)
library(plotly)
library(DT)
library(leaflet)

# Turn off scientific notation for cleaner axes
options(scipen = 999)

# Load the raw data
fb_data <- read.csv("dataset_Facebook.csv", sep = ";")

# Create human-readable metadata for Categories using tribble
category_meta <- tribble(
  ~Category, ~Category_Name,
  1, "Action",
  2, "Product",
  3, "Inspiration"
)

# Data Transformation
fb_full <- fb_data |>
  na.omit() |> 
  left_join(category_meta, by = "Category") |>
  mutate(
    # Create surrogate key for unique identification
    Post_ID = row_number(), 
    Day_Type = if_else(Post.Weekday %in% c(1, 7), "Weekend", "Weekday"),
    Month_Name = case_when(
      Post.Month == 1 ~ "Jan", Post.Month == 2 ~ "Feb",
      Post.Month == 3 ~ "Mar", Post.Month == 4 ~ "Apr",
      Post.Month == 5 ~ "May", Post.Month == 6 ~ "Jun",
      Post.Month == 7 ~ "Jul", Post.Month == 8 ~ "Aug",
      Post.Month == 9 ~ "Sep", Post.Month == 10 ~ "Oct",
      Post.Month == 11 ~ "Nov", Post.Month == 12 ~ "Dec",
      TRUE ~ "Unknown"
    ),
    # Calculate Efficiency (Interactions per viewer)
    Engage_Rate = Total.Interactions / Lifetime.Post.Total.Reach,
    # Standardize variables to easily spot outliers
    Reach_Std = (Lifetime.Post.Total.Reach - mean(Lifetime.Post.Total.Reach)) / sd(Lifetime.Post.Total.Reach)
  )

Interactive Data Exploration

# Viewing the cleaned dataset
datatable(head(fb_full, 100), options = list(pageLength = 5, scrollX = TRUE), 
          caption = "Cleaned Facebook Metrics Data")

Exploratory Data Analysis & Visualization

I generated 10 static plots using ggplot2 to break down the narrative.

Analyzing Distributions

# P1: Histogram showing the distribution of Reach
p1 <- ggplot(fb_full, aes(x = Lifetime.Post.Total.Reach)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  theme_minimal() +
  labs(title = "Figure 1: Histogram of Total Reach", x = "Lifetime Reach", y = "Count")

# P2: Scatter plot of Hour vs Reach
p2 <- ggplot(fb_full, aes(x = Post.Hour, y = Lifetime.Post.Total.Reach, color = Type)) +
  geom_point(alpha = 0.5) +
  theme_classic() +
  labs(title = "Figure 2: Reach by Time of Day", x = "Hour of Day", y = "Total Reach")

# P3: Boxplot of Reach by Format
p3 <- ggplot(fb_full, aes(x = Type, y = Lifetime.Post.Total.Reach, fill = Type)) +
  geom_boxplot() +
  theme_bw() +
  labs(title = "Figure 3: Reach Distribution by Format", x = "Format", y = "Total Reach")

p1 / p2 # Displaying 1 and 2

p3      # Displaying 3

Figure 1 shows that the vast majority of posts have a relatively low reach, with a heavy right skew indicating just a few highly viral posts.
By the same token, Figure 2 demonstrates reach by the hour. We can see that the massive outliers (especially videos) happen regardless of the specific hour they are posted.
In contrast, Figure 3 highlights the volatility of video postings. It has a high reach, but the massive vertical stretch means its performance is incredibly unpredictable.

Category and Efficiency Analysis

Another key point is how efficiently these formats perform based on the actual topic of the post. To put it differently, I wanted to see if certain formats worked better for product posts versus inspirational posts.

# P4: Bar chart of average likes with labels
avg_likes <- fb_full |> group_by(Type) |> summarize(mean_like = mean(like))

p4 <- ggplot(avg_likes, aes(x = reorder(Type, -mean_like), y = mean_like, fill = Type)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(mean_like, 0)), vjust = -0.5) +
  theme_light() +
  labs(title = "Figure 4: Average Likes by Format", x = "Format", y = "Average Likes")

# P5: Boxplot Faceted by Category
p5 <- ggplot(fb_full, aes(x = Type, y = Engage_Rate, fill = Type)) +
  geom_boxplot() +
  facet_wrap(~Category_Name) +
  theme_minimal() +
  labs(title = "Figure 5: Efficiency Rate by Category", y = "Interactions per Viewer")

# P6: Segment plot showing standardized success over the year
monthly_perf <- fb_full |> 
  group_by(Post.Month, Day_Type) |> 
  summarize(Avg_Reach = mean(Reach_Std), .groups = "drop")

p6 <- ggplot(monthly_perf, aes(x = Avg_Reach, y = factor(Post.Month), color = Day_Type)) +
  geom_point(size = 3) +
  geom_segment(aes(xend = 0, yend = factor(Post.Month)), linetype = "dashed") +
  theme_classic() +
  labs(title = "Figure 6: Standardized Reach by Month", x = "Standardized Reach", y = "Month")

p4

p5

p6

Figure 4 confirms that Photos dominate when it comes to average likes.
Furthermore, Figure 5 shows that across all categories (Inspiration, Action, and Product), Photos consistently yield the highest Efficiency (Interactions per viewer).
At the same time, Figure 6 highlights seasonality. Reach fluctuates significantly month-to-month, regardless of whether a post goes up on a weekday or weekend.

Relationships

Up to this point in the analysis, I have looked at distributions. Now, I change my attention to mapping the actual relationship between seeing a post and liking it.

# P7: Scatter plot of Reach vs Likes with Trend Line
p7 <- ggplot(fb_full, aes(x = Lifetime.Post.Total.Reach, y = like, color = Type)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_light() +
  labs(title = "Figure 7: Reach vs. Likes (Engagement Velocity)")

# P8: Scatter plot with Annotations for Outliers
top_posts <- fb_full |> arrange(desc(like)) |> head(5)

p8 <- ggplot(fb_full, aes(x = share, y = like)) +
  geom_point(aes(color = Type), alpha = 0.6) +
  geom_text_repel(data = top_posts, aes(label = paste("ID:", Post_ID)), size = 3) +
  theme_minimal() +
  labs(title = "Figure 8: Shares vs. Likes with Top Outliers Annotated")

# P9: Interaction Comparison
p9 <- ggplot(fb_full, aes(x = factor(Paid), y = Total.Interactions, fill = factor(Paid))) +
  geom_boxplot() +
  scale_fill_manual(values = c("gray", "purple"), labels = c("Organic", "Paid")) +
  theme_bw() +
  labs(title = "Figure 9: Total Interactions (Organic vs Paid)", x = "Paid Status", fill = "Status")

p7

p8

p9

Figure 7 visualizes the core concept: “Engagement Velocity.” The steep slope of Photos indicates a high Return on Reach. Conversely, the video trend line is almost totally flat.
Figure 8 proves there is a massive positive correlation between shares and likes, and I explicitly tagged the most viral Post_IDs.
Figure 9 points out a surprising truth: even though Paid content can push outliers a bit higher, the general distribution of interactions remains virtually identical to Organic posts.

Interactive Visualizations

I converted a few of the most important charts into interactive widgets.

# Figure 1: Hoverable Scatter Plot
plotly_scatter <- ggplotly(p7, tooltip = c("x", "y", "color")) |> 
  layout(title = "Hover to see exact Reach and Likes")

# Figure 2: Dynamic Bar Chart
plotly_bar <- ggplotly(p4, tooltip = "y") |> 
  layout(title = "Average Likes by Format")

plotly_scatter

plotly_bar

Data Modeling

Ordinary Least Squares (OLS) Regression

To put some mathematical weight behind the visual trends I saw in Figure 7, I ran a linear regression model to predict like based on Lifetime.Post.Total.Reach and Type.

# Linear Regression Model
model <- lm(like ~ Lifetime.Post.Total.Reach + Type, data = fb_full)
summary(model)


Call:
lm(formula = like ~ Lifetime.Post.Total.Reach + Type, data = fb_full)

Residuals:
    Min      1Q  Median      3Q     Max 
-1180.1   -69.0   -26.3    43.8  3640.9 

Coefficients:
                              Estimate   Std. Error t value
(Intercept)                -76.0834210   58.4150828  -1.302
Lifetime.Post.Total.Reach    0.0080563    0.0005432  14.832
TypePhoto                  153.1988250   59.0936584   2.592
TypeStatus                 147.4265047   70.2731044   2.098
TypeVideo                 -105.0188541  118.4531099  -0.887
                                      Pr(>|t|)    
(Intercept)                            0.19337    
Lifetime.Post.Total.Reach < 0.0000000000000002 ***
TypePhoto                              0.00981 ** 
TypeStatus                             0.03642 *  
TypeVideo                              0.37574    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 269.9 on 490 degrees of freedom
Multiple R-squared:  0.3135,    Adjusted R-squared:  0.3079 
F-statistic: 55.94 on 4 and 490 DF,  p-value: < 0.00000000000000022

As shown above, the regression output statistically backs up what I saw on the charts. The coefficient for Lifetime.Post.Total.Reach acts as the Engagement Velocity. In effect, it proves that reach does drive likes, but the Type coefficients show that the baseline expectations shift wildly depending on the format you choose.

First and foremost, the coefficient for Lifetime.Post.Total.Reach is significant (p< 0.001). For every 1,000 additional people a post reaches, the company can expect to gain about 8 additional likes (\(0.0080563 \times 1000\)), assuming the post format is held constant.

Furthermore, this is how the other formats stack up against a standard Link:

Photos show the strongest performance. Holding reach constant, a photo generates about 153 more likes than a baseline link (p < 0.01).
Status updates are also strong performers, generating about 147 more likes than a link, assuming they have the exact same reach (p < 0.05).
Videos show a negative coefficient of -105.0. However, because the p-value is 0.37 (which is much greater than the 0.05 threshold), this difference is not statistically significant. In other words, despite their massive reach potential, videos do not reliably generate more likes than a simple link.

K-Means Clustering

In the same fashion, I applied K-Means clustering to see if we could group different posting environments. It is important to realize here that all target variables (like, share, Total.Interactions) were removed before clustering. By doing this, I ensured the groups were based purely on page metrics and timing, preventing data leakage.

# Select variables for clustering, removing target variables
cluster_data <- fb_full |>
  select(Page.total.likes, Post.Hour, Post.Weekday) |>
  scale() # Standardize data

# Run K-Means with 3 centers
set.seed(123)
kmeans_res <- kmeans(cluster_data, centers = 3, nstart = 25)

# Add cluster assignments back to a copy of the data
fb_clusters <- fb_full |> mutate(Cluster = as.factor(kmeans_res$cluster))

# Visualize the clusters 
ggplot(fb_clusters, aes(x = Page.total.likes, y = Post.Hour, color = Cluster)) +
  geom_point(size = 3, alpha = 0.6) +
  theme_minimal() +
  labs(title = "Grouping Post Environments", 
       x = "Total Page Likes", y = "Hour of Post")

Facebook pages tend to gain likes over time. In other words, as the year progresses, the page matures and its total follower count naturally grows. With this in mind, the X-axis on our cluster plot (Total Page Likes) actually acts as a hidden timeline for the account.

Given these points, I am able to translate the algorithm’s groupings into three distinct posting strategies:

Cluster 1 (The Red Group): This group represents the account’s earliest growth phase in our dataset, spanning from roughly 80,000 to 110,000 likes. While the bulk of these posts are focused in the morning and early afternoon (hours 0 to 15), you can see red dots scattered all the way up the Y-axis into the late evening. This suggests that early on, the team was experimenting with a wider, less restricted variety of posting times.
Cluster 3 (The Blue Group): Following that, this cluster captures the growth phase, bridging the gap between roughly 110,000 and 130,000 likes. As it overlaps with both the early and mature stages, it represents a shift in strategy. More importantly, notice how the vertical scatter begins to tighten. The team is starting to hone in on that morning to mid-day window (hours 0 to 15), largely abandoning the late-night experimental posts.
Cluster 2 (The Green Group): Finally, this group represents the peak maturity of the page in the dataset, heavily concentrated around the 130,000 to 140,000 like mark. You can say that the team has completely locked in their marketing strategy. The posts are almost entirely concentrated between the 0 and 15-hour marks.

In the final analysis, the algorithm proved that as the page matured, the marketing team naturally shifted from a scattered, experimental posting schedule to a more disciplined morning/mid-day approach. Also, by completely removing the engagement variables from this specific test, I mapped this strategic change without any bias from how many likes or shares the posts actually received.

Conclusion

To summarize, this project highlights some truths about Facebook engagement:

Format over Timing: What you post (Type) matters significantly more than when you post (Weekday/Weekend).
Prioritize Format Based On Goals: All things considered, videos are your best bet if you strictly want exposure. However, Photos are mathematically proven to yield a much higher Engagement Velocity. If you want community interaction, Photos are the winner.
Efficiency Standards: Ultimately, shifting the focus to Interaction Rates reveals the true value of the content.

Location & Team Information

# KSU Interactive Map
ksu_map <- leaflet() |>
  addTiles() |>
  setView(lng = -84.5810, lat = 34.0379, zoom = 15) |>
  addMarkers(lng = -84.5810, lat = 34.0379, popup = "Kennesaw State University<br>Department of Data Science and Analytics")

ksu_map

Team Contact:

Dashan Richards - drich124@students.kennesaw.edu