Overview

This project explores statistical analysis through an Instagram Post dataset to understand engagement patterns using point and interval estimations. Our objective is to understand the average engagement metrics, such as likes, and how they are influenced by the characteristics of the post. Using R for data processing and visualization, I begin with data cleaning, then use R libraries plotly, ggplot2 and tinytex to illustrate the correlations and also the equations used.

Point and Interval Estimation

Point Estimation involves the use of data to calculate a single value which is to serve as a “best guess” or “best estimate” of an unknown population parameter.
Interval Estimation is the evaluation of a parameter of a population by computing an interval, within which the parameter is most likely to be located.

Data Cleaning

In preparing the Instagram dataset for analysis, I’ve converted key metrics like likes and comments to numeric values for proper statistical handling and addressed any missing data in these fields.

instagram_data$likes <- as.numeric(as.character(instagram_data$likes))
instagram_data$comments <- as.numeric(instagram_data$comments)
instagram_data$followers <- as.numeric(instagram_data$followers)
instagram_data$following <- as.numeric(instagram_data$following)

Additionally, I chose to transform the ‘created_at’ timestamps into a readable format. These cleaning efforts will help understanding the analysis of the data.

instagram_data$created_at <- as.POSIXct(instagram_data$created_at, origin="1970-01-01", tz="UTC")

Average likes by hour and by post type

The plot shows the average likes for Instagram posts, highlighting that photos generally receive more likes than videos, with significant peaks at specific hours.

Engagement Distribution by Post Type

The histogram on the next slide, displayed on a log scale, shows the distribution of likes for both photo and video posts on Instagram. I used log scaling to better visualize the data, which spans several orders of magnitude and is heavily right-skewed.

By using a log scale, the wide range of likes is compressed into a more manageable scope, allowing for more noticeable patterns without one range dominating the view. The log scale reveals the like distribution’s shape more clearly, showing that most posts accumulate a modest number of likes, while very few receive exponentially more.

Engagement Distribution by Post Type

Point Estimation Plot

The bar chart on the next page provides a straightforward visual representation of the point estimation for average likes on Instagram, comparing photos to videos.

avg_likes <- instagram_data %>%
  group_by(is_video) %>%
  summarise(average_likes = mean(likes, na.rm = TRUE)) %>%
  mutate(post_type = if_else(is_video, "Video", "Photo")) %>%
  ungroup()
ggplot(avg_likes, aes(x = post_type, y = average_likes, fill = post_type)) +
  geom_bar(stat = "identity", width = 0.5) +
  geom_text(aes(label = round(average_likes, 1)), vjust = -0.3, size = 3.5) +
  scale_fill_manual(values = c("Photo" = "blue", "Video" = "red")) +
  labs(title = "Point Estimation of Average Likes by Post Type",
       x = "Post Type",
       y = "Average Likes") +
  theme_minimal() +
  theme(legend.position = "none", 
        axis.text.x = element_text(angle = 0))

Point Estimation Plot

Interval Estimation Plot

The following bar chart shows the interval estimation for average likes on Instagram, separated into the two post types: Photo and Video. Each bar represents the point estimate of the average likes for the respective category, while the vertical lines represent the range of the 95% confidence interval.

The intervals suggest that while photos tend to have a higher average number of likes, and the variability is also greater, as shown by the longer confidence interval.

Interval Estimation Plot

The Mean (Point Estimation) using LaTex

The mean provided a point estimate of average likes for photos and videos. It was used as sort of a benchmark representing the typical number of likes a post could expect. \[ \text{Mean} (\bar{x}) = \frac{1}{n} \sum_{i=1}^{n} x_i \]

The Confidence Interval (Interval Estimation) using LaTex

The confidence intervals around the mean likes represented the uncertainty in the estimates, giving a range that likely contains the true average likes for each post type, essentially surrounding the point estimates with a statistical margin of error. \[ \text{CI} = \bar{x} \pm t_{\frac{\alpha}{2}, n-1} \left(\frac{s}{\sqrt{n}}\right) \]

Point and Interval Approaches to Instagram Data

Overview

Point and Interval Estimation

Data Cleaning

Average likes by hour and by post type

Engagement Distribution by Post Type

Engagement Distribution by Post Type

Point Estimation Plot

Point Estimation Plot

Interval Estimation Plot

Interval Estimation Plot

The Mean (Point Estimation) using LaTex

The Confidence Interval (Interval Estimation) using LaTex