Pablo Vasquez
2024-04-08
This project explores statistical analysis through an Instagram Post dataset to understand engagement patterns using point and interval estimations. Our objective is to understand the average engagement metrics, such as likes, and how they are influenced by the characteristics of the post. Using R for data processing and visualization, I begin with data cleaning, then use R libraries plotly, ggplot2 and tinytex to illustrate the correlations and also the equations used.
Point Estimation involves the use of data to
calculate a single value which is to serve as a “best guess” or “best
estimate” of an unknown population parameter.
Interval Estimation is the evaluation of a parameter of
a population by computing an interval, within which the parameter is
most likely to be located.
In preparing the Instagram dataset for analysis, I’ve converted key metrics like likes and comments to numeric values for proper statistical handling and addressed any missing data in these fields.
instagram_data$likes <- as.numeric(as.character(instagram_data$likes))
instagram_data$comments <- as.numeric(instagram_data$comments)
instagram_data$followers <- as.numeric(instagram_data$followers)
instagram_data$following <- as.numeric(instagram_data$following)Additionally, I chose to transform the ‘created_at’ timestamps into a readable format. These cleaning efforts will help understanding the analysis of the data.
The plot shows the average likes for Instagram posts, highlighting that photos generally receive more likes than videos, with significant peaks at specific hours.
The histogram on the next slide, displayed on a log scale, shows the distribution of likes for both photo and video posts on Instagram. I used log scaling to better visualize the data, which spans several orders of magnitude and is heavily right-skewed.
By using a log scale, the wide range of likes is compressed into a more manageable scope, allowing for more noticeable patterns without one range dominating the view. The log scale reveals the like distribution’s shape more clearly, showing that most posts accumulate a modest number of likes, while very few receive exponentially more.
The bar chart on the next page provides a straightforward visual representation of the point estimation for average likes on Instagram, comparing photos to videos.
avg_likes <- instagram_data %>%
group_by(is_video) %>%
summarise(average_likes = mean(likes, na.rm = TRUE)) %>%
mutate(post_type = if_else(is_video, "Video", "Photo")) %>%
ungroup()
ggplot(avg_likes, aes(x = post_type, y = average_likes, fill = post_type)) +
geom_bar(stat = "identity", width = 0.5) +
geom_text(aes(label = round(average_likes, 1)), vjust = -0.3, size = 3.5) +
scale_fill_manual(values = c("Photo" = "blue", "Video" = "red")) +
labs(title = "Point Estimation of Average Likes by Post Type",
x = "Post Type",
y = "Average Likes") +
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 0))The following bar chart shows the interval estimation for average likes on Instagram, separated into the two post types: Photo and Video. Each bar represents the point estimate of the average likes for the respective category, while the vertical lines represent the range of the 95% confidence interval.
The intervals suggest that while photos tend to have a higher average number of likes, and the variability is also greater, as shown by the longer confidence interval.
The mean provided a point estimate of average likes for photos and videos. It was used as sort of a benchmark representing the typical number of likes a post could expect. \[ \text{Mean} (\bar{x}) = \frac{1}{n} \sum_{i=1}^{n} x_i \]
The confidence intervals around the mean likes represented the uncertainty in the estimates, giving a range that likely contains the true average likes for each post type, essentially surrounding the point estimates with a statistical margin of error. \[ \text{CI} = \bar{x} \pm t_{\frac{\alpha}{2}, n-1} \left(\frac{s}{\sqrt{n}}\right) \]