1 Introduction

In this study, we will try to understand what affects the number of followers on Instagram for the most popular accounts. Specifically, we will explore whether the number of posts relate to the number of followers and what kind of account draws the most followers.

Our data is from Data.world, which derives it’s data from Iconosquare. This data was collected on December 26, 2016 and is based on the 100 accounts with the most following.

The variables we will be using are brand, categories_1, media_posted, and num.

  • Brand is an identification variable which indicates who/what is behind the account.
  • Categories_1 is a categorical explanatory/predictor variable which divides the accounts into celebrities, fashion, media, and sport.
  • Media_posted is a numerical explanatory/predictor variable which shows the number of Instagram posts for each account.
  • Num is the outcome variable which indicates how many followers each account has.

One limitation of this data is that the sample size is too small compared to the user base of Instagram. Another limitation is that the data is from 2016 and therefore slightly outdated. Lastly, this only focuses on a single social media platform instead of all social media outlets such as Facebook and Twitter.

The table below is a glimpse at our data.

brand categories_1 num media_posted
Selena Gomez celebrities 105.4 1200
Taylor Swift celebrities 95.2 958
Ariana Grande celebrities 92.3 2800
Beyonce celebrities 90.6 1400
Kim Kardashian West celebrities 89.3 3600
Cristiano Ronaldo celebrities 85.1 1600

2 Exploratory data analysis

term estimate std_error statistic p_value conf_low conf_high
intercept 31.039 3.199 9.703 0.000 24.687 37.392
media_posted 0.000 0.001 0.352 0.726 -0.001 0.002
## [1] 0.03642968

For the scatter plot, there is an extremely weak correlation between the number of posts and the number of followers. The table shows that the slope of the regression line is actually 0.00. Although the log10 scale x axis makes the data appear to have a positive slope, this is not true.

In this sample, the correlation coefficient is .0364, which indicates a low correlation.

For the box plot, it appears that sport accounts have the highest median followers, followed by celebrity accounts, then fashion accounts, and lastly media accounts. In addition, celebrities seem to vary above the 3rd quartile. Based on an eyeball test, it appears that the type of account may contribute to the number of followers.


3 Multiple regression

term estimate std_error statistic p_value conf_low conf_high
intercept 32.933 3.538 9.309 0.000 25.901 39.965
media_posted 0.000 0.001 0.292 0.771 -0.002 0.002
categories_1fashion -8.861 14.980 -0.591 0.556 -38.636 20.914
categories_1media -42.530 29.131 -1.460 0.148 -100.431 15.372
categories_1sport 4.280 21.164 0.202 0.840 -37.787 46.346
media_posted:categories_1fashion -0.001 0.006 -0.216 0.830 -0.013 0.010
media_posted:categories_1media 0.005 0.004 1.505 0.136 -0.002 0.013
media_posted:categories_1sport -0.001 0.002 -0.508 0.613 -0.006 0.003

Our multiple regression model will include num (number of followers) as our outcome variable and type of account and number of posts as our explanatory variables. Num is our numerical variable and type of brand is our categorical variable. The categories of accounts are separated by color while the graph is plotted with the y axis as number of followers and the x axis as number of posts. In this graph, the 4 categories represented by colors do not span throughout the whole x axis due to lack of data and small sample size.

In our regression table, we have slopes and intercepts shown relative to the baseline, the celebrity condition. The fashion and media condition are both positively affected by the number of post while the number of post negatively affects the number of followers for media and sport.

3.1 Statistical interpretation

Category: Celebrities

number of followers = intercept + media_posted(# of posts)

number of followers = 32.933 + 0(# of posts)

Category: Fashion

number of followers = intercept + categories_1fashion + media_posted + media_posted:categories_1fashion(# of posts)

number of followers = 32.933 + 0(# of posts) -8.861 - 0.001(# of posts)

number of followers = 24.072 - 0.001(# of posts)

Category: Media

number of followers = intercept + categories_1media + media_posted + media_posted:categories_1media(# of posts)

number of followers = 32.933 + 0(# of posts) - 42.530 + 0.005(# of posts)

number of followers = -9.597 + 0.005(# of posts)

Category: Sport

number of followers = intercept + categories_1sport + media_posted + media_posted:categories_1sport(# of posts)

number of followers = 32.933 + 0(# of posts) + 4.28 - 0.001(# of posts)

number of followers = 37.213 - 0.001(# of posts)

In general, all else being held equal, there seems to be little to no relation between number of posts and number of followers for all categories. This result agrees with the low correlation coefficient and estimate of media_posted from above. However, there are some outliers which may affect the fitted regression lines.

3.2 Non-statistical interpretation

In general, there seems to be very little influence between the number of post and amount an account has. So the amount of posts an account makes will probably not help one’s chances of gaining followers.


4 Inference for multiple regression

We will focus on celebrities as our category.

The histogram above shows the residuals are skewed to the right. The histogram is also not a normal distribution. Therefore, the p-values and the confidence intervals above must be taken with a grain of salt as the conditions for regression are not met. An attempt with a log base 10 transformation did not change the histogram.

The confidence interval for celebrity accounts is [-0.002, 0.002]. This include zero, meaning that it is possible that the slope of the number of followers in relation to the number of posts for celebrities could possibly be zero - which has been proven to be true. It’s p-value is .771 which is relatively large. This allows us to not reject the null hypothesis that there is no relationship between the number of posts and the number of followers.

The confidence interval for media accounts is [-100.431,15.372]. This include zero, meaning that it is possible that the slope of the number of followers in relation to the number of posts for celebrities could possibly be zero. It’s p-value is .148 which is relatively small compared to the other categories. However, this is still a large p-value. Therefore, this allows us to not reject the null hypothesis that there is no relationship between the number of posts and the number of followers.


5 Conclusion

Our results show that there is no relationship between number of post and the number of followers on a Instagram account.

A take-home message from our analysis is that brands, celebrities, and persons interested in a multitude of followers should not focus on the number of post they post. This lack of relationship is true for all conditions: celebrities, fashion, media, and sports. So whether you are a news anchor or a fashion mogul, the more you post will not guarantee you those extra hundred followers.

While analyzing our data, we discovered many limitations and caveats. First, this analysis is limited to Instagram only. This does not help the general public understand the effect of post on other social media platforms (i.e. Facebook or Twitter). A future direction for this project would be including different platforms. By including other platforms like Facebook, Twitter, and Tumblr, we could provide information on how post and followers are similar or different among on separating platforms.

A second caveat to remember with this analysis is that there are various reasons that followers choose to follower certain accounts. When someone decides to follow Katy Perry, they follower her because they like her and maybe not for the actual posts she makes. On the other hand, followers may follow a media outlet to learn more about their community and important events. The reasons one may follow an account may be different for every account which could explain why the data may be have clear relationships.


6 Citations and References

We give thanks to:

Data.world

Iconosquare

ModernDive

We would like to thank you for reading our lovely analysis feel free to follow us on Instagram:

Stanley

Elizabeth

Joshua

We would finally like to thank our WONDERFUL, DAPPERLY DRESSED, INTELLIGENT, HUMORUOUS, ONE-OF-A-KIND GUY, our professor, Albert Kim. He answered countless questions, soothed our nerves with advice, and showered us with statistical wisdom. Thank you Professor Kim, you will be missed and forever loved.