FreeCodeCamp Facebook Page Activity Overview

This is a quick dive into the stats of the Free Code Camps Facebook page activity. An analysis to see what types of post bring the most amount attention in the terms of reactions, clicks, and reach. I'll just refer to these collectively as Interaction variables for the rest of the analysis.

Importing and Cleaning the Dataset

The data was provided in a CSV format so I'm going to read that in with the read_csv() function from the readr package. After that let's take a look at the resulting dataframe.

library(readr)
freecodecamp <- read_csv('~/Downloads/freeCodeCamp-facebook-page-activity.csv')
str(freecodecamp, max.level = 1)

## Classes 'tbl_df', 'tbl' and 'data.frame':    420 obs. of  7 variables:
##  $ date     : chr  "08/18/2017" "08/18/2017" "08/17/2017" "08/17/2017" ...
##  $ time     :Classes 'hms', 'difftime'  atomic [1:420] 56700 41220 69480 60900 28560 ...
##   .. ..- attr(*, "units")= chr "secs"
##  $ title    : chr  "The origins of t-distributions and how they can help you make accurate estimates from small sample sizes." "How one camper got his developer dream job" "Trying to code when chat's open" "An interaction designer explains how a \"homeless iPhone\" might work." ...
##  $ type     : chr  "Link" "Link" "Video" "Link" ...
##  $ reach    : int  1768 6941 17399 3751 18248 4806 11422 17906 13734 9305 ...
##  $ clicks   : int  44 536 2236 167 1946 200 560 1441 1186 662 ...
##  $ reactions: chr  "21" "99" "750" "10" ...
##  - attr(*, "spec")=List of 2
##   ..- attr(*, "class")= chr "col_spec"

Ok, so there is one problem with the data frame. It seems that read_csv has cast the Reactions variable as chr instead of the expected int. So the next step will be to change that.

freecodecamp$reactions <- as.integer(freecodecamp$reactions)

## Warning: NAs introduced by coercion

So that tells us some of the values in this column were not all numbers hence the reason it was cast as a chr column in the first place. How many are NA will be a big question in how much more we have to do with the data.

sum(is.na(freecodecamp$reactions))

## [1] 1

There's only one there that didn't convert properly so I feel safe in excluding it from the dataframe and moving on with the data analysis

freecodecamp <- subset(freecodecamp, !is.na(freecodecamp$reactions))
str(freecodecamp, max.level = 1)

## Classes 'tbl_df', 'tbl' and 'data.frame':    419 obs. of  7 variables:
##  $ date     : chr  "08/18/2017" "08/18/2017" "08/17/2017" "08/17/2017" ...
##  $ time     :Classes 'hms', 'difftime'  atomic [1:419] 56700 41220 69480 60900 28560 ...
##   .. ..- attr(*, "units")= chr "secs"
##  $ title    : chr  "The origins of t-distributions and how they can help you make accurate estimates from small sample sizes." "How one camper got his developer dream job" "Trying to code when chat's open" "An interaction designer explains how a \"homeless iPhone\" might work." ...
##  $ type     : chr  "Link" "Link" "Video" "Link" ...
##  $ reach    : int  1768 6941 17399 3751 18248 4806 11422 17906 13734 9305 ...
##  $ clicks   : int  44 536 2236 167 1946 200 560 1441 1186 662 ...
##  $ reactions: int  21 99 750 10 474 56 466 403 217 109 ...

And now we can see that the reactions column is an integer and everything looks good with the dataset.

Posts Overview

Each post is broken down into different types by the type variable in the dataframe that contains four different values: Link, Video, Status, and Photo. So the first look will be breaking down the data by those four types. All plots will be made with ggplot2.

ggplot(aes(x = type), data = freecodecamp, echo = FALSE) + 
    geom_bar(aes(fill = type), stat = 'count') +
    xlab('Type') + ylab('Count') + labs(fill = 'Type')

It's clear that the link type is far and away the most popular type of post on the Free Code Camp Facebook page. This isn't unexpected but will obviously influence our counts in the next graphs that break each type down by their Interaction variables. I also subsetted the data frame to include up to the 99th percentile so as not to skew the graphs. We'll take a look at the outliers a little bit later

ggplot(aes(x = reactions), 
    data = subset(freecodecamp, reactions <= quantile(reactions, .99))) + 
    facet_wrap(~type) +
    geom_histogram(aes(fill = type), col = I('black'), binwidth = 100) + 
    scale_x_continuous(breaks = seq(0,2000,200)) +
    xlab('Reactions') +ylab('Count') +
    theme(axis.text.x = element_text(angle = 60, hjust = 1))

ggplot(aes(x = clicks), data = subset(freecodecamp, 
    clicks <= quantile(clicks,.99))) + 
    facet_wrap(~type) + 
    geom_histogram(aes(fill = type), col = I('black'), binwidth = 100) + 
    scale_x_continuous(breaks = seq(0, 7000, 500)) +
    theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
    xlab('Clicks') + ylab('Count')

ggplot(aes(x = reach), data = subset(freecodecamp, reach<=
    quantile(reach, .99))) + 
    facet_wrap(~type) + 
    geom_histogram(aes(fill = type), col = I('black'), binwidth = 1000) + 
    scale_x_continuous(breaks = seq(0, 60000, 5000)) + 
    theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
    xlab('Reach') + ylab('Count')

I'm not sure this tells us much other than the sample size is much larger for Links than the other types, but it is a nice look at the distributions of the Interactions for each type of post. Let's take a look at the actual summaries of the numbers for each type.

library(dplyr)
free_group <-group_by(freecodecamp, type)
summarise(free_group, Average_Click = mean(clicks), Average_Reach = mean(reach),
          Average_Reaction = mean(reactions))

## # A tibble: 4 x 4
##     type Average_Click Average_Reach Average_Reaction
##    <chr>         <dbl>         <dbl>            <dbl>
## 1   Link      849.7836     11494.203        212.67869
## 2  Photo     1748.2500     14816.193        475.98864
## 3 Status      285.0000      5109.667         62.33333
## 4  Video     6172.7391     37154.739       1124.34783

Surprisingly Video leads the way in all three categories, with Photo second, and Links last. There could be several explanations for this including the fact that Photo and Video have much smaller smaple sizes than Links allowing their outliers to have outsized leverage on their population stats than would be possible for Links. Median values may show a little clearer picture.

summarise(free_group, Median_Click = median(clicks), Median_Reach = median(reach),
          Median_Reaction = median(reactions))

## # A tibble: 4 x 4
##     type Median_Click Median_Reach Median_Reaction
##    <chr>        <dbl>        <dbl>           <dbl>
## 1   Link          565         9274             134
## 2  Photo         1457        11499             354
## 3 Status          337         5371              41
## 4  Video         1390        15965             273

Photo and Video perform better heare as well although Photo overtakes video. These numbers I would think are more indicative of each's population esepcially given the smaller samples of Photo and Video. Again I wouldn't completely pivot to video yet as the small sample sizes relative to links means more data needs to be gathered.

Average Interactions Over Time

So we've seen that Photos and Videos produce more interactions than links, but what about the posts as a whole? Has Free Code Camp been improving their user base over the time span of the data? I would think that they would like to get more and more people coming to their Facebook page everyday let's see if that's happening.

The first step to doing this will be converting the data from a wide format to a long format where the columns of reactions, reach, and clicks are converted into a column themselves.

library(tidyr)
long_freecode_data <- gather(freecodecamp, key = 'Interaction', 
                             value = 'counts', 'reach', 'clicks', 'reactions')
head(long_freecode_data[4:6])

## # A tibble: 6 x 3
##    type Interaction counts
##   <chr>       <chr>  <int>
## 1  Link       reach   1768
## 2  Link       reach   6941
## 3 Video       reach  17399
## 4  Link       reach   3751
## 5  Link       reach  18248
## 6  Link       reach   4806

long_freecode_data$date <- as.Date(long_freecode_data$date, '%m/%d/%Y')

So now that the dataframe is ready lets late a look at the average Interactions by type over the span of time contained in the data.

ggplot(aes(x = date, y = counts), data = subset(long_freecode_data, 
    counts <= quantile(counts, .99))) + 
    geom_line(aes(color = Interaction), stat = 'summary', fun.y = mean) +
    xlab('Date') + ylab('Averages Per Day')

There's a lot of noise there so it's hard to see if there's much growth in the different type of Interactions. Maybe if we take the average of all interactions with the Facebook posts some growth will be evident.

ggplot(aes(x = date, y = counts), data = subset(long_freecode_data, 
    counts <= quantile(counts, .99))) + 
    geom_line(stat = 'summary', fun.y = mean) +
    xlab('Date') + ylab('Average total Interactions') +geom_smooth()

Yeah there's not a whole lot of increase in the total average interactions over the time period with the Free Code Camp Facebook posts. But maybe if we factor out reach which just means that people saw the post, there might be some growth in people reacting and clicking the links.

no_reach_df <- long_freecode_data[long_freecode_data$Interaction== c('clicks',
                                                                     'reactions'),]
ggplot(aes(x = date, y = counts), data = subset(no_reach_df, 
    counts <= quantile(counts, .99))) + 
    geom_line(stat = 'summary', fun.y = mean) +
    xlab('Date') + ylab('Average Clicks and Reactions') + geom_smooth(method = lm)

Stil not much growth in the interactions from Facebook users to the Free Code Camp Facebook posts, but there also isn't any decrease either. So growth is trending up just not at a fast pace.

Effect on Interactions by Article Publication Time

The other time variable included in the data set is the time of Facebook post. So let's take a look at that. Again we'll just focus on clicks and reactions for this one.

ggplot(aes(x = time, y =counts), data = 
           subset(long_freecode_data, counts <= quantile(counts, .99) & 
                      Interaction == c('clicks', 'reactions'))) + 
    geom_point(aes(color = Interaction))

And here is just for Reach so it doesn't mess with our scale.

ggplot(aes(x = time, y =counts), data = 
           subset(long_freecode_data, counts <= quantile(counts, .99) & 
                      Interaction == c('reach'))) + 
    geom_point()

While the results for posts between around 8AM to 6PM are expected because that is generally when most people across the US are active the surprise was that posts between 12AM to 2AM are roughly the same results for Interactions.

One reason for this could be the time zone these times are taken from. I couldn't find a clear answer for what time zone Facebook's internal timestamps use. So assuming those are Eastern Time US timestamps that would translate to 9PM to 12PM on the west coast. Which wouldn't make the interaction numbers all that surprising given the large number of tech employees located in California.

Free Code Camp Facebook Post Themes

The last set of data that hasn't been looked at so far is the actual content of these posts from their titles. One of the quickest ways I felt to get a sense of the themes of these posts was to do a word cloud.

library(tm)
library(wordcloud)
library(SnowballC)

#creating the word cloud to see which topics or ideas where most written about
article_titles <- freecodecamp$title #pulls the titles into a chr vector
article_corpus <- VectorSource(article_titles)
article_corpus <- Corpus(article_corpus) #these two commands prepare the word matrix
article_corpus <- tm_map(article_corpus, removePunctuation)     #removes Punctuation
article_corpus <- tm_map(article_corpus, removeWords, stopwords('english')) 
article_corpus <- tm_map(article_corpus, stemDocument)#the next two commands remove stopwords and  #turns verb conjugations into their stems
wordcloud(article_corpus, max.words = 80, random.order = FALSE)

Themes like develop, code, and freecodecamp are not surprises. Words like 'how' and 'here' may seem like weird one's to include but I'm sure one of the biggest questions new coders have is "How do I do ..." So the fact that it's included means these posts are focused on answering questions which would be in line with Free Code Camps mission.

We can see the actual counts themselves by taking the word cloud and converting it into a matrix summing the rows and then turning that into a dataframe:

dtm <- TermDocumentMatrix(article_corpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

##                      word freq
## develop           develop   55
## here                 here   46
## code                 code   45
## how                   how   38
## the                   the   32
## freecodecamp freecodecamp   31
## get                   get   26
## can                   can   25
## one                   one   24
## new                   new   23

Conclusions

This is just a cursory glance at the data in this dataset. There are definitely more analyses that could be done. One I would do is see if these common words correlate to more interactions by FreeCodeCamp Facebook followers. That could definitely help

Another would be to look at the actual numerical growth shown by the trend lines in the over time data to see just how growth there is between the first half and second half averages. Also I would suggest binning the publication times and look at the average follower interactions as well to get a more granular look at the data.

If you have any questions, or suggestions for improvement, please email me at mcbarlowe@gmail.com