Index of Analysis

1- Introduction

2- About the Data Set

3- Data Collection and Understanding

4- Data Wrangling

5- Data Visualization

6- Conclusion

Introduction

Regardless of the variety of different channels such as websites or tv, social media is the first-place people get information. Organizations can easily distribute any kind of content with a Facebook account, cross link their other digital or tv properties as part of their social media strategy to get the most engagement possible with their brand and to get their business idea across.

In October 2016 BuzzFeed released a political article around misleading and extremely bias published Facebook posts. The full article can be found here. The article is based on a dataset that has all the posts, fact-check ratings and Facebook engagement figures for each post. BuzzFeed analyzed the Facebook Posts from hyperpartisan political Facebook pages and posts, selected from the right, left and mainstream media to find out about the nature and popularity of misleading information shared.

The purpose of our analysis is to explore the data set to see the impact of content regardless of it being correct or not. We are not looking at our analysis from a political view as the category or the organization does not matter to our analysis. The problem in question that we are trying to solve is “What are the variables that makes the correlation to user engagement? Does video content get better reaction or comment count?

About the Data Set

The dataset spreadsheet can be found here. The methodology for collecting and rating the pages is outline in below disclaimer from BuzzFeed as below;

“ “Each of our raters was given a rotating selection of pages from each category on different days. In some cases, we found that pages would repost the same link or video within 24 hours, which caused Facebook to assign it the same URL. When this occurred, we did not log or rate the repeat post and instead kept the original date and rating. Each rater was given the same guide for how to review posts:

“Mostly True: The post and any related link or image are based on factual information and portray it accurately. This lets them interpret the event/info in their own way, so long as they do not misrepresent events, numbers, quotes, reactions, etc., or make information up. This rating does not allow for unsupported speculation or claims.
“Mixture of True and False: Some elements of the information are factually accurate, but some elements or claims are not. This rating should be used when speculation or unfounded claims are mixed with real events, numbers, quotes, etc., or when the headline of the link being shared makes a false claim but the text of the story is largely accurate. It should also only be used when the unsupported or false information is roughly equal to the accurate information in the post or link. Finally, use this rating for news articles that are based on unconfirmed information.
“Mostly False: Most or all of the information in the post or in the link being shared is inaccurate. This should also be used when the central claim being made is false.
“No Factual Content: This rating is used for posts that are pure opinion, comics, satire, or any other posts that do not make a factual claim. This is also the category to use for posts that are of the “Like this if you think…” variety.

“In gathering the Facebook engagement data, the API did not return results for some posts. It did not return reaction count data for two posts, and two posts also did not return comment count data. There were 70 posts for which the API did not return share count data. We also used CrowdTangle’s API to check that we had entered all posts from all nine pages on the assigned days. In some cases, the API returned URLs that were no longer active. We were unable to rate these posts and are unsure if they were subsequently removed by the pages or if the URLs were returned in error.” “

Descriptions of each variable are outlined below;

Account ID: The Facebook Account ID information.

Post ID: The Facebook POST ID information.

Category: The category of the organization. Broken down to three different categories as, “mainstream”, “left”, and “right”

Page: Facebook Page Name of the Organization.

Post URL: URL of the actual post rated by BuzzFeed

Date Published: Facebook Post Publish Date.

Debate: If the Facebook post is related to the debate.

Share Count: Amount of shares for that particular post

Reaction Count: Amount of reactions for that particular post

Comment Count: Amount of content for that particular post.

Data Collection and Understanding

In order to import the data to R, we can upload to our github repo and read it from there. The data set can be read from here

fb_data <- read.csv('https://raw.githubusercontent.com/anilak1978/facebook-fact-check-2016/master/facebook-fact-check.csv')
head(fb_data)

##     account_id      post_id   Category              Page
## 1 1.840966e+14 1.035058e+15 mainstream ABC News Politics
## 2 1.840966e+14 1.035269e+15 mainstream ABC News Politics
## 3 1.840966e+14 1.035306e+15 mainstream ABC News Politics
## 4 1.840966e+14 1.035323e+15 mainstream ABC News Politics
## 5 1.840966e+14 1.035353e+15 mainstream ABC News Politics
## 6 1.840966e+14 1.035367e+15 mainstream ABC News Politics
##                                                          Post.URL
## 1 https://www.facebook.com/ABCNewsPolitics/posts/1035057923259100
## 2 https://www.facebook.com/ABCNewsPolitics/posts/1035269309904628
## 3 https://www.facebook.com/ABCNewsPolitics/posts/1035305953234297
## 4 https://www.facebook.com/ABCNewsPolitics/posts/1035322636565962
## 5 https://www.facebook.com/ABCNewsPolitics/posts/1035352946562931
## 6 https://www.facebook.com/ABCNewsPolitics/posts/1035366579894901
##   Date.Published Post.Type             Rating Debate share_count
## 1     2016-09-19     video no factual content                 NA
## 2     2016-09-19      link        mostly true                  1
## 3     2016-09-19      link        mostly true                 34
## 4     2016-09-19      link        mostly true                 35
## 5     2016-09-19     video        mostly true                568
## 6     2016-09-19      link        mostly true                 23
##   reaction_count comment_count
## 1            146            15
## 2             33            34
## 3             63            27
## 4            170            86
## 5           3188          2815
## 6             28            21

As part of our data exploration, we can look at the overview of the data frame.

str(fb_data)

## 'data.frame':    2282 obs. of  12 variables:
##  $ account_id    : num  1.84e+14 1.84e+14 1.84e+14 1.84e+14 1.84e+14 ...
##  $ post_id       : num  1.04e+15 1.04e+15 1.04e+15 1.04e+15 1.04e+15 ...
##  $ Category      : Factor w/ 3 levels "left","mainstream",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Page          : Factor w/ 9 levels "ABC News Politics",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Post.URL      : Factor w/ 2282 levels "https://www.facebook.com/ABCNewsPolitics/posts/1035057923259100",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Date.Published: Factor w/ 7 levels "2016-09-19","2016-09-20",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Post.Type     : Factor w/ 4 levels "link","photo",..: 4 1 1 1 4 1 4 1 1 4 ...
##  $ Rating        : Factor w/ 4 levels "mixture of true and false",..: 4 3 3 3 3 3 3 3 3 3 ...
##  $ Debate        : Factor w/ 2 levels "","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ share_count   : int  NA 1 34 35 568 23 46 7 7 152 ...
##  $ reaction_count: int  146 33 63 170 3188 28 409 62 39 278 ...
##  $ comment_count : int  15 34 27 86 2815 21 105 64 6 59 ...

class(fb_data)

## [1] "data.frame"

mode(fb_data)

## [1] "list"

class(fb_data$account_id)

## [1] "numeric"

class(fb_data$post_id)

## [1] "numeric"

class(fb_data$Category)

## [1] "factor"

class(fb_data$Page)

## [1] "factor"

class(fb_data$Post.URL)

## [1] "factor"

class(fb_data$Date.Published)

## [1] "factor"

class(fb_data$Post.Type)

## [1] "factor"

class(fb_data$Rating)

## [1] "factor"

class(fb_data$Debate)

## [1] "factor"

class(fb_data$share_count)

## [1] "integer"

class(fb_data$reaction_count)

## [1] "integer"

class(fb_data$comment_count)

## [1] "integer"

We can see that, account id and post id is numeric, share, reaction and comment count is integer and category,page, post url, date published, post type, rating and debate are factoral.

We can look at the levels of each factor to see the details of these categorical variables.

levels(fb_data$Post.Type)

## [1] "link"  "photo" "text"  "video"

levels(fb_data$Date.Published)

## [1] "2016-09-19" "2016-09-20" "2016-09-21" "2016-09-22" "2016-09-23"
## [6] "2016-09-26" "2016-09-27"

levels(fb_data$Category)

## [1] "left"       "mainstream" "right"

levels(fb_data$Page)

## [1] "ABC News Politics" "Addicting Info"    "CNN Politics"     
## [4] "Eagle Rising"      "Freedom Daily"     "Occupy Democrats" 
## [7] "Politico"          "Right Wing News"   "The Other 98%"

levels(fb_data$Rating)

## [1] "mixture of true and false" "mostly false"             
## [3] "mostly true"               "no factual content"

levels(fb_data$Debate)

## [1] ""    "yes"

We can also look at the summary of the data set

summary(fb_data)

##    account_id           post_id                Category   
##  Min.   :6.232e+10   Min.   :5.511e+14   left      : 471  
##  1st Qu.:1.145e+14   1st Qu.:1.247e+15   mainstream:1145  
##  Median :1.841e+14   Median :1.291e+15   right     : 666  
##  Mean   :1.867e+14   Mean   :3.300e+15                    
##  3rd Qu.:3.469e+14   3rd Qu.:1.541e+15                    
##  Max.   :4.401e+14   Max.   :1.015e+16                    
##                                                           
##                 Page    
##  Politico         :536  
##  CNN Politics     :409  
##  Eagle Rising     :286  
##  Right Wing News  :268  
##  Occupy Democrats :209  
##  ABC News Politics:200  
##  (Other)          :374  
##                                                             Post.URL   
##  https://www.facebook.com/ABCNewsPolitics/posts/1035057923259100:   1  
##  https://www.facebook.com/ABCNewsPolitics/posts/1035269309904628:   1  
##  https://www.facebook.com/ABCNewsPolitics/posts/1035305953234297:   1  
##  https://www.facebook.com/ABCNewsPolitics/posts/1035322636565962:   1  
##  https://www.facebook.com/ABCNewsPolitics/posts/1035352946562931:   1  
##  https://www.facebook.com/ABCNewsPolitics/posts/1035366579894901:   1  
##  (Other)                                                        :2276  
##     Date.Published Post.Type                          Rating    
##  2016-09-19:306    link :1780   mixture of true and false: 245  
##  2016-09-20:317    photo: 207   mostly false             : 104  
##  2016-09-21:306    text :   4   mostly true              :1669  
##  2016-09-22:293    video: 291   no factual content       : 264  
##  2016-09-23:294                                                 
##  2016-09-26:403                                                 
##  2016-09-27:363                                                 
##  Debate      share_count      reaction_count     comment_count     
##     :1984   Min.   :      1   Min.   :     2.0   Min.   :     0.0  
##  yes: 298   1st Qu.:     24   1st Qu.:   149.0   1st Qu.:    37.0  
##             Median :     96   Median :   545.5   Median :   131.5  
##             Mean   :   4045   Mean   :  5364.3   Mean   :   516.1  
##             3rd Qu.:    739   3rd Qu.:  2416.8   3rd Qu.:   390.2  
##             Max.   :1088995   Max.   :456458.0   Max.   :159047.0  
##             NA's   :70        NA's   :2          NA's   :2

Median for share is:96 and Mean is 4045

Median for reaction is 545 and Mean is 5364.3

Median for comment is 131.5 and Mean is 516.

Post Types are: Link, Photo, Text and Video

Dates start from September 20,21,22,23,26 and 27. So the facebook posts are pulled from these dates. It looks like the API didnt return posts on dates September 24th and 25th. This does not have any impact to our analysis.

Organization categories are: Mainstream, Right and Left

Organization names are: ABC News Politics, Addicting Info, CNN Politics, Eagle Rising, Freedom Daily, Occupy Democrats, Politico, Right Wing News, The Other 98%.

Rating options are: Mixture of True and False, Mostly False, Mostly True, No Factual content.

Debate has two levels. One is yes and the other one is "" which is blank. We can assume the other option is “No”

Data Wrangling

As we can see there are several things we can do to clean up and prepare the data for our analysis. These steps are outlined below;

1- We dont need the account_id, post.url and post_id for our analysis. So we can exclude them

2- We can simplfy the naming for rating variable. TF for Mixture of True and False, MF for Mostly False, MT for Mostly True, F for No Factual Content.

3- Update the missing values to “No” in debate

4- Find the missing values(NA) and either replace them or exclude them from the data set.

exclude_vars <- names(fb_data) %in% c('account_id', 'post_id', 'Post.URL') # selecting variables to exclude.
fb_data_new <- fb_data[!exclude_vars] # exluding selected variables for the new dataset.
levels(fb_data_new$Debate)[levels(fb_data_new$Debate)==""]<-"No" # changing blank values to "No"
levels(fb_data_new$Rating)[levels(fb_data_new$Rating)=="no factual content"]<-"F" # simplfying names
levels(fb_data_new$Rating)[levels(fb_data_new$Rating)=="mixture of true and false"]<-"TF" # simplfying names
levels(fb_data_new$Rating)[levels(fb_data_new$Rating)=="mostly false"]<-"MF" # simplfying names
levels(fb_data_new$Rating)[levels(fb_data_new$Rating)=="mostly true"]<-"MT" # simplfying names

# simplfying column names
colnames(fb_data_new) <- c("category", "organization", "date", "type", "rating", "debate", "share", "reaction", "comment") 

head(fb_data_new)

##     category      organization       date  type rating debate share
## 1 mainstream ABC News Politics 2016-09-19 video      F     No    NA
## 2 mainstream ABC News Politics 2016-09-19  link     MT     No     1
## 3 mainstream ABC News Politics 2016-09-19  link     MT     No    34
## 4 mainstream ABC News Politics 2016-09-19  link     MT     No    35
## 5 mainstream ABC News Politics 2016-09-19 video     MT     No   568
## 6 mainstream ABC News Politics 2016-09-19  link     MT     No    23
##   reaction comment
## 1      146      15
## 2       33      34
## 3       63      27
## 4      170      86
## 5     3188    2815
## 6       28      21

Let’s see if there are any NA values in our dataset.

sum(is.na(fb_data_new$category))

## [1] 0

sum(is.na(fb_data_new$organization))

## [1] 0

sum(is.na(fb_data_new$date))

## [1] 0

sum(is.na(fb_data_new$type))

## [1] 0

sum(is.na(fb_data_new$rating))

## [1] 0

sum(is.na(fb_data_new$debate))

## [1] 0

sum(is.na(fb_data_new$share))

## [1] 70

sum(is.na(fb_data_new$reaction))

## [1] 2

sum(is.na(fb_data_new$comment))

## [1] 2

Based on above, we can see there are 70 missing share values, 2 reaction and 2 comment count values. We can either get their average and replace them or we can remove them completely. Since the amount of na values are not high we can exclude them in this case

# just in case i am going to create a new dataframe 


fb_data_new_final <- na.omit(fb_data_new)
head(fb_data_new_final)

##     category      organization       date  type rating debate share
## 2 mainstream ABC News Politics 2016-09-19  link     MT     No     1
## 3 mainstream ABC News Politics 2016-09-19  link     MT     No    34
## 4 mainstream ABC News Politics 2016-09-19  link     MT     No    35
## 5 mainstream ABC News Politics 2016-09-19 video     MT     No   568
## 6 mainstream ABC News Politics 2016-09-19  link     MT     No    23
## 7 mainstream ABC News Politics 2016-09-19 video     MT     No    46
##   reaction comment
## 2       33      34
## 3       63      27
## 4      170      86
## 5     3188    2815
## 6       28      21
## 7      409     105

sum(is.na(fb_data_new_final$share))

## [1] 0

sum(is.na(fb_data_new_final$reaction))

## [1] 0

sum(is.na(fb_data_new_final$comment))

## [1] 0

Now we have a new clean data frame and we can start analysing.

summary(fb_data_new_final)

##        category               organization         date        type     
##  left      : 441   Politico         :534   2016-09-19:297   link :1768  
##  mainstream:1116   CNN Politics     :406   2016-09-20:309   photo: 196  
##  right     : 655   Eagle Rising     :277   2016-09-21:295   text :   4  
##                    Right Wing News  :267   2016-09-22:285   video: 244  
##                    Occupy Democrats :201   2016-09-23:285               
##                    ABC News Politics:176   2016-09-26:384               
##                    (Other)          :351   2016-09-27:357               
##  rating    debate         share            reaction       
##  TF: 241   No :1920   Min.   :      1   Min.   :     2.0  
##  MF: 103   yes: 292   1st Qu.:     24   1st Qu.:   154.0  
##  MT:1633              Median :     96   Median :   554.5  
##  F : 235              Mean   :   4045   Mean   :  5410.4  
##                       3rd Qu.:    739   3rd Qu.:  2408.2  
##                       Max.   :1088995   Max.   :456458.0  
##                                                           
##     comment        
##  Min.   :     0.0  
##  1st Qu.:    39.0  
##  Median :   133.0  
##  Mean   :   520.0  
##  3rd Qu.:   391.2  
##  Max.   :159047.0  
##

Data Visualization

Let’s use visualization to see the relationship between variables. In order to do this, we need to install the neccessary libraries.

install.packages('ggplot2', repos="http://cran.us.r-project.org")

## Installing package into 'C:/Users/Anil Akyildirim/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)

## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Anil Akyildirim\AppData\Local\Temp\Rtmp0wrH7Q\downloaded_packages

library('ggplot2')

Let’s look at the count of Category variable.

ggplot(data= fb_data_new_final, aes(fb_data_new_final$category))  + 
  geom_bar(width=1, colour = I("black"), aes(fill=..count..))

We can see that majority of the posts in the data set are from the mainstream category.

Let’s look at the count of rating variable.

ggplot(data= fb_data_new_final, aes(fb_data_new_final$rating))  + 
  geom_bar(width=1, colour = I("black"), aes(fill=..count..))

We can also see that the significant amount of the posts in the data are mostly true

Let’s look at the type variable to see the amount of different content type counts within our dataset.

ggplot(data= fb_data_new_final, aes(fb_data_new_final$type))  + 
  geom_bar(width=1, colour = I("black"), aes(fill=..count..))

We can see most of the content type that is in the facebook posts is link. We can assume that the majority of the organizations’ goal is to use Facebook and FB posts to drive the users to their other brand owned properties. They are using FB posts for acquisition of the users

Let’s look at the Rating variable in detail. We can use histrogram to see the distribution of facebook posts across different type of content.

theme_set(theme_classic())

# Histogram on Rating variable
g1 <- ggplot(fb_data_new_final, aes(rating))
g1 + geom_bar(aes(fill=type), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Histogram on Rating Variable", 
       subtitle="Content Type Across All FB Posts")

Based on the histogram on the rating variable. We can see that majority of the False (facebook posts that does not have factual information) have image (photo) content type.

Let’s look at the distribution of the content type data over the time period. We can do this with the density plot.

theme_set(theme_classic())

# Density Plot -  
g2 <- ggplot(fb_data_new_final, aes(date))
g2 + geom_density(aes(fill=factor(type)), alpha=0.8) + 
    labs(title="Density plot", 
         subtitle="Fb Posts Grouped by type",
         caption="Source: fb_data_new_final",
         x="Date",
         fill="Content Type")

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

We can see that the density of the facebook posts dropeed from the 19th of september to the 27th of September however the video and link content dominates the distribution of the facebook post content type within this time period.

theme_set(theme_classic())

# Box Plot
g3 <- ggplot(fb_data_new_final, aes(type, share))
g3 + geom_boxplot(varwidth=T, fill="plum") + 
    labs(title="Box plot", 
         subtitle="Share Count Grouped by Content Type",
         caption="Source: fb_data_new_final",
         x="Content Type",
         y="Share")

The distribution of share is higher on video and photo(image) compare to link and text. This is an interesting find as the majority of the content type within the data is link. It is interesting to see users tend to share the video content compare to link and text content type

options(scipen=999)  # this is to turn-off scientific notation like 1e+48
theme_set(theme_bw())  # set the theme


# Scatterplot . We are keeping the comment count to 1000 and reaction count to 5000. 
g4 <- ggplot(fb_data_new_final, aes(x=comment, y=reaction)) + 
  geom_point(aes(col=type, size=reaction)) + 
  geom_smooth(method="loess", se=F) + 
  xlim(c(1, 1000)) + 
  ylim(c(0, 5000)) + 
  labs(subtitle="Reaction Vs Comments", 
       y="Reaction", 
       x="Comment", 
       title="Scatterplot", 
       caption = "Source: fb_data_new_final")

plot(g4)

## Warning: Removed 425 rows containing non-finite values (stat_smooth).

## Warning: Removed 425 rows containing missing values (geom_point).

We see there is definately a relationship between reaction and comment variable. The more the comment count the more the reaction count. We can narrow down the x(comment) and y(reaction) to x=50 and y=250 to see the content type distribution.

options(scipen=999)  # this is to turn-off scientific notation like 1e+48
theme_set(theme_bw())  # set the theme


# Scatterplot . We are keeping the comment count to 250 and reaction count to 1000.
g5 <- ggplot(fb_data_new_final, aes(x=comment, y=reaction)) + 
  geom_point(aes(col=type, size=reaction)) + 
  geom_smooth(method="loess", se=F) + 
  xlim(c(1, 50)) + 
  ylim(c(0, 250)) + 
  labs(subtitle="Reaction Vs Comments", 
       y="Reaction", 
       x="Comment", 
       title="Scatterplot", 
       caption = "Source: fb_data_new_final")

plot(g5)

## Warning: Removed 1709 rows containing non-finite values (stat_smooth).

## Warning: Removed 1709 rows containing missing values (geom_point).

Let’s see if we can create a correlation matrix.

# i need to make sure the data set is numeric. 
# to be able to use the cor() function.

fb_data_new_2 <- subset(fb_data_new_final, select=share:comment)

corr <- round(cor(fb_data_new_2), 1)
corr

##          share reaction comment
## share      1.0      0.9     0.9
## reaction   0.9      1.0     0.7
## comment    0.9      0.7     1.0

library(ggcorrplot)

ggcorrplot(corr, 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("tomato2", "white", "springgreen3"), 
           title="Correlogram of fb_data_new_final", 
           ggtheme=theme_bw)

We can see there is a huge positive correlation between share and reaction, share and comment and comment and reaction.

Conclusion

As part of this assignment, the aim of this document was to outline a possible question as a problem, review and clean the data as part of data wrangling and eploration proess, visualize the dataset and provide a conclusion against the problem defined.

Based on our analysis, we are able to see, content type has an impact to user engagement in terms of FB post share. When we analyze the user engagement in detail, we see that there is a positive correlation between comments, reaction and share. The more comments the FB post gets, the more reaction it gets. We also see if the facebook post content does not have factual information, the post might get more reaction rather than comments.

When we look at the content type variable, we see majority of the content type are links, which gets the most engagement in terms of fb post share. The assumption is that these organizations are using FB as a channel to drive the users to their brand owned digital properties. The more user engagement would be more user acqusion for that particular organization. Link content type is followed by video content and imagery. It is interesting to see that video content is the content type that gets shared the most. It is also interesting to see that the photo content type has the most non factual information rating.

R-Bridge-Week3-Assignment

Murat Anil Akyildirim

7/27/2019