1- Introduction
2- About the Data Set
3- Data Collection and Understanding
4- Data Wrangling
5- Data Visualization
6- Conclusion
Regardless of the variety of different channels such as websites or tv, social media is the first-place people get information. Organizations can easily distribute any kind of content with a Facebook account, cross link their other digital or tv properties as part of their social media strategy to get the most engagement possible with their brand and to get their business idea across.
In October 2016 BuzzFeed released a political article around misleading and extremely bias published Facebook posts. The full article can be found here. The article is based on a dataset that has all the posts, fact-check ratings and Facebook engagement figures for each post. BuzzFeed analyzed the Facebook Posts from hyperpartisan political Facebook pages and posts, selected from the right, left and mainstream media to find out about the nature and popularity of misleading information shared.
The purpose of our analysis is to explore the data set to see the impact of content regardless of it being correct or not. We are not looking at our analysis from a political view as the category or the organization does not matter to our analysis. The problem in question that we are trying to solve is “What are the variables that makes the correlation to user engagement? Does video content get better reaction or comment count?
The dataset spreadsheet can be found here. The methodology for collecting and rating the pages is outline in below disclaimer from BuzzFeed as below;
“ “Each of our raters was given a rotating selection of pages from each category on different days. In some cases, we found that pages would repost the same link or video within 24 hours, which caused Facebook to assign it the same URL. When this occurred, we did not log or rate the repeat post and instead kept the original date and rating. Each rater was given the same guide for how to review posts:
“Mostly True: The post and any related link or image are based on factual information and portray it accurately. This lets them interpret the event/info in their own way, so long as they do not misrepresent events, numbers, quotes, reactions, etc., or make information up. This rating does not allow for unsupported speculation or claims.
“Mixture of True and False: Some elements of the information are factually accurate, but some elements or claims are not. This rating should be used when speculation or unfounded claims are mixed with real events, numbers, quotes, etc., or when the headline of the link being shared makes a false claim but the text of the story is largely accurate. It should also only be used when the unsupported or false information is roughly equal to the accurate information in the post or link. Finally, use this rating for news articles that are based on unconfirmed information.
“Mostly False: Most or all of the information in the post or in the link being shared is inaccurate. This should also be used when the central claim being made is false.
“No Factual Content: This rating is used for posts that are pure opinion, comics, satire, or any other posts that do not make a factual claim. This is also the category to use for posts that are of the “Like this if you think…” variety.
“In gathering the Facebook engagement data, the API did not return results for some posts. It did not return reaction count data for two posts, and two posts also did not return comment count data. There were 70 posts for which the API did not return share count data. We also used CrowdTangle’s API to check that we had entered all posts from all nine pages on the assigned days. In some cases, the API returned URLs that were no longer active. We were unable to rate these posts and are unsure if they were subsequently removed by the pages or if the URLs were returned in error.” “
Descriptions of each variable are outlined below;
Account ID: The Facebook Account ID information.
Post ID: The Facebook POST ID information.
Category: The category of the organization. Broken down to three different categories as, “mainstream”, “left”, and “right”
Page: Facebook Page Name of the Organization.
Post URL: URL of the actual post rated by BuzzFeed
Date Published: Facebook Post Publish Date.
Debate: If the Facebook post is related to the debate.
Share Count: Amount of shares for that particular post
Reaction Count: Amount of reactions for that particular post
Comment Count: Amount of content for that particular post.
In order to import the data to R, we can upload to our github repo and read it from there. The data set can be read from here
fb_data <- read.csv('https://raw.githubusercontent.com/anilak1978/facebook-fact-check-2016/master/facebook-fact-check.csv')
head(fb_data)
## account_id post_id Category Page
## 1 1.840966e+14 1.035058e+15 mainstream ABC News Politics
## 2 1.840966e+14 1.035269e+15 mainstream ABC News Politics
## 3 1.840966e+14 1.035306e+15 mainstream ABC News Politics
## 4 1.840966e+14 1.035323e+15 mainstream ABC News Politics
## 5 1.840966e+14 1.035353e+15 mainstream ABC News Politics
## 6 1.840966e+14 1.035367e+15 mainstream ABC News Politics
## Post.URL
## 1 https://www.facebook.com/ABCNewsPolitics/posts/1035057923259100
## 2 https://www.facebook.com/ABCNewsPolitics/posts/1035269309904628
## 3 https://www.facebook.com/ABCNewsPolitics/posts/1035305953234297
## 4 https://www.facebook.com/ABCNewsPolitics/posts/1035322636565962
## 5 https://www.facebook.com/ABCNewsPolitics/posts/1035352946562931
## 6 https://www.facebook.com/ABCNewsPolitics/posts/1035366579894901
## Date.Published Post.Type Rating Debate share_count
## 1 2016-09-19 video no factual content NA
## 2 2016-09-19 link mostly true 1
## 3 2016-09-19 link mostly true 34
## 4 2016-09-19 link mostly true 35
## 5 2016-09-19 video mostly true 568
## 6 2016-09-19 link mostly true 23
## reaction_count comment_count
## 1 146 15
## 2 33 34
## 3 63 27
## 4 170 86
## 5 3188 2815
## 6 28 21
As part of our data exploration, we can look at the overview of the data frame.
str(fb_data)
## 'data.frame': 2282 obs. of 12 variables:
## $ account_id : num 1.84e+14 1.84e+14 1.84e+14 1.84e+14 1.84e+14 ...
## $ post_id : num 1.04e+15 1.04e+15 1.04e+15 1.04e+15 1.04e+15 ...
## $ Category : Factor w/ 3 levels "left","mainstream",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Page : Factor w/ 9 levels "ABC News Politics",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Post.URL : Factor w/ 2282 levels "https://www.facebook.com/ABCNewsPolitics/posts/1035057923259100",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Date.Published: Factor w/ 7 levels "2016-09-19","2016-09-20",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Post.Type : Factor w/ 4 levels "link","photo",..: 4 1 1 1 4 1 4 1 1 4 ...
## $ Rating : Factor w/ 4 levels "mixture of true and false",..: 4 3 3 3 3 3 3 3 3 3 ...
## $ Debate : Factor w/ 2 levels "","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ share_count : int NA 1 34 35 568 23 46 7 7 152 ...
## $ reaction_count: int 146 33 63 170 3188 28 409 62 39 278 ...
## $ comment_count : int 15 34 27 86 2815 21 105 64 6 59 ...
class(fb_data)
## [1] "data.frame"
mode(fb_data)
## [1] "list"
class(fb_data$account_id)
## [1] "numeric"
class(fb_data$post_id)
## [1] "numeric"
class(fb_data$Category)
## [1] "factor"
class(fb_data$Page)
## [1] "factor"
class(fb_data$Post.URL)
## [1] "factor"
class(fb_data$Date.Published)
## [1] "factor"
class(fb_data$Post.Type)
## [1] "factor"
class(fb_data$Rating)
## [1] "factor"
class(fb_data$Debate)
## [1] "factor"
class(fb_data$share_count)
## [1] "integer"
class(fb_data$reaction_count)
## [1] "integer"
class(fb_data$comment_count)
## [1] "integer"
We can see that, account id and post id is numeric, share, reaction and comment count is integer and category,page, post url, date published, post type, rating and debate are factoral.
We can look at the levels of each factor to see the details of these categorical variables.
levels(fb_data$Post.Type)
## [1] "link" "photo" "text" "video"
levels(fb_data$Date.Published)
## [1] "2016-09-19" "2016-09-20" "2016-09-21" "2016-09-22" "2016-09-23"
## [6] "2016-09-26" "2016-09-27"
levels(fb_data$Category)
## [1] "left" "mainstream" "right"
levels(fb_data$Page)
## [1] "ABC News Politics" "Addicting Info" "CNN Politics"
## [4] "Eagle Rising" "Freedom Daily" "Occupy Democrats"
## [7] "Politico" "Right Wing News" "The Other 98%"
levels(fb_data$Rating)
## [1] "mixture of true and false" "mostly false"
## [3] "mostly true" "no factual content"
levels(fb_data$Debate)
## [1] "" "yes"
We can also look at the summary of the data set
summary(fb_data)
## account_id post_id Category
## Min. :6.232e+10 Min. :5.511e+14 left : 471
## 1st Qu.:1.145e+14 1st Qu.:1.247e+15 mainstream:1145
## Median :1.841e+14 Median :1.291e+15 right : 666
## Mean :1.867e+14 Mean :3.300e+15
## 3rd Qu.:3.469e+14 3rd Qu.:1.541e+15
## Max. :4.401e+14 Max. :1.015e+16
##
## Page
## Politico :536
## CNN Politics :409
## Eagle Rising :286
## Right Wing News :268
## Occupy Democrats :209
## ABC News Politics:200
## (Other) :374
## Post.URL
## https://www.facebook.com/ABCNewsPolitics/posts/1035057923259100: 1
## https://www.facebook.com/ABCNewsPolitics/posts/1035269309904628: 1
## https://www.facebook.com/ABCNewsPolitics/posts/1035305953234297: 1
## https://www.facebook.com/ABCNewsPolitics/posts/1035322636565962: 1
## https://www.facebook.com/ABCNewsPolitics/posts/1035352946562931: 1
## https://www.facebook.com/ABCNewsPolitics/posts/1035366579894901: 1
## (Other) :2276
## Date.Published Post.Type Rating
## 2016-09-19:306 link :1780 mixture of true and false: 245
## 2016-09-20:317 photo: 207 mostly false : 104
## 2016-09-21:306 text : 4 mostly true :1669
## 2016-09-22:293 video: 291 no factual content : 264
## 2016-09-23:294
## 2016-09-26:403
## 2016-09-27:363
## Debate share_count reaction_count comment_count
## :1984 Min. : 1 Min. : 2.0 Min. : 0.0
## yes: 298 1st Qu.: 24 1st Qu.: 149.0 1st Qu.: 37.0
## Median : 96 Median : 545.5 Median : 131.5
## Mean : 4045 Mean : 5364.3 Mean : 516.1
## 3rd Qu.: 739 3rd Qu.: 2416.8 3rd Qu.: 390.2
## Max. :1088995 Max. :456458.0 Max. :159047.0
## NA's :70 NA's :2 NA's :2
Median for share is:96 and Mean is 4045
Median for reaction is 545 and Mean is 5364.3
Median for comment is 131.5 and Mean is 516.
Post Types are: Link, Photo, Text and Video
Dates start from September 20,21,22,23,26 and 27. So the facebook posts are pulled from these dates. It looks like the API didnt return posts on dates September 24th and 25th. This does not have any impact to our analysis.
Organization categories are: Mainstream, Right and Left
Organization names are: ABC News Politics, Addicting Info, CNN Politics, Eagle Rising, Freedom Daily, Occupy Democrats, Politico, Right Wing News, The Other 98%.
Rating options are: Mixture of True and False, Mostly False, Mostly True, No Factual content.
Debate has two levels. One is yes and the other one is "" which is blank. We can assume the other option is “No”
As we can see there are several things we can do to clean up and prepare the data for our analysis. These steps are outlined below;
1- We dont need the account_id, post.url and post_id for our analysis. So we can exclude them
2- We can simplfy the naming for rating variable. TF for Mixture of True and False, MF for Mostly False, MT for Mostly True, F for No Factual Content.
3- Update the missing values to “No” in debate
4- Find the missing values(NA) and either replace them or exclude them from the data set.
exclude_vars <- names(fb_data) %in% c('account_id', 'post_id', 'Post.URL') # selecting variables to exclude.
fb_data_new <- fb_data[!exclude_vars] # exluding selected variables for the new dataset.
levels(fb_data_new$Debate)[levels(fb_data_new$Debate)==""]<-"No" # changing blank values to "No"
levels(fb_data_new$Rating)[levels(fb_data_new$Rating)=="no factual content"]<-"F" # simplfying names
levels(fb_data_new$Rating)[levels(fb_data_new$Rating)=="mixture of true and false"]<-"TF" # simplfying names
levels(fb_data_new$Rating)[levels(fb_data_new$Rating)=="mostly false"]<-"MF" # simplfying names
levels(fb_data_new$Rating)[levels(fb_data_new$Rating)=="mostly true"]<-"MT" # simplfying names
# simplfying column names
colnames(fb_data_new) <- c("category", "organization", "date", "type", "rating", "debate", "share", "reaction", "comment")
head(fb_data_new)
## category organization date type rating debate share
## 1 mainstream ABC News Politics 2016-09-19 video F No NA
## 2 mainstream ABC News Politics 2016-09-19 link MT No 1
## 3 mainstream ABC News Politics 2016-09-19 link MT No 34
## 4 mainstream ABC News Politics 2016-09-19 link MT No 35
## 5 mainstream ABC News Politics 2016-09-19 video MT No 568
## 6 mainstream ABC News Politics 2016-09-19 link MT No 23
## reaction comment
## 1 146 15
## 2 33 34
## 3 63 27
## 4 170 86
## 5 3188 2815
## 6 28 21
Let’s see if there are any NA values in our dataset.
sum(is.na(fb_data_new$category))
## [1] 0
sum(is.na(fb_data_new$organization))
## [1] 0
sum(is.na(fb_data_new$date))
## [1] 0
sum(is.na(fb_data_new$type))
## [1] 0
sum(is.na(fb_data_new$rating))
## [1] 0
sum(is.na(fb_data_new$debate))
## [1] 0
sum(is.na(fb_data_new$share))
## [1] 70
sum(is.na(fb_data_new$reaction))
## [1] 2
sum(is.na(fb_data_new$comment))
## [1] 2
Based on above, we can see there are 70 missing share values, 2 reaction and 2 comment count values. We can either get their average and replace them or we can remove them completely. Since the amount of na values are not high we can exclude them in this case
# just in case i am going to create a new dataframe
fb_data_new_final <- na.omit(fb_data_new)
head(fb_data_new_final)
## category organization date type rating debate share
## 2 mainstream ABC News Politics 2016-09-19 link MT No 1
## 3 mainstream ABC News Politics 2016-09-19 link MT No 34
## 4 mainstream ABC News Politics 2016-09-19 link MT No 35
## 5 mainstream ABC News Politics 2016-09-19 video MT No 568
## 6 mainstream ABC News Politics 2016-09-19 link MT No 23
## 7 mainstream ABC News Politics 2016-09-19 video MT No 46
## reaction comment
## 2 33 34
## 3 63 27
## 4 170 86
## 5 3188 2815
## 6 28 21
## 7 409 105
sum(is.na(fb_data_new_final$share))
## [1] 0
sum(is.na(fb_data_new_final$reaction))
## [1] 0
sum(is.na(fb_data_new_final$comment))
## [1] 0
Now we have a new clean data frame and we can start analysing.
summary(fb_data_new_final)
## category organization date type
## left : 441 Politico :534 2016-09-19:297 link :1768
## mainstream:1116 CNN Politics :406 2016-09-20:309 photo: 196
## right : 655 Eagle Rising :277 2016-09-21:295 text : 4
## Right Wing News :267 2016-09-22:285 video: 244
## Occupy Democrats :201 2016-09-23:285
## ABC News Politics:176 2016-09-26:384
## (Other) :351 2016-09-27:357
## rating debate share reaction
## TF: 241 No :1920 Min. : 1 Min. : 2.0
## MF: 103 yes: 292 1st Qu.: 24 1st Qu.: 154.0
## MT:1633 Median : 96 Median : 554.5
## F : 235 Mean : 4045 Mean : 5410.4
## 3rd Qu.: 739 3rd Qu.: 2408.2
## Max. :1088995 Max. :456458.0
##
## comment
## Min. : 0.0
## 1st Qu.: 39.0
## Median : 133.0
## Mean : 520.0
## 3rd Qu.: 391.2
## Max. :159047.0
##
Let’s use visualization to see the relationship between variables. In order to do this, we need to install the neccessary libraries.
install.packages('ggplot2', repos="http://cran.us.r-project.org")
## Installing package into 'C:/Users/Anil Akyildirim/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Anil Akyildirim\AppData\Local\Temp\Rtmp0wrH7Q\downloaded_packages
library('ggplot2')
Let’s look at the count of Category variable.
ggplot(data= fb_data_new_final, aes(fb_data_new_final$category)) +
geom_bar(width=1, colour = I("black"), aes(fill=..count..))
We can see that majority of the posts in the data set are from the mainstream category.
Let’s look at the count of rating variable.
ggplot(data= fb_data_new_final, aes(fb_data_new_final$rating)) +
geom_bar(width=1, colour = I("black"), aes(fill=..count..))
We can also see that the significant amount of the posts in the data are mostly true
Let’s look at the type variable to see the amount of different content type counts within our dataset.
ggplot(data= fb_data_new_final, aes(fb_data_new_final$type)) +
geom_bar(width=1, colour = I("black"), aes(fill=..count..))
We can see most of the content type that is in the facebook posts is link. We can assume that the majority of the organizations’ goal is to use Facebook and FB posts to drive the users to their other brand owned properties. They are using FB posts for acquisition of the users
Let’s look at the Rating variable in detail. We can use histrogram to see the distribution of facebook posts across different type of content.
theme_set(theme_classic())
# Histogram on Rating variable
g1 <- ggplot(fb_data_new_final, aes(rating))
g1 + geom_bar(aes(fill=type), width = 0.5) +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title="Histogram on Rating Variable",
subtitle="Content Type Across All FB Posts")
Based on the histogram on the rating variable. We can see that majority of the False (facebook posts that does not have factual information) have image (photo) content type.
Let’s look at the distribution of the content type data over the time period. We can do this with the density plot.
theme_set(theme_classic())
# Density Plot -
g2 <- ggplot(fb_data_new_final, aes(date))
g2 + geom_density(aes(fill=factor(type)), alpha=0.8) +
labs(title="Density plot",
subtitle="Fb Posts Grouped by type",
caption="Source: fb_data_new_final",
x="Date",
fill="Content Type")
## Warning: Groups with fewer than two data points have been dropped.
## Warning: Groups with fewer than two data points have been dropped.
We can see that the density of the facebook posts dropeed from the 19th of september to the 27th of September however the video and link content dominates the distribution of the facebook post content type within this time period.
theme_set(theme_classic())
# Box Plot
g3 <- ggplot(fb_data_new_final, aes(type, share))
g3 + geom_boxplot(varwidth=T, fill="plum") +
labs(title="Box plot",
subtitle="Share Count Grouped by Content Type",
caption="Source: fb_data_new_final",
x="Content Type",
y="Share")
The distribution of share is higher on video and photo(image) compare to link and text. This is an interesting find as the majority of the content type within the data is link. It is interesting to see users tend to share the video content compare to link and text content type
options(scipen=999) # this is to turn-off scientific notation like 1e+48
theme_set(theme_bw()) # set the theme
# Scatterplot . We are keeping the comment count to 1000 and reaction count to 5000.
g4 <- ggplot(fb_data_new_final, aes(x=comment, y=reaction)) +
geom_point(aes(col=type, size=reaction)) +
geom_smooth(method="loess", se=F) +
xlim(c(1, 1000)) +
ylim(c(0, 5000)) +
labs(subtitle="Reaction Vs Comments",
y="Reaction",
x="Comment",
title="Scatterplot",
caption = "Source: fb_data_new_final")
plot(g4)
## Warning: Removed 425 rows containing non-finite values (stat_smooth).
## Warning: Removed 425 rows containing missing values (geom_point).
We see there is definately a relationship between reaction and comment variable. The more the comment count the more the reaction count. We can narrow down the x(comment) and y(reaction) to x=50 and y=250 to see the content type distribution.
options(scipen=999) # this is to turn-off scientific notation like 1e+48
theme_set(theme_bw()) # set the theme
# Scatterplot . We are keeping the comment count to 250 and reaction count to 1000.
g5 <- ggplot(fb_data_new_final, aes(x=comment, y=reaction)) +
geom_point(aes(col=type, size=reaction)) +
geom_smooth(method="loess", se=F) +
xlim(c(1, 50)) +
ylim(c(0, 250)) +
labs(subtitle="Reaction Vs Comments",
y="Reaction",
x="Comment",
title="Scatterplot",
caption = "Source: fb_data_new_final")
plot(g5)
## Warning: Removed 1709 rows containing non-finite values (stat_smooth).
## Warning: Removed 1709 rows containing missing values (geom_point).
Let’s see if we can create a correlation matrix.
# i need to make sure the data set is numeric.
# to be able to use the cor() function.
fb_data_new_2 <- subset(fb_data_new_final, select=share:comment)
corr <- round(cor(fb_data_new_2), 1)
corr
## share reaction comment
## share 1.0 0.9 0.9
## reaction 0.9 1.0 0.7
## comment 0.9 0.7 1.0
library(ggcorrplot)
ggcorrplot(corr,
type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("tomato2", "white", "springgreen3"),
title="Correlogram of fb_data_new_final",
ggtheme=theme_bw)
We can see there is a huge positive correlation between share and reaction, share and comment and comment and reaction.
As part of this assignment, the aim of this document was to outline a possible question as a problem, review and clean the data as part of data wrangling and eploration proess, visualize the dataset and provide a conclusion against the problem defined.
Based on our analysis, we are able to see, content type has an impact to user engagement in terms of FB post share. When we analyze the user engagement in detail, we see that there is a positive correlation between comments, reaction and share. The more comments the FB post gets, the more reaction it gets. We also see if the facebook post content does not have factual information, the post might get more reaction rather than comments.
When we look at the content type variable, we see majority of the content type are links, which gets the most engagement in terms of fb post share. The assumption is that these organizations are using FB as a channel to drive the users to their brand owned digital properties. The more user engagement would be more user acqusion for that particular organization. Link content type is followed by video content and imagery. It is interesting to see that video content is the content type that gets shared the most. It is also interesting to see that the photo content type has the most non factual information rating.