A2 - Exploratory Data Analysis of Subreddits

Introduction

The data for this analysis comes from a random scraping of reddit.com, and a handful of its subreddits. The scraping was clearly done very haphazardly, and the data was not sanitized at all. Each subreddit scrape contained about 40,000 entries, but when duplicates were removed, the number was closer to 1,000. The full data can be found at https://github.com/linanqiu/reddit-dataset. For the sake of conciseness, we will focus on 2 collections, from the Harry Potter subreddit and the Star Wars subreddit. The data consists of one entry per comment post, but does not make mention of which overall post it is within.

I wanted to see if there were certain attributes of a posting that made it more appealing, as measured by the number of “upvotes” it received. When a post receives an upvote, the author of the post (a “redditor”) receives a karma point. Thus, below, author karma is a reflection of how many upvotes their posts have received.

Preprocessing

library(tidyverse)

variable.names = c('text', 'id', 'subreddit', 'meta', 'unixtime', 'author', 'ups', 'downs', 
                   'authorlinkkarma', 'authorcommentkarma', 'authorisgold')
hp <- read_csv('reddit-dataset/entertainment_harrypotter.csv')
sw <- read_csv('reddit-dataset/entertainment_starwars.csv')

# remove column and row 1
hp <- hp[,-1]
hp <- hp[-1,]
sw <- sw[,-1]
sw <- sw[-1,]

# combine datasets
dat <- rbind(hp, sw)

# rename columns
names(dat) <- variable.names

# remove unneccesary columns
drops <- c('id', 'meta')
dat <- dat[, !(names(dat) %in% drops)]

# convert UNIX timestamp to date and time and datetime
library(lubridate)
dat$datetime <- as_datetime(dat$unixtime)
dat <- dat %>%
  separate(datetime, c("date", "time"), " ", remove=FALSE)

The data first needed to be cleaned up and combined. As such, the following steps were performed

Removed columns/rows that were just the row index
Removed columns that were uninformative and provided clutter
The headers had to be manually added, although they were provided and explained in the documentation.
The timestamp for each comment was a UNIX timestamp, i.e. seconds since Jan. 1, 1970. This was converted to a datetime, a separate date and a separate time.

Initially, at this point I proceeded to exploring redditor attributes. However, the output of total posts per author seemed very sporadic, with oddly specific and identical number of posts from many authors, as seen below. If it was a random sample, I would have expected a smoother distibution of the number of posts per author.

author.freq <- as.data.frame(table(dat$author, dat$subreddit))
names(author.freq) <- c("username", "subreddit", "posts")

# include 0 counts, where author only posted in one subreddit
ggplot(author.freq, aes(x=posts)) +
  geom_histogram() +
  facet_grid(subreddit~.)

After a visual inspection, I discovered that many posts were duplicated in the dataset. As such, I removed all duplicates, which results in a much smoother distribution.

# need to use IDs because some rows differ by only 1 upvote
dat.nodupe <- distinct(dat, id, .keep_all = TRUE)
author.freq.nodupe <- as.data.frame(table(dat.nodupe$author, dat.nodupe$subreddit))
names(author.freq.nodupe) <- c("username", "subreddit", "posts")

# ggplot(author.freq.nodupe, aes(x=posts)) +
#   geom_histogram() +
#   # scale_x_continuous(breaks=c(0,1,2,3,4,5,6,7,8,9,10)) +
#   facet_grid(subreddit~.)

In addition, I took two more steps to complete the preprocessing phase.

First, I removed entries where the “text” of the post was NA. This likely means that a post only contained something such as an emoji or an image, which could not be rendered into plain text for this dataset.

# remove NAs in text
dat.nodupe <- dat.nodupe[!is.na(dat.nodupe$text), ]
table(dat.nodupe$subreddit)

However, the subreddits were unbalanced, as seen below:

	harrypotter	starwars
	657	1013

To remedy this, I took a random sample of 500 posts from each subreddit, to keep them balanced.

set.seed(1234)
dat.nodupe.rand <- dat.nodupe %>%
  group_by(subreddit) %>%
  sample_n(500)

The final result shows a much smoother distribution of posts per author. In addition, it was observed that every single redditor only published in one subreddit. This was discovered because the author-frequency table initially had each redditor and each subreddit. As such, for a subreddit where the redditor did not post, the post count would be \(0\). When the \(0\) counts were eliminated, the row total was reduced by exactly \(50\%\), because every redditor had one subreddit wihth \(0\) posts.

author.freq.nodupe.rand <- as.data.frame(table(dat.nodupe.rand$author, dat.nodupe.rand$subreddit))
names(author.freq.nodupe.rand) <- c("username", "subreddit", "posts")
# almost everyone, with 1 exception, only posted in 1 subreddit
# we see this because when we do a table by subreddit, it doubles the number of rows

# remove zeroes
author.freq.nodupe.rand.nozero <- filter(author.freq.nodupe.rand, posts!=0)

ggplot(author.freq.nodupe.rand.nozero, aes(x=posts)) +
  geom_histogram() +
  scale_x_continuous(breaks=c(0:10)) +
  facet_grid(subreddit~.)

Exploring Redditor Attributes

The primary aspect of redditors I wanted to examine was their total “karma.” Karma points are accumulated when an author’s post receives upvotes. As such, it is a reflection of both how much posting the redditors do, as well as the quality of their posts. (This is a very rough estimate. In order to truly guage post quality, we would need all of the redditor’s posts to measure average upvotes.)

Since authors post multiple times, I needed to create a frequency table, with one entry per author.

dat.nodupe.rand$totalkarma <- dat.nodupe.rand$authorlinkkarma + 
  dat.nodupe.rand$authorcommentkarma
author.nodupe.karma <- dat.nodupe.rand %>%
  group_by(author) %>%
  slice(1) %>%
  ungroup()

The initial plot of redditor karma had a very long tail, because a few select redditors have a significant amount of karma points.

ggplot(author.nodupe.karma, aes(x=totalkarma)) +
  geom_histogram() +
  facet_grid(subreddit~.)

As such, I log-transformed the karma numbers, which resulted in a largely log-normal distribution.

ggplot(author.nodupe.karma, aes(x=log(totalkarma))) +
  geom_histogram() +
  facet_grid(subreddit~.)

To further explore the difference in author karma by subreddit, I also created the box plot below. As can be seen from the large overlap in interquartile ranges, there is not a significant differences between the “types” of redditors in each subreddit, i.e. they have accrued similar amounts of karma points. However, there do seem to be more low outliers in the Star Wars subreddit.

ggplot(author.nodupe.karma, aes(y=log(totalkarma))) +
  geom_boxplot(aes(color=subreddit))

Exploring Post Attributes

Finally I wanted to explore the attributes of each post, rather than each redditor. Specifically, I wanted to focus on how many upvotes each post received. The central question I wanted to answer what “What makes a post receive more upvotes, i.e. more favorability?” To begin the analysis, I looked at the total distribution of all upvotes across all posts.

ggplot(dat.nodupe.rand, aes(x=log(ups))) +
  geom_histogram() +
  scale_x_continuous(breaks=c(-10:10)) +
  facet_grid(subreddit~.)

As can be seen, there are a disproportionate number of posts with 0 or 1 upvote. Upon further reflection, this makes sense: A post, by default, has 1 upvote when it’s posted. A total of 1 upvote implies that the post never received much attention. 0 upvotes was also large, because it means the post got one downvote, and that the algorithms of the website chose not to give that post very high visibility afterwards.

Next I wanted to look at when posts are created throughout the day. As such, I wanted to use a time series of post frequencies. However, if each bin was the exact time of the post, then each bin would likely contain only 1 entry. To remedy this, I created a variable that rounded each time to the closest hour. Thus, each day would be split into 24 bins.

# create bin by hour
dat.nodupe.rand$dayhour <- round_date(dat.nodupe.rand$datetime, unit='hour')

I then created a summary table to capture the attributes of the posts in each hourly bin, and then displayed the frequency time series.

dayhour.freq <- dat.nodupe.rand %>%
  group_by(dayhour, subreddit) %>%
  summarise(post.count = n(),
            subreddit.name = first(subreddit))

ggplot(dayhour.freq, aes(x=dayhour, y=post.count)) +
  geom_bar(stat='identity') +
  facet_grid(subreddit~.)

There is obviously a lot of empty space in this. Upon inspection it was the result of two posts that were marked 02-02-2016, whereas every other post takes place from 02-14-2016 – 02-16-2016. As such, I filtered out the early posts, which results in a much more effective and informative dispplay.

dayhour.freq <- dat.nodupe.rand %>%
  filter(date > as.Date("2016-02-02")) %>%
  group_by(dayhour, subreddit) %>%
  summarise(post.count = n(),
            subreddit.name = first(subreddit))

ggplot(dayhour.freq, aes(x=dayhour, y=post.count)) +
  geom_bar(stat='identity') +
  facet_grid(subreddit~.)

The next step was to add in the average upvotes for each hourly bin, as well as the average redditor karma for posts in that bin.

dayhour.freq <- dat.nodupe.rand %>%
  filter(date > as.Date("2016-02-02")) %>%
  group_by(dayhour, subreddit) %>%
  summarise(post.count = n(),
            subreddit.name = first(subreddit),
            avg.upvotes = mean(ups),
            avg.karma = mean(totalkarma))

This allowed me to answer two questions. First, is there a connection between the time of day of a post, and how many upvotes it receives?

ggplot(dayhour.freq, aes(x=dayhour, y=post.count)) +
  geom_bar(stat='identity', aes(fill=avg.upvotes)) +
  scale_fill_distiller(palette="Spectral", trans="reverse") +
  facet_grid(subreddit~.)

The data does not seem to point to any trends in this respect. The second question was whether the is a connection between the timing of posts and the average karma of the redditor who is posting. In other words, do the more successful redditors post during certain time periods throughout the day?

ggplot(dayhour.freq, aes(x=dayhour, y=post.count)) +
  geom_bar(stat='identity', aes(fill=avg.karma)) +
  scale_fill_distiller(palette="Spectral", trans="reverse") +
  facet_grid(subreddit~.)

Again it is difficult to deduce any trends from the data. It is possible though that there are less upvotes, and less successful redditors are posting, very early and very late.

Rudimentary NLP of Posts

In thinking about attributes of posts, I also wanted to perform a linguistic analysis, to see if we can capture the language elements that result in more upvotes. This is a tricky problem, and likely requires both more posts as well as more data about each post, such as the thread(s) it is nested beneath.

The two NLP attributes I chose to capture were (1) the length of the post (in space-delimited words) and (2) the avergage semtiment of the post. Using the R package sentimentr, we can capture the sentiment of an entire sentence, rather than only word-by-word.

# word count
dat.nodupe.rand <- dat.nodupe.rand %>%
  mutate(wordcount = str_count(text, "\\S+"))

# sentiment analysis
library(sentimentr)
dat.nodupe.rand <- dat.nodupe.rand %>%
  mutate(sentiment = sentiment(text)$sentiment)

Starting with word count per post, we can observe a very long tail. This is very common in many aspects of linguistics, due to Zipf’s Law, which speaks to the proportional distribution of many different linguistic phenomena.

# dist of word count
ggplot(dat.nodupe.rand, aes(x=wordcount)) +
  geom_histogram() +
  facet_grid(subreddit~.)

A log transformation of word count yields a more effective visualization, with the more normal distribution.

ggplot(dat.nodupe.rand, aes(x=log(wordcount))) +
  geom_histogram() +
  facet_grid(subreddit~.)

We can then ask whether there is a relationship between word count and the number of upvotes the post received.

# relationship between word count and upvotes
ggplot(dat.nodupe.rand, aes(x=log(wordcount), y=log(ups))) +
  geom_point() +
  geom_smooth(method="loess") +
  facet_grid(subreddit~.)

# a few long hp posts have more upvotes
# sw seems most consistently upvotes for average length

Overall, there does not appear to be a connection, as the trend line is mostly flat, i.e. longer posts do not receive more upvotes. However, there is a small trend where some lengthier posts about Harry Potter receive more upvotes. On the other hand, the Star Wars community seems to most consistently upvote average length posts.

Moving on to a sentiment analysis, I first explored the distribution of sentiment among all posts.

# dist of sentiment
ggplot(dat.nodupe.rand, aes(x=sentiment)) +
  geom_histogram() +
  facet_grid(subreddit~.)

From this we see that most posts are neutral in sentiment, which makes sense. Posts are, on average, neither especially positive or negative. However, I was curious if the more extreme polarity posts receive more upvotes.

# relationship between sentiment and upvotes
ggplot(dat.nodupe.rand, aes(x=sentiment, y=log(ups))) +
  geom_point() +
  geom_smooth(method="loess") +
  facet_grid(subreddit~.)

# neutral posts receive most upvotes
# bigger difference in sw

This does not appear to be the case. Rather, neutral posts, by and large, receive the most upvotes, rather than rewarding extreme views. There may be an exception to this in the Star Wars community, which anecdotally holds very strong viewpoints. However, given the large variance at the extreme ends of the spectrum, it is difficult to conclude whether either subreddit especially rewards especially passionate posts.

I concluded this section by running a very bare bones linear mixed-effects model, to test the effects of all of the variables together on the number of upvotes. The output can be seen below, where it is apparent that from the current dataset, none of the variables is a significant predictor of upvotes.

library(lme4)

lmer2 <- lmer(ups ~ sentiment + wordcount + totalkarma + (1|subreddit), 
              data=dat.nodupe.rand)
summary(lmer2)

## Linear mixed model fit by REML ['lmerMod']
## Formula: ups ~ sentiment + wordcount + totalkarma + (1 | subreddit)
##    Data: dat.nodupe.rand
## 
## REML criterion at convergence: 12977.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -0.4942 -0.1872 -0.0696 -0.0368 20.8028 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  subreddit (Intercept)   280.1   16.73  
##  Residual              25191.4  158.72  
## Number of obs: 1000, groups:  subreddit, 2
## 
## Fixed effects:
##               Estimate Std. Error t value
## (Intercept) 21.1291252 13.2577813   1.594
## sentiment    7.8894552 15.4942377   0.509
## wordcount   -0.0325491  0.0662015  -0.492
## totalkarma   0.0000682  0.0001086   0.628
## 
## Correlation of Fixed Effects:
##            (Intr) sntmnt wrdcnt
## sentiment  -0.073              
## wordcount  -0.195  0.015       
## totalkarma -0.130 -0.018  0.006
## fit warnings:
## Some predictor variables are on very different scales: consider rescaling

Future Work & Conclusions

I attempted to combine all of the post attributes, to discover if any trends stood out. Nothing seems obvious, although there seems to be a connection between extreme polarity of sentiment (extremely negative or extremely positive) and the number of posts occuring at that time.

### final chart
dayhour.freq <- dat.nodupe.rand %>%
  filter(date > as.Date("2016-02-02")) %>%
  group_by(dayhour, subreddit) %>%
  summarise(post.count = n(),
            subreddit.name = first(subreddit),
            avg.upvotes = mean(ups),
            avg.karma = mean(totalkarma),
            avg.sentiment = mean(abs(sentiment)),
            avg.wordcount = mean(wordcount))

ggplot(dayhour.freq, aes(x=dayhour, y=post.count)) +
  geom_bar(stat='identity', aes(alpha=avg.wordcount,
                                fill=avg.sentiment,
                                color=avg.upvotes)) +
  scale_fill_distiller(palette="YlOrRd", trans="reverse") +
  scale_color_distiller(palette="PuBuGn", trans="reverse") +
  facet_grid(subreddit~.) +
  ggtitle("Post Count By Hour With Post Attributes") +
  labs(x="Date and Hour",
       y="Total Posts in Hour",
       alpha="Mean Word Count",
       fill="Mean Absolute Sentiment",
       color="Mean Upvotes")

knitr::include_graphics('Final.chart.png')

This highlights the number of posts per hour, and their attributes. The bar outline signifies number of upvotes, the color signifies the absolute sentiment

As can be seen, it is very difficult to draw strong conclusions from the data, that are supported either by perception of a visualization, or statistical significance. Since most of the data only comes from a very brief snapshot (3 days), it is unlikely that any long-term trends will be evident. In future studies, it would make sense to run this analysis on a large number of subreddits, so that even though it is a brief snapshot, it would at least be widely expansive. The current analysis, though, is a brief snapshot of a small subset of data.