Reddit Piracy Subreddit Data Analysis

27/04/2022

REDDIT :

Reddit.com is a social media news and discussion website, where votes promote user-provided stories, links, and comments to the front page of the site. Reddit.com is one the most visited site in the world according to Alexa Internet.

REDDIT AS A DISCUSSION FORUM

As a site that relies upon users for content, interaction, and moderation, some interesting data can be pulled from Reddit that can show real-time trends of popular topics. Users “upvote” and “downvote” each others comments, with upvoted comments rising higher in threads. So a popular comment is much more likely to be seen by another user, and perhaps then further upvoted or commented upon. This makes for a complex network of users and discussions that also provides insight into trending topics.

SUBREDDITS :

Reddit is made up of “subreddits” which are categories of threads of a similar topic or type. There are subreddits for politics, celebrities, memes, video games, and just about every other topic one can think of. There are thousands of users on the platform, commenting on different topics and forum, expressing their views. As a mass media platform, it can be weaponized into a powerful tool that can manipulate the masses. As part of our project, we will be analyzing threads, comments and posts and moderator bots on Reddit.

PROBLEM STATEMENT :

Digital piracy is often portrayed as a victimless crime, but that portrayal is false. Piracy negatively affects every single person working in these industries and their supply chains.On less monitored platforms like Reddit, discussing controversial topics is easier and that includes Piracy. We have analysed the data of the subreddit r/Piracy and see what kind of content is passed around to help propagate the ease of access to pirated content through normal users and bots.

DATA :

Our data source is Reddit API, using the package RedditExtractoR available in Rstudio. “https://cran.r-project.org/web/packages/RedditExtractoR/”. This API returns contents of all of Reddit ecosystem, subreddits, user data and Karma, upvotes and downvotes, trending posts. We have analyzed the subreddit r/Piracy and the comments posted on them, to analyze the content and users.

PEER COMMENTS:

“I think you should choose which subreddits you use carefully. Since some subreddits have very specific behavior by users, ex: people comment short replies or repeat the same thing over and over. This might make it difficult to differentiate between a bot and a real user.”
“What’s the future scope of this project?”
“What is the time frame of data that you are considering for your analysis?”
“Are you planning to consider secondary data for your project?”

DATA EXPLORATION:

Exploring data, we found many key take aways from the primary and secondary data. Visualizing the data, primary was for comments data and secondary was for threads from the subreddit.

Selected Reddit Data
	X	url	author	date	timestamp	title	text	subreddit	score	upvotes	downvotes	up_ratio	total_awards_received	golds	cross_posts	comments
4	4	https://www.reddit.com/r/Piracy/comments/oa9ix2/mas_14_also_works_on_windows_11_build_2200051/	VybeXE	2021-06-29	1624977329	MAS 1.4 also works on Windows 11 build 22000.51!		Piracy	2237	2237	0	0.98	14	0	0	267

DATA EXPLORATION

Analyzing the data we found: Comments timelines, top users, top awarded posts, kind of threads, and many more. As an example we have included the following plot which shows the top awarded threads. Top awarded threads also have high number of upvotes and are popular with users. These can be of different types, whether informational from a technical perspective or opinionated posts or posts with external links.

DATA EXPLORATION

Analyzing other data we have found the timeline of comments, and saw when there has been a spike in comments.

AI/ML MODEL:

We evaluate the model AutoML i.e automated Machine Learning which provides the automation of selection, composition and parameterization of Machine Learning Models. This we have achieved via deep learning model h2o. We used indicators from our primary dataset to see if user is a bot or not and the kind of content it has posted.

AI/ML MODEL:

Running the AutoML model, and binding columns of our primary dataset

AI/ML MODEL:

WE used different splits and made three different sets of training set, validation set and testing set. Putting the h2o model to work, it selected the best model possible to give us the results.

KEY TAKEAWAYS OF THE AI/ML MODEL

The AI/ML model gave an RMSE value of 0.38 and can give an accurate result if user is genuine or bot. We have used StackedEnsemble to evaluate the quality of our model.

KEY TAKEAWAYS

Through this model, we can curb excessive posts about pirated content and protect the different media corporations and their talent. for visual and audio media, software products and many more costing billions of dollars, users may be better equipped to use credible sources. Bots tend to post the same repititive and excessive content and hence this prooduct may be better used for this business case