- Reddit is a website which acts like a collection of forums
- Users post and discuss about various topics pertaining to the forum in the entities called subreddit
- In an apple subreddit users will discuss about latest products from apple.
https://www.dropbox.com/s/hx5yiqkjwc9b4cd/worldnews_test3.csv?dl=0
In our project proposal, we stated below
Our primary source of data will be the Reddit API:
We will be utilizing various data cleaning libraries of R like tidyr and dplyr for Data Cleaning and Manipulation
we will use Natural Language Processing Techniques to categorize the reddit data into their relevant categories.
For data exploration purposes, we will use ggplot graphs to show our analysis and the outcomes.
Data is uploaded to : https://www.dropbox.com/s/hx5yiqkjwc9b4cd/worldnews_test3.csv?dl=0 , this link has a csv file which is fetched with the reddit API. We used RedditExtractoR to download reddit data from the following subreddits ( majority of data)
Number of posts vs years
Alt text
Number of posts vs days of the week - People post very less on wednesdays
Most common tokens in subreddit - world news - Data related to world news
Alt text
Most common tokens in subreddit - ask reddit - Data related to ask reddit
Most common tokens in subreddit Bangtan - Data related to the South Korean boy group (Entertainment) (Entertainment)
Most common tokens in subreddit NintendoSwitch - Data related to games
Alt text
Alt text
Alt text