This document intends to scrape titles relating to Tesla from reddit posts. The primary objective of this document is to scrape this information in preparation for a future and more in-depth analysis, where data gathered here will be utilized alongside financial information relating to Tesla to determine whether a correlation exists. The secondary objective here is to practice scraping and ultimately the gathering of actionable data from the web using R. Ideally, these posts in aggregate will represent a sample of the public consensus on Tesla as reddit is a free and open forum.
We can first begin by loading the necessary packages and examining the structure of the webpage we intend to scrape. We will be using old.reddit.com in this case as it provides a more user-friendly interface for scraping, opting to use pagination over infinite scrolling. Then, we can extract the titles, subreddit posted to, and exact time posted, which will allow us to determine a timeline of when these posts were created. All of this can be done using functionality provided by httr, xml2, rvest, and tidyverse. It is worth noting however, that there is a publicly-facing API available for reddit, which would allow for a more streamlined approach in comparison to scraping.
The raw results of the scraping process can be seen below. The first column represents the title of the post, the second column represents the subreddit the post was posted to, and the third column represents the exact time the post was created.
We can also summarize the data by subreddit to see which subreddits yield the most information, as well as tokenize the individual words in the titles to see what the general sentiment, as well as individual subreddit sentiment is, towards Tesla.
Here is a bar chart representing subreddits that contain over 10
posts in our sample of all Tesla posts. Notably, the most common
subreddit mentioned is r/wallstreetbets, which is a
subreddit dedicated to discussing stocks and investing in general. This
further leads us to believe that there could be a possble correlation
between the public’s sentiment and stock price. This is quickly followed
by r/teslamotors, which is a subreddit dedicated directly
to Tesla.
Below is a bar chart representing the top 10 most common words in the titles of the posts. The most common word is “Tesla”, which is unsprirising given it was the subject of the initial search. The second most common word is “tsla” which represents Tesla’s stock ticker, further proving that the posts in our sample, of which represent the public’s sentiment towards the company, could have a correlation with the stock price. It is worth noting though, that the word “tsla” was also one of the search criteria.
We can also examine the posts by their post date to see if posts are more frequent at certain times of the week, or if they appear on a constant basis. Below is a line chart representing the number of posts per day as time progresses. It is worth noting, once again, that this data was scraped and only represents a small sample of the total number of posts, which could introduce some bias here. The frequency of posts appears to be fairly linear, with a slight increase as we approach the date of scraping.
We can also examine the number of comments per post in relation to the number of upvotes per post. This will allow us to establish an average comment-per-upvote ratio, which could be used to determine a post’s quality while also providing another metric that could be used to determine the relational sentiment in cases where the title may not reflect the true quality of a post. Below is a scatter plot representing the number of comments per post in relation to the number of up-votes per post
The number of up-votes and comments per post appear to have a correlation that is positively skewed towards the number of up-votes. Assuming we are using a linear function to represent this relationship, we would be able to use this to provide another metric that could be used to analyze the overall sentiment of a post in the off chance that the sentiment of the title does not reflect the public’s sentiment of the post.
All data represented here was scraped from Reddit.